Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 [ 101 ] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

decisions is no longer available. On the other hand, it does not work well for customers who were not around during the original solicitation; so some sort of neutral weighting must be applied to them.

Considering whether the original customers responded to the solicitation can extend this function further, resulting in a solicitation metric like:

dsolicitation(A, B) = 0, when A and B both responded to the solicitation

dsolicitation(A, B) = 0.1, when A and B were both chosen but neither responded

dsolicitation(A, B) = 0.2, when neither A nor B was chosen, but both were available in the data

dsolicitation(A, B) = 0.3, when A and B were both chosen, but only one responded

dsolicitation(A, B) = 0.3, when one or both were not considered

dsolicitation(A, B) = 1.0, when one was chosen and the other was not

Of course, the particular values are not sacrosanct; they are only meant as a guide for measuring similarity and showing how previous information and response histories can be incorporated into a distance function.

The Combination Function: Asking the Neighbors for the Answer

The distance function is used to determine which records comprise the neighborhood. This section presents different ways to combine data gathered from those neighbors to make a prediction. At the beginning of this chapter, we estimated the median rent in the town of Tuxedo, by taking an average of the median rents in similar towns. In that example, averaging was the combination function. This section explores other methods of canvassing the neighborhood.

The Basic Approach: Democracy

One common combination function is for the k nearest neighbors to vote on an answer- democracy in data mining. When MBR is used for classification, each neighbor casts its vote for its own class. The proportion of votes for each class is an estimate of the probability that the new record belongs to the corresponding class. When the task is to assign a single class, it is simply the one with the most votes. When there are only two categories, an odd number of neighbors should be poled to avoid ties. As a rule of thumb, use c+1 neighbors when there are c categories to ensure that at least one class has a plurality.



In Table 8.12, the five test cases seen earlier have been augmented with a flag that signals whether the customer has become inactive.

For this example, three of the customers have become inactive and two have not, an almost balanced training set. For illustrative purposes, lets try to determine if the new record is active or inactive by using different values of k for two distance functions, deuclid and dnorm (Table 8.13).

The question marks indicate that no prediction has been made due to a tie among the neighbors. Notice that different values of k do affect the classification. This suggests using the percentage of neighbors in agreement to provide the level of confidence in the prediction (Table 8.14).

Table 8.12 Customers with Attrition History

RECNUM

GENDER

SALARY

INACTIVE

female

$19,000

male

$64,000

male

$105,000

female

$55,000

male

$45,000

female

$100,000

Table 8.13 Using MBR to Determine if the New Customer Will Become Inactive

NEIGHBORS

NEIGHBOR ATTRITION

K = 1

K = 2

K = 3

K = 4

K = 5

dsum

4,3,5,2,1

Y,Y,N,Y,N

dEuclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes

Table 8.14 Attrition Prediction with Confidence

K = 1

K = 2

K = 3

K = 4

K = 5

dsum

yes, 100%

yes, 100%

yes, 67%

yes, 75%

yes, 60%

dEudid

yes, 100%

yes, 50%

no, 67%

yes, 50%

yes, 60%



The confidence level works just as well when there are more than two categories. However, with more categories, there is a greater chance that no single category will have a majority vote. One of the key assumptions about MBR (and data mining in general) is that the training set provides sufficient information for predictive purposes. If the neighborhoods of new cases consistently produce no obvious choice of classification, then the data simply may not contain the necessary information and the choice of dimensions and possibly of the training set needs to be reevaluated. By measuring the effectiveness of MBR on the test set, you can determine whether the training set has a sufficient number of examples.

WARNINGmBR is only as good as the training set it uses. To measure whether the training set is effective, measure the results of its predictions on the test set using two, three, and four neighbors. If the results are inconclusive or inaccurate, then the training set is not large enough or the dimensions and distance metrics chosen are not appropriate.

Weighted Voting

Weighted voting is similar to voting in the previous section except that the neighbors are not all created equal-more like shareholder democracy than one-person, one-vote. The size of the vote is inversely proportional to the distance from the new record, so closer neighbors have stronger votes than neighbors farther away do. To prevent problems when the distance might be 0, it is common to add 1 to the distance before taking the inverse. Adding 1 also makes all the votes between 0 and 1.

Table 8.15 applies weighted voting to the previous example. The yes, customer will become inactive vote is the first; the no, this is a good customer vote is second.

Weighted voting has introduced enough variation to prevent ties. The confidence level can now be calculated as the ratio of winning votes to total votes (Table 8.16).

Table 8.15 Attrition Prediction with Weighted Voting

K = 1

K = 2

K = 3

K = 4

K = 5

dsum

0.749 to 0

1.441 to 0

1.441

2.085 to

2.085 to

to 0.647

0.647

1.290

dEudid

0.669 to 0

0.669 to

0.669 to

1.157 to

1.601 to

0.562

1.062

1.062

1.062



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 [ 101 ] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222