Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 [ 100 ] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Gender is an example of categorical data. The simplest distance function is the identical to function, which is 1 when the genders are the same and 0 otherwise:

dgender(female, female) = 0 dgender(female, male) = 1 dgender(female, female) = 1 dgender(male, male) = 0

So far, so simple. There are now three field distance functions that need to merge into a single record distance function. There are three common ways to do this:

Manhattan distance or summation:

dsum(A,B) = dgender(A,B) + dage(A,B) + dsalary(A,B)

Normalized summation: dnorm(A,B) = dsum(A,B) / max(dsum)

Euclidean distance:

dEuclid(A,B) = sqrt(dgender(A,B)2 + dage(A,B)2 + dsalary(A,B)2)

Table 8.9 shows the nearest neighbors for each of the points using the three functions.

In this case, the sets of nearest neighbors are exactly the same regardless of how the component distances are combined. This is a coincidence, caused by the fact that the five records fall into two well-defined clusters. One of the clusters is lower-paid, younger females and the other is better-paid, older males. These clusters imply that if two records are close to each other relative to one field, then they are close on all fields, so the way the distances on each field are combined is not important. This is not a very common situation.

Consider what happens when a new record (Table 8.10) is used for the comparison.

Table 8.9

Farthest

Set of Nearest Neighbors for Three Distance Functions, Ordered Nearest to

DSUM

DNORM

DEUCLID 1

1,4,5,2,3

1,4,5,2,3

1,4,5,2,3

2,5,3,4,1

2,5,3,4,1

2,5,3,4,1

3,2,5,4,1

3,2,5,4,1

3,2,5,4,1

4,1,5,2,3

4,1,5,2,3

4,1,5,2,3

5,2,3,4,1

5,2,3,4,1

5,2,3,4,1



Table 8.10 New Customer

RECNUM

GENDER

SALARY

female

$100,000

This new record is not in either of the clusters. Table 8.11 shows her respective distances from the training set with the list of her neighbors, from nearest to furthest.

Now the set of neighbors depends on how the record distance function combines the field distance functions. In fact, the second nearest neighbor using the summation function is the farthest neighbor using the Euclidean and vice versa. Compared to the summation or normalized metric, the Euclidean metric tends to favor neighbors where all the fields are relatively close. It punishes Record 3 because the genders are different and are maximally far apart (a distance of 1.00). Correspondingly, it favors Record 1 because the genders are the same. Note that the neighbors for dsum and dnorm are identical. The definition of the normalized distance preserves the ordering of the summation distance-the distances values are just shifted to the range from 0 to 1.

The summation, Euclidean, and normalized functions can also incorporate weights so each field contributes a different amount to the record distance function. MBR usually produces good results when all the weights are equal to 1. However, sometimes weights can be used to incorporate a priori knowledge, such as a particular field suspected of having a large effect on the classification.

Distance Functions for Other Data Types

A 5-digit American zip code is often represented as a simple number. Do any of the default distance functions for numeric fields make any sense? No. The difference between two randomly chosen zip codes has no meaning. Well, almost no meaning; a zip code does encode location information. The first three digits represent a postal zone-for instance, all zip codes on Manhattan start with 100, 101, or 102.

Table 8.11 Set of Nearest Neighbors for New Customer

NEIGHBORS

dsum

1.662

1.659

1.338

1.003

1.640

4,3,5,2,1

dnorm

0.554

0.553

0.446

0.334

0.547

4,3,5,2,1

dEudid

0.781

1.052

1.251

0.494

1.000

4,1,5,2,3



Furthermore, there is a general pattern of zip codes increasing from East to West. Codes that start with 0 are in New England and Puerto Rico; those beginning with 9 are on the west coast. This suggests a distance function that approximates geographic distance by looking at the high order digits of the zip code.

dzip(A,B) = 0.0 if the zip codes are identical

dzip(A,B) = 0.1 if the first three digits are identical (e.g., 20008 and 20015

dzip(A,B) = 0.5 if the first digits are identical (e.g., 95050 and 98125 )

dzip(A,B) = 1.0 if the first digits are not identical (e.g., 02138 and 94704 )

Of course, if geographic distance were truly of interest, a better approach would be to look up the latitude and longitude of each zip code in a table and calculate the distances that way (it is possible to get this information for the United States from www.census.gov). For many purposes however, geographic proximity is not nearly as important as some other measure of similarity. 10011 and 10031 are both in Manhattan, but from a marketing point of view, they dont have much else in common, because one is an upscale downtown neighborhood and the other is a working class Harlem neighborhood. On the other hand 02138 and 94704 are on opposite coasts, but are likely to respond very similarly to direct mail from a political action committee, since they are for Cambridge, MA and Berkeley, CA respectively.

This is just one example of how the choice of a distance metric depends on the data mining context. There are additional examples of distance and similarity measures in Chapter 11 where they are applied to clustering.

When a Distance Metric Already Exists

There are some situations where a distance metric already exists, but is difficult to spot. These situations generally arise in one of two forms. Sometimes, a function already exists that provides a distance measure that can be adapted for use in MBR. The news story case study provides a good example of adapting an existing function, the relevance feedback score, for use as a distance function.

Other times, there are fields that do not appear to capture distance, but can be pressed into service. An example of such a hidden distance field is solicitation history. Two customers who were chosen for a particular solicitation in the past are close, even though the reasons why they were chosen may no longer be available; two who were not chosen, are close, but not as close; and one that was chosen and one that was not are far apart. The advantage of this metric is that it can incorporate previous decisions, even if the basis for the



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 [ 100 ] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222