Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 [ 129 ] 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Manhattan Distance

Another common distance metric gets its name from the rectangular grid pattern of streets in midtown Manhattan. It is simply the sum of the distances traveled along each axis. This measure is sometimes preferred to the Euclidean distance because given that the distances along each axis are not squared, it is less likely that a large difference in one dimension will dominate the total distance.

Number of Features in Common

When the preponderance of fields in the records are categorical variables, geometric measures are not the best choice. A better measure is based on the degree of overlap between records. As with the geometric measures, there are many variations on this idea. In all variations, the two records are compared field by field to determine the number of fields that match and the number of fields that dont match. The simplest measure is the ratio of matches to the total number of fields.

In its simplest form, this measure counts two null or empty fields as matching. This has the perhaps perverse result that everything with missing data ends up in the same cluster. A simple improvement is to not include matches of this sort in the match count. Another improvement is to weight the matches by the prevalence of each class in the general population. After all, a match on Chevy Nomad ought to count for more than a match on Ford F-150 Pickup.

Data Preparation for Clustering

The notions of scaling and weighting each play important roles in clustering. Although similar, and often confused with each other, the two notions are not the same. Scaling adjusts the values of variables to take into account the fact that different variables are measured in different units or over different ranges. For instance, household income is measured in tens of thousands of dollars and number of children in single digits. Weighting provides a relative adjustment for a variable, because some variables are more important than others.

Scaling for Consistency

In geometry, all dimensions are equally important. Two points that differ by 2 in dimensions X and Y and by 1 in dimension Z are the same distance apart as two other points that differ by 1 in dimension X and by 2 in dimensions Y and Z. It doesnt matter what units X, Y, and Z are measured in, so long as they are the same.



But what if X is measured in yards, Y is measured in centimeters, and Z is measured in nautical miles? A difference of 1 in Z is now equivalent to a difference of 185,200 in Y or 2,025 in X. Clearly, they must all be converted to a common scale before distances will make any sense.

Unfortunately, in commercial data mining there is usually no common scale available because the different units being used are measuring quite different things. If variables include plot size, number of children, car ownership, and family income, they cannot all be converted to a common unit. On the other hand, it is misleading that a difference of 20 acres is indistinguishable from a change of $20. One solution is to map all the variables to a common range (often 0 to 1 or -1 to 1). That way, at least the ratios of change become comparable-doubling the plot size has the same effect as doubling income. Scaling solves this problem, in this case by remapping to a common range.

It is very important to scale different variables so their values fall roughly into the same range, by normalizing, indexing, or standardizing the values.

Here are three common ways of scaling variables to bring them all into comparable ranges:

Divide each variable by the range (the difference between the lowest and highest value it takes on) after subtracting the lowest value. This maps all values to the range 0 to 1, which is useful for some data mining algorithms.

Divide each variable by the mean of all the values it takes on. This is often called indexing a variable.

Subtract the mean value from each variable and then divide it by the standard deviation. This is often called standardization or converting to z-scores. A z-score tells you how many standard deviations away from the mean a value is.

Normalizing a single variable simply changes its range. A closely related concept is vector normalization which scales all variables at once. This too has a geometric interpretation. Consider the collection of values in a single record or observation as a vector. Normalizing them scales each value so as to make the length of the vector equal one. Transforming all the vectors to unit length emphasizes the differences internal to each record rather than the differences between records. As an example, consider two records with fields for debt and equity. The first record contains debt of $200,000 and equity of $100,000; the second, debt of $10,000 and equity of $5,000. After normalization, the two records look the same since both have the same ratio of debt to equity.



Use Weights to Encode Outside Information

Scaling takes care of the problem that changes in one variable appear more significant than changes in another simply because of differences in the magnitudes of the values in the variable. What if we think that two families with the same income have more in common than two families on the same size plot, and we want that to be taken into consideration during clustering? That is where weighting comes in. The purpose of weighting is to encode the information that one variable is more (or less) important than others.

A good place to starts is by standardizing all variables so each has a mean of zero and a variance (and standard deviation) of one. That way, all fields contribute equally when the distance between two records is computed.

We suggest going farther. The whole point of automatic cluster detection is to find clusters that make sense to you. If, for your purposes, whether people have children is much more important than the number of credit cards they carry, there is no reason not to bias the outcome of the clustering by multiplying the number of children field by a higher weight than the number of credit cards field. After scaling to get rid of bias that is due to the units, use weights to introduce bias based on knowledge of the business context.

Some clustering tools allow the user to attach weights to different dimensions, simplifying the process. Even for tools that dont have such functionality, it is possible to have weights by adjusting the scaled values. That is, first scale the values to a common range to eliminate range effects. Then multiply the resulting values by a weight to introduce bias based on the business context.

Of course, if you want to evaluate the effects of different weighting strategies, you will have to add another outer loop to the clustering process.

Other Approaches to Cluster Detection

The basic K-means algorithm has many variations. Many commercial software tools that include automatic cluster detection incorporate some of these variations. Among the differences are alternate methods of choosing the initial seeds and the use of probability density rather than distance to associate records with clusters. This last variation merits additional discussion. In addition, there are several different approaches to clustering, including agglomer-ative clustering, divisive clustering, and self organizing maps.

Gaussian Mixture Models

The K-means method as described has some drawbacks:

It does not do well with overlapping clusters.

The clusters are easily pulled off-center by outliers.

Each record is either inside or outside of a given cluster.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 [ 129 ] 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222