Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 [ 128 ] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

True measures are interval variables that measure from a meaningful zero point. This trait is important because it means that the ratio of two values of the variable is meaningful. The Fahrenheit temperature scale used in the United States and the Celsius scale used in most of the rest of the world do not have this property. In neither system does it make sense to say that a 30° day is twice as warm as a 15° day. Similarly, a size 12 dress is not twice as large as a size 6, and gypsum is not twice as hard as talc though they are 2 and 1 on the hardness scale. It does make perfect sense, however, to say that a 50-year-old is twice as old as a 25-year-old or that a 10-pound bag of sugar is twice as heavy as a 5-pound one. Age, weight, length, customer tenure, and volume are examples of true measures.

Geometric distance metrics are well-defined for interval variables and true measures. In order to use categorical variables and rankings, it is necessary to transform them into interval variables. Unfortunately, these transformations may add spurious information. If ice cream flavors are assigned arbitrary numbers 1 through 28, it will appear that flavors 5 and 6 are closely related while flavors 1 and 28 are far apart.

These and other data transformation and preparation issues are discussed extensively in Chapter 17.

Formal Measures of Similarity

There are dozens if not hundreds of published techniques for measuring the similarity of two records. Some have been developed for specialized applications such as comparing passages of text. Others are designed especially for use with certain types of data such as binary variables or categorical variables. Of the three presented here, the first two are suitable for use with interval variables and true measures, while the third is suitable for categorical variables.

Geometric Distance between Two Points

When the fields in a record are numeric, the record represents a point in n-dimensional space. The distance between the points represented by two records is used as the measure of similarity between them. If two points are close in distance, the corresponding records are similar.

There are many ways to measure the distance between two points, as discussed in the sidebar Distance Metrics . The most common one is the Euclidian distance familiar from high-school geometry. To find the Euclidian distance between X and Y, first find the differences between the corresponding elements of X and Y (the distance along each axis) and square them. The distance is the square root of the sum of the squared differences.



DISTANCE METRICS

Any function that takes two points and produces a single number describing a relationship between them is a candidate measure of similarity, but to be a true distance metric, it must meet the following criteria:

♦ Distance(X, Y = 0 if and only if X = Y

♦ Distance(X,Y) > 0 for all X and all Y

♦ Distance(X,Y) = Distance(Y,X)

♦ Distance(X,Y) < Distance(X,Z) + Distance(Z,Y)

These are the formal definition of a distance metric in geometry.

A true distance is a good metric for clustering, but some of these conditions can be relaxed. The most important conditions are the second and third (called identity and commutativity by mathematicians)-that the measure is 0 or positive and is well-defined for any two points. If two records have a distance of 0, that is okay, as long as they are very, very similar, since they will always fall into the same cluster.

The last condition, the Triangle Inequality, is perhaps the most interesting mathematically. In terms of clustering, it basically means that adding a new cluster center will not make two distant points suddenly seem close together. Fortunately, most metrics we could devise satisfy this condition.

Angle between Two Vectors

Sometimes it makes more sense to consider two records closely associated because of similarities in the way the fields within each record are related. Minnows should cluster with sardines, cod, and tuna, while kittens cluster with cougars, lions, and tigers, even though in a database of body-part lengths, the sardine is closer to a kitten than it is to a catfish.

The solution is to use a different geometric interpretation of the same data. Instead of thinking of X and Y as points in space and measuring the distance between them, think of them as vectors and measure the angle between them. In this context, a vector is the line segment connecting the origin of a coordinate system to the point described by the vector values. A vector has both magnitude (the distance from the origin to the point) and direction. For this similarity measure, it is the direction that matters.

Take the values for length of whiskers, length of tail, overall body length, length of teeth, and length of claws for a lion and a house cat and plot them as single points, they will be very far apart. But if the ratios of lengths of these body parts to one another are similar in the two species, than the vectors will be nearly colinear.



The angle between vectors provides a measure of association that is not influenced by differences in magnitude between the two things being compared (see Figure 11.7). Actually, the sine of the angle is a better measure since it will range from 0 when the vectors are closest (most nearly parallel) to 1 when they are perpendicular. Using the sine ensures that an angle of 0 degrees is treated the same as an angle of 180 degrees, which is as it should be since for this measure, any two vectors that differ only by a constant factor are considered similar, even if the constant factor is negative. Note that the cosine of the angle measures correlation; it is 1 when the vectors are parallel (perfectly correlated) and 0 when they are orthogonal.


>7

I I I I I I I Г

Figure 11.7 The angle between vectors as a measure of similarity.

Team-Fly®



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 [ 128 ] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222