Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 [ 132 ] 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

algorithm is to supply a purity function chosen to either minimize the average intracluster distance or maximize the intercluster distances. An example of such a purity function is the average distance from the centroid of the parent.

With no change in the purity function, we might say that decision trees provide directed clustering; that is, they create clusters of records that are similar with respect to some target variable. For this reason, ordinary decision trees are often a better choice for customer segmentation than the undirected clustering algorithms discussed in this chapter. If the purpose of the customer segmentation is to find customer segments that are loyal or profitable or likely to respond to some particular offer, it makes sense to use one of those variables (or a proxie) as the target for directed clustering. If, on the other hand, the point of the customer segmentation is to stimulate discussion of new product offerings geared to various naturally occurring clusters of customers, than an undirected approach is more appropriate.

Self-organizing maps are a variant of neural networks that have been used for many years in applications such as feature detection in two-dimensional images. More recently, they have been applied successfully for more general clustering applications. There is a discussion of self-organizing networks in Chapter 7.

When using the K-means approach to cluster detection, is there a way to determine what value of K finds the best clusters? Similarly, when using a hierarchical approach, is there a test for which level in the hierarchy contains the best clusters? What does it mean to say that a cluster is good?

These questions are important when using clustering in practice. In general terms, clusters should have members that have a high degree of similarity-or, in geometric terms, are close to each other-and the clusters themselves should be widely spaced.

A standard measure of within-cluster similarity is variance (the sum of the squared differences of each element from the mean), so the best set of clusters might be the set whose clusters have tthe lowest variance. However, this measure does not take into account cluster size. A similar measure would be the average variance, the total variance divided by the size of the cluster.

For agglomerative clustering, using variance as a measure does not make sense, since this method always starts out with clusters of one-which, of course, have variance zero. A good measure to use with agglomerative clusters


Self-Organizing Maps

Evaluating Clusters

Team-Fly®



is the difference between the distance value at which it was formed and the distance value at which it is merged into the next level. This is a measure of the durability of the cluster. Strong clusters, like the one linking 1 to 13-year-olds at distance 3 in Figure 11.11, last through many iterations of the algorithm.

A general-purpose measure that works with any form of cluster detection is to take whatever similarity measure or distance metric is used to form the clusters and use it to compare the average distance between members of a cluster and the cluster centroid to the average distance between cluster centroids. This can be done for each cluster individually and for the entire collection of clusters.

If there are one or two good clusters along with a number of weaker ones, it may be possible to improve results by removing all members of the strong clusters. The strong clusters are worthy of further analysis anyway, and removing their strong pull may allow new clusters to be detected in the records left behind.

Inside the Cluster

Clustering often produces one or more strong clusters-rather large clusters where the records are quite similar. The question is, what makes a strong cluster special? What is it about the records in this cluster that causes them to be lumped together? Even more importantly, is it possible to find rules and patterns within this cluster now that the noise from the rest of the database has been reduced?

The easiest way to approach these questions is to take the mean of each variable within the cluster and compare it to the mean of the same variable in the parent population. Rank order the variables by the magnitude of the difference, or better yet, the z-score. Looking at the variables that show the largest difference between the cluster and the rest of the database goes a long way towards explaining what makes the cluster special.

Outside the Cluster

Clustering can be useful even when only a single cluster is found. When screening for a very rare defect, there may not be enough examples to train a directed data mining model to detect it. One example is testing electric motors at the factory that makes them. Cluster detection methods can be used on a sample containing only good motors to determine the shape and size of the normal cluster. When a motor comes along that falls outside the cluster for any reason, it is suspect. This approach has been used in medicine to detect the presence of abnormal cells in tissue samples and in telecommunications to detect calling patterns indicative of fraud.



Case Study: Clustering Towns

The Boston Globe is one of two major dailies serving Boston and the surrounding area of eastern Massachusetts and southern New Hampshire. The Globe is the leading circulated newspaper in Boston with daily circulation of over 467,000 in 2003 compared to 243,000 for the Boston Herald, the other major daily in town. On Sundays, the Globe has circulation of over 705,000. Despite this leading position, in 2003 the Globe did not want to stand still. As with many newspapers, it faced declining readership in its core Boston market and strong competition from local papers in the suburban markets where some of its readers have migrated.

In order to compete better with the suburban papers, the Globe introduced geographically distinct versions of the paper with specialized editorial content for each of 12 geographically defined zones. Two days a week, readers are treated to a few pages of local coverage for their area. The editorial zones were drawn up using data available to the Globe, common sense, and a map, but no formal statistical analysis. There were some constraints on the composition of the editorial zones:

The zones had to be geographically contiguous so that the trucks carrying the localized editions from the central printing plant in Boston could take sensible routes.

The zones had to be reasonably compact and contain sufficient population to justify specialized editorial content.

The editorial zones had to be closely aligned with the geographic zones used to sell advertising.

Within these constraints, the Globe wished to design editorial zones that would group similar towns together. Sounds sensible, but which towns are similar? That is the question that The Boston Globe brought to us at Data Miners.

Creating Town Signatures

Before deciding which towns belonged together, there needed to be a way of describing the towns-a town signature with a column for every feature that might be useful for characterizing a town and comparing it with its neighbors. As it happened, Data Miners had worked on an earlier project to find towns with good prospects for future circulation growth that had already defined town signatures. Those signatures, which had been developed for a regression model to predict Globe home delivery penetration, turned out to be equally useful for undirected clustering. This is a fairly common occurrence; once a useful set of descriptive attributes has been collected it can be used for all sorts of things. In another example, a long-distance company developed customer



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 [ 132 ] 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222