Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 [ 210 ] 211 212 213 214 215 216 217 218 219 220 221 222

less likely to churn. The additional requirement to identify separate segments of subscribers at risk and understand what motivates each group to leave suggests the use of decision trees and clever derived variables.

Each leaf of the decision tree has a label, which in this case would be not likely to churn or likely to churn. Each leaf in the tree has different proportions of the target variables; this proportion of churners that can be used as a churn score. Each leaf also has a set of rules describing who ends up there. With skill and creativity, an analyst may be able to turn these mechanistic rules into comprehensible reasons for leaving that, once understood, can be counteracted. Decision trees often have more leaves than desired for the purpose of developing special offers and telemarketing scripts. To combine leaves, into larger groups, take whole branches of the tree as the groups, rather than single leaves.

Note that our preference for decision-tree methods in this case stems from the desire to understand the reasons for attrition and our desire to treat subgroups differentially. If the goal were simply to do the best possible job of predicting the subscribers at risk, without worrying about the reasons, we might select a different approach. Different business goals suggest different data mining techniques. If the goal were to estimate next months minutes of use for each subscriber, neural networks or regression would be better choices. If the goal were to find naturally occurring customer segments an undirected clustering technique or profiling and hypothesis testing would be appropriate.

Determine the Relevant Characteristics of the Data

Once the data mining tasks have been identified and used to narrow the range of data mining methods under consideration, the characteristics of the available data can help to refine the selection further. In general terms, the goal is to select the data mining technique that minimizes the number and difficulty of the data transformations that must be performed in order to coax good results from the data.

As discussed in the previous chapter, some amount of data transformation is always part of the data mining process. The raw data may need to be summarized in various ways, data encodings must be rationalized, and so forth. These kinds of transformations are necessary regardless of the technique chosen. However, some kinds of data pose particular problems for some data mining techniques.

Data Type

Categorical variables are especially problematic for data mining techniques that use the numeric values of input variables. Numeric variables of the kind that can be summed and multiplied play to the strengths of data mining techniques, such as regression, K-means clustering, and neural networks, that are



based on arithmetic operations. When data has many categorical variables, then decision trees are quite useful, although association rules and link analysis may be appropriate in some cases.

Number of Input Fields

In directed data mining applications, there should be a single target field or dependent variable. The rest of the fields (except for those that are either clearly irrelevant or clearly dependent on the target variable) are treated as potential inputs to the model. Data mining methods vary in their ability to successfully process large numbers of input fields. This can be a factor in deciding on the right technique for a particular application.

In general, techniques that rely on adjusting a vector of weights that has an element for each input field run into trouble when the number of fields grows very large. Neural networks and memory-based reasoning share that trait. Association rules run into a different problem. The technique looks at all possible combinations of the inputs; as the number of inputs grows, processing the combinations becomes impossible to do in a reasonable amount of time.

Decision-tree methods are much less hindered by large numbers of fields. As the tree is built, the decision-tree algorithm identifies the single field that contributes the most information at each node and bases the next segment of the rule on that field alone. Dozens or hundreds of other fields can come along for the ride, but wont be represented in the final rules unless they contribute to the solution.

When faced with a large number of fields for a directed data mining problem, it is a good idea to start by building a decision tree, even if the final model is to be built using a different technique. The decision tree will identify a good subset of the fields to use as input to a another technique that might be swamped by the original set of input variables.

Free-Form Text

Most data mining techniques are incapable of directly handling free-form text. But clearly, text fields often contain extremely valuable information. When analyzing warranty claims submitted to an engine manufacturer by independent dealers, the mechanics free-form notes explaining what went wrong and what was done to fix the problem are at least as valuable as the fixed fields that show the part numbers and hours of labor used.

One data mining technique that can deal with free text is memory-based reasoning, one of the nearest neighbor methods discussed in Chapter 8. Recall that memory-based reasoning is based on the ability to measure the distance



from one record to all the other records in a database in order to form a neighborhood of similar records. Often, finding an appropriate distance metric is a stumbling block that makes it hard to apply the technique, but researchers in the field of information retrieval have come up with good measures of the distance between two blocks of text. These measurements are based on the overlap in vocabulary between the documents, especially of uncommon words and proper nouns. The ability of Web search engines to find appropriate articles is one familiar example of text mining.

As described in Chapter 8, memory-based reasoning on free-form text has also been used to classify workers into industries and job categories based on written job descriptions they supplied on the U.S. census long form and to add keywords to news stories.

Consider Hybrid Approaches

Sometimes, a combination of techniques works better than any single approach. This may require breaking down a single data mining task into two or more sub-tasks. The automotive marketing example from Chapter 2 is a good example. Researchers found that the best way of selecting prospects for a particular car model was to first use a neural network to identify people likely to buy a car, then use a decision tree to predict the particular model each car buyer would select.

Another example is a bank that uses three variables as input to a credit solicitation decision. The three inputs are estimates for:

The likelihood of a response

The projected first-year revenue from this customer

The risk of the new customer defaulting

These tasks vary considerably in the amount of relevant training data likely to be available, the input fields likely to be important, and the length of time required to verify the accuracy of a prediction. Soon after a mailing, the bank knows exactly who responded because the solicitation contains a deadline after which responses are considered invalid. A whole year must pass before the estimated first-year revenue can be checked against the actual amount, and it may take even longer for a customer to go bad. Given all these differences, it is not be surprising that a different data mining techniques may turn out to be best for each task.

How One Company Began Data Mining

Over the years, the authors have watched many companies make their first forays into data mining. Although each companys situation is unique, some



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 [ 210 ] 211 212 213 214 215 216 217 218 219 220 221 222