Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 [ 66 ] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Splitting on a Categorical Input Variable

The simplest algorithm for splitting on a categorical input variable is simply to create a new branch for each class that the categorical variable can take on. So, if color is chosen as the best field on which to split the root node, and the training set includes records that take on the values red, orange, yellow, green, blue, indigo, and violet, then there will be seven nodes in the next level of the tree. This approach is actually used by some software packages, but it often yields poor results. High branching factors quickly reduce the population of training records available at each node in lower levels of the tree, making further splits less reliable.

A more common approach is to group together the classes that, taken individually, predict similar outcomes. More precisely, if two classes of the input variable yield distributions of the classes of the output variable that do not differ significantly from one another, the two classes can be merged. The usual test for whether the distributions differ significantly is the chi-square test.

Splitting in the Presence of Missing Values

One of the nicest things about decision trees is their ability to handle missing values in either numeric or categorical input fields by simply considering null to be a possible value with its own branch. This approach is preferable to throwing out records with missing values or trying to impute missing values. Throwing out records due to missing values is likely to create a biased training set because the records with missing values are not likely to be a random sample of the population. Replacing missing values with imputed values has the risk that important information provided by the fact that a value is missing will be ignored in the model. We have seen many cases where the fact that a particular value is null has predictive value. In one such case, the count of non-null values in appended household-level demographic data was positively correlated with response to an offer of term life insurance. Apparently, people who leave many traces in Acxioms household database (by buying houses, getting married, registering products, and subscribing to magazines) are more likely to be interested in life insurance than those whose lifestyles leave more fields null.

Decision trees can produce splits based on missing values of an input variable. The fact that a value is null can often have predictive value so do not be hasty to filter out records with missing values or to try to replace them with imputed values.

Although splitting on null as a separate class is often quite valuable, at least one data mining product offers an alternative approach as well. In Enterprise Miner, each node stores several possible splitting rules, each one based on a different input field. When a null value is encountered in the field that yields



the best splits, the software uses the surrogate split based on the next best available input variable.

Growing the Full Tree

The initial split produces two or more child nodes, each of which is then split in the same manner as the root node. Once again, all input fields are considered as candidate splitters, even fields already used for splits. However, fields that take on only one value are eliminated from consideration since there is no way that they can be used to create a split. A categorical field that has been used as a splitter higher up in the tree is likely to become single-valued fairly quickly. The best split for each of the remaining fields is determined. When no split can be found that significantly increases the purity of a given node, or when the number of records in the node reaches some preset lower bound, or when the depth of the tree reaches some preset limit, the split search for that branch is abandoned and the node is labeled as a leaf node.

Eventually, it is not possible to find any more splits anywhere in the tree and the full decision tree has been grown. As we will see, this full tree is generally not the tree that does the best job of classifying a new set of records.

Decision-tree-building algorithms begin by trying to find the input variable that does the best job of splitting the data among the desired categories. At each succeeding level of the tree, the subsets created by the preceding split are themselves split according to whatever rule works best for them. The tree continues to grow until it is no longer possible to find better ways to split up incoming records. If there were a completely deterministic relationship between the input variables and the target, this recursive splitting would eventually yield a tree with completely pure leaves. It is easy to manufacture examples of this sort, but they do not occur very often in marketing or CRM applications.

Customer behavior data almost never contains such clear, deterministic relationships between inputs and outputs. The fact that two customers have the exact same description in terms of the available input variables does not ensure that they will exhibit the same behavior. A decision tree for a catalog response model might include a leaf representing females with age greater than 50, three or more purchases within the last year, and total lifetime spending of over $145. The customers reaching this leaf will typically be a mix of responders and nonresponders. If the leaf in question is labeled responder, than the proportion of nonresponders is the error rate for this leaf. The ratio of the proportion of responders in this leaf to the proportion of responders in the population is the lift at this leaf.

One circumstance where deterministic rules are likely to be discovered is when patterns in data reflect business rules. The authors had this fact driven home to them by an experience at Caterpillar, a manufacturer of diesel engines. We built a decision tree model to predict which warranty claims would be approved. At the time, the company had a policy by which certain



claims were paid automatically. The results were startling: The model was 100 percent accurate on unseen test data. In other words, it had discovered the exact rules used by Caterpillar to classify the claims. On this problem, a neural network tool was less successful. Of course, discovering known business rules may not be particularly useful; it does, however, underline the effectiveness of decision trees on rule-oriented problems.

Many domains, ranging from genetics to industrial processes really do have underlying rules, though these may be quite complex and obscured by noisy data. Decision trees are a natural choice when you suspect the existence of underlying rules.

Measuring the Effectiveness Decision Tree

The effectiveness of a decision tree, taken as a whole, is determined by applying it to the test set-a collection of records not used to build the tree-and observing the percentage classified correctly. This provides the classification error rate for the tree as a whole, but it is also important to pay attention to the quality of the individual branches of the tree. Each path through the tree represents a rule, and some rules are better than others.

At each node, whether a leaf node or a branching node, we can measure:

The number of records entering the node

The proportion of records in each class

How those records would be classified if this were a leaf node

The percentage of records classified correctly at this node

The variance in distribution between the training set and the test set

Of particular interest is the percentage of records classified correctly at this node. Surprisingly, sometimes a node higher up in the tree does a better job of classifying the test set than nodes lower down.

Tests for Choosing the Best Split

A number of different measures are available to evaluate potential splits. Algorithms developed in the machine learning community focus on the increase in purity resulting from a split, while those developed in the statistics community focus on the statistical significance of the difference between the distributions of the child nodes. Alternate splitting criteria often lead to trees that look quite different from one another, but have similar performance. That is because there are usually many candidate splits with very similar performance. Different purity measures lead to different candidates being selected, but since all of the measures are trying to capture the same idea, the resulting models tend to behave similarly.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 [ 66 ] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222