Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 [ 63 ] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222


Decision Trees

Decision trees are powerful and popular for both classification and prediction. The attractiveness of tree-based methods is due largely to the fact that decision trees represent rules. Rules can readily be expressed in English so that we humans can understand them; they can also be expressed in a database access language such as SQL to retrieve records in a particular category. Decision trees are also useful for exploring data to gain insight into the relationships of a large number of candidate input variables to a target variable. Because decision trees combine both data exploration and modeling, they are a powerful first step in the modeling process even when building the final model using some other technique.

There is often a trade-off between model accuracy and model transparency. In some applications, the accuracy of a classification or prediction is the only thing that matters; if a direct mail firm obtains a model that can accurately predict which members of a prospect pool are most likely to respond to a certain solicitation, the firm may not care how or why the model works. In other situations, the ability to explain the reason for a decision is crucial. In insurance underwriting, for example, there are legal prohibitions against discrimination based on certain variables. An insurance company could find itself in the position of having to demonstrate to a court of law that it has not used illegal discriminatory practices in granting or denying coverage. Similarly, it is more acceptable to both the loan officer and the credit applicant to hear that an application for credit has been denied on the basis of a computer-generated



rule (such as income below some threshold and number of existing revolving accounts greater than some other threshold) than to hear that the decision has been made by a neural network that provides no explanation for its action.

This chapter begins with an examination of what decision trees are, how they work, and how they can be applied to classification and prediction problems. It then describes the core algorithm used to build decision trees and discusses some of the most popular variants of that core algorithm. Practical examples drawn from the authors experience are used to demonstrate the utility and general applicability of decision tree models and to illustrate practical considerations that must be taken into account.

What Is a Decision Tree?

A decision tree is a structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules. With each successive division, the members of the resulting sets become more and more similar to one another. The familiar division of living things into kingdoms, phyla, classes, orders, families, genera, and species, invented by the Swedish botanist Carl Linnaeus in the 1730s, provides a good example. Within the animal kingdom, a particular animal is assigned to the phylum chordata if it has a spinal cord. Additional characteristics are used to further subdivide the chordates into the birds, mammals, reptiles, and so on. These classes are further subdivided until, at the lowest level in the taxonomy, members of the same species are not only morphologically similar, they are capable of breeding and producing fertile offspring.

A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable. A decision tree may be painstakingly constructed by hand in the manner of Linnaeus and the generations of taxonomists that followed him, or it may be grown automatically by applying any one of several decision tree algorithms to a model set comprised of preclassified data. This chapter is mostly concerned with the algorithms for automatically generating decision trees. The target variable is usually categorical and the decision tree model is used either to calculate the probability that a given record belongs to each of the categories, or to classify the record by assigning it to the most likely class. Decision trees can also be used to estimate the value of a continuous variable, although there are other techniques more suitable to that task.

Classification

Anyone familiar with the game of Twenty Questions will have no difficulty understanding how a decision tree classifies records. In the game, one player



thinks of a particular place, person, or thing that would be known or recognized by all the participants, but the player gives no clue to its identity. The other players try to discover what it is by asking a series of yes-or-no questions. A good player rarely needs the full allotment of 20 questions to move all the way from Is it bigger than a bread box? to the Golden Gate Bridge.

A decision tree represents such a series of questions. As in the game, the answer to the first question determines the follow-up question. The initial questions create broad categories with many members; follow-on questions divide the broad categories into smaller and smaller sets. If the questions are well chosen, a surprisingly short series is enough to accurately classify an incoming record.

The game of Twenty Questions illustrates the process of using a tree for appending a score or class to a record. A record enters the tree at the root node. The root node applies a test to determine which child node the record will encounter next. There are different algorithms for choosing the initial test, but the goal is always the same: To choose the test that best discriminates among the target classes. This process is repeated until the record arrives at a leaf node. All the records that end up at a given leaf of the tree are classified the same way. There is a unique path from the root to each leaf. That path is an expression of the rule used to classify the records.

Different leaves may make the same classification, although each leaf makes that classification for a different reason. For example, in a tree that classifies fruits and vegetables by color, the leaves for apple, tomato, and cherry might all predict red, albeit with varying degrees of confidence since there are likely to be examples of green apples, yellow tomatoes, and black cherries as well.

The decision tree in Figure 6.1 classifies potential catalog recipients as likely (1) or unlikely (0) to place an order if sent a new catalog.

The tree in Figure 6.1 was created using the SAS Enterprise Miner Tree Viewer tool. The chart is drawn according to the usual convention in data mining circles-with the root at the top and the leaves at the bottom, perhaps indicating that data miners ought to get out more to see how real trees grow. Each node is labeled with a node number in the upper-right corner and the predicted class in the center. The decision rules to split each node are printed on the lines connecting each node to its children. The split at the root node on lifetime orders ; the left branch is for customers who had six or fewer orders and the right branch is for customers who had seven or more.

Any record that reaches leaf nodes 19, 14, 16, 17, or 18 is classified as likely to respond, because the predicted class in this case is 1. The paths to these leaf nodes describe the rules in the tree. For example, the rule for leaf 19 is If the customer has made more than 6.5 orders and it has been fewer than 765 days since the last order, the customer is likely to respond.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 [ 63 ] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222