Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [ 29 ] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

In data mining, more is better, but with some caveats. The first caveat has to do with the relationship between the size of the model set and its density. Density refers to the prevalence of the outcome of interests. Often the target variable represents something relatively rare. It is rare for prospects to respond to a direct mail offer. It is rare for credit card holders to commit fraud. In any given month, it is rare for newspaper subscribers to cancel their subscriptions. As discussed later in this chapter (in the section on creating the model set), it is desirable for the model set to be balanced with equal numbers of each of the outcomes during the model-building process. A smaller, balanced sample is preferable to a larger one with a very low proportion of rare outcomes.

The second caveat has to do with the data miners time. When the model set is large enough to build good, stable models, making it larger is counterproductive because everything will take longer to run on the larger dataset. Since data mining is an iterative process, the time spent waiting for results can become very large if each run of a modeling routine takes hours instead of minutes.

A simple test for whether the sample used for modeling is large enough is to try doubling it and measure the improvement in the models accuracy. If the model created using the larger sample is significantly better than the one created using the smaller sample, then the smaller sample is not big enough. If there is no improvement, or only a slight improvement, then the original sample is probably adequate.

How Much History Is Required?

Data mining uses data from the past to make predictions about the future. But how far in the past should the data come from? This is another simple question without a simple answer. The first thing to consider is seasonality. Most businesses display some degree of seasonality. Sales go up in the fourth quarter. Leisure travel goes up in the summer. There should be enough historical data to capture periodic events of this sort.

On the other hand, data from too far in the past may not be useful for mining because of changing market conditions. This is especially true when some external event such as a change in the regulatory regime has intervened. For many customer-focused applications, 2 to 3 years of history is appropriate. However, even in such cases, data about the beginning of the customer relationship often proves very valuable-what was the initial channel, what was the initial offer, how did the customer initially pay, and so on.

How Many Variables?

Inexperienced data miners are sometimes in too much of a hurry to throw out variables that seem unlikely to be interesting, keeping only a few carefully chosen variables they expect to be important. The data mining approach calls for letting the data itself reveal what is and is not important.



Often, variables that had previously been ignored turn out to have predictive value when used in combination with other variables. For example, one credit card issuer, that had never included data on cash advances in its customer profitability models, discovered through data mining that people who use cash advances only in November and December are highly profitable. Presumably, these are people who are prudent enough to avoid borrowing money at high interest rates most of the time (a prudence that makes them less likely to default than habitual users of cash advances) but who need some extra cash for the holidays and are willing to pay exorbitant interest to get it.

It is true that a final model is usually based on just a few variables. But these few variables are often derived by combining several other variables, and it may not have been obvious at the beginning which ones end up being important.

What Must the Data Contain?

At a minimum, the data must contain examples of all possible outcomes of interest. In directed data mining, where the goal is to predict the value of a particular target variable, it is crucial to have a model set comprised of preclassi-fied data. To distinguish people who are likely to default on a loan from people who are not, there needs to be thousands of examples from each class to build a model that distinguishes one from the other. When a new applicant comes along, his or her application is compared with those of past customers, either directly, as in memory-based reasoning, or indirectly through rules or neural networks derived from the historical data. If the new application looks like those of people who defaulted in the past, it will be rejected.

Implicit in this description is the idea that it is possible to know what happened in the past. To learn from our mistakes, we first have to recognize that we have made them. This is not always possible. One company had to give up on an attempt to use directed knowledge discovery to build a warranty claims fraud model because, although they suspected that some claims might be fraudulent, they had no idea which ones. Without a training set containing warranty claims clearly marked as fraudulent or legitimate, it was impossible to apply these techniques. Another company wanted a direct mail response model built, but could only supply data on people who had responded to past campaigns. They had not kept any information on people who had not responded so there was no basis for comparison.

Step Three: Get to Know the Data

It is hard to overstate the importance of spending time exploring the data before rushing into building models. Because of its importance, Chapter 17 is devoted to this topic in detail. Good data miners seem to rely heavily on



intuition-somehow being able to guess what a good derived variable to try might be, for instance. The only way to develop intuition for what is going on in an unfamiliar dataset is to immerse yourself in it. Along the way, you are likely to discover many data quality problems and be inspired to ask many questions that would not otherwise have come up.

Examine Distributions

A good first step is to examine a histogram of each variable in the dataset and think about what it is telling you. Make note of anything that seems surprising. If there is a state code variable, is California the tallest bar? If not, why not? Are some states missing? If so, does it seem plausible that this company does not do business in those states? If there is a gender variable, are there similar numbers of men and women? If not, is that unexpected? Pay attention to the range of each variable. Do variables that should be counts take on negative values? Do the highest and lowest values sound like reasonable values for that variable to take on? Is the mean much different from the median? How many missing values are there? Have the variable counts been consistent over time?

As soon as you get your hands on a data file from a new source, it is a good idea to profile the data to understand what is going on, including getting counts and summary statistics for each field, counts of the number of distinct values taken on by categorical variables, and where appropriate, cross-tabulations such as sales by product by region. In addition to providing insight into the data, the profiling exercise is likely to raise warning flags about inconsistencies or definitional problems that could destroy the usefulness of later analysis.

Data visualization tools can be very helpful during the initial exploration of a database. Figure 3.6 shows some data from the 2000 census of the state of New York. (This dataset may be downloaded from the companion Web site at www.data-miners.com/companion where you will also find suggested exercises that make use of it.) The red bars indicate the proportion of towns in the county where more than 15 percent of homes are heated by wood. (In New York, a town is a subdivision of a county that may or may not include any incorporated villages or cities. For instance, the town of Cortland is in Westchester county and includes the village of Croton-on-Hudson, whereas the city of Cortland is in Cortland County, in another part of the state.) The picture, generated by software from Quadstone, shows at a glance that wood-burning stoves are not much used to heat homes in the urbanized counties close to New York City, but are popular in rural areas upstate.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [ 29 ] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222