Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [ 31 ] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

ADDING MORE NEEDLES TO THE HAYSTACK

In standard statistical analysis, it is common practice to throw out outliers-observations that are far outside the normal range. In data mining, however, these outliers may be just what we are looking for. Perhaps they represent fraud, some sort of error in our business procedures, or some fabulously profitable niche market. In these cases, we dont want to throw out the outliers, we want to get to know and understand them!

The problem is that knowledge discovery algorithms learn by example. If there are not enough examples of a particular class or pattern of behavior, the data mining tools will not be able to come up with a model for predicting it. In this situation, we may be able to improve our chances by artificially enriching the training data with examples of the rare event.

Stratified Sampling

Weights

l£t6J

00} 0]3 02 00 00 05j 06j 00 00

When an outcome is rare, there are two ways to create a balanced sample.

For example, a bank might want to build a model of who is a likely prospect for a private banking program. These programs appeal only to the very wealthiest clients, few of whom are represented in even a fairly large sample of bank customers. To build a model capable of spotting these fortunate individuals, we might create a training set of checking transaction histories of a population that includes 50 percent private banking clients even though they represent fewer than 1 percent of all checking accounts.

Alternately, each private banking client might be given a weight of 1 and other customers a weight of 0.01, so the total weight of the exclusive customers equals the total weight of the rest of the customers (we prefer to have the maximum weight be 1).



Including Multiple Timeframes

The primary goal of the methodology is creating stable models. Among other things, that means models that will work at any time of year and well into the future. This is more likely to happen if the data in the model set does not all come from one time of year. Even if the model is to be based on only 3 months of history, different rows of the model set should use different 3-month windows. The idea is to let the model generalize from the past rather than memorize what happened at one particular time in the past.

Building a model on data from a single time period increases the risk of learning things that are not generally true. One amusing example that the authors once saw was an association rules model built on a single weeks worth of point of sale data from a supermarket. Association rules try to predict items a shopping basket will contain given that it is known to contain certain other items. In this case, all the rules predicted eggs. This surprising result became less so when we realized that the model set was from the week before Easter.

Creating a Model Set for Prediction

When the model set is going to be used for prediction, there is another aspect of time to worry about. Although the model set should contain multiple time-frames, any one customer signature should have a gap in time between the predictor variables and the target variable. Time can always be divided into three periods: the past, present, and future. When making a prediction, a model uses data from the past to make predictions about the future.

As shown in Figure 3.7, all three of these periods should be represented in the model set. Of course all data comes from the past, so the time periods in the model set are actually the distant past, the not-so-distant past, and the recent past. Predictive models are built be finding patterns in the distant past that explain outcomes in the recent past. When the model is deployed, it is then able to use data from the recent past to make predictions about the future.

Model Building Time


Not So

Distant Past

Distant Past

Recent Past

Present

Future

Model Scoring Time

Figure 3.7 Data from the past mimics data from the past, present, and future.



It may not be immediately obvious why some recent data-from the not-so-distant past-is not used in a particular customer signature. The answer is that when the model is applied in the present, no data from the present is available as input. The diagram in Figure 3.8 makes this clearer.

If a model were built using data from June (the not-so-distant past) in order to predict July (the recent past), then it could not be used to predict September until August data was available. But when is August data available? Certainly not in August, since it is still being created. Chances are, not in the first week of September either, since it has to be collected and cleaned and loaded and tested and blessed. In many companies, the August data will not be available until mid-September or even October, by which point nobody will care about predictions for September. The solution is to include a month of latency in the model set.

Partitioning the Model Set

Once the preclassified data has been obtained from the appropriate time-frames, the methodology calls for dividing it into three parts. The first part, the training set, is used to build the initial model. The second part, the validation sett, is used to adjust the initial model to make it more general and less tied to the idiosyncrasies of the training set. The third part, the test set, is used to gauge the likely effectiveness of the model when applied to unseen data. Three sets are necessary because once data has been used for one step in the process, it can no longer be used for the next step because the information it contains has already become part of the model; therefore, it cannot be used to correct or judge.

January

February

March

April

June

July

August

September

October

Target Month

Model Building Time

Target Month

Model Scoring Time

Figure 3.8 Time when the model is built compared to time when the model is used.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [ 31 ] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222