Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 [ 36 ] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Lift Value 1.5 -

10 20 30 40 50 60 70 80 90 100

Percentile

Figure 3.14 A lift chart starts high and then goes to 1.

Step Nine: Deploy Models

Deploying a model means moving it from the data mining environment to the scoring environment. This process may be easy or hard. In the worst case (and we have seen this at more than one company), the model is developed in a special modeling environment using software that runs nowhere else. To deploy the model, a programmer takes a printed description of the model and recodes it in another programming language so it can be run on the scoring platform.

A more common problem is that the model uses input variables that are not in the original data. This should not be a problem since the model inputs are at least derived from the fields that were originally extracted to from the model set. Unfortunately, data miners are not always good about keeping a clean, reusable record of the transformations they applied to the data.

The challenging in deploying data mining models is that they are often used to score very large datasets. In some environments, every one of millions of customer records is updated with a new behavior score every day. A score is simply an additional field in a database table. Scores often represent a probability or likelihood so they are typically numeric values between 0 and 1, but by no



means necessarily so. A score might also be a class label provided by a clustering model, for instance, or a class label with a probability.

Step Ten: Assess Results

The response chart in Figure 3.14compares the number of responders reached for a given amount of postage, with and without the use of a predictive model. A more useful chart would show how many dollars are brought in for a given expenditure on the marketing campaign. After all, if developing the model is very expensive, a mass mailing may be more cost-effective than a targeted one.

What is the fixed cost of setting up the campaign and the model that supports it?

What is the cost per recipient of making the offer?

What is the cost per respondent of fulfilling the offer?

What is the value of a positive response?

Plugging these numbers into a spreadsheet makes it possible to measure the impact of the model in dollars. The cumulative response chart can then be turned into a cumulative profit chart, which determines where the sorted mailing list should be cut off. If, for example, there is a high fixed price of setting up the campaign and also a fairly high price per recipient of making the offer (as when a wireless company buys loyalty by giving away mobile phones or waiving renewal fees), the company loses money by going after too few prospects because, there are still not enough respondents to make up for the high fixed costs of the program. On the other hand, if it makes the offer to too many people, high variable costs begin to hurt.

Of course, the profit model is only as good as its inputs. While the fixed and variable costs of the campaign are fairly easy to come by, the predicted value of a responder can be harder to estimate. The process of figuring out what a customer is worth is beyond the scope of this book, but a good estimate helps to measure the true value of a data mining model.

In the end, the measure that counts the most is return on investment. Measuring lift on a test set helps choose the right model. Profitability models based on lift will help decide how to apply the results of the model. But, it is very important to measure these things in the field as well. In a database marketing application, this requires always setting aside control groups and carefully tracking customer response according to various model scores.

Step Eleven: Begin Again

Every data mining project raises more questions than it answers. This is a good thing. It means that new relationships are now visible that were not visible



before. The newly discovered relationships suggest new hypotheses to test and the data mining process begins all over again.

Lessons Learned

Data mining comes in two forms. Directed data mining involves searching through historical records to find patterns that explain a particular outcome. Directed data mining includes the tasks of classification, estimation, prediction, and profiling. Undirected data mining searches through the same records for interesting patterns. It includes the tasks of clustering, finding association rules, and description.

Data mining brings the business closer to data. As such, hypothesis testing is a very important part of the process. However, the primary lesson of this chapter is that data mining is full of traps for the unwary and following a methodology based on experience can help avoid them.

The first hurdle is translating the business problem into one of the six tasks that can be solved by data mining: classification, estimation, prediction, affinity grouping, clustering, and profiling.

The next challenge is to locate appropriate data that can be transformed into actionable information. Once the data has been located, it should be thoroughly explored. The exploration process is likely to reveal problems with the data. It will also help build up the data miners intuitive understanding of the data. The next step is to create a model set and partition it into training, validation, and test sets.

Data transformations are necessary for two purposes: to fix problems with the data such as missing values and categorical variables that take on too many values, and to bring information to the surface by creating new variables to represent trends and other ratios and combinations.

Once the data has been prepared, building models is a relatively easy process. Each type of model has its own metrics by which it can be assessed, but there are also assessment tools that are independent of the type of model. Some of the most important of these are the lift chart, which shows how the model has increased the concentration of the desired value of the target variable and the confusion matrix that shows that misclassification error rate for each of the target classes. The next chapter uses examples from real data mining projects to show the methodology in action.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 [ 36 ] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222