Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [ 26 ] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

of maleness. It seems safe to assume that the link is from men to beer and not vice versa.

With behavioral data, the direction of causality is not always so clear. Consider a couple of actual examples from real data mining projects:

People who have purchased certificates of deposit (CDs) have little or no money in their savings accounts.

Customers who use voice mail make a lot of short calls to their own number.

Not keeping money in a savings account is a common behavior of CD holders, just as being male is a common feature of beer drinkers. Beer companies seek out males to market their product, so should banks seek out people with no money in savings in order to sell them certificates of deposit? Probably not! Presumably, the CD holders have no money in their savings accounts because they used that money to buy CDs. A more common reason for not having money in a savings account is not having any money, and people with no money are not likely to purchase certificates of deposit. Similarly, the voice mail users call their own number so much because in this particular system that is one way to check voice mail. The pattern is useless for finding prospective users.

Prediction

Profiling uses data from the past to describe what happened in the past. Prediction goes one step further. Prediction uses data from the past to predict what is likely to happen in the future. This is a more powerful use of data. While the correlation between low savings balances and CD ownership may not be useful in a profile of CD holders, it is likely that having a high savings balance is (in combination with other indicators) a predictor of future CD purchases.

Building a predictive model requires separation in time between the model inputs or predictors and the model output, the thing to be predicted. If this separation is not maintained, the model will not work. This is one example of why it is important to follow a sound data mining methodology.

The Methodology

The data mining methodology has 11 steps.

1. Translate the business problem into a data mining problem.

2. Select appropriate data.

3. Get to know the data.

4. Create a model set.

5. Fix problems with the data.



6. 7. 8. 9. 10.

Transform data to bring information to the surface. Build models. Asses models. Deploy models. Assess results.

11. Begin again.

As shown in Figure 3.5, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. And things learned in later steps will cause earlier ones to be revisited.


Figure 3.5 Data mining is not a linear process.



Step One: Translate the Business Problem into a Data Mining Problem

A favorite scene from Alice in Wonderland is the passage where Alice asks the Cheshire cat for directions:

Would you tell me, please, which way I ought to go from here?

That depends a good deal on where you want to get to, said the Cat.

I dont much care where- said Alice.

Then it doesnt matter which way you go, said the Cat.

-so long as I get somewhere, Alice added as an explanation.

Oh, youre sure to do that, said the Cat, if you only walk long enough.

The Cheshire cat might have added that without some way of recognizing the destination, you can never tell whether you have walked long enough! The proper destination for a data mining project is the solution of a well-defined business problem. Data mining goals for a particular project should not be stated in broad, general terms, such as:

Gaining insight into customer behavior

Discovering meaningful patterns in data

Learning something interesting

These are all worthy goals, but even when they have been achieved, they are hard to measure. Projects that are hard to measure are hard to put a value on. Wherever possible, the broad, general goals should be broken down into more specific ones to make it easier to monitor progress in achieving them. Gaining insight into customer behavior might turn into concrete goals:

Identify customers who are unlikely to renew their subscriptions.

Design a calling plan that will reduce churn for home-based business customers.

Rank order all customers based on propensity to ski.

List products whose sales are at risk if we discontinue wine and beer sales.

Not only are these concrete goals easier to monitor, they are easier to translate into data mining problems as well.

What Does a Data Mining Problem Look Like?

To translate a business problem into a data mining problem, it should be reformulated as one of the six data mining tasks introduced in Chapter One:



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [ 26 ] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222