Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 [ 30 ] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222


Compare Values with Descriptions

Look at the values of each variable and compare them with the description given for that variable in available documentation. This exercise often reveals that the descriptions are inaccurate or incomplete. In one dataset of grocery purchases, a variable that was labeled as being an item count had many noninteger values. Upon further investigation, it turned out that the field contained an item count for products sold by the item, but a weight for items sold by weight. Another dataset, this one from a retail catalog company, included a field that was described as containing total spending over several quarters. This field was mysteriously capable of predicting the target variable-whether a customer had placed an order from a particular catalog mailing. Everyone who had not placed an order had a zero value in the mystery field. Everyone who had placed an order had a number greater than zero in the field. We surmise that the field actually contained the value of the customers order from the mailing in question. In any case, it certainly did not contain the documented value.



Validate Assumptions

Using simple cross-tabulation and visualization tools such as scatter plots, bar graphs, and maps, validate assumptions about the data. Look at the target variable in relation to various other variables to see such things as response by channel or churn rate by market or income by sex. Where possible, try to match reported summary numbers by reconstructing them directly from the base-level data. For example, if reported monthly churn is 2 percent, count up the number of customers that cancel one month and see if it is around 2 percent of the total.

Trying to recreate reported aggregate numbers from the detail data that supposedly goes into them is an instructive exercise. In trying to explain the discrepancies, you are likely to learn much about the operational processes and business rules behind the reported numbers.

Ask Lots of Questions

Wherever the data does not seem to bear out received wisdom or your own expectations, make a note of it. An important output of the data exploration process is a list of questions for the people who supplied the data. Often these questions will require further research because few users look at data as carefully as data miners do. Examples of the kinds of questions that are likely to come out of the preliminary exploration are:

Why are no auto insurance policies sold in New Jersey or Massachusetts?

Why were some customers active for 31 days in February, but none were active for more than 28 days in January?

Why were so many customers born in 1911? Are they really that old?

Why are there no examples of repeat purchasers?

What does it mean when the contract begin date is after the contract end date?

Why are there negative numbers in the sale price field?

How can active customers have a non-null value in the cancelation reason code field?

These are all real questions we have had occasion to ask about real data. Sometimes the answers taught us things we hadnt known about the clients industry. New Jersey and Massachusetts do not allow automobile insurers much flexibility in setting rates, so a company that sees its main competitive



advantage as smarter pricing does not want to operate in those markets. Other times we learned about idiosyncrasies of the operational systems, such as the data entry screen that insisted on a birth date even when none was known, which lead to a lot of people being assigned the birthday November 11, 1911 because 11/11/11 is the date you get by holding down the 1 key and letting it auto-repeat until the field is full (and no other keys work to fill in valid dates). Sometimes we discovered serious problems with the data such as the data for February being misidentified as January. And in the last instance, we learned that the process extracting the data had bugs.

Step Four: Create a Model Set

The model set contains all the data that is used in the modeling process. Some of the data in the model set is used to find patterns. Some of the data in the model set is used to verify that the model is stable. Some is used to assess the models performance. Creating a model set requires assembling data from multiple sources to form customer signatures and then preparing the data for analysis.

Assembling Customer Signatures

The model set is a table or collection of tables with one row per item to be studied, and fields for everything known about that item that could be useful for modeling. When the data describes customers, the rows of the model set are often called customer signatures. Assembling the customer signatures from relational databases often requires complex queries to join data from many tables and then augmenting it with data from other sources.

Part of the data assembly process is getting all data to be at the correct level of summarization so there is one value per customer, rather than one value per transaction or one value per zip code. These issues are discussed in Chapter 17.

Creating a Balanced Sample

Very often, the data mining task involves learning to distinguish between groups such as responders and nonresponders, goods and bads, or members of different customer segments. As explained in the sidebar, data mining algorithms do best when these groups have roughly the same number of members. This is unlikely to occur naturally. In fact, it is usually the more interesting groups that are underrepresented.

Before modeling, the dataset should be made balanced either by sampling from the different groups at different rates or adding a weighting factor so that the members of the most popular groups are not weighted as heavily as members of the smaller ones.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 [ 30 ] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222