Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [ 32 ] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

People often find it hard to understand why the training set and validation set are tainted once they have been used to build a model. An analogy may help: Imagine yourself back in the fifth grade. The class is taking a spelling test. Suppose that, at the end of the test period, the teacher asks you to estimate your own grade on the quiz by marking the words you got wrong. You will give yourself a very good grade, but your spelling will not improve. If, at the beginning of the period, you thought there should be an e at the end of tomato, nothing will have happened to change your mind when you grade your paper. No new information has entered the system. You need a validation set!

Now, imagine that at the end of the test the teacher allows you to look at the papers of several neighbors before grading your own. If they all agree that tomato has no final e, you may decide to mark your own answer wrong. If the teacher gives the same quiz tomorrow, you will do better. But how much better? If you use the papers of the very same neighbors to evaluate your performance tomorrow, you may still be fooling yourself. If they all agree that potatoes has no more need of an e than tomato, and you have changed your own guess to agree with theirs, then you will overestimate your actual grade on the second quiz as well. That is why the test set should be different from the validation set.

For predictive models, the test set should also come from a different time period than the training and validation sets. The proof of a models stability is in its ability to perform well month after month. A test set from a different time period, often called an out of time test set, is a good way to verify model stability, although such a test set is not always available.

Step Five: Fix Problems with the Data

All data is dirty. All data has problems. What is or isnt a problem varies with the data mining technique. For some, such as decision trees, missing values, and outliers do not cause too much trouble. For others, such as neural networks, they cause all sorts of trouble. For that reason, some of what we have to say about fixing problems with data can be found in the chapters on the techniques where they cause the most difficulty. The rest of what we have to say on this topic can be found in Chapter 17 in the section called The Dark Side of Data.

The next few sections talk about some of the common problems that need to be fixed.

Team-Fly®



Categorical Variables with Too Many Values

Variables such as zip code, county, telephone handset model, and occupation code are all examples of variables that convey useful information, but not in a way that most data mining algorithms can handle. The problem is that while where a person lives and what he or she does for work are important predictors, there are so many possible values for the variables that carry this information and so few examples in your data for most of the values, that variables such as zip code and occupation end up being thrown away along with their valuable information content.

Variables like these must either be grouped so that many classes that all have approximately the same relationship to the target variable are grouped together, or they must be replaced by interesting attributes of the zip code, handset model or occupation. Replace zip codes by the zip codes median home price or population density or historical response rate or whatever else seems likely to be predictive. Replace occupation with median salary for that occupation. And so on.

Numeric Variables with Skewed Distributions and Outliers

Skewed distributions and outliers cause problems for any data mining technique that uses the values arithmetically (by multiplying them by weights and adding them together, for instance). In many cases, it makes sense to discard records that have outliers. In other cases, it is better to divide the values into equal sized ranges, such as deciles. Sometimes, the best approach is to transform such variables to reduce the range of values by replacing each value with its logarithm, for instance.

Missing Values

Some data mining algorithms are capable of treating missing as a value and incorporating it into rules. Others cannot handle missing values, unfortunately. None of the obvious solutions preserve the true distribution of the variable. Throwing out all records with missing values introduces bias because it is unlikely that such records are distributed randomly. Replacing the missing value with some likely value such as the mean or the most common value adds spurious information. Replacing the missing value with an unlikely value is even worse since the data mining algorithms will not recognize that -999, say, is an unlikely value for age. The algorithms will go ahead and use it.



When missing values must be replaced, the best approach is to impute them by creating a model that has the missing value as its target variable.

Values with Meanings That Change over Time

When data comes from several different points in history, it is not uncommon for the same value in the same field to have changed its meaning over time. Credit class A may always be the best, but the exact range of credit scores that get classed as an A may change from time to time. Dealing with this properly requires a well-designed data warehouse where such changes in meaning are recorded so a new variable can be defined that has a constant meaning over time.

Inconsistent Data Encoding

When information on the same topic is collected from multiple sources, the various sources often represent the same data different ways. If these differences are not caught, they add spurious distinctions that can lead to erroneous conclusions. In one call-detail analysis project, each of the markets studied had a different way of indicating a call to check ones own voice mail. In one city, a call to voice mail from the phone line associated with that mailbox was recorded as having the same origin and destination numbers. In another city, the same situation was represented by the presence of a specific nonexistent number as the call destination. In yet another city, the actual number dialed to reach voice mail was recorded. Understanding apparent differences in voice mail habits between cities required putting the data in a common form.

The same data set contained multiple abbreviations for some states and, in some cases, a particular city was counted separately from the rest of the state. If issues like this are not resolved, you may find yourself building a model of calling patterns to California based on data that excludes calls to Los Angeles.

Step Six: Transform Data to Bring Information to the Surface

Once the data has been assembled and major data problems fixed, the data must still be prepared for analysis. This involves adding derived fields to bring information to the surface. It may also involve removing outliers, binning numeric variables, grouping classes for categorical variables, applying transformations such as logarithms, turning counts into proportions, and the



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [ 32 ] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222