Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 [ 86 ] 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Size of Training Set

The more features there are in the network, the more training examples that are needed to get a good coverage of patterns in the data. Unfortunately, there is no simple rule to express a relationship between the number of features and the size of the training set. However, typically a minimum of a few hundred examples are needed to support each feature with adequate coverage; having several thousand is not unreasonable. The authors have worked with neural networks that have only six or seven inputs, but whose training set contained hundreds of thousands of rows.

When the training set is not sufficiently large, neural networks tend to over-fit the data. Overfitting is guaranteed to happen when there are fewer training examples than there are weights in the network. This poses a problem, because the network will work very, very well on the training set, but it will fail spectacularly on unseen data.

Of course, the downside of a really large training set is that it takes the neural network longer to train. In a given amount of time, you may get better models by using fewer input features and a smaller training set and experimenting with different combinations of features and network topologies rather than using the largest possible training set that leaves no time for experimentation.

Number of Outputs

In most training examples, there are typically many more inputs going in than there are outputs going out, so good coverage of the inputs results in good coverage of the outputs. However, it is very important that there be many examples for all possible output values from the network. In addition, the number of training examples for each possible output should be about the same. This can be critical when deciding what to use as the training set.

For instance, if the neural network is going to be used to detect rare, but important events-failure rates in a diesel engines, fraudulent use of a credit card, or who will respond to an offer for a home equity line of credit-then the training set must have a sufficient number of examples of these rare events. A random sample of available data may not be sufficient, since common examples will swamp the rare examples. To get around this, the training set needs to be balanced by oversampling the rare cases. For this type of problem, a training set consisting of 10,000 good examples and 10,000 bad examples gives better results than a randomly selected training set of 100,000 good examples and 1,000 bad examples. After all, using the randomly sampled training set the neural network would probably assign good regardless of the input-and be right 99 percent of the time. This is an exception to the general rule that a larger training set is better.



HIJ The training set for a neural network has to be large enough to cover all the values taken on by all the features. You want to have at least a dozen, if not hundreds or thousands, of examples for each input feature. For the outputs of the network, you want to be sure that there is an even distribution of values. This is a case where fewer examples in the training set can actually improve results, by not swamping the network with good examples when you want to train it to recognize bad examples. The size of the training set is also influenced by the power of the machine running the model. A neural network needs more time to train when the training set is very large. That time could perhaps better be used to experiment with different features, input mapping functions, and parameters of the network.

Preparing the Data

Preparing the input data is often the most complicated part of using a neural network. Part of the complication is the normal problem of choosing the right data and the right examples for a data mining endeavor. Another part is mapping each field to an appropriate range-remember, using a limited range of inputs helps networks better recognize patterns. Some neural network packages facilitate this translation using friendly, graphical interfaces. Since the format of the data going into the network has a big effect on how well the network performs, we are reviewing the common ways to map data. Chapter 17 contains additional material on data preparation.

Features with Continuous Values

Some features take on continuous values, generally ranging between known minimum and maximum bounds. Examples of such features are:

Dollar amounts (sales price, monthly balance, weekly sales, income, and so on)

Averages (average monthly balance, average sales volume, and so on)

Ratios (debt-to-income, price-to-earnings, and so on)

Physical measurements (area of living space, temperature, and so on)

The real estate appraisal example showed a good way to handle continuous features. When these features fall into a predefined range between a minimum value and a maximum value, the values can be scaled to be in a reasonable range, using a calculation such as:

mapped value = 2 * (original value - min) / (max - min + 1) - 1



This transformation (subtract the min, divide by the range, double and subtract 1) produces a value in the range from -1 to 1 that follows the same distribution as the original value. This works well in many cases, but there are some additional considerations.

The first is that the range a variable takes in the training set may be different from the range in the data being scored. Of course, we try to avoid this situation by ensuring that all variables values are represented in the training set. However, this ideal situation is not always possible. Someone could build a new house in the neighborhood with 5,000 square feet of living space perhaps rendering the real estate appraisal network useless. There are several ways to approach this:

Plan for a larger range. The range of living areas in the training set was from 714 square feet to 4185 square feet. Instead of using these values for the minimum and maximum value of the range, allow for some growth, using, say, 500 and 5000 instead.

Reject out-of-range values. Once we start extrapolating beyond the ranges of values in the training set, we have much less confidence in the results. Only use the network for predefined ranges of input values. This is particularly important when using a network for controlling a manufacturing process; wildly incorrect results can lead to disasters.

Peg values lower than the minimum to the minimum and higher than the maximum to the maximum. So, houses larger than 4,000 square feet would all be treated the same. This works well in many situations. However, we suspect that the price of a house is highly correlated with the living area. So, a house with 20 percent more living area than the maximum house size (all other things being equal) would cost about 20 percent more. In other situations, pegging the values can work quite well.

Map the minimum value to -0.9 and the maximum value to 0.9 instead of -1 and 1.

Or, most likely, dont worry about it. It is important that most values are near 0; a few exceptions probably will not have a significant impact.

Figure 7.9 illustrates another problem that arises with continuous features- skewed distribution of values. In this data, almost all incomes are under $100,000, but the range goes from $10,000 to $1,000,000. Scaling the values as suggested maps a $30,000 income to -0.96 and a $65,000 income to -0.89, hardly any difference at all, although this income differential might be very significant for a marketing application. On the other hand, $250,000 and $800,000 become -0.51 and +0.60, respectively-a very large difference, though this income differential might be much less significant. The incomes are highly skewed toward the low end, and this can make it difficult for the neural network to take advantage of the income field. Skewed distributions



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 [ 86 ] 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222