Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 [ 62 ] 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Figure 5.11 shows another situation with the same result. This curve shows sales and inventory for a retailer for one product. Sales are always less than or equal to the inventory. On the days with the Xs, though, the inventory sold out. What were the potential sales on these days? The potential sales are greater than or equal to the observed sales-another example of censored data.

Truncated data poses another problem in terms of biasing samples. Truncated data is not included in databases, often because it is too old. For instance, when Company A purchases Company B, their systems are merged. Often, the active customers from Company B are moved into the data warehouse for Company A. That is, all customers active on a given date are moved over. Customers who had stopped the day before are not moved over. This is an example of left truncation, and it pops up throughout corporate databases, usually with no warning (unless the documentation is very good about saying what is not in the warehouse as well as what is). This can cause confusion when looking at when customers started-and discovering that all customers who started 5 years before the merger were mysteriously active for at least 5 years. This is not due to a miraculous acquisition program. This is because all the ones who stopped earlier were excluded.

Lessons Learned


This chapter talks about some basic statistical methods that are useful for analyzing data. When looking at data, it is useful to look at histograms and cumulative histograms to see what values are most common. More important, though, is looking at values over time.

One of the big questions addressed by statistics is whether observed values are expected or not. For this, the number of standard deviations from the mean (z-score) can be used to calculate the probability of the value being due to chance (the p-value). High p-values mean that the null hypothesis is true; that is, nothing interesting is happening. Low p-values are suggestive that other factors may be influencing the results. Converting z-scores to p-values depends on the normal distribution.

Business problems often require analyzing data expressed as proportions. Fortunately, these behave similarly to normal distributions. The formula for the standard error for proportions (SEP) makes it possible to define a confidence interval on a proportion such as a response rate. The standard error for the difference of proportions (SEDP) makes it possible to determine whether two values are similar. This works by defining a confidence interval for the difference between two values.

When designing marketing tests, the SEP and SEDP can be used for sizing test and control groups. In particular, these groups should be large enough to

Team-Fly®



measure differences in response with a high enough confidence. Tests that have more than two groups need to take into account an adjustment, called Bonferronis correction, when setting the group sizes.

The chi-square test is another statistical method that is often useful. This method directly calculates the estimated values for data laid out in rows and columns. Based on these estimates, the chi-square test can determine whether the results are likely or unlikely. As shown in an example, the chi-square test and SEDP methods produce similar results.

Statisticians and data miners solve similar problems. However, because of historical differences and differences in the nature of the problems, there are some differences in approaches. Data miners generally have lots and lots of data with few measurement errors. This data changes over time, and values are sometimes incomplete. The data miner has to be particularly suspicious about bias introduced into the data by business processes.

The next eight chapters dive into more detail into more modern techniques for building models and understanding data. Many of these techniques have been adopted by statisticians and build on over a century of work in this area.





1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 [ 62 ] 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222