Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 [ 50 ] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

The simplest explanation is usually the best one-even (or especially) if it does not prove the hypothesis you want to prove.

This skeptical attitude is very valuable for both statisticians and data miners. Our goal is to demonstrate results that work, and to discount the null hypothesis. One difference between data miners and statisticians is that data miners are often working with sufficiently large amounts of data that make it unnecessary to worry about the mechanics of calculating the probability of something being due to chance.

P-Values

The null hypothesis is not merely an approach to analysis; it can also be quantified. The p-value is the probability that the null hypothesis is true. Remember, when the null hypothesis is true, nothing is really happening, because differences are due to chance. Much of statistics is devoted to determining bounds for the p-value.

Consider the previous example of the presidential poll. Consider that the p-value is calculated to be 60 percent (more on how this is done later in the chapter). This means that there is a 60 percent likelihood that the difference in the support for the two candidates as measured by the poll is due strictly to chance and not to the overall support in the general population. In this case, there is little evidence that the support for the two candidates is different.

Lets say the p-value is 5 percent, instead. This is a relatively small number, and it means that we are 95 percent confident that Candidate B is doing better than Candidate A. Confidence, sometimes called the q-value, is the flip side of the p-value. Generally, the goal is to aim for a confidence level of at least 90 percent, if not 95 percent or more (meaning that the corresponding p-value is less than 10 percent, or 5 percent, respectively).

These ideas-null hypothesis, p-value, and confidence-are three basic ideas in statistics. The next section carries these ideas further and introduces the statistical concept of distributions, with particular attention to the normal distribution.

A Look at Data

A statistic refers to a measure taken on a sample of data. Statistics is the study of these measures and the samples they are measured on. A good place to start, then, is with such useful measures, and how to look at data.



Looking at Discrete Values

Much of the data used in data mining is discrete by nature, rather than continuous. Discrete data shows up in the form of products, channels, regions, and descriptive information about businesses. This section discusses ways of looking at and analyzing discrete fields.

Histograms

The most basic descriptive statistic about discrete fields is the number of times different values occur. Figure 5.1 shows a histogram of stop reason codes during a period of time. A histogram shows how often each value occurs in the data and can have either absolute quantities (204 times) or percentage (14.6 percent). Often, there are too many values to show in a single histogram such as this case where there are over 30 additional codes grouped into the other category.

In addition to the values for each category, this histogram also shows the cumulative proportion of stops, whose scale is shown on the left-hand side. Through the cumulative histogram, it is possible to see that the top three codes account for about 50 percent of stops, and the top 10, almost 90 percent. As an aesthetic note, the grid lines intersect both the left- and right-hand scales at sensible points, making it easier to read values off of the chart.

12,500

10,000

7,500

5,000

2,500

1,491 1,306

1,226 1,108

П П П П

1---I---I---I---I---I---I---I---I---г

TI NO OT VN PE CM CP NR MV EX OTHER

Stop Reason Code

100%

10,048

5,944

<U

E з z

4,884

3,851

3,549

3,311

3,054

Figure 5.1 This example shows both a histogram (as a vertical bar chart) and cumulative proportion (as a line) on the same chart for stop reasons associated with a particular marketing effort.



Time Series

Histograms are quite useful and easily made with Excel or any statistics package. However, histograms describe a single moment. Data mining is often concerned with what is happening over time. A key question is whether the frequency of values is constant over time.

Time series analysis requires choosing an appropriate time frame for the data; this includes not only the units of time, but also when we start counting from. Some different time frames are the beginning of a customer relationship, when a customer requests a stop, the actual stop date, and so on. Different fields belong in different time frames. For example:

Fields describing the beginning of a customer relationship-such as original product, original channel, or original market-should be looked at by the customers original start date.

Fields describing the end of a customer relationship-such as last product, stop reason, or stop channel-should be looked at by the customers stop date or the customers tenure at that point in time.

Fields describing events during the customer relationship-such as product upgrade or downgrade, response to a promotion, or a late payment-should be looked at by the date of the event, the customers tenure at that point in time, or the relative time since some other event.

The next step is to plot the time series as shown in Figure 5.2. This figure has two series for stops by stop date. One shows a particular stop type over time (price increase stops) and the other, the total number of stops. Notice that the units for the time axis are in days. Although much business reporting is done at the weekly and monthly level, we prefer to look at data by day in order to see important patterns that might emerge at a fine level of granularity, patterns that might be obscured by summarization. In this case, there is a clear up and down wiggling pattern in both lines. This is due to a weekly cycle in stops. In addition, the lighter line is for the price increase related stops. These clearly show a marked increase starting in February, due to a change in pricing.

When looking at field values over time, look at the data by day to get a feel for the data at the most granular level.

A time series chart has a wealth of information. For example, fitting a line to the data makes it possible to see and quantify long term trends, as shown in Figure 5.2. Be careful when doing this, because of seasonality. Partial years might introduce inadvertent trends, so include entire years when using a best-fit line. The trend in this figure shows an increase in stops. This may be nothing to worry about, especially since the number of customers is also increasing over this period of time. This suggests that a better measure would be the stop rate, rather than the raw number of stops.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 [ 50 ] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222