Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 [ 59 ] 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

DEGREES OF FREEDOM

The idea behind the degrees of freedom is how many different variables are needed to describe the table of expected values. This is a measure of how constrained the data is in the table.

If the table has r rows and c columns, then there are r * c cells in the table. With no constraints on the table, this is the number of variables that would be needed. However, the calculation of the expected values has imposed some constraints. In particular, the sum of the values in each row is the same for the expected values as for the original table, because the sum of each row is fixed. That is, if one value were missing, we could recalculate it by taking the constraint into account by subtracting the sum of the rest of values in the row from the sum for the whole row. This suggests that the degrees of freedom is r * c - r. The same situation exists for the columns, yielding an estimate of r * c - r - c.

However, there is one additional constraint. The sum of all the row sums and the sum of all the column sums must be the same. It turns out, we have over counted the constraints by one, so the degrees of freedom is really r * c - r - c + 1. Another way of writing this is ( r - 1) * (c - 1).

The result is the probability that the distribution of values in the table is due to random fluctuations rather than some external criteria. As Occams Razor suggests, the simplest explanation is that there is no difference at all due to the various factors; that observed differences from expected values are entirely within the range of expectation.

Comparison of Chi-Square to Difference of Proportions

Chi-square and difference of proportions can be applied to the same problems. Although the results are not exactly the same, the results are similar enough for comfort. Earlier, in Table 5.4, we determined the likelihood of champion and challenger results being the same using the difference of proportions method for a range of champion response rates. Table 5.7 repeats this using the chi-square calculation instead of the difference of proportions. The results from the chi-square test are very similar to the results from the difference of proportions-a remarkable result considering how different the two methods are.



Table 5.7 Chi-Square Calculation for Difference of Proportions Example in Table 5.4

CHALLENGER

CHAMPION

CHALLENGER EXP

CHAMPION EXP

CHAL

CHI-SQUARE

CHAMP CHI-SQUARE

CHI-SQUARE

DIFF PROP

RESP

NON RESP

RESP

NON-RESP

OVERALL RESP

RESP

NON RESP

RESP

NON RESP

RESP

NON RESP

RESP

NON RESP

VALUE

P-VALUE P-VALUE

5,000

95,000

40,500

859,500

4.55%

4,550

95,450

40,950

859,050

44.51

2.12

4.95

0.24

51.81

0.00%

0.00%

5,000

95,000

41,400

858,600

4.64%

4,640

95,360

41,760

858,240

27.93

1.36

3.10

0.15

32.54

0.00%

0.00%

5,000

95,000

42,300

857,700

4.73%

4,730

95,270

42,570

857,430

15.41

0.77

1.71

0.09

17.97

0.00%

0.00%

5,000

95,000

43,200

856,800

4.82%

4,820

95,180

43,380

856,620

6.72

0.34

0.75

0.04

7.85

0.51%

0.58%

5,000

95,000

44,100

855,900

4.91%

4,910

95,090

44,190

855,810

1.65

0.09

0.18

0.01

1.93

16.50%

16.83%

5,000

95,000

45,000

855,000

5.00%

5,000

95,000

45,000

855,000

0.00

0.00

0.00

0.00

0.00

100.00%

100.00%

5,000

95,000

45,900

854,100

5.09%

5,090

94,910

45,810

854,190

1.59

0.09

0.18

0.01

1.86

17.23%

16.91%

5,000

95,000

46,800

853,200

5.18%

5,180

94,820

46,620

853,380

6.25

0.34

0.69

0.04

7.33

0.68%

0.60%

5,000

95,000

47,700

852,300

5.27%

5,270

94,730

47,430

852,570

13.83

0.77

1.54

0.09

16.23

0.01%

0.00%

5,000

95,000

48,600

851,400

5.36%

5,360

94,640

48,240

851,760

24.18

1.37

2.69

0.15

28.39

0.00%

0.00%

5,000 95,000 49,500 850,500 5.45% 5,450 94,550 49,050 850,950 37.16 2.14 4.13 0.24 43.66 0.00% 0.00%



An Example: Chi-Square for Regions and Starts

A large consumer-oriented company has been running acquisition campaigns in the New York City area. The purpose of this analysis is to look at their acquisition channels to try to gain an understanding of different parts of the area. For the purposes of this analysis, three channels are of interest:

Telemarketing. Customers who are acquired through outbound telemarketing calls (note that this data was collected before the national do-not-call list went into effect).

Direct mail. Customers who respond to direct mail pieces.

Other. Customers who come in through other means.

The area of interest consists of eight counties in New York State. Five of these counties are the boroughs of New York City, two others (Nassau and Suffolk counties) are on Long Island, and one (Westchester) lies just north of the city. This data was shown earlier in Table 5.1. This purpose of this analysis is to determine whether the breakdown of starts by channel and county is due to chance or whether some other factors might be at work.

This problem is particularly suitable for chi-square because the data can be laid out in rows and columns, with no customer being counted in more than one cell. Table 5.8 shows the deviation, expected values, and chi-square values for each combination in the table. Notice that the chi-square values are often quite large in this example. The overall chi-square score for the table is 7,200, which is very large; the probability that the overall score is due to chance is basically 0. That is, the variation among starts by channel and by region is not due to sample variation. There are other factors at work.

The next step is to determine which of the values are too high and too low and with what probability. It is tempting to convert each chi-square value in each cell into a probability, using the degrees of freedom for the table. The table is 8 x 3, so it has 14 degrees of freedom. However, this is not an appropriate thing to do. The chi-square result is for the entire table; inverting the individual scores to get a probability does not produce valid results. Chi-square scores are not additive.

An alternative approach proves more accurate. The idea is to compare each cell to everything else. The result is a table that has two columns and two rows, as shown in Table 5.9. One column is the column of the original cell; the other column is everything else. One row is the row of the original cell; the other row is everything else.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 [ 59 ] 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222