Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 [ 144 ] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

A NOTE ABOUT SURVIVAL ANALYSIS AND STATISTICS

The discussion of survival analysis in this chapter assumes that time is discrete. In particular, things happen on particular days, and the particular time of day is not important. This is not only reasonable for the problems addressed by data mining, but it is also more intuitive and simplifies the mathematics.

In statistics, though, survival analysis makes the opposite assumption, that time is continuous. Instead of hazard probabilities, statisticians work with hazard rates, which are turned into survival curves by using exponentiation and integration. One difference between a rate and a probability is that the rate can exceed 1, whereas a probability never does. Also, a rate seems less intuitive for many survival problems encountered with customers.

The method for calculating hazards in this chapter is called the life table method, and it works well with discrete time data. A very similar method, called Kaplan-Meier, is used for continuous time data. The two techniques produce almost exactly the same results when events occur at discrete times.

An important part of statistical survival analysis is the estimation of hazards using parameterized regression-trying to find the best functional form for the hazards. This is an alternative approach, calculating the hazards directly from the data.

The parameterized approach has the important advantage that it can more easily include covariates in the process. Later in this chapter, there is an example based on such a parameterized model. Unfortunately, the hazard function rarely follows a form that would be familiar to nonstatisticians. The hazards do such a good job of describing the customer life cycle that it would be shocking if a simple function captured that rich complexity.

We strongly encourage interested readers who have a mathematical or statistical background to investigate the area further.

Proportional Hazards

Sir David Cox is one of the most cited statisticians of the past century; his work comprises numerous books and over 250 articles. He has received many awards including a knighthood bestowed on him by Queen Elizabeth in 1985. Much of his research centered on understanding hazard functions, and his work has been particularly important in the world of medical research.

His seminal paper was about determining the effect of initial factors (time-zero covariates) on hazards. By assuming that these initial factors have a uniform proportional effect on hazards, he was able to figure out how to measure this effect for different factors. The purpose of this section is to introduce proportional hazards and to suggest how they are useful for understanding customers. This section starts with some examples of why proportional



hazards are useful. It then describes an alternative approach before returning to the Cox model itself.

Examples of Proportional Hazards

Consider the following statement about one risk from smoking: The risk of leukemia for smokers is 1.53 times greater than for nonsmokers. This result is a classic example of proportional hazards. At the time of the study, the researchers knew whether someone was or was not a smoker (actually, there was a third group of former smokers, but our purpose here is to illustrate an example). Whether or not someone is a smoker is an example of an initial condition. Since there are only two factors to consider, it is possible to just look at the hazard curves and to derive some sort of average for the overall risk.

Figure 12.11 provides an illustration from the world of marketing. It shows two sets of hazard probabilities, one for customers who joined from a telephone solicitation and the other from direct mail. Once again, how someone became a customer is an example of an initial condition. The hazards for the telemarketing customers are higher; looking at the chart, we might say telemarketing customers are a bit less than twice as risky as direct mail customers. Cox proportional hazard regression provides a way to quantify this.

The two just-mentioned examples use categorical variables as the risk factor. Consider another statement about the risk of tobacco: The risk of colorectal cancer increases 6.7 percent per pack-year smoked. This statement differs from the previous one, because it now depends on a continuous variable. Using proportional hazards, it is possible to determine the contribution of both categorical and continuous covariates.

20% 18%

14% 12% 10% 8% 6%

Telemarketing

+ Direct Mail

( Шла . .

0 10 20 30 40 50 60 70

Tenure (Weeks)

Figure 12.11 These two hazard functions suggest that the risk of attrition is about one and a half times as great for customers acquired through telemarketing versus direct mail.



Stratification: Measuring Initial Effects on Survival

Figure 12.11 showed hazard probabilities for two different groups of customers, one that started via outbound telemarketing campaigns and the other via direct mail campaigns. These two curves clearly show differences between these channels. It is possible to generate a survival curve for these hazards and quantify the difference, using 1-year survival, median survival, or average truncated tenure. This approach to measuring differences among different groups defined by initial conditions is called stratification because each group is analyzed independently from other groups. This produces good visualizations and accurate survival values. It is also quite easy, since statistical packages such as SAS and SPSS have options that make it easy to stratify data for this purpose.

Stratification solves the problem of understanding initial effects assuming that two conditions are true. First, the initial effect needs to be a categorical variable. Since the data is being broken into separate groups, some variable, such as channel or product or region, needs to be chosen for this purpose. Of course, it is always possible to use binning to break a continuous variable into discrete chunks.

The second is that each group needs to be fairly big. When starting with lots and lots of customers and only using one variable that takes on a handful of values, such as channel, this is not a problem. However, there may be multiple variables of interest, such as:

Acquisition channel

Original promotion

Geography

Once more than one dimension is included, the number of categories grows very quickly. This means that the data gets spread thinly, making the hazards less and less reliable.

Cox Proportional Hazards

In 1972, Sir David Cox recognized this problem and he proposed a method of analysis, now known as Cox proportional hazards regression, which overcomes these limitations. His brilliant insight was to find a way to focus on the original conditions and not on the hazards themselves. The question is: What effect do the initial conditions have on hazards? His approach to answering this question is quite interesting.

Fortunately, the ideas are simpler than the mathematics behind his approach. Instead of focusing on hazards, he introduces the idea of partial likelihood. Assuming that only one customer stops at a given time t, the partial likelihood at t is the likelihood that exactly that particular customer stopped.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 [ 144 ] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222