Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 [ 28 ] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Data mining is often presented as a technical problem of finding a model that explains the relationship of a target variable to a group of input variables. That technical task is indeed central to most data mining efforts, but it should not be attempted until the target variable has been properly defined and the appropriate input variables identified. That, in turn, depends on a good understanding of the business problem to be addressed. As the story in the sidebar illustrates, failure to properly translate the business problem into a data mining problem leads to one of the dangers we are trying to avoid- learning things that are true, but not useful.

For a complete treatment of turning business problems into data mining problems, we recommend the book Business Modeling and Data Mining by our colleague Dorian Pyle. This book gives detailed advice on how to find the business problems where data mining provides the most benefit and how to formulate those problems for mining. Here, we simply remind the reader to consider two important questions before beginning the actual data mining process: How will the results be used? And, in what form will the results be delivered? The answer to the first question goes a long way towards answering the second.

Step Two: Select Appropriate Data

Data mining requires data. In the best of all possible worlds, the required data would already be resident in a corporate data warehouse, cleansed, available, historically accurate, and frequently updated. In fact, it is more often scattered in a variety of operational systems in incompatible formats on computers running different operating systems, accessed through incompatible desktop tools.

The data sources that are useful and available vary, of course, from problem to problem and industry to industry. Some examples of useful data:

Warranty claims data (including both fixed-format and free-text fields)

Point-of-sale data (including ring codes, coupons proffered, discounts applied)

Credit card charge records

Medical insurance claims data

Web log data

E-commerce server application logs

Direct mail response records

Call-center records, including memos written by the call-center reps

Printing press run records



Motor vehicle registration records

Noise level in decibels from microphones placed in communities near an airport

Telephone call detail records

Survey response data

Demographic and lifestyle data

Economic data

Hourly weather readings (wind direction, wind strength, precipitation)

Census data

Once the business problem has been formulated, it is possible to form a wish list of data that would be nice to have. For a study of existing customers, this should include data from the time they were acquired (acquisition channel, acquisition date, original product mix, original credit score, and so on), similar data describing their current status, and behavioral data accumulated during their tenure. Of course, it may not be possible to find everything on the wish list, but it is better to start out with an idea of what you would like to find.

Occasionally, a data mining effort starts without a specific business problem. A company becomes aware that it is not getting good value from the data it collects, and sets out to determine whether the data could be made more useful through data mining. The trick to making such a project successful is to turn it into a project designed to solve a specific problem. The first step is to explore the available data and make a list of candidate business problems. Invite business users to create a lengthy wish list which can then be reduced to a small number of achievable goals-the data mining problem.

What Is Available?

The first place to look for data is in the corporate data warehouse. Data in the warehouse has already been cleaned and verified and brought together from multiple sources. A single data model hopefully ensures that similarly named fields have the same meaning and compatible data types throughout the database. The corporate data warehouse is a historical repository; new data is appended, but the historical data is never changed. Since it was designed for decision support, the data warehouse provides detailed data that can be aggregated to the right level for data mining. Chapter 15 goes into more detail about the relationship between data mining and data warehousing.

The only problem is that in many organizations such a data warehouse does not actually exist or one or more data warehouses exist, but dont live up to the promises. That being the case, data miners must seek data from various departmental databases and from within the bowels of operational systems.



These operational systems are designed to perform a certain task such as claims processing, call switching, order entry, or billing. They are designed with the primary goal of processing transactions quickly and accurately. The data is in whatever format best suits that goal and the historical record, if any, is likely to be in a tape archive. It may require significant political and programming effort to get the data in a form useful for knowledge discovery.

In some cases, operational procedures have to be changed in order to supply data. We know of one major catalog retailer that wanted to analyze the buying habits of its customers so as to market differentially to new customers and longstanding customers. Unfortunately, anyone who hadnt ordered anything in the past six months was routinely purged from the records. The substantial population of people who loyally used the catalog for Christmas shopping, but not during the rest of the year, went unrecognized and indeed were unrecognizable, until the company began keeping historical customer records.

In many companies, determining what data is available is surprisingly difficult. Documentation is often missing or out of date. Typically, there is no one person who can provide all the answers. Determining what is available requires looking through data dictionaries, interviewing users and database administrators, and examining existing reports.

WARNINGl)se database documentation and data dictionaries as a guide but do not accept them as unalterable fact. The fact that a field is defined in a table or mentioned in a document does not mean the field exists, is actually available for all customers, and is correctly loaded.

How Much Data Is Enough?

Unfortunately, there is no simple answer to this question. The answer depends on the particular algorithms employed, the complexity of the data, and the relative frequency of possible outcomes. Statisticians have spent years developing tests for determining the smallest model set that can be used to produce a model. Machine learning researchers have spent much time and energy devising ways to let parts of the training set be reused for validation and test. All of this work ignores an important point: In the commercial world, statisticians are scarce, and data is anything but.

In any case, where data is scarce, data mining is not only less effective, it is less likely to be useful. Data mining is most useful when the sheer volume of data obscures patterns that might be detectable in smaller databases. Therefore, our advice is to use so much data that the questions about what constitutes an adequate sample size simply do not arise. We generally start with tens of thousands if not millions of preclassified records so that the training, validation, and test sets each contain many thousands of records.

Team-Fly®



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 [ 28 ] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222