Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 [ 186 ] 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

account future growth (data does not grow smaller over time). The scalability aspect of data mining is important in three ways:

Transforming the data into customer signatures requires a lot of I/O and computing power.

Building models is a repetitive and very computationally expensive.

Scoring models requires complex data transformations.

For exploring and transforming data, the most readily available scalable software are relational databases. These have been designed to take advantage of multiple processors and multiple disks for handling a single database query. Another class of software, the extraction, transformation, and load tools (ETL) used to create databases may also be scalable and useful for data mining. However, most programming languages do not scale; they only support single processors and single disks for handling a single task. When there is a lot of data that needs to be combined, the most scalable solution to handling the data is often found at this level.

Building models and exploring data require software that runs fast enough and on large enough quantities of data. Some data mining tools only work on data in memory, so the volume of data is limited by available memory. This has the advantage that algorithms run faster. On the other hand there are limits. In practice, this was a problem when available memory was measured in megabytes; the gigabytes of memory available even on a typical workstation ameliorate the problem. Often, the data mining environment puts multiuser data mining servers on a powerful server close to the data. This is a good solution. As workstations become more powerful, building the models locally is also a viable solution. In either case, the goal is to run the models on hundreds of thousands or millions of rows in a reasonable amount of time. A data mining environment should encourage users to understand and explore the data, rather than expending effort sampling it down to make it fit in.

The scoring environment is often the most complex, because it require transforming the data and running the models at the same time-preferably with a minimal amount of user interaction. Perhaps the best solution is when data mining software can both read and write to relational databases, making it possible to use the database for scalable data manipulation and the data mining tool for efficient model building.

Support for Scoring

The ability to write to as well as read from a database is desirable when data mining is used to develop models used for scoring. The models may be developed using samples extracted from the master database, but once developed, the models will score every record in the database.



The value of a response model decreases with time. Ideally, the results of one campaign should be analyzed in time to affect the next one. But, in many organizations there is a long lag between the time a model is developed and the time it can be used to append scores to a database; sometimes the time is measured in weeks or months. The delay is caused by the difficulty of moving the scoring model, which is often developed on a different computer from the database server, into a form that can be applied to the database. This might involve interpreting the output of a data mining tool and writing a computer program that embodies the rules that make up the model.

The problem is even worse when the database is actually stored at a third facility, such as that of a list processor. The list processor is unlikely to accept a neural network model in the form of C source code as input to a list selection request. Building a unified model development and scoring framework requires significant integration effort, but if scoring large databases is an important application for your business, the effort will be repaid.

Multiple Levels of User Interfaces

In many organizations, several different communities of users use the data mining software. In order to accommodate their differing needs, the tool should provide several different user interfaces:

A graphical user interface (GUI) for the casual user that has reasonable default values for data mining parameters.

Advanced options for more skilled users.

An ability to build models in batch mode (which could be provided by a command line interface).

An applications program interface (API) so that predictive modeling can be built into applications

The GUI for a data mining tool should not only make it easy for users to build models, it should be designed to encourage best practices such as ensuring that model assessment is performed on a hold-out set and that the target variables for predictive models come from a later timeframe than the inputs. The user interface should include a help system, with context-sensitive help. The user interface should provide reasonable default values for such things as the minimum number of records needed to support a split in a decision tree or the number of nodes in the hidden layer of a neural network to improve the chance of success for casual users. On the other hand, the interface should make it easy for more knowledgeable users to change the defaults. Advanced users should be able to control every aspect of the underlying data mining algorithms.



Comprehensible Output

Tools vary greatly in the extent to which they explain themselves. Rule generators, tree visualizers, Web diagrams, and association tables can all help.

Some vendors place great emphasis on the visual representation of both data and rules, providing three-dimensional data terrain maps, geographic information systems (GIS), and cluster diagrams to help make sense of complex relationships. The final destination of much data mining work is reports for management, and the power of graphics should not be underestimated for convincing non-technical users of data mining results. A data mining tool should make it easy to export results to commonly available reporting an analysis packages such as Excel and PowerPoint.

Ability to Handle Diverse Data Types

Many data mining software packages place restrictions on the kinds of data that can be analyzed. Before investing in a data mining software package, find out how it deals with the various data types you want to work with.

Some tools have difficulty using categorical variables (such as model, type, gender) as input variables and require the user to convert these into a series of yes/no variables, one for each possible class. Others can deal with categorical variables that take on a small number of values, but break down when faced with too many. On the target field side, some tools can handle a binary classification task (good/bad), but have difficulty predicting the value of a categorical variable that can take on several values.

Some data mining packages on the market require that continuous variables (income, mileage, balance) be split into ranges by the user. This is especially likely to be true of tools that generate association rules, since these require a certain number of occurrences of the same combination of values in order to recognize a rule.

Most data mining tools cannot deal with text, although such support is starting to appear. If the text strings in the data are standardized codes (state, part number), this is not really a problem, since character codes can easily be converted to numeric or categorical ones. If the application requires the ability to analyze free text, some of the more advanced data mining tool sets are starting to provide support for this capability.

Documentation and Ease of Use

A well-designed user interface should make it possible to start mining right away, even if mastery of the tool requires time and study. As with any complex software, good documentation can spell the difference between success and frustration. Before deciding on a tool, ask to look over the manual. It is very



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 [ 186 ] 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222