Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 [ 216 ] 217 218 219 220 221 222

data (continued) missing data

data correction, 73-74

NULL values, 590

splits, decision trees, 174-175 operational feedback, 485, 492 patterns

meaningful discoveries, 56

prediction, 45

untruthful learning sources, 45-46 point-of-sale association rules, 288 scanners, 3

as useful data source, 60 preparation automatic cluster detection, 363-365

categorical values, neural networks,

239-240 continuous values, neural networks, 235-237 quality, association rules, 308 representation, generic algorithms,

432-433 scarce, 62

source systems, 484, 486-487 SQL, time series analysis, 572-573 terabytes, 5 truncated, 162 useful data sources, 60-61 visualization tools, 65 wrong level of detail, untruthful learning sources, 47 data mining architecture, 528-532 as creative process, 33 directed

classification, 57

discussed, 7

estimation, 57

prediction, 57 documentation, 536-537 goals of, 7 insourcing, 524-525

outsourcing, 522-524 platforms, 527 scalability, 533-534 scoring platforms, 527-528 staffing, 525-526 typical operational systems

versus, 33 undirected affinity grouping, 57 clustering, 57 discussed, 7 Data Preparation for Data Mining

(Dorian Pyle), 75 The Data Warehouse Toolkit (Ralph

Kimball), 474 data warehousing customer patterns, 5 for decision support, 13 discussed, 4 database administrators (DBAs), 488 databases call detail, 37 demographic, 37 KDD (knowledge discovery in

databases), 8 server platforms, affordability, 13 datasets, balanced, model sets, 68 dates and times, interval variables, 551

DBAs (database administrators), 488 deaths, house-hold level data, 96 debt, nonrepayment of, credit

risks, 114 decision support

data warehousing for, 13

hypothesis testing, 50-51

summary data, OLAP, 477-478 decision trees

alphas, 188

alternate representations for, 199-202 applying to sequential events, 205 branching nodes, 176 building models, 8 case-study, 206, 208



for catalog response models, 175

classification, 9, 166-168

cost considerations, 195

effectiveness of, measuring, 176

estimation, 170

as exploration tool, 203-204

fields, multiple, 195-197

neural networks, 199

profiling tasks, 12

projective visualization, 207-208

pruning C5 algorithm, 190-191 CART algorithm, 185, 188-189 discussed, 184

minimum support pruning, 312

stability-based, 191-192 rectangular regions, 197 regression trees, 170 rules, extracting, 193-194 SAS Enterprise Miner Tree Viewer

tool, 167-168 scoring, 169-170 splits

on categorical input variables, 174

chi-square testing, 180-183

discussed, 170

diversity measures, 177-178

entropy, 179

finding, 172

Gini splitting criterion, 178 information gain ratio, 178, 180 intrinsic information of, 180 missing values, 174-175 multiway, 171

on numeric input variables, 173 population diversity, 178 purity measures, 177-178 reduction in variance, 183 surrogate, 175

subtrees, selecting, 189

uses for, 166 declining usage, behavior-based variables, 577-579

deep intimacy, customer relationships,

449, 451 default classes, records, 194 default risks, proof-of-concept

projects, 599 degrees of freedom values, chi-square

tests, 152-153 democracy approach, memory-based

reasoning, 279-281 demographic databases, 37 demographic profiles, customers, 31 density

data selection, 62-63

density function, statistics, 133 deploying models, 84-85 derived variables, column data, 542 descriptions

comparing values with, 65

data transformation, 57 descriptive models, assessing, 78 descriptive profiling, 52 deviation. See standard deviation difference of proportion

chi-square tests versus, 153-154

statistical analysis, 143-144 differential response analysis,

marketing campaigns, 107-108 differentiation, market based

analysis, 289 dimension

automatic cluster detection, 352

dimension tables, OLAP, 502-503 directed clustering, automatic cluster

detection, 372 directed data mining

classification, 57

discussed, 7

estimation, 57

prediction, 57 directed graphs, 330 directed models, assessing, 78-79 directed profiling, 52 dirty data, 592-593



discrete outcomes, classification, 9 discrete values, statistics, 127-131 discrimination measures, ROC

curves, 99 dissociation rules, 317 distance and similarity, automatic

cluster detection, 359-363 distance function

defined, 271-272

discussed, 258, 265

hidden distance fields, 278

identity distance, 271

numeric fields, 275

triangle inequality, 272

zip codes, 276-277 distribution

data exploration, 65

one-tailed, 134

probability and, 135

statistics, 130-132

two-tailed, 134 diverse data types, 536 diversity measures, splitting criteria,

decision trees, 177-178 divisive clustering, automatic cluster

detection, 371-372 documentation

data mining, 536-537

historical data as, 61 dumping data, flat files, 594

EBCF (existing base churn

forecast), 469 economic data, useful data sources, 61 edges, graphs, 322 education level, house-hold level

data, 96 e-mail

as communication channel, 89

free text resources, 556-557 encoding, inconsistent, data

correction, 74 enterprise-wide data, 33 entropy, information gain, 178-180

equal-height binning, 551 equal-width binning, 551 erroneous conclusions, 74 errors countervailing, 81-82 error rates adjusted, 185 establishing, 79 measurement, 159 operational, 159 predicting, 191 standard error of proportion, statistical analysis, 139-141 established customers, customer

relationships, 457 estimation accuracy, 79-81 averages, 81

business goals, formulating, 605

classification tasks, 9

collaboration filtering, 284-285

data transformation, 57

decision trees, 170

directed data mining, 57

estimation task examples, 10

examples of, 10

neural networks, 10, 215

regression models, 10

revenue, behavior-based variables,

581-583 standard deviation, 81 valued outcomes, 9 ETL (extraction, transformation,

and load) tools, 487, 595 evaluation, automatic cluster

detection, 372-373 event-based relationships, customer

relationships, 458-459 existing base churn forecast

(EBCF), 469 expectations comparing to results, 31 expected values, chi-square tests,

150-151 proof-of-concept projects, 599



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 [ 216 ] 217 218 219 220 221 222