Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 [ 177 ] 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

In practice, star schemas may not be efficient for answering all users questions, because the central fact table is so large. In such cases, the OLAP systems introduce summary tables at different levels to facilitate query response. Relational database vendors have been providing more and more support for star schemas. With a typical architecture, any query on the central fact table would require multiple joins back to the dimension tables. By applying standard indexes, and creatively enhancing indexing technology, relational databases can handle these queries quite well.

OLAP and Data Mining

Data mining is about the successful exploitation of data for decision-support purposes. The virtuous cycle of data mining, described in Chapter 2, reminds us that success depends on more than advanced pattern recognition algorithms. The data mining process needs to provide feedback to people and encourage using information gained from data mining to improve business processes. The data mining process should enable people to provide input, in the form of observations, hypotheses, and hunches about what results are important and how to use those results.

In the larger context of data exploitation, OLAP clearly plays an important role as a means of broadening the audience with access to data. Decisions once made based on experience and educated guesses can now be based on data and patterns in the data. Anomalies and outliers can be identified for further investigation and further modeling, sometimes using the most sophisticated data mining techniques. For instance, a user might discover that a particular item sells better at a particular time during the week through the use of an OLAP tool. This might lead to an investigation using market basket analysis to find other items purchased with that item. Market basket analysis might suggest an explanation for the observed behavior-more information and more opportunities for exploiting the information.

There are other synergies between data mining and OLAP. One of the characteristics of decision trees discussed in Chapter 6 is their ability to identify the most informative features in the data relative to a particular outcome. That is, if a decision tree is built in order to predict attrition, then the upper levels of the tree will have the features that are the most important predictors for attrition. Well, these predictors might be a good choice for dimensions using an OLAP tool. Such analysis helps build better, more useful cubes. Another problem when building cubes is determining how to make continuous dimensions discrete. The nodes of a decision tree can help determine the best breaking point for a continuous value. This information can be fed into the OLAP tool to improve the dimension.



One of the problems with neural networks is the difficulty of understanding the results. This is especially true when using them for undirected data mining, as when using SOM networks to detect clusters. The SOM identifies clusters, but cannot explain what the clusters mean.

OLAP to the rescue! The data can now be enhanced with a predicted cluster, as well as with other information about customers, such as demographics, purchase history, and so on. This is a good application for a cube. Using OLAP-with information about the clusters included as a dimension-makes it possible for end users to explore the clusters and to determine features that distinguish them. The dimensions used for the OLAP cube should include the inputs to the SOM neural network, along with the cluster identifier, and perhaps other descriptive variables. There is a tricky data conversion problem because the neural networks require continuous values scaled between -1 and 1, and OLAP tools prefer discrete values. For values that were originally discrete, this is no problem. For continuous values, various binning techniques solve the problem.

As these examples show, OLAP and data mining complement each other. Data mining can help build better cubes by defining appropriate dimensions, and further by determining how to break up continuous values on dimensions. OLAP provides a powerful visualization capability to help users better understand the results of data mining, such as clustering and neural networks. Used together, OLAP and data mining reinforce each others strengths and provide more opportunities for exploiting data.

Where Data Mining Fits in with Data Warehousing

Data mining plays an important role in the data warehouse environment. The initial returns from a data warehouse come from automating existing processes, such as putting reports online and giving existing applications a clean source of data. The biggest returns are the improved access to data that can spur innovation and creativity-and these come from new ways of looking at and analyzing data. This is the role of data mining-to provide the tools that improve understanding and inspire creativity based on observations in the data.

A good data warehousing environment serves as a catalyst for data mining. The two technologies work together as partners:

Data mining thrives on large amounts of data and the more detailed the data, the better-data that comes from a data warehouse.

Data mining thrives on clean and consistent data-capitalizing on the investment in data cleansing tools.



The data warehouse environment enables hypothesis testing and simplifies efforts to measure the effects of actions taken-enabling the virtuous cycle of data mining.

Scalable hardware and relational database software can offload the data processing parts of data mining.

There is, however, a distinction between the way data mining looks at the world and the way data warehousing does. Normalized data warehouses can store data with time stamps, but it is very difficult to do time-related manipulations-such as determining what event happened just before some other event of interest. OLAP introduces a time dimension. Data mining extends this even further by taking into account the notion of before and after. Data mining learns from data (the before ), with the purpose of applying these findings to the future (the after ). For this reason, data mining often puts a heavy load on data warehouses. These are complementary technologies, supporting each other as discussed in the next few sections.

Lots of Data

The traditional approach to data analysis generally starts by reducing the size of the data. There are three common ways of doing this: summarizing detailed transactions, taking a subset of the data, and only looking at certain attributes. The reason for reducing the size of the data was to make it possible to analyze the data on the available hardware and software systems. When properly done, the laws of statistics come into play, and it is possible to choose a sample that behaves roughly like the rest of the data.

Data mining, on the other hand, is searching for trends in the data and for valuable anomalies. It is often trying to answer different types of questions from traditional statistical analysis, such as what product is this customer most likely to purchase next? Even if it is possible to devise a model using a subset of data, it is necessary to deploy the model and score all customers, a process that can be very computationally intensive.

Fortunately, data mining algorithms are often able to take advantage of large amounts of data. When looking for patterns that identify rare events- such as having to write-off customers because they failed to pay-having large amounts of data ensures that there is sufficient data for analysis. A subset of the data might be statistically relevant in total, but when you try to decompose it into other segments (by region, by product, by customer segment), there may be too little data to produce statistically meaningful results.

Data mining algorithms are able to make use of lots of data. Decision trees, for example, work very well, even when there are dozens or hundreds of fields in each record. Link analysis requires a full complement of the data to create a



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 [ 177 ] 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222