Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 [ 166 ] 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

believe that, over time, informed decisions lead to better bottom-line results over time, and data warehouses help managers make informed decisions. Decision support, as used here, is an intentionally ambiguous concept. It can be as rudimentary as getting production reports to front-line managers every week. It can be as complex as sophisticated modeling of prospective customers using neural networks to determine which message to offer. It can be and is just about everything in between.

Data warehousing is a natural ally of data mining. Data mining seeks to find actionable patterns in data and therefore has a firm requirement for clean and consistent data. Much of the effort behind data mining endeavors is in the steps of identifying, acquiring, and cleansing the data. A well-designed corporate data warehouse is a valuable ally. Better yet, if the design of the data warehouse includes support for data mining applications, the warehouse facilitates and catalyzes data mining efforts. The two technologies work together to deliver value. Data mining fulfills some of the promise of data warehousing by converting an essentially inert source of clean and consistent data into actionable information.

There is also a technological component to this relationship. Apart from the ability of users to run multiple jobs at the same time, most software, including data mining and statistical software, does not take advantage of the multiple processors and multiple disks available on the fastest servers. Relational database management systems (RDBMS), the heart of most data warehouses, are parallel-enabled and can take advantage of all of a systems resources for processing a single query. Even more importantly, users do not need to be aware of this fact, since the interface, some variant on SQL, remains the same. A database running on a powerful server can be a powerful asset for processing large amounts of data, as is the case when summarizing transactions at the customer level.

As useful as data warehousing is, such systems are not prerequisite for data mining and data analysis. Statisticians, actuaries, and analysts have been using statistical packages for decades-and achieving good results with their analyses- without the benefit of a well-designed centralized warehouse. This process can continue to be useful. Because of the need for consistent, accurate, and timely data to support business units, data warehousing has become increasingly important for any kind of decision support or information analysis.

This chapter is focused on data warehousing as part of the virtuous cycle of data mining, as a valuable and often critical component in supporting all four phases of the cycle: identifying opportunities, analyzing data, applying information, and measuring results. It is not a how-to guide for building a warehouse-there are many books already devoted to that subject, and we heartily recommend Ralph Kimballs The Data Warehouse Toolkit (Wiley, 2002) and Bill Inmons Building the Data Warehouse (Wiley, 2002).



The chapter starts with a discussion of the different types of data that are available, and then discusses data warehousing requirements from the perspective of data mining. It then shows a typical data warehousing architecture and variants on this theme. The chapter next turns to Online Analytic Processing (OLAP), an alternative approach to the normalized data warehouse. The final discussion covers the role of data mining in these environments. As with much that has to do with data mining, however, the place to start is with data.

The Architecture of Data

There are many different flavors of information represented on computers. Different levels of data represent different types of abstraction, as shown in Figure 15.1.

Transaction data

Operational summary data

Decision-support summary data

Schema

Metadata

Business rules

BusirHess rules

Metadata

Whats been learned from the data

Logical model and mappings to physical layout and sources

Database schema

decision support

Summary data

operational

Operational data

Physical layout of the data, tables, fields, indexes, types

Summaries by who, what, where, when

Who, what, where, and when

Data Size

Figure 15.1 A hierarchy of data and its descriptions helps users navigate around a data warehouse. As data gets more abstract, it generally gets less voluminous.



The level of abstraction is an important characteristic of data used in data mining. In a well-designed system, it should be possible to drill down through these levels of abstraction to obtain the base data that supports a summarization or a business rule. The lower levels of the pyramid are more voluminous and tend to be the stuff of databases. The upper levels are smaller and tend to be the stuff of computer programs. All these levels are important, because we do not want to analyze the detailed data to merely produce what should already be known.

Transaction Data, the Base Level

Every product purchased by a customer, every bank transaction, every Web page visit, every credit card purchase, every flight segment, every package, every telephone call is recorded in some operational system. Every time a new customer opens an account or pays a bill, there should be a record of the transaction somewhere, providing information about who, what, where, when, and how much. Such transaction-level data is the raw material for understanding customer behavior. It is the eyes and ears of the enterprise.

Unfortunately, over time operational systems change because of changing business needs. Fields may change their meaning over time. Important data is simply rolled off and deleted. Change is constant, in response to the introduction of new products, expanding numbers of customers, acquisitions, reorganizations, and new technology. The fact that operational data changes over time has to be part of any robust data warehousing approach.

Data warehouses need to store data so the information is compatible over time, even when product lines change, when markets change, when customer segments change, when business organizations change. Otherwise, data mining is likely to pick up patterns that represent these changes, rather than underlying customer behavior.

The amount of data gathered from transactional systems can be enormous. A single fast food restaurant sells hundreds of thousands of meals over the course of a year. A chain of supermarkets can have tens or hundreds of thousands of transactions a day. A large bank processes millions of checks and credit card purchases a day. Large Web sites have millions of hits each day (in 2003, Google was already handling over 250 million searches each day). A telephone company has tens or even hundreds of millions of completed calls every day. A large ad server on the Web keeps track of over a billion ad views every day. Even with the price of disk space falling, storing all these transactions requires a significant investment. For reference, it is worth remembering that a day has 86,400 seconds, so a million transactions a day is really an average of about 12 transactions per second all day (and 250 million searches



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 [ 166 ] 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222