Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 [ 49 ] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

CHAPTER

The Lure of Statistics: Data Mining Using Familiar Tools

For statisticians (and economists too), the term data mining has long had a pejorative meaning. Instead of finding useful patterns in large volumes of data, data mining has the connotation of searching for data to fit preconceived ideas. This is much like what politicians do around election time-search for data to show the success of their deeds; this is certainly not what we mean by data mining! This chapter is intended to bridge some of the gap between statisticians and data miners.

The two disciplines are very similar. Statisticians and data miners commonly use many of the same techniques, and statistical software vendors now include many of the techniques described in the next eight chapters in their software packages. Statistics developed as a discipline separate from mathematics over the past century and a half to help scientists make sense of observations and to design experiments that yield the reproducible and accurate results we associate with the scientific method. For almost all of this period, the issue was not too much data, but too little. Scientists had to figure out how to understand the world using data collected by hand in notebooks. These quantities were sometimes mistakenly recorded, illegible due to fading and smudged ink, and so on. Early statisticians were practical people who invented techniques to handle whatever problem was at hand. Statisticians are still practical people who use modern techniques as well as the tried and true.



What is remarkable and a testament to the founders of modern statistics is that techniques developed on tiny amounts of data have survived and still prove their utility. These techniques have proven their worth not only in the original domains but also in virtually all areas where data is collected, from agriculture to psychology to astronomy and even to business.

Perhaps the greatest statistician of the twentieth century was R. A. Fisher, considered by many to be the father of modern statistics. In the 1920s, before the invention of modern computers, he devised methods for designing and analyzing experiments. For two years, while living on a farm outside London, he collected various measurements of crop yields along with potential explanatory variables-amount of rain and sun and fertilizer, for instance. To understand what has an effect on crop yields, he invented new techniques (such as analysis of variance-ANOVA) and performed perhaps a million calculations on the data he collected. Although twenty-first-century computer chips easily handle many millions of calculations in a second, each of Fishers calculations required pulling a lever on a manual calculating machine. Results trickled in slowly over weeks and months, along with sore hands and calluses.

The advent of computing power has clearly simplified some aspects of analysis, although its bigger effect is probably the wealth of data produced. Our goal is no longer to extract every last iota of possible information from each rare datum. Our goal is instead to make sense of quantities of data so large that they are beyond the ability of our brains to comprehend in their raw format.

The purpose of this chapter is to present some key ideas from statistics that have proven to be useful tools for data mining. This is intended to be neither a thorough nor a comprehensive introduction to statistics; rather, it is an introduction to a handful of useful statistical techniques and ideas. These tools are shown by demonstration, rather than through mathematical proof.

The chapter starts with an introduction to what is probably the most important aspect of applied statistics-the skeptical attitude. It then discusses looking at data through a statisticians eye, introducing important concepts and terminology along the way. Sprinkled through the chapter are examples, especially for confidence intervals and the chi-square test. The final example, using the chi-square test to understand geography and channel, is an unusual application of the ideas presented in the chapter. The chapter ends with a brief discussion of some of the differences between data miners and statisticians-differences in attitude that are more a matter of degree than of substance.

Occams Razor

William of Occam was a Franciscan monk born in a small English town in 1280-not only before modern statistics was invented, but also before the Renaissance and the printing press. He was an influential philosopher, theologian,



and professor who expounded many ideas about many things, including church politics. As a monk, he was an ascetic who took his vow of poverty very seriously. He was also a fervent advocate of the power of reason, denying the existence of universal truths and espousing a modern philosophy that was quite different from the views of most of his contemporaries living in the Middle Ages.

What does William of Occam have to do with data mining? His name has become associated with a very simple idea. He himself explained it in Latin (the language of learning, even among the English, at the time), Entia non sunt multiplicanda sine necessitate. In more familiar English, we would say the simpler explanation is the preferable one or, more colloquially, Keep it simple, stupid. Any explanation should strive to reduce the number of causes to a bare minimum. This line of reasoning is referred to as Occams Razor and is William of Occams gift to data analysis.

The story of William of Occam had an interesting ending. Perhaps because of his focus on the power of reason, he also believed that the powers of the church should be separate from the powers of the state-that the church should be confined to religious matters. This resulted in his opposition to the meddling of Pope John XXII in politics and eventually to his own excommunication. He eventually died in Munich during an outbreak of the plague in 1349, leaving a legacy of clear and critical thinking for future generations.

The Null Hypothesis

Occams Razor is very important for data mining and statistics, although statistics expresses the idea a bit differently. The null hypothesis is the assumption that differences among observations are due simply to chance. To give an example, consider a presidential poll that gives Candidate A 45 percent and Candidate B 47 percent. Because this data is from a poll, there are several sources of error, so the values are only approximate estimates of the popularity of each candidate. The layperson is inclined to ask, Are these two values different? The statistician phrases the question slightly differently, What is the probability that these two values are really the same?

Although the two questions are very similar, the statisticians has a bit of an attitude. This attitude is that the difference may have no significance at all and is an example of using the null hypothesis. There is an observed difference of 2 percent in this example. However, this observed value may be explained by the particular sample of people who responded. Another sample may have a difference of 2 percent in the other direction, or may have a difference of 0 percent. All are reasonably likely results from a poll. Of course, if the preferences differed by 20 percent, then sampling variation is much less likely to be the cause. Such a large difference would greatly improve the confidence that one candidate is doing better than the other, and greatly reduce the probability of the null hypothesis being true.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 [ 49 ] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222