Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 [ 94 ] 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

obvious geometric interpretation. The ideas of MBR are illustrated through a case study showing how MBR has been used to attach keywords to news stories. The chapter then looks at collaborative filtering, a popular approach to making recommendations, especially on the Web. Collaborative filtering is also based on nearest neighbors, but with a slight twist-instead of grouping restaurants or movies into neighborhoods, it groups the people recommending them.

Memory Based Reasoning

The human ability to reason from experience depends on the ability to recognize appropriate examples from the past. A doctor diagnosing diseases, a claims analyst flagging fraudulent insurance claims, and a mushroom hunter spotting Morels are all following a similar process. Each first identifies similar cases from experience and then applies what their knowledge of those cases to the problem at hand. This is the essence of memory-based reasoning. A database of known records is searched to find preclassified records similar to a new record. These neighbors are used for classification and estimation. Applications of MBR span many areas:

Fraud detection. New cases of fraud are likely to be similar to known cases. MBR can find and flag them for further investigation.

Customer response prediction. The next customers likely to respond to an offer are probably similar to previous customers who have responded. MBR can easily identify the next likely customers.

Medical treatments. The most effective treatment for a given patient is probably the treatment that resulted in the best outcomes for similar patients. MBR can find the treatment that produces the best outcome.

Classifying responses. Free-text responses, such as those on the U.S. Census form for occupation and industry or complaints coming from customers, need to be classified into a fixed set of codes. MBR can process the free-text and assign the codes.

One of the strengths of MBR is its ability to use data as is. Unlike other data mining techniques, it does not care about the format of the records. It only cares about the existence of two operations: A distance function capable of calculating a distance between any two records and a combination function capable of combining results from several neighbors to arrive at an answer. These functions are readily defined for many kinds of records, including records with complex or unusual data types such as geographic locations, images, and free text that



are usually difficult to handle with other analysis techniques. A case study later in the chapter shows MBRs successful application to the classification of news stories-an example that takes advantage of the full text of the news story to assign subject codes.

Another strength of MBR is its ability to adapt. Merely incorporating new data into the historical database makes it possible for MBR to learn about new categories and new definitions of old ones. MBR also produces good results without a long period devoted to training or to massaging incoming data into the right format.

These advantages come at a cost. MBR tends to be a resource hog since a large amount of historical data must be readily available for finding neighbors. Classifying new records can require processing all the historical records to find the most similar neighbors-a more time-consuming process than applying an already-trained neural network or an already-built decision tree. There is also the challenge of finding good distance and combination functions, which often requires a bit of trial and error and intuition.

Example: Using MBR to Estimate Rents in Tuxedo, New York

The purpose of this example is to illustrate how MBR works by estimating the cost of renting an apartment in the target town by combining data on rents in several similar towns-its nearest neighbors.

MBR works by first identifying neighbors and then combining information from them. Figure 8.1 illustrates the first of these steps. The goal is to make predictions about the town of Tuxedo in Orange County, New York by looking at its neighbors. Not its geographic neighbors along the Hudson and Delaware rivers, rather its neighbors based on descriptive variables-in this case, population and median home value. The scatter plot shows New York towns arranged by these two variables. Figure 8.1 shows that measured this way, Brooklyn and Queens are close neighbors, and both are far from Manhattan. Although Manhattan is nearly as populous as Brooklyn and Queens, its home prices put it in a class by itself.

TIP Neighborhoods can be found in many dimensions. The choice of dimensions determines which records are close to one another. For some purposes, geographic proximity might be important. For other purposes home price or average lot size or population density might be more important. The choice of dimensions and the choice of a distance metric are crucial to any nearest-neighbor approach.



The first stage of MBR finds the closest neighbor on the scatter plot shown in Figure 8.1. Then the next closest neighbor is found, and so on until the desired number are available. In this case, the number of neighbors is two and the nearest ones turn out to be Shelter Island (which really is an island) way out by the tip of Long Islands North Fork, and North Salem, a town in Northern Westchester near the Connecticut border. These towns fall at about the middle of a list sorted by population and near the top of one sorted by home value. Although they are many miles apart, along these two dimensions, Shelter Island and North Salem are very similar to Tuxedo.

Once the neighbors have been located, the next step is to combine information from the neighbors to infer something about the target. For this example, the goal is to estimate the cost of renting a house in Tuxedo. There is more than one reasonable way to combine data from the neighbors. The census provides information on rents in two forms. Table 8.1 shows what the 2000 census reports about rents in the two towns selected as neighbors. For each town, there is a count of the number of households paying rent in each of several price bands as well as the median rent for each town. The challenge is to figure out how best to use this data to characterize rents in the neighbors and then how to combine information from the neighbors to come up with an estimate that characterizes rents in Tuxedo in the same way.

Tuxedos nearest neighbors, the towns of North Salem and Shelter Island, have quite different distributions of rents even though the median rents are similar. In Shelter Island, a plurality of homes, 34.6 percent, rent in the $500 to $750 range. In the town of North Salem, the largest number of homes, 30.9 percent, rent in the $1,000 to $1,500 range. Furthermore, while only 3.1 percent of homes in Shelter Island rent for over $1,500, 24.2 percent of homes in North Salem do. On the other hand, at $804, the median rent in Shelter Island is above the $750 ceiling of the most common range, while the median rent in North Salem, $1,150, is below the floor of the most common range for that town. If the average rent were available, it too would be a good candidate for characterizing the rents in the various towns.

Table 8.1 The Neighbors

TOWN

POPULATION

MEDIAN RENT

RENT RENT

<$500 $750 (%) (%)

RENT $1500

RENT $1000

RENT

>$1500

NO RENT

Shelter Island

2228

$804

34.6

31.4

10.7

North Salem

5173

$1150

10.2

21.6

30.9

24.2

10.2



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 [ 94 ] 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222