Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 [ 97 ] 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Applying MBR

This section explains how MBR facilitated assigning codes to news stories for a news service. The important steps were:

1. Choosing the training set

2. Determining the distance function

3. Choosing the number of nearest neighbors

4. Determining the combination function

The following sections discuss each of these steps in turn. Choosing the Training Set

The training set consisted of 49,652 news stories, provided by the news retrieval service for this purpose. These stories came from about three months of news and from almost 100 different sources. Each story contained, on average, 2,700 words and had eight codes assigned to it. The training set was not specially created, so the frequency of codes in the training set varied a great deal, mimicking the overall frequency of codes in news stories in general. Although this training set yielded good results, a better-constructed training set with more examples of the less common codes would probably have performed even better.

Choosing the Distance Function

The next step is choosing the distance function. In this case, a distance function already existed, based on a notion called relevance feedback that measures the similarity of two documents based on the words they contain. Relevance feedback, which is described more fully in the sidebar, was originally designed to return documents similar to a given document, as a way of refining searches. The most similar documents are the neighbors used for MBR.

Choosing the Combination Function

The next decision is the combination function. Assigning classification codes to news stories is a bit different from most classification problems. Most classification problems are looking for the single best solution. However, news stories can have multiple codes, even from the same category. The ability to adapt MBR to this problem highlights its flexibility.



USING RELEVANCE FEEDBACK TO CREATE A DISTANCE FUNCTION

Relevance feedback is a powerful technique that allows users to refine searches on text databases by asking the database to return documents similar to one they already have. (Hubs and authorities, another method for improving search results on hyperlinked web pages, is described in Chapter 10.) In the course of doing this, the text database scores all the other documents in the database and returns those that are most similar-along with a measure of similarity. This is the relevance feedback score, which can be used as the basis for a distance measure for MBR.

In the case study, the calculation of the relevance feedback score went as follows:

1. Common, non-content-bearing words, such as it, and, and of, were removed from the text of all stories in the training set. A total of 368 words in this category were identified and removed.

2. The next most common words, accounting for 20 percent of the words in the database, were removed from the text. Because these words are so common, they provide little information to distinguish between documents.

3. The remaining words were collected into a dictionary of searchable terms. Each was assigned a weight inversely proportional to its frequency in the database. The particular weight was the negative of the base 2 log of the terms frequency in the training set.

4. Capitalized word pairs, such as United States and New Mexico, were identified (automatically) and included in the dictionary of searchable terms.

5. To calculate the relevance feedback score for two stories, the weights of the searchable terms in both stories were added together. The algorithm used for this case study included a bonus when searchable terms appeared in close proximity in both stories.

The relevance feedback score is an example of the adaptation of an already-existing function for use as a distance function. However, the score itself does not quite fit the definition of a distance function. In particular, a score of 0 indicates that two stories have no words in common, instead of implying that the stories are identical. The following transformation converts the relevance feedback score to a function suitable for measuring the distance between news stories:

score(A,B)

dclassification (A,B) = 1 - score(A,A) This is the function used to find the nearest neighbors. Actually, even this is not a true distance function because d(A,B) is not the same as d(B,A), but it works well enough.



Table 8.3 Classified Neighbors of a Not-Yet-Classified Story

NEIGHBOR

DISTANCE

WEIGHT

CODES

0.076

0.924

R/FE,R/CA,R/CO

0.346

0.654

R/FE,R/JA,R/CA

0.369

0.631

R/FE,R/JA,R/MI

0.393

0.607

R/FE,R/JA,R/CA

The combination function used a weighted summation technique. Since the maximum distance was 1, the weight was simply one minus the distance, so weights would be big for neighbors at small distances and small for neighbors at big distances. For example, say the neighbors of a story had the following region codes and weights, shown in Table 8.3.

The total score for a code was then the sum of the weights of the neighbors containing it. Then, codes with scores below a certain threshold value were eliminated. For instance, the score for R/FE (which is the region code for the Far East) is the sum of the weights of neighbors 1, 2, 3, and 4, since all of them contain the R/FE, yielding a score of 2.816. Table 8.4 shows the results for the six region codes contained by at least one of the four neighbors. For these examples, a threshold of 1.0 leaves only three codes: R/CA, R/FE, and R/JA. The particular choice of threshold was based on experimenting with different values and is not important to understanding MBR.

Table 8.4 Code Scores for the Not-Yet-Classified Story

CODE

SCORE

R/CA

0.924

0.607

1.531

R/CO

0.924

0.924

R/FE

0.924

0.654

0.631

0.607

2.816

R/JA

0.654

0.631

0.607

1.892

R/MI

0.654

0.624



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 [ 97 ] 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222