Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 [ 99 ] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

MEASURING THE EFFECTIVENESS OF ASSIGNING CODES: RECALL AND PRECISION

Recall and precision are two measurements that are useful for determining the appropriateness of a set of assigned codes or keywords. The case study on coding news stories, for instance, assigns many codes to news stories. Recall and precision can be used to evaluate these assignments.

Recall answers the question: How many of the correct codes did MBR assign to the story? It is the ratio of codes assigned by MBR that are correct (as verified by editors) to the total number of correct codes on the story. If MBR assigns all available codes to every story, then recall is 100 percent because the correct codes all get assigned, along with many other irrelevant codes. If MBR assigns no codes to any story, then recall is 0 percent.

Precision answers the question: How many of the codes assigned by MBR were correct? It is the percentage of correct codes assigned by MBR to the total number of codes assigned by MBR. Precision is 100 percent when MBR assigns only correct codes to a story. It is close to 0 percent when MBR assigns all codes to every story.

Neither recall nor precision individually give the full story of how good the classification is. Ideally, we want 100 percent recall and 100 percent precision. Often, it is possible to trade off one against the other. For instance, using more neighbors increases recall, but decreases precision. Or, raising the threshold increases precision but decreases recall. Table 8.5 gives some insight into these measurements for a few specific cases.

Table 8.5 Examples of Recall and Precision

CODES BY MBR

CORRECT CODES

RECALL

PRECISION

A,B,C,D

A,B,C,D

100%

100%

A,B,C,D

100%

A,B,C,D,E,F,G,H

A,B,C,D

100%

A,B,C,D

A,B,E,F

A,B,C,D

The original codes assigned to the stories by individual editors had a recall of 83 percent and a precision of 88 percent with respect to the validated set of correct codes. For MBR, the recall was 80 percent and the precision 72 percent. However, Table 8.6 shows the average across all categories. MBR did significantly better in some of the categories.



MEASURING THE EFFECTIVENESS OF ASSIGNING CODES: RECALL AND PRECISION (continued)

Table 8.6 Recall and Precision Measurements by Code Category

CATEGORY

RECALL

PRECISION

Government

Industry

Market Sector

Product

Region

Subject

The variation in the results by category suggests that the original stories used for the training set may not have been coded consistently. The results from MBR can only be as good as the examples chosen for the training set. Even so, MBR performed as well as all but the most experienced editors.

Building a Distance Function One Field at a Time

It is easy to understand distance as a geometric concept, but how can distance be defined for records consisting of many different fields of different types? The answer is, one field at a time. Consider some sample records such as those shown in Table 8.7.

Figure 8.6 illustrates a scatter graph in three dimensions. The records are a bit complicated, with two numeric fields and one categorical. This example shows how to define field distance functions for each field, then combine them into a single record distance function that gives a distance between two records.

Table 8.7 Five Customers in a Marketing Database

RECNUM

GENDER

SALARY

female

$ 19,000

male

$ 64,000

male

$105,000

female

$ 55,000

male

$ 45,000



$120,000 $100,000 $80,000 $60,000 $40,000 $20,000 $0

Female Male

25 30 35 40 45 50 55 60 Age

Figure 8.6 This scatter plot shows the five records from Table 8.7 in three dimensions-age, salary, and gender-and suggests that standard distance is a good metric for nearest neighbors.

The four most common distance functions for numeric fields are:

Absolute value of the difference: A-B

Square of the difference: (A-B)2

Normalized absolute value: A-B/(maximum difference)

Absolute value of difference of standardized values: (A - mean)/(stan-dard deviation) - (B - mean)/(standard deviation) which is equivalent to (A - B)/(standard deviation)

The advantage of the normalized absolute value is that it is always between 0 and 1. Since ages are much smaller than the salaries in this example, the normalized absolute value is a good choice for both of them-so neither field will dominate the record distance function (difference of standardized values is also a good choice). For the ages, the distance matrix looks like Table 8.8.

Table 8.8 Distance Matrix Based on Ages of Customers

0.00

0.96

1.00

0.24

0.72

0.96

0.00

0.04

0.72

0.24

1.00

0.04

0.00

0.76

0.28

0.24

0.72

0.76

0.00

0.48

0.72

0.24

0.28

0.48

0.00



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 [ 99 ] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222