Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 [ 112 ] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

For instance, in the grocery store that sells orange juice, milk, detergent, soda, and window cleaner, the first step calculates the counts for each of these items. During the second step, the following counts are created:

Milk and detergent, milk and soda, milk and cleaner

Detergent and soda, detergent and cleaner

Soda and cleaner

This is a total of 10 pairs of items. The third pass takes all combinations of three items and so on. Of course, each of these stages may require a separate pass through the data or multiple stages can be combined into a single pass by considering different numbers of combinations at the same time.

Although it is not obvious when there are just five items, increasing the number of items in the combinations requires exponentially more computation. This results in exponentially growing run times-and long, long waits when considering combinations with more than three or four items. The solution is pruning. Pruning is a technique for reducing the number of items and combinations of items being considered at each step. At each stage, the algorithm throws out a certain number of combinations that do not meet some threshold criterion.

The most common pruning threshold is called minimum support pruning. Support refers to the number of transactions in the database where the rule holds. Minimum support pruning requires that a rule hold on a minimum number of transactions. For instance, if there are one million transactions and the minimum support is 1 percent, then only rules supported by 10,000 transactions are of interest. This makes sense, because the purpose of generating these rules is to pursue some sort of action-such as striking a deal with Mattel (the makers of Barbie dolls) to make a candy-bar-eating doll-and the action must affect enough transactions to be worthwhile.

The minimum support constraint has a cascading effect. Consider a rule with four items in it:

if A, B, and C, then D.

Using minimum support pruning, this rule has to be true on at least 10,000 transactions in the data. It follows that:

A must appear in at least 10,000 transactions, and, B must appear in at least 10,000 transactions, and, C must appear in at least 10,000 transactions, and, D must appear in at least 10,000 transactions.

Team-Fly®



In other words, minimum support pruning eliminates items that do not appear in enough transactions. The threshold criterion applies to each step in the algorithm. The minimum threshold also implies that:

A and B must appear together in at least 10,000 transactions, and, A and C must appear together in at least 10,000 transactions, and, A and D must appear together in at least 10,000 transactions, and so on.

Each step of the calculation of the co-occurrence table can eliminate combinations of items that do not meet the threshold, reducing its size and the number of combinations to consider during the next pass.

Figure 9.11 is an example of how the calculation takes place. In this example, choosing a minimum support level of 10 percent would eliminate all the combinations with three items-and their associated rules-from consideration. This is an example where pruning does not have an effect on the best rule since the best rule has only two items. In the case of pizza, these toppings are all fairly common, so are not pruned individually. If anchovies were included in the analysis-and there are only 15 pizzas containing them out of the 2,000- then a minimum support of 10 percent, or even 1 percent, would eliminate anchovies during the first pass.

The best choice for minimum support depends on the data and the situation. It is also possible to vary the minimum support as the algorithm progresses. For instance, using different levels at different stages you can find uncommon combinations of common items (by decreasing the support level for successive steps) or relatively common combinations of uncommon items (by increasing the support level).

The Problem of Big Data

A typical fast food restaurant offers several dozen items on its menu, say 100. To use probabilities to generate association rules, counts have to be calculated for each combination of items. The number of combinations of a given size tends to grow exponentially. A combination with three items might be a small fries, cheeseburger, and medium Diet Coke. On a menu with 100 items, how many combinations are there with three different menu items? There are 161,700! This calculation is based on the binomial formula On the other hand, a typical supermarket has at least 10,000 different items in stock, and more typically 20,000 or 30,000.



A pizza restaurant has sold 2000 pizzas, of which:

100 are mushroom only, 150 are pepperoni, 200 are extra cheese

400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese 100 are mushroom, pepperoni, and extra cheese. 550 have no extra toppings.

We need to calculate the probabilities for all possible combinations of items.

.4

<s P. J00 + 400 + 300 + 100 = 900 pizzas or 45%

л - Mushroom and pepperoni \ The w°rks

Just mushroom Mushroom and extra cheese

150 + 400 + 200 + 100 = 850 pizzas or 42.5%

p-j 200 + 300 + 200 + 100 = 800 pizzas or 40%

- °o°q, 400 + 100 = 500 pizzas or 25%

300 + 100 = 400 pizzas or 20% 200 + 100 = 300 pizzas or 15%

100 pizzas or 5%

There are three rules with all three items:

Support = 5%

Confidence = 5% divided by 25% = 0.2

Lift = 20%(100/500) divided by 40%(800/2000) = 0.5

Support = 5%

Confidence = 5% divided by 20% = 0.25

Lift = 25%(100/400) divided by 42.5%(850/2000) = 0.588

The best rule has only two items:


Support = 5%

Confidence = 5% divided by 15% = 0.333

Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74

Support = 25%

Confidence = 25% divided by 42.5% = 0.588

Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31

Figure 9.11 This example shows how to count up the frequencies on pizza sales for market basket analysis.

Calculating the support, confidence, and lift quickly gets out of hand as the number of items in the combinations grows. There are almost 50 million possible combinations of two items in the grocery store and over 100 billion combinations of three items. Although computers are getting more powerful and



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 [ 112 ] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222