Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 [ 115 ] 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222


Link Analysis

The international route maps of British Airways and Air France offer more than just trip planning help. They also provide insights into the history and politics of their respective homelands and of lost empires. A traveler bound from New York to Mombasa changes planes at Heathrow; one bound for Abidjan changes at Charles de Gaul. The international route maps show how much information can be gained from knowing how things are connected.

Which Web sites link to which other ones? Who calls whom on the telephone? Which physicians prescribe which drugs to which patients? These relationships are all visible in data, and they all contain a wealth of information that most data mining techniques are not able to take direct advantage of. In our ever-more-connected world (where, it has been claimed, there are no more than six degrees of separation between any two people on the planet), understanding relationships and connections is critical. Link analysis is the data mining technique that addresses this need.

Link analysis is based on a branch of mathematics called graph theory. This chapter reviews the key notions of graphs, then shows how link analysis has been applied to solve real problems. Link analysis is not applicable to all types of data nor can it solve all types of problems. However, when it can be used, it



often yields very insightful and actionable results. Some areas where it has yielded good results are:

Identifying authoritative sources of information on the World Wide Web by analyzing the links between its pages

Analyzing telephone call patterns to identify particular market segments such as people working from home

Understanding physician referral patterns; a referral is a relationship between two physicians, once again, naturally susceptible to link analysis

Even where links are explicitly recorded, assembling them into a useful graph can be a data-processing challenge. Links between Web pages are encoded in the HTML of the pages themselves. Links between telephones are recorded in call detail records. Neither of these data sources is useful for link analysis without considerable preprocessing, however. In other cases, the links are implicit and part of the data mining challenge is to recognize them.

The chapter begins with a brief introduction to graph theory and some of the classic problems that it has been used to solve. It then moves on to applications in data mining such as search engine rankings and analysis of call detail records.

Graphs are an abstraction developed specifically to represent relationships. They have proven very useful in both mathematics and computer science for developing algorithms that exploit these relationships. Fortunately, graphs are quite intuitive, and there is a wealth of examples that illustrate how to take advantage of them.

A graph consists of two distinct parts:

Nodes (sometimes called vertices) are the things in the graph that have relationships. These have names and often have additional useful properties.

Edges are pairs of nodes connected by a relationship. An edge is represented by the two nodes that it connects, so (A, B) or AB represents the edge that connects A and B. An edge might also have a weight in a weighted graph.

Figure 10.1 illustrates two graphs. The graph on the left has four nodes connected by six edges and has the property that there is an edge between every pair of nodes. Such a graph is said to be fully connected. It could be representing daily flights between Atlanta, New York, Cincinnati, and Salt Lake City on an airline where these four cities serve as regional hubs. It could also represent


Basic Graph Theory

Team-Fly®



four people, all of whom know each other, or four mutually related leads for a criminal investigation. The graph on the right has one node in the center connected to four other nodes. This could represent daily flights connecting Atlanta to Birmingham, Greenville, Charlotte, and Savannah on an airline that serves the Southeast from a hub in Atlanta, or a restaurant frequented by four credit card customers. The graph itself captures the information about what is connected to what. Without any labels, it can describe many different situations. This is the power of abstraction.

A few points of terminology about graphs. Because graphs are so useful for visualizing relationships, it is nice when the nodes and edges can be drawn with no intersecting edges. The graphs in Figure 10.2 have this property. They are planar graphs, since they can be drawn on a sheet of paper (what mathematicians call a plane) without having any edges intersect. Figure 10.2 shows two graphs that cannot be drawn without having at least two edges cross. There is, in fact, a theorem in graph theory that says that if a graph is nonpla-nar, then lurking inside it is one of the two previously described graphs.

When a path exists between any two nodes in a graph, the graph is said to be connected. For the rest of this chapter, we assume that all graphs are connected, unless otherwise specified. A path, as its name implies, is an ordered sequence of nodes connected by edges. Consider a graph where each node represents a city, and the edges are flights between pairs of cities. On such a graph, a node is a city and an edge is a flight segment-two cities that are connected by a nonstop flight. A path is an itinerary of flight segments that go from one city to another, such as from Greenville, South Carolina to Atlanta, from Atlanta to Chicago, and from Chicago to Peoria.


A fully connected graph with four nodes and six edges. In a fully connected graph, there is an edge between every pair

A graph with five nodes and four edges.

of nodes.

Figure 10.1 Two examples of graphs.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 [ 115 ] 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222