Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 [ 118 ] 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

Directed Graphs

The graphs discussed so far are undirected. In undirected graphs, the edges are like expressways between nodes: they go in both directions. In a directed graph, the edges are like one-way roads. An edge going from A to B is distinct from an edge going from B to A. A directed edge from A to B is an outgoing edge of A and an incoming edge of B.

Directed graphs are a powerful way of representing data:

Flight segments that connect a set of cities

Hyperlinks between Web pages

Telephone calling patterns

State transition diagrams

Two types of nodes are of particular interest in directed graphs. All the edges connected to a source node are outgoing edges. Since there are no incoming edges, no path exists from any other node in the graph to any of the source nodes. When all the edges on a node are incoming edges, the node is called a sink node. The existence of source nodes and sink nodes is an important difference between directed graphs and their undirected cousins.

An important property of directed graphs is whether the graph contains any paths that start and end at the same vertex. Such a path is called a cycle, implying that the path could repeat itself endlessly: ABCABCABC and so on. If a directed graph contains at least one cycle, it is called cyclic. Cycles in a graph of flight segments, for instance, might be the path of a single airplane. In a call graph, members of a cycle call each other-these are good candidates for a friends and family-style promotion, where the whole group gets a discount, or for marketing conference call services.

Detecting Cycles in a Graph

There is a simple algorithm to detect whether a directed graph has any cycles. This algorithm starts with the observation that if a directed graph has no sink vertices, and it has at least one edge, then any path can be extended arbitrarily. Without any sink vertices, the terminating node of a path is always connected to another node, so the path can be extended by appending that node. Similarly, if the graph has no source nodes, then we can always prepend a node to the beginning of the path. Once the path contains more nodes than there are nodes in the graph, we know that the path must visit at least one node twice. Call this node X. The portion of the path between the first X and the second X in the path is a cycle, so the graph is cyclic.

Now consider the case when a graph has one or more source nodes and one or more sink nodes. It is pretty obvious that source nodes and sink nodes



cannot be part of a cycle. Removing the source and sink nodes from the graph, along with all their edges, does not affect whether the graph is cyclic. If the resulting graph has no sink nodes or no source nodes, then it contains a cycle, as just shown. The process of removing sink nodes, source nodes, and their edges is repeated until one of the following occurs:

No more edges or no more nodes are left. In this case, the graph has no cycles.

Some edges remain but there are no source or sink nodes. In this case, the graph is cyclic.

If no cycles exist, then the graph is called an acyclic graph. These graphs are useful for describing dependencies or one-way relationships between things. For instance, different products often belong to nested hierarchies that can be represented by acyclic graphs. The decision trees described in Chapter 6 are another example.

In an acyclic graph, any two nodes have a well-defined precedence relationship with each other. If node A precedes node B in some path that contains both A and B, then A will precede B in all paths containing both A and B (otherwise there would be a cycle). In this case, we say that A is a predecessor of B and that B is a successor of A. If no paths contain both A and B, then A and B are disjoint. This strict ordering can be an important property of the nodes and is sometimes useful for data mining purposes.

A Familiar Application of Link Analysis

Most readers of this book have probably used the Google search engine. Its phenomenal popularity stems from its ability to help people find reasonably good material on pretty much any subject. This feat is accomplished through link analysis.

The World Wide Web is a huge directed graph. The nodes are Web pages and the edges are the hyperlinks between them. Special programs called spiders or web crawlers are continually traversing these links to update maps of the huge directed graph that is the web. Some of these spiders simply index the content of Web pages for use by purely text-based search engines. Others record the Webs global structure as a directed graph that can be used for analysis.

Once upon a time, search engines analyzed only the nodes of this graph. Text from a query was compared with text from the Web pages using techniques similar to those described in Chapter 8. Googles approach (which has now been adopted by other search engines) is to make use of the information encoded in the edges of the graph as well as the information found in the nodes.



The Kleinberg Algorithm

Some Web sites or magazine articles are more interesting than others even if they are devoted to the same topic. This simple idea is easy to grasp but hard to explain to a computer. So when a search is performed on a topic that many people write about, it is hard to find the most interesting or authoritative documents in the huge collection that satisfies the search criteria.

Professor Jon Kleinberg of Cornell University came up with one widely adopted technique for addressing this problem. His approach takes advantage of the insight that in creating a link from one site to another, a human being is making a judgment about the value of the site being linked to. Each link to another site is effectively a recommendation of that site. Cumulatively, the independent judgments of many Web site designers who all decide to provide links to the same target are conferring authority on that target. Furthermore, the reliability of the sites making the link can be judged according to the authoritativeness of the sites they link to. The recommendations of a site with many other good recommendations can be given more weight in determining the authority of another.

In Kleinbergs terminology, a page that links to many authorities is a hub; a page that is linked to by many hubs is an authority. These ideas are illustrated in Figure 10.7 The two concepts can be used together to tell the difference between authority and mere popularity. At first glance, it might seem that a good method for finding authoritative Web sites would be to rank them by the number of unrelated sites linking to them. The problem with this technique is that any time the topic is mentioned, even in passing, by a popular site (one with many inbound links), it will be ranked higher than a site that is much more authoritative on the particular subject though less popular in general. The solution is to rank pages, not by the total number of links pointing to them, but by the number of subject-related hubs that point to them. Google.com uses a modified and enhanced version of the basic Kleinberg algorithm described here.

A search based on link analysis begins with an ordinary text-based search. This initial search provides a pool of pages (often a couple hundred) with which to start the process. It is quite likely that the set of documents returned by such a search does not include the documents that a human reader would judge to be the most authoritative sources on the topic. That is because the most authoritative sources on a topic are not necessarily the ones that use the words in the search string most frequently. Kleinberg uses the example of a search on the keyword Harvard. Most people would agree that www. harvard.edu is one of the most authoritative sites on this topic, but in a purely content-based analysis, it does not stand out among the more than a million Web pages containing the word Harvard so it is quite likely that a text-based search will not return the universitys own Web site among its top results. It is very likely, however, that at least a few of the documents returned will contain

Team-Fly®



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 [ 118 ] 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222