Промышленный лизинг Промышленный лизинг  Методички 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 [ 119 ] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

a link to Harvards home page or, failing that, that some page that points to one of the pages in the pool of pages will also point to www.harvard.edu.

An essential feature of Kleinbergs algorithm is that it does not simply take the pages returned by the initial text-based search and attempt to rank them; it uses them to construct the much larger pool of documents that point to or are pointed to by any of the documents in the root set. This larger pool contains much more global structure-structure that can be mined to determine which documents are considered to be most authoritative by the wide community of people who created the documents in the pool.

The Details: Finding Hubs and Authorities

Kleinbergs algorithm for identifying authoritative sources has three phases:

1. Creating the root set

2. Identifying the candidates

3. Ranking hubs and authorities

In the first phase, a root set of pages is formed using a text-based search engine to find pages containing the search string. In the second phase, this root set is expanded to include documents that point to or are pointed to by documents in the root set. This expanded set contains the candidates. In the third phase, which is iterative, the candidates are ranked according to their strength as hubs (documents that have links to many authoritative documents) and authorities (pages that have links from many authoritative hubs).

Creating the Root Set

The root set of documents is generated using a content-based search. As a first step, stop words (common words such as a, an, the, and so on) are removed from the original search string supplied. Then, depending on the particular content-based search strategy employed, the remaining search terms may undergo stemming. Stemming reduces words to their root form by removing plural forms and other endings due to verb conjugation, noun declension, and so on. Then, the Web index is searched for documents containing the terms in the search string. There are many variations on the details of how matches are evaluated, which is one reason why performing the same search on two text-based search engines yields different results. In any case, some combination of the number of matching terms, the rarity of the terms matched, and the number of times the search terms are mentioned in a document is used to give the indexed documents a score that determines their rank in relation to the query. The top n documents are used to establish the root set. A typical value for n is 200.



Identifying the Candidates

In the second phase, the root set is expanded to create the set of candidates. The candidate set includes all pages that any page in the root set links to along with a subset of the pages that link to any page in the root set. Locating pages that link to a particular target page is simple if the global structure of the Web is available as a directed graph. The same task can also be accomplished with an index-based text search using the URL of the target page as the search string.

The reason for using only a subset of the pages that link to each page in the root set is to guard against the possibility of an extremely popular site in the root set bringing in an unmanageable number of pages. There is also a parameter d that limits the number of pages that may be brought into the candidate set by any single member of the root set.

If more than d documents link to a particular document in the root set, then an arbitrary subset of d documents is brought into the candidate set. A typical value for d is 50. The candidate set typically ends up containing 1,000 to 5,000 documents.

This basic algorithm can be refined in various ways. One possible refinement, for instance, is to filter out any links from within the same domain, many of which are likely to be purely navigational. Another refinement is to allow a document in the root set to bring in at most m pages from the same site. This is to avoid being fooled by collusion between all the pages of a site to, for example, advertise the site of the Web site designer with a this site designed by link on every page.

Ranking Hubs and Authorities

The final phase is to divide the candidate pages into hubs and authorities and rank them according to their strength in those roles. This process also has the effect of grouping together pages that refer to the same meaning of a search term with multiple meanings-for instance, Madonna the rock star versus the Madonna and Child in art history or Jaguar the car versus jaguar the big cat. It also differentiates between authorities on the topic of interest and sites that are simply popular in general. Authoritative pages on the correct topic are not only linked to by many pages, they tend to be linked to by the same pages. It is these hub pages that tie together the authorities and distinguish them from unrelated but popular pages. Figure 10.7 illustrates the difference between hubs, authorities, and unrelated popular pages.

Hubs and authorities have a mutually reinforcing relationship. A strong hub is one that links to many strong authorities; a strong authority is one that is linked to by many strong hubs. The algorithm therefore proceeds iteratively, first adjusting the strength rating of the authorities based on the strengths of the hubs that link to them and then adjusting the strengths of the hubs based on the strength of the authorities to which they link.



Hubs Authorities Popular Site

Figure 10.7 Google uses link analysis to distinguish hubs, authorities, and popular pages.

For each page, there is a value A that measures its strength as an authority and a value H that measures its strength as a hub. Both these values are initialized to 1 for all pages. Then, the A value for each page is updated by adding up the H values of all the pages that link to them. The A values for each page are then normalized so that the sum of their squares is equal to 1. Then the H values are updated in a similar manner. The H value for each page is set to the sum of the A values of the pages it links to, and the new H values are normalized so that the sum of their squares is equal to 1. This process is repeated until an equilibrium set of A and H values is reached. The pages that end up with the highest H values are the strongest hubs; those with the strongest A values are the strongest authorities.

The authorities returned by this application of link analysis tend to be strong examples of one particular possible meaning of the search string. A search on a contentious topic such as gay marriage or Taiwan independence yields strong authorities on both sides because the global structure of the Web includes tightly connected subgraphs representing documents maintained by like-minded authors.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 [ 119 ] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222