The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

PDF FILE –This is the paper that started it all, the protoype of the Google search engine. This is back when they were working with 24 million pages. This is a necessary read if you want to gain an understanding of the origins of Google’s ranking system. Of particular interest is section 2.1.2 –Intuitive Justification:

The probability that the random surfer visits a page is its PageRank.
And, the d damping factor is the probability at each page the “random surfer” will get bored and request
another random page. One important variation is to only add the damping factor d to a single page, or a
group of pages. This allows for personalization and can make it nearly impossible to deliberately
mislead the system in order to get a higher ranking.

Nearly impossible eh? Well, that was the thinking back in ’98…

The PageRank Citation Ranking: Bringing Order to the Web (1998)

PDF FILE– The definititive PageRank paper, where a comparison is made to the ‘idealized web surfer’. Even back in 1998, they were worried about commercial interests manipulating their results, and link buying.

At worst, you can have manipulation in the form of buying advertisements (links) on importnant sites. But, this seems well under control since it costs money.

Of course, back then they thought that ‘cost’ almost equated to immunity from manipulation. Instead, links are bought and sold like any commodity.

Authoritative Sources in a Hyperlinked Environment (1999)

PDF FILE – This paper by Jon Kleinberg is cited by Brin in Bringing Order To The Web and deals directly with hubs and authorities.

Of course, there are a number of potential pitfalls in the application of links
for such a purpose. First of all, links are created for a wide variety of reasons,
many of which have nothing to do with the conferral of authority. For example, a
large number of links are created primarily for navigational purposes (“Click
here to return to the main menu”); others represent paid advertisements.
Another issue is the difficulty in finding an appropriate balance between the
criteria of relevance and popularity, each of which contributes to our intuitive
notion of authority.

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (1998)

PDF FILE – This paper addresses three problems found in the Kleinberg Connectivity Analysis Algorithm. 1. Mutually reinforcing rtelationships between hosts. 2. Automatically generated links. 3. Non-relevant nodes. The imp algorithm comes to life.

4.4 Implementation
Unlike the previous implementation where it suced to
get the graph from the Connectivity Server, in this case
we need to fetch all the documents to do content analysis.
To build term vectors we eliminate stop words and use
Porter stemming [27]. For IDF weights, since we know of
no source of IDF weights for the Web and of no ocial
representative collection, we had to build our own collection.
Hence we used term frequencies measured in a
crawl of 400,000 Yahoo! [30] documents in January 1997.

PageRank: Quantitative Model of Interaction Information Retrieval

PDF FILE – This paper provides another view of PageRank, rather, PageRank as a dynamic system that looks for equilibrium rather than the stochastic view and makes it possible to compute relative importance.

Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (1998)

PDF FILE – The ‘automatic resource compiler’.

The subject of this paper is the design and evaluation of an automatic resource compiler. An automatic
resource compiler is a system which, given a topic that is broad and well-represented on the web, will
seek out and return a list of web resources that it considers the most authoritative for that topic. Our
system is built on an algorithm that performs a local analysis of both text and links to arrive at a “global consensus” of the best resources for the topic.

This paper asserts that their automatically compiled results were “almost competitive with, and occasionally better than the (semi) manually compiled results”. Yahoo and Infoseek were used as the (semi) manually compiled results.

  1. I love how your website is a great one stop shop for anyone looking for more information on this subject. All posts along with this one are perfectly written and are extremely informative to boot.

  2. I haven’t checked in here for a while since I thought it was getting boring, but the last few posts are great quality so I guess I’ll add you back to my daily bloglist. You deserve it my friend :)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: