The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

PDF FILE -This is the paper that started it all, the protoype of the Google search engine. This is back when they were working with 24 million pages. This is a necessary read if you want to gain an understanding of the origins of Google’s ranking system. Of particular interest is section 2.1.2 -Intuitive Justification:

The probability that the random surfer visits a page is its PageRank.
And, the d damping factor is the probability at each page the “random surfer” will get bored and request
another random page. One important variation is to only add the damping factor d to a single page, or a
group of pages. This allows for personalization and can make it nearly impossible to deliberately
mislead the system in order to get a higher ranking.

Nearly impossible eh? Well, that was the thinking back in ‘98…

The PageRank Citation Ranking: Bringing Order to the Web (1998)

PDF FILE- The definititive PageRank paper, where a comparison is made to the ‘idealized web surfer’. Even back in 1998, they were worried about commercial interests manipulating their results, and link buying.

At worst, you can have manipulation in the form of buying advertisements (links) on importnant sites. But, this seems well under control since it costs money.

Of course, back then they thought that ‘cost’ almost equated to immunity from manipulation. Instead, links are bought and sold like any commodity.

Authoritative Sources in a Hyperlinked Environment (1999)

PDF FILE – This paper by Jon Kleinberg is cited by Brin in Bringing Order To The Web and deals directly with hubs and authorities.

Of course, there are a number of potential pitfalls in the application of links
for such a purpose. First of all, links are created for a wide variety of reasons,
many of which have nothing to do with the conferral of authority. For example, a
large number of links are created primarily for navigational purposes (“Click
here to return to the main menu”); others represent paid advertisements.
Another issue is the difficulty in finding an appropriate balance between the
criteria of relevance and popularity, each of which contributes to our intuitive
notion of authority.

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (1998)

PDF FILE – This paper addresses three problems found in the Kleinberg Connectivity Analysis Algorithm. 1. Mutually reinforcing rtelationships between hosts. 2. Automatically generated links. 3. Non-relevant nodes. The imp algorithm comes to life.

4.4 Implementation
Unlike the previous implementation where it suced to
get the graph from the Connectivity Server, in this case
we need to fetch all the documents to do content analysis.
To build term vectors we eliminate stop words and use
Porter stemming [27]. For IDF weights, since we know of
no source of IDF weights for the Web and of no ocial
representative collection, we had to build our own collection.
Hence we used term frequencies measured in a
crawl of 400,000 Yahoo! [30] documents in January 1997.

PageRank: Quantitative Model of Interaction Information Retrieval

PDF FILE – This paper provides another view of PageRank, rather, PageRank as a dynamic system that looks for equilibrium rather than the stochastic view and makes it possible to compute relative importance.

Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (1998)

PDF FILE – The ‘automatic resource compiler’.

The subject of this paper is the design and evaluation of an automatic resource compiler. An automatic
resource compiler is a system which, given a topic that is broad and well-represented on the web, will
seek out and return a list of web resources that it considers the most authoritative for that topic. Our
system is built on an algorithm that performs a local analysis of both text and links to arrive at a “global consensus” of the best resources for the topic.

This paper asserts that their automatically compiled results were “almost competitive with, and occasionally better than the (semi) manually compiled results”. Yahoo and Infoseek were used as the (semi) manually compiled results.


Leave a Comment




  • Blog Stuff

  • What People Are Saying

  • People Are Interested In

  • RSS Word Of The Day

  • RSS Quote Of The Day

  • RSS Neologisms

    • wallet neuropathy July 3, 2009
      wallet neuropathy n. Lower back pain caused by sitting on an overstuffed wallet kept in a back pants pocket. Example Citations: Physiotherapists have coined the term 'wallet-neuropathy' for the lower back pain caused by men sitting down (such as when driving or in the office) on wallets always carried in their back trouser pocket. The […]
    • intexticated July 2, 2009
      intexticated adj. Preoccupied by reading or sending text messages, particularly while driving a car. —intexticating pp. —intextication n. Example Citations: The usual concerns arise, knowing teen drivers will be packing the ever-present buzzing and ringing cell phones from which most seem incapable of parting. We've had the disc […]
    • Wikipedia kid June 30, 2009
      Wikipedia kid n. A student who has poor research skills and lacks the ability to think critically. Example Citations: As an English professor at Algonquin College in Ottawa, I was very impressed by the report's neologism: "Wikipedia kids." Too many graduates of Ontario's high schools know how to cut and paste, but have learnt […]
    • carrotmob June 26, 2009
      carrotmob n. An event where people support an environmentally-friendly store by gathering en masse to purchase the store's products. Also: carrot mob. —carrotmobber n. —carrotmobbing pp. Example Citations: Forget sticks, and stick with carrots instead. So says Brent Schulkin, founder of a fledgling movement of activist consumers […]
    • phantom fat June 25, 2009
      phantom fat n. Lost body fat that is still perceived by a person who used to be overweight. Example Citations: Body-image experts say it's not uncommon for people, especially women, who have lost a lot of weight to be disappointed to some extent to discover that they still aren't "perfect." The excess fat is gone when they re […]
    • weisure June 8, 2009
      weisure n. Free time spent doing work or work-related tasks. [Blend of work and leisure.] Example Citations: Weisure has been fueled by social networking sites like Facebook and MySpace, where "friends" may actually be business partners or work colleagues. "Social networking as an activity is one of those ambiguous activities, […]
    • DDo$ June 4, 2009
      DDo$ n. A scheme where a fine or fee is paid using a massive number of small electronic payments, particularly when each payment generates a transaction cost greater than the payment itself. Example Citations: After the Pirate Bay founders were fined $3.5 million, they swore they wouldn't cough up a single cent. Instead, they've come u […]
    • space headache June 4, 2009
      space headache n. A debilitating headache experienced by astronauts during space travel. Example Citations: Astronauts need to add space headache to their list of occupational hazards, say researchers. ... The researchers believe there are a number of reasons why space travel could cause headaches, the root cause being microgravity. Micrograv […]
    • VB6 June 3, 2009
      VB6 n. A person who eats a vegan diet before 6:00 PM, and then whatever they want after that. [From the phrase vegan before 6.] Example Citations: VB6. No, it's not a tomato cocktail or the latest version of a computer programming language. VB6 is short for Vegan Before 6, the increasingly popular veggie-heavy diet that converts say can do […]
    • Idaho stop June 2, 2009
      Idaho stop n. Stop sign behavior where a vehicle, particularly a bicycle, slows down but does not come to a complete stop. Example Citations: Here's an idea that would light up talk-radio phone lines, even though it would do little more than legalize what many cyclists do every day anyway: It's called the Idaho stop. ... Almost no cyc […]
  • More Blog Stuff