Unraveling the mystery of PageRanking!
By
Mitali Jha
B.Tech(IT) 4th year
United College of Engineering & Research, Allahabad
To understand how search engines work, let's get a brief idea of the technologies employed for the purpose:
1. Information Retrieval
It relates to querying of unstructured textual data. Information is organised into documents, documents into unstructured data. To make it clear, each HTML page is considered to be a document. Documents have associated with them a set of keywords, and then a search is made to locate relevant documents based on user input, for example: the basis of keywords which is also called Keyword-based Information Retrieval. This type of information retrieval can be used for retrieving textual, audio and video data.
2. PageRank Technology
Even though the pages have fewer links to them, but if the links they have are much higher valued, then those pages receive a higher PageRank and are more likely to appear at the top of the search results, having greater importance and relevance. This technology is also referred to as Popularity or Prestige Ranking.
High PageRank does NOT guarantee a high search ranking for any particular term. If it did, then PR10 sites like Adobe would always show up for any search you do. They don't.
However, there are many sites such as home page servers in Universities and Web portals, that host a large number of mostly unrelated pages. For such sites, the popularity of one part of the site does not imply popularity of another part of the site. So, an alternative requires transfer of prestige(importance value) from popular pages to pages that they link to!
Google interprets a link from page A to page B as a vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages important. A hyperlink to a page counts as a vote of support. No links to a web page means no support for that page.
PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, on any step, that he will continue is a damping factor'd' which is generally assumed to be set around 0.85 .
An algorithm has been defined to calculate the relative relevance of a web page that is going to fulfill a user's query:
PR(A)= ((1-d)/N) + d((PR(B)/L(B)) + (PR(C)/L(C)) + ......)
where d is the damping factor, N is the number of documents in collection and L(X) is the number of links from page X.
Uses of PageRanking
- In PageRank terms, academic departments link to each other by hiring their faculty from each other.
- A Web Crawler may use PageRank to determine which URL to visit next during a crawl of the Web.
- PageRank may also be used as a methodology to measure the apparent impact of a community like the Blogosphere on the overall Web itself.
3. Hypertext-matching analysis
The technology analyzes the full content of the specific page along with the neighbouring web pages to ensure that the results returned are the most relevant to the user's query.
So, the inference drawn from the aforementioned revolves around 3 major aspects:
- Best locally relevant results are served globally: It is ensured that no user query is left behind.
- Keeping it simple: For a wide variety of user queries, there has to be a response in multiple languages. Even then, the system has to be made simple and understandable.
- No manual intervention: No editing being done manually. The final ordering of the results must be decided by the algorithms alone.