0% found this document useful (0 votes)
243 views

Page Rank Algorithm

The document summarizes the key aspects of the original paper by Page and Brin that proposed the concept of PageRank and the initial architecture of the Google search engine. It describes how PageRank assigns importance scores to webpages based on both the quantity and quality of inbound links, addressing limitations of prior search engines. It also outlines Google's initial architecture, including how it crawls, indexes, and stores webpages before using PageRank and other signals to return relevant results for search queries.

Uploaded by

venkatsahul
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
243 views

Page Rank Algorithm

The document summarizes the key aspects of the original paper by Page and Brin that proposed the concept of PageRank and the initial architecture of the Google search engine. It describes how PageRank assigns importance scores to webpages based on both the quantity and quality of inbound links, addressing limitations of prior search engines. It also outlines Google's initial architecture, including how it crawls, indexes, and stores webpages before using PageRank and other signals to return relevant results for search queries.

Uploaded by

venkatsahul
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

The Anatomy of a Large-Scale Hyper textual Web Search Engine and Parallelizing K-means Clustering with MapReduce

CSE 590 DATA MINING Prof. Anita Wasilewska SUNY Stony Brook Presented By: Tushar Deshpande (SB ID: 106676653) Tejas Vora (SB ID: 106879612)

The Anatomy of a Large-Scale Hyper textual Web Search Engine

Sergey Brin sergey@cs.stanford.edu and Lawrence Page page@cs.stanford.edu Computer Science Department , Stanford University, Stanford, CA 94305, USA.

Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
http://infolab.stanford.edu/~backrub/google.html

Description about the Authors


Larry Page, (born March 26, 1973 in Lansing, Michigan, USA) and Sergey Brin (born August 21, 1973, in Moscow, Soviet Union) are cofounders of Google, Inc., the worlds largest internet company, based on its search engine and online advertising technology. They both are ranked #1 of the 50 Most Important People on the Web by PC World Magazine. Larry page is ranked 26th on the 2008 Forbes list of the worlds billionaires and is the 6th richest person in America and Forbes has ranked Sergey Brin as the 25th richest person in the world.

http://en.wikipedia.org/wiki/Larry_Page, http://en.wikipedia.org/wiki/Sergey_Brin

Important References
Serge Abiteboul and Victor Vianu, Queries and Computation on the Web. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997. S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 1418, 1998. Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994. Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the web. http://google.stanford.edu/~backrub/pageranksub.ps

http://infolab.stanford.edu/~backrub/google.html

Abstract
Google , a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext.

http://infolab.stanford.edu/~backrub/google.html

Introduction

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Need for Search Engines


The popularity of internet was increasing day by day. As the internet grew and became more popular the content over the internet grew. This led to the need for efficient search engines.

http://infolab.stanford.edu/~backrub/google.html

Conventional Search Engines


The conventional search engine categorized on the basis of search tags. The engines calculated the number of occurrences of a particular tag in a page. The page which has the highest hits with respect to the tags is the most relevant document.

http://infolab.stanford.edu/~backrub/google.html

Working of Conventional Search Engine


Suppose I want to search stony brook university So the conventional search engine will look for the matching tags in the web pages. Depending upon the hit rate the most relevant web pages are displayed.

http://infolab.stanford.edu/~backrub/google.html

Drawbacks of Conventional Search Engines


Example 1 : Suppose I have 100 web pages which has the text content stony brook university repeated at least 1000 on each the 100 pages. The conventional search engine will return these 100 web pages as the most relevant document. Example 2 : compare the usage information from a major homepage, like Yahoo's which currently receives millions of page views every day with an obscure historical article which might receive one view every ten years. Clearly, these two items must be treated very differently by a search engine.

http://infolab.stanford.edu/~backrub/google.html

Link Analysis
Links represent citations Quantity of links to a website makes the website more popular Quality of links to a website also helps in computing rank Link structure largely unused before Larry Page proposed it to thesis advisor

http://infolab.stanford.edu/~backrub/google.htm l 11

Page Rank
A Top 10 IEEE ICDM data mining algorithm Page Rank is a trademark of Google. The Page Rank process has been patented. Google utilizes link to improve search results. PageRank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance.

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Simplified PageRank algorithm


Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. A B A B C D C D

L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

PageRank algorithm including damping factor


Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to 0.85. The PageRank of a page A is given as follows:

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Intuitive Justification
A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back, but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. A page can have a high PageRank If there are many pages that point to it Or if there are some pages that point to it, and have a high PageRank.

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Other Features
The text associated with the link is treated in a special way (Anchor text). It has location information for all hits. Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Full raw HTML of pages is available in a repository

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Architecture Overview

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Architecture Overview
URL server Sends URLs to be fetched to crawlers Crawler Downloads web pages Done by several distributed crawlers Store Server Compresses and stores web pages Repository Each web page associated with docID

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Google Architecture Overview


Indexer Reads repository, uncompresses docs, parses them Each doc converted to set of word occurrences, Hits Record word, position, font, capitalization Distributes hits into set of Barrels Parses all links in web pages and stores them in Anchor File Contains enough info to determine where each link points from and to and text of link

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Google Architecture Overview


URL Resolver Reads anchor files Converts relative URLs to absolute URLs and in turn into docIDs.

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Google Architecture Overview


Sorter Takes barrels sorted by docID Resorts by wordID DumpLexicon Generates new lexicon for use by Searcher Searcher Uses above lexicon with input from the barrels and PR to answer queries

www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt

Searching
The goal of searching is to provide quality search results. The present focus of the paper is not on response time. To limit the response time the current search is stopped when 40k documents are retrieved.

http://infolab.stanford.edu/~backrub/google.html

The Ranking system


Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Every document has a count-weight depending upon the number of hits and it also has the associated type-weight. Type-weight is an indication of either the position, font or capitalization information. These two along with the page rank of the page forms the ranking function. The complexity is more when the search tag has multiple elements. Typo-prox-weights is the additional weight associated when we multiple elements in the search box. Typo-prox weights is based on Proximity . Proximity is based on how far apart the hits are in the document.

http://infolab.stanford.edu/~backrub/google.html

Feedback
Figuring out the right values for the type weight and typo prox weight is a challenge. For this reason the authors came up with the idea of user feedback. Depending on the user feedback the ranking function is modified.

http://infolab.stanford.edu/~backrub/google.html

Results and Performance


Now if we query the same text we did earlier Stony Brook University we are bound to get the desired home page of Stony Brook University in the search results. As the main purpose of the paper was to give the relevant search outputs , the focus was not on the retrieval time but authors claim that the system will be made more efficient and delivered as a commercial product.

http://infolab.stanford.edu/~backrub/google.html

Conclusion
Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

http://infolab.stanford.edu/~backrub/google.html

You might also like