Page Rank Algorithm
Page Rank Algorithm
CSE 590 DATA MINING Prof. Anita Wasilewska SUNY Stony Brook Presented By: Tushar Deshpande (SB ID: 106676653) Tejas Vora (SB ID: 106879612)
Sergey Brin sergey@cs.stanford.edu and Lawrence Page page@cs.stanford.edu Computer Science Department , Stanford University, Stanford, CA 94305, USA.
Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
http://infolab.stanford.edu/~backrub/google.html
http://en.wikipedia.org/wiki/Larry_Page, http://en.wikipedia.org/wiki/Sergey_Brin
Important References
Serge Abiteboul and Victor Vianu, Queries and Computation on the Web. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997. S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 1418, 1998. Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994. Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the web. http://google.stanford.edu/~backrub/pageranksub.ps
http://infolab.stanford.edu/~backrub/google.html
Abstract
Google , a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext.
http://infolab.stanford.edu/~backrub/google.html
Introduction
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
http://infolab.stanford.edu/~backrub/google.html
http://infolab.stanford.edu/~backrub/google.html
http://infolab.stanford.edu/~backrub/google.html
http://infolab.stanford.edu/~backrub/google.html
Link Analysis
Links represent citations Quantity of links to a website makes the website more popular Quality of links to a website also helps in computing rank Link structure largely unused before Larry Page proposed it to thesis advisor
http://infolab.stanford.edu/~backrub/google.htm l 11
Page Rank
A Top 10 IEEE ICDM data mining algorithm Page Rank is a trademark of Google. The Page Rank process has been patented. Google utilizes link to improve search results. PageRank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance.
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
Intuitive Justification
A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back, but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. A page can have a high PageRank If there are many pages that point to it Or if there are some pages that point to it, and have a high PageRank.
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
Other Features
The text associated with the link is treated in a special way (Anchor text). It has location information for all hits. Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Full raw HTML of pages is available in a repository
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
Architecture Overview
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
Architecture Overview
URL server Sends URLs to be fetched to crawlers Crawler Downloads web pages Done by several distributed crawlers Store Server Compresses and stores web pages Repository Each web page associated with docID
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
www.cis.temple.edu/~vasilis/Courses/CIS664/Papers/An-google.ppt
Searching
The goal of searching is to provide quality search results. The present focus of the paper is not on response time. To limit the response time the current search is stopped when 40k documents are retrieved.
http://infolab.stanford.edu/~backrub/google.html
http://infolab.stanford.edu/~backrub/google.html
Feedback
Figuring out the right values for the type weight and typo prox weight is a challenge. For this reason the authors came up with the idea of user feedback. Depending on the user feedback the ranking function is modified.
http://infolab.stanford.edu/~backrub/google.html
Now if we query the same text we did earlier Stony Brook University we are bound to get the desired home page of Stony Brook University in the search results. As the main purpose of the paper was to give the relevant search outputs , the focus was not on the retrieval time but authors claim that the system will be made more efficient and delivered as a commercial product.
http://infolab.stanford.edu/~backrub/google.html
Conclusion
Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.
http://infolab.stanford.edu/~backrub/google.html