0% found this document useful (0 votes)

61 views

IRWM: Assignment 1: How Does Google Search Engine Works?

Google's search engine works by crawling the web with bots to index web pages, organizing information through keywords, and ranking pages through algorithms like PageRank that analyze link structures. It builds a massive index of over 100,000 gigabytes by crawling billions of web pages to gather and organize information so it can return the most relevant results to user searches based on over 200 unique signals. Website owners can control how their pages are crawled and indexed through tools like robots.txt.

Uploaded by

deepti

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

IRWM: Assignment 1: How Does Google Search Engine Works?

Uploaded by

deepti

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

How Does Google search

engine Works?

IRWM: Assignment 1

Ganesh B. Solanke
1. Crawling and Indexing
Google navigates web by crawling. It follows links from page to page and then sort pages
by their contents and other factors. Google keep track of it all in the index.

How Search Works

These processes lay the foundation. It gathers and organizes information on the web so it
can return the most useful results to us. Its index is well over 100,000,000 gigabytes, and it has
spent over one million computing hours to build it.

Fig. High level Google Architecture

Finding information by crawling (i.e. Googlebot)

It uses software known as “web crawlers” to discover publicly available web pages. The most
well-known crawler is called “Googlebot”. Crawlers look at web pages and follow links on
those pages, much like we would if we were browsing content on the web. They go from link to
link and bring data about those web pages back to Google’s servers.

The crawl process begins with a list of web addresses from past crawls and sitemaps
provided by website owners. As its crawlers visit these websites, they look for links for other
pages to visit. The software pays special attention to new sites, changes to existing sites and dead
links. Computer programs determine which sites to crawl, how often, and how many pages to
fetch from each site. Google doesn't accept payment to crawl a site more frequently for their web
search results. It cares more about having the best possible results because in the long run that’s
what’s best for users and, therefore, their business.

PCCOE, Pune Page 2

In Google, the web crawling (downloading of web pages) is done by several distributed
crawlers, which browses the World Wide Web by employing many Computers. URL server
sends lists of URLs (uniform resource locator) to be fetched to the crawlers. The web pages that
are fetched are then sent to the store server. The store server then compresses and stores the web
pages into a repository. Every web page has an associated ID number called a docID which is
assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform
indexing functions and read the repository, uncompress the documents, and parse them. Each
document is converted into a set of word occurrences called hits. The hits record the word,
position in document, an approximation of font size, and capitalization. The indexer distributes
these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs
another important function. It parses out all the links in every web page and stores important
information about them in an anchors file. This file contains enough information to determine
where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in
turn into docIDs. It puts the anchor text into the forward index, associated with the docID that
the anchor points to. It also generates a database of links which are pairs of docIDs. The links
database is used to compute Page Ranks for all the documents.

The sorter takes the barrels, which are sorted by docID and resorts them by wordID to
generate the inverted index. This is done in place so that little temporary space is needed for this
operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A
program called DumpLexicon takes this list together with the lexicon produced by the indexer
and generates a new lexicon to be used by the searcher. The searcher is run by a web server and
uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to
answer queries.

Organizing information by Indexing

The web is like an ever-growing public library with billions of books and no central filing
system. Google essentially gathers the pages during the crawl process and then creates an index,
so it know exactly how to look things up. Much like the index in the back of a book, the Google
index includes information about words and their locations. When people search, at the most
basic level, Google’s algorithms look up our search terms in the index to find the appropriate
pages.
The search process gets much more complex from there. Google’s indexing systems note
many different aspects of pages, such as when they were published, whether they contain
pictures and videos, and much more. With the Knowledge Graph, Google is continuing to go
beyond keyword matching to better understand the people, places and things people care about.
New sites, changes to existing sites and deadlinks are noted and used to update Google index.

Choice for website owners

Most websites don’t need to set up restrictions for crawling, indexing or serving, so their
pages are eligible to appear in search results without having to do any extra work. That said, site
owners have many choices about how Google crawls and indexes their sites through Webmaster
Tools and a file called “robots.txt”. With the robots.txt file, site owners can choose not to be

PCCOE, Pune Page 3

crawled by Googlebot, or they can provide more specific instructions about how to process pages
on their sites. Site owners have granular choices and can choose how content is indexed on a
page-by-page basis. For example, they can opt to have their pages appear without a snippet (the
summary of the page shown below the title in search results) or a cached version (an alternate
version stored on Google’s servers in case the live page is unavailable). Webmasters can also
choose to integrate search into their own pages with Custom Search.

2. Algorithm- Ranking and more

Algorithms are computer programs that look for clues to give you back exactly what user
wants. It helps to deliver best possible results by ranking and more methods. Algorithms are the
computer processes and formulas that take user’s questions and turn them into answers. Today
Google’s algorithms rely on more than 200 unique signals or clues that make it possible to guess
what you might really be looking for. These signals include things like the terms on websites, the
freshness of content, our region and PageRank.
The indexing process has produce all the pages that include particular words in a query
enter by the searcher, but they are not sorted in terms of importance or relevance. Ranking of the
document is measured to provide the most relevant WebPages for the search query entered.
Evaluation of relevance is based on factors, they are:
• PageRank
• Authority and trust of the pages which refer to a page.
• The number of times the keywords, phrases and synonyms of keywords occur on the
page.
• Spamming rate.
• The occurrence of the phrase within the document title, URL (Uniform Resource
Locator).

There are many components to the search process and the results page, and it is
constantly updating Google’s technologies and systems to deliver better results. Many of these
changes involve exciting new innovations, such as the Knowledge Graph or Google Instant.
There are other important systems that Google constantly tune and refine. This list of projects
provides a glimpse into the many different aspects of search. Some of them are:

• Autocomplete: Predicts what user might be searching for. This includes understanding
terms with more than one meaning.
• Freshness: Shows the latest news and information. This includes gathering timely results
when you’re searching specific dates.
• Google Instant: Displays immediate results as you type.
• Indexing: Uses systems for collecting and storing documents on the web.
• Mobile: Includes improvements designed specifically for mobile devices, such as tablets
and smart phones.
• Query Understanding: Gets to the deeper meaning of the words you type.
• Refinements: Provides features like “Advanced Search,” related searches, and other
search tools, all of which help you fine-tune your search.
• Safe Search: Reduces the amount of adult web pages, images, and videos in our results.

PCCOE, Pune Page 4

• Search Methods: Creates new ways to search, including “search by image” and “voice
search.”
• Site & Page Quality: Uses a set of signals to determine how trustworthy, reputable, or
authoritative a source is. (One of these signals is PageRank, one of Google’s first
algorithms, which looks at links between pages to determine their relevance.)
• Spelling: Identifies and corrects possible spelling errors and provides alternatives.
• Synonyms: Recognizes words with similar meanings.
• Translation and Internationalization: Tailors results based on your language and country.
• Universal Search: Blends relevant content, such as images, news, maps, videos, and
your personal content, into a single unified search results page.
• User Context: Provides more relevant results based on geographic region, Web
History, and other factors.

Based on all above clues, Google pull all relevant documents from index and rank them. After
ranking process those are returned as query results to the user.

PageRank Algorithm: bringing order to the web

Google’s most important feature is Page Rank, a method that determined the
“importance” of a webpage by analyze at what other pages link to it, as well as other data. It is
derived from link analysis algorithm. It is not possible for a user to go through all the millions of
pages presented as output of search. Thus all the pages should be weighted according to their
priority and represented in the order of their weights and importance. PageRank is an excellent
way to prioritize the results of web keyword searches. PageRank is basically a numeric value that
represents how much a webpage is important on the web. PageRank is calculated by counting
citations or backlinks to a given page.

PR (A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Where,
PR (A)-PageRank of page A,
PR (Ti)-PageRank of pages Ti, which links to page A,
C (Ti)-total number of outbound links on page Ti,
d-damping factor, which is always set between 0 to 1,
N-total number of all pages on web

The PageRanks form a probability distribution over web pages, so the sum of all web pages
PageRanks will be one.

Anchor Text

The anchor text is defined as the visible, highlighted clickable text that is displayed for a
hyperlink in an HTML page. Search engine treat the anchor text in a different way. The anchor
text can determine the rank of the page. It provides more accurate descriptions of web pages that
are indicated in anchors than the pages themselves. Anchors may exist for documents which
cannot be indexed by a text-based search engine, such as images, programs, and databases.

PCCOE, Pune Page 5

Anchor text is ranked or given high weightage in search engine algorithms. The main goal of
search engines is to bring highly relevant search results and anchor text can help by providing
better quality results.

Other Features:
It has location information for all hits, a set of all word occurrences so it makes extensive
use of proximity or probability in searching. Google keeps information about some visual
presentation details such as font size of words, Words in a larger or bolder font are weighted
higher than other words. Full raw HTML of pages is available in a repository.

Result Serving
Results are served to user in different forms like images, audio, texts, videos, links,
knowledge graph, snippets, news, thumbnails, voices and more in just (1/8)th of second time.
Some of forms are as below.

• Snippets: Shows small previews of information, such as a page’s title and short
descriptive text, about each search result.
• Knowledge Graph: Provides results based on a database of real world people, places,
things, and the connections between them.
• News: Includes results from online newspapers and blogs from around the world.
• Answers: Displays immediate answers and information for things such as the weather,
sports scores and quick facts.
• Videos: Shows video-based results with thumbnails so you can quickly decide which
video to watch.
• Images: Shows you image-based results with thumbnails so you can decide which page to
visit from just a glance.
• Books: Finds results out of millions of books, including previews and text, from libraries
and publishers worldwide.

3. Fighting Spam
It fights spam through a combination of computer algorithms and manual review. Spam
sites attempt to game their way to the top of search results through techniques like repeating
keywords over and over, buying links that pass PageRank or putting invisible text on the screen.
This is bad for search because relevant websites get buried, and it’s bad for legitimate website
owners because their sites become harder to find. The good news is that Google's algorithms can
detect the vast majority of spam and demote it automatically. For the rest, they have teams who
manually review sites.

Identifying Spam

Spam sites come in all shapes and sizes. Some sites are automatically-generated gibberish
that no human could make sense of. Of course, it also sees sites using subtler spam techniques.

PCCOE, Pune Page 6

Taking Action

While its algorithms address the vast majority of spam, it addresses other spam manually
to prevent it from affecting the quality of your results. The numbers may look large out of
context, but the web is a really big place. A recent snapshot of its index showed that about 0.22%
of domains had been manually marked for removal.

Notifying Website Owners

When it takes manual action on a website, it tries to alert the site's owner to help him or
her address issues. It wants website owners to have the information they need to get their sites in
shape. That’s why, over time, it has invested substantial resources in webmaster communication
and outreach.

Listening for Feedback

Manual actions don’t last forever. Once a website owner cleans up her site to remove
spammy content, she can ask us to review the site again by filing a reconsideration request. It
processes all of the reconsideration requests it receive and communicate along the way to let site
owners know how it's going. Historically, most sites that have submitted reconsideration requests
are not actually affected by any manual spam action. Often these sites are simply experiencing
the natural ebb and flow of online traffic, an algorithmic change, or perhaps a technical problem
preventing Google from accessing site content.

Some facts about Google

• Google has been in the search game a long time, it has the highest share market of Search
Engine (about 81%).
• Web Crawler-based service provides both comprehensive coverage of the Web along
with great relevancy.
• Google is much better than the other engines at determining whether a link is an artificial
link or true editorial link.
• Google gives much importance to Sites which add fresh content on a regular basis. This
is why Google likes blogs, especially popular ones.
• Google prefer informational pages to commercial sites.
• A page on a site or sub domain of a site with significant age or link can rank much better
than it should, even with no external citations.
• It has aggressive duplicate content filters that filter out many pages with similar content.
• Crawl depth determined not only by link quantity, but also link quality. Excessive low
quality links may make your site less likely to be crawled deep or even included in the
index.
• In addition we can search for twelve different file formats, cached pages, images, news
and Usenet group postings

PCCOE, Pune Page 7

IRS Notes
No ratings yet
IRS Notes
40 pages
SEO Basics Course 2023
No ratings yet
SEO Basics Course 2023
41 pages
UserGuide Iteraplan
No ratings yet
UserGuide Iteraplan
237 pages
Adobe PageMaker Help
50% (4)
Adobe PageMaker Help
991 pages
A Training Manual by Amish Gilani
No ratings yet
A Training Manual by Amish Gilani
25 pages
How Google Works
No ratings yet
How Google Works
61 pages
How Google Works
No ratings yet
How Google Works
27 pages
How Google Search Work
No ratings yet
How Google Search Work
3 pages
ICT Module 4
No ratings yet
ICT Module 4
13 pages
Seo Learning Guide
From Everand
Seo Learning Guide
ngencoband
No ratings yet
What Is SE IT (7yh SEM)
No ratings yet
What Is SE IT (7yh SEM)
13 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
Ieee Format
No ratings yet
Ieee Format
13 pages
Case Study Google
100% (1)
Case Study Google
6 pages
Challenges in Running A Commercial Web Search Engine: Amit Singhal
No ratings yet
Challenges in Running A Commercial Web Search Engine: Amit Singhal
50 pages
The Dark Secrets of The Search Engines - Find Out What Search Engines Are Hiding From You (PDFDrive)
No ratings yet
The Dark Secrets of The Search Engines - Find Out What Search Engines Are Hiding From You (PDFDrive)
130 pages
CRAWLER,INDEX,RANKING
No ratings yet
CRAWLER,INDEX,RANKING
20 pages
Page Rank of Google Search: The Algorithm That Organizes The Web
No ratings yet
Page Rank of Google Search: The Algorithm That Organizes The Web
8 pages
Module 2
No ratings yet
Module 2
18 pages
Anatomy of A Search Engine
No ratings yet
Anatomy of A Search Engine
17 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
Search Engine Crawling Indexing and Ranking
No ratings yet
Search Engine Crawling Indexing and Ranking
1 page
DIGITALIZATION OF BUSINESS UNIT-3
No ratings yet
DIGITALIZATION OF BUSINESS UNIT-3
107 pages
How Search Engines Work: Crawling, Indexing, And Ranking - Beginner's Guide to SEO - Moz
No ratings yet
How Search Engines Work: Crawling, Indexing, And Ranking - Beginner's Guide to SEO - Moz
47 pages
705-I300885e65e6rst
No ratings yet
705-I300885e65e6rst
4 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
How Google Indexing Works
No ratings yet
How Google Indexing Works
3 pages
Web Crawling & SEO
No ratings yet
Web Crawling & SEO
20 pages
UNIT-1
No ratings yet
UNIT-1
47 pages
BA4029 SOCIAL MEDIA WEB ANALYTICS unit 5
No ratings yet
BA4029 SOCIAL MEDIA WEB ANALYTICS unit 5
23 pages
Internet Searching Technique - Last Edited
No ratings yet
Internet Searching Technique - Last Edited
36 pages
Unit 4
No ratings yet
Unit 4
47 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
10 pages
Search-Engines: A Case Study of Google: Presented By: Gaurav Khandelwal (08BCE131)
No ratings yet
Search-Engines: A Case Study of Google: Presented By: Gaurav Khandelwal (08BCE131)
19 pages
(Reliablesoft - Net) SEO Basics Mini Course 2021
No ratings yet
(Reliablesoft - Net) SEO Basics Mini Course 2021
35 pages
How Search Engines Work - Crawling, Indexing, and Ranking - Beginner's Guide To SEO - Moz
No ratings yet
How Search Engines Work - Crawling, Indexing, and Ranking - Beginner's Guide To SEO - Moz
59 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Techniques for Advanced Search Engine Optimization: On Autopilot, Increase Your Traffic and Profits!
From Everand
Techniques for Advanced Search Engine Optimization: On Autopilot, Increase Your Traffic and Profits!
Jim Stephens
No ratings yet
Holistic Seo: Yoast Seo For Wordpress Training - Lesson 1.1
No ratings yet
Holistic Seo: Yoast Seo For Wordpress Training - Lesson 1.1
16 pages
ICT - Unit 2
No ratings yet
ICT - Unit 2
37 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
ROLL NO: 320-33014 Subject: Introduction To Ict
No ratings yet
ROLL NO: 320-33014 Subject: Introduction To Ict
7 pages
How Search Engines Work
No ratings yet
How Search Engines Work
11 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
Handout 17
No ratings yet
Handout 17
1 page
Seo CH1
No ratings yet
Seo CH1
45 pages
Computer - Search Engines
No ratings yet
Computer - Search Engines
10 pages
Seo CH1
No ratings yet
Seo CH1
45 pages
How Do Search Engines Work?
No ratings yet
How Do Search Engines Work?
6 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
24 pages
Search Engines: Ranjan Patra B.Tech (CSE) IV TH Sem HNB Garhwal University, Uttarakhand, India
No ratings yet
Search Engines: Ranjan Patra B.Tech (CSE) IV TH Sem HNB Garhwal University, Uttarakhand, India
17 pages
The Pocket Guide to SEO for Authors: Pocket Guides
From Everand
The Pocket Guide to SEO for Authors: Pocket Guides
Troy Lambert
No ratings yet
Week7 1
No ratings yet
Week7 1
48 pages
THU Agasang Evaluating Search Engines
No ratings yet
THU Agasang Evaluating Search Engines
41 pages
Search Engines
No ratings yet
Search Engines
30 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
SEO Mastery
From Everand
SEO Mastery
Aisha Khan
No ratings yet
CS101 Midterm Solved MCQS Subjective
No ratings yet
CS101 Midterm Solved MCQS Subjective
35 pages
AI Engineer Using Microsoft Azure Nanodegree Program Syllabus
No ratings yet
AI Engineer Using Microsoft Azure Nanodegree Program Syllabus
14 pages
Introduction To Information Retrieval: LBSC 796/INFM 718R: Week 1
No ratings yet
Introduction To Information Retrieval: LBSC 796/INFM 718R: Week 1
49 pages
Grove Music Online: Product Demonstration
No ratings yet
Grove Music Online: Product Demonstration
16 pages
Search Engines: Submitted To: Submitted by
No ratings yet
Search Engines: Submitted To: Submitted by
16 pages
Information Storage and Retrieval - 783
100% (1)
Information Storage and Retrieval - 783
12 pages
LCH 015B
No ratings yet
LCH 015B
13 pages
Operation Guide: NW-A805 / A806 / A808
No ratings yet
Operation Guide: NW-A805 / A806 / A808
139 pages
Unit 6 A2 - Assignment Helpfile and Writing Frame
No ratings yet
Unit 6 A2 - Assignment Helpfile and Writing Frame
19 pages
XCP Performance Best Practices
No ratings yet
XCP Performance Best Practices
23 pages
MongoDB Architecture Guide PDF
No ratings yet
MongoDB Architecture Guide PDF
17 pages
Short Guide To Indexing
No ratings yet
Short Guide To Indexing
12 pages
SEO Services New York
No ratings yet
SEO Services New York
4 pages
Saep 381 PDF
100% (1)
Saep 381 PDF
17 pages
CHM Processor Documentation
No ratings yet
CHM Processor Documentation
12 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
CBTP Phase 3
100% (1)
CBTP Phase 3
16 pages
Ai102renewal 29-12-23
No ratings yet
Ai102renewal 29-12-23
36 pages
Coretools Reference PDF
100% (1)
Coretools Reference PDF
1,918 pages
NLP 05
No ratings yet
NLP 05
26 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Assignment 3 of DM
No ratings yet
Assignment 3 of DM
7 pages
Scanning and Skimming
No ratings yet
Scanning and Skimming
9 pages
Information Retrieval - Search Engines in The Web
No ratings yet
Information Retrieval - Search Engines in The Web
2 pages
OAK-the Architecture of Apache Jackrabbit 3 PDF
No ratings yet
OAK-the Architecture of Apache Jackrabbit 3 PDF
46 pages
Ucm Archive Pull Replicate
No ratings yet
Ucm Archive Pull Replicate
13 pages
Information Retrieval and Dissemination
100% (1)
Information Retrieval and Dissemination
38 pages