IRWM: Assignment 1: How Does Google Search Engine Works?
IRWM: Assignment 1: How Does Google Search Engine Works?
engine Works?
IRWM: Assignment 1
Ganesh B. Solanke
1. Crawling and Indexing
Google navigates web by crawling. It follows links from page to page and then sort pages
by their contents and other factors. Google keep track of it all in the index.
These processes lay the foundation. It gathers and organizes information on the web so it
can return the most useful results to us. Its index is well over 100,000,000 gigabytes, and it has
spent over one million computing hours to build it.
It uses software known as “web crawlers” to discover publicly available web pages. The most
well-known crawler is called “Googlebot”. Crawlers look at web pages and follow links on
those pages, much like we would if we were browsing content on the web. They go from link to
link and bring data about those web pages back to Google’s servers.
The crawl process begins with a list of web addresses from past crawls and sitemaps
provided by website owners. As its crawlers visit these websites, they look for links for other
pages to visit. The software pays special attention to new sites, changes to existing sites and dead
links. Computer programs determine which sites to crawl, how often, and how many pages to
fetch from each site. Google doesn't accept payment to crawl a site more frequently for their web
search results. It cares more about having the best possible results because in the long run that’s
what’s best for users and, therefore, their business.
The sorter takes the barrels, which are sorted by docID and resorts them by wordID to
generate the inverted index. This is done in place so that little temporary space is needed for this
operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A
program called DumpLexicon takes this list together with the lexicon produced by the indexer
and generates a new lexicon to be used by the searcher. The searcher is run by a web server and
uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to
answer queries.
The web is like an ever-growing public library with billions of books and no central filing
system. Google essentially gathers the pages during the crawl process and then creates an index,
so it know exactly how to look things up. Much like the index in the back of a book, the Google
index includes information about words and their locations. When people search, at the most
basic level, Google’s algorithms look up our search terms in the index to find the appropriate
pages.
The search process gets much more complex from there. Google’s indexing systems note
many different aspects of pages, such as when they were published, whether they contain
pictures and videos, and much more. With the Knowledge Graph, Google is continuing to go
beyond keyword matching to better understand the people, places and things people care about.
New sites, changes to existing sites and deadlinks are noted and used to update Google index.
Most websites don’t need to set up restrictions for crawling, indexing or serving, so their
pages are eligible to appear in search results without having to do any extra work. That said, site
owners have many choices about how Google crawls and indexes their sites through Webmaster
Tools and a file called “robots.txt”. With the robots.txt file, site owners can choose not to be
There are many components to the search process and the results page, and it is
constantly updating Google’s technologies and systems to deliver better results. Many of these
changes involve exciting new innovations, such as the Knowledge Graph or Google Instant.
There are other important systems that Google constantly tune and refine. This list of projects
provides a glimpse into the many different aspects of search. Some of them are:
• Autocomplete: Predicts what user might be searching for. This includes understanding
terms with more than one meaning.
• Freshness: Shows the latest news and information. This includes gathering timely results
when you’re searching specific dates.
• Google Instant: Displays immediate results as you type.
• Indexing: Uses systems for collecting and storing documents on the web.
• Mobile: Includes improvements designed specifically for mobile devices, such as tablets
and smart phones.
• Query Understanding: Gets to the deeper meaning of the words you type.
• Refinements: Provides features like “Advanced Search,” related searches, and other
search tools, all of which help you fine-tune your search.
• Safe Search: Reduces the amount of adult web pages, images, and videos in our results.
Based on all above clues, Google pull all relevant documents from index and rank them. After
ranking process those are returned as query results to the user.
Google’s most important feature is Page Rank, a method that determined the
“importance” of a webpage by analyze at what other pages link to it, as well as other data. It is
derived from link analysis algorithm. It is not possible for a user to go through all the millions of
pages presented as output of search. Thus all the pages should be weighted according to their
priority and represented in the order of their weights and importance. PageRank is an excellent
way to prioritize the results of web keyword searches. PageRank is basically a numeric value that
represents how much a webpage is important on the web. PageRank is calculated by counting
citations or backlinks to a given page.
The PageRanks form a probability distribution over web pages, so the sum of all web pages
PageRanks will be one.
Anchor Text
The anchor text is defined as the visible, highlighted clickable text that is displayed for a
hyperlink in an HTML page. Search engine treat the anchor text in a different way. The anchor
text can determine the rank of the page. It provides more accurate descriptions of web pages that
are indicated in anchors than the pages themselves. Anchors may exist for documents which
cannot be indexed by a text-based search engine, such as images, programs, and databases.
Other Features:
It has location information for all hits, a set of all word occurrences so it makes extensive
use of proximity or probability in searching. Google keeps information about some visual
presentation details such as font size of words, Words in a larger or bolder font are weighted
higher than other words. Full raw HTML of pages is available in a repository.
Result Serving
Results are served to user in different forms like images, audio, texts, videos, links,
knowledge graph, snippets, news, thumbnails, voices and more in just (1/8)th of second time.
Some of forms are as below.
• Snippets: Shows small previews of information, such as a page’s title and short
descriptive text, about each search result.
• Knowledge Graph: Provides results based on a database of real world people, places,
things, and the connections between them.
• News: Includes results from online newspapers and blogs from around the world.
• Answers: Displays immediate answers and information for things such as the weather,
sports scores and quick facts.
• Videos: Shows video-based results with thumbnails so you can quickly decide which
video to watch.
• Images: Shows you image-based results with thumbnails so you can decide which page to
visit from just a glance.
• Books: Finds results out of millions of books, including previews and text, from libraries
and publishers worldwide.
3. Fighting Spam
It fights spam through a combination of computer algorithms and manual review. Spam
sites attempt to game their way to the top of search results through techniques like repeating
keywords over and over, buying links that pass PageRank or putting invisible text on the screen.
This is bad for search because relevant websites get buried, and it’s bad for legitimate website
owners because their sites become harder to find. The good news is that Google's algorithms can
detect the vast majority of spam and demote it automatically. For the rest, they have teams who
manually review sites.
Identifying Spam
Spam sites come in all shapes and sizes. Some sites are automatically-generated gibberish
that no human could make sense of. Of course, it also sees sites using subtler spam techniques.
While its algorithms address the vast majority of spam, it addresses other spam manually
to prevent it from affecting the quality of your results. The numbers may look large out of
context, but the web is a really big place. A recent snapshot of its index showed that about 0.22%
of domains had been manually marked for removal.
When it takes manual action on a website, it tries to alert the site's owner to help him or
her address issues. It wants website owners to have the information they need to get their sites in
shape. That’s why, over time, it has invested substantial resources in webmaster communication
and outreach.
Manual actions don’t last forever. Once a website owner cleans up her site to remove
spammy content, she can ask us to review the site again by filing a reconsideration request. It
processes all of the reconsideration requests it receive and communicate along the way to let site
owners know how it's going. Historically, most sites that have submitted reconsideration requests
are not actually affected by any manual spam action. Often these sites are simply experiencing
the natural ebb and flow of online traffic, an algorithmic change, or perhaps a technical problem
preventing Google from accessing site content.