0% found this document useful (0 votes)
98 views13 pages

Understanding Search Engines and Algorithms

1. The document discusses search engines and how they work. It focuses on Google and Bing. 2. Search engines crawl websites to discover new content, then index the information by analyzing keywords and other signals. They rank pages based on relevance to the user's search query. 3. Google prioritizes keywords and uses machine learning to understand query intent and content quality. Bing uses a different algorithm that represents information as vectors rather than keywords and finds related vectors to queries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views13 pages

Understanding Search Engines and Algorithms

1. The document discusses search engines and how they work. It focuses on Google and Bing. 2. Search engines crawl websites to discover new content, then index the information by analyzing keywords and other signals. They rank pages based on relevance to the user's search query. 3. Google prioritizes keywords and uses machine learning to understand query intent and content quality. Bing uses a different algorithm that represents information as vectors rather than keywords and finds related vectors to queries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Contextualized Online Search and Research Skills

 
Search Engine
A search engine is a kind of website through which users can search the
content available on the Internet. For this purpose, users enter the desired
keywords into the search field. Then the search engine looks through its index
for relevant web pages and displays them in the form of a list. 

What is the Web?


The World Wide Web—usually called the Web for short—is a collection of
different websites you can access through the Internet. A website is made up
of related text, images, and other resources. Websites can resemble other
forms of media—like newspaper articles or television programs—or they can
be interactive in a way that's unique to computers.
Internet
The internet is a globally connected network system facilitating worldwide
communication and access to data resources through a vast collection of private,
public, business, academic and government networks. It is governed by agencies
like the Internet Assigned Numbers Authority (or IANA) that establish universal
protocols.
The terms internet and World Wide Web are often used interchangeably, but they
are not exactly the same thing; the internet refers to the global communication
system, including hardware and infrastructure, while the web is one of the
services communicated over the internet.
 
How Do Search Engines work?
 
To be effective, search engines need to understand exactly what kind of
information is available and present it to users logically. The way they
accomplish this is through three fundamental actions: crawling, indexing, and
ranking.

Search engine process flow


 
Through these actions, they discover newly published content, store the
information on their servers, and organize it for your consumption. Let’s break
down what happens during each of these actions:
• Crawl: Search engines send out web crawlers, also known as bots or spiders,

to review website content. Paying close attention to new websites and to


existing content that has recently been changed, web crawlers look at data
such as URLs, sitemaps, and code to discover the types of content being
displayed.
• Index: Once a website has been crawled, the search engines need to decide

how to organize the information. The indexing process is when they review
website data for positive or negative ranking signals and store it in the correct
location on their servers.
• Rank: During the indexing process, search engines start making decisions on

where to display specific content on the search engine results page (SERP).
Ranking is accomplished by assessing a number of different factors based on
an end user’s query for quality and relevancy.
Breaking Down Search Engine Algorithms by Platform
 
Google Search Algorithm
Google is the most popular search engine on the planet. Their search engine
routinely own above 90% of the market, resulting in approximately 3.5 billion of
individual searches on their platform every day. While notoriously tight-lipped
about how their algorithm works, Googles does provide some high-level context
about how they prioritize websites in the results page.
New websites are created every day. Google can find these pages by following
links from existing content they’ve crawled previously, or when a website owner
submits their sitemap directly. Any updates to existing content can also be
submitted to Google by asking them to recrawl a specific URL. This is done
through Google’s Search Console.
While Google doesn’t state how often sites are crawled, any new content that is
linked to existing content will be found eventually as well.
Once the web crawlers gather enough information, they bring it back to Google
for indexing.
Indexing starts by analyzing website data, including written content, images,
videos, and technical site structure. Google is looking for positive and negative
ranking signals such as keywords and website freshness to try and understand
what any page they crawled is all about.
Google’s website index contains billions of pages and 100,000,000 gigabytes of
data. To organize this information, Google uses a machine-learning algorithm
called RankBrain and a knowledge base called Knowledge Graph.  This is all
works together to help Google provide the most relevant content possible for
users. Once the indexing is complete, they move on to the ranking action.
Everything that takes place up to this point is done in the background, before a
user ever interacts with Google’s search functionality. Ranking is the action that
occurs based on what a user is searching for. Google looks at five major
factors when someone performs a search:
• Query meaning: This determines the intent of any end user’s question. Google

uses this to determine exactly what someone is looking for when they perform
a search. They parse each query using complex language models built on
past searches and usage behavior.
• Web page relevance: Once Google has determined the intent of a user’s

search query, they review the content of ranking web pages to figure out
which one is the most relevant. The primary driver for this is keyword analysis.
The keywords on a website have to match Google’s understanding of the
question a user asked.
• Content quality: With keywords matched, Google takes it a step further and

reviews the quality of the content on the requisite web pages. This helps them
prioritize which results come first by looking at the authority of a given website
as well as its page rank and freshness.
• Web page usability: Google gives ranking priority to websites that are easy to

use. Usability covers everything from site speed to responsiveness.


• Additional context and settings: This step tailors searches to past user

engagement and specific settings within the Google platform.


• Once all of this information has been processed, Google will provide results that

look something like this:


 
Google search for “best wireless headphones 2019”

 
 
 
 
 
 
 
 
Let’s break down these results:
• User query: The question a user asked Google.
• Google shopping: Google considers the intent of this query as someone

searching for products to purchase. As a result, they pull products from their
index that match this intent and display them first in the results.
• Feature snippet: A result of Knowledge Graph. Google presents specific

information from a SERP result to make it easier for users to review without
leaving the results page.
• Top-ranking results: The first site listed in the results is the one Google thinks

best matches the intent of a user’s query. The top-ranking result is the one
that performs best, based on the five ranking factors we discussed earlier.
• People also ask: This box is another result of the Knowledge Graph. It gives

users a quick way to move on to another search that might match their intent
even better.
These results are possible only because Google has information stored on
each of these pages in their index. Before a user performs a search, Google
has reviewed websites to figure out what keywords and intent they match for.
That process makes it easy to populate the results page quickly when a
search is made and helps Google provide the most relevant content possible.
As the most popular search engine around, Google more or less built the
framework for how search engines look at content. Most marketers tailor their
content specifically to rank on Google, which means they’re potentially
missing out on other platforms.
Bing Search Algorithm
Bing, Microsoft’s proprietary search engine, uses an open-source vector-search
algorithm called Space Partition Tree And Graph (SPTAG) to surface results.
This means they’re going in a totally different direction from Google’s keyword-
based search.
Being open source means that anyone can look at the nuts-and-bolts code of
what makes up Bing’s search results and make comments. This open model is
antithetical to Google’s tight control of their algorithms. The code itself is
separated into two separate modules—index builder and searcher:
• Index Builder: The code that works to categorize website information into

vectors
• Searcher: The way that Bing makes connections between search queries and

vectors in their index


The second big difference between Bing and Google is at the core of how the
information is stored and indexed. Instead of a keyword-first model, like Google,
Bing breaks down information into individual data points called vectors. A vector
is a numerical representation of concept; this concept is the basis for Bing’s
search structure.
Search queries for Bing are based on an algorithmic principle called Approximate
Nearest Neighbor, which uses deep learning and natural-language models to
provide faster results based on the proximity of certain vectors to one another.

Graphical representation of Bing’s Approximate


Nearest Neighbor algorithm SPTAG
If we look at the yellow dot as a user query, the green dots are the first
closes neighbor, followed by the blue dots. Tracking the orange arrow, we can
see how Bing’s algorithm decides which information is most relevant to the user’s
search.
While the underlying principles driving Bing’s search structure are fundamentally
different, the process of building their database still follows the crawl, index, rank
actions.
Bing crawls websites to find new content or updates to existing content. They
then create vectors for that information to store in their index. From there, they
look at specific ranking factors. The biggest difference in comparison with Google
is that Bing does not include pages without ranking authority, meaning that new
pages have a more difficult time ranking if they don’t have backlinks to an
existing page with more authority.
If we look at the same search performed on Bing, the results are different:
 

Bing search results for “best wireless headphones 2019”


While the results look similar in their structure, Bing is pulling from different
websites for both their Shopping and their feature snippet selections.  The top-
ranking result is also different from our search in Google, though both match our
intent quite well.
If you’re thinking about tailoring content for Bing, you should start by looking at
the differences between the top ranking sites and feature snippets. Their platform
prioritizes content differently from Google, and these distinctions will help you
understand why.
DuckDuckGo Search Algorithm
DuckDuckGo is a bit of a maverick in the search engine market but is gaining
headway as the go-to search engine for anyone concerned about their data
privacy. While they have a proprietary web crawler called DuckDuckBot to scour
web-page content, much of the information DuckDuckGo shows on their results
page is compiled from 400+ additional third-party sources, including Bing, Yahoo,
and Wikipedia.  
Unlike Google and Bing, DuckDuckGo does not capture personal information on
their users, including past search history and IP address. This dedication to
privacy in some ways makes their algorithm work harder to provide personalized
results.
For even more privacy, DuckDuckGo can also be used for completely
anonymous browsing using the Tor network or an onion service.
As a result of this focus on privacy, DuckDuckGo has the most streamlined
results page so far.

DuckDuckGo results for “best wireless headphones 2019”


Both Bing and DuckDuckGo have the same first and second results, which
makes sense, considering that Bing is included in DuckDuckGo’s search
algorithms.
DuckDuckGo’s 400 additional sources also include computational databases
like WolframAlpha, a platform built primarily to answer complex mathematical
equations and provide tools for data analysis. Other sources come in the form
of Instant Answers, which pull content from relevant websites in an effort to
provide on-page answers, like the feature snippets we’ve seen from Google and
Bing.

Instant Answer from DuckDuckGo


The information in our example comes directly from Wikipedia.
DuckDuckGo doesn’t provide specific information on the different kinds of
ranking factors that go into these results pages but alludes to the fact that linking
to sites with good authority is something to consider.
Another interesting aspect of the DuckDuckGo platform is that they allow users to
use custom parameters called bangs to bypass the search results page entirely.
A function of pulling from multiple sources to display results, DuckDuckGo then
acts as a search portal for platforms like Wikipedia, Amazon, and Twitter.
As a security-conscious platform, we can assume that DuckDuckGo does not
include past searches as a part of their ranking algorithm. That, combined with
the informational aspects of their additional sources, makes for a platform that is
less personalized than Bing or Google but is still able to provide quality and
relevant content for their users. Tailoring content for Bing would work for this
platform as well.
YouTube Search Algorithm
YouTube is the most popular video-hosting website. Their search engine is
effectively run by rules similar to those of Google, which owns the platform, and it
focuses on keywords and relevancy. The algorithm is broken down into two
separate functions: ranking videos in search and surfacing relevant
recommendations.
The specific reasons why certain videos rank higher than others are, like all
Google properties, not outwardly defined. That said, most interpretations lean
toward newness of video and frequency of channel upload being the most
important factors.
In terms of recommendations, this research paper from 2016 lists the main
priorities for YouTube as scale, freshness, and noise:
• Scale: There are 300 hours of video uploaded to YouTube every minute, and

the platform has approximately 1.3 billion users. This makes parsing
information significantly more difficult, so the algorithm’s primary focus is
finding ways to sift through this amount of data on a user-by-user basis.
• Freshness: YouTube balances how they recommend videos based on how

recently a video was uploaded as well as on individual user’s past behavior.


• Noise: Due to the varying amounts of content most users watch on YT, it is

difficult for any AI to parse what is the most relevant at any time.
These factors result in a recommendations page that is tailored to each individual
user account.

YouTube recommendations on the home page


This also shows how Subscriptions factor into the way YouTube presents results.
When a user subscribes to a particular channel, that boosts its ranking in search
results, recommendations, and what to watch next.
Other ranking factors include what a user watches, how long they engage with
different videos, and what the overall popularity of a video on YouTube is.
Take a look at the results page for “best wireless headphones 2019.”

YouTube search results page for “best wireless headphones 2019”


The top result is the most-viewed video of the bunch. This is followed by a newer
upload with fewer views but an exact keyword match. The third video has more
views than the second, but no exact keyword match—it is also a slightly older
upload.
Part 2 Web Search Technique and Strategies
The information you retrieve will depend on: 
The search engine(s) and the search term(s) you use
Check your search engine’s home page or initial screen to find out its default
or basic settings Look for “help”, “tips”, “FAQs” Know the default settings as
this may explain why your search results are not what you expected
Basic Searching Aids
Boolean Operators
Most search engines now offer boolean capabilities. Boolean operators express
different and specific relationships between words and phrases used in the
search.
AND limits a search by requiring each term must be present. For example a
search on learning AND cognition specifies that you want information on BOTH
learning and cognition. If an article only has the term learning in it, it will not be
matched. Using AND will usually produce fewer hits.
OR expands the search by combining discrete terms into a conditional set.
Searching for learning OR cognition specifies that you want information either
learning or cognition. Using OR usually produces the most hits.
NOT limits the search by specifying that a term not be present. Searching for
learning NOT training will find matches with the term learning but not training.
Proximity Operators
With some search engines you can use proximity operators such
as OpenText's NEAR operators or Webcrawler's ADJecent or the FOLLOWED
BY operator. With each of these operators, word order is important. For example:
if you place square brackets such as [learning theory] causes a hit if they are
found within 100 words of each other (Gray, 1966).
Truncation (*)
You can use truncation on most search engines. That is, you can use the
asterisk (*) operator to end a root word. For example: searching for teach* will
find teacher, teaching, and teachers. Note: the asterisk can not be the first or
second letter of a root word.
Wildcard (?)
You can find words that share some but not all characters using the question
mark (?) operator. For example: Johns?nwill find Johnson and Johnsen.
Note: the ? can not be the first character in the search.
You may also use combinations of truncation (*) and single character wildcard (?)
in your searches.
Bookmarks
Throughout the process of searching for information you will find many useful
sites. If you do not have time to examine these sites in detail you may either
print them for off-line review or simply set a bookmark to easily return to them
later. Although bookmarks are simple to set and will certainly help your overall
searching, organizing your bookmarks dramatically increases your efficiency.
Netscape for example allows you to organize bookmarks into folders. 
Search Strategies
Search tools are certainly proliferating on the web. These tools have grown
from early naive indexing tools to those that now use a form of artificial
intelligence algorithms termed heuristics. Heuristic searching tools are
designed to aid the user in learning, discovering or problem solving through
self-educating techniques (i.e., feedback) to improve performance
 
To determine which search engine(s) you should use to aid you in your task,
you need to know a little about various strategies they use and features they
provide.
Starting Points
Five categories of searching strategies
1. Rating
2. Sampling
3. Locating
4. Collecting
5. concept searching

Search engines may be segregated into one or more of these categories. As


search engines continue to develop, many are integrating multiple strategies into
their capabilities and thus blurring these categorical "lines." But for now, I have
defined each of these categories to help you understand the uses of various
search engines.
The decision regarding which search engine to use depends upon your
knowledge of how an engine searchers and indexes web pages. 
To better understand this let's look at a few examples. The Lycos indexing
search engine examines only specific parts of a web page such as the title,
headings, and the most significant 100 words. Where Webcrawler examines
every word on a web page (Webster & Paul, 1996). But these are not the only
criteria to consider when selecting a search engine. The size of the database
(i.e., listings) is also a major factor.
I have provided hyperlinks to examples for each of the five categories for your
examination and better understanding of their uses. I recommend that after you
have reviewed these search engines that you bookmark those that are most
useful to you for this course and for your professional work. This way you will not
have to continually return to this web page to access your preferred search tools.
Five categories for search strategies
Rating Strategy (rating and reviews) - Finding rated and reviewed sites.
Use: When you want to find out how others have rated topical sites.
• MaGellan

• WebCrawler Select

• Point Communications

Sampling Strategy (subject trees) - Finding a few high quality sources based on


topics.
Use: When you are looking for broad "trailblazer" or topical pages.
• Yahoo

• Exite

• OpenText

• Galaxy Subjects

• Internet Sleuth (Netscape Users)

• Planet Earth

• Internet Public Library

• WWW Virtual Library

Locating Strategies (indexes) - Finding a list of items (sites).


Use: When you need to find a list of sites in specific databases.
• Yahoo...w/Boolean Operators

• Excite...Advanced

• Galaxy Search

• HotBot (Click "Modify" and "Expert" for Advanced)

• AltaVista

• InfoSeek

• InfoSeek Ultra (Faster and More Comprehensive)

• Lycos

• Lycos...Custom Search

Newsgroups, E-Mail Lists, Addresses and Software Archives


• DejaNews - Newsgroups

• Mailing Lists

• Four11 - Email
•  Shareware
• People

• News/Weather

• Publications/Literature

• Technical Reports

• Documentation

• Desk Reference

• Other Useful Search Engines (Airlines, Road Maps, Image Search and Much

More)
Graphic Searching
• Yahoo Image Surfer

• Apollo (Pick regions to search by clickable map)

Collecting Strategy - Metasearch (meta indexes/multi-threaded) - Find


and catalog a high number available web documents on a subject.
Use: When a comprehensive simultaneous search of multiple databases is
necessary.
• DogPile

• Highway 61

• Internet Sleuth

• Metacrawler

• Savvy Search

• Starting Point

• SuperSeek

Concept Searching (heuristics, fuzzy matching and relevancy matching) - Find


information on topics using feedback.
Use: When unsure about the target.
• Autonomy Web Researcher

• EchoSearch

• Inso Search Wizard (Click on "Learn more about InsoSearch Wizard" and then

"Demo")
• Surfbot

• Teleport Pro

Robots, Spiders, Worms, WebAnts, and Agents


Most search engines create indexes that are compiled by computer
programs known as robots, spiders, webants, or worms. Robots and spiders
are the same thing, but worms are technically different in that they are a
replicating program, where WebAnts are distributed cooperating robots.
These resources traverse the web to examine documents and indexes or
enters it into a database, and recursively retrieves all documents that are
referenced (Koster, 1996). These robots will follow the hyperlinks to other
documents and index those also. Agents have numerous meanings in the
computing arena. Agents are programs which act autonomously on a task.
The most common agents found on the web are: Autonomous
Agents which are programs that travel between sites and decide based on
algorithms when to move and what to do; and Intelligent Agents which are
programs that help users with things such as forms and heuristics. They
choose a product, or guide a user through forms, or help users find
information.
 
Part 3 Google Web Search Features
Major features of Google Search Console are:

1) Search Analytics:
One of the most popular features of Google Search Console is Search Analytics.
It tells you a lot about how to get organic traffic from Google. It also offers critical
search metrics from the website that includes clicks, impressions, rankings and
click through rates. It is easy to filter data in multiple ways like pages, queries,
devices, and more. SEO professionals never fail to check the Queries section as
it helps in the identification of organic keywords that people commonly use to
search for the products or services offered by a website. You can also find out
number of visitors using Image search for visiting your website. The average
CTR of mobile and desktop can be easily compared. The average position or
ranking of specific pages can be checked.
2) HTML Improvements
The section pertaining to HTML Improvements helps in improving the display of
the SERP. In case, there are any issues related to SEO, these features help in
their identification. Issues like Missing Metadata, Duplicate content, over or under
optimized Metadata and more can be readily identified. If identical content is
available on the Internet as multiple pieces, the search engines find it difficult to
make a decision regarding which content is more relevant to a specific query.
Similarly, if metadata like Meta Descriptions or Title tags is missing, it can be
easily found out.
3) Crawl Errors
Checking the crawl error report on a periodic basis helps you to solve various
problems related to the crawl section. All the errors related
to Googlebot encounters are shown clearly while crawling website pages. All the
information about those site URLs that could not be crawled successfully by the
Google is shown as HTTP error code. An individual chart can be easily displayed
and information like DNS errors, [Link] failure and server errors can be
revealed.
4) Fetch as Google
One of the essential tools, Fetch as Google helps in ensuring that the web pages
are search engine friendly. Google crawls every page on the site for publishing or
indexing on the Search Engine Result Page. The URL is analyzed with the help
of this tool for verification. This includes changes in the content, title tag, etc. This
tool help in communicating with the search engine bots and find out if the page
can be indexed or not. This tool also helps in indicating when due to certain
errors, the site is not being crawled or may get blocked by coding errors or
[Link].

5) Sitemaps & [Link] Tester


XML sitemap is used to help search engines (Google, Yahoo, Bing etc) to
understand the website better while crawling by robots. There is a section named
sitemap where you can test your sitemap to be crawled. No web pages are
indexed by Google without the sitemap. [Link] is a text file which instructs
search engine bots what to crawl and what not to crawl. This file is used to check
which url is blocked or disallowed by [Link]

You might also like