13 Building Search Engine Using Machine Learning Technique
13 Building Search Engine Using Machine Learning Technique
I. I NTRODUCTION
World Wide Web is actually a web of individual systems
and servers which are connected with different technology and
methods. Every site comprises of the heaps of site pages that
are being made and sent on the server. So if a user needs
something, then he or she needs to type a keyword. Keyword
is a set of words extracted from user search input. Search input
given by user may be syntactically incorrect. Here comes the Fig. 1. Block Diagram of Search Engine [1]
actual need for search engines. Search engines provide you a
simple interface to search user query and display the results
in the form of the web address of the relevant web page. This paper utilizes Machine Learning Techniques to dis-
Figure 1 focuses on three main components of search cover the utmost suitable web address for the given keyword.
engine. The output of PageRank algorithm is given as input to machine
1) Web crawler learning algorithm.
Web crawlers help in collecting data about a website The section II discusses the related work in search engine
and the links related to them. We are only using web and PageRank algorithm. In section III Objective is explained.
crawler for collecting data and information from WWW Section IV deals with proposed system which is based on
and store it to our database. machine learning technique and section V contains the con-
2) Indexer clusion.
Indexer which arranges each term on each web page
and stores the subsequent list of terms in a tremendous II. L ITERATURE REVIEW
repository.
Numerous endeavors have been made by data experts and
3) Query Engine
researchers in the field of search engine. Dutta and Bansal
It is mainly used to reply the user’s keyword and show
[1] discuss various type of search engine and they conclude
the effective outcome for their keyword. In query engine,
the crawler based search engine is best among them and also
Page ranking algorithm ranks the URL by using different
Google uses it. It give a user more relevant web address for
algorithms in the query engine.
user query. A Web crawler is a program that navigates the
web by following the regularly changing, thick and circulated
Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)
IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8
hyperlinked structure and from there on putting away down- Step 3: Dequeue URL’s from queue (q).
loaded pages in a vast database which is after indexed for Step 4: Downloads web page related with this URL.
productive execution of user queries. In [2], author conclude Step 5: Extract all URLs from downloaded web pages
that major benefit of using keyword focused web crawler over Step 6: Insert extracted URL into queue (q).
traditional web crawler is that it works intelligently, efficiently. Step 7: Goto step 1 until more relevant results are achieved.
The search engine uses a page ranking algorithm to give
more relevant web page at the top of result, according to
user need. It ease the searching method and user get required
information very easily. Initially just an idea has been de-
veloped as user were facing problem in searching data so
simple algorithm introduced which works on link structure,
then further modification came as the web is also expanding so
weighted PageRank and HITS came into the scenario. In [3],
author compare various PageRank algorithm and among all,
Weighted PageRank algorithm is best suited for our system.
Michael Chau and Hsinchun Chen [4] proposed a system
which is based on a machine learning approach for web
page filtering. The machine learning result is compared with
traditional algorithm and found that machine learning result
are more useful. The proposed approach is also effective for
building a search engine.
III. O BJECTIVE
To build a search engine which gives web address of
the most relevant web page at the top of the search result, Fig. 2. Flowchart for keyword focused web crawler [2]
according to user queries. The main focus of our system is
to build a search engine using machine learning technique for B. Perform data cleaning using NLP
increasing accuracy compare to available search engine.
In this step, data cleaning is performed to preprocess the
IV. M ETHODOLOGY data using NLP steps so that unnecessary data is removed.
To build a search engine which gives web address of After collecting data from WWW using web crawler, there is
the most relevant web page at the top of the search result, need to perform data cleaning using NLP.
according to user queries. The main focus of our system is
to build a search engine using machine learning technique for
increasing accuracy compare to available search engine.
Following is the step by step procedure for building the
search engine:
1) Collect data from WWW using web crawler.
2) Perform data cleaning using NLP.
3) Study and compare the existing page ranking algorithm.
4) Merge the selected page rank algorithm with current
technologies in machine learning.
5) Implement query engine to display the efficient results
for user query.
A. Collect data from WWW using web crawler
In this step, we are using keyword based web crawler
to collect data and information from internet. It begins its
working utilizing seed URL. Subsequent to visiting the website Fig. 3. NLP steps for data cleaning
page of seed URL and concentrates every one of the hyperlinks
present in that site page and store the extracted hyperlinks to Figure 3 shows the step by step procedure for data cleaning
the queue and exact the data from all web pages. Finally filter using NLP.
out the URL which is relevant for particular keywords. 1) Tokenization = Tokenization depicts divide web page
Algorithm steps: passages into phrases, or phrases into individual terms.
Step 1: Start with seed URL. 2) Capitalization = The most common approach is to
Step 2: Initialize queue (q). reduce all web page data to lower case for simplicity.
Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)
IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8
3) Removing the stopword = Web page consists of many b) TFIDF (w) = Sum of TFIDF of the words in web
words which are mainly used for connecting parts of page w found associated with given query.
the sentence rather than showing important information. c) URL (w) = Number of words within the URL of
Here is need to remove such words. web page w found associated with given query.
4) Parts of Speech tagging = It will split the sentence into d) Heading (w) = Number of words within the head-
token and give significance for each token which is also ing of web page w found related to given query.
applies to user queries in search engines to find out e) Anchor (w) = Number of words found in the
exactly what user required? anchor text describing web page w related to the
5) Lemmatization = Lemmatization is where words are particular query.
diminished to a root by evacuating affectation through 2) Page content of neighbors
dropping pointless characters. In this methodology, three sorts of neighbors were
C. Study and compare the existing page ranking algorithm considered:inbound, outbound, and sibling. For any web
page w, inbound neighbors are the collection of all web
pages that have a hyperlink to w. Outbound neighbors
TABLE I
C OMPARISON BETWEEN PR, WPR AND HITS [7] are the collection of all web pages whose hyperlinks are
found in w. Sibling pages are set of all web pages that
Weighted are pointed any of the parents of w.
Criteria PageRank (PR) PageRank HITS
(WPR) Six feature scores are defined :
Web page
This algorithm weight is a) InTitle (w) = Mean (number of words within the
It calculates title of web page m found associated with given
calculates the calculated
hub and
page score based on query) for all inbound web pages m of w.
Working authority score
at the time inbound and
the pages are outbound links
for each web b) InTFIDF (w) = Mean (sum of TFIDF of the words
page. in page m found associated with given query) for
indexed. of importance
web page. all inbound web pages m of w.
Content,
Input
Incoming links
Incoming and
incoming and
c) OutTitle (w) = Mean (number of words within the
Parameter outgoing links title of web page n found associated with given
outgoing links
Algorithm query) for all outbound web pages n of w.
Complex- O(log N) < O(log N) < O(log N)
ity
d) OutTFIDF (w) = Mean (sum of TFIDF of the
Quality of More than Less than words in web page n found associated with given
Good
Results PageRank PageRank query) for all outbound pages n of w.
Efficiency Medium High Low e) SiblingTitle (w) = Mean (number of words within
the title of web page q found associated with given
Among all, the Weighted PageRank algorithm is best suited
query) for all sibling web pages q of w.
for system because it gives more accuracy and efficiency
f) SiblingTFIDF (w) = Mean (sum of TFIDF of the
comparable to other (see table-1).
words in web page q found associated with given
D. Merge the selected page rank algorithm with current query) for all sibling web pages q of w.
technologies in machine learning 3) Link analysis
After selecting and implementing the best suited PageRank Link are nothing but hyperlinks from one page to another
algorithm. In this step, topmost output of pagerank algorithm that are also very useful for deciding the relevancy and
is considered as input for machine learning algorithm. The quality of the web page.
output of machine learning algorithm is given to the user as a Three feature scores were defined :
web address of relevant web page based on user queries. a) PageRank (w) = PageRank of web page w.
For implementing the machine learning algorithm to find
b) Inlinks (w) = Count of inbound links pointing to
out the most relevant web page based on user queries, we are
w.
dividing the web feature into three parts:
c) Outlinks (w) = Count of outbound links from w.
1) Page content
2) Page content of Neighbors All of above web features are considered as an input feature
3) Link analysis for ANN,SVM and Xgboost and the one who give more
accuracy is merged with the selected pagerank algorithm to
1) Page Content give the URL of relevant web pages for user queries.
Instead of a word vector, content of each web page is
represented by following feature scores. E. Implement query engine to display the efficient results for
Five feature scores are defined : user query
a) Title (w) = Number of words within the title of At last, implement the Query engine which takes the input
web page w found associated with given query. from the user in a form of query and display the efficient result
Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)
IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8
Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.