0% found this document useful (0 votes)
66 views

13 Building Search Engine Using Machine Learning Technique

The document discusses building a search engine using machine learning techniques. It proposes using a keyword-based web crawler to collect data from the web and perform data cleaning using natural language processing. Existing page ranking algorithms are studied and selected algorithms are merged with machine learning technologies. A query engine is implemented to display efficient results for user queries.

Uploaded by

Vamsi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

13 Building Search Engine Using Machine Learning Technique

The document discusses building a search engine using machine learning techniques. It proposes using a keyword-based web crawler to collect data from the web and perform data cleaning using natural language processing. Existing page ranking algorithms are studied and selected algorithms are merged with machine learning technologies. A query engine is implemented to display efficient results for user queries.

Uploaded by

Vamsi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)

IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8

Building Search Engine Using Machine Learning


Technique

1st Rushikesh Karwa 2nd Vikas Honmane


Department of Computer Science and Engineering Department of Computer Science and Engineering
Walchand College of Engineering Walchand College of Engineering
Sangli(MS), India Sangli(MS), India
rushikeshkarwa55@gmail.com vhonmane@gmail.com

Abstract—The web is the huge and most extravagant well-


spring of data. To recover the information from World Wide Web,
Search Engines are commonly utilized. Search engines provide
a simple interface for searching for user query and displaying
results in the form of the web address of the relevant web page,
but using traditional search engines has become very challenging
to obtain suitable information. This paper proposed search engine
using Machine Learning technique that will give more relevant
web pages at top for user queries.
Index Terms—World Wide Web, Search Engine, PageRank,
Machine Learning.

I. I NTRODUCTION
World Wide Web is actually a web of individual systems
and servers which are connected with different technology and
methods. Every site comprises of the heaps of site pages that
are being made and sent on the server. So if a user needs
something, then he or she needs to type a keyword. Keyword
is a set of words extracted from user search input. Search input
given by user may be syntactically incorrect. Here comes the Fig. 1. Block Diagram of Search Engine [1]
actual need for search engines. Search engines provide you a
simple interface to search user query and display the results
in the form of the web address of the relevant web page. This paper utilizes Machine Learning Techniques to dis-
Figure 1 focuses on three main components of search cover the utmost suitable web address for the given keyword.
engine. The output of PageRank algorithm is given as input to machine
1) Web crawler learning algorithm.
Web crawlers help in collecting data about a website The section II discusses the related work in search engine
and the links related to them. We are only using web and PageRank algorithm. In section III Objective is explained.
crawler for collecting data and information from WWW Section IV deals with proposed system which is based on
and store it to our database. machine learning technique and section V contains the con-
2) Indexer clusion.
Indexer which arranges each term on each web page
and stores the subsequent list of terms in a tremendous II. L ITERATURE REVIEW
repository.
Numerous endeavors have been made by data experts and
3) Query Engine
researchers in the field of search engine. Dutta and Bansal
It is mainly used to reply the user’s keyword and show
[1] discuss various type of search engine and they conclude
the effective outcome for their keyword. In query engine,
the crawler based search engine is best among them and also
Page ranking algorithm ranks the URL by using different
Google uses it. It give a user more relevant web address for
algorithms in the query engine.
user query. A Web crawler is a program that navigates the
web by following the regularly changing, thick and circulated

978-1-5386-8113-8/19/$31.00 ©2019 IEEE 1061

Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)
IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8

hyperlinked structure and from there on putting away down- Step 3: Dequeue URL’s from queue (q).
loaded pages in a vast database which is after indexed for Step 4: Downloads web page related with this URL.
productive execution of user queries. In [2], author conclude Step 5: Extract all URLs from downloaded web pages
that major benefit of using keyword focused web crawler over Step 6: Insert extracted URL into queue (q).
traditional web crawler is that it works intelligently, efficiently. Step 7: Goto step 1 until more relevant results are achieved.
The search engine uses a page ranking algorithm to give
more relevant web page at the top of result, according to
user need. It ease the searching method and user get required
information very easily. Initially just an idea has been de-
veloped as user were facing problem in searching data so
simple algorithm introduced which works on link structure,
then further modification came as the web is also expanding so
weighted PageRank and HITS came into the scenario. In [3],
author compare various PageRank algorithm and among all,
Weighted PageRank algorithm is best suited for our system.
Michael Chau and Hsinchun Chen [4] proposed a system
which is based on a machine learning approach for web
page filtering. The machine learning result is compared with
traditional algorithm and found that machine learning result
are more useful. The proposed approach is also effective for
building a search engine.
III. O BJECTIVE
To build a search engine which gives web address of
the most relevant web page at the top of the search result, Fig. 2. Flowchart for keyword focused web crawler [2]
according to user queries. The main focus of our system is
to build a search engine using machine learning technique for B. Perform data cleaning using NLP
increasing accuracy compare to available search engine.
In this step, data cleaning is performed to preprocess the
IV. M ETHODOLOGY data using NLP steps so that unnecessary data is removed.
To build a search engine which gives web address of After collecting data from WWW using web crawler, there is
the most relevant web page at the top of the search result, need to perform data cleaning using NLP.
according to user queries. The main focus of our system is
to build a search engine using machine learning technique for
increasing accuracy compare to available search engine.
Following is the step by step procedure for building the
search engine:
1) Collect data from WWW using web crawler.
2) Perform data cleaning using NLP.
3) Study and compare the existing page ranking algorithm.
4) Merge the selected page rank algorithm with current
technologies in machine learning.
5) Implement query engine to display the efficient results
for user query.
A. Collect data from WWW using web crawler
In this step, we are using keyword based web crawler
to collect data and information from internet. It begins its
working utilizing seed URL. Subsequent to visiting the website Fig. 3. NLP steps for data cleaning
page of seed URL and concentrates every one of the hyperlinks
present in that site page and store the extracted hyperlinks to Figure 3 shows the step by step procedure for data cleaning
the queue and exact the data from all web pages. Finally filter using NLP.
out the URL which is relevant for particular keywords. 1) Tokenization = Tokenization depicts divide web page
Algorithm steps: passages into phrases, or phrases into individual terms.
Step 1: Start with seed URL. 2) Capitalization = The most common approach is to
Step 2: Initialize queue (q). reduce all web page data to lower case for simplicity.

978-1-5386-8113-8/19/$31.00 ©2019 IEEE 1062

Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)
IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8

3) Removing the stopword = Web page consists of many b) TFIDF (w) = Sum of TFIDF of the words in web
words which are mainly used for connecting parts of page w found associated with given query.
the sentence rather than showing important information. c) URL (w) = Number of words within the URL of
Here is need to remove such words. web page w found associated with given query.
4) Parts of Speech tagging = It will split the sentence into d) Heading (w) = Number of words within the head-
token and give significance for each token which is also ing of web page w found related to given query.
applies to user queries in search engines to find out e) Anchor (w) = Number of words found in the
exactly what user required? anchor text describing web page w related to the
5) Lemmatization = Lemmatization is where words are particular query.
diminished to a root by evacuating affectation through 2) Page content of neighbors
dropping pointless characters. In this methodology, three sorts of neighbors were
C. Study and compare the existing page ranking algorithm considered:inbound, outbound, and sibling. For any web
page w, inbound neighbors are the collection of all web
pages that have a hyperlink to w. Outbound neighbors
TABLE I
C OMPARISON BETWEEN PR, WPR AND HITS [7] are the collection of all web pages whose hyperlinks are
found in w. Sibling pages are set of all web pages that
Weighted are pointed any of the parents of w.
Criteria PageRank (PR) PageRank HITS
(WPR) Six feature scores are defined :
Web page
This algorithm weight is a) InTitle (w) = Mean (number of words within the
It calculates title of web page m found associated with given
calculates the calculated
hub and
page score based on query) for all inbound web pages m of w.
Working authority score
at the time inbound and
the pages are outbound links
for each web b) InTFIDF (w) = Mean (sum of TFIDF of the words
page. in page m found associated with given query) for
indexed. of importance
web page. all inbound web pages m of w.
Content,
Input
Incoming links
Incoming and
incoming and
c) OutTitle (w) = Mean (number of words within the
Parameter outgoing links title of web page n found associated with given
outgoing links
Algorithm query) for all outbound web pages n of w.
Complex- O(log N) < O(log N) < O(log N)
ity
d) OutTFIDF (w) = Mean (sum of TFIDF of the
Quality of More than Less than words in web page n found associated with given
Good
Results PageRank PageRank query) for all outbound pages n of w.
Efficiency Medium High Low e) SiblingTitle (w) = Mean (number of words within
the title of web page q found associated with given
Among all, the Weighted PageRank algorithm is best suited
query) for all sibling web pages q of w.
for system because it gives more accuracy and efficiency
f) SiblingTFIDF (w) = Mean (sum of TFIDF of the
comparable to other (see table-1).
words in web page q found associated with given
D. Merge the selected page rank algorithm with current query) for all sibling web pages q of w.
technologies in machine learning 3) Link analysis
After selecting and implementing the best suited PageRank Link are nothing but hyperlinks from one page to another
algorithm. In this step, topmost output of pagerank algorithm that are also very useful for deciding the relevancy and
is considered as input for machine learning algorithm. The quality of the web page.
output of machine learning algorithm is given to the user as a Three feature scores were defined :
web address of relevant web page based on user queries. a) PageRank (w) = PageRank of web page w.
For implementing the machine learning algorithm to find
b) Inlinks (w) = Count of inbound links pointing to
out the most relevant web page based on user queries, we are
w.
dividing the web feature into three parts:
c) Outlinks (w) = Count of outbound links from w.
1) Page content
2) Page content of Neighbors All of above web features are considered as an input feature
3) Link analysis for ANN,SVM and Xgboost and the one who give more
accuracy is merged with the selected pagerank algorithm to
1) Page Content give the URL of relevant web pages for user queries.
Instead of a word vector, content of each web page is
represented by following feature scores. E. Implement query engine to display the efficient results for
Five feature scores are defined : user query
a) Title (w) = Number of words within the title of At last, implement the Query engine which takes the input
web page w found associated with given query. from the user in a form of query and display the efficient result

978-1-5386-8113-8/19/$31.00 ©2019 IEEE 1063

Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019)
IEEE Xplore Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8

for their query. It will display the web address of relevant


pages based on the output of machine learning algorithm. number of documents correctly classif ied
accuracy =
total number of documents
V. E XPERIMENTAL RESULT
The following table show the accuracy of each algorithm.
Following is a list of algorithm which is implemented. The Out of all, XGBoost have more accuracy.
algorithm which give more accuracy is used which PageRank
algorithm. TABLE II
ACCURACY OF DIFFERENT ALGORITHM
1) Support Vector Machine
2) Artificial Neural Network No. Algorithm Accuracy
3) XGBoost 1 SVM 89.50
2 ANN 91.35
1) Support Vector Machine 3 XGBoost 92.59
Because of its exceptional performance, a SVM was also
used to allow a better approach. It used the same set of VI. C ONCLUSION
feature scores to perform classification. Search engine is very useful for finding out more relevant
Dataset is not linearly separable so we are using non- URL for given keyword. Due to this, user time is reduced
linear SVM. Rbf, poly and sigmoid are type of non- for searching the relevant web page. For this, Accuracy is
linear kernel. The above 14 feature are selected as a very important factor. From the above observation, it can be
input for SVM model and based on that feature, SVM concluded that XGBoost is a best in terms of accuracy than
tried to predict, whether each web page in the testing SVM and ANN. Thus, Search engine built using XGBoost and
set was relevant to the given query or not. The results PageRank algorithm will give better accuracy.
were stored and used for performance evaluation.
R EFERENCES
2) Artificial Neural Network [1] Manika Dutta, K. L. Bansal, “A Review Paper on Various Search En-
gines (Google, Yahoo, Altavista, Ask and Bing)”, International Journal
A neural network consist of three layers, namely input on Recent and Innovation Trends in Computing and Communication,
layer, hidden layer, and output layer. The neural net- 2016.
work’s input layer consisted of 14 nodes corresponding [2] Gunjan H. Agre, Nikita V.Mahajan, “Keyword Focused Web Crawler”,
International Conference on Electronic and Communication Systems,
to each web page’s 14 feature scores. IEEE, 2015.
Only one output node is required in output layer for [3] Tuhena Sen, Dev Kumar Chaudhary, “Contrastive Study of Simple
determining relevancy of a web page. The number of PageRank, HITS and Weighted PageRank Algorithms: Review”, Inter-
national Conference on Cloud Computing, Data Science & Engineering,
nodes was set to 7 in the hidden layer. These param- IEEE, 2017.
eters are set using a grid search based on some initial [4] Michael Chau, Hsinchun Chen, “A machine learning approach to web
experimentation. The entire process has been repeated page filtering using content and structure analysis”, Decision Support
Systems 44 (2008) 482–494,scienceDirect,2008.
150 times and the batch size is set to 10. The results [5] Taruna Kumari, Ashlesha Gupta, Ashutosh Dixit, “Comparative Study of
were stored and used for performance evaluation. Page Rank and Weighted Page Rank Algorithm”, International Journal
of Innovative Research in Computer and Communication Engineering,
3) XGBoost February 2014.
[6] K. R. Srinath, “Page Ranking Algorithms – A Comparison”, Interna-
It is a type of Boosting based ensemble learning. It uses tional Research Journal of Engineering and Technology (IRJET), Dec-
gradient boosted decision trees for improving accuracy 2017.
and speed. [7] S. Prabha, K. Duraiswamy, J. Indhumathi, “Comparative Analysis of
Different Page Ranking Algorithms”, International Journal of Computer
The input feature consist of same 14 features and we and Information Engineering, 2014.
are using gbtree based booster. The number of classifier [8] Dilip Kumar Sharma, A. K. Sharma, “A Comparative Analysis of Web
are set to 50 and max depth size is set to 4. These Page Ranking Algorithms”, International Journal on Computer Science
and Engineering, 2010.
parameters are set using a parameter turning and cross [9] Vijay Chauhan, Arunima Jaiswal, Junaid Khalid Khan, “Web Page
validation based on some initial experimentation. Ranking Using Machine Learning Approach”, International Conference
on Advanced Computing Communication Technologies, 2015.
A. Performance [10] Amanjot Kaur Sandhu, Tiewei s. Liu., “Wikipedia Search Engine: Inter-
active Information Retrieval Interface Design”, International Conference
1) Accuracy on Industrial and Information Systems, 2014.
[11] Neha Sharma, Rashi Agarwal, Narendra Kohli, “Review of features
Firstly, we create a dataset of 540 records in which the and machine learning techniques for web searching”, International
Conference on Advanced Computing Communication Technologies,
dependent variable is the relevancy of URL which is 0 or 1. 1 2016.
indicate relevant URL and 0 indicate irrelevant URL. Dataset [12] Sweah Liang Yong, Markus Hagenbuchner, Ah Chung Tsoi, “Ranking
is divided into training and testing dataset. Out of 540 records, Web Pages using Machine Learning Approaches”, International Confer-
ence on Web Intelligence and Intelligent Agent Technology, 2008.
378 record are used for training purpose and 162 are used for [13] B. Jaganathan, Kalyani Desikan,“Weighted Page Rank Algorithm based
testing purpose. The accuracy is calculated using following on In-Out Weight of Webpages”, Indian Journal of Science and Tech-
formula. nology, Dec-2015.

978-1-5386-8113-8/19/$31.00 ©2019 IEEE 1064

Authorized licensed use limited to: University College London. Downloaded on May 26,2020 at 06:00:12 UTC from IEEE Xplore. Restrictions apply.

You might also like