Machine Learning for Ride Sharing Model
Machine Learning for Ride Sharing Model
2019
Part of the Software Engineering Commons, and the Theory and Algorithms Commons
Recommended Citation
Yatnalkar, Govind Pramod, "A Machine Learning Recommender Model for Ride Sharing Based on Rider
Characteristics and User Threshold Time" (2019). Theses, Dissertations and Capstones. 1259.
[Link]
This Thesis is brought to you for free and open access by Marshall Digital Scholar. It has been accepted for
inclusion in Theses, Dissertations and Capstones by an authorized administrator of Marshall Digital Scholar. For
more information, please contact zhangj@[Link], beachgr@[Link].
A MACHINE LEARNING RECOMMENDER MODEL FOR RIDE SHARING
BASED ON RIDER CHARACTERISTICS AND USER THRESHOLD TIME
A thesis submitted to
the Graduate College of
Marshall University
In partial fulfillment of
the requirements for the degree of
Master of Science
in
Computer Science
by
Govind Pramod Yatnalkar
Approved by
Dr. Wook-Sung Yoo, Committee Chairperson
Dr. Husnu S. Narman
Dr. Haroon Malik
Marshall University
December 2019
c 2019
Govind Pramod Yatnalkar
ALL RIGHTS RESERVED
iii
ACKNOWLEDGEMENTS
brother, Mr. Prabodh Yatnalkar, his wife, Mrs. Manjiri Yatnalkar, and their lovely daughter,
Madhura for providing me the support which was essential for accomplishing the crucial task of
the thesis.
I extend my gratitude towards my thesis advisor, Dr. Husnu Narman, for his
contributions to my thesis. Without his adept suggestions and expert guidance, I would not have
reached this point of achievement in my research work. Along with Dr. Narman, I am highly
obliged to my thesis committee members, Dr. Wook-Sung Yoo and Dr. Haroon Malik, for their
helpful assistance and mentorship in the reviewing and fulfillment of the thesis document.
At last, I would like to thank the Weisberg Division of Computer Science, College of
present my research work. It was a huge honor to work with highly expertise professionals, along
with the support of my friends, which led to the completion of my thesis work.
iv
TABLE OF CONTENTS
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 4 METHODOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
4.5 Saving User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Experimentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Shortcomings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vi
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Appendix B Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vii
LIST OF TABLES
viii
LIST OF FIGURES
ix
Figure 25 An Illustration of Root Mean Square Error (RMSE) . . . . . . . . . . . . . . . . . . . . . 54
Figure 35 Result Comparison of the Matching Rates from Phase 1 (left) and Phase 2
(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 36 Result Comparison of the Number of Computed Trips in Phase 1 (left) and
Phase 2 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 37 Result Comparison of the Total Simulation Time in Phase 1 (left) and Phase
2 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
x
ABSTRACT
In the present age, human life is prospering incredibly due to the 4th Industrial Revolution or The
Age of Digitization and Computing. The ubiquitous availability of the Internet and advanced
computing systems have resulted in the rapid development of smart cities. From connected
devices to live vehicle tracking, technology is taking the field of transportation to a new level. An
essential part of the transportation domain in smart cities is Ride Sharing. It is an excellent
solution to issues like pollution, traffic, and the rapid consumption of fuel. Even though Ride
Sharing has several benefits, the current usage is significantly low due to limitations like social
barriers and long rider waiting times. The thesis proposes a novel Ride Sharing model with two
matching layers to eliminate most of the observed issues in the existing Ride Sharing applications
like UberPool and LyftLine. The first matching layer matches riders based on specific human
characteristics, and the second matching layer provides riders the option to restrict the waiting
time by using personalized threshold time. At the end of trips, the system collects user feedback
according to five characteristics. Then, at most, two main characteristics that are the most
important to riders are determined based on the collected feedback. The registered characteristics
and the two main determined characteristics are fed as the inputs to a Machine Learning
classification module. For newly registering users, the module predicts the two main
characteristics of riders, and that assists in matching with other riders having similar determined
characteristics. The thesis includes subjecting the proposed model to an extensive simulation for
measuring system efficiency. The model simulations have utilized the real-time New York City
Cab traffic data with real-traffic conditions using Google Maps Application Programming
Interface (API). Results indicate that the proposed Ride Sharing model is feasible, and efficient as
the number of riders increases while maintaining the rider threshold time. The expected outcome
of the thesis is to help service providers increase the usage of Ride Sharing, complete the pool for
the maximum number of trips in minimal time and perform maximum rider matches based on
similar characteristics, thus providing an energy-efficient and a social platform for Ride Sharing.
xi
CHAPTER 1
INTRODUCTION
The world is progressing rapidly from a perspective of technology and innovation due to
The 4th Industrial Revolution. The ultimate motive of the revolution or the digitization is to
convert time-consuming manual tasks to automation [1, 2]. Automation includes the interference
of computing systems and sophisticated software to execute and speed up the manual tasks [3].
Also, automation not only acts as a catalyst for speeding up the processes but also significantly
increases productivity. Several domains experience the usage of automation and technology like
Finance and Banking, Manufacturing and Production, Information Technology (IT) Industry,
Education, Health and Public Safety, Medicare, and many more [2, 4, 5]. The list also includes
the domain of Transportation, on which the thesis mostly focuses [2, 3].
Transportation is one of the most vital domains for humankind [6]. The need for vehicles
is comparable to the necessity of food and water to humans. Irrespective of weather conditions,
vehicles provide the ability to swiftly traverse from a source to a destination [6, 7]. Engineers and
researchers are continuously deploying advanced tools in vehicles, which offer smoother and faster
Incorporating the latest technologies avails a system to perform better and achieve more
significant results [4]. The ability of a system to execute tasks while exploiting the features of the
advanced technologies is a smart system [8]. Such systems include the coupling of an environment
that is generally orchestrated by human beings with machinery plus computing power [2, 8]. A
popular paradigm of smart systems is smart cities, and transportation forms a crucial component
[4].
tracking. An example that revolutionized vehicle connectivity is manipulating cell controls via
vehicle handles through Bluetooth and Network [9]. Also, driving is smarter and more
manageable with features like Steering Assists, Cruise Control, and the latest innovation in the
1
market, Auto-Parking, and Auto-Pilot [10, 11]. Such examples constitute the smart
A simple definition of Ride Sharing is to share a ride among multiple users. The history of
Ride Sharing dates back to the times of World War II [12, 13], during the oil and energy crisis.
The concept of Ride Sharing emerged as a bright and potent idea of sharing a journey and was an
effective way of saving plus sharing oil or fuel resources [13]. With time, world conditions
improved, automobiles evolved, economies thrived, and as people started getting financially
stable, a downfall in the utilization of Ride Sharing was observed, resulting in people owning
self-purchased vehicles.
In the present age, under the topic of Green Computing, Ride Sharing is gaining much
attention [14]. Ride Sharing is synonymous with names like Ride-Hailing, Car-Pooling, and
Vehicle-Pooling. Hence, in the further chapters of the thesis, the term Ride Sharing is referenced
with Ride-Hailing, Car-Pooling, and Vehicle-Pooling. Utilizing the basic idea of Ride Sharing, the
thesis proposes an Enhanced Ride Sharing Model (ERSM). The enhanced model addresses several
1.2 Motivation
Even though automobiles provide many benefits, they also give rise to many problems. The
problems arise due to the continuously rising requirements of people for fuel resources. A study of
previous applications led to the discovery of a relation between the number of vehicles and the
current rising population. The relation states that as the population increases, the number of
vehicles increases [7, 15]. It is valid that the immense growth in the overall number of vehicles has
risen exponentially in the past decade, which has directly impacted the present traffic conditions
[16]. Solutions like High Occupancy Vehicle (HOV) lanes are proposed to address the traffic issue
[17] in existing Ride Sharing systems, but there has not been a significant improvement in current
With traffic, vehicle fuel consumption has increased exponentially, and in the coming
years, there is a possibility of outrunning the natural resources [19]. Governments from many
2
countries are investing in technologies for renewable energy generation, but the rate of fuel
consumption is much higher than the rate of renewable energy consumption [20]. Other hurdles
include the installation and production costs for renewable energy generation. The byproducts of
burning fuel are the smoke and harmful gas emissions that have detrimental effects on the
One of the primary issues in the domain of transportation is the pollution resulting due to
emissions from many vehicles [22]. As the population increases, the number of vehicles and
emissions increases, resulting in Global Warming [7, 15, 23]. Additionally, the emissions not only
affect human health but every living being on the planet Earth [21]. For example, the number of
reported cases of respiratory issues has hiked up to notable levels in the past five years [24]. An
increase in the number of vehicles also leads to car accidents, and a minor but critical issue of the
Ride Sharing is a possible candidate solution to the aforementioned problems of traffic, air
pollution, and rapid fuel consumption. It is the process of sharing a ride among people who are
traversing through a series of sources and destinations. In Ride Sharing, the journey is completed
by following a specific trajectory that is formed using multiple locations [25, 26]. Moreover, Ride
Sharing increases the number of HOV lanes, providing a smoother traffic flow.
Currently, there exist many Ride Sharing applications. The thesis includes a detailed
study of several Ride Sharing applications and has listed several limitations in the previous works.
The three primary persisting problems in most cases are that riders do not reach the seating
capacity of the vehicle, the system suddenly adds or accepts passengers on an ongoing trip, and
riders avoid Ride Sharing due to the social barriers, as riders do not know with whom they are
going to travel on an upcoming trip [27, 28]. Such factors lead to consumer disappointment and
frustration. The motivation of the thesis is to provide solutions to the three aforementioned
issues, along with several others that are stated in further chapters.
Ride Sharing is a definite practical solution if applied effectively [29]. For example,
consider a case of five users who have their distinct vehicles for commuting purposes. If the five
3
users decide to share a ride, they cut down the usage of four cars. Eliminating the usage of four
cars leads to an overall reduction in traffic and a decrease in fuel consumption plus carbon
emissions by almost 80% [23, 30]. Additional advantages include splitting the stress, fatigue, and
fares among riders, increasing parking spots, and encouraging social interactions with others
during the journey [25, 30, 31, 32]. The use case of the five users sharing a ride is portrayed in
Figure 1.
Currently, there exist problems in Ride Sharing applications like social barriers and the
sudden rider addition without rider consent. Such factors cause people to avoid the usage of Ride
Sharing [33, 34]. A fact obtained by research is that humans thrive on social relations and cannot
stay isolated for long. Also, human beings tend to associate themselves with people having similar
characteristics [35]. The thesis uses a similar kind of approach and utilizes human attributes in
the rider matching layers. The aim of the thesis is to implement an Enhanced Ride Sharing Model
that addresses the issues related to unknown characteristics of riders and the sudden elongation of
the trip time. The Enhanced Ride Sharing Model, in a nutshell, is depicted in Figure 2.
4
BASIC RIDE
SHARING MODEL
The designed model in the thesis includes Ride Sharing technology with two matching
layers. The model begins with the rider registration, where users provide required profile data
along with five specific characteristics. Characteristics are the requirements that define the search
criteria for a match and are positive integers on a scale of 1 to 5. The selected human
characteristics in the system are chatty, friendliness, safety, punctuality, and comfortability. Once
a user registers to the system, the proposed model searches and matches riders having a similar
set of characteristics. The matching of riders using the five characteristics constitutes the first
matching layer.
The thesis proposes a novel concept of User Threshold Time (UTT). In registration, riders
provide the User Threshold Time or User Tolerated Time. UTT is defined as the time in minutes
that riders are willing to spend during the event of picking other riders. It is the maximum
waiting time that both riders and drivers agree to accept a rider. UTT in the thesis is taken on a
scale of 10 to 30 and in multiples of 5. Therefore, riders can select one of the following, 10, 15, 20,
25, and 30 as the UTT. Based on minimal UTT of a rider on a trip, drivers pick other riders to
5
respect the tolerated time of other riders. Hence, riders are only accepted if they are at a
traveling time or waiting time, which is less than or equal to registered UTT. User Threshold
Time assures travelers do not wait long, picking other riders during a journey.
The next stage in the proposed model is the execution of the matching layers, which
begins with a broadcasting rider request. The ride request triggers a search for other riders
having similar registered characteristics. The output list from the characteristics matching layer is
the input to the UTT matching layer, where the traveling time between the locations of the
broadcasting rider and other riders is computed using the Google Maps APIs and verified if the
calculated time is less then trip UTT. If riders satisfy the matching layer conditions, the system
adds them onto the final trip itinerary, marking the completion of trip formation.
After the trip formation, the execution of a novel designed feedback system begins where
riders rate the driver as well as other riders on the trip. The feedback given by a user forms an
essential data-set as the system uses the feedback data to compute the two main characteristics
for every user. The determined characteristics are later employed by the Machine Learning
algorithms to predict better rider recommendations. The generic control flow of the designed
based on the feedback the rider gets from other riders. The two main characteristics are used to
determine the characteristics a rider most focuses on a trip while rating other riders. In the end,
based on the feedback patterns in past trips, the system assigns the two most favored
The computations for determining the main characteristics of a rider are quite complex
and tediously high. Thus, after recording a sufficient number of trip and feedback records, the
thesis made use of the Machine Learning classification algorithms or classifiers to predict the main
characteristics of a rider, which eliminates the need for complex computations. Machine Learning
(ML) is a technology where a system learns and trains based on an existing data-set and predicts
outputs for new input data [36, 37]. In the case of the Ride Sharing model, the thesis employs the
6
BROADCASTING TRAIN & TEST THE NEW
RIDER REQUEST MACHINE LEARNIG REGISTERING
CLASSIFIER RIDERS
FIND CLOSEST
DRIVER
COMPUTE TWO MAIN PREDICT TWO MAIN
CHARACTERISTICS CHARACTERISTICS
FIND RIDERS
USING MACHINE
BASED ON
COMPLETE TRIP & LEARNING CLASSIFIER
REGISTERED
GET FEEDBACK
CHARACTERISTICS
YES FIND RIDERS
NO BASED ON PREDICTED
FILTER RIDERS VEHICLE CHARACTERISTICS
USING SEATS==0
UTT MATCHING COMPLETE TRIP
Support Vector Machine (SVM) classification algorithm. After appropriate training and testing,
the SVM classifier predicts the two main characteristics of newly registering riders. Riders are
In the chapter of results, the model’s explorations and analysis showcase that it is possible
to allocate the best-matched riders using characteristics and UTT. The proposed model in the
thesis aims to increase the Ride Sharing while respecting rider considerations and decrease
consumer frustration.
1.4 Contributions
ii Filtering riders matched in the characteristics matching layer using the UTT matching layer.
iii Recording user feedback and computing the two main characteristics for every user, which are
7
iv Using a Machine Learning Algorithm to predict the two main characteristics and recommend
v Evaluating the proposed model with an extensive simulation and real data to analyze the
model efficiency.
Based on the user characteristics and UTT, the system allocates the riders based on
similar characteristics on a trip to ensure they have a joyful and stress-free ride. The motive of
the thesis is to reduce trip differences and promote an interactive journey. Through UTT, the
model tries to minimize consumer frustrations in cases where the user unexpectedly waits for a
long time on a short trip. The observations and results in the thesis show that The Enhanced
Ride Sharing model is feasible, and can be deployed to increase the usage of Ride Sharing.
Ultimately, the objective of the thesis is to enhance the usage of the present Ride Sharing services
using the human characteristics, user feedback, and UTT, which will indirectly reduce the effects
of Global Warming and increase the fuel reserves for future generations.
The organization of the rest of the thesis is as follows: Chapter 2 describes the related
works for the present Ride Sharing applications. Chapter 3 includes the discussion of the system
model, which possesses the problem statement and system architectures. Chapter 4 describes the
methodologies followed for the proposed model, and Chapter 5 showcases the designs of the
Enhanced Ride Sharing Model using Machine Learning algorithms and reports the simulations
performed to test the system efficiency. Chapter 6 presents the model results plus observations,
and Chapter 7 has the concluding remarks and plans to improve the proposed model.
8
CHAPTER 2
LITERATURE REVIEW
With the presence of advanced technologies and complex computing systems, there has
been immense development in the field of Ride Sharing. Companies like Uber, Lyft, Via, Ola, and
Juno are continuously developing ideas to improvise their applications and revenue models [38].
However, due to the lack of appropriate equipment and technology, Ride Sharing is discouraged in
many states and countries. Even though governments are putting efforts and proposing plans to
encourage Ride Sharing tactics like reducing taxes on vehicles affiliated with Ride Sharing
applications and using public plus private transportation in conjunction with Ride Sharing
services, the overall market for Ride Sharing remains low. [18, 39].
The chapter of the literature survey begins with the research of current popular Ride
Sharing applications. The next section is of the study, which addresses different vehicle traversing
approaches and the modern technologies integrated with the existing Ride Sharing applications.
The section after the study of modern technologies with Ride Sharing applications presents the
methods for determining the two main characteristics for a rider. In the final section of the
chapter, the research provides the explorations of several Machine Learning classification
algorithms. Also, the last section discusses the selected Machine Learning classifier, which is later
The literature survey began with an investigation of the most popular Ride Sharing
applications like UberPool, LyftLine, Juno, Curb, Wingz, Via, Flywheel, Zimride, and Waze
[40, 41, 42, 43, 44, 45, 46, 47, 48]. Uber, Lyft, Wingz, Via are Ride Sharing applications that allow
any person to be a rider or driver [43, 46]. The study observed role restrictions in Juno, Gett, and
Curb as they are taxi based Ride Sharing services [7]. The strong point of most of the applications
is the usage of modern technologies like Internet of Things (IoT) and Cloud Computing. An
application hosted with advanced technologies promotes quicker computations capabilities, easy
availability of services, sophisticated notification abilities, and infinite data storage [16, 49].
9
Ride Sharing is employed throughout the United States of America, but states like New
York, California, Florida, and Texas experience a higher usage of Ride Sharing services as
compared to other states [50]. California is home to many Ride Sharing companies. Hence, Ride
Sharing is highly popular in California. A separate Ride Sharing terminal at the San Francisco
International Airport in California is a paradigm for the extensive usage of Ride Sharing [42, 44].
Gett, by Juno, is profoundly utilized in London, United Kingdom, as well as in the states of
California, Texas, and New York in the US. New York City (NYC) Cab, which is a taxi-based
service, is working with Uber, Lyft, Via plus Juno, and contributing notably to Ride Sharing
services [51].
The findings from the research on the currently popular Ride Sharing applications listed
several issues and some of the common limitations in all the applications are that drivers learn
the count of passengers at the pickup location [34], and in most trips, the riders and driver do not
reach the vehicle seating capacity [7, 26]. Additional issues include passengers do not possess the
basic information of other passengers they are traveling with, unfair pricing for users [33], and the
sudden addition of a rider whose destination is too far adds a significant time in trip completion
[52]. Also, a critical issue observed is in the vehicle traversing approach or the route a car covers
on a trip, which does not meet rider expectations of completing the journey in minimal time [53].
Acknowledging the listed issues in the existing Ride Sharing applications, the proposed
model in the thesis is designed in a way that eliminates most of the problems. To accept most of
the broadcasting riders in the system, the proposed model in the thesis presents the three types of
rider matching. The first type of match is the Exact match, also referred to as the Same or
Similar match. In the Exact match, the system finds riders with exactly matching characteristics.
If the pool is incomplete or if the riders do not reach the seating capacity of the vehicle, the
system triggers the search for riders with the second type of matching. The second type finds
riders with Closer or Altered characteristics. Closer characteristics are the characteristics that are
slightly different from the broadcasting rider’s characteristics. If the pool is still incomplete, the
system begins the third type of matching, which is comparable to the Uber and Lyft approach of
matching riders [54]. The third type finds riders irrespective of characteristics i.e., matching based
on the closest traveling time [54]. The third type of matching is called the Alternative type of
10
charactetistic matching as the system searches for passengers with alternative characteristics. The
system serves most of the broadcasting rider requests by using the three types of characteristics
matching and assures that it generates trips for a maximum number of riders. The three types of
CHATTY 2 CHATTY
CHAT 2 CHATTY 2
PUNCTUALITY 3 PUNCTUALITY 3 PUNCTUALITY 3
COMFORTABILITY 4 COMFORTABILITY 4 COMFORTABILITY 4
B SAFETY 4 B SAFETY 4 B SAFETY 4
FRIENDLINESS 2 FRIENDLINESS 2 FRIENDLINESS 2
Same/ Exact/ Similar Match Altered/ Closer Match Uber/ Lyft -Alternative Match
After completing the trip formation, the system sends the trip itinerary to every user,
including the driver. The event of sending every user’s basic information to other users reduces
the social barriers among riders as riders get to know with whom they are going to travel on an
upcoming trip. Also, with the User Threshold Time (UTT) matching layer and shortest path
are not at a location that exceeds the trip’s User Threshold Time. By studying the limitations in
current Ride Sharing applications, it is concluded that for an application to be efficient and
11
popular, it is essential to implement the system features which reach the overall user expectations
Significant technologies that are speeding up the building of smart cities are the Internet
of Things (IoT), Artificial Intelligence, and Cloud Computing. Such technologies also contribute
IoT enables efficient device connectivity and communication while broadcasting the data.
It is possible to push or send a notification to a million connected devices within a few seconds
[55]. Integrated with Car-Pooling, every vehicle can connect and communicate to a data hub that
logs every minor detail about the trip [16, 55]. Accordingly, the current status of a vehicle can be
notified to broadcasting riders, facilitating faster decisions for road traversing, vehicle tracking,
and location-based requests clustering. Such features result in continuous status updates, quicker
Cloud services bring numerous benefits to any computing system [56, 57]. Enabling Cloud
services results in better system scalability, service availability, and efficient load balancing of
requests [56]. Cloud services also decrease the overall costs of any system by offering resources
like virtual machines, domain spaces for website hosting, and the databases. Also, the Cloud
services facilitate efficient resource allocation plus management, and the resources are virtually
If the system consumes too much time while responding to a client request, the system
loses efficiency. In the Cloud environment, requests from a client device travels to the Cloud,
interacts with the Cloud servers, and travels back to client devices to render server data
introducing a latency. If the Cloud server and application reside at two different places, the
traveling time of requests from the client to the server can cause a considerable time delay [49].
Further research on the Cloud Computing led to the finding of the topic, Fog Computing [58].
The Fog server constitutes a group of small servers that resides near the client location.
Computations take place at the Fog server, which significantly reduces the request travel time as
servers which are processing the client requests are placed nearer to the client machines than the
12
actual Cloud. [59].
The simulations in the thesis observed a large number of request and response
transactions. For quicker computations of the client requests, the thesis utilized the technology of
Fog Computing. Currently, the processing of requests takes place at a client machine that resides
in a Cloud server. For storage purposes and exploiting the benefits of Cloud Computing in terms
of databases [49], the system uses the Atlas MongoDB database, which is a Cloud-based database.
To conclude, modern technologies play a crucial role in application design, data storage, and
resource management. Also, factors like load balancing, timeliness of result, and quality of service
It is of utmost importance to meet the traversing requirements of the users. The traversing
requirements are the possibilities of routing a vehicle to pick and drop users from their respective
sources to destinations [47]. There are four traversing possibilities. The first traversing path is the
Same-Source-Same-Destination (SSSD), where the trip starts from the same source and ends at
the same destination for all riders. There are no stops included in the SSSD. The second
from the same source location and dropped at different destinations. In the third approach, which
but they all end up at the same location. The research on the traversing modules observed the
use of MSSD in many existing applications [40, 42, 44]. The last and most significant traversing
it includes all the traversing modules, which are SSSD, SSMD and MSSD [26]. MSMD reaches
the primary user requirement that states users may start from multiple sources and end up on
multiple destinations, or a trip may include multiple pickups and multiple drop-offs [28, 33]
The study on the vehicle traversing approaches included a search for algorithms that
contributes to the formation of a MSMD path. The primary outcome of the MSMD approach is
to consider the sources and destinations of all users and form an optimized itinerary. Some of the
algorithms that promote the formation of a MSMD itinerary includes Mesh networks, Dijkstra’s
13
S2
S1
D2 D3
1 min
D1
S3
Figure 5: Multiple-Sources-Multiple-Destinations (MSMD) Traversing Approach.
Figure 5 showcases the example of 3 riders having three sources, S1, S3, S3, and three
destinations, D1, D2, D3. Initially, the system selects the broadcasting rider source and calculates
the traveling time between all locations. The next station to be selected is the closest source or
the destination to which the source is selected. The process continues until the system traverses
through all the locations. Based on the computed traveling times, the green arrowed route shows
the optimized travel path, which is S1-S2-D2-S3-D1-D3.
shortest path Many-Sources-Same-Destination approach, and Greedy algorithms [32, 39, 60].
The Mesh networks include the creation of a route based on the dynamic addition of
locations [61]. The trip itinerary is regenerated if new locations are added to an ongoing trip
[61, 39]. The drawback observed in the Mesh network is the computation time required for
developing multiple optimized routes using different combinations of locations until finding the
best one [39]. Another approach includes completing the journey through various public and
private transportation systems like buses, cabs, and taxi [18, 22, 62]. In some cases, users had to
walk a certain distance and meet other riders at a common location where passengers would be
later picked for Ride Sharing. The limitation of completing the journey through different methods
of transportation introduces latency due to the involvement and exchange of various means of
The selected method in the thesis for creating a MSMD route is the Greedy algorithm
[63]. In the Greedy algorithm, initially, any source from the available sources is selected. The
14
next step is the selection of the closest source or the destination of the rider whose source is
initially selected. The process of location selection continues until all the locations are traversed.
The journey created is an optimized one and formed by a Greedy approach. The modification
performed in the thesis is starting the itinerary formation with the broadcasting rider’s source and
selecting further locations based on the traveling time instead of the traveling distance. Figure 5
demonstrates the formation of an optimised path using the MSMD vehicle traversing approach.
One of the main motives of the thesis is to track or determine the main characteristics a
rider most focuses while rating other riders. The method for tracking the main characteristic
depends on feedback data. For example, a user may rate a score of 4 or a score of 0 to a specific
characteristic for several trips, implying the user is less interested in a specific characteristic. The
task is to find the characteristics the user is most interested in and recommend riders based on
The research on the methods for tracking the main characteristics led to the finding of
statistical methods like the range of a data-set [64], standard deviation [64], and variance
[65, 66, 64]. The selected methodology for tracking the main rider characteristics is the variance.
The concept of the variance is demonstrated with the help of stated three lists, L1 , L2 and L3 .
L1 = [1, 0, 5, 4, 0]
L2 = [0, 0, 0, 0, 2]
L3 = [4, 4, 4, 4, 4]
Let N be the total number of sample points in a list. The mean of the sample set is
denoted by xi . The distance of data-point x to the mean xi or the spread of a specific sample
xdistance = x − xi (2.1)
15
The variance of a data-set is computed using Equation 2.1. Variance indicates the level of
spread of each sample point in a data-set [64]. Variance is also defined as the average of squared
differences from the mean of the data-set [66, 64]. The differences are squared because the
substraction of a sample point and the mean may result in a negative value [64]. The larger the
variance of a data-set, the higher is the data-variety or the spred of data in the data-set.
The variance for the list L1 will be comparatively higher than the lists, L2 and L3 . The
spread of data around the mean in lists L2 and L3 is notable low [64]. If a similar methodology is
applied for the feedback data-set of a rider in the proposed model, the characteristic feedback set
with the highest variance is the main tracked characteristic of the rider. Thus, the thesis
employed the variance to track the main rider characteristics based on feedback data-sets.
altering the characteristics of broadcasting riders like adding or subtracting by 1 and then
researching the riders. For automating the task of manual alterations, the research included the
study of Machine Learning algorithms and led to the finding of the Machine Learning
features are converted to vectors and represented in a d-dimensional space, where d is the number
of features [67, 68]. The angular distance or the Cosine of the angle θ, which is between the
vectors is calculated using the equation of the Dot Product [68, 69]. The recommendation system
plots the vectors with the highest Cosine values closer to each other [67, 68, 69]. The thesis uses a
similar methodology where the selected features are the registered rider characteristics and the
UTT. The ML-based recommendation system plots the rider vectors with higher Cosine values in
proximity, and riders closest to each other are selected and added on a trip.
An additional need for Machine Learning algorithms in the thesis is the prediction of the
two main characteristics of newly registering riders. There is room for a little error due to the
presence of the imbalanced feedback data-sets in the proposed model [70]. In the case of an
16
imbalanced data-set, for similar inputs, different outputs may be recorded, creating uncertainties
during predictions [71]. The research included a search for a Machine Learning classification
algorithm or a classifier that could appropriately fit an imbalanced data-set and give quality
predictions.
The search for a suitable Machine Learning classifier led to the training and testing of
feedback data-sets with classifiers like Logistic Regression, K-Nearest Neighbours (KNN)
classifier, Naive Bayes Multinomial classifier, Random Forests classifier, Neural Networks, and
Support Vector Machine (SVM) [72, 73, 74, 75, 76]. Out of all tested classifiers, SVM turned out
to be the most feasible because of the Radial Bias Function (RBF) Kernel [77].
For distinguishing classes, SVM uses the RBF kernel, which is a highly non-linear curve
[76, 77, 78]. SVM works on the principle of placing the curve or the line to the closest data-point
with maximum distance [76]. The regularization parameter, C, and the gamma parameter, γ
dictates the shape and the placement of curve [77]. The process of governing the placement of the
curve by manipulating the values of C and γ is called Kernelization [77, 78, 79]. Kernelization
allows maximum fitting of data-points of a class, which may also include fitting fewer data-points
from other classes. Hence, SVM considers a small error by the maximum fitting of data and
works best for imbalanced data-sets. [70, 77, 79]. In the end, the classifier selected in the thesis is
17
CHAPTER 3
SYSTEM MODEL
The system model in the thesis reflects the proposed model framework and the purposes
of the Enhanced Ride Sharing Model. The chapter begins by specifying the problem statement,
which outlines the need for the designed model. The chapter concludes by describing the system
architecture, which focuses on several components utilized in orchestrating the matching layers
The increased number of vehicles has led to significant problems in the transportation
domain like Global Warming, traffic congestion, and rapid consumption of fuel [23, 25, 26]. Along
with humans, the stated issues also affect other living beings on our planet [80]. In such cases, the
concept of Ride Sharing provides optimal solutions, and currently, many existing Ride Sharing
applications tackle and solve the aforementioned problems. After an in-depth inspection of several
applications and research papers based on the existing Ride Sharing models, the investigation
concluded that the primary issue lies in the matching of riders, unexpected rider additions,
vehicle traversing approach, and the overall time management in the completion of a trip
[32, 39, 57]. Hence, even though there exist many Ride Sharing applications, Ride Sharing is not
The thesis presents a Ride Sharing platform that focuses on encouraging the services of
Ride Sharing. The proposed solutions provide outcomes like higher rider matching rates and
minimal time expenditure for trip formation plus trip completion. The system also provides the
trip’s metadata to all users to loosen the influence of social barriers among riders. Also, the
model reaches user traversing expectations by creating an optimized path using the
Car-Pooling to complete a trip with appropriate time management. The main idea of the
designed model is to match riders based on human characteristics and the User Threshold Time
(UTT). Riders having similar characteristics are grouped considering the minimal restricted
18
traveling time on a trip. The proposed model also uses Machine Learning algorithms to predict
better rider recommendations plus to tune up the system efficiency. The thesis majorly focuses on
the expansion of the Ride Sharing services, which will indirectly result in improving the weather
3.2 Architecture
The architecture in the thesis resembles the blueprint of the designed Ride Sharing model.
The thesis includes two design phases, and therefore, the chapter of the system model presents
Phase 1 provides the first design for the Enhanced Ride Sharing Model. The execution of
Phase 1 commences with associating a driver on a trip. The driver allocation is followed by
finding and filtering the riders based on rider characteristics and User Threshold Time. Figure 6
While researching the NYC Cab service, the study led to the finding of the NYC Cab
19
location data repository. The repository includes real-time NYC taxi zone locations and is
publicly available [81]. The New York City Cab Department has divided New York City into
small 265 areas, also referred to as the zones. Figure 7 gives an idea about the zones in New York
City and showcases an example of the zone “Ridgewood.” Each zone possesses almost 1000
locations in the form of latitudes and longitudes [81]. For accurate measurement of the system
efficiency, the Ride Sharing model in the thesis made use of the NYC Cab location directory while
client-server setting, a client device sends a request with the user data to a server. The server
processes the client’s request and sends the response data back to the client device. The system
20
then renders the received data from the server on the client device. In Phase 1, initially, a user
broadcasts a rider request that holds the broadcasting rider’s user-id, source, and destination.
Based on the contents in the request, the data server retrieves the characteristics plus UTT and
creates a data document, called the trip document, which includes the request data of the
broadcasting rider.
The database on the server-side includes an active repository of all drivers. When drivers
are active or awaiting broadcasting rider requests, the system keeps updating the location and
status of the vehicles for quicker driver allotment to incoming requests. Additionally, the system
notes the source zone from the trip document. The noted source zone indicates from which source
zone the broadcasting request has originated. All the available drivers from the noted source zone
are retrieved, and the closest available driver to the user’s source location is selected. The model
adds the selected driver to the trip document, and the activity of the driver association completes
Furthermore, the system sends the source zone as a parameter to the rider matching
functions. The first function is the characteristics matching function, which includes the Exact,
Closer, and Alternative types of characteristics matching. The function searches and retrieves all
the active and broadcasting riders from the same source zone and gets a rider list based on the
three characteristics matching types. The accepted passengers are further sent to the second
The second function is the UTT filtering function that computes the traveling time from
every accepted rider’s location to the broadcasting rider’s location. If the traveling time is less
than trip UTT, the rider is accepted. The characteristics and UTT functions continue the
matching and filtering of riders until the number of accepted riders reaches the seating capacity of
the vehicle or until there are no riders left in the characteristics matching accepted rider list. If
the pool is incomplete or if the maximum number of seats in the car is not occupied, the model
searches for active and broadcasting riders in other zones. Riders found in other zones undergo
After concluding the rider search, the system executes trip completion and records rider
feedback. The simulation of trip completion consists of adding the time required to traverse
21
between all rider locations. The feedback in Phase 1 is the event where a user provides a
single-digit rating to other users on a scale of 1 to 5. The single-digit feedback module forms the
The characteristics and UTT functions provided successful results that reached the
expectations of Phase 1. However, the simulations in Phase 1 incurred a significant time lag while
running the characteristics matching function. After a minor inspection of the function, the
investigation revealed that the time lag is due to the presence of numerous conditional statements
in the Closer characteristics matching. For achieving optimized performance, it was necessary to
eliminate the extra time consumed in the characteristics matching due to the several conditional
loops. The model experienced significant programming changes that led to the creation of Phase
2 of the thesis. The changes in design did not imply changing the idea of characteristics matching
The second phase majorly focuses on the characteristics matching layer and the rider
feedback systems. The improvisation in the architecture consists of elements like matching layers
with recommendation systems, 1-minute threshold driver match, broadcasting rider requests with
the feedback status, redesigned feedback system, computation of the main characteristics, and the
Machine Learning classification model. Figure 8 reflects the system architecture for Phase 2 of the
A new field added to the broadcasting request is the feedback status. The feedback status
checks if there is the presence of historical data representing the user feedback pattern. The
historical data of a rider consists of the rider feedback and the computed main characteristics
assigned by the system based on past trips. If the user has performed trips before, the system
gets the assigned main characteristics and prioritizes a search for other riders with similarly
The next step is a similar step performed in the first phase, which is to find the closest
available driver from the same source zone. In Phase 1, the system computed the traveling time
for all available drivers and then selected the nearest driver. The driver search in Phase 1
22
Broadcasting 2 4 Save Feedback
Rider Characteristics RIDER 2
Matching Layer DRIVER RIDERSRIDER 3
Data RIDER 1
B Server ML Content-Based
Recommendation
SOURCE Compute Two Main
DESTINATION
Characteristics
USER-ID
MONGO-ID Feedback Feedback
1 3 Given Received
Characteristic Characteristic
UTT Matching
Feed Registered Characteristics,
Filter Riders Based UTT, Computed Main
On Travelling Time 5 Characteristics to Train The
Machine Learning Module
Support Vector Machine
Find Closest Driver Classifier
consumed much time as all drivers were initially searched and later compared based on computed
traveling time. A solution implemented in such a case is the 1-minute driver threshold strategy,
which is stopping the driver search if the system finds a driver at a location that is within a
traveling time of 1 minute. Hence, there are limited iterations with the 1-minute threshold driver
strategy. The allocation of the driver to a trip marks the completion of Step 1.
After the system associates a driver to a trip, the execution of the characteristics matching
layers begins. In step 2, the system fetches all the broadcasting riders based on the Exact, Closer,
and Alternative matching types. The enhancement in Step 2 is that the proposed model uses the
Machine Learning recommendation system in all three types of characteristics matching. The
recommendation system eliminates the process of manually updating the characteristics. As there
is no interference for updating a characteristic, the time consumed for the trip formation in Phase
2 is notably less as compared to the time consumed for the trip formation in Phase 1.
After getting a rider list from the characteristic matching layer, the system computes
traveling time between the broadcasting and selected rider locations. The step of computing plus
23
checking the traveling time is the UTT matching layer and is Step 3 of the architecture. Riders
are added to the final trip itinerary if they satisfy the conditions in Step 2 and Step 3. The
system continues the running of matching layers until the accepted riders fill up the seats of the
selected driver’s vehicle, or no more riders are left to traverse in the accepted rider list. The trip’s
pool completion status is labeled “Yes” if riders with driver occupy at least nseats − 1 seats, where
nseats is the total number of seats in the vehicle. The pool completion status is labeled “No” if
the riders and driver do not reach the vehicle seating capacity.
Step 4 is saving the feedback and computing the two main characteristics for every rider
and driver. Also, the design in Phase 2 included a significant change in the feedback rating
approach of Phase 1. In the new approach, a user rates the five characteristics of riders instead of
providing a single-digit rating. Such an approach assists in tracking the user’s most favored
characteristics. Through the newly designed feedback module, it is possible to get the
characteristics a rider expects in other riders while commuting on a trip. If the system groups
riders having similar expectations based on the rider feedback patterns, the thesis achieves the
After recording the feedback by users, the system segregates the feedback records and
enters the data into two distinct data-sets. The first data-set comprises the feedback data a user
provides to other users, while the second data-set comprises the feedback data a user receives
from other users. The Machine Learning classifier uses both data-sets to predict the main
characteristics for every user in the system. Therefore, Step 5 is the training and testing of the
Machine Learning classifier. The need for Machine Learning in the thesis is to predict the main
characteristics of the newly registering riders. The step of predicting the characteristics by the
24
CHAPTER 4
METHODOLOGIES
The chapter of methodologies presents the approaches utilized to construct the proposed
model. The current chapter specifically focuses on the implementations of the elements that
constitute the system architecture. The chapter exhibits an in-depth visualization of the system
components in the form of six prime sections: (i) The Broadcasting Rider Request (ii) The Search
for the Closest Driver (iii) Searching Riders by Characteristics Matching (iv) Filtering Riders
through UTT Matching (v) Saving User Feedback and (vi) The Final Trip Document.
SOURCE LOCATION
SOURCE ZONE
DESTINATION LOCATION
DESTINATION ZONE
MONGO-ID
USER-ID
CHATTY_REQ
SAFETY_REQ
PUNCTUALITY_REQ
FRIENDLINESS_REQ
COMFORTABILITY_REQ
UTT
TIME STAMP
The program execution begins with a client device like a cell phone or a computer
broadcasting a request to the data server. The most important part of the broadcasting request is
the user-id. The server fetches the registered rider characteristics and UTT from the rider
25
registration records using the user-id. The next step is one of the vital steps in the Ride Sharing
model, which is the creation of the trip document. Figure 9 showcases the structure and elements
The trip document logs every essential trip detail or any minor trip updates. If the trip
document is inspected deeply, to the presence of the user-id, there is also a mongo-id. The
database, MongoDB, creates a new and unique mongo-id for every user, which is a 12-bit binary
JSON string during the user registration. The reading of the mongo-id is complicated and needs
conversion to a simple string format for later data handling purposes. Hence, the system
generates a user-id for every rider in registration. User-id is a unique identification number that is
easily readable and serves better for data handling tasks like data additions and alterations. The
broadcasting rider’s characteristics and UTT are referenced as the trip characteristics and trip
UTT because, throughout the trip, the model refers to the broadcasting rider’s characteristics
and UTT present in the trip document while searching a rider or a driver.
Data Server
MONGO-ID CHATTY_REQ
USER-ID SAFETY_REQ
SOURCE (Zone and PUNCTUALITY_REQ
B Location) FRIENDLINESS_REQ
DESTINATION (Zone COMFORTABILITY_REQ
and Location) UTT
TIME_STAMP
If Time <= 1 min:
Add Driver
SOURCE (Zone and Location) Else: Fetch Closest
Get Current Driver Driver and Add to
Location the Trip
26
The system keeps a driver’s status as available until the driver is active, and the riders
have not reached the seating capacity of the vehicle. At first, the system gets the source zone
from the recently created trip document and retrieves all the available drivers using the source
zone as the parameter. After getting the driver list, the system records every driver’s current
location. The next crucial step is the computation of traveling time between the broadcasting
The traveling time including real-time traffic is computed using the Google Maps Distance
Matrix API. The calculated timings are compared to find the lowest one, and the driver with the
shortest traveling time is selected and added on the trip. An essential step in the driver search is
noting the vehicle seating capacity. Figure 10 represents the complete driver search module using
With the improvised driver search algorithm in Phase 2, driver allotment is more agile.
The system stops the search for a driver if it finds a driver at a traveling distance of one minute.
27
Selecting and adding a driver to a trip implies updating the trip document with driver details like
driver’s mongo-id, user-id, the vehicle’s license plate number, and the computed traveling time
between the broadcasting rider and the selected driver. Figure 11 showcases the updated trip
A complete rider search module with the three types of characteristics matching is stated
in Figure 12. The rider search commences with Exact characteristics matching. In Exact
matching, the model searches for riders from the same zone with identical characteristics to that
of the broadcasting rider’s characteristics. The model adds the found riders in a list and sends the
B
MONGO-ID
USER-ID Exact Characteristics
SOURCE ZONE Match
DESTINATION ZONE UTT MATCHING LAYER
SOURCE LOCATION
DESTINATION
LOCATION
TIME_STAMP Altered/ Closer IF SEAT CAPACITY = 0 OR
CHATTY_REQ Characteristics Match IF NO RIDERS IN THE QUEUE
SAFETY_REQ
PUNCTUALITY_REQ TRUE
FRIENDLINESS_REQ
COMFORTABILITY_ B
REQ Alternative
UTT Characteristics Match FINAL RIDER LIST
If the riders do not complete the pool, the system commences the Closer characteristics
matching. In Closer matching, especially in Phase 1, the characteristics were manually altered. In
28
future stages, an iterative conditional program replaced the manual alterations of the
characteristics. Even though the task was automated, the execution still consumed significant
time. The root cause of the delay was the presence of numerous conditional executions. Figure 13
provides an idea of how the program altered the broadcasting rider’s characteristics.
As stated in the chapter of the system model, a significant design change led to the
replacement of the entire matching function with the Machine Learning recommendation system.
The ML-based system facilitates computing a match between the broadcasting rider
characteristics and countless possible combinations of other rider characteristics. The next
chapter of matching layers with Machine Learning presents a detailed mathematical explanation
of the Content-Based recommendation system. After the rider acceptance in Closer matching, the
system adds all riders in a queue and sends them for the UTT matching.
Furthermore, if the seats of the vehicle remain unfilled, the model employs the last type of
29
matching. In the Alternative type of rider matching, the system searches for riders irrespective of
the characteristics. The approach is a similar rider selection approach employed by companies like
UberPool and LyftLine. Riders are then added in a queue and sent for the UTT matching.
Riders Accepted in
Broadcasting Characteristics Matching
Rider Layer
Source
B
1st TRAVEL TIME CHECK (LESS THAN UTT)
Destination
SOURCE (Zone and
Location) Google Maps
Distance Matrix
DESTINATION (Zone API
nd
2 TRAVEL TIME CHECK (LESS THAN UTT)
and Location)
Broadcasting
Final Rider List Based on
Rider’s Locations Characteristics & UTT Layer
The next vital action in the rider matching is sending the accepted riders from the
characteristics matching layer to the UTT matching layer. At first, for all riders, the system
calculates the traveling time between broadcasting and accepted rider’s source locations using the
Google Maps Distance Matrix API. The model then performs the first UTT check, which is to
verify if the traveling time is equal or less than the trip UTT. If the rider satisfies the first UTT
check, the model accepts the rider and sends the rider to a second UTT check. The model
performs the second check, which is to verify if the traveling time between broadcasting and rider’s
destination locations is equal or less than the trip UTT. If the rider satisfies both UTT checks,
the system adds the rider into a final itinerary. The process of UTT matching continues until the
30
riders reach the seating capacity of the vehicle, or the model does not find any more riders in the
accepted rider queue. Figure 14 illustrates the two UTT checks in the UTT matching layer.
The thesis maintains a threshold of two minutes for the trip formation, which assures that
the time for trip formations is not high. If the rider list is exhausted for the same zone, the
system extends the rider search with characteristics and UTT matching to other zones. Also, if a
rider gets rejected from one trip, the system redirects the rejected rider request to other ongoing
trips. The rejected rider request may also be sent to an active and available driver to commence a
new trip. Using the maximum rider check strategy from multiple zones allows the vehicle seats in
1 D
friendlinessRiderD1: 4
2
Figure 15: An Use Case of Phase 2 Feedback System.
Figure 15 illustrates the recrafted feedback system with riders Rider1 and Rider2 and the driver
D. The arrows provide the direction of rating, and the label on the arrows specifies the
characteristic the user is rating. In the given example, Rider1 provides a rating of 3 to the
punctuality characteristic of Rider2 . Another example is of the safety rating of 5, provided by the
driver D to Rider1 .
The architecture in Phase 2 replaced the single-digit rating model with the five
characteristics rating model to track the user characteristics. A rider provides ratings to other
31
riders in terms of five characteristics. The five characteristics are the same characteristics present
at the time of rider registration. Figure 15 provides an example of ratings given by two riders and
a driver on a trip. Equation 4.1 represents the rating user rates to other users for a specific
characteristic.
In Equation 4.1, a represents the user rating other users, and b represents the user getting
rated. If a user submits a feedback without rating a specific characteristic, the system assigns the
value 0 to the charactetistic the user has not rated. The following is an example of the feedback
chatty1D = 0
saf ety1D = 4
punctuality1D = 5
f riendliness1D = 3
comf ortability1D = 0
The system adds the feedback into a feedback data-set after each user submits the ratings
for other users. The feedback data-set is later used for the computation of the main
characteristics.
From broadcasting of requests till the ending of the trip, the trip document keeps
collecting data from several entities like the broadcasting rider requests, the allocated driver, and
the accepted riders. At each stage, each part of the trip contributes to the creation of the final
trip document. For example, from the broadcasting rider, the starting and ending location of the
trip is saved, and based on the rider locations, an optimized path is created, which provides the
total trip time. In the end, the model possesses a massive block of trip data, and the system saves
32
TRIP-ID BROADCASTING RIDER DRIVER USER-ID
TRIP CHATTY_REQ MONGO-ID DRIVER MONG-OID
TRIP SAFETY_REQ USER-ID VEHICLE LICENCE PLATE
TRIP SOURCE LOCATION
PUNCTUALITY_REQ SOURCE ZONE DR_TRAVEL TIME_DIFF
TRIP DESTINATION LOCATION VEHICLE SEATS
DESTINATION ZONE DRIVER STATUS
FRIENDLINESS_REQ
RIDER RATINGS RIDER RATINGS
TRIP
COMFORTABILITY_REQ
RIDER 2 RIDER 3
TRIP UTT MONGO-ID MONGO-ID
TIME STAMP USER-ID USER-ID
TIME START TIME SOURCE LOCATION SOURCE LOCATION
TIME END TIME SOURCE ZONE SOURCE ZONE
TIME DIFF(MINS, SECS) DESTINATION LOCATION DESTINATION LOCATION
TRIP STATUS, POOL DESTINATION ZONE DESTINATION ZONE
COMLPETION STATUS RIDER RATINGS RIDER RATINGS
the trip document in a distinct trip data-set for future data management. As shown in Figure 16,
every trip document is assigned a unique trip-id. The reason for the creation of the unique
trip-ids is for future maintenance and support. If there are customer complaints on a trip, the
The final section of the proposed solution concludes with a description of the final trip
document. In the trip document, the trip characteristics and the trip UTT are the broadcasting
rider’s characteristics and UTT. Every rider data is itself a small document and contains
information like mongo-id, user-id, source and destinations, and a copy of ratings given by the
corresponding user to other users. Additionally, the trip document also contains an overall time
taken for the completion of the journey. The time difference is the difference in time from the
point the first rider broadcasts the request for a trip until the point where the driver
acknowledges, “trip ended” as the trip status. The final trip document marks the completion of
the entire trip and forms the final step of the proposed model.
33
CHAPTER 5
The chapter of matching layers with Machine Learning is a part of the proposed model,
but the module of Machine Learning itself spans numerous details. Thus, it was essential to
provide the research and contributions of the Machine Learning modules in a distinct chapter.
The current chapter starts by discussing how the recommendation system improves the quality of
matching. Later, the chapter describes the methodologies for computing the main characteristics
of the rider and presents an in-depth discussion on the selected Machine Learning classifier for
predicting the main characteristics. The chapter concludes with the simulations performed for
In the Exact matching type, the system searches for riders with exactly matching
characteristics. The odds of finding an Exact match in the same zone are low because of the
scenario where two or more broadcasting riders having exactly the same characteristics start
around the same time and reach the same or nearby sources and destinations. Hence, the number
of matches in Exact characteristics matching is notably low. Alternatively, the chances of finding
a rider with little different characteristics and heading on the same trajectory are high. Hence,
the number of riders accepted is largest by the Closer characteristics matching type. The
alteration of characteristics in the Closer matching involved a large number of iterations in Phase
1. The thesis employed the concept of Machine Learning Content-Based recommendation system
to reduce the higher number of loops in Phase 2. Initially, the system converts the characteristics
follows:
34
chattybr = 3
saf etybr = 4
punctualitybr = 3
f riendlinessbr = 3
comf ortabilitybr = 4
The vector representation char vbr for the broadcasting rider is given by Equation 5.2.
Rider1 2 Rider2
5 Dimensional Space 2
1
char_v1 = [4,4,3,5,3] char_v2 = [2,1,5,1,1]
Consider a case of two riders, Riderx and Ridery . The vector representing Riderx is given
by char vx and the vector representing Ridery is given by char vy . For measuring the level of
35
match between Riderx and Ridery , the angular distance between two vectors or the Cosine of
angle θxy is calculated using the Equation 5.3. The Cosine of θxy is equal to the Dot Product of
(char vx ·char vy )
cosθxy = (5.3)
kchar vx kkchar vy k
If θxy is equal to 0, the value of cosθxy is 1. The angular distance is 0 when the system
finds a rider with exactly matching characteristics. If there are other riders with identical
characteristics as that of the broadcasting rider, it is a 100% match. Such a case is matching the
riders through the Exact matching type. Hence, a greater Cosine value results in a higher match.
space or a 5-dimensional hypercube. The reason for selecting a 5-dimensional space is because the
number of dimensions is equal to the number of selected features. In the figure, ‘O’ represents the
origin, and Riderbr represents the broadcasting rider. Additionally, Rider1 and Rider2 represent
the riders to be matched with the broadcasting rider Riderbr . Points B, 1, and 2 in the hypercube
represent the points plotted by the vectors for Riderbr , Rider1 , and Rider2 .
By observing the visuals in Figure 17, char v1 seems to be a better match than char v2 .
The match is higher for char v1 due to a smaller angle, θ1B as compared to the angle θ2B , which
has a larger stretch than angle θ1B . The higher match implies Rider1 is a better match than
Rider2 . Hence, Rider1 is selected and added on the trip. The system redirects Rider2 to other
ongoing trips or any available and active drivers. In simulations, the system accepts riders only if
a rider match results greater than 85% or only if the computed cosθxy value is above 0.85.
The benefit of the recommendation system is that it can be utilized to compute the
angular match between any two characteristic vectors irrespective of the characteristic’s matching
type. Thus, using the ML-based recommendation model led to the elimination of altering the
rider characteristics in Closer matching. The elimination resulted in cutting down a large number
of conditional loops that profoundly affected the system performance in terms of time complexity.
It is now possible to get a match between broadcasting rider and a rider with any
36
combination of characteristic values. Therefore, the employment of the recommendation system is
not only limited to Closer matching. The thesis also employs the recommendation system in the
Exact and Alternative type of matching. A positive point to state is about the processing offered
by Sklearn libraries. Sklearn libraries are the Machine Learning libraries and facilitate batch
processing between multiple vectors. Batch processing is the process of computing the Cosine
Similarity between multiple vectors at the same time. Using the feature of batch processing, the
model computes the Cosine Similarity between any number of riders, which results in the
The computation of main characteristics initiates after saving the rider feedback and
dividing the feedback data into two parts. The first main characteristic is the
Feedback-Given-Characteristic and uses the first part of the feedback data, which includes the
ratings given by each rider to other riders. The purpose of the first main characteristic is to track
the characteristic the rider most focuses while giving feedback to other riders. The computation
of the first main characteristic is discussed with an example of the feedback given by Rider1 to
The next step is to segregate and append the feedback data based on every characteristic
of the rider. The system creates the following lists based on the feedback given by Rider1 .
chattyRider1 = [0, 0, 1]
37
punctualityRider1 = [1, 0, 0]
f riendlinessRider1 = [4, 4, 4]
The observation made from the five lists is that Rider1 may continue to give a friendliness
rating of 4 in future trips. Another observation is that the rider has submitted the feedback
without rating the comfortability characteristic, and therefore, the system assigned the value 0 to
the comfortability rating. The only data variety observed is in the safety rating. The
characteristic with the highest data variety is the Feedback-Given-Characteristic. The first main
characteristic of Rider1 is the safety class. The system computes the first main characteristic
The created list from the given feedback data forms the sample sets for computing
variance. The higher the spread of the data around the mean of a sample set, the higher is the
characteristic variance. Total number of elements in a characteristic list is nchar or data countchar .
x denotes a specific element from the characteristic sample set, and xchar i denotes the mean of
the characteristic sample set. The system calculates the squared differences using Equation 5.4.
2
The selected characteristic variance is denoted by σchar and is represented in Equation
5.5. The system selects the characteristic list with the highest variance, which implies the user is
more diverse in rating the selected characteristic and therefore focuses on the selected
characteristic. Besides noting the characteristic with the highest variance, the system also records
the variance of other characteristics and saves in the feedback data-set for every user.
Pnchar Pnchar
2 i=1 xsqr dif f i=1 (x − xchar i )2
σchar = = (5.5)
data countchar data countchar
After calculating the first main characteristic, the system proceeds to the computation of
38
second main characteristic is to categorize users based on the feedback given by other users. For
example, if 20 users have provided the highest rating to the chatty characteristic of Ridera on
many trips, the best-observed characteristic in Ridera is chatty. If users are looking for a rider
who enjoys conversations, the system will recommend the Ridera as the rider has the highest
chatty rating. The methodology for the second main characteristic uses the second part of the
Table 2 provides a use case of the feedback given to Rider1 by Rider2 , Rider3 , Rider4 .
Each element in the column has two values. The first value is the feedback given by Rideri for a
specific characteristic, and the second value is the characteristic variance ((σi char )2 ) computed for
the Rideri characteristics. Every time a rider provides feedback for a specific characteristic, the
system multiplies the feedback value by their respective characteristic variance. To exemplify,
Rider2 variance for safety is 4.31, and the safety rating given by Rider2 to Rider1 is 2. The
feedback to Rider1 by Rider2 for the safety characteristic is the product of variance and the value
provided in the rating. The system computes the product of variance and rated value for every
In the end, for every characteristic, all the multiplications are added and compared. The
characteristic with the highest score is the Feedback-Received-Characteristic. In the same use
case, the second main characteristic computed for Rider1 is chatty, as the value of 39.26 is highest
Based on the computed main characteristics, the system redefines the search criteria for
every rider. Indeed, the scenario is a practical use-case where riders rate other riders based on
their past experiences and provide a real-time idea of the characteristics a user possesses. The
39
main characteristics assist in promoting a better and real-time recommendation to the riders.
The Machine Learning module selected in the thesis is Support Vector Machine (SVM).
The data-sets created for training the SVMs are the Feedback-Given-Characteristic Data-set and
Feedback-Received-Characteristic Data-set. In both data-sets, the input fields are the registered
user characteristics and the registered UTT. The outputs or the labels to be predicted are the
computed main characteristics. Table 3 and Table 4 reflect the fields and sample rows of the
Feedback-Given-Characteristic Data-set
Class Given Chatty Safety Punctuality Friendliness Comfortability UTT
Comfortability 3 5 4 1 4 20
Chatty 1 2 4 3 5 10
Feedback-Received-Characteristic Data-Set
Class Received Chatty Safety Punctuality Friendliness Comfortability UTT
Punctuality 4 4 2 3 1 10
Safety 5 4 4 1 4 25
The registered characteristics and UTT are the selected input fields or selected features
for the SVM modules. The expected output is that the SVM should predict the main
characteristics that match the computed main characteristics in the data-sets. For two data-sets,
the thesis uses two distinct SVM modules. The function of the first SVM is to predict the
40
Feedback-Received-Characteristic. The working of the two SVM modules is reflected in Figure 18.
The training and testing of SVMs are vital steps in the thesis. The module training has
been completed using 27,000 records for both SVM classifiers. Through the data-sets, the SVM
classifier learns that for a specific combination of the registered characteristics and UTT, the
output is a specific main characteristic. The SVMs were tested using new 12,000 records. For the
inputs in the testing data, initially, the system computed the respective first and second
characteristics. For the same set of input data, the SVMs predicted main characteristics. The
system then compared the predicted and computed values to check if the SVM is correctly
predicting the main characteristics. The comparison is a part of evaluating the Machine Learning
model accuracy, and the chapter of results provides a brief description of the SVM classifier
evaluation.
After getting significant testing results, the thesis employed the trained and tested SVMs
41
to predict the main characteristics of newly registering riders. At first, the newly registered riders
provide the characteristics and UTT in the registration phase. The system then sends the
recorded characteristics and UTT to SVMs, which predict the main characteristics of the newly
registered riders.
In the experimentations, the Machine Learning module was retrained using variance as an
additional input feature. For an SVM, the higher the number of features, the higher is the
accuracy of the module. The value of variance provides an extra edge for SVM to classify and
plot the data points into labeled classes. With variance as an additional feature, the system
5.4 Experimentations
A simulation is denoted by Equation 5.6 where Ui denotes the User Threshold Time, RCi
denotes the number of riders traversed, and Si denotes a specific simulation event for the ith
simulation.
At the beginning of every simulation, the trip starts by selecting a broadcasting rider from
the rider records. The selected rider has a UTT equal to Ui . Consider the first simulation S1 . For
the 1st iteration, U1 is taken as 10 minutes and RC1 as 200. After selecting a broadcasting rider
with a registered UTT of 10 minutes, the system begins the trip formation and the creation of the
trip document. If the trip ends while traversing through the first 20 riders, the simulation
continues by starting a new trip. The system again searches for a rider with a similar registered
Ui and begins the trip. The process of rider traversing and trip completion continues until the
For every next simulation, the system keeps the similar value for Ui and increases the
value of RCi by 200 until it reaches 1000. Hence, the next simulation or S2 is denoted by
S2 = {U2 , RC2 } = {10, 400} and as the RCi reaches 1000, the ith simulation is
42
S5 = {U5 , RC5 } = {10, 1000}. As the RCi reaches 1000, the system resets RCi to 200 and
increases Ui by 5. The next simulation is denoted by S6 = {U6 , RC6 } = {15, 200} and is followed
by simulations until S10 = {U10 , RC10 } = {15, 1000}. The system proceeds with the simulations
till the Ui reaches 30. Indeed, the nth or the last recorded simulation is given by Sn .
Sn = {30, 1000}
Variable Description
Ui Trip User Threshold Time for a simulation Si
RCi Total number of riders traversed in a simulation Si
RPi Total number of riders accepted in a simulation Si
Ti Total time consumed for completion of a simulation Si
trip counti Total number of trips computed in a simulation Si
M Ri Matching rate of a simulation Si
closeri Count of riders accepted through the Exact and Closer matching types in a
simulation Si
alternativei Count of riders accepted through the Alternative matching type in a simulation
Si
matchcloser Total count of riders accepted through the Exact and Closer matching types
in all simulations
matchalternative Total count of riders accepted through the Alternative matching type in all
simulations
For every simulation Si , the system notes RPi , the total number of riders accepted, Ti , the
total time required for completing the simulation, and trip counti , the total number of trips
computed. Table 5 mentions the description of each variable which tracks the significant updates
in every simulation. In some cases, the results stated that RPi has a smaller value than RPi+1 .
The expected result is that RPi or the number of accepted riders should keep increasing with
every progressing simulation event. The unexpected increase or decrease in RPi was a
randomness factor introduced due to uneven acceptance of the riders at the UTT matching layer.
The system performed the same simulation without the Machine Learning model for ten times,
43
and with the Machine Learning module for five times to reduce the randomness factor.
Performing the simulations several times reduced the randomness element, which led to the
An important measure in the system efficiency is the matching rate. Equation 5.7 defines
the equation to compute the matching rate for a simulation. The matching rate is the division of
accepted riders and the total number of traversed riders. According to the expectation of the
thesis, the matching rate should keep increasing for consecutive simulations.
RPi
M Ri = (5.7)
RCi
Two variables, closeri and alternativei track the number of accepted riders based on the
characteristics matching type which is the Exact, Closer and Alternative matching type in every
simulation Si . In the end, the system adds the values of closeri and alternativei from all
simulations to compare the number of riders accepted by the type of matching. Equation 2 and
Equation 3 represents the added rider count by the type of matching for all simulations. n marks
the total number of simulations and matchcloser and matchalternative are the variables that track
n
X
matchcloser = closeri (5.8)
Si =1
n
X
matchalternative = alternativei (5.9)
Si =1
Every tracking variable contributes to the performance measurement of the system. The
next chapter of results provides an in-depth analysis of the entire Ride Sharing model and
provides the contributions of the matching rate, the number of trips, and the time required for trip
formation towards the overall system efficiency. Also, the chapter includes a comparison of results
from Phase 1 and Phase 2, which states the improvements observed due to changes in Phase 2.
44
CHAPTER 6
The chapter of analysis and results includes four sections that are most crucial while
evaluating the system efficiency. The four sections are (i) Results from Phase 1 (ii) Machine
Learning Accuracy Measurement and Evaluation (iii) Results from Phase 2 and (iv) Comparison
of Results.
The three significant components in the results of both phases of the thesis are the
matching rate, the number of completed trips, and the total time taken for the completion of a
simulation.
The matching rate or the M Ri provides a fractional value of the accepted riders out of the
total traversed riders RCi in a simulation Si . It is essential to consider the total traversed riders
RCi in a simulation to understand the impact of the matching rate. Consider an example of a
simulation where the computed matching rate is 0.48. For better understanding, M Ri is
multiplied by 100 to get the matching rate in percentage. If RCi is 100, a matching rate of 0.48
implies that the system accepted 48% of riders while finding a match for the broadcasting riders
in all the computed trips in a simulation. The expected outcome of the thesis is that the matching
rate should improve as the number of traversed riders and the trip UTT increases. If the matching
rate is constant or falls for an increasing number of riders, the system fails to be efficient. The
X-axis in Figure 19 represents the RCi or the total number of searched riders while the Y-axis
specifies the scale of the matching rate. Each legend or the trend line indicates the trip UTT of
the simulation. The matching rate for Phase 1 is drafted using a stacked-line graph in Figure 19.
The observations from the graph in Figure 19 state that with the rising number of riders,
the system recorded a higher matching rate than the priorly recorded matching rates as shown in
the simulations. The highest recorded matching rate in Phase 1 is 0.49 for 1000 riders and 30
45
Sum of matching_rate
2.50 UTT
UTT 0.49
30 25 20 15 10
2.00
0.48
0.42
0.30
Matching Rate
1.50
0.41 0.41
0.26
0.29
0.39 0.39
1.00 0.22 0.48
0.30
minutes as the trip UTT. The matching rate of 0.49 implies that out of 1000 traversed riders, the
For every simulation Si , the variable trip counti tracks the number of completed trips.
The elements of the stacked-lined graph for the total number of completed trips are similar to the
graph of the matching rate. The X-axis reflects the number of traversed riders from 200 to 1000,
and the legends indicate the trip UTT with distinct markers. The only difference is with the value
and scale on Y-axis. Y-axis represents the total number of computed trips. The expectations from
the model in terms of computed trips are that the trip counti should increase with the increasing
number of riders plus trip UTT, and the minimum number of computed trips should be at least 3
46
Sum of total_drivers
UTT
140
UTT 30 25 20 15 10 26
120
Total Number of Completed Trips 25
100 22
80 15 20 25
14
60 11 20
16
12 26
40 6 18
10 15
5
20 7 7 24
6 15 16
5 9
0
200 400 600 800 1000
for_riders
Total Number of Riders Per Simulation
Figure 20: Total Number of Computed Trips, Phase 1.
The number of computed trips is the total count of completed trips in a simulation. The graph is
evaluated by observing a simulation event and the value plotted for the simulation indicating the
total number of completed trips. For example, for UTT:10 minutes and traversed rider count of
600, the total number of completed trips is 15.
for every 100 traversed riders. Like the matching rate, if the total number of computed trips drops
down for progressing simulations, the Enhanced Ride Sharing Model proves to be inefficient. The
graph of the computed trips against the count of traversing riders is reflected in Figure 20.
The result in Figure 20 reflects that the Ride Sharing model in Phase 1 of the thesis
achieved a trip count of 5 for the first simulation where the UTT is 10 minutes, and the number
of traversed riders is 200. The observations from the stacked-lined graph in Figure 20 states that
the trip count increases with every simulation event or with the growing count of riders and trip
UTT. The highest number of completed trips is 26 for UTT:15 minutes and UTT:30 minutes for a
47
6.1.3 Trip Simulation Time
The trip simulation time or the Ti specifies the total time consumed for completing a
specific number of trips. The resultant graph for the trip simulation is portrayed in the form of a
Sum of trip_diff(mins)
stacked-line120.00
graph in Figure 21. UTT
UTT 30 25 20 15 10
16.47
100.00
17.20
Total Simulation Time (mins)
80.00 13.40
11.18 22.55
60.00 13.28
11.62
7.99 19.53
13.27
40.00
8.96 12.86
4.92
8.59 12.83
20.00 12.35
4.70 5.26 22.58
8.12
5.23 17.46
7.86 11.93
- 4.04
200 400 600 800 1000
for_riders
Total Number of Riders Per Simulation
Figure 21: Trip Simulation Time, Phase 1.
The data-points in the stacked-line graph describe the total time recorded to complete one entire
simulation. The graph is understood by providing an example of the simulation event where the
trip UTT is 15 minutes, and the number of traversed riders is 800. The simulation time recorded
for the selected example is 12.83 minutes.
The contribution of the simulation time towards the system efficiency depends on the
matching rate and the total number of computed trips in a simulation. The increase or decrease in
the simulation time does not affect the system. However, if the simulation time increases for every
consecutive simulation, it is crucial to observe the values of the matching rate and the completed
number of trips. The expected result in the thesis is that if the simulation time Ti keeps rising for
48
every simulation Si , there should be a corresponding increase in the matching rate M Ri and the
total number of computed trips trip counti . From Figure 21, the noted observation is that the trip
simulation time keeps increasing for every progressing simulation. If the figures, Figure 19, Figure
20 and Figure 21 are placed next to each other, the results specify that the matching rate M Ri
and the number of completed trips trip counti increases with the rising trip simulation time Ti .
Trip Without
Total Trip
Pool
Simulation
Completion,
Count: 7159
811, 11%
The proposed model indirectly helps humanity preserve environment and fuel resources if
the number of computed trips with pool completion is higher than the number of completed trips
without pool completion. After the completion of Phase 1, the trip count from all simulations is
added and classified based on the pool completion status. Trips that complete the pool entail that
the accepted riders and driver reached the vehicle seating capacity. Figure 22 reflects the drafted
classification of trips by the pool status in a pie-chart. The expected outcome in the thesis is that
out of the total number of computed trips, at least 70% of trips should complete the pool.
The result achieved in the case of trips with pool completion is considerately acceptable.
49
The pie-chart in Figure 22 shows that the number of completed trips with pool completion is
89%, which is a significant measure in the case of the Ride Sharing model. For the total
computed trips trip countn and n being the total number of simulations, it is confirmed that
around 90% of the total trips completed the journey with pool completion.
Exact/ Closer
Characteristics
Match With UTT
17%
Total Riders
Accepted In
All Trips
(100%)
83%
Different
Characteristics
Match With UTT
The count of matches by characteristics matching type is one of the most critical measures
of the system. The model design includes the Exact, Closer, and Alternative types of
characteristics matching. As stated in the previous chapter, the variables which track the rider
count by the matching type are the matchcloser and matchalternative . If the system accepts a rider
the system accepts a rider by the Alternative type of matching, the value of the matchalternative is
increased by 1.
The reason for generating results by the matching type is to check which rider
characteristics matching type is the most utilized by the system while creating trips. The
50
expected outcome in the thesis is that the number of rider matches by the Exact or Closer
characteristics matching must be higher than the number of rider matches by the Alternative
matching type. Based on the values of matchcloser and matchalternative , the accepted riders are
drafted on a pie-chart as showcased in Figure 23. The two classes in the pie-chart are the Exact
or Closer characteristics match with UTT, and the Different or Alternative characteristics match
with UTT.
The pie-chart in Figure 23 reveals that the result achieved in terms of the number of rider
matches by characteristics matching types is an average quality result. The system recorded a
lower number of matches by the Exact and Closer matching type than the Alternative matching
type. But, as all riders undergo the UTT matching layer and as most of the trips complete the
It is necessary to provide the definitions of true positive (tp), true negative (tn), false
positive (fp), and false negative (fn) for evaluating the Machine Learning classifiers. The
definations are given with the help of a confusion matrix which is illustrated in Figure 24.
Chatty
Predicted Values
tp fp
Safety
fn tn
Chatty Safety
Actual or Computed Values
Figure 24: An Example of Confusion Matrix.
Figure 24 illustrates a simple example of the confusion matrix with two classes, safety and chatty.
The values on the X-axis represent the actual or computed values and the values on the Y-axis
represent the predicted values by a Machine Learning classifier. The matrix contains values which
determine the quality of the prediction of a Machine Learning classifier in the form of true
positive (tp), true negative (tn), false positive (fp), and false negative (fn).
51
The confusion matrix provides the level of correctness between a system’s computed value
for a class and a Machine Learning classifier’s predicted value for the same class. For illustrating
the confusion matrix, only two classes have been considered, the chatty and the safety class. Let
the computed main characteristic for a rider be the chatty class. The system evaluates the
Machine Learning accuracy by checking the main characteristic predicted by the SVM classifier.
If a Machine Learning classifier predicts a class that is similar to the computed class, the
prediction is true positive (tp). For example, if the SVM predicts the class chatty as the main
Consider a case where an entity E does not belong to a class Cf alse . If the system subjects
the classifier with the entity E and the class Cf alse and the classifier correctly predicts that E does
not belong to class Cf alse , the prediction is true negative (tn). Directing to the same example, if
the system subjects the SVM with the rider and the safety class, and if the SVM predicts that the
rider is not associated with the safety class, the prediction is a true negative prediction.
The prediction is a false negative (fn) prediction when the Machine Learning classifier
predicts the wrong class as the right class. In the example of the rider, if the SVM predicts the
safety class, the prediction is a false negative prediction as the computed class is chatty.
Consider a case of the entity E, which belongs to a class Ccorrect . If the system subjects
the Machine Learning classifier with class Ccorrect and the entity E, and the classifier predicts that
E does not belong to class Ccorrect , then the prediction is false positive (fp). In the same example,
if the system subjects the SVM with the rider and the class chatty, and the SVM predicts that
the rider does not belong to the chatty class, the prediction is a false positive prediction.
With the definitions of tp, tn, fp, and fn, the important elements of a Machine Learning
The first measure of performance for the Machine Learning classifier is the accuracy.
Accuracy is the fractional value of the total number of correctly predicted samples to the total
number of present samples in a data-set. Equation 6.1 represents the accuracy measure of a
52
tp + tn
accuracy = (6.1)
tp + tn + f p + f n
Consider a case of a data-set where the number of false positives and false negatives are
the same. When the fp and fn are almost equal, the data-set is balanced. In the case of an
imbalanced data-set, the accuracy is not enough to measure the quality of prediction of a Machine
Learning classifier. The Enhanced Ride Sharing Model possesses imbalanced feedback data-sets,
and therefore more measures are required to evaluate the efficiency of SVMs. The other
performance measures for evaluating a Machine Learning classifier are precision, recall, and the
F1 score.
The second performance measure is precision, which is the division of the true positive
predictions to all the positively predicted predictions. From all the positive predicted values,
precision states which are correctly predicted values that match precisely to the computed values.
The system calculates the positive predictions by adding all true positive and false positive
predictions. A classifier is expected to possess a higher precision value for quality prediction. The
following Equation 6.2 represents the formula for computing precision of a Machine Learning
classifier.
tp
precision = (6.2)
tp + f p
The third performance measure is the recall. Recall provides the fractional value of the
correctly predicted samples to the total samples in a data-set. In the case of the recall, the system
adds the true positive and false positive predictions to get the whole sample set. Equation 6.3
provides the formula for computing the recall of a Machine Learning classifier.
tp
recall = (6.3)
tp + f n
It may be the case that the classifier predicts accurately for a particular set of classes but
predicts incorrectly for a few classes. In such a case, a possible solution is using the combination
of precision and recall. The combination of precision and recall provides the F1 score. The major
53
focus of the F1 score is on false positives and false negatives. Equation 6.4 provides the formula
recall ∗ precision
F 1 Score = 2 ∗ (6.4)
recall + precision
If a classifier has a higher computed F1 score, the precision and recall utilized for
computing the F1 score are also high. Hence, a good or a higher F1 score indicates the classifier
prediction is accurate. The last performance measure for a Machine Learning classifier is the Root
Mean Square Error (RMSE). The concept of the RMSE is explained through Figure 25.
PREDICTED DATA-POINTS
ACTUAL OR COMPUTED
DATA-POINTS
Y
ERROR IN PREDICTION
X
CURVE FITTING THE
DATA-POINTS
RMSE computes a value based on the errors present in every predicted sample point.
Initially, a curve or a line passes through all the computed data-points in the system. The first
step for calculating RMSE is the computation of error. Equation 6.5 gives the error between the
computed and predicted data-points. The error is the distance between the computed data-points
As the errors may result in negative values, all the errors are squared and added. The
54
added squares are divided by the total number of data-points, providing the mean of squared
errors. The last step is computing the square root of the mean. A lower RMSE value represents
less error and therefore signifies that the classifier prediction is accurate. Equation 6.6 is utilized
s s
(error)2 (ycomputed − ypredicted )2
RM SE = = (6.6)
total sample points total sample points
The thesis uses the confusion matrix, F1 score, precision, recall, RMSE, and accuracy for
measuring the quality of prediction and the performance of the Support Vector Machines.
Due to the presence of the imbalanced data-sets in the training and testing of Support
Vector Machines, it is essential to measure F1 score, precision, and recall for the five classes and
the overall RMSE and accuracy. The evaluation of the SVMs begins with the performance
measure for the first SVM, which is the Feedback-Given-Characteristic SVM. Table 6 cites the
calculated performance measures for the first SVM classifier. Also, to compare the system
computed values with the SVM predicted values for every class, the section of performance
measure includes the confusion matrix plotted for the first SVM, as shown in Figure 26.
The expected result in the thesis from the perspective of Machine Learning is getting a
higher score for accuracy, F1 score, precision, recall, and getting a minimal RMSE score for the
Feedback-Given-Characteristic SVM achieves a higher score for every performance measure. The
55
&