Analysis of Machine Learning Algorithm With Road Accidents Data Sets

ABSTRACT
Currently, road transport infrastructure failing to cope up with the exponential

increase in vehicular population and to computing the fastest driving routes and
accidents in the presence of varying traffic conditions is an essential problem in modern
navigation systems. To prevent this problem is to investigate the transport department
dataset with ensemble learning method for finding the best road selection without
accident forecasting by prediction results of best accuracy calculation by comparing
supervised machine learning algorithms. In statistics and machine learning, ensemble
methods use multiple learning algorithms to obtain better predictive performance. The
analysis of dataset by supervised machine learning technique(SMLT) to capture
several information‟s like, variable identification, uni-variate analysis, bi-variate and
multi-variate analysis, missing value treatments and analyze the data validation, data
cleaning/preparing and data visualization will be done on the entire given dataset.
Additionally, to compare and discuss the performance of various machine learning
algorithm measurements from the given transport department dataset with evaluation
of GUI based road accident prediction by given attributes.
v
TABLE OF CONTENTS
CHAPTER TITLE PAGE

NO. NO.
ABSTRACT V
LIST OF FIGURES IX
LIST OF ABBREVIATIONS X
1 INTRODUCTION 1
1.1 EXISTING SYSTEM 3

1.2 PROPOSED SYSTEM 4
1.2.1 Ensemble learning 4
1.2.2 Max Voting 5
1.2.3 Averaging 5
1.2.4 Weighted Average 5
1.2.5 Voting based Ensemble learning 6
1.3 AIM 6
1.4 SCOPE 7
1.5 OBJECTIVE 7
2 LITERATURE SURVEY 8
2.1 LITERATURE SURVEY 8
3 METHODOLOGY 12
3.1 METHODOLOGIES 12
3.1.1 Sequential Ensemble learning (Boosting) 12
3.1.2 Parallel Ensemble Learning (Bagging) 12
3.1.3 Stacking & Blending 13
3.2 FEASIBILITY STUDY 14
3.2.1 Data Wrangling 14
3.2.2 Data collection 14
3.2.3 Preprocessing 14
3.3 CONSTRUCTION OF A PREDICTIVE MODEL 14
3.3.1 Dataflow diagram for machine learning 15
vi
3.3.2 Work flow diagram 16
3.3.3 UML Diagram 16
3.3.3.1 Use Case Diagram 16
3.3.3.2 Activity Diagram 17
3.3.4 Sequence Diagram 18
3.4 PROJECT REQUIREMENTS 18
3.5 ENVIRONMENTAL REQUIREMENTS 19
3.5.1 Software Description 20
3.5.2 Anaconda Navigator 20
3.5.3 Conda 21
3.5.4 The Jupyter Notebook 21
3.5.5 Notebook document 21
3.5.6 Jupyter Notebook App 22
3.5.7 Kernel 22
3.5.8 Notebook Dashboard 22
3.6 MODULE-01 24
3.6.1 Data validation process 24
3.6.2 Data Validation 24
3.6.3 Data Pre-processing 25
3.7 MODULE - 02 26
3.7.1 Exploration data analysis of visualization 26
3.7.2 Training the Dataset 28
3.7.3 Testing the Dataset 28
3.8 MODULE - 03 29
3.8.1 Logistic Regression 29
3.8.2 Decision Tree 30
3.9 MODULE -04 31
3.9.1 Support Vector Machines (SVM) 31
3.9.2 Random Forest 32
3.10 MODULE -05 33
3.10.1 K-Nearest Neighbor (KNN) 33
vii
3.10.2 Naive Bayes algorithm 34
3.11 MODULE – 06 35
3.12 MODULE - 07 36
3.12.1 Accuracy calculation 36
3.12.2 Comparing Algorithm with prediction result 37
4 RESULTS AND DISCUSSION 40
4.1 RESULTS 40
4.2 DISCUSSION 40
5 CONCLUSION AND FUTURE WORK 41
5.1 CONCLUSION 41
5.2 FUTURE WORK 41
REFERENCES 42
APPENDIX 43
A.SAMPLE CODE 43
B.SCREEN SHOTS 48
C.PUBLICATION WITH PLAGIARISM REPORT 50
viii
LIST OF FIGURES
FIGURE NO TITLE PAGE

NO.
1.1 Process of Machine learning 3
1.2 Ensemble structure 5
1.3 Ensemble model 6
3.1 Process of dataflow diagram 15
3.2 Data flow diagram for Machine learning model 15

3.3 Workflow Diagram 16
3.4 Use Case Diagram 17
3.5 Activity Diagram 17
3.6 Sequence Diagram 18
3.7 Pre-processing Data 25
3.8 Analysis of data 26
3.9 Result of Logistic Regression 29
3.10 Result of Decision Tree Classifier 30
3.11 Result of Support Vector Machine 32
3.12 Result of Random Forest Algorithm 33
3.13 Result of K-Nearest Neighbor 34
3.14 Result of Naive Bayes 35
3.15 Confusion matrix 37
ix
LIST OF ABBREVIATION
ABBREVIATION EXPANSION
FN False Negatives
FP False Positives
GUI Graphical User Interfaces
NBA Naive Bayes Algorithm
PEL Parallel Ensemble Learning
SEL Sequential Ensemble learning
SVM Support Vector Machines
TN True Negatives
TP True Positives
x
CHAPTER 1
INTRODUCTION
The amount of accident data stored by traffic management departments has
been growing at an ever-increasing rate in the last few years because of numerous
road traffic accidents. Government entities and some private sectors were busy
collecting accident data at daily bases. Data from accidents is often among the most
valuable assets for local authorities and central governments, as it helps in budgeting
and implementation of policies. It can also help policymakers to make appropriate
decisions pertaining to infrastructure development, planning and social grants
allocation. But as the amount of this data is growing, there is high demand and a need
of finding methods, technique and tools to analyze such large volumes of data and find
a solution to the cause of accidents. Road traffic accidents are a major worldwide threat
that continues to cause casualties, injuries and fatalities on roadways on a daily basis,
resulting in huge losses both at the economic and social levels. This global problem
needs more attention to reduce the severity and the frequency of accident occurrence.
The past data about previous accidents represents a formidable opportunity for
researchers to identify the most influential factors in such accidents, which in turn play
a key role in finding appropriate solutions to mitigate this problem in the future. In recent
years, several data mining techniques have been effectively used to extract useful
knowledge from large data sets containing information about traffic accidents. Road
and traffic accidents are one of the major causes of fatality and incapacitation across
the world. Road contingency can be considered as an event in which a conveyance
collides with other conveyance, person or other objects. A road contingency not only
provides property damage but it may lead to partial or full incapacitation and sometimes
can be fatal for human being. Incrementing number of road accidents is not a good sign
for the conveyance safety. The only solution requires the analysis of traffic contingency
data to identify different causes of road accidents and taking preventive measures.
Machine Learning is one of the applications of Artificial Intelligence technique where
the machine learns the data implicitly rather than explicit programming. Nowadays,
machine learning plays a crucial role in our day to day life. It is used almost in every
field like transport, medical, banking etc. in which transport and medical field has more
1
importance than others as they are related to lives. In the field of transportation,
machine learning can be applied in many sectors like traffic flow prediction, accident
prediction, tourist place suggestion etc. Machine learning is used not only for
automation but also for safety.
Machine learning is to predict the future from past data. Machine learning (ML)
is a type of artificial intelligence (AI) that provides computers with the ability to learn
without being explicitly programmed. Machine learning focuses on the development of
Computer Programs that can change when exposed to new data and the basics of
Machine Learning, implementation of a simple machine learning algorithm using
python. Process of training and prediction involves use of specialized algorithms. It feed
the training data to an algorithm, and the algorithm uses this training data to give
predictions on a new test data. Machine learning can be roughly separated in to three
categories. There are supervised learning, unsupervised learning and reinforcement
learning. Supervised learning program is both given the input data and the
corresponding labeling to learn data has to be labeled by a human being beforehand.
Unsupervised learning is no labels. It provided to the learning algorithm. This algorithm
has to figure out the clustering of the input data. Finally, Reinforcement learning
dynamically interacts with its environment and it receives positive or negative feedback
to improve its performance. Data scientists use many different kinds of machine
learning algorithms to discover patterns in python that lead to actionable insights. At a
high level, these different algorithms can be classified into two groups based on the
way they “learn” about data to make predictions: supervised and unsupervised learning.
Classification is the process of predicting the class of given data points. Classes are
sometimes called as targets/ labels or categories. Classification predictive modeling is
the task of approximating a mapping function from input variables(X) to discrete output
variables(y). In machine learning and statistics, classification is a supervised learning
approach in which the computer program learns from the data input given to it and then
uses this learning to classify new observation. This data set may simply be bi-class or it
may be multi-class too. Some examples of classification problems are speech
recognition, handwriting recognition, bio metric identification, document classification
etc.
2
Predicts
Past Dataset
T Result
Machine Learning
Trains
Fig 1.1: Process of Machine learning
Supervised Machine Learning is the majority of practical machine learning uses

supervised learning. Supervised learning is where have input variables (X) and an
output variable (y) and use an algorithm to learn the mapping function from the input to
the output is y = f(X). The goal is to approximate the mapping function so well that when
you have new input data (X) that you can predict the output variables (y) for that data.
Techniques of Supervised Machine Learning algorithms include logistic
regression, multi-class classification, Decision Trees and support vector machines etc.
Supervised learning requires that the data used to train the algorithm is already labeled
with correct answers. Supervised learning problems can be further grouped
into Classification problems. This problem has as goal the construction of a succinct
model that can predict the value of the dependent attribute from the attribute variables.
The difference between the two tasks is the fact that the dependent attribute is
numerical for categorical for classification. A classification model attempts to draw
some conclusion from observed values. Given one or more inputs a classification
model will try to predict the value of one or more outcomes. A classification problem is
when the output variable is a category, such as “red” or “blue”.
Existing system:
The traffic signals optimized in a receding horizon framework can quickly adapt
to the variations in traffic flows or change in some vehicle behavior. To minimize the
total crossing time of all the vehicles, usually the shortest possible duration of signals
are generated. Such short signals almost eliminate the stop-delay of the videos at any
penetration level when the intersection is under saturated. As the portion of the video
increases, the traffic density, average speed, stop-delay are improved significantly. The
scheme is found to be environment friendly as it significantly reduces the fuel
consumption and emissions around the intersection. The videos use distributed
3
controllers without using the detailed information of its preceding traffic accidents in
deciding their driving action.
The traffic signals are optimized in a receding horizon control framework that
aims at minimizing the total crossing time of all vehicles, considering their dynamical
states. The control scheme ensures comfortable crossing of manually driven vehicles
by retaining the basic features of the traditional signal management systems. The
optimal signal changing times are broadcasted one cycle ahead, which enables the
automated vehicles to tune their speed in order to cross the intersection with minimum
stop-delay. More specifically, the framework optimizes the green time of each signal
without considering the existing cycle-split concept explicitly. It is observed that the
optimization process usually results in the shortest possible green period of each signal
that can be realized without reducing the capacity of the intersection at any traffic
volumes. Consequently, the resulting short signal cycle which is adaptive to the traffic
around the intersection improves the average speed and reduces both the traffic
density and the number of idling vehicles. As a consequence.
Drawbacks:
 It can‟t get videos or any information from transport department legally to predict
traffic accidents for user interface through dataset.
 It defined only traffic control scenario information.
 It‟s not achieves more accurate prediction result of parameters.
PROPOSED SYSTEM
Ensemble learning:
Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance
compared to a single model and it is the art of combining diverse set of learners
(individual models) together to improvise on the stability and predictive power of the
model. In the world of Statistics and Machine Learning, Ensemble learning techniques
attempt to make the performance of the predictive models better by improving their
accuracy. Ensemble Learning is a process using which multiple machine learning
models (such as classifiers) are strategically constructed to solve a particular problem.
4
 Single model over fits
 Results worth the extra training
 It can be used for classification as well as regression
Voting based Ensemble learning:
Voting is one of the most straightforward Ensemble learning techniques in which
predictions from multiple models are combined. The method starts with creating two or
more separate models with the same dataset. Then a Voting based Ensemble model
can be used to wrap the previous models and aggregate the predictions of those
models. After the Voting based Ensemble model is constructed, it can be used to make
a prediction on new data. The predictions made by the sub-models can be assigned
weights. Stacked aggregation is a technique which can be used to learn how to weigh
these predictions in the best possible way.
Road traffic information
Data Processing Test

dataset
Classification ML Ensemble
Training Algorithm Model
dataset
Fig 1.3: Ensemble model
AIM:
There are several problems with current practices for prevention of the
accidents occurred in the localities. The database we will use is available officially by
kaggle and government websites. The data collected will be analyzed, integrated and
grouped together based on different constraints using the best suited algorithm. This
estimation will be helpful to analyses and identify the flaw and the reasons of the
accidents. It will also be helpful while making roads and bridges as a reference to avoid
the same problems faced before. The predictions made will be very much useful to plan

Analysis of Machine Learning Algorithm With Road Accidents Data Sets

Uploaded by

Analysis of Machine Learning Algorithm With Road Accidents Data Sets

Uploaded by

ABSTRACT

Currently, road transport infrastructure failing to cope up with the exponential

CHAPTER TITLE PAGE

1.1 EXISTING SYSTEM 3

FIGURE NO TITLE PAGE

1.3 Ensemble model 6

3.1 Process of dataflow diagram 15

3.2 Data flow diagram for Machine learning model 15

Supervised Machine Learning is the majority of practical machine learning uses

Road traffic information

Data Processing Test

Fig 1.3: Ensemble model

You might also like