Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
ABSTRACT
Article Info Diabetes is a chronic disease with the potential to cause a worldwide health
Volume 6, Issue 4 care crisis. According to International Diabetes Federation 382 million people
Page Number: 294-305 are living with diabetes across the whole world. By 2035, this will be doubled
Publication Issue : as 592 million. Diabetes is a disease caused due to the increase level of blood
July-August-2020 glucose. This high blood glucose produces the symptoms of frequent urination,
increased thirst, and increased hunger. Diabetes is a one of the leading cause of
blindness, kidney failure, amputations, heart failure and stroke. When we eat,
our body turns food into sugars, or glucose. At that point, our pancreas is
supposed to release insulin. Insulin serves as a key to open our cells, to allow
the glucose to enter and allow us to use the glucose for energy. But with
diabetes, this system does not work. Type 1 and type 2 diabetes are the most
common forms of the disease, but there are also other kinds, such as gestational
diabetes, which occurs during pregnancy, as well as other forms. Machine
learning is an emerging scientific field in data science dealing with the ways in
which machines learn from experience. The aim of this project is to develop a
system which can perform early prediction of diabetes for a patient with a
higher accuracy by combining the results of different machine learning
techniques. The algorithms like K nearest neighbour, Logistic Regression,
Random forest, Support vector machine and Decision tree are used. The
accuracy of the model using each of the algorithms is calculated. Then the one
Article History with a good accuracy is taken as the model for predicting the diabetes.
Accepted : 25 July 2020 Keywords : Machine Learning, Diabetes, Decision tree, K nearest neighbour,
Published : 30 July 2020 Logistic Regression, Support vector Machine, Accuracy.
Copyright: © the author(s), publisher and licensee Technoscience Academy. This is an open-access article distributed
under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-
commercial use, distribution, and reproduction in any medium, provided the original work is properly cited
294
KM Jyoti Rani Int J Sci Res CSE & IT, July-August-2020; 6 (4) : 294-305
glucose. The glucose moves around the body in the Symptoms of Diabetes
bloodstream. Some of the glucose is taken to our
brain to help us think clearly and function. The • Frequent Urination
remainder of the glucose is taken to the cells of our • Increased thirst
body for energy and also to our liver, where it is • Tired/Sleepiness
stored as energy that is used later by the body. In • Weight loss
order for the body to use glucose for energy, insulin • Blurred vision
is required. Insulin is a hormone that is produced by • Mood swings
the beta cells in the pancreas. Insulin works like a • Confusion and difficulty concentrating
key to a door. Insulin attaches itself to doors on the • frequent infections
cell, opening the door to allow glucose to move from
the blood stream, through the door, and into the cell.
Causes of Diabetes
If the pancreas is not able to produce enough insulin
Genetic factors are the main cause of diabetes. It is
(insulin deficiency) or if the body cannot use the
caused by at least two mutant genes in the
insulin it produces (insulin resistance), glucose builds
chromosome 6, the chromosome that affects the
up in the bloodstream (hyperglycaemia) and diabetes
response of the body to various antigens.
develops. Diabetes Mellitus means high levels of
Viral infection may also influence the occurrence of
sugar (glucose) in the blood stream and in the urine.
type 1 and type 2 diabetes. Studies have shown that
infection with viruses such as rubella, Coxsackievirus,
Types of Diabetes
mumps, hepatitis B virus, and cytomegalovirus
increase the risk of developing diabetes.
Type 1 diabetes means that the immune system is
compromised and the cells fail to produce insulin in
II. LITERATURE REVIEW
sufficient amounts. There are no eloquent studies
that prove the causes of type 1 diabetes and there are
Yasodhaet al.[1] uses the classification on diverse
currently no known methods of prevention.
types of datasets that can be accomplished to decide if
a person is diabetic or not. The diabetic patient’s data
Type 2 diabetes means that the cells produce a low
set is established by gathering data from hospital
quantity of insulin or the body can’t use the insulin
warehouse which contains two hundred instances
correctly. This is the most common type of diabetes,
with nine attributes. These instances of this dataset
thus affecting 90% of persons diagnosed with
are referring to two groups i.e. blood tests and urine
diabetes. It is caused by both genetic factors and the
tests. In this study the implementation can be done
manner of living.
by using WEKA to classify the data and the data is
assessed by means of 10-fold cross validation
Gestational diabetes appears in pregnant women who
approach, as it performs very well on small datasets,
suddenly develop high blood sugar. In two thirds of
and the outcomes are compared. The naïve Bayes, J48,
the cases, it will reappear during subsequent
REP Tree and Random Tree are used. It was
pregnancies. There is a great chance that type 1 or
concluded that J48 works best showing an accuracy
type 2 diabetes will occur after a pregnancy affected
of 60.2% among others.
by gestational diabetes.
Aiswaryaet al. [2] aims to discover solutions to detect after applying the resample filter over the data. The
the diabetes by investigating and examining the author emphasis on the class imbalance problem and
patterns originate in the data via classification the need to handle this problem before applying any
analysis by using Decision Tree and Naïve Bayes algorithm to achieve better accuracy rates. The class
algorithms. The research hopes to propose a faster imbalance is a mostly occur in a dataset having
and more efficient method of identifying the disease dichotomous values, which means that the class
that will help in well-timed cure of the patients. variable have two possible outcomes and can be
Using PIMA dataset and cross validation approach handled easily if observed earlier in data
the study concluded that J48 algorithm gives an preprocessing stage and will help in boosting the
accuracy rate of 74.8% while the naïve Bayes gives an accuracy of the predictive model.
accuracy of 79.5% by using 70:30 split.
III. METHODOLOGY
Gupta et al. [3] aims to find and calculate the
accuracy, sensitivity and specificity percentage of In this section we shall learn about the various
numerous classification methods and also tried to classifiers used in machine learning to predict
compare and analyse the results of several diabetes. We shall also explain our proposed
classification methods in WEKA, the study compares methodology to improve the accuracy. Five different
the performance of same classifiers when methods were used in this paper. The different
implemented on some other tools which includes methods used are defined below. The output is the
Rapidminer and Matlabusing the same parameters accuracy metrics of the machine learning models.
(i.e. accuracy, sensitivity and specificity). They Then, the model can be used in prediction.
applied JRIP, Jgraft and BayesNet algorithms. The
result shows that Jgraft shows highest accuracy i.e Dataset Description
81.3%, sensitivity is 59.7% and specificity is 81.4%. It The diabetes data set was originated from
was also concluded that WEKA works best than https://www.kaggle.com/johndasilva/diabetes.
Matlab and Rapidminner. Diabetes dataset containing 2000 cases. The objective
is to predict based on the measures to predict if the
Lee et al. [4] focus on applying a decision tree patient is diabetic or not.
algorithm named as CART on the diabetes dataset
➔ The diabetes data set consists of 2000 data points, with 9 features each.
➔ “Outcome” is the feature we are going to predict, 0 means No diabetes, 1 means diabetes.
Preprocess Applied
Dataset Data Performance Evaluation
Algorithm
s on Various Measures
Correlation Matrix:
It is easy to see that there is no single feature that has a very high correlation with our outcome value.
Some of the features have a negative correlation with the outcome value and some have positive.
Histogram:
Let’s take a look at the plots. It shows how each feature and label is distributed along different ranges, which
further confirms the need for scaling. Next, wherever you see discrete bars, it basically means that each of these
is actually a categorical variable. We will need to handle these categorical variables before applying Machine
Learning. Our outcome labels have two classes, 0 for no disease and 1 for disease.
The above graph shows that the data is biased towards datapoints having outcome value as 0 where it means
that diabetes was not present actually. The number of non-diabetics is almost twice the number of diabetic
patients.
k-Nearest Neighbors:
The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of
storing the training data set. To make a prediction for a new data point, the algorithm finds the closest data
points in the training data set, its “nearest neighbors.”
First, let’s investigate whether we can confirm the connection between model complexity and accuracy:
The above plot shows the training and test set accuracy on the y-axis against the setting of n_neighbors on the x-
axis. Considering if we choose one single nearest neighbor, the prediction on the training set is perfect. But
when more neighbors are considered, the training accuracy drops, indicating that using the single nearest
neighbor leads to a model that is too complex. The best performance is somewhere around 9 neighbors.
Table-1
Logistic regression:
Table-2
➔ In first row, the default value of C=1 provides with 77% accuracy on the training and 78% accuracy on
the test set.
➔ In second row, using C=0.01 results are 78% accuracy on both the training and the test sets.
➔ Using C=100 results in a little bit lower accuracy on the training set and little bit highest accuracy on the
test set, confirming that less regularization and a more complex model may not generalize better than
default setting.
Decision Tree:
This classifier creates a decision tree based on which, it assigns the class values to each data point. Here, we
can vary the maximum number of features to be considered while creating the model.
Table-3
The accuracy on the training set is 100% and the test set accuracy is also good.
Feature importance rates how important each feature is for the decision a tree makes. It is a number between
0 and 1 for each feature, where 0 means “not used at all” and 1 means “perfectly predicts the target”.
Random Forest:
This classifier takes the concept of decision trees to the next level. It creates a forest of trees where each tree is
formed by a random selection of features from the total features.
Similarly to the single decision tree, the random forest also gives a lot of importance to the “Glucose” feature, but
it also chooses “BMI” to be the 2nd most informative feature overall.
This classifier aims at forming a hyper plane that can separate the classes as much as possible by adjusting the
distance between the data points and the hyper plane. There are several kernels based on which the hyper
plane is decided. I tried four kernels namely, linear, poly, rbf, and sigmoid.
0.5
0.4 0.778 0.774 0.774
0.3
0.482
0.2
0.1
0
LINEAR POLY RVF SIGMOID
KERNELS
As can be seen from the plot above, the linear kernel performed the best for this dataset and achieved a score of
77%.
Accuracy Comparison:
Table-5
Table-5 shows the accuracy values for all five machine learning algorithms.
Table-5 shows that Decision Tree algorithm gives the best accuracy with 98% training accuracy and 99%
testing accuracy.
V. CONCLUSION AND FUTURE WORK achieved accuracy of 99% using Decision Tree
algorithm.
One of the important real-world medical problems is
the detection of diabetes at its early stage. In this In future, the designed system with the used machine
study, systematic efforts are made in designing a learning classification algorithms can be used to
system which results in the prediction of diabetes. predict or diagnose other diseases. The work can be
During this work, five machine learning classification extended and improved for the automation of
algorithms are studied and evaluated on various diabetes analysis including some other machine
measures. Experiments are performed on john learning algorithms.
Diabetes Database. Experimental results determine
the adequacy of the designed system with an
About Author