A Cardiovascular Disease Prediction Using Machine Learning Algorithms
A Cardiovascular Disease Prediction Using Machine Learning Algorithms
net/publication/350312435
CITATIONS READS
26 2,869
6 authors, including:
Rubini Pe
CMR Institute of Technology
9 PUBLICATIONS 49 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
MCIDS-Multi Classifier Intrusion Detection system for IoT Cyber Attack using Deep Learning algorithm View project
All content following this page was uploaded by Rubini Pe on 23 March 2021.
ABSTRACT
Heart Diseases have shown a tremendous hit in this modern age. As doctors deal with precious human life, it is
very important for them to be right their results. Thus, an application was developed which can predict the
vulnerability of heart disease, given basic symptoms like age, gender, pulse rate, resting blood pressure,
cholesterol, fasting blood sugar, resting electrocardiographic results, exercise induced angina, ST depression ST
segment the slope at peak exercise, number of major vessels colored by fluoroscopy and maximum heart rate
achieved. This can be used by doctors to re heck and confirm on their patient‟s condition. In the existing
surveys they have considered only 10 features for prediction, but in this proposed research work 14 necessary
features were taken into consideration. Also, this paper presents a comparative analysis of machine learning
techniques like Random Forest (RF), Logistic Regression, Support Vector Machine (SVM), and Naïve Bayes in
the classification of cardiovascular disease. By the comparative analysis, machine learning algorithm Random
Forest has proven to be the most accurate and reliable algorithm and hence used in the proposed system. This
system also provides the relation between diabetes and how much it influences heart disease
Keywords:
Heart disease; Machine learning algorithms; Random Forest; Logistic regression; Support Vector Machine;
Naïve Bayes; Diabetes Influence
1.Introduction
Coronary illness has the biggest level of passing on the planet. In 2012, around 17.5 million
individuals kicked the bucket from coronary illness, implying that it comprises of the 31% of
every single worldwide passing. Besides, coronary illness loss of life rises each year. It is relied
upon to develop more than 23.6 million by 2030. The exploration from the January 2017
demonstrated that the main source of death worldwide is cardiovascular infections. The
cardiovascular malady is considered as a world's biggest killer and is currently taking the top
position in the record of ten reasons for passing in the previous 15 years and in 2015 was
numeration for fifteen million passing. Various human lives could be spared by diagnosing on
schedule. Along these lines, diagnosing the syndrome is significant and an exceptionally muddled
undertaking. Mechanizing this procedure would conquer the issues with the diagnosis. The
utilization of AI in ailment arrangement is normal and researchers are especially fascinated in the
advancement of such frameworks for simpler following and analysis of cardiovascular diseases.
Since ML permits PC projects to ponder from information, building up a model to perceive
ordinary examples and having the option to settle on choices dependent on assembled data, it
doesn't have hitches with the deficiency of utilized medicinal database. The proposed model is to
amass significant information relating all components identified with coronary illness and
parameters impacting it, train the information according to the proposed calculation of AI and
http://annalsofrscb.ro 904
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
foresee how solid is there a probability for a patient to get a coronary illness. The relationship
with the diabetes related credits is considered to set up the impact. [2]
2. Methodology
The methodology for predicting cardiovascular disease was done by using following four
algorithms and the results are compared.Fig.1 describes the architecture diagram for predicting
cardio vascular disease.
1. Random Forest
2. Logistic Regression
3. Naive Bayes algorithm
4. Support Vector Machines
The dataset which was used for analysis are “Framingham” obtained from Kaggle. Heart disease
dataset with 14 features is obtained from UCI Machine Learning Repository [19]. Data is cleaned
by replacing all the non-available values with the median of values in that column. Categorical
data are assigned with numerical values.
2. Implementation
The implementation of random forest works as follows:
a. Load the heart disease dataset.
http://annalsofrscb.ro 905
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
b. After Preprocess, Split the heart disease dataset into train and test data with the proportion of
60:40 using Random Forest Classifier function.
c. K-Fold Cross Validation is wherever a given knowledge set is split into a K range of
sections/folds wherever every fold is employed as a testing set at some purpose.
d. Train the model using train set.
e. Make predictions on the test fold.
f. Map predictions to outcomes (only possible outcomes are 1 and 0).
g. Calculate the accuracy.
Accuracy = * 100
Where,
TP- True Positive (prediction is yes, and they do have the disease.
TN-True Negative (prediction is no, and they don't have the disease.)
FP-False Positive (We predicted yes, but they don't actually have the disease. (Also known as a
"Type I error.")
FN-False Negative (We predicted no, but they actually do have the disease. (Also known as a
"Type II error.")
The accuracy obtained by using random forest algorithm is 84.81%
http://annalsofrscb.ro 906
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
2. Kernel Functions
If data points are in nonlinear fashion, the kernel function makes them towards linear decision
surface.
Some Kernel functions are as follows:
a. Linear Function: In these kinds of kernel the hyper plane is a straight line. Linear Kernel
functions can provide best results for classifiers which have exactly two target classes.
b. Polynomial Function: In such kinds of kernel functions the hyper plane is generally a
polynomial like parabola, hyperbola.
c. Radial Basis Function: Radial Basis Function is put in use when points cannot be separated in a
linear fashion. The function works to bring points into a shape mostly radial/circular fashion to
perform further actions.
3. Implementation
The implementation of Support Vector Machine described as follows:
a. Load the data sets and clean values, in case of no value for a particular feature
inarowreplacewiththemedianvaluetherowfrom thedataset.
b. Split the data set into train and test in 60:40 ratio respectively.
c. Choosing the Kernel Function as Linear Kernel Function or Radial Basis Function.
d. Applying SVM by first creating a hyper plane with the help of test data set.
e. Calculate the accuracy using
The train data is taken and both Kernel function namely Linear Kernel Function or Radial Basis
Function is applied.
Apply test data set on the trained model.
The model uses hyper plane and finds closest proximitytoeitherclassthatishavingheartdisease
(yes/1) or not having heart disease(no/0).
In Table 1 the calculation accuracies for both SVM Models with RBF and Linear Function as
Kernels are examined. Linear Kernel Function provides higher accuracy than RBF. This is
because the problem is a two-class classifier problem. Hence a hyper plane in the form of a line
would be the best way to classify such values. In comparison RBF uses a circle as hyper plane
thus producing lower accuracy. The hyper plane plot for SVM for predicting heart disease is
shown in Fig.4.In this the yellow plot represents patients having heart disease and purple dots
represents the patients not having heart disease.
http://annalsofrscb.ro 907
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
Figure 4: Hyper plane and distribution of data points on either side of hyper plane for Heart
Disease Prediction
Naïve Bayes classifier is based on probability which is mostly used in the training phase. This
algorithm is used for removing the redundant data from the datasets.
1. Implementation
http://annalsofrscb.ro 908
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
j. Compare the result between the last column and the predicted values.
k. Calculate the accuracy.
D. Logistic Regression
Logistic regression is a machine learning algorithm used for classification. It is based on the
concept of probability. Logistic regression is used to assign observations to a discrete class.
Transforming output is done using the sigmoid logic function. The logistic regression hypothesis
tends to limit the cost function in range between 0 and 1. Therefore, linear functions cannot
represent as it can have a value >1 or <=0, which is not possible according to the regression
hypothesis.
1. Implementation
The implementation steps for logistic regression are given a follow:
a. Obtain the probabilities: Mapping predicted values to probabilities, using the Sigmoid
function.
(6)
where, y is input to the function and e is the base of natural log. Obtain the probabilities by
following equations:
P = ey/ 1 + ey (7)
where P is the probability of success, and q is the probability of failure written as:
q = 1 – P= 1 – (ey/ 1 + ey) (8)
on dividing, (7) / (8), we get
(9)
On taking log on both sides,
(10)
where (p/1-p) is the odd ratio. When „y‟ is positive, the probability of success is more than 50%.
http://annalsofrscb.ro 909
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
classification is negative. Logistic regression can also have multiple classes where the highest
probability predicted class is considered.
2. Analysis of result:
The result can be analyzed in following ways.
a. Using Confusion Matrix: Accuracy is calculated by formula
Accuracy= ((TP + TN) / (TP + TN + FP + FN)) * 100
Where TP- True Positive, TN-True Negative, FP-False Positive, FN-False Negative
b. ROC curve: The receiver operating characteristic summarizes the performance when
evaluating the compensations between the sensitivity and the 1-specificity. To plot ROC, assume
p> 0.5. The area under the curve, indicated as an index of precision or concordance index, is a
performance metric for curve. The larger the area under the curve, the better the predictive power
of the model.
3. Result
Results from Random Forest, Support Vector Machine, Logistic Regression and naïve Bayes are
analyzed, and Random Forest Algorithm has given the highest accuracy. Hence Random Forest
has been implemented in the proposed system.
http://annalsofrscb.ro 910
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
Heart disease prediction which uses Machine learning algorithm provides users a prediction
result if the user has heart disease. Recent advancements in technology made machine learning
algorithms to evolve. In this proposed method Random Forest Algorithm was used because of its
efficiency and accuracy. This algorithm is also used to find the heart disease prediction
percentage by knowing the correlation details between diabetes and heart diseases. The similar
prediction systems can be built by calculating correlation between heart diseases and other
diseases. Also new algorithms can be used to achieve increased accuracy. Better performance is
obtained with more parameter used in these algorithms.
References
[1] Jaymin Patel, Prof.Tejal Upadhyay, Dr.Samir Patel “Heart disease prediction using Machine
learning and Data Mining Technique" Volume 7.Number1 Sept 2015-March 2016.
[2] Thenmozhi.K and Deepika.P, Heart Disease Prediction using classification with different
decision tree techniques. International Journal of Engineering Research & General Science,
Vol 2(6), pp 6-11, Oct 2014.
[3] Igor Kononenko” Machine learning for medical diagnosis: history, state of art& perspective"
http://annalsofrscb.ro 911
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.
http://annalsofrscb.ro 912