0% found this document useful (0 votes)
79 views

A Cardiovascular Disease Prediction Using Machine Learning Algorithms

This document summarizes a research study that used machine learning algorithms to predict cardiovascular disease. Specifically, it used Random Forest, Logistic Regression, Naive Bayes, and Support Vector Machines on a dataset of 14 features to classify patients as having cardiovascular disease or not. Random Forest performed the best with an accuracy of over 80%. The study also examined the relationship between diabetes and cardiovascular disease risk.

Uploaded by

Kumara S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

A Cardiovascular Disease Prediction Using Machine Learning Algorithms

This document summarizes a research study that used machine learning algorithms to predict cardiovascular disease. Specifically, it used Random Forest, Logistic Regression, Naive Bayes, and Support Vector Machines on a dataset of 14 features to classify patients as having cardiovascular disease or not. Random Forest performed the best with an accuracy of over 80%. The study also examined the relationship between diabetes and cardiovascular disease risk.

Uploaded by

Kumara S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350312435

A Cardiovascular Disease Prediction using Machine Learning Algorithms

Article  in  Annals of the Romanian Society for Cell Biology · January 2021

CITATIONS READS

26 2,869

6 authors, including:

Rubini Pe
CMR Institute of Technology
9 PUBLICATIONS   49 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MCIDS-Multi Classifier Intrusion Detection system for IoT Cyber Attack using Deep Learning algorithm View project

All content following this page was uploaded by Rubini Pe on 23 March 2021.

The user has requested enhancement of the downloaded file.


Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

A Cardiovascular Disease Prediction using Machine Learning Algorithms

Rubini PE1, Dr.C.A.Subasini2, Dr.A.Vanitha Katharine3, V.Kumaresan4,


S.GowdhamKumar5, T.M. Nithya6
1
Assistant Professor, Department of Computer Science and Engineering, CMR Institute of Technology, Bengaluru.
2
Associate Professor, Department of Computer Science and Engineering, St. Joseph‟s Institute of Technology,
Chennai-119.
3
Associate Professor, Department of Computer Applications, PSNA College of Engineering and Technology,
Dindugul.
4
Assistant Professor (Senior Grade), Department of Electrical and Electronics Engineering, Kongu Engineering
College (Autonomous), Perundurai, Erode-638060.Email
5
Training Officer, PSG Industrial Institute (PSG COLLEGE OF TECHNOLOGY), Peelamedu, Coimbatore-641041
6
Assistant Professor, Department of Computer Science and Engineering, K. Ramakrishnan College of Engineering,
Trichy.

ABSTRACT
Heart Diseases have shown a tremendous hit in this modern age. As doctors deal with precious human life, it is
very important for them to be right their results. Thus, an application was developed which can predict the
vulnerability of heart disease, given basic symptoms like age, gender, pulse rate, resting blood pressure,
cholesterol, fasting blood sugar, resting electrocardiographic results, exercise induced angina, ST depression ST
segment the slope at peak exercise, number of major vessels colored by fluoroscopy and maximum heart rate
achieved. This can be used by doctors to re heck and confirm on their patient‟s condition. In the existing
surveys they have considered only 10 features for prediction, but in this proposed research work 14 necessary
features were taken into consideration. Also, this paper presents a comparative analysis of machine learning
techniques like Random Forest (RF), Logistic Regression, Support Vector Machine (SVM), and Naïve Bayes in
the classification of cardiovascular disease. By the comparative analysis, machine learning algorithm Random
Forest has proven to be the most accurate and reliable algorithm and hence used in the proposed system. This
system also provides the relation between diabetes and how much it influences heart disease

Keywords:
Heart disease; Machine learning algorithms; Random Forest; Logistic regression; Support Vector Machine;
Naïve Bayes; Diabetes Influence

1.Introduction

Coronary illness has the biggest level of passing on the planet. In 2012, around 17.5 million
individuals kicked the bucket from coronary illness, implying that it comprises of the 31% of
every single worldwide passing. Besides, coronary illness loss of life rises each year. It is relied
upon to develop more than 23.6 million by 2030. The exploration from the January 2017
demonstrated that the main source of death worldwide is cardiovascular infections. The
cardiovascular malady is considered as a world's biggest killer and is currently taking the top
position in the record of ten reasons for passing in the previous 15 years and in 2015 was
numeration for fifteen million passing. Various human lives could be spared by diagnosing on
schedule. Along these lines, diagnosing the syndrome is significant and an exceptionally muddled
undertaking. Mechanizing this procedure would conquer the issues with the diagnosis. The
utilization of AI in ailment arrangement is normal and researchers are especially fascinated in the
advancement of such frameworks for simpler following and analysis of cardiovascular diseases.
Since ML permits PC projects to ponder from information, building up a model to perceive
ordinary examples and having the option to settle on choices dependent on assembled data, it
doesn't have hitches with the deficiency of utilized medicinal database. The proposed model is to
amass significant information relating all components identified with coronary illness and
parameters impacting it, train the information according to the proposed calculation of AI and

http://annalsofrscb.ro 904
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

foresee how solid is there a probability for a patient to get a coronary illness. The relationship
with the diabetes related credits is considered to set up the impact. [2]

2. Methodology

The methodology for predicting cardiovascular disease was done by using following four
algorithms and the results are compared.Fig.1 describes the architecture diagram for predicting
cardio vascular disease.
1. Random Forest
2. Logistic Regression
3. Naive Bayes algorithm
4. Support Vector Machines

Figure 1: Methodology to predict heart disease

A. Random Forest Algorithm


The Random Forest Algorithm is understood as a forest comprised of trees. Firstly, it creates call
trees on every which way chosen knowledge samples from the dataset. It then gets the prediction
from each tree and selects the most effective resolution through means voting. It is an
enhancement from decision trees [3]. Some of its applications are image classification,
recommendation engines and feature selection. This algorithmic rule is considered as an
extremely correct and strong methodology as a result of the number of trees collaborating within
the method. One amongst its many advantages is that it does not suffer from the over fitting
problem. Finally, it takes the average of all the predictions from every tree, which cancels out the
biases.

1. Dataset collection and pre-processing

The dataset which was used for analysis are “Framingham” obtained from Kaggle. Heart disease
dataset with 14 features is obtained from UCI Machine Learning Repository [19]. Data is cleaned
by replacing all the non-available values with the median of values in that column. Categorical
data are assigned with numerical values.

2. Implementation
The implementation of random forest works as follows:
a. Load the heart disease dataset.

http://annalsofrscb.ro 905
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

b. After Preprocess, Split the heart disease dataset into train and test data with the proportion of
60:40 using Random Forest Classifier function.
c. K-Fold Cross Validation is wherever a given knowledge set is split into a K range of
sections/folds wherever every fold is employed as a testing set at some purpose.
d. Train the model using train set.
e. Make predictions on the test fold.
f. Map predictions to outcomes (only possible outcomes are 1 and 0).
g. Calculate the accuracy.
Accuracy = * 100
Where,
TP- True Positive (prediction is yes, and they do have the disease.
TN-True Negative (prediction is no, and they don't have the disease.)
FP-False Positive (We predicted yes, but they don't actually have the disease. (Also known as a
"Type I error.")
FN-False Negative (We predicted no, but they actually do have the disease. (Also known as a
"Type II error.")
The accuracy obtained by using random forest algorithm is 84.81%

Figure 2: Sample Code of Random Forest

Figure 3: Accuracy result of Random Forest algorithm

B. Support Vector Machine


1. Introduction
Support Vector Machines is a classification technique which separates data values by the creation
of hyper planes. Hyper planes can be of different shapes based on the spread of data, but only
those points which help in differentiating between the classes are considered for classification.

http://annalsofrscb.ro 906
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

2. Kernel Functions
If data points are in nonlinear fashion, the kernel function makes them towards linear decision
surface.
Some Kernel functions are as follows:

a. Linear Function: In these kinds of kernel the hyper plane is a straight line. Linear Kernel
functions can provide best results for classifiers which have exactly two target classes.
b. Polynomial Function: In such kinds of kernel functions the hyper plane is generally a
polynomial like parabola, hyperbola.
c. Radial Basis Function: Radial Basis Function is put in use when points cannot be separated in a
linear fashion. The function works to bring points into a shape mostly radial/circular fashion to
perform further actions.

3. Implementation
The implementation of Support Vector Machine described as follows:
a. Load the data sets and clean values, in case of no value for a particular feature
inarowreplacewiththemedianvaluetherowfrom thedataset.
b. Split the data set into train and test in 60:40 ratio respectively.
c. Choosing the Kernel Function as Linear Kernel Function or Radial Basis Function.
d. Applying SVM by first creating a hyper plane with the help of test data set.
e. Calculate the accuracy using
 The train data is taken and both Kernel function namely Linear Kernel Function or Radial Basis
Function is applied.
 Apply test data set on the trained model.
 The model uses hyper plane and finds closest proximitytoeitherclassthatishavingheartdisease
(yes/1) or not having heart disease(no/0).

Kernel Functions Accuracy (%)


Linear Kernel Function 74.05

Radial Basis Function


58.577
(RBF)
TABLE 1- Comparison of SVM accuracies with Kernel Functions

In Table 1 the calculation accuracies for both SVM Models with RBF and Linear Function as
Kernels are examined. Linear Kernel Function provides higher accuracy than RBF. This is
because the problem is a two-class classifier problem. Hence a hyper plane in the form of a line
would be the best way to classify such values. In comparison RBF uses a circle as hyper plane
thus producing lower accuracy. The hyper plane plot for SVM for predicting heart disease is
shown in Fig.4.In this the yellow plot represents patients having heart disease and purple dots
represents the patients not having heart disease.

http://annalsofrscb.ro 907
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

Figure 4: Hyper plane and distribution of data points on either side of hyper plane for Heart
Disease Prediction

C. Naïve Bayes Classification

Naïve Bayes classifier is based on probability which is mostly used in the training phase. This
algorithm is used for removing the redundant data from the datasets.

1. Implementation

The implementation of Naive Bayes is as follows:


a. Extract the dataset.
b. Apply cleaning on the dataset to remove unwanted values.
c. In case any values are missing then find the median value of the column and fill the
missing value.
d. Find the deterministic probability with occurrence of heart disease with respect to
14parameters.
e. Then find the conditional probability of non-occurrence of heart disease with respect to 14
parameters.
f. Train the model using this probability formula given below
= (1)
(2)
(3)
(4)
(5)
Where, x1- age; x2- sex; x3-cp; x4 – rest bp; x5- chol; x6- fbs; x7- rest ecg; x8- thalach;
x9 - exang;x10-oldpeak; x11-slope; x12-ca; x13-thal; x14-pulse rate.
g. As soon as the model is trained, then apply the test data set.
h. Remove the last column of the test data set which determines the person will have heart
attack or not.
i. Apply the model on the test data set and extract the values.

http://annalsofrscb.ro 908
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

j. Compare the result between the last column and the predicted values.
k. Calculate the accuracy.

Figure 5: Working of Naïve Bayes

D. Logistic Regression
Logistic regression is a machine learning algorithm used for classification. It is based on the
concept of probability. Logistic regression is used to assign observations to a discrete class.
Transforming output is done using the sigmoid logic function. The logistic regression hypothesis
tends to limit the cost function in range between 0 and 1. Therefore, linear functions cannot
represent as it can have a value >1 or <=0, which is not possible according to the regression
hypothesis.

1. Implementation
The implementation steps for logistic regression are given a follow:
a. Obtain the probabilities: Mapping predicted values to probabilities, using the Sigmoid
function.
(6)
where, y is input to the function and e is the base of natural log. Obtain the probabilities by
following equations:
P = ey/ 1 + ey (7)
where P is the probability of success, and q is the probability of failure written as:
q = 1 – P= 1 – (ey/ 1 + ey) (8)
on dividing, (7) / (8), we get
(9)
On taking log on both sides,
(10)
where (p/1-p) is the odd ratio. When „y‟ is positive, the probability of success is more than 50%.

b. Decision Boundary-Mapping probabilities to classes


Prediction function returns probability score between 0 and 1. To assign to a discrete class, a
threshold value is selected above which it is classified as class 1 or else class 2. For example, if
our threshold was 0.5 and our function value was 7, it is classified as positive. For say .3,

http://annalsofrscb.ro 909
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

classification is negative. Logistic regression can also have multiple classes where the highest
probability predicted class is considered.

2. Analysis of result:
The result can be analyzed in following ways.
a. Using Confusion Matrix: Accuracy is calculated by formula
Accuracy= ((TP + TN) / (TP + TN + FP + FN)) * 100
Where TP- True Positive, TN-True Negative, FP-False Positive, FN-False Negative
b. ROC curve: The receiver operating characteristic summarizes the performance when
evaluating the compensations between the sensitivity and the 1-specificity. To plot ROC, assume
p> 0.5. The area under the curve, indicated as an index of precision or concordance index, is a
performance metric for curve. The larger the area under the curve, the better the predictive power
of the model.

Figure 7. ROC Curve - Logistic Regression

Figure 8. Accuracy result of Logistic Regression

3. Result

Results from Random Forest, Support Vector Machine, Logistic Regression and naïve Bayes are
analyzed, and Random Forest Algorithm has given the highest accuracy. Hence Random Forest
has been implemented in the proposed system.

http://annalsofrscb.ro 910
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

Figure 9. Graphical Representation of Accuracy

ALGORITHM ACCURACY (%)


RANDOM FOREST 84.81
LINEAR REGRSSION 83.828
SUPPORT VECTOR
MACHINE (Using Linear Kernel
74.05
Function)
SUPPORT VECTOR
MACHINE (Using Radial Basis
58.577
Kernel Function)
Naïve Bayes 54.08401

TABLE II Comparison of Accuracies

4. Conclusion and Future Scope

Heart disease prediction which uses Machine learning algorithm provides users a prediction
result if the user has heart disease. Recent advancements in technology made machine learning
algorithms to evolve. In this proposed method Random Forest Algorithm was used because of its
efficiency and accuracy. This algorithm is also used to find the heart disease prediction
percentage by knowing the correlation details between diabetes and heart diseases. The similar
prediction systems can be built by calculating correlation between heart diseases and other
diseases. Also new algorithms can be used to achieve increased accuracy. Better performance is
obtained with more parameter used in these algorithms.

References

[1] Jaymin Patel, Prof.Tejal Upadhyay, Dr.Samir Patel “Heart disease prediction using Machine
learning and Data Mining Technique" Volume 7.Number1 Sept 2015-March 2016.
[2] Thenmozhi.K and Deepika.P, Heart Disease Prediction using classification with different
decision tree techniques. International Journal of Engineering Research & General Science,
Vol 2(6), pp 6-11, Oct 2014.
[3] Igor Kononenko” Machine learning for medical diagnosis: history, state of art& perspective"

http://annalsofrscb.ro 911
Annals of R.S.C.B., ISSN:1583-6258, Vol. 25, Issue 2, 2021, Pages. 904 - 912
Received 20 January 2021; Accepted 08 February 2021.

Elsevier -Artificial intelligence in Medicine, Volume23, Aug 2001.


[4] Gregory F. Cooper, Constantin F. Alfieris”, Richard Ambrosino, John Aronisb, Bruce G.
Buchanan, Richard Caruana', Michael J. Fine, Clark Glymour”, Geoffrey Gordon”, Barbara
H. Hanusad, Janine E. Janoskyf, Christopher Meek”, Tom Mitchell”, Thomas Richardson”,
Peter Spirtes” An evaluation of machine-learning methods for predicting of pneumonia
mortality”-Elsevier Feb 1997
[5] Sana Bharti, Shailendra Narayan Singh" Analytical study of heart disease comparing with
different algorithms": Computing, Communication & Automation (ICCCA),
2015InternationalConference.
[6] B.Dhomse Kanchan, M. Mahale Kishore “Study of Machine learning algorithms for special
disease predictions using the principal of component analysis” Global Trends in Signal
Processing, Information Computing and Communication (ICGTSPICC), 2016.
[7] MatjazKuka, Igor Kononenko, Cyril Groselj, Katrina Kalif, JureFettich" Analysing and
improving the diagnosis of ischaemic heart disease with machine learning" Elsevier -
Artificial intelligence in Medicine, Volume23, May 1999.
[8] Geert Meyfroidt, FabianGuiza, Jan Ramon, Maurice Brynooghe" Machine learning
techniques to examine large patient databases"-Best practice & Reasearch Clinical
Anaesthesiology, Elsevier Volume 23 (1) Mar 1, 2009.
[9] Gregory F.Cooper, ConstantinF.Aliferis, Richard Ambrosino”An evaluation of Machine
learning methods for predicting pneumonia mortality”-Elsevier, 1997.
[10] Sanjay Kumar Sen” Predicting and Diagnosing of Heart Disease Using Machine Learning
Algorithms”, International Journal of Engineering And Computer Science ISSN:2319-
7242Volume6Issue 6 June 2017.
[11] Abhishek Taneja” Heart Disease Prediction SystemUsing Data Mining Techniques”-Vol.6,
No(4) December 2013.
[12] AnimeshHazra, Subrata Kumar Mandal, AmitGupta,Arkomita Mukherjee and Asmita
Mukherjee” Heart Disease Diagnosis and Prediction Using Machine Learning and Data
Mining Techniques: A Review”- Advances in Computational Sciences and Technology
ISSN 0973-6107, Volume10, Number7(2017).
[13] BeantKaur, Williamjeet Singh” Review on Heart Diseases Prediction System using different
Data Mining Techniques”- International Journal on Recent and Innovation Trends in
Computing and Communication Volume:2 Issue:10, October 2014.Transll. J. Magn. Japan,
vol. 2, pp. 740-741, August 1987.
[14] SonamNikhar, A.M. Karandikar" Prediction of Heart Disease Using different Machine
Learning Algorithms"- Vol-2 Issue-6, June 2016.
[15] S. U. Ghumbre and A. A. Ghatol, “Heart Disease Diagnosis Using Machine Learning
Algorithm,” Advances in Intelligent and Soft Computing Proceedings of the International
Conference on Information Systems Design and Intelligent Applications.
[16] Machine learning based decision support systems (DSS) for heart disease Diagnosis: a
review. Online: 25 March 2017 DOI: 10.1007/s10462-01
[17] DataSetURL-https://archive.ics.uci.edu/ml/machielearnindatabses/heartdisease

http://annalsofrscb.ro 912

View publication stats

You might also like