0% found this document useful (0 votes)
179 views4 pages

Hybrid Algorithms for Heart Data

Uploaded by

Abyan Harahap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
179 views4 pages

Hybrid Algorithms for Heart Data

Uploaded by

Abyan Harahap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Enhance the Accuracy of Heart Dataset Using the

Assist of Hybrid Algorithm

1st Abyan Arya Syahputra Harahap 2nd Morin Adepatrick Damanik 3rd Inayatul Hafizhah Lubis
Information System Information System Information System
Institute Technology Sepuluh Nopember Institute Technology Sepuluh Nopember Institute Technology Sepuluh Nopember
Surabaya, Indonesia Surabaya, Indonesia Surabaya, Indonesia
[email protected] [email protected] [email protected]

Abstract—Cardiovascular disease (CVD) is a general term for


Classification is a machine learning technique used to
conditions affecting the heart or blood vessels. It's usually
associated with a build-up of fatty deposits inside the arteries categorize data into a given number of classes. It will predict
(atherosclerosis) and an increased risk of blood clots. It can also the class labels or categories for the new data. Classification
be associated with damage to arteries in organs such as the brain, algorithms are used to predict the target class where
heart, kidneys and eyes. Hybrid algorithm is a combination of predefined labels are assigned to instances by properties.
any algorithm consisting of simpler algorithms. This paper aims Supervised learning technique is used for classification:
to improve the performance of accuracy of classification Training set → Classification algorithms → Unseen data
algorithms, also to improve imbalanced data processing, feature → Prediction result [4].
selection and parameter optimization. The hybrid algorithm
classification that we used in this paper are, XGBoost with Grid The objectives of this research work are:
Search, Random Forest with Randomized search, and XGBoost
with Randomized Search. The proposed model has achieved 1. To identify the valuable features from the dataset
100%, 91%, 89% accuracy based on the train and test data’s
different splitting ratios (80:20).
using multiple feature selection techniques to
improve the classification accuracy.
Keywords—Accuracy, Algorithms, Cardiovascular Disease, 2. To select an optimal ratio between training and
Classification, Hybrid testing data for analyzing prediction accuracy.

I. INTRODUCTION II. MATERIAL


A. Description Dataset
Cardiovascular disease is defined as a set of conditions
related to the disorder of the heart and blood vessels. This paper proposes an ensemble method-based
Cardiovascular disease is also caused by coronary artery multilayer dynamic system that can improve its current
damage, damage to the whole or part of the heart, or knowledge in every layer. To test the proposed model’s
inadequate supply of nutrients and oxygen to the heart. efficiency, we have used a realistic dataset collected from
Predicting and diagnosing cardiovascular disease is significant Kaggle. Xiao-Yan et al. Proposed an ensemble method-based
to ensure the appropriate treatment of this disease. In the early heart disease prediction model using 1025 instances with 13
stage of cardiovascular disease, a machine learning model can independent attributes collected from Kaggle. Two feature
aid physicians in making the right decision about the selection algorithms (linear discriminant analysis, principal
medication [1]. Further, medicating a patient appropriately component analysis) are used to select the effective features.
after manually analyzing massive quantities of heart In this model, several algorithms are used to construct the
disease-related data is a significant challenge. To address this ensemble method. This data set dates from 1988 and consists
issue, machine learning (ML) techniques aid in creating of four databases: Cleveland, Hungary, Switzerland, and Long
predictive models that can process and analyze large amounts
Beach V. It contains 76 attributes, including the predicted
of complex medical data and predict the absence or presence
of cardiovascular disease for a patient with higher accurate attribute, but all published experiments refer to using a subset
results [2]. Machine learning techniques have already gained of 14 of them. The "target" field refers to the presence of heart
significantly higher precision in classification-based problems. disease in the patient. It is integer valued 0 = no disease and 1
Information abstraction has been achieved using various = disease.
machine learning techniques, including feature selection,
classification, and clustering [3]. The attribute information contain age, sex, chest pain
type (4 values), resting blood pressure, serum cholestoral in
mg/dl, fasting blood sugar > 120 mg/dl, resting

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


electrocardiographic results (values 0,1,2), maximum heart For example, a random sample with r samples.
rate achieved, exercise induced angina, oldpeak = ST 1 1
Through the use of 𝑟 and (1 − 𝑟 ), we may calculate the
depression induced by exercise relative to rest, the slope of the
likelihood of the sample being captured each time or not. If
peak exercise ST segment, number of major vessels (0-3)
random sampling is carried out R times, the chance of the
colored by flourosopy, thal: 0 = normal; 1 = fixed defect; 2 = 1
reversable defect. The names and social security numbers of sample being chosen or not is given by (1 − 𝑟 )R where R
the patients were recently removed from the database, converges to ∞ ,(1 −
1
)R converge to
1
= 0.368.
replaced with dummy values. 𝑟 𝑒
Furthermore, a mirror sample will be available. Additionally,
1
B. Hybrid Algorithm Theory instance ( 3 ) will stray into new samples. Out-of-Bag
instances are data that is overlooked during extraction. A
Hybrid algorithm is a combination of more than one
bag-related issue is known as an OOB error. This inaccuracy
algorithm for efficiently solving the problem, it is designed to
can be expressed mathematically as [8]:
yield better performance than the individual algorithms.
Algorithm that combines two or more other algorithms that
solve the same problem, either choosing one (depending on
the data), or switching between them over the course of the
algorithm. This is generally done to combine desired features
of each, so that the overall algorithm is better than the
individual components. “Hybrid algorithm” does not refer to
simply combining multiple algorithms to solve a different In Equation above, X illustrates the value of error for
problem – many algorithms can be considered as testingNB Number of Data whereas NB shows the number of
combinations of simpler pieces – but only to combining OOBand acknowledged as a class of each data.
algorithms that solve the same problem, but differ in other
The Gini Index is used to create a decision tree that elaborates
characteristics, notably performance [5].
the model's impurity level using the CART algorithm. A lower
Gini index value indicates a slight impurity. A lower Gini
III. METHODOLOGY index indicates less impurity. For classification issue,the
probability of the N-th category is 𝑝𝑛 for N classes, the Gini
In this Business Analytic project we used three index formula is illustrated as follows [8]:
methods to implement the Heart Disease Dataset:

A. Random Forest with Randomize Search


B. XGBoost with GridSearch
C. XGBoost with Randomize Search

Random Forest is a set of tree structure predictors, each tree is


A. Random Forest with Randomize Search trained by independent sampling random vectors, and all the
trees in the forest have the same distribution. When generating
Random Forest is an example of an ensemble method the kth tree, the random vector is θk,which is independent, so
involving the collection of decision trees. In the Random that this random vector generates forest T with other random
Forest algorithm, samples are drawn randomly, and decision vectors of the same distribution, and its objective function is
trees are built for the random sample, and the process is mathematically defined as:
repeated. It avoids the missing values and outliers by
following steps: data analysis and data pre-processing and
corrects the overfitting to their training dataset. This ensemble
classifier in corporates several decision trees to get the best
result. Decision Trees mainly apply bootstrap aggregating or
Where f (x) is the target vector, n is the total number of
bagging [6].
spanning trees, and X is the input vector. The predicted value
is the weighted sum of the regression results obtained by each
Random Search will only select a few points from the
tree [9].
whole randomly. This process is very useful when we have
several hyperparameters to be optimized. Although the points B. XGBoost with Grid Search
chosen may not be the best points. The optimization process
produces a combination of hyperparameter values with the The model uses an integrated algorithm based on
best accuracy value. At this stage Random Search will take the gradient lifting decision tree optimization, which can meet
best combination based on the accuracy validation value both the performance and construction requirements; the
obtained from the searching process [7]. model has excellent interpretability and can record the
importance of characteristic indices through tree nodes; IV. RESULT ANALYSIS
XGBoost introduces regularization in algorithm principle,
which has unique advantages in dealing with over fitting.
A. XGBoost with Randomize Search
The core idea of XGBoost in dealing with regression
problems is to use greedy methods to learn each base tree. By
constantly forming new decision trees to fit the residuals of
the previous prediction, the residuals error between the
predicted value and the real value is continuously reduced, So
as to improve the prediction accuracy. The optimization
objective can be defined by the following mathematical
formula [9]:

B. Random Forest with Randomize Search

The grid search method is an exhaustive search


method for determining given parameter values [10]. Its
process involves arranging and combining the possible values
of each parameter, listing all possible combination results first
to generate grids, and then applying each combination to the
training of the machine learning model. After the fitting
function has tried all the parameter combinations, it returns an
appropriate classifier, which automatically adjusts to the best
parameter combinations for the machine learning model [11]. C. XGBoost with Grid Search

The grid search algorithm is used to determine the optimal


parameters of max_depth and eta of the XGBoost algorithm,
where max_depth is used to prevent the model from falling
into overfitting, and eta is used to control the learning
efficiency of the model and improve the adaptability of the
model. Sensitivity, specificity, precision, and area under the
curve (AUC) were used to evaluate the predictive performance
of the model on the test set [12].

C. XGBoost with Randomize Search


TABLE I. ACCURACY RESULTS BASED ON JOURNAL

The gradient boosting decision tree algorithm was Heart Dataset Kaggle
proposed by Friedman in 2001 [13]. Its principle is using the
gradient descent method to generate new trees based on all Classifier Accuracy
previous trees and making the objective function as small as
possible. XGBoost is called an extreme gradient boosting MLDS 92.72%
decision tree, which is a tree ensemble model that can be used
for classification and regression. When XGBoost is applied to XGB 73.52%
regression problems, new regression trees are added
continuously, and then the residuals of the previous model are LR 70.7%
fitted through the newly generated CART tree. K is used to
mark the number of trees in the complete trained model, and SVM 72.46%
the sum of the results corresponding to each tree is used as the
final predicted value [14]. KNN 69.22%

DT 63.14%
TABLE II. ACCURACY RESULT BASED ON MIDTERM REPORT WITH REFERENCES
DIFFERENT METHODS

[1] Uddin, M.N. and Halder, R.K., 2021. An ensemble method based
Heart Dataset Kaggle multilayer dynamic system to predict cardiovascular disease using
machine learning approach. Informatics in Medicine Unlocked, 24,
Classifier Accuracy p.100584.
[2] Ray, S., 2019, February. A quick review of machine learning algorithms.
Random Forest 98,59% In 2019 International conference on machine learning, big data, cloud
and parallel computing (COMITCon) (pp. 35-39). IEEE.
[3] Ahmed, M.R., Mahmud, S.H., Hossin, M.A., Jahan, H. and Noori,
Artificial Neural Network (ANN) 95.12%
S.R.H., 2018, December. A cloud based four-tier architecture for early
detection of heart disease with machine learning algorithms. In 2018
Naive Bayes (NB) 81.46% IEEE 4th International Conference on Computer and Communications
(ICCC) (pp. 1951-1955). IEEE..
TABLE III. ACCURACY RESULTS WITH HYBRID ALGORITHMS [4] Sheeba, P.T., Roy, D. and Syed, M.H., 2022. A metaheuristic-enabled
training system for ensemble classification technique for heart disease
prediction. Advances in Engineering Software, 174, p.103297.
Heart Dataset Kaggle [5] Bhattacharyya, Siddhartha & Dutta, Paramartha. (2012).
Handbook of Research on Computational Intelligence for Engineering,
Classifier Accuracy Science, and Business. IGI Global
[6] Ashir Javeed, Zhou Shijie, Liao Yongjian, Iqbal Qasim, Adeeb Noor,
XGBoost + Grid Search 100% Redhwan Nour. An intelligent learning system based on random search
algorithm and optimized random forest model for improved heart
disease detection. IEEE Access 2019;7(7):180235–43.
Random Forest + Randomize Search 91%
https://doi.org/10.1109/ACCESS.2019.2952107. November 2019
[7] Fordana, M.D.Y. and Rochmawati, N., 2022. Optimisasi
XGBoost + Randomize Search 89% Hyperparameter CNN Menggunakan Random Search Untuk Deteksi
COVID-19 Dari Citra X-Ray Dada. Journal of Informatics and
Computer Science (JINACS), 4(01), pp.10-18.
After we analyze and compare the result accuracy from [8] Javeed, A., Zhou, S., Yongjian, L., Qasim, I., Noor, A., & Nour, R.
the methods that are used in the journal, previous midterm (2019). An intelligent learning system based on random search algorithm
methods, and the methods with hybrid algorithms. The and optimized random forest model for improved heart disease
detection. IEEE Access, 7, 180235-180243.
accuracy results obtained from the journal paper are between
[9] Yan, H., He, Z., Gao, C., Xie, M., Sheng, H. and Chen, H., 2022.
63.14% and peaked at 92.72%. Then, the accuracy results Investment estimation of prefabricated concrete buildings based on
obtained from our previous midterm methods are between XGBoost machine learning algorithm. Advanced Engineering
81.46% and highest at 98.59%. While the accuracy results we Informatics, 54, p.101789.
get with hybrid algorithms are between 89% to the highest of [10] Hsu, C., Chang, C., Lin, C., 2003. A Practical Guide to Support Vector
100%, using XGBoost + Grid Search, Random Forest + Classification. Technical report, pp. 1–16.
Randomize Search and XGBoost + Randomize Search. After [11] Pan, S., Zheng, Z., Guo, Z. and Luo, H., 2022. An optimized XGBoost
several trials, we obtained the highest accuracy result 100% method for predicting reservoir porosity using petrophysical logs.
using the XGBoost + Grid Search method. Whereas, the Journal of Petroleum Science and Engineering, 208, p.109520.
accuracy of the journal is 92.72% by using the MLDS method. [12] Gong Jun, Zhong Xiaogang, Tan Juntao, Liu Yunyu, Rao Qingmao,
and based on our previous midterm report, the highest Xiang Tianyu and Wang Huilai, 2020. "Grid search + XGBoost"
algorithm to establish a predictive model for children with septic shock.
accuracy that we get is 98,59% by using Random Forest Medical Journal of the People's Liberation Army, pp.1-7.
Method. So, we can conclude that the accuracy result using 龚军, 钟小钢, 谈军涛, 刘蕴宇, 饶青茂, 向天雨 and 王惠来, 2020.
the hybrid algorithm is higher than the prior experiment. “网格搜索+ XGBoost” 算法建立儿童脓毒性休克预测模型. 解放军
医学杂志.
[13] Friedman, J.H., 2001. Greedy function approximation: a gradient
boosting machine. Annals of statistics, pp.1189-1232.
[14] Nguyen, H., Bui, X.N., Bui, H.B. and Cuong, D.T., 2019. Developing an
XGBoost model to predict blast-induced peak particle velocity in an
open-pit mine: a case study. Acta Geophysica, 67(2), pp.477-490.
[15] Dataset from :
https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset,
Access Date: 2022-10-10.

You might also like