0% found this document useful (0 votes)
17 views10 pages

BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis

For Reading

Uploaded by

Ardhendu Mandal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
17 views10 pages

BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis

For Reading

Uploaded by

Ardhendu Mandal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

SN Computer Science (2023) 4:377

https://doi.org/10.1007/s42979-023-01825-x

ORIGINAL RESEARCH

BCPUML: Breast Cancer Prediction Using Machine Learning


Approach—A Performance Analysis
Rahul Karmakar1 · Subhranil Chatterjee1 · Akhil Kumar Das2 · Ardhendu Mandal1,3

Received: 8 May 2022 / Accepted: 9 April 2023


© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2023

Abstract
Breast cancer is taking the lives of women globally. It’s one of the most common malignancies in women, as well as the
leading cause of cancer-related death. Even though there are no therapies for breast cancer, early detection and diagnosis
are critical in determining survival chances. Machine learning for medical diagnosis and its accuracy is a key advancement
and a predicted trend in the future medical model. Our goal is to use machine learning algorithms to diagnose breast cancer
and perform performance analysis. Under the supervised machine learning approach, different classifiers are applied to
the data sets to predict outcomes. We have used five such classifiers, K-Nearest Neighbors (KNN), Random Forest (RF),
Decision Trees (DT), Logistic Regression (LR), and Support Vector Machines (SVM) on the Breast Cancer WISCONSIN
(Diagnostic) data set to observe how accurately they predict the cancerous instances. The classifiers are treated with cross-
validation approaches to get exact accuracies. The same is done with different partitions of the data set, and performance
analysis is made based on every observation. Significance statement: we have done a breast cancer prediction using some
machine learning models and made a performance analysis.

Keywords Breast cancer · Machine learning · Classifiers · Random forest · Cross-validation

Introduction cells. A malignant tumor can expand (metastasize) to the


surrounding tissues or the very end of the body. Men seldom
When cancer cells strike, their growth goes astray, caus- develop breast cancer. The majority of victims are women.
ing them to proliferate out of control. Damage to the DNA The size and kind of tumor determine the stage of breast
causes cells to become cancerous and converts to a lump. A cancer. Another factor determining breast cancer is to what
lump can be benign (non-cancerous) or malignant (cancer- extent the tumor cells have invaded the breast tissues. Here,
ous). Breast cancer is a kind of cancer that starts in the breast Stage 0 indicates the non-invasive tumors, whereas stage 4
refers to the invasive tumors. There are five tumor stages.
Stage 0 is the non-invasive stage. There is an absence of can-
* Rahul Karmakar
rkarmakar@cs.buruniv.ac.in cer cells or non-cancerous aberrant cells blazing out of the
part of the breast where they began or of penetrating through
Subhranil Chatterjee
nilsubhrachat99@gmail.com to or infecting nearby standard tissue. Stage 1 is invasive
breast carcinoma. There are two stages for Stage 1: 1A and
Akhil Kumar Das
dasakhi@gmail.com 1B. Category 1A represents a tumor up to 2 cm in diameter
and does not include lymph nodes. On the other hand, Stage
Ardhendu Mandal
am.csa.nbu@nbu.acin 1B depicts a small collection of cancer cells of more consid-
erable size than 0.2 mm detected within a lymph node. There
1
Department of Computer Science, The University are two grouping in Stage 2: 2A along with 2B. In stage 2A,
of Burdwan, Burdwan, Golapbag, West Bengal 713104, the tumor is inside lymph nodes, otherwise it could be in
India
sentinel lymph nodes, yet no cancer is detected inside the
2
Computer Science, Gour Mahavidyalaya Mangalbari, Malda, breast. The tumor size can be 2 cm or more in diameter but
West Bengal, India
less than 5 cm. Stage 2B, on the other hand, depicts a tumor
3
Department of Computer Science and Application, The that is larger than 5 cm but does not reach the lymph nodes.
University of North Bengal, Siliguri, India

SN Computer Science
Vol.:(0123456789)
377 Page 2 of 10 SN Computer Science (2023) 4:377

These three stages are the early stages of breast cancer. In


Stage 3, the tumor size exceeds 5 cm. It may spread over
nine lymph nodes. Stage 4 is the worst form, where cancer
rolls out to neighboring organs of the body.
There is a higher chance of saving a woman diagnosed
earlier with breast cancer. Radiation treatment, chemother-
apy, hormone therapy, HER2-targeted therapy, other phar-
macological therapies are used to treat early breast cancer
[1–4].
A lot of data sets with features are taken from Health Care
centers. They are trained with machine learning models to
predict diseases. Machine learning is an arm of artificial
intelligence that uses a range of statistical, optimization,
and probabilistic approaches to allow computers to study
and grasp knowledge from previous examples and detect
difficult-to-find molds in big, noisy, or complicated data sets.
This latter strategy is especially intriguing because it is a
rising trend toward personalized, predictive treatment. The
current population growth leads to a critical issue regard-
ing early diagnosis. The danger of death from breast cancer
increases exponentially as the world’s population grows.
Breast cancer is second-most severe of the already revealed
cancers. An automatic illness detection system assists medi-
cal personnel in disease diagnosis and provides a reliable,
effective, and quick reaction while also lowering the dan-
ger of death [5]. Under the supervised machine learning
approach, classification algorithms help to predict breast
cancer. In this method, labeled data sets are used to predict
outcomes or classify data. The model feeds on the labeled
data later. It adjusts its weight. Cross-validation is done after
that, and it ensures that the model avoids under-fitting or
over-fitting [6]. The classification algorithm determines the
category of new observations from the training data. Classi-
fication is a program studying and grasping knowledge from
Fig. 1  Flow of work
a data set or outcomes and then classifying new products
into one of several classes or lineups. Yes, or No, 0 or 1,
Spam or Not Spam, Football or Golf, etc. Targets/labels or classifiers applied on different partitions of the same data
categories are used to describe classes [7]. There is much set undergoing cross-validation techniques was missing
work in this field using different classifiers, ensemble learn- in related works. We have come across the above as the
ing, data mining tools, and artificial neural networks [8–10]. problem statement of our research.
Our work focused on classification algorithms. We have The key contributions of the paper are as follows:
used three partitions of the WISCONSIN (Diagnostic)
Breast Cancer Dataset. We aimed to find out the best clas- • A precise literature review of the related works
sifier to predict breast cancer. Cross-validation approaches • Performance analysis of the classifiers on three differ-
were used to obtain perfect results. We performed a perfor- ent partitions of the data set undergoing cross-valida-
mance analysis to determine how the classifiers are undergo- tion.
ing cross-validation perform for different partitions of the • Find the best cross-validated classifier for predicting
data set. The flow of our work is pictured below in Fig. 1. breast cancer.
Prior works regarding breast cancer prediction only
focused on using a single partition of the data set for train- The rest of the paper is organized as follows: in “Related
ing and testing. Previous papers showed performance anal- Works”, we have done the literature review. In “Problem
ysis of classifiers based on a specific split of data (train- Domain, Materials and Methods”, we conducted a perfor-
ing and testing). Well-detailed performance analysis of mance analysis, followed by discussions in “Results and

SN Computer Science
SN Computer Science (2023) 4:377 Page 3 of 10 377

Discussions”. Lastly, we have concluded our work in “Con- Chelvian Aroef et al. [18] used the UCI machine learn-
clusion and Future Work”. ing repository dataset. They implemented machine learn-
ing classifiers (algorithms) like Random Forest and Sup-
port Vector Machines. The data was segregated into 80%
Related Works for training and 20% for testing. SVM worked better than
Random Forest. The accuracy of SVM was 95.4.
Dai et al. [11] implemented random forest, decision tree In another work [19], the authors aimed to classify
algorithms, and discussed the scenario of breast cancer breast cancer types. The breast cancer lesions obtained
diagnosis and achieved excellent prediction accuracy using from FNA (fine needle aspiration) have been used, and
ensemble learning. They got 95% accuracy using the ensem- with the help of a random forest classifier, the work was
ble learning approach. The data set used in this paper is the done. The data set used here was Fine Needle Aspiration
University of California, Irvine Breast Cancer Wisconsin Biopsy Data, which has 458 benign cases and 241 malig-
(Diagnostic) Data set. nant cases. This data was obtained from UCI Machine
Shubham Sharma et al. [12] implemented K-NN, RF, and Learning Repository. After fitting the essential features
Naive Bayes to detect Breast Cancer. They used the Wis- and training the RF classifier, the results were based on
consin Diagnosis Breast Cancer Dataset. They compared 250 testing cases, where Sensitivity came 75%, Specific-
the performance of different classifiers. They concluded that ity came 70%, and accuracy came 72%. In conclusion, it
a woman chosen at random had a 12 percent risk of being can be said that RF has accurately classified breast cancer
diagnosed with breast cancer. CNN’s accuracy came highest tumors; however, further studies are required to improve
with 95.90%, one observation misclassified as Malignant. accuracy.
In another interesting work [13], the authors implemented Subrato Bharati et al. [20] used machine learning algo-
6 Machine Learning models (algorithms). They used the rithms like Naïve Bayes, Random Forest, KNN classifier,
Wisconsin breast cancer data set (WDBC). The DL_ANN and Kappa Statistics to calculate precision and recall. The
topped among the classifiers with 98.24%. data was collected from the UCI Machine Learning Reposi-
Random Forest and Extreme Gradient Boosting (com- tory. In conclusion, KNN gave the highest result, and it has
monly known as XGBoost), implemented by SajibKabiraj correctly classified cases of 97.9021%. Multilayer percep-
et al. [14] to predict breast cancer. These are two ensemble tron was second in position with 96.5035%. KNN even got
machine learning methods. They used the data set from the a 99.9% ROC Area.
UCI Machine Learning Repository. The data set was divided In another work [21], the authors proposed a rule-based
into 67% for the purpose of training and 33% for testing. In machine learning classification algorithm to predict differ-
conclusion, they said that the accuracy for Random Forest ent types of breast cancer. They implemented classifiers like
came 74.73% and 73.63% for XGBoost. In [15], the authors NB, Trees Random Forest (TRF), 1-Nearest Neighbour,
have used machine learning techniques to predict breast can- SVM, AdaBoost (AD), Multilayer Perceptron (MLP), RBF
cer. They have used the public data set from the University Network (a radial basis function network), and a kind of
of Wisconsin Hospitals Madison Breast Cancer Database. an artificial neural network. They compared all the models,
SVM has the most accuracy of 97.9 and minor error than where NB has the highest ROC area of 0.94, TRF has the
other classifiers used in this data set. highest Classification accuracy of 0.96. The data set used
Kaya Keles [16] aimed to predict and detect breast cancer here was taken from the Cancer Registry Organization of
when the size of the tumor is generally small. The data set Kerman Province, in Iran.
used in this study was from the studies conducted by Avsar Octaviani et al. [22] used the Wisconsin Breast Cancer
Aydm. The author has used non-invasive and untroublesome Database (WBCD) data from the UCI Repository. Their
methods, such as comparing data mining algorithms (classi- work aimed to predict breast cancer using one of the best
fiers) made using the Weka tool. The highest result obtained machine learning algorithms. They got an accuracy of 100%.
was Random Forest (TREES) with 92.2% accuracy. A classifier showing 100% accuracy is a bit problematic.
Murugan et al. [17] used Supervised Machine Learning An algorithm can’t be 100% accurate. Data pre-processing
models (classifiers) like Decision Tree, Random Forest, and and cross-validation techniques could have solved this issue.
Linear Regression to predict Breast Cancer. They aimed to In another work [23], here the authors implemented five
predict if the type of cancer is benign or malignant. The machine learning models like DT, RF, SVM, NN, and LR to
repository of data used here is the “UCI Machine Learning predict breast cancer. They have used two data sets: Breast
Repository (Wisconsin Breast Cancer) Data set”. The result- Cancer Coimbra Dataset (BCCD) and Wisconsin Breast
ant diagnosis is grouped by B-Benign and M-Malignant. Cancer Database (WBCD). In conclusion, the values of RF
The highest result came from the linear regression algorithm in both data sets were more than other classifiers. RF scores
with 84.15% accuracy. were 74.3% and 93.7%. RF can also be used with other data

SN Computer Science
377 Page 4 of 10 SN Computer Science (2023) 4:377

mining technologies to improve its accuracy. Table 1 depicts which 212 are malignant. Further work has been done with
accuracies obtained by previous works and their limitations. the 212 malignant cases.
Mohammad Monirunjjaman et al. [24] have used the The data were processed by dropping the patient id(s)
WISCONSIN breast cancer data set to predict the cancer and the unnamed 32 columns. Since the data set has several
probability of a patient. A correlation matrix was performed features, we plot a correlation matrix to remove the highly
for selecting desired features. Classifiers like Random For- correlated features. Figure 2 depicts the correlation matrix.
est, K-Nearest Neighbors, Decision Tree, and Logistic All of the attributes have no null values. After removing
Regression were used in their work. Out of them, Logistic the highly correlated features, the data frame was reduced to
Regression has secured the highest accuracy of 98.6%. 23 columns. The input variable, X data frame, was created
Another interesting work [25] where the authors have by dropping the ‘diagnosis’ attribute and the output variable
implemented the farthest first clustering with Naive Bayes Y frame has only the ‘diagnosis’ attribute. The data set was
and artificial neural network model to predict breast cancer split into three parts. First, it was segregated into 70% for
on the WISCONSIN Breast Cancer data set. They used five the purpose of training and 30% for data testing. Then it was
partitions of the data set, out of them 60–40 data split gave divided into 80% for training and 20% for testing, 90% for
the highest accuracy of 97.8% and the balanced average is training, and 10% for testing. We used the standard scaler
98%. to scale the data.
We addressed all the problems in our proposed work and
achieved better Breast cancer prediction accuracy. Works Methodology
from the year 2013 have been collected and reviewed well.
Work methodologies, results, and limitations are stated in The five classifiers we employed, KNN (K-Nearest Neigh-
the above table briefly. Papers which has no such limitations bors), DT (Decision Trees), RF (Random Forest), LR
are kept blank in the above table. (Logistic Regression), and SVM (Support Vector Machines),
are trained on the three partitions of the data as shown in
Research Gap the workflow diagram in Fig. 1. For every data split (70–30,
80–20, and 90–10), model accuracies with the confusion
Our work focused on primarily improving the accuracy of matrix, precision, recall, and f-1 score are recorded in a
classifiers. The majority of existing works analyzed the clas- table format. The model accuracies are applied with tenfold,
sifiers applied to a specific data partition. The results lacked threefold, fourfold, and fivefold cross-validation methodolo-
the cross-validation method, which can change the outcomes gies. The best cross-validation score and the mean cross-
drastically. Performance analysis of the classifiers imple- validation scores for every fold are noted.
mented on different data partitions was missing.
Observations

For 70–30 split data, Logistic Regression got the highest


Problem Domain, Materials and Methods accuracy of 95.9%. For logistic Regression, seven instances
2 Benign cases are misclassified as Malignant. One Malig-
Prior works regarding breast cancer prediction using nant case was misclassified as Benign. The precision, recall,
machine learning approaches only focused on using a spe- f-1 score, and support scores are shown in Table 2.
cific partition of the data set for training and testing. They For 80–20 split data, Logistic Regression got the highest
did the performance analysis of classifiers based on one par- accuracy of 96.4%. For logistic Regression, four instances
ticular split of data (training and testing). out of which a Benign case is misclassified as Malignant,
Detailed performance analysis by considering different and three Malignant cases are misclassified as Benign. The
partitions of the same data set and cross-validation tech- precision, recall, f-1 score, and support scores are shown in
niques was missing in related works. Table 3. For 90–10 split data, Logistic Regression got the
highest accuracy of 100%. This is a fault, perhaps. So we
Dataset Collection and Preprocessing will consider the random forest with 98.2% accuracy. For
random forest, 2 instances out of which a Benign case is
We have taken the Brest Cancer WISCONSIN (Diagnostic) misclassified as Malignant and a Malignant case is misclas-
data set for this work. The Source of the data set is Kag- sified as Benign. The precision, recall, f-1 score, and support
gle [26]. The data set has ten dimensions. The data set has scores are shown in Table 4.
569 rows and 33 columns. The classification of patients are The K-fold cross-validation techniques are applied, out-
B-Benign (non-cancerous) and M-Malignant (cancerous) comes are shown in Table 5. The highest accuracy score
under the ‘diagnosis’ class. There are 569 instances, out of for both 70–30 and 80–20 split data is secured by Logistic

SN Computer Science
Table 1  Accuracies obtained by previous works and their limitations
Author and year Classifier used Accuracy Limitations

Mohammad Monirunjjaman et al. (2022) [24] Random forest, K-nearest neighbors, decision Logistic regression: 98.6%
SN Computer Science

tree, and logistic regression


Chinniyan and Subramani (2021) [25] Farthest first clustering with Naïve bayes and Farthest first clustering with artificial
artificial neural network neural network: 98%
Gupta and Garg (2020) [13] Logistic regression, decision tree, KNN, random Adam gradient descent learning: 98.24%. KNN and Logistic Regression was found to be
forest, support vector machine with radial basis lowest, which generally doesn't occur
(2023) 4:377

function kernel, root mean square propagation,


and deep learning using Adam gradient descent
learning
Kabiraj et al. (2020) [14] Extreme gradient boosting and random forest Random forest: 74.73 % Preprocessing of data was not done efficiently.
Fewer features and samples were used. Better
methods to train the data were expected
Aroef et al. (2020) [18] Support vector machines and random forest SMV: 95.45% With 10% of training data, the worst accuracy
came of 72.81%
Kaya keles [16] Bagging, Ibk, random committee, simple CART, Random forest (TREES): 92.2% The accuracies could have increased with proper
and random forest preprocessing and feature selection
Montazeri et al. (2019) [21] Naive bayes, trees random forest, 1-nearest Random forest: 96% Cross-validation was unnecessary for TRF. It was
neighbour, AdaBoost, support vector machine, mentioned the data didn't need to be rescaled or
multilayer, RBF network and perceptron modified
Octaviani et al. (2019) [22] Random forest 100% The data were not preprocessed or scaled. All of
the features were used. A single classifier was
used to claim that it is the best for breast cancer
prediction
Dai et al. (2018) [11] Random forest, decision tree, ensemble learning 95% Random forest was used to get high accuracy. The
scaling of data the not mentioned. The correlated
features were not removed
Sharma et al. (2018) [12] Random forest, K-nearest neighbours, and naive KNN: 95.90% Correlated features were not removed. Scaling of
bayes data was not done
Khourdifi and Bahaj (2018) [15] Random forest, naive bayes, K-nearest neigh- SVM: 97.9%
bours, and support vector machines
Bharati et al. (2018) [20] Naive bayes, random forest, Kappa statistics and KNN: 97.9021%
KNN
Li (2018) [23] Decision tree (DT), random forest (RF), logistic DT: 96.1%
regression (LR), support vector machine
(SVM), and neural network (NN)
Murugan et al. (2017) [17] Random forest, decision tree, and linear regres- 88.14% Data were not well preprocessed. Further preproc-
sion essing was required, as mentioned in the paper
Page 5 of 10

Ahmad and Yusoff (2013) [19] Random forest classifier 72% Data were not well processed
377

SN Computer Science
377 Page 6 of 10 SN Computer Science (2023) 4:377

Fig. 2  Correlation matrix

Table 2  Accuracy Score, Models Accuracy Score for Confusion matrix Precision Recall F1-score Support
Confusion Matrix, and training and testing
Classification Report for 70–30 data
split data
Random forest 0.962311 [[107 1] 0.96 0.99 0.97 108
0.935672 [5 58]] 0.98 0.92 0.95 63
Logistic regression 0.989949 [[106 2] 0.95 0.98 0.97 108
0.959064 [5 58]] 0.97 0.92 0.94 63
Decision tree 1.0 [[99 9] 0.97 0.92 0.94 108
0.929824 [3 60]] 0.87 0.95 0.91 63
KNN 0.962311 [[105 3] 0.93 0.95 0.95 108
0.935672 [8 58]] 0.95 0.87 0.91 63
Support vector machine 0.989949 [[99 9] 0.97 0.92 0.94 108
0.929824 [3 60]] 0.87 0.95 0.91 63

SN Computer Science
SN Computer Science (2023) 4:377 Page 7 of 10 377

Table 3  Accuracy score, Models Accuracy score Confusion matrix Precision Recall F1-score Support
confusion matrix, and
classification report for 80–20 Random forest 0.967032 [[67 0] 0.96 1.00 0.98 67
split data
0.956140 [3 44]] 1.0 0.94 0.97 47
Logistic regression 0.989010 [[66 1] 0.96 0.99 0.97 67
0.964912 [3 44]] 0.98 0.94 0.96 47
Decision tree 1.0 [[61 6] 0.98 0.91 0.95 67
0.938596 [1 46]] 0.88 0.98 0.93 47
KNN 1.0 [[61 6] 0.94 0.99 0.96 67
0.938596 [1 46]] 0.98 0.91 0.95 47
Support vector machine 0.991208 [[61 6] 0.98 0.91 0.95 67
0.938596 [1 46]] 0.88 0.98 0.93 47

Table 4  Accuracy score, Models Accuracy score Confusion matrix Precision Recall F1-score Support
confusion matrix, and
classification report for 90–10 Random forest 0.962890 [[34 1] 0.97 0.97 0.97 35
split data
0.982456 [1 21]] 0.95 0.95 0.95 22
Logistic regression 0.988281 [[35 0] 1.0 1.0 1.0 35
1.0 [0 22]] 1.0 1.0 1.0 22
Decision tree 1.0 [[32 3] 0.97 0.91 0.94 35
0.929825 [1 21]] 0.88 0.95 0.91 22
KNN 0.962890 [[35 0] 0.97 1.0 0.99 35
0.982456 [1 21]] 1.0 0.95 0.98 22
Support vector machine 0.990234 [[32 3] 0.97 0.91 0.94 35
0.929825 [1 21]] 0.88 0.95 0.91 22

Table 5  Accuracy scores and cross-validation scores


Models 70-30 split 80-20 split 90-10 split
Accuracy Best cross- Mean cross- Accuracy Best cross- Mean cross- Accuracy Best cross- Mean cross-
score validation validation score validation validation score validation validation
score scores score scores score scores

Random 0.959064 0.968406 0.959617 0.964912 0.964884 0.964448 0.982456 0.964897 0.961376
forest
Logistic 0.935673 0.945536 0.944228 0.956140 0.945530 0.944228 1.000000 0.945536 0.944228
regression
Decision 0.935673 0.922683 0.925782 0.956140 0.929704 0.925335 0.929825 0.920944 0.926230
tree
KNN 0.912281 0.903354 0.902957 0.938596 0.898079 0.902957 0.982456 0.903354 0.902957
Support 0.912281 0.680345 0.704978 0.938596 0.826047 0.722490 0.929825 0.785592 0.761034
vector
machine

The best results indicated in bold

regression. The cross-validation scores are greater than other Graphs plotted for Accuracy Comparison, The Best
classifiers. In the case of 90–10 split data, we encountered Cross-Validation Score, and Mean Cross-Validation Scores
a problem. Logistic Regression got the highest accuracy of are shown in Fig. 3.
100%, so we considered Random Forest which has the accu-
racy of 98.2%.

SN Computer Science
377 Page 8 of 10 SN Computer Science (2023) 4:377

Fig. 3  Comparison graph of the best cross-validation score and mean cross-validation scores

Results and Discussions method. We used tenfold, threefold, fourfold, and fivefold
cross-validation methods to pick out the best result. Random
From the above Model Accuracy tables, we observed when forest’s mean cross-validation score was 0.961376. The rest
the data were divided into 70% for training data and 30% are in Table 5.
testing data, Logistic Regression got the highest accuracy In Table 6, we have compared the accuracies obtained by
of 95.9%. But after the application of the K-fold Cross- prior works with the proposed works.
Validation technique, Random Forest has the best score Although several previous works used the random forest
of 0.968406. The best result of the cross-validation tech- well, our proposed work showed better results than those.
nique came from the fivefold cross-validation method.
We used tenfold, threefold, fourfold, and fivefold cross-
validation methods to pick the best result. Random forest’s Conclusion and Future Work
mean cross-validation score was 0.959617. The rest are
in Table 5. In conclusion, we can say, despite having differences in
When the data was split into 80% of training data and 20% accuracy scores, Random Forest gave the best results after
of testing data, Logistic regression got the highest accuracy of the application of the k-fold cross-validation technique.
96.4%. But after the application of the K-fold Cross-Validation It was found in 90% training and 10% testing partition
technique, Random Forest has the best score of 0.964884. The of data. Random Forest gave the highest cross-validation
best result of the cross-validation technique came from the score of nearly 0.96488. We can say that Random For-
threefold cross-validation method. We used threefold, four- est is much more accurate than other classifiers used in
fold, fivefold, and tenfold cross-validation methods to pick the this work. Although, Logistic Regression did great and
best result. Random forest’s mean cross-validation score was gave accuracy close to Random Forest. After observing
0.964448. The rest are in Table 5. the implementation of different classifiers on three splits
When the data was segregated into 90% of training data of data into 70–30, 80–20, and 90–10, we can say that
and 10% of testing data, Logistic Regression got the highest Random Forest is the best classifier for predicting breast
accuracy of 100%. It is a problem, perhaps. But after the appli- cancer.
cation of the K-fold cross-validation approach, Random Forest, We have got a better result in our work, and it can be
has the best score of 0.964897. The best result of the cross- further improved by using Ensemble Learning or Data
validation technique came from the fivefold cross-validation Mining tools. We are looking forward to improving our
outcomes shortly.

SN Computer Science
SN Computer Science (2023) 4:377 Page 9 of 10 377

Table 6  Proposed work comparison with previous works

Works Classifiers used Previous work accuracies Proposed work accuracies

Kabiraj et al. [14] Random forest 74.73 96.48%


Extreme gradient boosting 73.63% N/A
Chelvian Aroef et al. [18] Random forest 90.90% 96.4%
Support vector machines 95.45% 82.60%
Kaya keles [16] Bagging 90.90 N/A
Ibk 90.0 N/A
Random committee 90.9 N/A
Simple CART​ 90.1 N/A
Random forest 92.2 96.48%
Dai et al. [11] Ensemble learning 95% N/A
Sharma et al. [12] Random forest 95% 96.48%
K-nearest neighbours 95.90% 90.33%
Naïve Bayes 94.47 N/A
Murugan et al. [17] Random forest 88.14% 96.48%
Linear regression 84.14% N/A
Ahmad and Yusoff [19] Random forest classifier 72% 96.48%
Li [23] For BCCD data For WBCD data For WBCD data
Decision tree (DT) 68.6% 96.1% 92.97%
Random forest (RF) 74.3% 96.1% 96.48%
Support vector machine (SVM) 71.4% 95.1% 82.60%
Neural network (NN) 60.00% 95.6 N/A
Logistic regression (LR) 65.7% 93.7% 94.55%

The best results indicated in bold

Acknowledgements We sincerely thank the Department of Computer 5. Cruz JA, Wishart DS. applications of machine learn-
Science, The University of Burdwan, India, for their assistance to pur- ing in cancer prediction and prognosis. Cancer Inf.
sue our research work. 2006;2:117693510600200030.
6. Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang MF, Campbell
Funding This research is not funded by any agencies or institutes. JP. Introduction to machine learning, neural networks, and deep
learning. Transl Vis Sci Technol. 2020;9(2):14.
Data availability The datasets generated during and/or analyzed dur- 7. Chowdhury S, Schoen MP. Research paper classification using
ing the current study are available from the corresponding author on supervised machine learning techniques. In 2020 Intermountain
reasonable request. Engineering, Technology and Computing (IETC), Oct 2020, pp.
1–6.
Declarations 8. N. S., A. I. f. H. S. Department of Computer Science, I. Higher
Education for Women, Coimbatore, S. A. ., A. I. f. H. S.
Conflict of Interest The authors declare that they have no conflict of Department of Computer Science, and I. Higher Education for
interest. Women, Coimbatore, Prediction of breast cancer using deci-
sion tree and random forest algorithm. Int J Comput Sci Eng.
2018;6(2):226–229.
9. Jaikrishnan SVJ, Chantarakasemchit O, Meesad P. A breakup
References machine learning approach for breast cancer prediction. In:
2019 11th International Conference on Information Technol-
1. Jemal A, Siegel R, Ward E, Hao Y, Xu J, Thun MJ. Cancer statis- ogy and Electrical Engineering (ICITEE). Pattaya, Thailand:
tics, 2009. CA Cancer J Clin. 2009;59(4):225–49. IEEE, October 2019, pp. 1–6.
2. Akram M, Iqbal M, Daniyal M, Khan AU. Awareness and current 10. Li J, Zhou Z, Dong J, Fu Y, Li Y, Luan Z, Peng Z. Predicting
knowledge of breast cancer. Biol Res. 2017;50(1):33. breast cancer 5-year survival using machine learning: a system-
3. Milosevic M, Jankovic D, Milenkovic A, Stojanov D. Early atic review. PLOS ONE. 2021;16(4):1–23.
diagnosis and detection of breast cancer. Technol Health Care. 11. Dai B, Chen RC, Zhu SZ, Zhang WW. Using random forest
2018;26(4):729–59. algorithm for breast cancer diagnosis. In: International Sympo-
4. Kalli S, Semine A, Cohen S, Naber SP, Makim SS, Bahl M. Amer- sium on Computer, Consumer and Control (IS3C). Taichung,
ican joint committee on cancer’s staging system for breast cancer, Taiwan: IEEE, December 2018, pp. 449–452.
Eighth Edition: What the radiologist needs to know. Radiograph- 12. Sharma S, Aggarwal A, Choudhury T. Breast cancer detection
ics. 2018;38(7):1921–33. using machine learning algorithms. p. 5, 2018.

SN Computer Science
377 Page 10 of 10 SN Computer Science (2023) 4:377

13. Gupta P, Garg S. Breast cancer prediction using varying 20. Bharati S, Rahman MA, Podder P. Breast cancer prediction apply-
parameters of machine learning models. Proc Comput Sci. ing different classification algorithm with comparative analysis
2020;171:593–601. using WEKA. In: 4th International Conference on Electrical Engi-
14. Kabiraj S, Raihan M, Alvi N, Afrin M, Akter L, Sohagi SA, neering and Information & Communication Technology (iCEE-
Podder E. Breast cancer risk prediction using XGBoost and iCT). Dhaka, Bangladesh: IEEE, September 2018, pp. 581–584.
random forest algorithm. In: 11th International Conference on 21. Montazeri M, Beigzadeh A. Machine learning models in breast
Computing, Communication and Networking Technologies cancer survival prediction, p. 12, 2019.
(ICCCNT). Kharagpur, India: IEEE, July 2020, pp. 1–4. 22. Octaviani TL Rustam, Z. Random forest for breast cancer predic-
15. Khourdifi Y, Bahaj M. Applying best machine learning algo- tion. Depok, Indonesia, p. 020050. 2019.
rithms for breast cancer prediction and classification. In Inter- 23. Li Y. Performance evaluation of machine learning methods for
national Conference on Electronics, Control, Optimization and breast cancer prediction. Appl Comput Math. 2018;7(4):212.
Computer Science (ICECOCS). Kenitra: IEEE, December 2018, 24. Khan M. Machine learning based comparative analysis for breast
pp. 1–5. cancer prediction. J Healthc Eng. 2022;2022:1–15.
16. Keleşs, MK. Breast cancer prediction and detection using data 25. Chinniyan K, Subramani R. Breast cancer prediction using
mining classification algorithms: a comparative study. Tehnicki machine learning algorithms. Int J Mech Eng. 2021;6(3):9–9.
vjesnik Technical Gazette 2019;26(1). 26. B. C. W. D. Dataset. Breast Cancer Wisconsin (Diagnostic) Data
17. Murugan S, Kumar BM, Amudha S. Classification and prediction Set, 2018.
of breast cancer using linear regression, decision tree and random
forest. In: International Conference on Current Trends in Com- Publisher's Note Springer Nature remains neutral with regard to
puter, Electrical, Electronics and Communication (CTCEEC). jurisdictional claims in published maps and institutional affiliations.
Mysore: IEEE, September 2017, pp. 763–766.
18. Aroef C, Rivan Y, Rustam Z. Comparing random forest and sup- Springer Nature or its licensor (e.g. a society or other partner) holds
port vector machines for breast cancer classification. TELKOM- exclusive rights to this article under a publishing agreement with the
NIKA (Telecommun Comput Electron Control). 2020;18(2):815. author(s) or other rightsholder(s); author self-archiving of the accepted
19. Ahmad FK, Yusoff N. Classifying breast cancer types based on manuscript version of this article is solely governed by the terms of
fine needle aspiration biopsy data using random forest classifier. such publishing agreement and applicable law.
In: 13th International Conference on Intelligent Systems Design
and Applications. Salangor, Malaysia: IEEE, December 2013, pp.
121–125.

SN Computer Science

You might also like