BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis
BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis
https://doi.org/10.1007/s42979-023-01825-x
ORIGINAL RESEARCH
Abstract
Breast cancer is taking the lives of women globally. It’s one of the most common malignancies in women, as well as the
leading cause of cancer-related death. Even though there are no therapies for breast cancer, early detection and diagnosis
are critical in determining survival chances. Machine learning for medical diagnosis and its accuracy is a key advancement
and a predicted trend in the future medical model. Our goal is to use machine learning algorithms to diagnose breast cancer
and perform performance analysis. Under the supervised machine learning approach, different classifiers are applied to
the data sets to predict outcomes. We have used five such classifiers, K-Nearest Neighbors (KNN), Random Forest (RF),
Decision Trees (DT), Logistic Regression (LR), and Support Vector Machines (SVM) on the Breast Cancer WISCONSIN
(Diagnostic) data set to observe how accurately they predict the cancerous instances. The classifiers are treated with cross-
validation approaches to get exact accuracies. The same is done with different partitions of the data set, and performance
analysis is made based on every observation. Significance statement: we have done a breast cancer prediction using some
machine learning models and made a performance analysis.
SN Computer Science
Vol.:(0123456789)
377 Page 2 of 10 SN Computer Science (2023) 4:377
SN Computer Science
SN Computer Science (2023) 4:377 Page 3 of 10 377
Discussions”. Lastly, we have concluded our work in “Con- Chelvian Aroef et al. [18] used the UCI machine learn-
clusion and Future Work”. ing repository dataset. They implemented machine learn-
ing classifiers (algorithms) like Random Forest and Sup-
port Vector Machines. The data was segregated into 80%
Related Works for training and 20% for testing. SVM worked better than
Random Forest. The accuracy of SVM was 95.4.
Dai et al. [11] implemented random forest, decision tree In another work [19], the authors aimed to classify
algorithms, and discussed the scenario of breast cancer breast cancer types. The breast cancer lesions obtained
diagnosis and achieved excellent prediction accuracy using from FNA (fine needle aspiration) have been used, and
ensemble learning. They got 95% accuracy using the ensem- with the help of a random forest classifier, the work was
ble learning approach. The data set used in this paper is the done. The data set used here was Fine Needle Aspiration
University of California, Irvine Breast Cancer Wisconsin Biopsy Data, which has 458 benign cases and 241 malig-
(Diagnostic) Data set. nant cases. This data was obtained from UCI Machine
Shubham Sharma et al. [12] implemented K-NN, RF, and Learning Repository. After fitting the essential features
Naive Bayes to detect Breast Cancer. They used the Wis- and training the RF classifier, the results were based on
consin Diagnosis Breast Cancer Dataset. They compared 250 testing cases, where Sensitivity came 75%, Specific-
the performance of different classifiers. They concluded that ity came 70%, and accuracy came 72%. In conclusion, it
a woman chosen at random had a 12 percent risk of being can be said that RF has accurately classified breast cancer
diagnosed with breast cancer. CNN’s accuracy came highest tumors; however, further studies are required to improve
with 95.90%, one observation misclassified as Malignant. accuracy.
In another interesting work [13], the authors implemented Subrato Bharati et al. [20] used machine learning algo-
6 Machine Learning models (algorithms). They used the rithms like Naïve Bayes, Random Forest, KNN classifier,
Wisconsin breast cancer data set (WDBC). The DL_ANN and Kappa Statistics to calculate precision and recall. The
topped among the classifiers with 98.24%. data was collected from the UCI Machine Learning Reposi-
Random Forest and Extreme Gradient Boosting (com- tory. In conclusion, KNN gave the highest result, and it has
monly known as XGBoost), implemented by SajibKabiraj correctly classified cases of 97.9021%. Multilayer percep-
et al. [14] to predict breast cancer. These are two ensemble tron was second in position with 96.5035%. KNN even got
machine learning methods. They used the data set from the a 99.9% ROC Area.
UCI Machine Learning Repository. The data set was divided In another work [21], the authors proposed a rule-based
into 67% for the purpose of training and 33% for testing. In machine learning classification algorithm to predict differ-
conclusion, they said that the accuracy for Random Forest ent types of breast cancer. They implemented classifiers like
came 74.73% and 73.63% for XGBoost. In [15], the authors NB, Trees Random Forest (TRF), 1-Nearest Neighbour,
have used machine learning techniques to predict breast can- SVM, AdaBoost (AD), Multilayer Perceptron (MLP), RBF
cer. They have used the public data set from the University Network (a radial basis function network), and a kind of
of Wisconsin Hospitals Madison Breast Cancer Database. an artificial neural network. They compared all the models,
SVM has the most accuracy of 97.9 and minor error than where NB has the highest ROC area of 0.94, TRF has the
other classifiers used in this data set. highest Classification accuracy of 0.96. The data set used
Kaya Keles [16] aimed to predict and detect breast cancer here was taken from the Cancer Registry Organization of
when the size of the tumor is generally small. The data set Kerman Province, in Iran.
used in this study was from the studies conducted by Avsar Octaviani et al. [22] used the Wisconsin Breast Cancer
Aydm. The author has used non-invasive and untroublesome Database (WBCD) data from the UCI Repository. Their
methods, such as comparing data mining algorithms (classi- work aimed to predict breast cancer using one of the best
fiers) made using the Weka tool. The highest result obtained machine learning algorithms. They got an accuracy of 100%.
was Random Forest (TREES) with 92.2% accuracy. A classifier showing 100% accuracy is a bit problematic.
Murugan et al. [17] used Supervised Machine Learning An algorithm can’t be 100% accurate. Data pre-processing
models (classifiers) like Decision Tree, Random Forest, and and cross-validation techniques could have solved this issue.
Linear Regression to predict Breast Cancer. They aimed to In another work [23], here the authors implemented five
predict if the type of cancer is benign or malignant. The machine learning models like DT, RF, SVM, NN, and LR to
repository of data used here is the “UCI Machine Learning predict breast cancer. They have used two data sets: Breast
Repository (Wisconsin Breast Cancer) Data set”. The result- Cancer Coimbra Dataset (BCCD) and Wisconsin Breast
ant diagnosis is grouped by B-Benign and M-Malignant. Cancer Database (WBCD). In conclusion, the values of RF
The highest result came from the linear regression algorithm in both data sets were more than other classifiers. RF scores
with 84.15% accuracy. were 74.3% and 93.7%. RF can also be used with other data
SN Computer Science
377 Page 4 of 10 SN Computer Science (2023) 4:377
mining technologies to improve its accuracy. Table 1 depicts which 212 are malignant. Further work has been done with
accuracies obtained by previous works and their limitations. the 212 malignant cases.
Mohammad Monirunjjaman et al. [24] have used the The data were processed by dropping the patient id(s)
WISCONSIN breast cancer data set to predict the cancer and the unnamed 32 columns. Since the data set has several
probability of a patient. A correlation matrix was performed features, we plot a correlation matrix to remove the highly
for selecting desired features. Classifiers like Random For- correlated features. Figure 2 depicts the correlation matrix.
est, K-Nearest Neighbors, Decision Tree, and Logistic All of the attributes have no null values. After removing
Regression were used in their work. Out of them, Logistic the highly correlated features, the data frame was reduced to
Regression has secured the highest accuracy of 98.6%. 23 columns. The input variable, X data frame, was created
Another interesting work [25] where the authors have by dropping the ‘diagnosis’ attribute and the output variable
implemented the farthest first clustering with Naive Bayes Y frame has only the ‘diagnosis’ attribute. The data set was
and artificial neural network model to predict breast cancer split into three parts. First, it was segregated into 70% for
on the WISCONSIN Breast Cancer data set. They used five the purpose of training and 30% for data testing. Then it was
partitions of the data set, out of them 60–40 data split gave divided into 80% for training and 20% for testing, 90% for
the highest accuracy of 97.8% and the balanced average is training, and 10% for testing. We used the standard scaler
98%. to scale the data.
We addressed all the problems in our proposed work and
achieved better Breast cancer prediction accuracy. Works Methodology
from the year 2013 have been collected and reviewed well.
Work methodologies, results, and limitations are stated in The five classifiers we employed, KNN (K-Nearest Neigh-
the above table briefly. Papers which has no such limitations bors), DT (Decision Trees), RF (Random Forest), LR
are kept blank in the above table. (Logistic Regression), and SVM (Support Vector Machines),
are trained on the three partitions of the data as shown in
Research Gap the workflow diagram in Fig. 1. For every data split (70–30,
80–20, and 90–10), model accuracies with the confusion
Our work focused on primarily improving the accuracy of matrix, precision, recall, and f-1 score are recorded in a
classifiers. The majority of existing works analyzed the clas- table format. The model accuracies are applied with tenfold,
sifiers applied to a specific data partition. The results lacked threefold, fourfold, and fivefold cross-validation methodolo-
the cross-validation method, which can change the outcomes gies. The best cross-validation score and the mean cross-
drastically. Performance analysis of the classifiers imple- validation scores for every fold are noted.
mented on different data partitions was missing.
Observations
SN Computer Science
Table 1 Accuracies obtained by previous works and their limitations
Author and year Classifier used Accuracy Limitations
Mohammad Monirunjjaman et al. (2022) [24] Random forest, K-nearest neighbors, decision Logistic regression: 98.6%
SN Computer Science
Ahmad and Yusoff (2013) [19] Random forest classifier 72% Data were not well processed
377
SN Computer Science
377 Page 6 of 10 SN Computer Science (2023) 4:377
Table 2 Accuracy Score, Models Accuracy Score for Confusion matrix Precision Recall F1-score Support
Confusion Matrix, and training and testing
Classification Report for 70–30 data
split data
Random forest 0.962311 [[107 1] 0.96 0.99 0.97 108
0.935672 [5 58]] 0.98 0.92 0.95 63
Logistic regression 0.989949 [[106 2] 0.95 0.98 0.97 108
0.959064 [5 58]] 0.97 0.92 0.94 63
Decision tree 1.0 [[99 9] 0.97 0.92 0.94 108
0.929824 [3 60]] 0.87 0.95 0.91 63
KNN 0.962311 [[105 3] 0.93 0.95 0.95 108
0.935672 [8 58]] 0.95 0.87 0.91 63
Support vector machine 0.989949 [[99 9] 0.97 0.92 0.94 108
0.929824 [3 60]] 0.87 0.95 0.91 63
SN Computer Science
SN Computer Science (2023) 4:377 Page 7 of 10 377
Table 3 Accuracy score, Models Accuracy score Confusion matrix Precision Recall F1-score Support
confusion matrix, and
classification report for 80–20 Random forest 0.967032 [[67 0] 0.96 1.00 0.98 67
split data
0.956140 [3 44]] 1.0 0.94 0.97 47
Logistic regression 0.989010 [[66 1] 0.96 0.99 0.97 67
0.964912 [3 44]] 0.98 0.94 0.96 47
Decision tree 1.0 [[61 6] 0.98 0.91 0.95 67
0.938596 [1 46]] 0.88 0.98 0.93 47
KNN 1.0 [[61 6] 0.94 0.99 0.96 67
0.938596 [1 46]] 0.98 0.91 0.95 47
Support vector machine 0.991208 [[61 6] 0.98 0.91 0.95 67
0.938596 [1 46]] 0.88 0.98 0.93 47
Table 4 Accuracy score, Models Accuracy score Confusion matrix Precision Recall F1-score Support
confusion matrix, and
classification report for 90–10 Random forest 0.962890 [[34 1] 0.97 0.97 0.97 35
split data
0.982456 [1 21]] 0.95 0.95 0.95 22
Logistic regression 0.988281 [[35 0] 1.0 1.0 1.0 35
1.0 [0 22]] 1.0 1.0 1.0 22
Decision tree 1.0 [[32 3] 0.97 0.91 0.94 35
0.929825 [1 21]] 0.88 0.95 0.91 22
KNN 0.962890 [[35 0] 0.97 1.0 0.99 35
0.982456 [1 21]] 1.0 0.95 0.98 22
Support vector machine 0.990234 [[32 3] 0.97 0.91 0.94 35
0.929825 [1 21]] 0.88 0.95 0.91 22
Random 0.959064 0.968406 0.959617 0.964912 0.964884 0.964448 0.982456 0.964897 0.961376
forest
Logistic 0.935673 0.945536 0.944228 0.956140 0.945530 0.944228 1.000000 0.945536 0.944228
regression
Decision 0.935673 0.922683 0.925782 0.956140 0.929704 0.925335 0.929825 0.920944 0.926230
tree
KNN 0.912281 0.903354 0.902957 0.938596 0.898079 0.902957 0.982456 0.903354 0.902957
Support 0.912281 0.680345 0.704978 0.938596 0.826047 0.722490 0.929825 0.785592 0.761034
vector
machine
regression. The cross-validation scores are greater than other Graphs plotted for Accuracy Comparison, The Best
classifiers. In the case of 90–10 split data, we encountered Cross-Validation Score, and Mean Cross-Validation Scores
a problem. Logistic Regression got the highest accuracy of are shown in Fig. 3.
100%, so we considered Random Forest which has the accu-
racy of 98.2%.
SN Computer Science
377 Page 8 of 10 SN Computer Science (2023) 4:377
Fig. 3 Comparison graph of the best cross-validation score and mean cross-validation scores
Results and Discussions method. We used tenfold, threefold, fourfold, and fivefold
cross-validation methods to pick out the best result. Random
From the above Model Accuracy tables, we observed when forest’s mean cross-validation score was 0.961376. The rest
the data were divided into 70% for training data and 30% are in Table 5.
testing data, Logistic Regression got the highest accuracy In Table 6, we have compared the accuracies obtained by
of 95.9%. But after the application of the K-fold Cross- prior works with the proposed works.
Validation technique, Random Forest has the best score Although several previous works used the random forest
of 0.968406. The best result of the cross-validation tech- well, our proposed work showed better results than those.
nique came from the fivefold cross-validation method.
We used tenfold, threefold, fourfold, and fivefold cross-
validation methods to pick the best result. Random forest’s Conclusion and Future Work
mean cross-validation score was 0.959617. The rest are
in Table 5. In conclusion, we can say, despite having differences in
When the data was split into 80% of training data and 20% accuracy scores, Random Forest gave the best results after
of testing data, Logistic regression got the highest accuracy of the application of the k-fold cross-validation technique.
96.4%. But after the application of the K-fold Cross-Validation It was found in 90% training and 10% testing partition
technique, Random Forest has the best score of 0.964884. The of data. Random Forest gave the highest cross-validation
best result of the cross-validation technique came from the score of nearly 0.96488. We can say that Random For-
threefold cross-validation method. We used threefold, four- est is much more accurate than other classifiers used in
fold, fivefold, and tenfold cross-validation methods to pick the this work. Although, Logistic Regression did great and
best result. Random forest’s mean cross-validation score was gave accuracy close to Random Forest. After observing
0.964448. The rest are in Table 5. the implementation of different classifiers on three splits
When the data was segregated into 90% of training data of data into 70–30, 80–20, and 90–10, we can say that
and 10% of testing data, Logistic Regression got the highest Random Forest is the best classifier for predicting breast
accuracy of 100%. It is a problem, perhaps. But after the appli- cancer.
cation of the K-fold cross-validation approach, Random Forest, We have got a better result in our work, and it can be
has the best score of 0.964897. The best result of the cross- further improved by using Ensemble Learning or Data
validation technique came from the fivefold cross-validation Mining tools. We are looking forward to improving our
outcomes shortly.
SN Computer Science
SN Computer Science (2023) 4:377 Page 9 of 10 377
Acknowledgements We sincerely thank the Department of Computer 5. Cruz JA, Wishart DS. applications of machine learn-
Science, The University of Burdwan, India, for their assistance to pur- ing in cancer prediction and prognosis. Cancer Inf.
sue our research work. 2006;2:117693510600200030.
6. Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang MF, Campbell
Funding This research is not funded by any agencies or institutes. JP. Introduction to machine learning, neural networks, and deep
learning. Transl Vis Sci Technol. 2020;9(2):14.
Data availability The datasets generated during and/or analyzed dur- 7. Chowdhury S, Schoen MP. Research paper classification using
ing the current study are available from the corresponding author on supervised machine learning techniques. In 2020 Intermountain
reasonable request. Engineering, Technology and Computing (IETC), Oct 2020, pp.
1–6.
Declarations 8. N. S., A. I. f. H. S. Department of Computer Science, I. Higher
Education for Women, Coimbatore, S. A. ., A. I. f. H. S.
Conflict of Interest The authors declare that they have no conflict of Department of Computer Science, and I. Higher Education for
interest. Women, Coimbatore, Prediction of breast cancer using deci-
sion tree and random forest algorithm. Int J Comput Sci Eng.
2018;6(2):226–229.
9. Jaikrishnan SVJ, Chantarakasemchit O, Meesad P. A breakup
References machine learning approach for breast cancer prediction. In:
2019 11th International Conference on Information Technol-
1. Jemal A, Siegel R, Ward E, Hao Y, Xu J, Thun MJ. Cancer statis- ogy and Electrical Engineering (ICITEE). Pattaya, Thailand:
tics, 2009. CA Cancer J Clin. 2009;59(4):225–49. IEEE, October 2019, pp. 1–6.
2. Akram M, Iqbal M, Daniyal M, Khan AU. Awareness and current 10. Li J, Zhou Z, Dong J, Fu Y, Li Y, Luan Z, Peng Z. Predicting
knowledge of breast cancer. Biol Res. 2017;50(1):33. breast cancer 5-year survival using machine learning: a system-
3. Milosevic M, Jankovic D, Milenkovic A, Stojanov D. Early atic review. PLOS ONE. 2021;16(4):1–23.
diagnosis and detection of breast cancer. Technol Health Care. 11. Dai B, Chen RC, Zhu SZ, Zhang WW. Using random forest
2018;26(4):729–59. algorithm for breast cancer diagnosis. In: International Sympo-
4. Kalli S, Semine A, Cohen S, Naber SP, Makim SS, Bahl M. Amer- sium on Computer, Consumer and Control (IS3C). Taichung,
ican joint committee on cancer’s staging system for breast cancer, Taiwan: IEEE, December 2018, pp. 449–452.
Eighth Edition: What the radiologist needs to know. Radiograph- 12. Sharma S, Aggarwal A, Choudhury T. Breast cancer detection
ics. 2018;38(7):1921–33. using machine learning algorithms. p. 5, 2018.
SN Computer Science
377 Page 10 of 10 SN Computer Science (2023) 4:377
13. Gupta P, Garg S. Breast cancer prediction using varying 20. Bharati S, Rahman MA, Podder P. Breast cancer prediction apply-
parameters of machine learning models. Proc Comput Sci. ing different classification algorithm with comparative analysis
2020;171:593–601. using WEKA. In: 4th International Conference on Electrical Engi-
14. Kabiraj S, Raihan M, Alvi N, Afrin M, Akter L, Sohagi SA, neering and Information & Communication Technology (iCEE-
Podder E. Breast cancer risk prediction using XGBoost and iCT). Dhaka, Bangladesh: IEEE, September 2018, pp. 581–584.
random forest algorithm. In: 11th International Conference on 21. Montazeri M, Beigzadeh A. Machine learning models in breast
Computing, Communication and Networking Technologies cancer survival prediction, p. 12, 2019.
(ICCCNT). Kharagpur, India: IEEE, July 2020, pp. 1–4. 22. Octaviani TL Rustam, Z. Random forest for breast cancer predic-
15. Khourdifi Y, Bahaj M. Applying best machine learning algo- tion. Depok, Indonesia, p. 020050. 2019.
rithms for breast cancer prediction and classification. In Inter- 23. Li Y. Performance evaluation of machine learning methods for
national Conference on Electronics, Control, Optimization and breast cancer prediction. Appl Comput Math. 2018;7(4):212.
Computer Science (ICECOCS). Kenitra: IEEE, December 2018, 24. Khan M. Machine learning based comparative analysis for breast
pp. 1–5. cancer prediction. J Healthc Eng. 2022;2022:1–15.
16. Keleşs, MK. Breast cancer prediction and detection using data 25. Chinniyan K, Subramani R. Breast cancer prediction using
mining classification algorithms: a comparative study. Tehnicki machine learning algorithms. Int J Mech Eng. 2021;6(3):9–9.
vjesnik Technical Gazette 2019;26(1). 26. B. C. W. D. Dataset. Breast Cancer Wisconsin (Diagnostic) Data
17. Murugan S, Kumar BM, Amudha S. Classification and prediction Set, 2018.
of breast cancer using linear regression, decision tree and random
forest. In: International Conference on Current Trends in Com- Publisher's Note Springer Nature remains neutral with regard to
puter, Electrical, Electronics and Communication (CTCEEC). jurisdictional claims in published maps and institutional affiliations.
Mysore: IEEE, September 2017, pp. 763–766.
18. Aroef C, Rivan Y, Rustam Z. Comparing random forest and sup- Springer Nature or its licensor (e.g. a society or other partner) holds
port vector machines for breast cancer classification. TELKOM- exclusive rights to this article under a publishing agreement with the
NIKA (Telecommun Comput Electron Control). 2020;18(2):815. author(s) or other rightsholder(s); author self-archiving of the accepted
19. Ahmad FK, Yusoff N. Classifying breast cancer types based on manuscript version of this article is solely governed by the terms of
fine needle aspiration biopsy data using random forest classifier. such publishing agreement and applicable law.
In: 13th International Conference on Intelligent Systems Design
and Applications. Salangor, Malaysia: IEEE, December 2013, pp.
121–125.
SN Computer Science