Cancer Detection Using Machine Learning
Cancer Detection Using Machine Learning
net/publication/327974742
CITATION READS
1 2,182
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mervat Adib Bamiah on 30 September 2018.
Abstract—Cancer is the second cause of death in the world. [6], an analysis of 87 studies strongly concluded that female
8.8 million patients died due to cancer in 2015. Breast cancer is patients with breast cancer who start their therapy less than 3
the leading cause of death among women. Several types of months after the appearance of symptoms significantly have
research have been done on early detection of breast cancer to a higher chance of survival compare to those who wait for
start treatment and increase the chance of survival. Most of the
studies concentrated on mammogram images. However,
more than 3 months.
mammogram images sometimes have a risk of false detection Many previous studies confirm that detection of breast
that may endanger the patient’s health. It is vital to find cancer in early stages significantly increase the chance of
alternative methods which are easier to implement and work survival because it prevents the spreading of malignant cells
with different data sets, cheaper and safer, that can produce a throughout the entire body [6].
more reliable prediction. This paper proposes a hybrid model The main contribution of this paper is to review the role of
combined of several Machine Learning (ML) algorithms machine learning techniques in early detection of the breast
including Support Vector Machine (SVM), Artificial Neural cancer.
Network (ANN), K-Nearest Neighbor (KNN), Decision Tree Artificial Intelligence (AI) can be applied to improve
(DT) for effective breast cancer detection. This study also
discusses the datasets used for breast cancer detection and
breast cancer detection and diagnosis, as well as prevent
diagnosis. The proposed model can be used with different data overtreatment. Nevertheless, combining AI and Machine
types such as image, blood, etc. Learning (ML) methods enables the prediction and empower
accurate decision making. For example, deciding on the
Index Terms—Breast Cancer; Breast Cancer Detection; biopsy results for detecting breast cancer if the patient needs
Medical Images; Machine Learning. surgery or not.
Currently, Mammograms are the most used test available,
I. INTRODUCTION however, still, they have false positive (high-risk) results
which shows abnormal cells that can lead to unnecessary
World Health Organization (WHO) reported the breast cancer biopsies and surgeries. Sometimes surgery is done to remove
is the most common cancer amongst women globally [1]. It lesions reveals that it is benign which is not harmful. This
is also the highest ranked type of cancer cause the death means that the patient will go through unnecessary painful
among women in the world [2, 3]. In Malaysia, Breast cancer and expensive surgery.
has the highest rate of cancer deaths, around 25%, and it is ML Algorithms were introduced with many features such
the commonest cancer among women [4]. Around 5% of as effective performance on healthcare related dataset which
Malaysian women are at risk of breast cancer while Europe involve images, x-rays, blood samples, etc. Some methods
and the United States, it is around 12.5% [3]. It confirms that are appropriate for the small dataset whereby others are
women with breast cancer in Malaysia present at a later stage suitable for huge datasets. However, noise can be a
of the disease compared to women from other countries [4]. problematic concern in some methods.
Usually, breast cancer can be easily detected if specific This paper is organized as follows, Section II introduces
symptoms appear. However, many women who are suffering the breast cancer briefly, Section III explains the ML
from breast cancer have no symptoms. Hence, regular breast algorithms used for detecting breast cancer. A summary of
cancer screening is very important for early detection [3]. previous related works is given in section IV. Finally,
Early detection of breast cancer aids for early diagnosis and Section V concludes the paper.
treatment, because the prognosis is very important for long-
term survival [5]. Since early detection, diagnosis, and II. BREAST CANCER
treatment of cancer can reduce the risk of death, it plays a
significant role in saving the life of the patient. Any delay in Breast cancer is the most found disease in the women,
detection of cancer in early stages leads to disease worldwide, where abnormal growth of a mass of tissue, cause
progression and complication of treatment [5], therefore long the expansion of malignant cells leads to acute breast cancer.
waiting time prior to diagnosis of breast cancer and starting These malignant cells are originally created from milk glands
the treatment process is of prognostic concern. of the breast. These malignant cells which are the main reason
Previous studies on the investigation of the consequences for breast cancer can be classified into different groups
of a late diagnosis of cancer confirm that it is strongly according to their unusual progress and capability affecting
associated with progression of the disease to more advanced other normal cells [7]. The capability of affecting means
stages, consequently less chance to save the patient’s life. In whether these malignant cells affect only the local cells or can
a systematic review conducted by Prof MA Richards et al. spread throughout the full body. The effect of spreading these
malignant cells throughout the whole body of the patient is and Smooth Support Vector Machine (SSVM).
called as metastasis [7]. It is very important to prevent this
spreading effect by a diagnosis of cancer in the early stages C. K-Nearest Neighbors (KNN)
using advanced techniques and equipment. In recent decades, KNN is a supervised learning method which is used for
there are many efforts to employ artificial intelligence and diagnosing and classifying cancer [12]. In this method, the
other related methods to assist in the detection of cancer in computer is trained in a specific field and new data is given
earlier stages. to it. Additionally, similar data is used by the machine for
Early detection of cancer boosts the increase of survival detecting (K) hence, the machine starts finding KNN for the
chance to 98% [8]. Figure 1. shows different types of cancers unknown data. It is recommended to choose a large dataset
whereby breast cancer is leading with 24% as follows. for training also K value must be an odd number.
Chunqiu Wang et al. [18] chose Microwave Tomography Vector, 68% for Statistical and LBP based Feature Vector,
Imaging (MTI) to extract features and classify the images then the features were combined (Taxonomic Indices,
using ANN. Two different techniques were compared in this Statistical and LBP based Feature Vector) and again checked
study, GMM and KNN. Their results showed that the for accuracy. The evaluation results were the best after 4
sensitivity obtained by KNN is 87%, while for GMM is 67%. times testing. The researchers claimed that to increase
The accuracy was 85% for KNN and 75% for GMM. The performance and efficiency of detecting breast cancer is
result for Matthews Correlation Coefficient (MCC) was 67% performed by using different features.
and 48% for KNN and GMM, respectively. Finally, the Mejia et al. [27] have chosen Thermogram images for
specificity was 84% for KNN and 86% for GMM. According detecting breast cancer as it is cheaper and safer than other
to their findings, Sensitivity, Accuracy, and MCC for KNN methods. It can detect cancer in the earlier stage compared to
were better than GMM, but GMM was better in Specificity other images or tests, and it doesn’t have any limitation such
and Precision. as pregnancy, size or density of breast. Also, it doesn’t need
Chowdhary and Acharjya [19] focused on mammogram any complex features for extracting. They selected 18 cases
images as they are cheaper and more efficient in detection. with 9 abnormal and 9 normal cases. KNN classifier was used
However, since selecting and extracting features are to improve the accuracy. The results were 88.88% for
important for improving performance, Fuzzy Histogram abnormal and 94, 44% for normal cases.
Hyperbolization (FHH) was chosen to increase the quality of Ayeldeen et al. [28] used AI and its techniques for breast
images, Fuzzy C-mean for segmenting, and Gray level cancer detection. They used 5 different methods for
dependence model for extracting the features. Their method performance comparison. RF algorithm showed the highest
showed 94% accuracy for detecting malignant breast lesions. result with 99% performance.
In a study conducted by Aminikhanghahi et al. [20], Avramov and Si [29] worked on feature extraction and the
wireless cyber mammography images were explored. After impact of the selection on performance. They applied 4 ways
selecting features and extracting them, the researcher has of correlation selection (PCA, T-Test Significance and
chosen two different ML techniques, SVM and GMM to Random feature selection) and 5 models of classification (LR,
check their accuracy. Their findings showed that SVM is DT, KNN, LSVM, and CSVM). Best result was achieved by
more accurate if there is no noise or error, else GMM is better stacking the logistic, SVM and CSVM improve accuracy to
and safer. 98.56%.
Durai et al. [21] Have selected Data Mining technique for Ngadi et al. [30] used NSVC algorithm to test different
detecting diseases including breast cancer. They used LRC classification methods including RBF, Poly, and Linear. Then
and compared it with four other techniques including BFI, they compared the results with other classification methods
ID3, J48, and SVM. The result shows that LRC is the most such as Naïve Bayes, DT, K-NN, SVM, RF, and Adaboost.
accurate one with 99.25% accuracy. RF has the best performance result with 93% accuracy. This
Wang and Yoon [22] chose four methods of Data Mining proves that NSVC was better than the other methods.
to measure their effectiveness in detection. These models Jiang and Xu [31] used Diffusion-Weighted Magnetic
were: SVM, ANN, Naïve Bayes Classification and Adaboost Resonance Image (DWI) for breast cancer detection. They
tree. In addition, PCs and PCi were used for making hybrid used two types of features; one based on ROI and another one
models. After checking the accuracy, they have found out that based on ADC- on 61 patient’s data. Moreover, they
Principal Component Analysis (PCA) can be a critical factor implemented RF-RFE and RF algorithm was used. The study
to improve performance. findings show that the accuracy of RF-RFE and RF and
Hafizah et al. [23] compared SVM and ANN using four Histogram + GLCM is 77.05% which indicates that feature-
different datasets of breast and liver cancer including WBCD, based texture has a critical role in improving performance and
BUPA JNC, Data, Ovarian. The researchers have detection.
demonstrated that both methods are having high performance Salma [32] selected two different data sets from WBCD
but still, SVM was better than ANN. and KDD also they used FM-ANN for both of them. They
Azar and El-Said [24] worked on six different methods of compared the results with other techniques (RBF, FNN, and
SVM. They have compared ST-SVM with LPSVM, LSVM, MNN). After training and testing KDD achieved better
SSVM, PSVM, and NSVM to find out which method accuracy of 99.96% due to the number of features were more.
performs the best in accuracy, sensitivity, specificity, and Comparing the results FM- ANN proved to be more accurate.
ROC. LPSVM proved to be the best with accuracy 97.1429%, Bevilacqua et al. [33] selected MR images for training and
sensitivity 98.2456%, specificity 95.082%, and ROC testing. After extracting data and processing, they used ANN
99.38%. Therefore, LPSVM has the highest performance. for classification and detecting breast cancer. However, when
Deng and Perkowski [25] used a new method called Genetic Algorithm was used to optimize ANN, the observed
Weighted Hierarchical Adaptive Voting Ensemble specificity was 90.46%, sensitivity was 89.08% and the
(WHAVE). They compared the accuracy of WHAVE with average accuracy was improved to 89.77% and high accuracy
seven other methods that had the highest accuracies in changed to 100%.
previous researchers. WHAVE proved to achieve the highest Table 1 represents all the related work ML method used in
performance value of 99.8%. this study [17-33]. It contains the references, type of extracted
Rehman et al. [26] extracted different features including features, data sets and measured performances. Performance
Phylogenetic trees, Statistical Features and Local Binary is the most significant feature in choosing the proper method.
Patterns from mammography images. They used a hybrid
model combined with SVM and RBF for classification. They
checked the accuracy of each feature separately. In this step
the best accuracy value was 76% for 90 features that were
chosen based on Taxonomic Indices based Feature (TIF)
Table 1
Related work on different types of methodology, features, dataset, and references for breast cancer detection
Fuzzy Histogram
Training set Accuracy %
Hyperonization, Mammographic
SVM, KNN, Normal 70 100
[19] Fuzzy C-mean, and Gray Mammogram Image Analysis
RSDA Benign 60 96.67
level dependence model Society (MIAS)
Malignant 50 94
Contrast, Homogeneity,
MCC Sensitivity Specificity DDSM
Mean, Correlation, Energy,
[20] SVM, GMM Mammography SVM 78.78% 82% 96% University of
Maximum
GMM 72.06% 84% 86% South Florida
Mitoses, Marginal-Adhesion,
Accuracy percentage
Normal Nucleoli, Clump
LRC 99.25
Thickness, Bland Chromatin,
BFI 95.46
[21] LRC Uniformity of cell shape, Standard Data UCI
ID3 92.99
Single Epithelial cell size,
J48 98.14
Uniformity of cell size, Bare
SVM 96.40
Nuclei
Mitoses, Marginal-Adhesion,
Normal Nucleoli, Clump
Accuracy Sensitivity Specificity AUC Wisconsin Breast
ANN, Thickness, Bland Chromatin,
[23] Standard Data SVM 99.51% 99.25% 100% 99.63% Cancer Database
SVM Uniformity of cell shape and
ANN 98.54% 99.25% 97.22% 98.24% (WBCD)
size, Single Epithelial cell
size, Bare Nuclei
Weighted
Method Accuracy Percentage
Hierarchical
DNF 65. 72
Adaptive Voting Mitoses, Marginal-Adhesion,
DT 94.74
Ensemble Normal Nucleoli, Clump
NB 84.5
(WHAVE) Thickness, Bland Chromatin,
[25] SVM 99.54 WBCD
Disjunctive Uniformity of cell shape and
Hybrid 99.54
Normal Form size, Single Epithelial cell
KNN 97.14
(DNF) rule-based size, Bare Nuclei
Quadratic Classifier 97.14
method,
WHAVE 99.8
DT, NB, SVM
Federal
Accuracy
Fluminense
[27] KNN Mean, Standard Deviation Thermogram Normal Abnormal
KNN University
94.44% 88.88%
Hospital
RF on FP
Bayes Net (BN), Precision Recall F ROC
TP rate Rate Department of
Multi-Class
Blood Serum BN 0.947 0.035 0.949 0.947 0.945 0.995 Biochemistry and
Classifier, TP Rate, FP Rate, Precision,
[28] Multi CC 0.933 0.043 0.933 0.933 0.93 0.987 Molecular
DT, Recall, F-measure, ROC area
DT 0.87 0.084 0.878 0.87 0.868 0.966 Biology of Kasr
Radial Basis
RBF 0.774 0.128 0.722 0.774 0.739 0.908 Alainy
Function, RF
RF 0.99 0.007 0.99 0.99 0.99 1
Accuracy percentage
DT with 30 features 92.51
KNN with 30 features 91.56
LR with 3 features 96.27
Logistic
Radius, Texture, Perimeter, LR with 6 features 97.77
Regression (LR),
Area, Smoothness, LR with 30 features 95.65
DT. Microscope
[29] Compactness, Concavity, LSVM with 3 features 97.47 UCI
KNN, Digital Image
Concave Points, Symmetry, LSVM with 10 features 97.87
Cubic SVM
Fractal, Dimension LSVM with 30 features 97.30
(CSVM)
CSVM with 11 features 97.98
SVM and CSVM 98.56
CSVM with 30 features 98
Stacking the Logistic, LSVM, and CSVM 98.56
Diffusion-
ROI: Mean, Variance, Weighted Accuracy Sensitivity Specificity AUC
RF-Recursive
Skewness, Kurtosis, Energy, Magnetic RF-RFE and RF 77.05% 84.21% 65.21% 0.76
Feature Zhejiang Cancer
[31] Entropy Resonance Histogram 68.85% 76.32% 56.52% 0.73
Elimination (RF- Hospital
ADC: Contrast, Entropy, Image (DW GLCM 65.57% 71.05% 56.52% 0.63
RFE) method
ASM, Correlation (Convert to Histogram + GLCM 77.05% 84.21% 65.21% 0.76
ADC)-MRI)
According to Figure 2, most researchers have worked on mammogram was the most frequent data set used compared
mammogram images as its quicker than other types of breast to other types of data such as ultrasound images, thermal
cancer detection and it is safe and more effective [34]. images or blood features.
Figure 3 presents a comparison of using ML methods and
Mammogram Standard MRI
algorithms methodologies employed for breast cancer
detection in the reviewed literature listed in Table 1. It is MTI Thermogram Blood Test
observed that SVM is the most frequently used method. MDI DW X-Ray
Whereby, Figure 4 presents the results of breast cancer
detection using ML methods.
6%
6%
V. CONCLUSION
6%
In the present paper, breast cancer and ML were introduced 35%
as well as an in-depth literature review was performed on 6%
existing ML methods used for breast cancer detection. The
findings of these researchers suggest that SVM is the most 6%
popular method used for cancer detection applications. SVM
was used either alone or combined with another method to 6%
improve the performance. The maximum achieved accuracy
of SVM (single or hybrid) was 99.8% that can be improved
12%
to 100%. It was observed from the work of [33] who used 17%
optional ANN on MRI resulted in 100% accuracy in
detecting breast cancer. This method can be applied and
tested on another dataset like mammogram and ultrasound to
check the performance of different data types. The Figure 2: Different breast cancer detection methods
[6] M.A. Richards, A.M. Westcombe, S.B. Love, P. Littlejohns, and A.J.
Popularity of Machine Learning Methods Ramirez, “Influence of delay on survival in patients with breast cancer:
a systematic review,” The Lancet, 1999, vol. 353, no. 9159, pp. 1119-
1126.
NSVC [7] B. Stewart and C.P. Wild, World Cancer Report 2014, International
MCC Agency for Research on Cancer, WHO, 2014.
BNN [8] S. A. Korkmaz, and M. Poyraz, “A New Method Based for Diagnosis
of Breast Cancer Cells from Microscopic Images: DWEE—JHT,” J.
DNF Med. Syst., vol. 38, no. 9, p. 92, 2014.
PCA [9] P. Louridas, and C. Ebert, “Machine Learning,” IEEE Softw., vol. 33,
no. 5, pp. 110–115, 2016.
ABT [10] A. Simons, “Using artificial intelligence to improve early breast cancer
RSDA detection, “2017. Retrieved on April 10, 2018, from
https://www.csail.mit.edu/news/using-artificial-intelligence-improve-
RBF
early-breast-cancer-detection
NB [11] E. Ali, and W. Feng, “Breast Cancer classification using Support
LRC Vector Machine and Neural Network,” International Journal of
Science and Research, pp. 2013, 2319-7064.
GMM [12] S. Medjahed, T. Saadi, and A. Benyettou, “Breast Cancer Diagnosis by
RF using k-Nearest Neighbor with Different Distances and Classification
Rules,” International Journal of Computer Applications, 2013, vol. 62,
DT no. 1, pp. 0975 – 8887.
ANN [13] R. Sumbaly, N. Vishnusri, and S. Jeyalatha, “Diagnosis of Breast
Cancer using Decision Tree Data Mining Technique,” International
K-NN
Journal of Computer Applications, 2014, vol. 98, no. 10, pp. 0975 –
SVM 8887.
[14] M. Elgedawy, “Prediction of Breast Cancer using Random Forest,
Support Vector Machines and Naïve Bayes,” International Journal of
Figure 3: Using machine learning methods in cancer detection Engineering and Computer Science, 2017, vol. 6, no. 1, pp. 19884-
19889.
[15] R. Senkamalavalli, and T. Bhuvaneswari,” Improved classification of
Accuracy (%) breast cancer data using hybrid techniques, “International Journal of
Advanced Research in Computer Science. 2017, vol. 8, no. 8, pp. 454-
100
457.
[16] A. Hazra, S. Mandal, and A. Gupta” Study and Analysis of Breast
90 Cancer Cell Detection using Naïve Bayes, SVM and Ensemble
Algorithms,” International Journal of Computer Applications. 2016,
80 vol. 145, no.2, pp. 0975 – 8887.
[17] S. Gc, R. Kasaudhan, T. K. Heo, and H.D. Choi, “Variability
70 Measurement for Breast Cancer Classification of Mammographic
Masses,” in Proceedings of the 2015 Conference on research in
60 adaptive and convergent systems (RACS), Prague, Czech Republic,
2015, pp. 177–182.
50 [18] C. Wang, W. Wang, S. Shin, and S. I. Jeon, “Comparative Study of
Microwave Tomography Segmentation Techniques Based on GMM
40 and KNN in Breast Cancer Detection,” in Proceedings of the 2014
Conference on Research in Adaptive and Convergent Systems (RACS
30 '14), Towson, Maryland, 2014, pp. 303–308.
[19] C. L. Chowdhary, and D. P. Acharjya, “Breast Cancer Detection using
20 Intuitionistic Fuzzy Histogram Hyperbolization and Possibilitic Fuzzy
c-mean Clustering algorithms with texture feature-based Classification
10 on Mammography Images,” in Proceedings of the International
Conference on Advances in Information Communication Technology &
0 Computing, Bikaner, India, 2016, pp. 1–6.
[20] S. Aminikhanghahi, S. Shin, W. Wang, S. I. Jeon, S. H. Son, and C.
[28]
[30]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[29]
[31]
[32]
[33]
Descriptors for Breast Cancer Detection,” 2015 Asia-Pacific Conf. International Conference on Bioinformatics and Computational
Comput. Aided Syst. Eng., pp. 24–29, 2015. Intelligence 2017.
[28] H. Ayeldeen, M. A. Elfattah, O. Shaker, A. E. Hassanien, and T.-H. [32] M. U. Salma, “Fast Modular Artificial Neural Network for the
Kim, “Case-Based Retrieval Approach of Clinical Breast Cancer Classification of Breast Cancer Data,” Proc. Third Int. Symp. Women
Patients,” 2015 3rd Int. Conf. Comput. Inf. Appl., pp. 38–41, 2015. Comput. Informatics - WCI ’15, pp. 66–72, 2015.
[29] T. K. Avramov and D. Si, “Comparison of Feature Reduction Methods [33] V. Bevilacqua, A. Brunetti, M. Triggiani, D. Magaletti, M. Telegrafo,
and Machine Learning Models for Breast Cancer Diagnosis,” Proc. Int. and M. Moschetta, “An Optimized Feed-forward Artificial Neural
Conf. Comput. Data Anal. - ICCDA ’17, pp. 69–74, 2017. Network Topology to Support Radiologists in Breast Lesions
[30] M. Ngadi, A. Amine, and B. Nassih, “A Robust Approach for Classification,” Proc. 2016 Genet. Evol. Comput. Conf. Companion -
Mammographic Image Classification Using NSVC Algorithm,” Proc. GECCO ’16 Companion, pp. 1385–1392, 2016.
Mediterr. Conf. Pattern Recognit. Artif. Intell. - MedPRAI-2016, pp. [34] M. Rmili, and A. El, “A Combined Approach for Breast Cancer
44–49, 2016. Detection in Mammogram,” 2016 13th International Conference on
[31] Z. Jiang, and W. Xu, “Classification of benign and malignant breast Computer Graphics, Imaging and Visualization, pp. 350–353, 2016.
cancer based on DWI texture features,” ICBCI 2017 Proceedings of the