Breast Cancer Detection Using Machine Learning Algorithm PDF
Breast Cancer Detection Using Machine Learning Algorithm PDF
Project Report
On
Bachelor of Engineering
in
Computer
By
Supervisor
Prof. Vaibhav Badbe
Technology Personified
(2018-19)
Technology Personified
CERTIFICATE
This is to certify that, the Project titled
Bachelor of Engineering
In
Computer
To the
University of Mumbai
Supervisor
Prof. Vaibhav Badbe
This is to certify that the project entitled "Breast Cancer Detection Using Machine
Learning” is a bonafide work done by Vinay Vilas Patil, Arun Vasantlal Jasiwal and Shubham
Satishunder the supervision of Prof. Vaibhav Badbe. This project has been approved for the
award of Bachelor’s Degree in Computer Engineering, University ofMumbai.
Examiners:
1...............................
2...............................
Supervisors:
1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Principal:
..............................
Date:
Place:
Declaration
I declare that this written submission represents my ideas in my own words and where
other’s ideas or words have been included, I have adequately cited and referenced the original
sources.I also declarethatIhaveadheredtoallprinciplesofacademichonestyand integrityand have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
fromwhomproperpermissionhasnotbeentakenwhenneeded.
Date:
Contents
Abstract ............................................................................................................................................
List of Tables...................................................................................................................................
1. Introduction 1
4. Methodology 13
5. Design of System
Reference ………...............................................................................................................................21
Abstract
Breast Cancer is one of the most common cancers among women worldwide,
representing the majority of new cancer cases and cancer related deaths according to global
statistics, making it a significant public health issue in today's society. The early diagnosis of
BC can improve the prognosis and chance of survival significantly, as it can promote timely
clinical treatments to patients. This project aims to detect the type of Breast Cancer (Malignant
or Benign classes) using K-Nearest Neighbors (K-NN) a Machine Learning Algorithm by
taking cell parameters as input. The quality of the results depends largely on the distance and
value of parameter “k” which represents the number of nearest neighbors. This project also
aims to achieve maximum accuracy in detection of Breast Cancer using BC data sets. This
study is conducted on Wisconsin breast cancer dataset (WBCD) obtained by the university of
Wisconsin Hospital from UCI repository.
List of Figures
ii
List of Tables
iii
BREAST CANCER DETECTION USING MACHINE LEARNING
CHAPTER 1
1.1 Introduction:
Breast cancer (BC) is the most common cancer in women, affecting about 10% of all
women at some stages of their life.Over the past few decades, ML techniques have been
widely used in intelligent healthcare systems, especially for breast cancer (BC) detection and
diagnosis.
The breast is made up of different tissue, ranging from very fatty tissue to very dense
tissue. Within this tissue is a network of lobes. Each lobe is made up of tiny, tube-like
structures called lobules that contain milk glands. Tiny ducts connect the glands, lobules, and
lobes, carrying milk from the lobes to the nipple. The nipple is located in the middle of the
areola, which is the darker area that surrounds the nipple. Blood and lymph vessels also run
throughout the breast. Blood nourishes the cells. The lymph system drains bodily waste
products. The lymph vessels connect to lymph nodes, the tiny, bean-shaped organs that help
fight infection.
It has been identified that one of the leading causes of death in developing countries
as breast cancer. Earlier detection of cancer can reduce the death rate and reduce the treatment
phase. As breast cancer is a medical scenario it requires a medical diagnosis to detect it. To
make the process must simpler computer aided tools are adopted. The objective of this work
is to classify the given data set into different types of breast cancer i.e. Benign or Malignant.
A mass of abnormal tissue is known as tumor. Breast cancer tumors are classified into
two types,1. Benign, those that are non-cancerous, and Malignant, those that are cancerous.
Benign Tumors: Generally, these tumors are not aggressive toward surrounding tissue, they
may continue to grow occasionally. Malignant Tumors: Malignant tumors are cancerous and
aggressive because they invade and damage surrounding tissue.
`K-Nearest Neighbors (K-NN) is one of the most prominent classification algorithms
because it is simple, effective and more accurate than many other classification algorithms.
This algorithm does not require any assumption for detection.
Diagnosis at advanced stages of disease contributes to the high mortality rate among
women due to breast cancer, which can be attributed to low levels of awareness, cumbersome
referral pathways to diagnosis, limited access to effective treatment at regional cancer centres
and incomplete treatment regimens. With the rising breast cancer incidence in India and
disproportionately higher mortality, it is essential to understand the level of cancer literacy,
especially since the average age at diagnosis is 10 years younger than women in Western
countries. An assessment of existing levels of cancer awareness is a pre-requisite for planning
comprehensive health programmes, early detection and treatment campaigns, that effectively
engage communities of women and men.
The main scope behind this project is to help women by helping them diagnose Breast
Cancer (BC) using their cell reports and detect about the type of breast cancer, if it non-
dangerous and can be treated or dangerous so as to help them know what to measures they can
take at an early stage.
The project helps women to identify the type of Breast Cancer, if it is malignant or
benign. By adding data from their cell report, the machine learning algorithm can help to
diagnose it. The K-NN algorithm uses cell data to detect the nearest cells which are cancerous
and detects the type, so they can take preventive measures early stages.
It has been identified that one of the leading causes of death in developing countries as
breast cancer. Earlier detection of cancer can reduce the death rate and reduce the treatment
phase. As breast cancer is a medical scenario it requires a medical diagnosis to detect it. To
make the process must simpler computer aided tools are adopted.
According to the reports 23% of women in india are still not aware of breast cancer,
even though detection of BC in early stages is difficult. This is the main reason this disease
gets severe as it grows. Cancer cells have a high division rate and that is the reason this grows
quite fast. Thats why the mortality rate of women having BC has decreased in years.
Women are not aware about the early stages of cancer and this leads to the deaths. The
cancer diagnosis after initial teatment is time consuming also cost for treatment is high, in such
cases its takes time for the women to identify the cancer type. So as depending upon the type
of cancer they can start treatent.
Due to this to help women diagnose cancer type we are making a application which
will help women’s to identify the cancer type using machine learning technqiues so as they can
get to check in early stages of disease just by putting the credientials asked on application with
the help of their reports and identify type, which can decrese the death rate due to breast cancer.
CHAPTER 2
In Breast cancer detection field there are many studies with many concepts and methods
were used to be a useful methods. Many researchers present many methods and algorithms to
detect the breast cancer disease; here we discussed some of them.
A. Image Processing:
1. Mammography:
2. MRI:
MRI uses the hydrogen nucleus (single proton) for imaging determinations because this
nucleus is abundant in water and fat. The magnetic property of the hydrogen nucleus is used to
yield detailed images from any part of the body. The patient who is inspected using MRI is
placed in a magnetic field and a radio frequency wave is applied to generate high contrast
images of the breast. In dynamic contrast enhanced-MRI (DCE-MRI), a contrast agent is
inserted before the images are captured. This technique has been found to be more complex
than mammography.
The use of the ANN proved to give better diagnostic performance than the radiologists
when the network output was compared to the radiologists’ categorical assessment. Both
utilized ANN to predict malignancy using different mammographic elements as inputs. The
accuracies were significant and improved by 3–5% compared with conventional experts’
judgment. Thus, more than twenty years ago, ANN has been proved excellent in BC diagnosis
and prognosis. Although ANN has shown a good predictor of results in pattern classification
problems, the ANN is not easily explained as ANN has been considered as a series of “black
box”. By having a better understanding of ANN, a three-phase algorithm has been proposed to
unveil the ANN workings by building a weight-decay BP network, deleting insignificant
connections and extracting rules by recursively discretizing the activation values of the hidden
unit. The rules from this pruned network keep the accuracy as high as the rules from the
standard ANN through a series of tests. After a year, based on the previous method, a modified
pruned network was presented with fewer connections between each neuron, and higher
accuracy.
In machine learning, support vector machines are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier (although methods
such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model
is a representation of the examples as points in space, mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as possible. New examples are
then mapped into that same space and predicted to belong to a category based on which side of
the gap they fall.
CHAPTER 3
1 Clump Thickness 1 - 10
4 Marginal Adhesion 1 – 10
6 Bare Nuclei 1 – 10
7 Bland Chromatin 1 – 10
8 Normal nucioll 1 – 10
9 Mitosis 1 - 10
Clump thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are
often grouped in multilayers.
Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these
parameters are valuable in determining whether the cells are cancerous or not.
Marginal adhesion: Normal cells tend to stick together. Cancer cells tends to loos this ability.
So loss of adhesion is a sign of malignancy.
Single epithelial cell size: Is related to the uniformity mentioned above. Epithelial cells that
are significantly enlarged may be a malignant cell.
Bare nuclei: This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the
cell). Those are typically seen in benign tumours.
Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer
cells the chromatin tend to be more coarse.
Normal nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the
nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more
prominent, and sometimes there are more of them.
K-Nearest Neighbor:
The k-nearest neighbors algorithm is one of the most used algorithms in machine learning. It
is a learning method bases on instances that does not required a learning phase. The training
sample, associated with a distance function and the choice function of the class based on the
classes of nearest neighbors is the model developed. Before classifying a new element, we must
compare it to other elements using a similarity measure. Its k-nearest neighbors are then
considered, the class that appears most among the neighbors is assigned to the element to be
classified. The neighbors are weighted by the distance that separate it to the new elements to
classify.k-NN is one of the most central ML techniques in classification. k-NN is a non-
parametric lazy learning algorithm used for classification, which classifies the objects using
their “k” nearest neighbours. k-NN only considers the neighbours around the object, not the
underlying data distribution. Additionally, there is no training phase with the training data. In
Figure 1, an example of k-NN structure is presented for determining BC diagnosis and
prognosis if k = 3, the test sample (circle) is assigned to malignant BC (square) because there
are 2 squares and only 1 triangle inside the inner circle.
If k = 5, the test sample is assigned to benign BC (triangle).k-Nearest neighbor for breast cancer
diagnosis. Green circle means the test sample, red triangle means the malignant BC and blue
square means the benign BC.k-NN related algorithms have a number of applications in BC
diagnosis and prognosis. The quality of the classification depends on the selection of k. In
2000, the k-NN and fuzzy k-NN algorithms were implemented to classify the WBCD. The
different values of k from 1 to 15 were considered, and the best performance was when k
equalled 1.
CHAPTER 4
Methodology
The k-nearest neighbors (KNN) algorithm is one of the simplest similarity-based artificial
learning algorithms, offering interesting performance in some contexts. The choice of the value
of k must be chosen a priori; various techniques have been proposed to select it such as cross-
validation and heuristics. This value should not be a multiple of the number of classes to avoid
tie votes. Thus, in the case of a binary classification, it is necessary to take a value of k odd so
that a majority necessarily emerges.
• The proposed system is to build a cancer prediction system using k- nearest neighbor
algorithm and python language as a base.
• It reduces the error caused by human intervention in cancer prediction and increases the
accuracy of prediction and diagnosis of the disease.
• K-NN - K-Nearest Neighbour algorithms are used which provide high levels of accuracy in
prediction.
The proposed methodology is shown in Figure. The main idea is to use k-NN algorithm to
predict the class labels in the test set. Then for each classifier, the con formal prediction
algorithm is applied to calculate the non-con formality score for each prediction and use it to
calculate the confidence. The con-formal prediction algorithm is fully described.
K-Nearest Neighbors (KNN): which is a lazy classifier that is widely used in data
mining applications. In this work, we implement the KNN algorithm using Euclidian distance
as a similarity measure with K=7.
Studies in this paper are conducted on Wisconsin breast cancer dataset (WBCD) from UCI
repository [18]. This dataset has 699 clinical cases, each one labeled as malignant (cancerous)
or benign (non-cancerous). The number of malignant and benign cases are 241(24.5%) and 458
(65.5%), respectively. This dataset has 16 samples (cases) with some missing values.
Removing these samples from dataset decreases the sample size to 683. Every sample has 11
features (Table 1). The first feature is sample id, and the last one is a class label that keeps two
values: 2 for benign and 4 for malignant. Practically is proved that stratified 10-fold cross
validation is one of the best methods due to low bias and variance. After dividing the dataset
into ten folds, first fold is selected for testing and the combination of the other nine folds for
training. The numbers of test and train samples are equal to 69 and 614 in each run. The
numbers of positive (malignant) and negative (benign) train-samples are equal to 215 and 399.
The standard deviations of positive and negative classes are 8.269 and 3.143, respectively.
After that, every sample in the test fold is classified by finding K nearest samples from the
training set. Now, the values of accuracy, sensitivity, and specificity are measured for the
selected test fold. This process is repeated ten times by selecting each fold exactly once for
testing. At this point, we have ten values for accuracy, sensitivity, and specificity.
In order to increase the correctness of outcome, these steps are repeated 100 times by
considering that the samples are randomly reassigned to the folds again.
The user will input the parameters of the cell mentioned in the report. These parameters will
decide the type of cancer.
At first, the values given by the user specificity, the accuracy for different values of K between
1 and 614 is reported to show the individual performance of the classifier on positive and
negative classes. After that, the maximum values, minimum values, and standard deviations of
positive and negative classes are examined to show the stability of classifier over positive and
negative classes. Then the system detects the cancer type depending upon the training dataset
and user dataset.
There are various factors and variable that define cancer cells. The genomic data is collected
with biological knowledge and stored in a database which is collective called the dataset. This
module’s purpose is to connect to dataset so that it can be processed to predict cancer. There
are 32 variables that contribute to the tumor’s initiation and progression, which are recorded
and stored in the dataset; the variables include radius, texture, volume, size, etc. of the cancer
cells.
Hardware:
Software
Chapter 5
Design Of System
There are 10 variables that contribute to the tumor’s initiation and progression, which
are recorded and stored in the dataset; the variables include radius, texture, volume, size, etc.
of the cancer cells is uploaded as a training set into R language and the kNN algorithm is
applied upon them to get the predicted outcome.The input is taken from the user in form of cell
parameters. The architecture of this system is kept as simple as possible to make it accessible
to a wide range of consumers and to maintain a simple user interface.
Chapter 6
We have proposed Breast Cancer Detection Sytem Based on KNN Algorithm which is a
machine learnig technology. This approach increases the chances of detecting breast cancer in
early stages so as women can start treatment to be cured or decrease the chances of death in
this cases. This will inturn increase the mortality rate and help women to spread awareness
regarding breast cancer diagnosis.
References
1. The Performance of K-Nearest Neighbors on Malignant and Benign Classes, Arash
Roshanpoor, Reza Safdari [2017]
2. Cancer Prediction Using KNN, Dheeraj.R , Hariprasath.R , Akshay Kannan.V , Nishanth
Kumar.S [2018]
3. Breast Cancer Detection Using K-Nearest Neighbor Machine Learning Algorithm, Moh'd
Rasoul Al-Hadidi, Abdulsalam Alarabeyyat, Mohannad Alhanahnah [2016]
4. Breast Cancer Diagnosis by using k-Nearest Neighbor with Different Distances and
Classification Rules Seyyid Ahmed Medjahed, Tamazouzt Ait Saadi, Abdelkader Benyettou
5. H. Zhang, T. Arslan, B. Flynn, A Single Antenna Based microwave System for Breast
Cancer Detection: Experimental Results, IEEE, 2013.
6. Medjahed SA, Saadi TA, Benyettou A. Breast Cancer Diagnosis by using k-Nearest
Neighbor with Different Distances and Classification Rules. Int J Comput Appl. 2013;62
7. Gupta S, Kumar D, Sharma A. Data Mining Classification Techniques Applied For Breast
Cancer Diagnosis And Prognosis.