0% found this document useful (0 votes)
19 views9 pages

Research Paper

The document outlines a Breast Cancer Prediction System that employs machine learning techniques to enhance early diagnosis by analyzing tumor characteristics to classify them as benign or malignant. Utilizing the Wisconsin Breast Cancer Dataset, the system focuses on data preprocessing, model training, and performance evaluation to support healthcare professionals in clinical decision-making. The study emphasizes the importance of accuracy, simplicity, and reliability in the prediction process to improve patient outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Research Paper

The document outlines a Breast Cancer Prediction System that employs machine learning techniques to enhance early diagnosis by analyzing tumor characteristics to classify them as benign or malignant. Utilizing the Wisconsin Breast Cancer Dataset, the system focuses on data preprocessing, model training, and performance evaluation to support healthcare professionals in clinical decision-making. The study emphasizes the importance of accuracy, simplicity, and reliability in the prediction process to improve patient outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Breast Cancer Prediction System

Ruchi Kumari​ Shrishti​


Department of Artificial Intelligence and Department of Artificial Intelligence and
Machine Learning​ Machine Learning​
Buddha Institute of Technology​ Buddha Institute of Technology​
Gorakhpur, India​ Gorakhpur, India​
[email protected] [email protected]

Abstract: This electronic document presents a Breast Cancer Prediction System designed to assist in
early diagnosis by utilizing machine learning techniques. The system analyzes medical data such as
tumor characteristics to classify whether a tumor is benign or malignant. By training predictive models
on historical datasets, the system aims to improve accuracy in diagnosis and support healthcare
professionals in decision-making. The project emphasizes simplicity, accuracy, and reliability in a
user-friendly interface to promote early detection and treatment of breast cancer.

Keywords: Breast cancer, Prediction, Machine learning, Diagnosis, Tumor classification, Support
vector Machine

I.​ INTRODUCTION
This template, customized for research on Breast Cancer Prediction, provides authors with
essential formatting specifications for preparing electronic versions of their papers, especially
those intended for conference or journal submission. The aim of this structured format is
threefold: (1) to simplify the preparation and formatting of academic papers, (2) to ensure
compliance with electronic submission and publication standards, and (3) to maintain
consistency across scientific proceedings. Margins, column widths, line spacing, and type styles
are pre-defined; examples of these styles are embedded throughout this document and denoted in
italic type for clarity.

The field of breast cancer prediction is a critical domain within medical research and machine
learning. Accurate and early detection of breast cancer significantly increases the chances of
successful treatment and survival. As a result, researchers are leveraging advanced artificial
intelligence (AI) techniques—such as machine learning (ML), deep learning (DL), and data
mining—to develop predictive models that analyze various clinical and diagnostic datasets.

This paper focuses on implementing a breast cancer prediction system using supervised learning
models, particularly those trained on benchmark datasets like the Wisconsin Breast Cancer
Dataset (WBCD). The study outlines data preprocessing techniques, model training, validation
metrics, and system performance analysis. While elements like tables, figures, and mathematical
expressions may vary by research, this paper incorporates them using standard formatting
guidelines to maintain clarity and readability.
Through the integration of AI and healthcare data, the paper aims to contribute to the growing
body of work supporting early diagnosis and treatment planning in oncology, potentially saving
thousands of lives by aiding in clinical decision-making.

II.​ METHODOLOGY
A.​Data Source
The foundation of the Breast Cancer Prediction System lies in the quality and relevance of the
dataset used. For this study, the Wisconsin Breast Cancer Dataset (WBCD) is selected due to
its wide acceptance in the machine learning and medical research communities. The dataset is
publicly available from the UCI Machine Learning Repository, which serves as a reputable
source of benchmark datasets for classification and prediction tasks.
The WBCD contains 569 instances with 32 numerical features derived from digitized images
of fine needle aspirate (FNA) of breast masses. These features describe characteristics of the cell
nuclei present in the images, such as:
●​ Radius (mean of distances from the center to points on the perimeter)
●​ Texture (standard deviation of gray-scale values)
●​ Perimeter
●​ Area
●​ Smoothness
●​ Compactness
●​ Concavity
●​ Symmetry, and others.
Each instance is labeled as either:
●​ Benign (0) or
●​ Malignant (1)
The dataset does not contain any missing values, which makes it ideal for training machine
learning models without the need for complex data imputation techniques.
The use of this dataset enables a consistent benchmarking approach for evaluating the accuracy
and efficiency of various supervised learning algorithms, including Support Vector Machines
(SVM), Logistic Regression, Decision Trees, and Random Forests. The dataset is divided into
training (80%) and testing (20%) subsets to ensure fair model evaluation and to prevent
overfitting.
This data-driven approach ensures that the prediction system is trained on reliable, medically
significant patterns and can provide early and accurate diagnosis support for healthcare
professionals.

B.​Normalization

In machine learning, the values of input features often vary in magnitude, which can negatively
affect the performance of certain algorithms. For instance, in the Wisconsin Breast Cancer
Dataset, features such as radius, area, and perimeter may have significantly larger values
compared to others, like smoothness or symmetry. Algorithms like Support Vector Machines
or K-Nearest Neighbors, which are sensitive to the scale of data, can produce biased results if
features are not adjusted to a common range.

To overcome this issue, all features in the dataset are transformed to the same numerical range,
typically between 0 and 1. This is achieved using a simple rescaling technique where each
feature value is adjusted based on its minimum and maximum values in the dataset:

X' = (X - min(X)) / (max(X) - min(X))

This rescaling ensures that each feature contributes equally during model training, enhances the
learning process, and helps prevent any single feature from dominating the outcome. It also
improves computational efficiency and stability during optimization.

By converting all input features to a common range, the model becomes more accurate and better
suited for reliable breast cancer prediction.

III.​ MATERIALS AND METHODS

In this research, there are three totals of phase involved included the first phase of this study
focuses on a literature review and studies of existing research on classifying the cancerous and
non-cancerous cell of breast cancer, and the second phases focus on designing, developing, and
carrying out the experiment and the third phase focuses on the presentation, analysis, and
discussion on the findings. Fig. 1 shows the framework of the research.
Fig.1. Research Framework

A.​Abbreviations and Acronyms

Artificial Intelligence (AI) techniques are widely applied in Breast Cancer Prediction (BCP)
systems. Machine Learning (ML) and Deep Learning (DL) are two major branches used in
building predictive models. Computer-Aided Diagnosis (CAD) systems often utilize
Convolutional Neural Networks (CNNs), Artificial Neural Networks (ANNs), Support Vector
Machines (SVMs), K-Nearest Neighbors (KNNs), Decision Trees (DTs), and Random Forests
(RFs) for classification and prediction tasks.

Diagnostic methods like Fine Needle Aspiration Cytology (FNAC), Magnetic Resonance
Imaging (MRI), and Ultrasonography (USG) help in the early detection of Breast Cancer (BC).
The Breast Imaging-Reporting and Data System (BI-RADS) is used for standardizing
mammography results. Common cancer types include invasive ductal carcinoma (IDC) and
Invasive Lobular Carcinoma (ILC). Estrogen Receptor (ER), Progesterone Receptor (PR), and
Human Epidermal Growth Factor Receptor 2 (HER2) status are key biomarkers used in
diagnosis.
The Wisconsin Breast Cancer Dataset (WBCD) is widely used for training ML models. Model
evaluation uses metrics such as Accuracy (ACC), Receiver Operating Characteristic (ROC),
Area Under Curve (AUC), F1 Score (F1), True Positive (TP), True Negative (TN), False Positive
(FP), and False Negative (FN). Principal Component Analysis (PCA) is used for dimensionality
reduction, and Root Mean Squared Error (RMSE) measures prediction error.

In genetic testing, mutations in Breast Cancer Gene 1 (BRCA1) and Breast Cancer Gene 2
(BRCA2) are indicators of hereditary risk. Other terms include Deoxyribonucleic Acid (DNA),
Ribonucleic Acid (RNA), and Single Nucleotide Polymorphism (SNP), which are essential in
genomic analysis.

B.​Units
In breast cancer prediction systems, standard SI (International System of Units) are primarily
used to maintain consistency and scientific clarity. The following units are commonly used in
data collection, feature extraction, image processing, and model evaluation:
●​ Millimeters (mm) – Used for measuring tumor size, cell nucleus radius, perimeter, and
other morphological features.
●​ Micrometers (µm) – Used in microscopic image analysis to assess fine structures like
nuclei or mitotic figures.
●​ Grams (g) – Sometimes used in biopsy sample weight or tissue mass analysis.
●​ Degrees Celsius (°C) – Used in thermal imaging techniques and to report temperature
during imaging processes.
●​ Pixels – Used in digital mammography and histopathological image resolution.
●​ Seconds (s) or Milliseconds (ms) – Applied in processing time measurements and
algorithm evaluation.
●​ Volts (V) – Occasionally used in electrical biosensing equipment or imaging systems.
●​ Hertz (Hz) – Used in ultrasound or MRI frequency measurements.
●​ Percentage (%) – Used to express accuracy, sensitivity, specificity, precision, and recall.
●​ Area Under Curve (AUC) – A unitless metric used to evaluate the performance of
classification models via ROC curves.
●​ Units for concentration – Such as ng/mL (nanograms per milliliter) or pg/mL
(picograms per milliliter) for biomarker levels (e.g., HER2, ER, PR).
Ensure consistency and clarity by using a space between the number and the unit (e.g., 15
mm, 98 %, 3.5 s).

C.​Equations
Equations are used extensively in breast cancer prediction to normalize data, calculate
evaluation metrics, and model relationships between features and outcomes. Equations should be
numbered consecutively and aligned to the right. Symbols used in equations should be defined
immediately after they appear.
1) Data Normalization
To bring features onto a common scale, min-max normalization is commonly used:

X' = (X - min(X)) / (max(X) - min(X))


Where:
●​ X is the original feature value,
●​ Min(X) and max(X) are the minimum and maximum values in the dataset respectively.
●​ X’ is the normalized feature value(scaled between 0 and 1).

IV.​ Accuracy
Accuracy is a key metric used to evaluate model performance:
Accuracy = (TP + TN) / (TP + TN + FP + FN) × 100
Where:
●​ TP True Positives,
●​ TN = True Negatives,
●​ FP = False Positives,
●​ FN = False Negatives.

V.​ Precision and Recall


●​ Precision = TP / (TP + FP)
●​ Recall = TP / (TP + FN)

VI.​ F1 Score
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
These metrics provide insight into the effectiveness of the classification model, especially in
imbalanced datasets common in medical diagnostics.

A.​Some Common Mistakes

●​ Relying only on accuracy: Accuracy alone can be misleading, especially with


imbalanced datasets. Important metrics like precision, recall, F1-score, and AUC should
also be used.
●​ Ignoring data imbalance: Many breast cancer datasets have more benign cases than
malignant ones. Not addressing this imbalance can lead to biased models.
●​ Skipping normalization or feature scaling: Algorithms like SVM, KNN, and ANN
require input features to be on the same scale. Without normalization, model performance
may suffer.
●​ Overfitting the model: Training a model without using validation or testing data can
cause overfitting. Cross-validation or a separate test set should always be used.
●​ Not defining abbreviations: Abbreviations like CNN, FNAC, or WBCD should be
defined the first time they are used in the main body of the text, even if used in the
abstract.
●​ Improper equation formatting: Equations should be written clearly, numbered properly,
and each variable should be defined for clarity.
●​ Inconsistent use of units: Always use standard SI units like millimeters (mm), grams (g),
degrees Celsius (°C), and ensure consistency throughout the document.
●​ Lack of dataset description: The dataset source, size, features, and class distribution
must be described to ensure transparency and reproducibility.
●​ Using regression metrics for classification problems: Metrics like RMSE are meant for
regression. For classification, accuracy, precision, recall, and F1-score are more
appropriate.
●​ Missing key sections: Sections such as methodology, evaluation, and limitations are
sometimes incomplete or missing, weakening the report structure.
●​ Excessive jargon without explanation: Using too many technical terms without proper
explanation can make the content hard to understand for a broader audience.
●​ Failing to cite sources: Not providing proper references for datasets, algorithms, or
previous research reduces the credibility of the work.

VII.​ PERFORMANCE MEASUREMENT

This study assessed the predictive performance of the model using metrics such as accuracy,
precision, and recall scores, which are derived from the confusion matrix generated after the
classifier model's predictions. Accuracy measures the frequency of correct predictions,
calculated by dividing the sum of true negatives (TN) and true positives (TP) by the total
number of predictions. Precision evaluates the correctness of positive predictions, indicating
how many of the positive predictions were accurate out of all positive predictions made.
Precision is obtained by dividing TP by the sum of TP and false positives (FP). Recall measures
the ratio of correctly classified positive instances to the total number of actual positive instances.
It is calculated by dividing TP by the sum of TP and false negatives (FN).


Fig. 2. Description of the confusion matrix (source: Gyamfi and Missah (2017))
VIII. RESULTS AND DISCUSSION

After applying the PCA to the dataset, the dimension of the data is reduced by creating two new
attributes, namely principal component 1 (legend M) and principal component 2 (legend B).
The overview of the dataset is shown in Fig. 3.

Fig. 3. The distribution of the data after PCA

From the figure above, the diagonal line between the two classes was clearly defined, which
means that the classifier would easily differentiate the malignant and benign classes. This proves
that the dimension of the data has been reduced by the PCA method. To be precise, the data after
the PCA process will be implemented for the next step, which is boosting the ensemble method
using SVM as the base classifier to know the efficiency of the ensemble method for this dataset.

Kernel Parameter Selection


During the kernel selection process, the default values of gamma and C were consistently used
for each experiment. Notably, the sigmoid kernel exhibited a substantial change in performance
before and after applying PCA feature selection, with the accuracy score of the classifier model
increasing from 0.29 to 0.89. This significant improvement suggests that the Sigmoid kernel was
ill-suited to the dataset's type and distribution without PCA feature selection. According to a
study by Rimah et al. [12], the sigmoid kernel may not be positive or semi-definite for specific
parameter values, potentially leading to incorrect results and negatively impacting classifier
performance. This discrepancy in results could be attributed to the Sigmoid kernel's improper
parameter selection, which may have occurred because the dataset without PCA feature selection
had significantly higher dimensionality compared to the dataset with PCA feature selection.
Consequently, further experimentation is warranted to assess the Sigmoid kernel's suitability
under different conditions of the classifier model.
ACKNOWLEDGMENT
The preferred spelling of the word “acknowledgment” in America is without an “e” after the
“g”. Avoid the stilted expression “one of us (R. B. G.) thanks ...”. Instead, try “R. B. G.
thanks...”. Put sponsor acknowledgments in the unnumbered footnote on the first page.

REFERENCES

[1]​ Breast cancer. (2019). Nature Reviews Disease Primers, 5(1).


https://doi.org/10.1038/s41572-019-0122-z.
[2]​ Arthur, R., Kirsh, V. A., Kreiger, N., & Rohan, T. (2018). A healthy lifestyle index and its
association with risk of breast, endometrial, and ovarian cancer among Canadian women. Cancer
Causes & Control, 29(6), 485-493. https://doi.org/10.1007/s10552-018-1032-1.
[3]​ Mehta, D., Mohite, A., Shinde, V., Khatri, R., & Dokare, I. (2022). Detection of breast cancer using
machine learning algorithms. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4108758.
[4]​ Lou, S. J., Hou, M. F., Chang, H. T., Chiu, C. C., Lee, H. H., Yeh, S. C. J., & Shi, H. Y. (2020).
Machine learning algorithms to predict recurrence within 10 years after breast cancer surgery: A
prospective cohort study. Cancers, 12(12), 3817. https://doi.org/10.3390/cancers12123817.
[5]​ Bataineh, A. A. (2019). A comparative analysis of nonlinear machine learning algorithms for breast
cancer detection. International Journal of Machine Learning and Computing, 9(3), 248-254.
https://doi.org/10.18178/ijmlc.2019.9.3.794.
[6]​ Ganggayah, M. D., Taib, N. A., Har, Y. C., Lio, P., & Dhillon, S. K. (2019). Predicting factors for
survival of breast cancer patients using machine learning techniques. BMC Medical Informatics and
Decision Making, 19(1). https://doi.org/10.1186/s12911-019-0801-4.
[7]​ Journal of Research in Medical and Dental Sciences, 6(1), 365368.
https://doi.org/10.5455/jrmds.20186159.
[8]​ Yadav, A., Jamir, I., Jain, R. R., & Sohani, M. (2019). Breast cancer prediction using SVM with
PCA feature selection method. International Journal of Scientific Research in Computer Science,
Engineering and Information Technology, 969-978. https://doi.org/10.32628/cseit1952277.
[9]​ Bhavsar, S., Arora, K., Koul, S., & Barapate, S. (2020, November 23). Understanding bagging &
boosting in machine learning. UCI Machine Learning Repository: Breast Cancer Wisconsin
(Diagnostic) Data Set. (n.d.). UCI Machine Learning.
​ Retrieved ​ 2022, ​ from
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin +(diagnostic).
[10]​Ghosal, I., & Hooker, G. (2020). Boosting random forests to reduce bias; one-step boosted forest
and its variance estimate. Journal of Computational and Graphical Statistics, 30(2), 493502.
https://doi.org/10.1080/10618600.2020.1820345.
[11]​UCI Machine Learning Repository Website. (n.d.). Retrieved September 30, 2023.
[12]​Rimah, R. A., Dorra, D. B. A., & Noureddine, N. E. (2013). Practical selection of SVM supervised
parameters with different feature representations for vowel recognition. International Journal of
Digital Content Technology and Its Applications (JDCTA), 7(9).
https://doi.org/10.4156/jdcta.vol7.issue9.50.

You might also like