Lab 3

This document provides instructions for Lab 3 on implementing a Naive Bayes classifier for spam filtering. Students must create functions to train and test a Naive Bayes classifier on various datasets containing email messages labeled as spam or not spam. The datasets have attributes in different formats - discrete and continuous. Students will test their code on datasets with 2-class and 3-class classification problems. A report describing the classifier and test results must be submitted. Bonus marks are available for additional analysis beyond what was taught.

Uploaded by

bc040400330737

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views

Lab 3

Uploaded by

bc040400330737

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

COMP24111 Lab 3: Nave Bayes Classifier for Spam Filtering

Deadline: Start of lab, week beginning 29 Nov. Extension policy: NO EXTENSIONS (EVEN ON REQUEST)
In general, a pattern recognition system consists of three modules; i.e., pre-processing, feature extraction and classification. As a typical pattern recognition task, a spam filterer needs to pre-process text messages for cleaning up data, extract salient features for distinguishing between normal and spam mails that form a feature vector for an email message, and classify email messages based on feature vectors for making a decision. While pre-processing and feature extraction require knowledge in the field of natural language processing and computational linguistics, classification needs to find a proper classifier to provide the underpinning technique for decision-making. Bayesian methods have become very widely used for spam filtering, and you may have such a filter on your computer. Mozilla has a Bayesian filter, Spam Assassin uses Bayesian methods, and even Google uses a Bayesian filter to prevent search-engine spammers from dominating search results. In this lab, you are going to implement nave Bayes classifiers for spam filtering emails based on a UCI machine learning benchmark database named spambase (http://archive.ics.uci.edu/ml/datasets/Spambase) where 57 features (attributes of continuous vales) have already been extracted for each email message and all instances were labelled as 0 (normal) or 1 (spam). The purpose of this lab is to use Matlab to carry out the nave Bayes classification algorithm learned from the module to provide underpinning techniques for developing a spam filter working in different situations. In order to apply the discrete-valued nave Bayes classifier for spam filtering, the spambase database has been modified by the further discretisation of continuous attribute values. Also some instances have been re-labelled as 2 to indicate uncertain, a newly added class.

You need to do three things:

1) Set up your path (and ensure it stays set up for every Matlab session) 2) Copy datafiles then (and ONLY then) 3) Do the lab exercises.

SET UP YOUR PATH If it is not already set up, the following directory needs to be added to your Matlab path: /opt/info/courses/COMP24111/lab3 See the lab sheet from last week on how to do this. COPY DATAFILES Go to the directory: /opt/info/courses/COMP24111/lab3 you will find four data files with the suffix .mat as follows: av2_c2 - a dataset for binary classification where each attribute has two discrete values. av3_c2 - a dataset for binary classification where each attribute has three discrete values. av7_c3 - a dataset for three-class classification where each attribute has seven discrete values. avc_c2 - a dataset for binary classification where all attributes have continuous values as are in spambase.

Each of the aforementioned datasets contains the same data format as follows: a M*57 matrix where each row represents the feature vector of an training example, M is the number of training examples. LabelSet a M*1 column vector where elements represent the labels of corresponding to training examples in AttributeSet. testAttributeSet a N*57 matrix where each row represent the feature vector an test instance, N is the number of test instances. validLabel - a N*1 column vector where elements represent the labels corresponding to test instances in testAttributeSet (used to achieve the accuracy of a classifier). In addition, a function (main.m) is also provided to load a dataset for training and testing your implementation. As a result, you need to embed your code into this main function for evaluation. After embedding your code to this main function, you need to run this function only. A typical scenario for machine learning is to create a learning system by training it on a given training data set. Later on the system will be applied to different test data sets. As a result, you must write two generic functions; i.e., one for training a nave Bayes classifier and the other for test based on the trained nave Bayes classifier so that they can be applied to any datasets with the main function. A typical format is shown below where Parameter List stands for a set of data structures (e.g., vector and matrix) used to store conditional probabilities and prior probabilities to be learned. You need to design such data structures yourself for this purpose. % for NB training [ Parameter List ] = NBTrain(AttributeSet, LabelSet) % for NB test [predictLabel, accuracy] = NBTest(Parameter List, testAttributeSet, validLabel) Note that the accuracy of a nave Bayes classifier is defined as follows: AttributeSet

accuracy =

number ofcorrectly classified testinstances number oftestinstances

For example, if you are given a test dataset of 100 instances and your code correctly predicts labels (i.e., predicted labels are as same as their true labels) for 85 instances, the accuracy is 85/100 = 0.85. Apart from the accuracy measure, you also need to report test results with a confusion matrix as described in lab 2 by appending the corresponding code to the end of the main function. If you wrote your code for calculating a confusion matrix for lab 2, you can reuse it here.

Now the lab exercises..

PART 1 Implementation for discrete input attribute values

The main task of this exercise is to implement a nave Bayes classifier for discrete input attribute values (see lecture notes of Nave Bayes Classifier for details). Your code must be able to handle all possible situations; i.e., different attributes could have a various number of values (e.g., the first attribute has two values, the second one has four values and so on) and a problem could be either a binary or a multi-class classification task. If you do not have a clear idea of implementing a generic nave Bayes classifier at the beginning, you might want to start coding a simplified nave Bayes classifier for a simple situation, e.g., all attributes have only two values and the task is binary classification. After gaining the experience on simple situations, you should be able to work on a generic nave Bayes classifier. With the main function, you should test your code based on three datasets of discrete input attribute values. In terms of functionalities, three goals are set for Part 1: 1. your code works for binary attribute values and binary classification tasks, 2. your code works for multiple attribute values and binary classification tasks 3. your code works for different attribute values and multi-class classification tasks.

PART 2 Implementation for continuous input attribute values

Upon the completion of the task of Part 1 described above, you need to further implement a nave Bayes classifier for continuous input attribute values where attributes are assumed to be subject to Gaussian distribution (see lecture notes of Nave Bayes Classifier for details). As same as required in Part 1., you must have two generic functions; one for training and the other test. With the main function, you are still able to test your code with the given dataset of continuous input attribute values.

PART 3 Bonus marks

Additionally, bonus marks are available for truly exceptional students: those able to show observations and knowledge that was not explicitly supplied in lectures. To restate - to obtain marks in this category you should show evidence of learning outside the supplied lecture notes. Examples of things you could do: learn about other forms of evaluation than just the confusion matrix, or other feature extraction/selection routines, or an extension of the nave Bayes classifier to improve the classification performance on this database. The original database spambase.data is given under the same directory as other datasets are located in so that you can use it if needed.

DELIVERABLES
By the deadline, you should submit two files, using the submit command as you did last year: lab3.zip lab3.pdf - all your .m files, zipped up - your report, as detailed below

A report should be submitted, on one side of A4 as a PDF file. Anything beyond 1 side will be ignored. You should give a description of Parameter List you designed to store conditional probabilities and prior probabilities in Part 1 and means and variances of Gaussian distributions for continuous input attribute values in Part 2 as well as test results on the given data sets. Take note, we are not interested in the details of your code, e.g., what Matlab functions are called, what they return etc. This module is about algorithms and is indifferent to how you program them. There is no specific format marks will be allocated roughly on the basis of: functionality of your program and rigours experimentation, knowledge displayed when talking to the demonstrator, imagination in Part 3, how informative in your report about your experiments, grammar, ease of reading.

The lab is marked out of 15: Part 1 Implementation for discrete input attribute values Part 2 Implementation for continuous input attribute values Part 3 Bonus Remember a mark of 12/15 constitutes a first class mark, 75%. (9 marks) (3 marks) (3 marks)

Lab 2
100% (1)
Lab 2
4 pages
HW 1
No ratings yet
HW 1
3 pages
midtermQuestions (1)
No ratings yet
midtermQuestions (1)
5 pages
Assignment 1-Preprocessing Handon
No ratings yet
Assignment 1-Preprocessing Handon
6 pages
Table of Contents:: Predictnow - Ai Lets You Apply Machine Learning Predictions To Your Data Without Any Programming
No ratings yet
Table of Contents:: Predictnow - Ai Lets You Apply Machine Learning Predictions To Your Data Without Any Programming
15 pages
Assignment 3: Named Entity Recognition: Training Dataset
No ratings yet
Assignment 3: Named Entity Recognition: Training Dataset
4 pages
Lab-11 Random Forest
No ratings yet
Lab-11 Random Forest
2 pages
CSCI374_Homework1
No ratings yet
CSCI374_Homework1
5 pages
03-list-iterator-comparator-instructions
No ratings yet
03-list-iterator-comparator-instructions
10 pages
Homework 3
No ratings yet
Homework 3
2 pages
Mini Project 1: Estimate Parameters of A Sinusoid: Problem Statement
No ratings yet
Mini Project 1: Estimate Parameters of A Sinusoid: Problem Statement
3 pages
Project_1
No ratings yet
Project_1
4 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Assignment 3_553
No ratings yet
Assignment 3_553
9 pages
COL774_A4_v3
No ratings yet
COL774_A4_v3
4 pages
Lab 10
No ratings yet
Lab 10
7 pages
Report Digit Recognition
No ratings yet
Report Digit Recognition
11 pages
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
No ratings yet
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
55 pages
2021 ITS665 - ISP565 - GROUP PROJECT-revMac21
No ratings yet
2021 ITS665 - ISP565 - GROUP PROJECT-revMac21
6 pages
Data Mining Lab File
No ratings yet
Data Mining Lab File
20 pages
CSCI946 Assignment_1_task_sheet
No ratings yet
CSCI946 Assignment_1_task_sheet
4 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Computer Programming: Algorithms and Control Structures
No ratings yet
Computer Programming: Algorithms and Control Structures
61 pages
JavaScript Lab Manual
No ratings yet
JavaScript Lab Manual
15 pages
Lab 1
No ratings yet
Lab 1
10 pages
hw1 Problem Set
No ratings yet
hw1 Problem Set
8 pages
Assignment1_LATEX
No ratings yet
Assignment1_LATEX
11 pages
Miniproject 1: Machine Learning 101: Preamble
No ratings yet
Miniproject 1: Machine Learning 101: Preamble
5 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
NLP Unit-3
No ratings yet
NLP Unit-3
17 pages
Assignment 02 - FA24
No ratings yet
Assignment 02 - FA24
5 pages
Team Alacrity - Amazon ML Challenge 2023 - Text File
No ratings yet
Team Alacrity - Amazon ML Challenge 2023 - Text File
8 pages
Assignment 2
No ratings yet
Assignment 2
11 pages
Biblio Java PDF
No ratings yet
Biblio Java PDF
4 pages
Smai A1 PDF
No ratings yet
Smai A1 PDF
3 pages
Module 2
No ratings yet
Module 2
20 pages
Assignment1 2020
No ratings yet
Assignment1 2020
6 pages
2021 Homework3 Introduction
No ratings yet
2021 Homework3 Introduction
8 pages
AWS Certified Machine Learning Specialty Exam Guide
0% (1)
AWS Certified Machine Learning Specialty Exam Guide
11 pages
Lab04
No ratings yet
Lab04
2 pages
Assignment 1-Preprocessing Handon
No ratings yet
Assignment 1-Preprocessing Handon
13 pages
Ex 6
No ratings yet
Ex 6
16 pages
Computer Programming and Applications MATLAB-beginner
No ratings yet
Computer Programming and Applications MATLAB-beginner
86 pages
message (3)
No ratings yet
message (3)
2 pages
Lab 3
No ratings yet
Lab 3
2 pages
Assignment Number 1: Machine Learning Course (Dr. S. Nadeem Ahsan) Due Date: 11-Feb-2012
No ratings yet
Assignment Number 1: Machine Learning Course (Dr. S. Nadeem Ahsan) Due Date: 11-Feb-2012
1 page
Class 2 Sheet 1
No ratings yet
Class 2 Sheet 1
5 pages
Graded_Lab_III
No ratings yet
Graded_Lab_III
3 pages
Practical # 11
No ratings yet
Practical # 11
10 pages
Selenium Framework Creation and Accessing Test Data From Excel
No ratings yet
Selenium Framework Creation and Accessing Test Data From Excel
14 pages
Assignment 4 - Heaps
No ratings yet
Assignment 4 - Heaps
7 pages
MATLAB
No ratings yet
MATLAB
8 pages
Black-Box Testing: Requirements
No ratings yet
Black-Box Testing: Requirements
17 pages
Prac 5
No ratings yet
Prac 5
4 pages
Wekappt
No ratings yet
Wekappt
58 pages
ML Hota Assign5
No ratings yet
ML Hota Assign5
2 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Math Lessons: Presented By: Shea Iris C. Samonte
No ratings yet
Math Lessons: Presented By: Shea Iris C. Samonte
18 pages
Computer MCQ
No ratings yet
Computer MCQ
31 pages
Gantt Chart
No ratings yet
Gantt Chart
1 page
GlobalData_TunisiaTelecomOperatorsCountryIntelligenceReport_080425
No ratings yet
GlobalData_TunisiaTelecomOperatorsCountryIntelligenceReport_080425
42 pages
Modbus LULC032 Communication Module: User's Manual 03/2005
No ratings yet
Modbus LULC032 Communication Module: User's Manual 03/2005
78 pages
Escalado / PLC - 1 (CPU 1214C AC/DC/Rly) / Program Blocks
No ratings yet
Escalado / PLC - 1 (CPU 1214C AC/DC/Rly) / Program Blocks
2 pages
Manual Zte
50% (2)
Manual Zte
69 pages
MATH - MW - Fibonacci Sequence
No ratings yet
MATH - MW - Fibonacci Sequence
1 page
EDS-305 308 309 316 Series
No ratings yet
EDS-305 308 309 316 Series
3 pages
Cheat Effect: Buddha
No ratings yet
Cheat Effect: Buddha
15 pages
BCD and Ascii Code
No ratings yet
BCD and Ascii Code
4 pages
[ebook] Supercharge Your Workday with ChatGPT
100% (4)
[ebook] Supercharge Your Workday with ChatGPT
38 pages
Disaster Recovery - Wikipedia
No ratings yet
Disaster Recovery - Wikipedia
14 pages
ICF - 8-Lesson 3
No ratings yet
ICF - 8-Lesson 3
14 pages
JIT Techniques Practiced in Service Industry
No ratings yet
JIT Techniques Practiced in Service Industry
4 pages
4.1 Definition of Infinite Sequence: Chapter-Four: Sequences and Series
No ratings yet
4.1 Definition of Infinite Sequence: Chapter-Four: Sequences and Series
47 pages
Utkarsh Upadhyay Graphic PDF
No ratings yet
Utkarsh Upadhyay Graphic PDF
9 pages
Social Media Commerce PDF
No ratings yet
Social Media Commerce PDF
13 pages
Aa - Req - 000131 - Quality Requirements Third Party Design Verification
No ratings yet
Aa - Req - 000131 - Quality Requirements Third Party Design Verification
11 pages
gp-435g CHG 1
No ratings yet
gp-435g CHG 1
41 pages
Cyber Security Analyst
No ratings yet
Cyber Security Analyst
1 page
A Comparison of The AC and DC Power Flow For LMP Calculations
No ratings yet
A Comparison of The AC and DC Power Flow For LMP Calculations
9 pages
Instructions - Fortigate Cli Reference Mr6
100% (5)
Instructions - Fortigate Cli Reference Mr6
14 pages
ActixOne M3
No ratings yet
ActixOne M3
30 pages
Multi Tenant Architecture
No ratings yet
Multi Tenant Architecture
11 pages
PostgreSQL DBA
No ratings yet
PostgreSQL DBA
3 pages
General Overall Design Considerations
No ratings yet
General Overall Design Considerations
4 pages
Wiley - Fundamentals of Microelectronics, 2nd Edition - 978-1-118-15632-2
No ratings yet
Wiley - Fundamentals of Microelectronics, 2nd Edition - 978-1-118-15632-2
3 pages
BNP b2215 (Eng) Z
No ratings yet
BNP b2215 (Eng) Z
138 pages
ART Q3 Elements and Principles of Photography
100% (1)
ART Q3 Elements and Principles of Photography
33 pages

Lab 3

Uploaded by

Lab 3

Uploaded by

COMP24111 Lab 3: Nave Bayes Classifier for Spam Filtering

You need to do three things:

number ofcorrectly classified testinstances number oftestinstances

Now the lab exercises..

PART 1 Implementation for discrete input attribute values

PART 2 Implementation for continuous input attribute values

PART 3 Bonus marks

You might also like