0% found this document useful (0 votes)
137 views

Lab 3

This document provides instructions for Lab 3 on implementing a Naive Bayes classifier for spam filtering. Students must create functions to train and test a Naive Bayes classifier on various datasets containing email messages labeled as spam or not spam. The datasets have attributes in different formats - discrete and continuous. Students will test their code on datasets with 2-class and 3-class classification problems. A report describing the classifier and test results must be submitted. Bonus marks are available for additional analysis beyond what was taught.

Uploaded by

bc040400330737
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views

Lab 3

This document provides instructions for Lab 3 on implementing a Naive Bayes classifier for spam filtering. Students must create functions to train and test a Naive Bayes classifier on various datasets containing email messages labeled as spam or not spam. The datasets have attributes in different formats - discrete and continuous. Students will test their code on datasets with 2-class and 3-class classification problems. A report describing the classifier and test results must be submitted. Bonus marks are available for additional analysis beyond what was taught.

Uploaded by

bc040400330737
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

COMP24111 Lab 3: Nave Bayes Classifier for Spam Filtering

Deadline: Start of lab, week beginning 29 Nov. Extension policy: NO EXTENSIONS (EVEN ON REQUEST)
In general, a pattern recognition system consists of three modules; i.e., pre-processing, feature extraction and classification. As a typical pattern recognition task, a spam filterer needs to pre-process text messages for cleaning up data, extract salient features for distinguishing between normal and spam mails that form a feature vector for an email message, and classify email messages based on feature vectors for making a decision. While pre-processing and feature extraction require knowledge in the field of natural language processing and computational linguistics, classification needs to find a proper classifier to provide the underpinning technique for decision-making. Bayesian methods have become very widely used for spam filtering, and you may have such a filter on your computer. Mozilla has a Bayesian filter, Spam Assassin uses Bayesian methods, and even Google uses a Bayesian filter to prevent search-engine spammers from dominating search results. In this lab, you are going to implement nave Bayes classifiers for spam filtering emails based on a UCI machine learning benchmark database named spambase (http://archive.ics.uci.edu/ml/datasets/Spambase) where 57 features (attributes of continuous vales) have already been extracted for each email message and all instances were labelled as 0 (normal) or 1 (spam). The purpose of this lab is to use Matlab to carry out the nave Bayes classification algorithm learned from the module to provide underpinning techniques for developing a spam filter working in different situations. In order to apply the discrete-valued nave Bayes classifier for spam filtering, the spambase database has been modified by the further discretisation of continuous attribute values. Also some instances have been re-labelled as 2 to indicate uncertain, a newly added class.

You need to do three things:


1) Set up your path (and ensure it stays set up for every Matlab session) 2) Copy datafiles then (and ONLY then) 3) Do the lab exercises.

SET UP YOUR PATH If it is not already set up, the following directory needs to be added to your Matlab path: /opt/info/courses/COMP24111/lab3 See the lab sheet from last week on how to do this. COPY DATAFILES Go to the directory: /opt/info/courses/COMP24111/lab3 you will find four data files with the suffix .mat as follows: av2_c2 - a dataset for binary classification where each attribute has two discrete values. av3_c2 - a dataset for binary classification where each attribute has three discrete values. av7_c3 - a dataset for three-class classification where each attribute has seven discrete values. avc_c2 - a dataset for binary classification where all attributes have continuous values as are in spambase.

Each of the aforementioned datasets contains the same data format as follows: a M*57 matrix where each row represents the feature vector of an training example, M is the number of training examples. LabelSet a M*1 column vector where elements represent the labels of corresponding to training examples in AttributeSet. testAttributeSet a N*57 matrix where each row represent the feature vector an test instance, N is the number of test instances. validLabel - a N*1 column vector where elements represent the labels corresponding to test instances in testAttributeSet (used to achieve the accuracy of a classifier). In addition, a function (main.m) is also provided to load a dataset for training and testing your implementation. As a result, you need to embed your code into this main function for evaluation. After embedding your code to this main function, you need to run this function only. A typical scenario for machine learning is to create a learning system by training it on a given training data set. Later on the system will be applied to different test data sets. As a result, you must write two generic functions; i.e., one for training a nave Bayes classifier and the other for test based on the trained nave Bayes classifier so that they can be applied to any datasets with the main function. A typical format is shown below where Parameter List stands for a set of data structures (e.g., vector and matrix) used to store conditional probabilities and prior probabilities to be learned. You need to design such data structures yourself for this purpose. % for NB training [ Parameter List ] = NBTrain(AttributeSet, LabelSet) % for NB test [predictLabel, accuracy] = NBTest(Parameter List, testAttributeSet, validLabel) Note that the accuracy of a nave Bayes classifier is defined as follows: AttributeSet

accuracy =

number ofcorrectly classified testinstances number oftestinstances

For example, if you are given a test dataset of 100 instances and your code correctly predicts labels (i.e., predicted labels are as same as their true labels) for 85 instances, the accuracy is 85/100 = 0.85. Apart from the accuracy measure, you also need to report test results with a confusion matrix as described in lab 2 by appending the corresponding code to the end of the main function. If you wrote your code for calculating a confusion matrix for lab 2, you can reuse it here.

Now the lab exercises..

PART 1 Implementation for discrete input attribute values


The main task of this exercise is to implement a nave Bayes classifier for discrete input attribute values (see lecture notes of Nave Bayes Classifier for details). Your code must be able to handle all possible situations; i.e., different attributes could have a various number of values (e.g., the first attribute has two values, the second one has four values and so on) and a problem could be either a binary or a multi-class classification task. If you do not have a clear idea of implementing a generic nave Bayes classifier at the beginning, you might want to start coding a simplified nave Bayes classifier for a simple situation, e.g., all attributes have only two values and the task is binary classification. After gaining the experience on simple situations, you should be able to work on a generic nave Bayes classifier. With the main function, you should test your code based on three datasets of discrete input attribute values. In terms of functionalities, three goals are set for Part 1: 1. your code works for binary attribute values and binary classification tasks, 2. your code works for multiple attribute values and binary classification tasks 3. your code works for different attribute values and multi-class classification tasks.

PART 2 Implementation for continuous input attribute values


Upon the completion of the task of Part 1 described above, you need to further implement a nave Bayes classifier for continuous input attribute values where attributes are assumed to be subject to Gaussian distribution (see lecture notes of Nave Bayes Classifier for details). As same as required in Part 1., you must have two generic functions; one for training and the other test. With the main function, you are still able to test your code with the given dataset of continuous input attribute values.

PART 3 Bonus marks


Additionally, bonus marks are available for truly exceptional students: those able to show observations and knowledge that was not explicitly supplied in lectures. To restate - to obtain marks in this category you should show evidence of learning outside the supplied lecture notes. Examples of things you could do: learn about other forms of evaluation than just the confusion matrix, or other feature extraction/selection routines, or an extension of the nave Bayes classifier to improve the classification performance on this database. The original database spambase.data is given under the same directory as other datasets are located in so that you can use it if needed.

DELIVERABLES
By the deadline, you should submit two files, using the submit command as you did last year: lab3.zip lab3.pdf - all your .m files, zipped up - your report, as detailed below

A report should be submitted, on one side of A4 as a PDF file. Anything beyond 1 side will be ignored. You should give a description of Parameter List you designed to store conditional probabilities and prior probabilities in Part 1 and means and variances of Gaussian distributions for continuous input attribute values in Part 2 as well as test results on the given data sets. Take note, we are not interested in the details of your code, e.g., what Matlab functions are called, what they return etc. This module is about algorithms and is indifferent to how you program them. There is no specific format marks will be allocated roughly on the basis of: functionality of your program and rigours experimentation, knowledge displayed when talking to the demonstrator, imagination in Part 3, how informative in your report about your experiments, grammar, ease of reading.

The lab is marked out of 15: Part 1 Implementation for discrete input attribute values Part 2 Implementation for continuous input attribute values Part 3 Bonus Remember a mark of 12/15 constitutes a first class mark, 75%. (9 marks) (3 marks) (3 marks)

You might also like