Lab 3
Lab 3
Deadline: Start of lab, week beginning 29 Nov. Extension policy: NO EXTENSIONS (EVEN ON REQUEST)
In general, a pattern recognition system consists of three modules; i.e., pre-processing, feature extraction and classification. As a typical pattern recognition task, a spam filterer needs to pre-process text messages for cleaning up data, extract salient features for distinguishing between normal and spam mails that form a feature vector for an email message, and classify email messages based on feature vectors for making a decision. While pre-processing and feature extraction require knowledge in the field of natural language processing and computational linguistics, classification needs to find a proper classifier to provide the underpinning technique for decision-making. Bayesian methods have become very widely used for spam filtering, and you may have such a filter on your computer. Mozilla has a Bayesian filter, Spam Assassin uses Bayesian methods, and even Google uses a Bayesian filter to prevent search-engine spammers from dominating search results. In this lab, you are going to implement nave Bayes classifiers for spam filtering emails based on a UCI machine learning benchmark database named spambase (http://archive.ics.uci.edu/ml/datasets/Spambase) where 57 features (attributes of continuous vales) have already been extracted for each email message and all instances were labelled as 0 (normal) or 1 (spam). The purpose of this lab is to use Matlab to carry out the nave Bayes classification algorithm learned from the module to provide underpinning techniques for developing a spam filter working in different situations. In order to apply the discrete-valued nave Bayes classifier for spam filtering, the spambase database has been modified by the further discretisation of continuous attribute values. Also some instances have been re-labelled as 2 to indicate uncertain, a newly added class.
SET UP YOUR PATH If it is not already set up, the following directory needs to be added to your Matlab path: /opt/info/courses/COMP24111/lab3 See the lab sheet from last week on how to do this. COPY DATAFILES Go to the directory: /opt/info/courses/COMP24111/lab3 you will find four data files with the suffix .mat as follows: av2_c2 - a dataset for binary classification where each attribute has two discrete values. av3_c2 - a dataset for binary classification where each attribute has three discrete values. av7_c3 - a dataset for three-class classification where each attribute has seven discrete values. avc_c2 - a dataset for binary classification where all attributes have continuous values as are in spambase.
Each of the aforementioned datasets contains the same data format as follows: a M*57 matrix where each row represents the feature vector of an training example, M is the number of training examples. LabelSet a M*1 column vector where elements represent the labels of corresponding to training examples in AttributeSet. testAttributeSet a N*57 matrix where each row represent the feature vector an test instance, N is the number of test instances. validLabel - a N*1 column vector where elements represent the labels corresponding to test instances in testAttributeSet (used to achieve the accuracy of a classifier). In addition, a function (main.m) is also provided to load a dataset for training and testing your implementation. As a result, you need to embed your code into this main function for evaluation. After embedding your code to this main function, you need to run this function only. A typical scenario for machine learning is to create a learning system by training it on a given training data set. Later on the system will be applied to different test data sets. As a result, you must write two generic functions; i.e., one for training a nave Bayes classifier and the other for test based on the trained nave Bayes classifier so that they can be applied to any datasets with the main function. A typical format is shown below where Parameter List stands for a set of data structures (e.g., vector and matrix) used to store conditional probabilities and prior probabilities to be learned. You need to design such data structures yourself for this purpose. % for NB training [ Parameter List ] = NBTrain(AttributeSet, LabelSet) % for NB test [predictLabel, accuracy] = NBTest(Parameter List, testAttributeSet, validLabel) Note that the accuracy of a nave Bayes classifier is defined as follows: AttributeSet
accuracy =
For example, if you are given a test dataset of 100 instances and your code correctly predicts labels (i.e., predicted labels are as same as their true labels) for 85 instances, the accuracy is 85/100 = 0.85. Apart from the accuracy measure, you also need to report test results with a confusion matrix as described in lab 2 by appending the corresponding code to the end of the main function. If you wrote your code for calculating a confusion matrix for lab 2, you can reuse it here.
DELIVERABLES
By the deadline, you should submit two files, using the submit command as you did last year: lab3.zip lab3.pdf - all your .m files, zipped up - your report, as detailed below
A report should be submitted, on one side of A4 as a PDF file. Anything beyond 1 side will be ignored. You should give a description of Parameter List you designed to store conditional probabilities and prior probabilities in Part 1 and means and variances of Gaussian distributions for continuous input attribute values in Part 2 as well as test results on the given data sets. Take note, we are not interested in the details of your code, e.g., what Matlab functions are called, what they return etc. This module is about algorithms and is indifferent to how you program them. There is no specific format marks will be allocated roughly on the basis of: functionality of your program and rigours experimentation, knowledge displayed when talking to the demonstrator, imagination in Part 3, how informative in your report about your experiments, grammar, ease of reading.
The lab is marked out of 15: Part 1 Implementation for discrete input attribute values Part 2 Implementation for continuous input attribute values Part 3 Bonus Remember a mark of 12/15 constitutes a first class mark, 75%. (9 marks) (3 marks) (3 marks)