COMP1831 LabExercise5
COMP1831 LabExercise5
This is a lab exercise sheet for COMP1831 module. You are required to follow the exercises
in the order they are presented. This exercise sheets the following learning outcomes:
Attempt the lab exercises on your own. If you encounter any issue please ask your lab tutor.
Background
Along with graph modelling and algorithms, there is another extremely powerful tool in the
arsenal of analysts when it comes to financial fraud detection. Fraud detection really boils
down to automatic anomaly and pattern detection which is something that data driven
approaches really excel in. More specifically, machine learning and deep learning provide a
wide collection of models that can be used to analyse datasets and classify anomalies that
more often than not, are correlated to fraudulent activity.
However, a common disadvantage of following such approaches is the intrinsic black box
nature of some of the models. In the context, of fraud detection it is extremely important to be
confident about the results of your analysis as these may be provided as evidence in a court of
law. Hence, using black box models (even with high confidence) can usually be used only for
the first stage of analysis in order to narrow down the content that needs to be analysed
further.
Yet, there are models and techniques which are fully transparent and therefore with simple
inspection it is possible to extract the execution trail and explain how each classification was
inferred.
The sandboxed environment that we will be using is Google Colaboratory to create, run and
edit Jupyter notebooks. A Jupyter notebook is an interactive Python programming
environment that you can run Python code snippets sequentially and receive immediate
feedback.
To launch your Google Colaboratory environment, make sure you have, and you signed in
into a Google account. Navigate to https://colab.research.google.com/ . Spend some time to
familiarise with the environment. Unlike Neo4J sandbox, you can save your notebooks and
continue at later stage if you want to do so.
Before starting the next parts of the lab follow the steps below to create a new Notebook and
run some basic commands.
import sys
print(sys.version)
- You can also run bash command by prefixing them with exclamation mark (“!”). Run
the following command to get OS information
!cat /etc/*-release
Decision tree model is falls under the supervised learning category. In other words, the
model is trained on data that has already been labelled, unlike the unsupervised learning
category where labelled data is not required in the first place. The main disadvantage of
supervised learning is the need for an already annotated dataset which in many cases is very
hard to obtain.
Since Jupyter notebooks allow for Markdown syntax as well, the rest of the lab instructions is
provided within the notebook itself. Follow the instruction below to import the skeleton
notebook to you Google Colab environment.
Go to the module’s Moodle page and download Week’s 7 Lab exercise notebook archive:
This should download a zip file called “Week 7 - Lab Exercise Notebook.ipynb.zip”. Extract
the contents of the archive to find the “Week 7 - Lab Exercise Notebook.ipynb” notebook
inside.
Now navigate back to your Google Colab environment and select “File”->”Upload
Notebook”
A new window should appear from which you can select the files to upload. Click “Choose
file” and select the downloaded notebook. (The file with the ipynb extension)
After a couple of seconds you should see the notebook uploaded and opened within the
Google Colab environment.
Exercises:
For this exercise you need to perform the same analysis as shown in the lab but on a different
dataset. You can find the dataset here:
https://drive.google.com/file/d/1A_EE7CIltvI5YfP_EPzmQ5jeNezT7cfF/view?usp=sharing
Tasks:
1.1 Download and load the provided dataset from the csv file as a panda dataframe.
1.2 Create a list of basic statistical properties (count, mean, std, min, 25%, 50%, 75%, max)
of the “amount” column in the data provided
1.3 What is the ratio of fraudulent / non-fraudulent transactions in the given dataset?
1.4 Apply one of the sampling methods demonstrated in the lab to balance the dataset classes.
1.4 Create a correlation heatmap plot against all the dataset’s fields
1.6 Create a new data column which can be useful for classification, by using the existing
fields of the dataset.
1.8 Create a decision tree classifier using scikit-learn package that classifies a transaction as
fraudulent or not. (Train and test on 70/30 split of your dataset).
1.9 Provide a summary of your model’s performance metrics (precision, recall, f1)
1.10 Which feature was the most important for the model? Provide a plot that justifies your
answer
Provide your solution as a Python Notebook (.ipynb) format. To export from Google
Colab: File -> Download .ipynb. Your python notebook name should be
LabExercise5.ipynb