0% found this document useful (0 votes)

22 views

COMP1831 LabExercise5

This document provides instructions for a lab exercise on fraud detection using decision trees. Students are asked to complete exercises on importing and analyzing a transaction dataset to build and evaluate a decision tree classifier for identifying fraudulent transactions.

Uploaded by

uday sonawane

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

COMP1831 LabExercise5

Uploaded by

uday sonawane

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

COMP1831 – Technologies for Anti-Money

Laundering and Financial Crime

Lab Exercise 5
2022-23
a.pogiatzis@greenwich.ac.uk

This is a lab exercise sheet for COMP1831 module. You are required to follow the exercises
in the order they are presented. This exercise sheets the following learning outcomes:

1. Obtain familiarity with utilising Google Colab notebooks.

2. Download and load datasets as Pandas data frames.
3. Perform basic data transformations on Pandas data frames
4. Perform a basic exploratory data analysis on unknown datasets.
5. Build a decision tree classifier for identifying fraudulent transactions.

Attempt the lab exercises on your own. If you encounter any issue please ask your lab tutor.

Background
Along with graph modelling and algorithms, there is another extremely powerful tool in the
arsenal of analysts when it comes to financial fraud detection. Fraud detection really boils
down to automatic anomaly and pattern detection which is something that data driven
approaches really excel in. More specifically, machine learning and deep learning provide a
wide collection of models that can be used to analyse datasets and classify anomalies that
more often than not, are correlated to fraudulent activity.

However, a common disadvantage of following such approaches is the intrinsic black box
nature of some of the models. In the context, of fraud detection it is extremely important to be
confident about the results of your analysis as these may be provided as evidence in a court of
law. Hence, using black box models (even with high confidence) can usually be used only for
the first stage of analysis in order to narrow down the content that needs to be analysed
further.

Yet, there are models and techniques which are fully transparent and therefore with simple
inspection it is possible to extract the execution trail and explain how each classification was
inferred.

Environment & Data Import

For this lab we will be utilising Python notebook programming and python data science
packages. Again, we will be using a sandboxed environment (akin to Neo4J sandbox) for
Python notebooks. You can also complete the exercises locally however it will be necessary
to install the required dependencies and setup the environment.

The sandboxed environment that we will be using is Google Colaboratory to create, run and
edit Jupyter notebooks. A Jupyter notebook is an interactive Python programming
environment that you can run Python code snippets sequentially and receive immediate
feedback.

To launch your Google Colaboratory environment, make sure you have, and you signed in
into a Google account. Navigate to https://colab.research.google.com/ . Spend some time to
familiarise with the environment. Unlike Neo4J sandbox, you can save your notebooks and
continue at later stage if you want to do so.

Google Colaboratory Notebook

Before starting the next parts of the lab follow the steps below to create a new Notebook and
run some basic commands.

- Click “File” and select “New notebook”

- Click on the Notebook name at the top to rename it, and change the name to
“COMP1831-Sample.ipynb”
- In the first cell type in the following python code to get the current version of Python
and run it (Shift+Enter):

import sys

print(sys.version)

- You can also run bash command by prefixing them with exclamation mark (“!”). Run
the following command to get OS information

!cat /etc/*-release

You should get something like the following screenshot:

PART I – Fraud Detection with Decision Tree classifier
This lab aims to demonstrate fraud detection using a Decision Tree model. You will be
working with a synthetic dataset of transactions for the walkthrough of this lab. The dataset
can be downloaded/inspected here: https://drive.google.com/file/d/1vbeJIknpr3-Gtn-
fEvNn9yUCjeECCt-D/view

Decision tree model is falls under the supervised learning category. In other words, the
model is trained on data that has already been labelled, unlike the unsupervised learning
category where labelled data is not required in the first place. The main disadvantage of
supervised learning is the need for an already annotated dataset which in many cases is very
hard to obtain.

Since Jupyter notebooks allow for Markdown syntax as well, the rest of the lab instructions is
provided within the notebook itself. Follow the instruction below to import the skeleton
notebook to you Google Colab environment.

Import Jupyter Notebook to Google Colab

Go to the module’s Moodle page and download Week’s 7 Lab exercise notebook archive:

This should download a zip file called “Week 7 - Lab Exercise Notebook.ipynb.zip”. Extract
the contents of the archive to find the “Week 7 - Lab Exercise Notebook.ipynb” notebook
inside.

Now navigate back to your Google Colab environment and select “File”->”Upload
Notebook”
A new window should appear from which you can select the files to upload. Click “Choose
file” and select the downloaded notebook. (The file with the ipynb extension)

After a couple of seconds you should see the notebook uploaded and opened within the
Google Colab environment.
Exercises:

For this exercise you need to perform the same analysis as shown in the lab but on a different
dataset. You can find the dataset here:
https://drive.google.com/file/d/1A_EE7CIltvI5YfP_EPzmQ5jeNezT7cfF/view?usp=sharing

Tasks:

1.1 Download and load the provided dataset from the csv file as a panda dataframe.

1.2 Create a list of basic statistical properties (count, mean, std, min, 25%, 50%, 75%, max)
of the “amount” column in the data provided

1.3 What is the ratio of fraudulent / non-fraudulent transactions in the given dataset?

1.4 Apply one of the sampling methods demonstrated in the lab to balance the dataset classes.

1.4 Create a correlation heatmap plot against all the dataset’s fields

1.5 Create pair plots for all fields of the dataset.

1.6 Create a new data column which can be useful for classification, by using the existing
fields of the dataset.

1.7 Encode categorical attributes of the dataset using one-hot encoding.

1.8 Create a decision tree classifier using scikit-learn package that classifies a transaction as
fraudulent or not. (Train and test on 70/30 split of your dataset).
1.9 Provide a summary of your model’s performance metrics (precision, recall, f1)

1.10 Which feature was the most important for the model? Provide a plot that justifies your
answer

Provide your solution as a Python Notebook (.ipynb) format. To export from Google
Colab: File -> Download .ipynb. Your python notebook name should be
LabExercise5.ipynb