Assignment 1
Assignment 1
Assignment 1
Data Mining Techniques
Data Mining Practice and Theory
I NTRODUCTION
This document introduces you to the first assignment of the Data Mining Techniques course
at the VU. This is a group task (3 members), please make sure all team members contribute to
the work as expected. The assignment follows the structure of the CRISP-DM process model
and the first five lectures of the course. You will study and apply the different subjects and
algorithms that have been treated during these lectures to get hands on experience and gain
a deeper understanding.
T OOLS / PROGRAMMING LANGUAGE : You have complete freedom to use any tool or program-
ming language and can also combine them (e.g. data preparation in one tool and classifica-
tion using the resulting dataset in another). There is no need to implement your own algo-
rithms unless it is explicitly mentioned in the assignment, you can just use packages that are
readily available. If you are in doubt on your choice, we would recommend using Python as
that is the most frequently used language and the algorithms treated during the lecture are
widely available in packages (e.g. scikit learn).
D ATASET: The group following Data Mining Techniques is quite diverse, an advanced and
basic version of this first assignment is available. Many components of this assignment re-
volve around analyzing a particular dataset. The basic version is based on a simpler dataset,
while the advanced version focuses on a more challenging, complex dataset. The latter should
1
only be followed by those that already have some experience with machine learning and want
to challenge themselves. A bonus point will be give to those who do the advanced version. In
case there are specific instructions for the advanced dataset in the assignments, these will be
clearly indicated.
• Basic dataset The basic dataset is the dataset we collected during the first lecture con-
taining information on all of you. In case this dataset does not satisfy your interest,
you are also allowed to use an alternative dataset from for instance Kaggle, but please
discuss this with your TA first. Note that there should be a classification as well as a
regression problem in there and the dataset should not be "pre-cleaned".
• Advanced dataset The domain from which the dataset originates is the domain of men-
tal health. More and more smartphone applications are becoming available to sup-
port people suffering a depression. These applications record all kinds of sensory data
about the behavior of the user and in addition frequently ask the user for a rating of the
mood. A snapshot of the resulting dataset is shown in Table 1. The dataset contains
ID’s, reflecting the user the measurement originated from. Furthermore, it contains
time-stamped pairs of variables and values. The variables and their interpretation are
shown in Table 2. The goal of the dataset will be to predict the mood of the next day for
the subjects in the dataset (being the average of the mood values measured during that
day).
R EPORT: We would like you as a group of 3 to prepare a report with the following in mind:
• The report should be submitted via Canvas by 21/04/2024, 23:59. This is a strict dead-
line, please try to respect that, otherwise points will be deducted (1 full point per day)
• Please format the document according to the LNCS guidelines. Templates are available
on Canvas for both LateX and Microsoft Word, do not deviate from these templates (so
no adjusting the margins, etc.). Note that you do not need to include an abstract in your
report, but do start with a brief introduction. The paper should not exceed 14 pages
including all figures and tables, but excluding references (references do not count for
the number of pages to encourage you to cite all relevant work). With the page limit,
our aim is to challenge you to report only what is necessary.
2
Table 2: Variables in the advanced dataset
Variable Explanation
mood The mood scored by the user on a scale of 1-10
circumplex.arousal The arousal scored by the user, on a scale between -2 to 2
circumplex.valence The valence scored by the user, on a scale between -2 to 2
activity Activity score of the user (number between 0 and 1)
screen Duration of screen activity (time)
call Call made (indicated by a 1)
sms SMS sent (indicated by a 1)
appCat.builtin Duration of usage of builtin apps (time)
appCat.communication Duration of usage of communication apps (time)
appCat.entertainment Duration of usage of entertainment apps (time)
appCat.finance Duration of usage of finance apps (time)
appCat.game Duration of usage of game apps (time)
appCat.office Duration of usage of office apps (time)
appCat.other Duration of usage of other apps (time)
appCat.social Duration of usage of social apps (time)
appCat.travel Duration of usage of travel apps (time)
appCat.unknown Duration of usage of unknown apps (time)
appCat.utilities Duration of usage of utilities apps (time)
appCat.weather Duration of usage of weather apps (time)
• Make sure we can identify your report, i.e., your group number, names and student
numbers should be in the document’s header.
• Structure the report following the assignments, be selective in what you report as space
is limited, but do provide good descriptions of the steps you have taken (to make it
reproducible) and the rationale behind your choices.
G RADING : You can earn a total of 100 points. 10 bonus points are given to those selecting
the advanced dataset. Your grade for the assignment is the number of points divided by 10
(with 10 being the maximum grade). At the end of this assignment you can find the grading
scheme used.
3
TASK 1: D ATA P REPARATION (30 POINTS )
As discussed during Lecture 2, the first phase of a Data Mining project typically includes get-
ting familiar with the domain and pre-processing the dataset in a suitable manner. In this
part of the assignment, we will go through those steps.
• Notice all sorts of properties of the dataset: how many records are there, how many
attributes, what kinds of attributes are there, ranges of values, distribution of values,
relationships between attributes, missing values, and so on. A table is often a suitable
way of showing such properties of a dataset. Notice if something is interesting (to you,
or in general), make sure you write it down if you find something worth mentioning.
• Make various plots of the data. Is there something interesting worth reporting? Re-
port the figures, discuss what is in them. What meaning do those bars, lines, dots, etc.
convey? Please select essential and interesting plots for discussion, as you have limited
space for reporting your findings.
• Apply an approach to remove extreme and incorrect values from your dataset. Describe
what your approach is, why you consider that to be a good approach, and describe what
the result of applying the approach is.
• Impute the missing values using two different approaches. Describe the approaches
and study the impact of applying them to your data. Argue which one of the two ap-
proaches would be most suitable and select that one to form your cleaned dataset. Also
base yourself on scientific literature for making your choice.
Advanced: The advanced dataset contains a number of time series, select two approaches to
impute missing values that are logical for such time series and argue for one of them based
on the insights you gain. Also consider what to do with prolonged periods of missing data in
a time series.
4
for example the transformation of values (e.g. take the log of values given a certain distri-
bution of values) or combining multiple features (e.g. two features that are more valuable
combined than the two separate values). Think of a creative feature engineering approach
for your dataset, describe it, and apply it. Report on why you think this is a useful enrichment
of your dataset.
Advanced: Essentially there are two approaches you can consider to create a predictive model
using this dataset (which we will do in the next part of this assignment): (1) use a machine
learning approach that can deal with temporal data (e.g. recurrent neural networks) or you
can try to aggregate the history somehow to create attributes that can be used in a more com-
mon machine learning approach (e.g. SVM, decision tree). For instance, you use the average
mood during the last five days as a predictor. Ample literature is present in the area of tem-
poral data mining that describes how such a transformation can be made. For the feature
engineering, you are going to focus on such a transformation in this part of the assignment.
This is illustrated in Figure 1.
!me$
1' tend'
Pa#ent'1'
Pa#ent'n'
2'
1'
te'
te'
' d
t)$
(ta od$
u
rio
rge
rib
rib
mo
pe
a7
a7
In the end, we end up with a dataset with a number of training instances per patient (as
you have a number of time points for which you can train), i.e. an instance that concerns
the mood at t=1, t=2, etc. Of course it depends on your choice of the history you consider
relevant from what time point you can start predicting (if you use a windows of 5 days of
history to create attributes you cannot create training instances before the 6th day). To come
5
to this dataset, you need to:
1. Define attributes that aggregate the history, draw inspiration from the scientific litera-
ture.
2. Define the target by averaging the mood over the entire day.
Advanced: For the advanced assignment you go through the same steps (and shape it into
a classification problem for predicting the mood of the next day), however you are required
to use two different types of classification algorithms, namely one that uses the dataset you
formed in Task 1C (e.g. using a random forest) and an algorithm that is inherently temporal
(e.g. recurrent neural networks). Also consider a good evaluation setup given the nature of
the dataset.
• http://www.kaggle.com/ - DM competitions
• Etc. - You should be able to find other relevant competitions by searching the Web.
The main goal is that you can demonstrate that you understand a technique that beats other
techniques under certain conditions (specified by the task and data at hand). Here’s what
we’d like you to include in the report for this task:
6
• A description of the competition: what competition, when was it held, what data they
were using, what task(s) they were solving, what evaluation measure(s) they used.
• What was the main idea of the winning approach? (Typically this would come from a
paper written by the winners.)
• What makes the winning approach stand out, or how is it different from standard, or
non-winning methods?
• Needless to say, but for the record, please do not copy and paste from papers. Always
cite (properly) the source of the paper you are using.
7
• Write down their corresponding formulae.
• Discuss: Why would someone use one and not the other?
• Describe an example situation (dataset, problem, algorithm perhaps) where using MSE
or MAE would give identical results. Justify your answer (some maths may come handy,
but clear explanation is also sufficient).
8
Table 3: Grading scheme
Task Grading Component Weight
Description of the dataset (statistics) 3
1A Plots of some features 3
Interpretation and rationale 4
Description of approach and results to remove outliers 2
1B Description of two approaches to impute missing values and 3
comparison
Interpretation and rationale 5
Description of feature engineering approach 5
1C
Interpretation and rationale 5
Description of classification approaches 4
Description of hyperparameter optimization 2
2A Description of evaluation setup 4
Description of results 5
Interpretation and rationale 5
Description of the competition 2
2B Describe winning technique 4
Analysis/comparison 4
Description of algorithm 5
3
Discussion pros and cons 5
Description of regression approaches 2
Description of hyperparameter optimization 1
4 Description of evaluation setup 2
Description of results 2
Interpretation and rationale 3
Formulae 2
5A Example 4
Rationale 4
Description of results 6
5B
Interpretation and rationale 4
Extra page -10
Deductions Late (per day) -10
Wrong formatting -10
Total 100