0% found this document useful (0 votes)
3 views5 pages

2024f-java-uml-programming

The Java final assignment involves practical work with the MNIST dataset, requiring students to read and process CSV files containing handwritten digit images. Tasks include extracting data, reshaping it into matrices, implementing basic statistical learning concepts, and evaluating model performance using classification metrics. Students are expected to refactor their code to adhere to Java standards and create test classes for each part of the assignment.

Uploaded by

seckinalpkargi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
3 views5 pages

2024f-java-uml-programming

The Java final assignment involves practical work with the MNIST dataset, requiring students to read and process CSV files containing handwritten digit images. Tasks include extracting data, reshaping it into matrices, implementing basic statistical learning concepts, and evaluating model performance using classification metrics. Students are expected to refactor their code to adhere to Java standards and create test classes for each part of the assignment.

Uploaded by

seckinalpkargi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 5

Java final assignment

Exploring Java Language through a practical work

In this assignment, you will have 2 csv files to use :

- mnist_test.csv
- mnist_train.csv

This java practical work relies on no other prerequisites than having taken the java lectures. Open
book, chatgpt allowed, individual work only.

First part: Reading data from csv


Open the mnist_train.csv file (not with excel!). This csv file contains 785 columns and 60 002
thousand lines! That is a huge set of data. Therefore, we need a programming language to analyze
its content.

A. First task, extract raw data:


a. Create a project in your IDE and import the 2 files in this project, then create a small
program that can read all the lines from the csv files. Print the 2 first lines in the
console.

b. For each line, you will have this kind of format:

Modify your program to take in account this format and get an array of String from this
line. You can take advantage of the split function.

c. Convert this array of String into an array of double (beware, you have one line to define
the headers)

This file contains only 2 data.

o On the first column, one can find a number, ranging from 0 to 9.


o On the remaining columns, there is a “flat” matrix, a 2d matrix which has been
written on a single line.

For the submission:


- Make a test class called TestA in src/test/java where it is possible to see your code in
action
B. Second task, reshaping data
a. Add some logic to your existing code, to transform the array of double extracted from
the remaining 784 columns to a 2d square matrix (28 x 28)
b. Write a method “showMatrix” that prints a matrix in the console.
In our case, the matrix contains values that range from 0 to 255, which is not so indicative
while printed in the console.
c. To improve the rendering, print “xx” in the console if the value is above 100, else print
“..”.
o Use this showMatrix method on the matrix extracted from line 23
o Print the first column of the same line. What do you notice?
d. Use a scatter plot chart from XChart to draw the same figure

For the submission:


- Make a test class called TestB in src/test/java where it is possible to see your code in
action

C. Refactor code to match java standards


a. Refactor your code to build 2 classes :
- An Image class that will contain 2 fields : “label” and “dataMatrix”, with the appropriate
types.
o Label contains the image digit value (if the image represents a 7, then label will
contain the value 7).
o Datamatrix contains the raw data.

- An ImageCsvDAO that will contain a function getAllImages(), and will contain all the code
you have written to read data from a csv file. Be careful, the file name can vary, so make it
variable in this service class.

For the submission:


- Make a test class called TestC in src/test/java where it is possible to see your code in
action
Second part: toward statistical learning, from scratch
(and for beginners 😊!)
The considered data set is in fact the MNIST dataset, that contains handwritten digits. This dataset is
a standard case of study for new beginners in machine learning.

As the goal here is not to train you on ML, but to find an interesting way to practice java,
mathematics necessary to achieve that will be really limited to simple formula (like knowing how to
calculate a distance between 2 points or how to calculate an average from a list of values.

As you have noticed, the first column contains what we call a “label”, which indicates what is
supposed to represent the associated image (the 28x28 matrix).

To introduce briefly what is (statistical) machine learning: it is a process where a program will
determine statistical characteristics of a dataset containing well known labels to take decisions, like
classification decisions.

In our case, we will use a very basic approach to take decisions :

- For each digit, we’ll compute the “average representant” (also called centroid) of each digit
image.
- Providing a new digit image, we’ll compare the distances from this image to each digit
centroid, and we’ll decide that this image will be classified as the digit from which the
average representant is the closest in terms of distance.

D. Analyzing the distribution of the dataset :


a. Write a “calculateDistribution” method, that for each digit, will assign the total count of
occurrences in the data set

For the submission:


- Make a test class called TestD in src/test/java where it is possible to see your code in
action

E. Calculating the average representant :


a. write a “trainCentroids” method, that will take a list of “Image”. You take those images from
mnist_train.csv.
b. This method should return a data structure associating a digit to its centroid.
c. The centroid matrix contains the average of all the images, index by index.

For the submission:


- Make a test class called TestE in src/test/java where it is possible to see your code in
action
F. Performing your first classification:
a. Load the file “mnist_test.csv” under the form of a list of Image instances
b. Isolate only the 10 first “0” occurrences.
c. For each instance, calculate the distance between this instance and every centroid defined
from previous task. Are you satisfied by the result?

Hint : the distance can be defined as the square root of the sum of each module (absolute
difference) of index-to-index values of the 2 considered matrix. Once you have that value, you
have to take the minimum. You can hava a hint here:
https://hlab.stanford.edu/brian/euclidean_distance_in.html

And some sample code to adapt to your case here:

https://stackoverflow.com/questions/54357325/calculating-largest-euclidean-distance-between-
two-values-in-a-2d-array

For the submission:


- Make a test class called TestF in src/test/java where it is possible to see your code in
action

G. Refactor code to match java standards


a. Write a class “CentroidClassifier” containing the “trainCentroids” method
b. Refactor the point “c” of the previous task to a method named “predict” that will return the
predicted digit providing a image.

For the submission:


- Make a test class called TestG in src/test/java where it is possible to see your code in
action

Third part: Evaluating the model


The model evaluation is mandatory to know what confidence you can have in a machine-learning
based system.
We define 4 core metrics :

- True positives, example: “the prediction is 5 and the label is actually 5.”
- True negatives, example: “the prediction is ‘not 5’ and the label is 4”.
- False positives, example: “the prediction is 5 and the label was actually 4”
- False negatives, example : “the prediction is ‘not 5’ but the label was actually 5”.

H. Implement classification performance assessment.


a. As our model (CentroidClassifier), will aggregate all the results from all the centroid
comparisons, you have to find a simple way to implement the above description for model
performance.
b. Find a complete (per centroid) implementation of the above
c. Display the confusion matrix of this solution. Confusion matrix

For the submission:


- Make a test class called TestH in src/test/java where it is possible to see your code in
action

I. Improve the model


1. Change your strategy to replace average with median in the centroid calculation (it becomes
a medoid then!). Your strategy to calculate the center of the cluster should be easy to
change in your code. What’s the best strategy
2. (Bonus): implement the adaptative centroid classification, which takes into account the
standard deviation of each cluster to compute its center.

For the submission:


- Make a test class called TestI in src/test/java where it is possible to see your code in
action

You might also like