0% found this document useful (0 votes)
3 views

CSCI946 Assignment_1_task_sheet

The University of Wollongong's CSCI446/946 Big Data Analytics assignment requires students to conduct data analytics experiments using Python, focusing on problem analysis, data preprocessing, clustering, classification, and result discussion. Groups must work independently, submit a single report and code files, and ensure all members contribute to avoid penalties. The assignment emphasizes the importance of clear documentation, justification of methods, and adherence to academic integrity standards.

Uploaded by

Masud Zaman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CSCI946 Assignment_1_task_sheet

The University of Wollongong's CSCI446/946 Big Data Analytics assignment requires students to conduct data analytics experiments using Python, focusing on problem analysis, data preprocessing, clustering, classification, and result discussion. Groups must work independently, submit a single report and code files, and ensure all members contribute to avoid penalties. The assignment emphasizes the importance of clear documentation, justification of methods, and adherence to academic integrity standards.

Uploaded by

Masud Zaman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

University of Wollongong

School of Compu7ng and Informa7on Technology


CSCI446/946 Big Data Analy7cs
Spring 2024

Assignment 1
(10 marks)
Due: Check Moodle Site

Aim
This assignment aims to provide students with essen7al experience conduc7ng data analy7cs
experiments by using the programming language Python. ANer comple7ng this assignment, you
should know how to
• load and save data and workspace; and
as part of data analysis:
• analyze a problem and preprocess raw dataset,
• perform clustering,
• perform classifica7on, and
• discuss experiment results in an introductory way.

Group work: You are to work on this assignment as a group. Each group is to work
independently from other groups on this assignment. All group members are expected to
contribute to this assignment. Group members may use communica7ons tools (e.g., UOW
Zoom, UOW Webex, UOW Teams, Slack, Discord, WhatsApp, etc.) and online collabora7on
workspace (e.g., UOW OneDrive, Google Drive, GitHub, ZenHub, etc.) to complete the
assignment. Please plan before star7ng the assignment, then keep a detail digital work log and
7mesheet for each group member. A jus7fica7on and/or explana7ons must accompany all your
answers to this assignment. All group members are expected to work together and to contribute
to this assignment. One submission per group only.

Penal/es: If a group member fails to make a minimum contribu7on, the member will be
awarded zero marks. Claims of less or no contribu7on should provide evidence like a work log.
Plagiarism of any part in this assignment will result in zero marks being awarded to the whole
group.

Preliminaries
Read through the lecture slides, lab instruc7ons and the recommended readings in Weeks 1 – 4.
Conduct relevant background studies. You should use Python for the tasks in this assignment.
You can use any publicly accessible toolbox of library for Python. Your submission must include
the source code file(s) which, when run, would re-create all your results. Some of the
assignment tasks can be computa7onally expensive or memory expensive. You may require a
computer with sufficient compute power and memory (at least 16GB of memory in this subject).

Task 1: Problem Analysis and Data Preprocess (4 marks)


An e-commerce website NewChic.com keeps records of its products. A snapshot of records of
instances are stored by categories in accessories.csv, bags.csv, beauty.csv,
house.csv, jewelry.csv, kids.csv, men.csv, shoes.csv, and women.csv, which
are provided to you with this instruc7on. A product is represented by one instance (i.e., a row).
The nine CSV data sheets form the NewChic dataset. The data dictionary(.pdf)
summarizes all acributes and their types in the dataset. In this assignment, you need to focus on
integer and decimal type data (i.e., columns) where id in integer type is excluded.
The rest columns will support your data analy7cs design and discussion.

The analy/cs into NewChic dataset aims to find


• top 10 products from your selected categories, and
• the best category among your selected categories.
For example, if you choose to analyse products in beauty, jewellery, bags, and shoes, the 10 best
products from these four categories are going to be reported, as well as the best category out of
these four. Please use as much data (i.e., categories) as you can. You need to choose at least as
many categories as there are members in your group. For example, if your group has 6
members then you need to choose at least six categories. Using fewer categories than the
minimum required will be marked zero for all tasks in this assignment.

To answer these two ques7ons, you need to think about the following parts. A figure to
illustrate your analy7cs plan is preferred.
1. Design your experiment (Task 1) and report: why would you choose all or part of data from
the NewChic dataset; how would you define “top 10” and “the best”; why some columns are
picked for clustering and classifica7on algorithms and some columns are for result discussion.
2. Program data preprocess (Task 1) by combining CSVs in one sheet and report: matched,
removed columns and detail explana7ons.
3. Use at least two clustering algorithms (Task 2) on preprocessed data and report: detail steps
of each algorithm, how you preprocessed the data, the result of all algorithms in a table,
algorithm comparison and best result.
4. Program at least two classifica/on algorithms (Task 3) on preprocessed data and report:
detail steps of each algorithm, result of all algorithms in a table, algorithm comparison and best
result.
5. Discuss results (Task 4) and report: the 10 best products, the best category and your
sugges7ons to NewChic.

Task 1 is expected to be answered in two sec7ons in your report, under sec7ons “Problem
Analysis” and “Data Preprocess”. Please accordingly cite referred ar7cles and programming
resources in your wri7ng. Task 1 also needs to submit the code. Add the code of data preprocess
to the ZIP file for your submission in which your code is saved in .py.

Task 2: Clustering (2 marks)


You are to analyze the data you preprocessed in Task1. You need to apply at least two different
clustering algorithms. Jus7fy and explain your selec7on! Please prac7ce the lab - clustering first,
then complete this task. Task 2 requests a report for detailed explana7ons of the steps of each
algorithm, the result of each algorithm in a table, algorithm comparison and best result.
Task 2 is expected to be answered in the sec7on “Clustering” in your report. Please accordingly
cite referred ar7cles and programming resources in your wri7ng. Task 2 also needs to submit
the code. Add the code of clustering algorithms to the ZIP file for your submission in which your
code is saved in .py.
Task 3: ClassificaAon (2 marks)
You are to analyze the data you preprocessed in Task1. You need to perform at least two
classifica7on algorithms and explain your selec7on: KNN, Naïve Bayes and more are available in
reading materials. Please prac7ce the lab - classifica7on first, then complete this task. Task 3
requests a report for detailed explana7ons of the steps of each algorithm, the result of each
algorithm in a table, algorithm comparison and best result.
Task 3 is expected to be answered in the sec7on “Classifica7on” in your report. Please
accordingly cite referred ar7cles and programming resources in your wri7ng. Task 2 also needs
to submit the code. Add the code of classifica7on algorithms to the ZIP file for your submission
in which your code is saved in .py.

Task 4: Result Discussion (2 marks)


Task 4 can answer the following ques7ons and more:
• Are the clusters well separated from each other?
• Did the classifiers well separate products from each other into different classes?
• Do any of the clusters/classes have only a few points?
• Are there meaningful and non-meaningful clusters/classes to the analy7cs problems
ques7oned in Task 1?
• What are the advantages, shortages for clustering and classifica7on algorithms in this analy7cs
case? Which one provides results of greater value?
• Are the examined algorithms suitable for Big Data analy7cs? and why in your opinion?
• Will data preprocess affect clustering and classifica7on results? and why in your opinion?
• More you can report …
Finally, please report the 10 best products, the best category and your sugges7ons to NewChic.
Task 4 is expected to be answered in the sec7on “Result Discussion” in your report.

Submission:
The submission link for assignment 1 is on the subject’s Moodle site. Only one submission per
group. The submission must be a zip file named “A1.zip”, under 200 MB, and contains a
report (mandatory) and code (mandatory). Accepted files formats are a report in .pdf
format, and code files in .py.

Important:
• The report must be in a single file and in .pdf format. The 7tle page must list the full name
and student ID of all members in the group. Clearly indicate members who did not make a
minimum in contribu7ons.
• The report must answer the ques7ons in their order as given in the assignment specifica7on.
There is no page limit.
• The report must have a clear heading for each part of each task.
• Sufficient descrip7on, explana7on, jus7fica7on, and discussion are essen7al parts of your
answers. Marks will be deducted for incomplete or vague answers.
• Sufficient, suitable, and legible annota7on shall be provided in your code to make it easy to
understand. Marks will be deducted for un7dy code, code that is difficult to read, code that does
not run, or code that does not reproduce the results in your report.

Note: Failure of your code to run may aTract zero marks. Plagiarism of any part in your code, or
any part in your report will acract zero marks for this assignment. It is the responsibility of the
group to ensure that your submission does not contain plagiarized material. You may be
requested to demonstrate and explain your program or explain your answer in the report.
Marks are deducted if you are unable to offer an explana7on. Marks will be awarded for correct
design, implementa7on, style, completeness, and jus7fica7on.

---------------------------------- END-------------------------------------

You might also like