CSCI946 Assignment_1_task_sheet

The University of Wollongong's CSCI446/946 Big Data Analytics assignment requires students to conduct data analytics experiments using Python, focusing on problem analysis, data preprocessing, clustering, classification, and result discussion. Groups must work independently, submit a single report and code files, and ensure all members contribute to avoid penalties. The assignment emphasizes the importance of clear documentation, justification of methods, and adherence to academic integrity standards.

Uploaded by

Masud Zaman

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

CSCI946 Assignment_1_task_sheet

Uploaded by

Masud Zaman

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

University of Wollongong

School of Compu7ng and Informa7on Technology

CSCI446/946 Big Data Analy7cs
Spring 2024

Assignment 1
(10 marks)
Due: Check Moodle Site

Aim
This assignment aims to provide students with essen7al experience conduc7ng data analy7cs
experiments by using the programming language Python. ANer comple7ng this assignment, you
should know how to
• load and save data and workspace; and
as part of data analysis:
• analyze a problem and preprocess raw dataset,
• perform clustering,
• perform classiﬁca7on, and
• discuss experiment results in an introductory way.

Group work: You are to work on this assignment as a group. Each group is to work
independently from other groups on this assignment. All group members are expected to
contribute to this assignment. Group members may use communica7ons tools (e.g., UOW
Zoom, UOW Webex, UOW Teams, Slack, Discord, WhatsApp, etc.) and online collabora7on
workspace (e.g., UOW OneDrive, Google Drive, GitHub, ZenHub, etc.) to complete the
assignment. Please plan before star7ng the assignment, then keep a detail digital work log and
7mesheet for each group member. A jus7ﬁca7on and/or explana7ons must accompany all your
answers to this assignment. All group members are expected to work together and to contribute
to this assignment. One submission per group only.

Penal/es: If a group member fails to make a minimum contribu7on, the member will be
awarded zero marks. Claims of less or no contribu7on should provide evidence like a work log.
Plagiarism of any part in this assignment will result in zero marks being awarded to the whole
group.

Preliminaries
Read through the lecture slides, lab instruc7ons and the recommended readings in Weeks 1 – 4.
Conduct relevant background studies. You should use Python for the tasks in this assignment.
You can use any publicly accessible toolbox of library for Python. Your submission must include
the source code ﬁle(s) which, when run, would re-create all your results. Some of the
assignment tasks can be computa7onally expensive or memory expensive. You may require a
computer with suﬃcient compute power and memory (at least 16GB of memory in this subject).

Task 1: Problem Analysis and Data Preprocess (4 marks)

An e-commerce website NewChic.com keeps records of its products. A snapshot of records of
instances are stored by categories in accessories.csv, bags.csv, beauty.csv,
house.csv, jewelry.csv, kids.csv, men.csv, shoes.csv, and women.csv, which
are provided to you with this instruc7on. A product is represented by one instance (i.e., a row).
The nine CSV data sheets form the NewChic dataset. The data dictionary(.pdf)
summarizes all acributes and their types in the dataset. In this assignment, you need to focus on
integer and decimal type data (i.e., columns) where id in integer type is excluded.
The rest columns will support your data analy7cs design and discussion.

The analy/cs into NewChic dataset aims to ﬁnd

• top 10 products from your selected categories, and
• the best category among your selected categories.
For example, if you choose to analyse products in beauty, jewellery, bags, and shoes, the 10 best
products from these four categories are going to be reported, as well as the best category out of
these four. Please use as much data (i.e., categories) as you can. You need to choose at least as
many categories as there are members in your group. For example, if your group has 6
members then you need to choose at least six categories. Using fewer categories than the
minimum required will be marked zero for all tasks in this assignment.

To answer these two ques7ons, you need to think about the following parts. A figure to
illustrate your analy7cs plan is preferred.
1. Design your experiment (Task 1) and report: why would you choose all or part of data from
the NewChic dataset; how would you define “top 10” and “the best”; why some columns are
picked for clustering and classifica7on algorithms and some columns are for result discussion.
2. Program data preprocess (Task 1) by combining CSVs in one sheet and report: matched,
removed columns and detail explana7ons.
3. Use at least two clustering algorithms (Task 2) on preprocessed data and report: detail steps
of each algorithm, how you preprocessed the data, the result of all algorithms in a table,
algorithm comparison and best result.
4. Program at least two classifica/on algorithms (Task 3) on preprocessed data and report:
detail steps of each algorithm, result of all algorithms in a table, algorithm comparison and best
result.
5. Discuss results (Task 4) and report: the 10 best products, the best category and your
sugges7ons to NewChic.

Task 1 is expected to be answered in two sec7ons in your report, under sec7ons “Problem
Analysis” and “Data Preprocess”. Please accordingly cite referred ar7cles and programming
resources in your wri7ng. Task 1 also needs to submit the code. Add the code of data preprocess
to the ZIP ﬁle for your submission in which your code is saved in .py.

Task 2: Clustering (2 marks)

You are to analyze the data you preprocessed in Task1. You need to apply at least two different
clustering algorithms. Jus7fy and explain your selec7on! Please prac7ce the lab - clustering first,
then complete this task. Task 2 requests a report for detailed explana7ons of the steps of each
algorithm, the result of each algorithm in a table, algorithm comparison and best result.
Task 2 is expected to be answered in the sec7on “Clustering” in your report. Please accordingly
cite referred ar7cles and programming resources in your wri7ng. Task 2 also needs to submit
the code. Add the code of clustering algorithms to the ZIP file for your submission in which your
code is saved in .py.
Task 3: ClassificaAon (2 marks)
You are to analyze the data you preprocessed in Task1. You need to perform at least two
classifica7on algorithms and explain your selec7on: KNN, Naïve Bayes and more are available in
reading materials. Please prac7ce the lab - classifica7on first, then complete this task. Task 3
requests a report for detailed explana7ons of the steps of each algorithm, the result of each
algorithm in a table, algorithm comparison and best result.
Task 3 is expected to be answered in the sec7on “Classifica7on” in your report. Please
accordingly cite referred ar7cles and programming resources in your wri7ng. Task 2 also needs
to submit the code. Add the code of classifica7on algorithms to the ZIP file for your submission
in which your code is saved in .py.

Task 4: Result Discussion (2 marks)

Task 4 can answer the following ques7ons and more:
• Are the clusters well separated from each other?
• Did the classifiers well separate products from each other into different classes?
• Do any of the clusters/classes have only a few points?
• Are there meaningful and non-meaningful clusters/classes to the analy7cs problems
ques7oned in Task 1?
• What are the advantages, shortages for clustering and classifica7on algorithms in this analy7cs
case? Which one provides results of greater value?
• Are the examined algorithms suitable for Big Data analy7cs? and why in your opinion?
• Will data preprocess affect clustering and classifica7on results? and why in your opinion?
• More you can report …
Finally, please report the 10 best products, the best category and your sugges7ons to NewChic.
Task 4 is expected to be answered in the sec7on “Result Discussion” in your report.

Submission:
The submission link for assignment 1 is on the subject’s Moodle site. Only one submission per
group. The submission must be a zip file named “A1.zip”, under 200 MB, and contains a
report (mandatory) and code (mandatory). Accepted files formats are a report in .pdf
format, and code files in .py.

Important:
• The report must be in a single file and in .pdf format. The 7tle page must list the full name
and student ID of all members in the group. Clearly indicate members who did not make a
minimum in contribu7ons.
• The report must answer the ques7ons in their order as given in the assignment specifica7on.
There is no page limit.
• The report must have a clear heading for each part of each task.
• Sufficient descrip7on, explana7on, jus7fica7on, and discussion are essen7al parts of your
answers. Marks will be deducted for incomplete or vague answers.
• Sufficient, suitable, and legible annota7on shall be provided in your code to make it easy to
understand. Marks will be deducted for un7dy code, code that is difficult to read, code that does
not run, or code that does not reproduce the results in your report.

Note: Failure of your code to run may aTract zero marks. Plagiarism of any part in your code, or
any part in your report will acract zero marks for this assignment. It is the responsibility of the
group to ensure that your submission does not contain plagiarized material. You may be
requested to demonstrate and explain your program or explain your answer in the report.
Marks are deducted if you are unable to oﬀer an explana7on. Marks will be awarded for correct
design, implementa7on, style, completeness, and jus7ﬁca7on.

---------------------------------- END-------------------------------------

Unit 10 Big Data and Business Analytics Assignment 2 LAB LAC
No ratings yet
Unit 10 Big Data and Business Analytics Assignment 2 LAB LAC
5 pages
Edt 2021 - 7buis010w - CW2
No ratings yet
Edt 2021 - 7buis010w - CW2
5 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
AP SummativeAssesment OL2
No ratings yet
AP SummativeAssesment OL2
8 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
C TAW12 750 PDF Questions and Answers
0% (1)
C TAW12 750 PDF Questions and Answers
5 pages
Assignment 3-PDS Python-24S3
No ratings yet
Assignment 3-PDS Python-24S3
5 pages
RN 2103213618 1 MT 690966-638031645126201096-Big-Data-Question
No ratings yet
RN 2103213618 1 MT 690966-638031645126201096-Big-Data-Question
6 pages
2844615_1_cn7022-18-19-crwk
No ratings yet
2844615_1_cn7022-18-19-crwk
7 pages
ITECH2302 MainAssessment Report
No ratings yet
ITECH2302 MainAssessment Report
8 pages
1 - Assignment Details Practical Data Science With Python
No ratings yet
1 - Assignment Details Practical Data Science With Python
8 pages
Ca2 - Lpu
No ratings yet
Ca2 - Lpu
2 pages
Cits2402 Assignment
No ratings yet
Cits2402 Assignment
7 pages
Assignment 2 Task Sheet
No ratings yet
Assignment 2 Task Sheet
3 pages
Computational
No ratings yet
Computational
7 pages
CWBrief
No ratings yet
CWBrief
2 pages
Data Science Manual
No ratings yet
Data Science Manual
155 pages
Agile Foundation Courseware – English
From Everand
Agile Foundation Courseware – English
Nader Rad
No ratings yet
BDA Lab 9 Manual
No ratings yet
BDA Lab 9 Manual
3 pages
Datascience
No ratings yet
Datascience
8 pages
Ce473 Project - Fall 2024
No ratings yet
Ce473 Project - Fall 2024
8 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
Project Guidelines (ISE-291 _T 241)
No ratings yet
Project Guidelines (ISE-291 _T 241)
3 pages
dsm020_coursework
No ratings yet
dsm020_coursework
3 pages
Crystal Reports Introduction: Versions 2008-2016
From Everand
Crystal Reports Introduction: Versions 2008-2016
Seth Bonder
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Week 3 v1.1 (hidden) Supervised Learning (Regression)
No ratings yet
Week 3 v1.1 (hidden) Supervised Learning (Regression)
52 pages
PCED_Lösung en
No ratings yet
PCED_Lösung en
24 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
Project Requirements Student Version 1.0
No ratings yet
Project Requirements Student Version 1.0
6 pages
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
No ratings yet
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
45 pages
DWDM Mid Project
No ratings yet
DWDM Mid Project
6 pages
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
Confident Programmer Problem Solver: Six Steps Programming Students Can Take to Solve Coding Problems
From Everand
Confident Programmer Problem Solver: Six Steps Programming Students Can Take to Solve Coding Problems
Cloudy Heaven Games
No ratings yet
DSBDAL Lab Manual
No ratings yet
DSBDAL Lab Manual
26 pages
CS502M_project_spec
No ratings yet
CS502M_project_spec
8 pages
Tableau Training Manual 9.0 Basic Version: This Via Tableau Training Manual Was Created for Both New and Intermediate
From Everand
Tableau Training Manual 9.0 Basic Version: This Via Tableau Training Manual Was Created for Both New and Intermediate
Larry Keller
3/5 (1)
ProjectINSE6220-Fall23
No ratings yet
ProjectINSE6220-Fall23
1 page
Milestone
No ratings yet
Milestone
7 pages
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
4 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
2024 COMP1702CourseWork
No ratings yet
2024 COMP1702CourseWork
5 pages
Assignment 1-SS-2023-24
No ratings yet
Assignment 1-SS-2023-24
2 pages
Capstone Project Guidelines
No ratings yet
Capstone Project Guidelines
2 pages
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
From Everand
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
Vibrant Publishers
No ratings yet
Revision Questions
No ratings yet
Revision Questions
19 pages
COMPUTER SCIENCE FOR ROOKIES
From Everand
COMPUTER SCIENCE FOR ROOKIES
Angel Bahabwa
No ratings yet
Software Testing and Tooling Courseware: Based on CTAP
From Everand
Software Testing and Tooling Courseware: Based on CTAP
Rob Flier
No ratings yet
EDF Data Visualization Professional Courseware
From Everand
EDF Data Visualization Professional Courseware
Michel Dekker
No ratings yet
Lab Sheet1
No ratings yet
Lab Sheet1
1 page
Lab Manual Ds&Bdal
No ratings yet
Lab Manual Ds&Bdal
100 pages
Designing Microsoft Azure Infrastructure Solution AZ 305
From Everand
Designing Microsoft Azure Infrastructure Solution AZ 305
Manish Soni
No ratings yet
Project2 - 158755. 4.21
No ratings yet
Project2 - 158755. 4.21
3 pages
DevOps Foundation Courseware - English
From Everand
DevOps Foundation Courseware - English
Oleg Skrynnik
No ratings yet
Data Mining Project - Abinaya John
No ratings yet
Data Mining Project - Abinaya John
42 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Practice Questions for UiPath Certified RPA Associate Case Based
From Everand
Practice Questions for UiPath Certified RPA Associate Case Based
Exam OG
No ratings yet
How to Track Schedules, Costs and Earned Value with Microsoft Project
From Everand
How to Track Schedules, Costs and Earned Value with Microsoft Project
Akram Najjar
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Informatics Practices (New) : Class Xii
No ratings yet
Informatics Practices (New) : Class Xii
3 pages
CSCI946 w3_DataPrep
No ratings yet
CSCI946 w3_DataPrep
58 pages
COMP90049 2021S1 A3-Spec
No ratings yet
COMP90049 2021S1 A3-Spec
7 pages
Chapter 3 - Kinematic Graphical
No ratings yet
Chapter 3 - Kinematic Graphical
37 pages
2020 Spring CSCI251 Lab Preliminary
No ratings yet
2020 Spring CSCI251 Lab Preliminary
3 pages
CSCI835 Database Systems Assignment 0 (Zero) : Saturday 29 August, 2020, 7.00 PM (Sharp)
No ratings yet
CSCI835 Database Systems Assignment 0 (Zero) : Saturday 29 August, 2020, 7.00 PM (Sharp)
12 pages
4 Solar Energy Storage System: Dept of Mechanical Engineering International Islamic University Malaysia
No ratings yet
4 Solar Energy Storage System: Dept of Mechanical Engineering International Islamic University Malaysia
27 pages
All Appendices
No ratings yet
All Appendices
5 pages
Student Management System: REPORT (2022)
No ratings yet
Student Management System: REPORT (2022)
50 pages
Vehicular Communication System: University of Management and Technology, Lahore
No ratings yet
Vehicular Communication System: University of Management and Technology, Lahore
12 pages
Report On Android Application Development
100% (1)
Report On Android Application Development
116 pages
Cambridge O Level: Computer Science For Examination From 2023
No ratings yet
Cambridge O Level: Computer Science For Examination From 2023
16 pages
xc4411 Setup Guide-1
No ratings yet
xc4411 Setup Guide-1
7 pages
Jaime Torres - en
No ratings yet
Jaime Torres - en
1 page
Digital Logic Design: Assignment 1 Due Date: 10.12.2020 (Thursday)
No ratings yet
Digital Logic Design: Assignment 1 Due Date: 10.12.2020 (Thursday)
3 pages
IT233 Test Bank
No ratings yet
IT233 Test Bank
73 pages
FPGA_BASED_SYSTEM_DESIGN [HONOURS]
No ratings yet
FPGA_BASED_SYSTEM_DESIGN [HONOURS]
6 pages
Sen 102 Revision Guide
No ratings yet
Sen 102 Revision Guide
4 pages
BSC Bca 5 Sem Computer Networks 22100010 Jan 2022
No ratings yet
BSC Bca 5 Sem Computer Networks 22100010 Jan 2022
2 pages
DC Os 2
No ratings yet
DC Os 2
3 pages
Aips
No ratings yet
Aips
8 pages
Og Fortiguard
No ratings yet
Og Fortiguard
3 pages
Basic Concept of Multiprotocol Label Switching (MPLS)
No ratings yet
Basic Concept of Multiprotocol Label Switching (MPLS)
4 pages
Ansys Fluent Text Command List
No ratings yet
Ansys Fluent Text Command List
604 pages
Cs3551 Distributed Computing
No ratings yet
Cs3551 Distributed Computing
2 pages
Chapter Three Artificial Intelligence
No ratings yet
Chapter Three Artificial Intelligence
26 pages
Kulang
No ratings yet
Kulang
5 pages
8-Introduction and Implementation of Byte stream, Character stream, Buffered strea
No ratings yet
8-Introduction and Implementation of Byte stream, Character stream, Buffered strea
54 pages
5V/3.3V or Adjustable, 100% Duty-Cycle, High-Efficiency, Step-Down DC-DC Controllers
No ratings yet
5V/3.3V or Adjustable, 100% Duty-Cycle, High-Efficiency, Step-Down DC-DC Controllers
16 pages
Easy Access 2.0 Manual
No ratings yet
Easy Access 2.0 Manual
62 pages
OSA PTP Grandmaster, NTP Server, GNSS Receiver. Your Benefits
No ratings yet
OSA PTP Grandmaster, NTP Server, GNSS Receiver. Your Benefits
7 pages
Software Testing Module1
No ratings yet
Software Testing Module1
34 pages
Lenovo Legion 5 17ACH6 Spec
No ratings yet
Lenovo Legion 5 17ACH6 Spec
7 pages
B. Global Reach: V 3Xcprr5Po7A&List PL - 0Rk - 1F4Stdnzozu4Aq0H7Rba6Tpwuuj&Index 12
No ratings yet
B. Global Reach: V 3Xcprr5Po7A&List PL - 0Rk - 1F4Stdnzozu4Aq0H7Rba6Tpwuuj&Index 12
27 pages
YDLIDAR X2 Development Manual V1.1
No ratings yet
YDLIDAR X2 Development Manual V1.1
8 pages
Content-Based Image Retrieval System Using Sketches
No ratings yet
Content-Based Image Retrieval System Using Sketches
4 pages
CSE 421 Algorithms: Richard Anderson Dynamic Programming
No ratings yet
CSE 421 Algorithms: Richard Anderson Dynamic Programming
4 pages