Lab

Lab Artificial Intelligence CS420
Lab
Decision Tree
1 Description
In this assignment, you are going to build decision trees on real world datasets using scikit-learn.
The datasets you will be working on include:

• Binary class dataset: The UCI Breast Cancer Wisconsin (Diagnostic) dataset is used for
classifying tumors as malignant or benign based on features derived from its imaging data.
This dataset includes 569 samples, with labels indicating malignant (M) or benign (B).
Please visit the link below for the dataset:
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
• Multi-class dataset: The UCI Wine Quality dataset is used for classifying wine samples
into quality levels based on physicochemical properties such as acidity, alcohol content, etc.
This dataset includes 4898 samples, with labels from 0 (low quality) to 10 (high quality).
Please visit the link below for the dataset:
https://archive.ics.uci.edu/dataset/186/wine+quality
• Additional dataset: You have to find another dataset and build the decision tree for it.
Please provide a detailed description of the dataset information in your report.
Your dataset must:
– Contain both features and labels for supervised learning.

– Include at least 300 samples for meaningful analysis.
– Contain multiple classes or at least two binary classes.
2 Specifications
You are required to write Python Notebooks (.ipynb) and use scikit-learn library to complete
the following tasks described for the Breast Cancer dataset.
While there are no strict guidelines for code organization, each task must be clearly documented
and fully comply with all specified requirements.
University of Science Faculty of Information Technology Page 1

2.1 Preparing the datasets

This task sets up the training and test datasets for the upcoming experiments.
Using the features and labels above, please prepare the following four subsets:
• feature_train: a set of training samples.
• label_train: a set of labels corresponding to the samples in feature_train.
• feature_test: a set of test samples with a structure to feature_train.
• label_test: a set of labels corresponding to the samples in feature_test.
You need to shuffle the dataset before splitting and ensure it is split in a stratified fashion.
Other parameters (if there are any) should remain at their default settings.
There will be experiments on training and test sets with different proportions, including 40/60,
60/40, 80/20, and 90/10 (train/test); therefore, you will need 16 subsets in total.
Visualize the class distributions in all datasets (the original set, training sets, and test sets)
across all proportions to demonstrate that they have been appropriately prepared.
2.2 Building the decision tree classifiers

This task involves conducting experiments on the designated train/test proportions listed above.
You need to fit an instance of sklearn.tree.DecisionTreeClassifier (using information gain)
to each training set and visualize the resulting decision tree with Graphviz.
Figure 1: Example for a decision tree classifier (with depth = 2).

2.3 Evaluating the decision tree classifiers

For each of the above decision tree classifiers, predict the samples in the corresponding test set
and generate a report using classification_report and confusion_matrix.
Figure 2: Example for Classification Report and Confusion Matrix.
How do you interpret the classification report and the confusion matrix? Based on the results,
provide your insights into the performance of these decision tree classifiers.
2.4 The depth and accuracy of a decision tree

This task focuses on the 80/20 training and test sets. You need to consider that how the depth of
the decision tree affects classification accuracy.
You can specify the maximum depth of a decision tree by adjusting the max_depth parameter.
Try the following values for parameter max_depth: None, 2, 3, 4, 5, 6, 7. Then:
• Provide the decision trees, visualized using Graphviz, for each max_depth value.
• Report the accuracy_score (on the test set) of the decision tree classifier for each value of
the max_depth parameter in the following table.
max_depth None 2 3 4 5 6 7
Accuracy
• Provide charts and your insights on the statistics reported above.

2.5 Repeat for other datasets

For the Wine Quality dataset and the Additional dataset, also perform the same steps as
described above.
Since the Wine Quality dataset contains 10 classes of wine quality (classes 0-10), you should
group them into 3 broader categories for analysis: Low quality (classes 0-4), Standard quality
(classes 5-6), and High quality (classes 7-10).
After completing the experiments for all datasets, analyze how characteristics of the datasets
(number of classes, number of features, and sample size) influence the decision tree’s performance
(accuracy, precision, etc.).
Provide your analysis based on the summarized results across the datasets in the report.
3 Requirements
3.1 Report
The report must include the following sections:
• Member information (Student ID, full name, etc.).
• A work assignment table, which includes information on each task assigned to team members,
along with the completion rate of each member relative to the assigned tasks.
• A self-evaluation of the completion rate of the Lab and other requirements.
• All visualizations must be presented in the .ipynb file, while statistical results and insights
must be presented in the report.
• The report needs to be well-formatted and exported to PDF. If there are figures cut off by
the page break, etc., points will be deducted.
• References (if any).
3.2 Submission
• All reports, code, etc., must be contributed in the form of a compressed file (.zip, .rar, .7z)
and named according to the format: StudentID1_StudentID2_etc.zip/.rar/.7z.
• If the compressed file is larger than 25MB, prioritize compressing the report and source code.
Images and other large files may be uploaded to the Google Drive and shared via a link.

4 Assessment
The detailed assessment criteria for this Lab are outlined as follows:
No. Criteria Score

1 Analysis of the Wine Quality dataset. 30%
2 Analysis of the Breast Cancer dataset. 30%
3 Analysis of an additional dataset. 30%
4 Comparative analysis of all three datasets. 5%
5 Well-structured and formatted notebooks. 5%
Total 100%
The detailed assessment criteria for each dataset are outlined as follows:
No. Criteria Score

1 Data preparation. 30%
2 Implement decision tree classifiers. 20%
3 Performance evaluation of decision tree.
- Classification report and confusion matrix. 10%
- Insights. 10%
4 Depth and accuracy of decision trees.
- Visualization (trees, tables, charts). 20%
- Insights. 10%
Total 100%
5 Notices
Please pay attention to the following notices:
• This is a GROUP assignment. Each group has 4 members.
• Duration: about 2 weeks.
• Any plagiarism, any tricks, or any lie will have a 0 point for the course grade.
The end.

Lab

Uploaded by

Lab

Uploaded by

Lab Artificial Intelligence CS420

The datasets you will be working on include:

– Contain both features and labels for supervised learning.

University of Science Faculty of Information Technology Page 1

2.1 Preparing the datasets

• feature_train: a set of training samples.

• label_train: a set of labels corresponding to the samples in feature_train.

• feature_test: a set of test samples with a structure to feature_train.

• label_test: a set of labels corresponding to the samples in feature_test.

2.2 Building the decision tree classifiers

Figure 1: Example for a decision tree classifier (with depth = 2).

University of Science Faculty of Information Technology Page 2

2.3 Evaluating the decision tree classifiers

Figure 2: Example for Classification Report and Confusion Matrix.

2.4 The depth and accuracy of a decision tree

• Provide charts and your insights on the statistics reported above.

University of Science Faculty of Information Technology Page 3

2.5 Repeat for other datasets

• A self-evaluation of the completion rate of the Lab and other requirements.

• References (if any).

University of Science Faculty of Information Technology Page 4

No. Criteria Score

No. Criteria Score

• This is a GROUP assignment. Each group has 4 members.

• Duration: about 2 weeks.

University of Science Faculty of Information Technology Page 5

You might also like