Assignment_1_Machine Learning
Assignment_1_Machine Learning
Total Marks: 50
Release Date: 26 Jan 2025
Date of Submission: 2 March 2025
You have been provided with a CSV file "Cars93.csv." The given dataset is related to cars and
contains 26 columns. In the given dataset, “Price” is the target variable (i.e., the output). The
marks distribution according to the tasks are as follows:
1. Assign a type to each of the following features (a) Model, (b) Type, (c) Max. Price and
(d) Airbags from the following: ordinal/nominal/ratio/interval scale.
2. Write a function to handle the missing values in the dataset (e.g., any NA, NaN values).
3. Write a function to reduce noise (any error in the feature) in individual attributes.
4. Write a function to encode all the categorical features in the dataset according to the
type of variable jointly.
5. Write a function to normalize / scale the features either individually or jointly.
6. Write a function to create a random split of the data into train, validation and test sets in
the ratio of [70:20:10].
Q2a: Linear Regression Task. - 6 Marks
Use the “linear_regression_dataset.csv”
Implement the linear regression model to predict the dependency between two variables.
1. Implement linear regression using the inbuilt function “LinearRegression” model in
sklearn.
2. Print the coefficient obtained from linear regression and plot a straight line on the scatter
plot.
3. Now, implement linear regression without the use of any inbuilt function.
4. Compare the results of 1 and 3 graphically.
In the report provide your findings for the output generated for all the kernels used and also
describe the changes that happened after changing the regularization hyperparameter.
Q4: Decision Tree and Random Forest – 15 Marks
Load the IRIS dataset. The dataset consists of 150 samples of iris flowers, each belonging to
one of three species (setosa, versicolor, or virginica). Each sample includes four features: sepal
length, sepal width, petal length, and petal width.
1. Visualize the distribution of each feature and the class distribution.
2. Encode the categorical target variable (species) into numerical values.
3. Split the dataset into training and testing sets (use an appropriate ratio).
4. Decision Tree Model
i. Build a decision tree classifier using the training set.
ii. Visualize the resulting decision tree.
iii. Make predictions on the testing set and evaluate the model's performance using
appropriate metrics (e.g., accuracy, confusion matrix).
5. Random Forest Model
i. Build a random forest classifier using the training set.
ii. Tune the hyperparameters (e.g., number of trees, maximum depth) if necessary.
iii. Make predictions on the testing set and evaluate the model's performance using
appropriate metrics and compare it with the decision tree model.