Exercise 1: Introduction to Scikit-learn for Supervised Learning
Objective: To gain hands-on experience with implementing supervised learning algorithms
using the Scikit-learn library in Python.
Solution:
Theory:
Supervised Learning is a type of machine learning where the algorithm is trained on labeled
data, i.e., input features (X) and corresponding output labels (y). The goal is to learn a mapping
function that can predict the output for new, unseen data.
Scikit-learn (sklearn) is a Python library that provides simple and efficient tools for data
mining and machine learning. It includes several classification, regression, and clustering
algorithms.
In this exercise, we use the K-Nearest Neighbors (KNN) classifier on the Iris dataset, a
popular dataset containing measurements of three types of iris flowers.
Key Features of Scikit-learn:
• Supervised learning: Classification and regression algorithms like:
o Decision Tree
o Random Forest
o K-Nearest Neighbors (KNN)
o Support Vector Machine (SVM)
o Logistic Regression
• Unsupervised learning: Clustering and dimensionality reduction algorithms like:
o K-Means
o PCA (Principal Component Analysis)
• Model selection: Tools for cross-validation, hyperparameter tuning (GridSearchCV,
RandomizedSearchCV)
• Preprocessing: Functions for:
o Data scaling (e.g., StandardScaler)
o Handling missing values
o Encoding categorical variables
• Datasets: Includes built-in sample datasets like:
o Iris
1|Page
o Digits
o Boston Housing (deprecated)
o Wine, etc.
Why Use Scikit-learn?
• Easy to learn and use
• Well-documented
• Integrates well with other libraries like NumPy, Pandas, and Matplotlib
• Widely used in education, research, and industry
---------------------------------------------Page End-----------------------------------------------------
Exercise 2: Exploring Unsupervised Learning with K-Means Clustering
Objective: To explore the concepts of unsupervised learning and clustering using the K-Means
algorithm.
Solution:
Objective:
To explore the concepts of unsupervised learning by applying the K-Means clustering
algorithm using the Scikit-learn library. This exercise helps understand how to group similar
data points when no labels are provided.
Theory:
What is Unsupervised Learning?
Unsupervised learning is a type of machine learning where the model is not given any labeled
output data. The goal is to discover hidden patterns or groupings in the data.
What is K-Means Clustering?
K-Means is a popular unsupervised learning algorithm used for clustering data into K groups
(clusters) based on feature similarity. It works by:
1. Choosing the number of clusters (K).
2. Randomly selecting initial cluster centroids.
3. Assigning data points to the nearest centroid.
4. Updating centroids based on the mean of points in each cluster.
5. Repeating steps 3 and 4 until convergence.
Code Implementation:
import pandas as pd
2|Page
import numpy as np
import [Link] as plt
from [Link] import KMeans
df = pd.read_csv('/content/[Link]')
[Link]()
[Link]()
# features scalling
from [Link] import StandardScaler
scaler = StandardScaler()
df['Age']=scaler.fit_transform(df[['Age']])
df['Income($)']=scaler.fit_transform(df[['Income($)']])
print(df)
# plot data point
[Link](figsize=(10,4))
[Link](df['Age'],df['Income($)'],s=100)
[Link]('Age')
[Link]('Income')
[Link]('Customer Data')
[Link]()
# Assuming 'df' is your DataFrame containing the 'Age' and 'Income($)' columns
X = df[['Age', 'Income($)']] # Select the features for clustering
km = KMeans(n_clusters=5)
ypred = km.fit_predict(X) # Pass the feature data to fit_predict
ypred
df['cluster'] = ypred
print(df)
## to get the centroid of the cluster
centroid=km.cluster_centers_
3|Page
centroid
df1=df[df['cluster']==0]
df1
df2=df[df['cluster']==1]
df3=df[df['cluster']==2]
[Link](df1['Age'],df1['Income($)'],color='green',label='cluster1',s=150)
[Link](df2['Age'],df2['Income($)'],color='red',label='cluster2',s=150)
[Link](df3['Age'],df3['Income($)'],color='blue',label='cluster3',s=150)
# to draw the centroid
[Link](centroid[:,0],centroid[:,1],s=200,marker="*",color='purple',label='centroid'
)
[Link]()
Output:
Fig no.1
4|Page
Fig no.2
---------------------------------------------------Page End-----------------------------------------------
Exercise 3: Implementing Linear Regression from Scratch.
Objective: To gain a deeper understanding of linear regression by implementing it from scratch
using Python.
Solution:
Theory:
What is Linear Regression?
Linear regression is a supervised learning algorithm used for predicting continuous values. It
assumes a linear relationship between the input feature x and the output y, modeled by the
equation:
y=mx+cy = mx + cy=mx+c
Where:
• m is the slope (also called weight or coefficient),
• c is the intercept (bias),
• x is the independent variable,
• y is the dependent variable (target).
Code Implementation:
import numpy as np
import [Link] as mtp
5|Page
import pandas as pd
data_set= pd.read_csv("Salary_Data.csv")
x=data_set.iloc[:,:-1].values
y=data_set.iloc[:,1].values
print(x)
print(y)
# splitting the data set into training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=1/3, random_state=0)
# fitting the simple linear regression model to the training dataset
from sklearn.linear_model import LinearRegression
regressor= LinearRegression() # regressor is just a variable we can replace it to any other
variable (such as a,x,etc)
[Link](x_train,y_train)
# prediction of test and training set result
y_pred= [Link](x_test)
print(y_pred)
x_pred= [Link](x_train)
print(x_pred)
# visualizing the training set results
[Link](x_train,y_train,color="green")
[Link](x_train,x_pred,color="red")
[Link]("Salary vs Experience (Training set)")
[Link]("Years of Experience")
[Link]("Salary(In Rupees)")
[Link]()
# visualizing the test set result
[Link](x_test,y_test,color="blue")
[Link](x_train,x_pred,color="red")
6|Page
[Link]("Salary vs Experience (Test Dataset)")
[Link]("Years of Experience")
[Link]("Salary(In Rupees)")
[Link]()
Output:
Fig no.1
Fig no.2
------------------------------------------------------------
7|Page
Exercise 4: Binary Classification with Logistic Regression.
Objective: To implement logistic regression for binary classification tasks and understand its
application in real-world scenarios.
Solution:
Theory:
What is Binary Classification?
Binary classification is a supervised learning task where the output variable has only two
possible classes, e.g., yes/no, 0/1, true/false.
What is Logistic Regression?
Logistic Regression is a statistical model used for classification tasks. It estimates the
probability that a given input point belongs to a certain class using the sigmoid (logistic)
function:
σ(z)=11+e−z, where z=w⋅x+b\sigma(z) = \frac{1}{1 + e^{-z}}, \text{ where } z = w \cdot x +
bσ(z)=1+e−z1, where z=w⋅x+b
The output is a probability between 0 and 1, and a threshold (usually 0.5) is used to assign class
labels.
Real-world Applications:
• Email spam detection (Spam or Not Spam)
• Disease diagnosis (Positive or Negative)
• Credit risk assessment (Default or Not)
Code Implementation:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, classification_report
from [Link] import LabelEncoder
# load the titanic data
data = pd.read_csv("/content/[Link]")
print(data)
# Select features and target
features =['Pclass','Sex','Age','SibSp','Parch','Fare',]
8|Page
data = data[features + ['Survived']]
# Handle missing values
data['Age'].fillna(data['Age'].median(), inplace=True)
# Convert categorical column 'sex' to numeric
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex']) # male =1, female = 0
# split features and target
X = data[features]
Y = data['Survived']
# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Train logistic regression
model = LogisticRegression(max_iter=200)
[Link](X_train,Y_train)
# Predictions
Y_pred = [Link](X_test)
# Evaluation
print("Accuracy:", accuracy_score(Y_test, Y_pred))
print("\nClassification Report:\n", classification_report(Y_test, Y_pred))
from [Link] import accuracy_score, classification_report, confusion_matrix,
roc_curve, auc
import [Link] as plt
import seaborn as sns
cm= confusion_matrix(Y_test, Y_pred)
[Link](figsize=(6, 5))
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Did Not Survive',
'Survived'], yticklabels=['Did Not Survived', 'Survived'])
[Link]('Predicted')
[Link]('Actual')
9|Page
[Link]('Confusion Matrix')
[Link]()
Output:
Fig no.1
Exercise 5: Decision Tree Classifier for Multiclass Classification
Objective: To understand the working of decision tree classifiers and their application in
multiclass classification problems.
Solution:
Theory:
What is a Decision Tree Classifier?
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It works by splitting the dataset into branches based on feature values,
forming a tree structure. Each node represents a decision based on a feature, and each leaf node
represents a class label.
What is Multiclass Classification?
Multiclass classification involves classifying inputs into more than two categories, unlike
binary classification. For example:
• Classifying flowers as Setosa, Versicolor, or Virginica
• Digit recognition (0–9)
10 | P a g e
Use Case:
Classify iris flowers into three species using Decision Tree.
Code Implementation:
from [Link] import load_iris
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score, classification_report
# Load dataset
iris = load_iris()
X = [Link]
y = [Link]
# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the Decision Tree classifier
model = DecisionTreeClassifier()
[Link](X_train, y_train)
# Make predictions
y_pred = [Link](X_test)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred,
target_names=iris.target_names))
Output:
11 | P a g e
Fig no.1
12 | P a g e