0% found this document useful (0 votes)
23 views26 pages

Machine Learning Unit 1

The document discusses various machine learning concepts including supervised, unsupervised, and reinforcement learning, as well as linear regression and its evaluation metrics. It covers topics like gradient descent, multicollinearity, bias-variance tradeoff, and the significance of support vectors in classification. Additionally, it provides Python code examples for training models to predict rainfall and employee salaries, along with explanations of overfitting and underfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views26 pages

Machine Learning Unit 1

The document discusses various machine learning concepts including supervised, unsupervised, and reinforcement learning, as well as linear regression and its evaluation metrics. It covers topics like gradient descent, multicollinearity, bias-variance tradeoff, and the significance of support vectors in classification. Additionally, it provides Python code examples for training models to predict rainfall and employee salaries, along with explanations of overfitting and underfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Version 2

1. Supervised vs. Unsupervised vs. Reinforcement

 Supervised learning uses labeled data with known outputs to train models (e.g.,
predicting marks from study hours).
 Unsupervised learning works with unlabeled data to find hidden patterns or
clusters (e.g., customer segmentation).
 Reinforcement learning is based on agents learning by interacting with an
environment and improving via rewards/penalties.

2. Linear Regression

Linear regression is a statistical method to predict the dependent variable (y) based on
one or more independent variables (x).
It assumes a linear relationship and fits a straight line, generally expressed as y = mx +
c, where m is slope and c is intercept.

3. R² for given data

R² measures how well the predicted values fit the actual data.
For the given dataset, Mean(y)=79, SST=685, SSR=660.
R² = SSR/SST = 660/685 ≈ 0.964 → Model explains about 96.4% of variance.

4. Role of Learning Rate in Gradient Descent

The learning rate decides the step size during parameter updates.
If it is too high, the model may overshoot the minimum; if too low, convergence will be
very slow.
An optimal learning rate ensures fast and stable convergence.

5. Gradient Descent in Linear Regression

Gradient descent minimizes the cost function by updating weights iteratively.


It calculates the gradient (slope of error function) and moves in the opposite direction to
reduce error.
This process continues until the error reaches its minimum value.
6. Multicollinearity

Multicollinearity occurs when two or more independent variables in regression are highly
correlated.
This makes it difficult to determine the effect of each predictor accurately.
It can lead to unstable coefficient estimates and poor model interpretability.

7. Bias-Variance Tradeoff

Bias is the error due to oversimplifying the model (underfitting), while variance is the
error due to high sensitivity to training data (overfitting).
A high-bias model ignores important patterns, and a high-variance model fails to
generalize.
The tradeoff balances both for optimal performance.

8. Outliers

Outliers are extreme values that differ significantly from other observations.
They can distort regression lines, increase error, and reduce prediction accuracy.
Handling outliers is important for building robust models.
QA507 Define support vectors and explain their significance.
Support vectors are the data points that are nearest to the hyper-plane and affect the position and
orientation of the hyper-plane. We have to select a hyperplane, for which the margin, i.e the distance
between support vectors and hyper-plane is maximum. Even a little interference in the position of
these support vectors can change the hyper-plane
QA508 What is the intuition of a large margin classifier?
A classifier with a large margin makes no low certainty classification decisions. This gives you a
classification safety margin: a slight error in measurement or a slight document variation will not
cause a misclassification.

PART B (5 x 13 = 65 marks)

QB101 (a) Explain gradient descent algorithm for linear regression, emphasizing its optimization process.
Answer:
Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable
function. Gradient descent is simply used in machine learning to find the values of a function's
parameters (coefficients) that minimize a cost function as far as possible.
In linear regression, the goal is to find the best-fitting line (or hyperplane in higher dimensions)
that minimizes the sum of squared differences between the observed and predicted values.
Gradient descent is used to update the coefficients of the linear regression model to minimize
this error.
Gradient Descent is an algorithm that finds the best-fit line for a given training dataset in a
smaller number of iterations.
When slope and intercept is plotted against MSE, it will acquire a bowl shape.

Gradient descent is used to minimize a cost function J(W) parameterized by a model parameter
W.

The gradient (or derivative) tells us the incline or slope of the cost function. Hence, to
minimize the cost function, we move in the direction opposite to the gradient.

1.​ Initialize the weights W randomly.


2.​ Calculate the gradients G of cost function w.r.t parameters. This is done using partial
differentiation: G = ∂J(W)/∂W. The value of the gradient G depends on the inputs, the
current values of the model parameters, and the cost function. You might need to revisit
the topic of differentiation if you are calculating the gradient by hand.
3.​ Update the weights by an amount proportional to G, i.e. W = W - ηG
4.​ Repeat until the cost J(w) stops reducing, or some other pre-defined termination criteria
is met.
Defining the Algorithm:
The gradient descent algorithm is defined as a repeated convergence for each input parameter,

9
⍺= Learning rate, which is a value that control how big of a step is taken. This is usually a
small positive number, between 0 and 1. A large alpha value corresponds to a bigger step and
vice-versa.

Derivative of the cost function, which determines the direction to take


each step.

Gradient descent intuition:

Choosing The Learning Rate:

The choice of learning rate ⍺ will have a huge impact on the efficiency of the overall
algorithm. If the learning rate is too small, many more steps will be taken than required, so the
10
total time for the algorithm will grow dramatically. On the other hand, if the learning rate is too
large, then steps can miss the true minimum cost function, and the algorithm itself will actually
diverge away from it.

11
(Or)
QB101 (b)
Write a python program using appropriate libraries to train a model to predict the rainfall (in
mm) using precipitation and humidity factor (explanatory).

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from [Link] import mean_squared_error, r2_score

12
# Sample dataset creation (You can replace this with your actual dataset)

data = {

'Precipitation': [10.2, 12.3, 7.8, 9.5, 8.4, 13.2, 11.5, 14.2, 9.0, 10.8],

'Humidity': [80, 85, 78, 82, 75, 90, 88, 92, 84, 79],

'Rainfall': [15.2, 18.3, 10.1, 13.5, 11.9, 19.5, 16.7, 20.2, 13.0, 14.8]

# Convert the data into a pandas DataFrame

df = [Link](data)

# Define the explanatory variables (X) and the target variable (y)

X = df[['Precipitation', 'Humidity']]

y = df['Rainfall']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the linear regression model

model = LinearRegression()

# Train the model on the training data

[Link](X_train, y_train)

# Predict the rainfall on the testing data

y_pred = [Link](X_test)

# Evaluate the model's performance

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

13
# Print the results

print("Mean Squared Error:", mse)

print("R-squared:", r2)

# Print model coefficients

print("Coefficients:", model.coef_)

print("Intercept:", model.intercept_)

# Predicting rainfall for a new set of data

new_data = [Link]([[10.5, 82]])

predicted_rainfall = [Link](new_data)

print(f"Predicted Rainfall for Precipitation=10.5 and Humidity=82: {predicted_rainfall[0]:.2f}


mm")
Short explanation/algorithm about the code should be given.

QB102 (a) Explain about overfitting, underfitting and best fit in detail with examples.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset.

Because of this, the model starts caching noise and inaccurate values present in the dataset, and
all these factors reduce the efficiency and accuracy of the model.

The overfitted model has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our model. It
means the more we train our model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

14
How to avoid the Overfitting in Model:
Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
●​ Cross-Validation
●​ Training with more data
●​ Removing features
●​ Regularization
●​ Reduce model complexity.
●​ Early stopping during the training phase (have an eye over the loss over
the training period as soon as loss begins to increase stop training).

Underfitting:
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data.
To avoid the overfitting in the model, the feed of training data can be stopped at an early stage,
due to which the model may not learn enough from the training data. As a result, it may fail to
find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.

15
Techniques to reduce underfitting: ​

●​ Increase model complexity


●​ Increase the number of features, performing feature engineering
●​ Remove noise from the data.
●​ Increase the number of epochs or increase the duration of training to get better results.
by increasing the training time of the model.

Best Fit:
●​ The goal of model fitting is to strike a balance between overfitting and underfitting to
achieve the best generalization performance.
●​ The best-fit model generalizes well to unseen data, capturing the underlying patterns in
the data without being overly complex or too simplistic.
●​ Achieving the best fit involves selecting a model that balances bias and variance
appropriately.
●​ This can be achieved through techniques such as cross-validation, regularization,
feature selection, and model complexity control.

Explanation – 13 Marks

(Or)
*QB102 (b) Write a python program using appropriate libraries to train a model to predict employees
monthly salaries based on years of experience and education level using Linear Regression.

# Import necessary libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import LabelEncoder
from [Link] import mean_squared_error

16
# Step 1: Load the dataset (assuming you have a file named 'employee_data.csv')
df = pd.read_csv('employee_data.csv')

# Step 2: Data Preprocessing (encode 'education_level' categorical feature)


label_encoder = LabelEncoder()
df['education_level'] = label_encoder.fit_transform(df['education_level'])

# Step 3: Define features (X) and target variable (y)


X = df[['years_of_experience', 'education_level']] # Features
y = df['salary'] # Target variable

# Step 4: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Initialize and train the Linear Regression model


model = LinearRegression()
[Link](X_train, y_train)

# Step 6: Make predictions on the test set


y_pred = [Link](X_test)

# Step 7: Evaluate the model using Mean Squared Error


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Step 8: Print the model coefficients


print(f"Model Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

# Step 9: Making a prediction for a new employee


new_employee = [Link]({'years_of_experience': [6], 'education_level':
[label_encoder.transform(['Masters'])[0]]})

17
predicted_salary = [Link](new_employee)
print(f"Predicted salary for new employee: {predicted_salary[0]}")
Explanation:
●​ Load the dataset: Load the data from a CSV file.
●​ Preprocess the data: Encode the categorical feature (education_level).
●​ Train the model: Split the data into training and testing sets, then train the linear
regression model.
●​ Evaluate the model: Check model accuracy using Mean Squared Error (MSE).
●​ Make predictions: Predict the salary for new employee data.

1.​ Loading Data:

o​ The data is loaded from a CSV file using pd.read_csv(). The dataset must
include columns like years_of_experience, education_level, and salary.

2.​ Label Encoding:

o​ education_level is a categorical feature, so we convert it to numerical values


using LabelEncoder from scikit-learn.

3.​ Model Training:

o​ We split the dataset into training and test sets using train_test_split(), then train
the Linear Regression model using [Link]().

4.​ Evaluation:

o​ We evaluate the model's performance by calculating the Mean Squared Error


(MSE) using mean_squared_error() from [Link].

5.​ Prediction:

We make predictions for new data, where the education_level is encoded using the
LabelEncoder.

*QB103 (a) (i) What is Linear Regression? Describe the steps involved in the linear regression model
representation for a single variable. (6)
Linear Regression is a statistical method used to model the relationship between a dependent
variable (target) and one or more independent variables (features) by fitting a linear equation to
observed data. The goal is to find the best-fit line that predicts the target variable based on the
input feature(s).
In the case of single-variable linear regression, there is only one independent variable
(feature) and one dependent variable (target). The model assumes a linear relationship between
them.
Linear Regression Model Representation for a Single Variable:
For a single-variable linear regression, the relationship between the input variable
X(independent variable) and the output Y (dependent variable) can be represented by the
equation:

18
19
(ii)How can the performance of a linear regression model be evaluated, and what are the key
performance measures used for this evaluation? (7)
●​To measure the performance of your regression model, some statistical metrics are used.
○​ Mean Squared Error(MSE)

○​
○​ Mean Absolute Error(MAE)

○​
○​ Root Mean Square Error(RMSE)

○​ Coefficient of determination or R2


○​ Adjusted R2


●​A good regression model is one where the difference between the actual or observed values
and predicted values for the selected model is small and unbiased for train, validation and
test data sets.
The performance of a regression model can be understood by knowing the error rate of the
predictions made by the model

20
(Or)
QB103 (b) (i)Write a python program to fit a straight line for the given data using Least squares method,
also plot the graph.
X=[1,2,3,4,5,6,7,8,9,10]
Y=[0,3,5,7,9,11,13,2,4,8] (6)
import numpy as np
import [Link] as plt

# Preprocessing Input data

X = [Link]([1,2,3,4,5,6,7,8,9,10])
Y = [Link]([0,3,5,7,9,11,13,2,4,8])

# Mean
X_mean =[Link](X)
Y_mean=[Link](Y)
num=0 #for slope
denom=0 #for slope

#to find sum of (xi-x') & (yi-y') & (xi-x')^2


for i in range(len(X)):
num+=(X[i] -X_mean)*(Y[i]-Y_mean)
denom+= (X[i]-X_mean)**2

#calculate slope
m=num/denom

#calculate intercept
b=Y_mean-m*X_mean

print(m,b)
#Line equation
y_predicted=m*X+b
print(y_predicted)
#to plot graph
[Link](X,Y)
[Link](X,y_predicted,color='red')
[Link]()

Code – 10 Marks​
Explanation - 3 marks

21
(ii)Find the MSE for the following set of values: (43,41), (44,45), (45,49), (46,47), (47,44).
(7)
Step 1: Find the regression line.

The regression line y = 9.2 + 0.8x.

QB104 (a) Write a python program using appropriate libraries to train a model to predict the price of a
house and number of occupants for california housing dataset and also find its performance
measures.
import numpy as np
from [Link] import fetch_california_housing
from sklearn.linear_model import SGDRegressor
from [Link] import MultiOutputRegressor
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error
from [Link] import StandardScaler

# Load the California Housing dataset


data = fetch_california_housing()

# Use the first 3 features as inputs


X = [Link][:, :3] # Features: 'MedInc', 'HouseAge', 'AveRooms'

# Use 'MedHouseVal' and 'AveOccup' as output variables


Y = np.column_stack(([Link], [Link][:, 6])) # Targets: 'MedHouseVal', 'AveOccup'
22
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Scale the features and target variables


scaler_X = StandardScaler()
scaler_Y = StandardScaler()

X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)
Y_train = scaler_Y.fit_transform(Y_train)
Y_test = scaler_Y.transform(Y_test)

# Initialize the SGDRegressor


sgd = SGDRegressor(max_iter=1000, tol=1e-3)

# Use MultiOutputRegressor to handle multiple output variables


multi_output_sgd = MultiOutputRegressor(sgd)

# Train the model


multi_output_sgd.fit(X_train, Y_train)

# Predict on the test data


Y_pred = multi_output_sgd.predict(X_test)

# Inverse transform the predictions to get them back to the original scale
Y_pred = scaler_Y.inverse_transform(Y_pred)
Y_test = scaler_Y.inverse_transform(Y_test)

# Evaluate the model using Mean Squared Error


mse = mean_squared_error(Y_test, Y_pred)
print("Mean Squared Error:", mse)

# Optionally, print some predictions


print("\nPredictions:\n", Y_pred[:5]) # Print first 5 predictions

Short explanation/algorithm about the code should be given.


(Or)
QB104 (b)
Number of man-hours and the corresponding productivity (in units) are furnished below.
Determine the line of best fit using the least squares method. Also predict what will be the
productivity if the man-hours is 8.8. Also calculate MSE and MAE.

Mean X = 6.9
Mean Y = 13.44
Sum of squares = 40.48
Sum of products = 62.21

Regression Equation, ŷ = bX + a

23
b = 62.21/40.48 = 1.53

a = MY - bMX = 13.44 - (1.53*6.9) = 2.89

ŷ = 1.53X + 2.89

For X=8.8,

ŷ=1.53(8.8)+2.89
ŷ=16.37

MSE = 6.42
MAE = 1.94

Step-by-Step calculations to be done. Use tabulation.

*QB201 (a) Write a Python program to implement logistic regression for classifying customer reviews as
positive (1) or negative (0) based on extracted features such as review length, number of
positive/negative words, and star rating.
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, classification_report

# Sample list of positive and negative words


positive_words = set(["good", "great", "excellent", "awesome", "perfect", "love", "best"])
negative_words = set(["bad", "terrible", "poor", "horrible", "worst", "disappointing", "hate"])

# Function to extract features from review


def extract_features(review, star_rating):
review_length = len([Link]()) # Length of the review (in words)

# Count number of positive and negative words in the review


pos_count = sum(1 for word in [Link]().split() if word in positive_words)
neg_count = sum(1 for word in [Link]().split() if word in negative_words)

# Return a feature vector (review_length, positive_count, negative_count, star_rating)


return [review_length, pos_count, neg_count, star_rating]

# Load the dataset (replace 'customer_reviews.csv' with your file path)


df = pd.read_csv('customer_reviews.csv')

# Extract features and labels


X = [Link]([extract_features(review, rating) for review, rating in zip(df['review'],
df['rating'])])
y = df['label'].values

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
24

You might also like