0% found this document useful (0 votes)
3 views42 pages

1_Lab Manual (ML)

The document is a lab manual for a Machine Learning course at Nims University Rajasthan, detailing practical experiments and objectives. It includes eight practicals focusing on various machine learning techniques such as linear regression, k-Nearest Neighbors, decision trees, and clustering. Each practical outlines the aim, methods, and evaluation metrics for implementing machine learning algorithms on real-world datasets.

Uploaded by

yshivani844
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
3 views42 pages

1_Lab Manual (ML)

The document is a lab manual for a Machine Learning course at Nims University Rajasthan, detailing practical experiments and objectives. It includes eight practicals focusing on various machine learning techniques such as linear regression, k-Nearest Neighbors, decision trees, and clustering. Each practical outlines the aim, methods, and evaluation metrics for implementing machine learning algorithms on real-world datasets.

Uploaded by

yshivani844
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 42

Nims Institute of Engineering & Technology

Nims University Rajasthan, Jaipur

LAB MANUAL

FOR
Machine Learning (CSC601B)

NIMS UNIVERSITY RAJASTHAN, JAIPUR


Jaipur-Delhi Highway
Jaipur - 303121, Rajasthan, India
Website: www.nimsuniversity.org

1
Contents
Syllabus ................................................................................................................................................... 3
MACHINE LEARNING LAB ........................................................................................................................ 3
Practical 1 ................................................................................................................................................ 4
Aim: 1. Predict housing prices based on features like area, number of bedrooms, and location using
linear regression. .................................................................................................................................... 4
Practical 2 .............................................................................................................................................. 16
Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier ............................................... 16
Practical 3 .............................................................................................................................................. 22
Aim: 3. Implement a decision tree algorithm to classify email spam based on keywords and sender
information. .......................................................................................................................................... 22
Practical 4 .............................................................................................................................................. 25
Aim: 4. Cluster customer data based on purchase history using k-means clustering. ......................... 25
Practical 5 .............................................................................................................................................. 28
Aim: 5. Predict future stock prices using a time series forecasting model (e.g., ARIMA) .................... 28
Practical 6 .............................................................................................................................................. 32
Aim: 6. Develop a sentiment analysis model to classify movie reviews as positive, negative, or
neutral. .................................................................................................................................................. 32
Practical 7 .............................................................................................................................................. 36
Aim: 7. Explore dimensionality reduction techniques like Principal Component Analysis (PCA) to
visualize high-dimensional data. ........................................................................................................... 36
Practical 8 .............................................................................................................................................. 39
Aim: 8. Train a support vector machine (SVM) to classify data. ........................................................... 39

2
Syllabus

MACHINE LEARNING LAB


Course Objectives:

1. Understand the concept of learning in computer and science.


2. Compare and contrast different paradigms for learning (supervised, unsupervised, etc.).
3. Design experiments to evaluate and compare different machine learning techniques on
real-world problems.

Experiments

1. Predict housing prices based on features like area, number of bedrooms, and
location using linear regression.
2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier.
3. Implement a decision tree algorithm to classify email spam based on keywords
and sender information.
4. Cluster customer data based on purchase history using k-means clustering.
5. Predict future stock prices using a time series forecasting model (e.g., ARIMA).
6. Develop a sentiment analysis model to classify movie reviews as positive,
negative, or neutral.
7. Explore dimensionality reduction techniques like Principal Component Analysis
(PCA) to visualize high-dimensional data.
8. Train a support vector machine (SVM) to classify data.

Course Outcomes:
1. Implement and analyse existing learning algorithms, including well-studied methods for
classification, regression and clustering
2. Apply evaluation metrics for various algorithms.
3. Identifying and implementing real-world problem

3
Practical 1

Aim: 1. Predict housing prices based on features like area, number of


bedrooms, and location using linear regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create a small dataset


data = {
"Area": [1500, 1800, 2400, 3000, 3500],
"Bedrooms": [3, 4, 3, 5, 4],
"Location": [1, 2, 1, 3, 2], # Encoding for location (e.g., 1: City A, 2: City B, 3: City C)
"Price": [300000, 400000, 350000, 500000, 450000]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Features and target


X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model Coefficients:", model.coef_)


print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

# Visualize actual vs predicted


plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")

4
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.title("Actual vs Predicted Prices")
plt.show()

Output:

How the Code Works


Dataset: A small dataset with features: Area, Bedrooms, Location, and Price.
Location is encoded as integers.
Splitting Data: Divides the data into training (80%) and testing (20%) sets.
Linear Regression: Uses Linear Regression from scikit-learn to build the model.
Evaluation: Computes Mean Squared Error (MSE) and R² score to evaluate the model's
performance.
Visualization: Compares actual vs. predicted prices.

5
Reading Data from an Excel File instead of Data Frame

Firstly, create a dataset and save it as 1_Housing.xlsx

Upload this file in the Google Drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load data from an Excel file


data_path = "1_Housing.xlsx" # Replace with your Excel file path
df = pd.read_excel(data_path)

6
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model Coefficients:", model.coef_)


print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

# Visualize actual vs predicted


plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.title("Actual vs Predicted Prices")

7
plt.show()
Output:

Reading a data file from excel and then predicting the Price on the basis of Area, No. of
Rooms and City.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset from an Excel file


df = pd.read_excel('1_Housing.xlsx') # Replace 'housing_data.xlsx' with your file path

8
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model Coefficients:", model.coef_)


print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

# Visualize actual vs predicted


plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()

9
plt.title("Actual vs Predicted Prices")
plt.show()

# Predict price for new data


new_data = pd.DataFrame({
"Area": [float(input("Enter Area: "))],
"Bedrooms": [int(input("Enter Bedrooms: "))],
"Location": [int(input("Enter Location (e.g., 1 for City A, 2 for City B): "))]
})

predicted_price = model.predict(new_data)
print(f"Predicted Price: {predicted_price[0]}")

10
11
12
13
14
15
Practical 2

Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset

data = load_iris()

X = data.data

y = data.target

# Split dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# Train kNN classifier

k=3

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

# Make predictions

y_pred = knn.predict(X_test)

16
# Evaluate the model

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

Output:

Add some Visualization to the code:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset

17
data = load_iris()

X = data.data

y = data.target

feature_names = data.feature_names

class_names = data.target_names

# Split dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# Train kNN classifier

k=3

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

# Make predictions

y_pred = knn.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

# Visualizations

# 1. Pairplot of features

df = pd.DataFrame(X, columns=feature_names)

18
df['target'] = y

sns.pairplot(df, hue='target', diag_kind='hist', palette='Set2')

plt.suptitle('Feature Distributions and Pairwise Scatter Plots', y=1.02)

plt.show()

# 2. Confusion Matrix Heatmap

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 5))

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=class_names,


yticklabels=class_names)

plt.title('Confusion Matrix')

plt.xlabel('Predicted Labels')

plt.ylabel('True Labels')

plt.show()

# 3. Decision Boundaries (for 2 features only, e.g., first two features)

if X.shape[1] == 2: # Only possible for datasets with 2 features

X_plot = X[:, :2] # Use the first two features

X_train_plot, X_test_plot, y_train_plot, y_test_plot = train_test_split(X_plot, y, test_size=0.2,


random_state=42)

X_train_plot = scaler.fit_transform(X_train_plot)

X_test_plot = scaler.transform(X_test_plot)

knn_2d = KNeighborsClassifier(n_neighbors=k)

knn_2d.fit(X_train_plot, y_train_plot)

# Create a mesh grid

x_min, x_max = X_train_plot[:, 0].min() - 1, X_train_plot[:, 0].max() + 1

y_min, y_max = X_train_plot[:, 1].min() - 1, X_train_plot[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),

np.arange(y_min, y_max, 0.01))

19
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

# Plot decision boundary

plt.figure(figsize=(8, 6))

plt.contourf(xx, yy, Z, alpha=0.8, cmap='Set3')

scatter = plt.scatter(X_test_plot[:, 0], X_test_plot[:, 1], c=y_test_plot, edgecolor='k', cmap='viridis')

plt.title('kNN Decision Boundary (2D)')

plt.xlabel(feature_names[0])

plt.ylabel(feature_names[1])

legend = plt.legend(handles=scatter.legend_elements()[0], labels=class_names)

plt.show()

else:

print("Decision boundary visualization is only possible for datasets with 2 features.")

Output:

20
How kNN works?

21
Practical 3

Aim: 3. Implement a decision tree algorithm to classify email spam


based on keywords and sender information.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, export_text

from sklearn.metrics import accuracy_score, classification_report

# Sample dataset

data = {

'contains_free': [1, 0, 1, 0, 1, 0, 1, 0],

'contains_offer': [0, 1, 1, 0, 0, 1, 1, 0],

'sender_known': [0, 1, 1, 1, 0, 1, 0, 0],

'spam': [1, 0, 1, 0, 1, 0, 1, 0]

# Create a DataFrame

df = pd.DataFrame(data)

# Features and target

X = df[['contains_free', 'contains_offer', 'sender_known']]

y = df['spam']

# Split dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

22
# Initialize the Decision Tree Classifier

clf = DecisionTreeClassifier(random_state=42)

# Train the model

clf.fit(X_train, y_train)

# Make predictions

y_pred = clf.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%")

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# Display the decision tree

tree_rules = export_text(clf, feature_names=list(X.columns))

print("\nDecision Tree Rules:")

print(tree_rules)

Output:

23
Explanation of the Code

1. Dataset:
The dataset contains three features:
o contains_free: Indicates if the email contains the keyword "free" (1 for yes,
0 for no).
o contains_offer: Indicates if the email contains the keyword "offer" (1 for
yes, 0 for no).
o sender_known: Indicates if the sender is known (1 for yes, 0 for no).
The spam column is the target (1 for spam, 0 for not spam).
2. Splitting the Dataset:
The dataset is split into training and testing sets using train_test_split.
3. Model Training:
A decision tree classifier is initialized and trained on the training set.
4. Model Evaluation:
Predictions are made on the testing set, and the accuracy and classification report are
printed.
5. Visualizing the Tree:
The export_text function generates human-readable decision rules for the tree.

Output Example
sql
Copy code
Accuracy: 100.00%

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 2


1 1.00 1.00 1.00 2

accuracy 1.00 4
macro avg 1.00 1.00 1.00 4
weighted avg 1.00 1.00 1.00 4

Decision Tree Rules:


|--- contains_free <= 0.50
| |--- sender_known <= 0.50
| | |--- class: 0
| |--- sender_known > 0.50
| | |--- class: 0
|--- contains_free > 0.50
| |--- class: 1
Notes

1. Customization: You can replace the sample dataset with a larger and more realistic
email dataset.
2. Feature Engineering: Add more features like the length of the email, frequency of
certain words, etc.
3. Model Tuning: Adjust the parameters of DecisionTreeClassifier (e.g.,
max_depth, min_samples_split) for better performance.

24
Practical 4

Aim: 4. Cluster customer data based on purchase history using k-means


clustering.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Simulated customer data


data = {
'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'total_spent': [500, 1500, 300, 800, 2500, 200, 1000, 1800, 400, 700],
'frequency': [5, 20, 2, 8, 30, 1, 12, 25, 3, 6],
'average_purchase_value': [100, 75, 150, 100, 83, 200, 83, 72, 133, 117],
}

# Create a DataFrame
df = pd.DataFrame(data)

# Features for clustering


features = df[['total_spent', 'frequency', 'average_purchase_value']]

# Standardize the features


scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Apply KMeans clustering


kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_features)

# Add cluster labels to the DataFrame


df['cluster'] = clusters

# Visualize the clusters using PCA (2D projection)


pca = PCA(n_components=2)
pca_features = pca.fit_transform(scaled_features)

plt.figure(figsize=(8, 6))
for cluster in np.unique(clusters):
plt.scatter(
pca_features[clusters == cluster, 0],
pca_features[clusters == cluster, 1],
label=f'Cluster {cluster}'
)

25
# Add centroids to the plot
centroids = pca.transform(kmeans.cluster_centers_)
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='black', marker='X', label='Centroids')

plt.title('Customer Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
plt.grid()
plt.show()

# Print clustered data


print("Clustered Customer Data:")
print(df)

Output:

26
Key Components:

1. Dataset:
o total_spent: Total amount spent by a customer.
o frequency: Number of purchases.
o average_purchase_value: Average value of each purchase.
2. Standardization:
o Used StandardScaler to normalize features for better clustering
performance.
3. K-Means:
o Specified n_clusters=3 (can be optimized using the elbow method or
silhouette score).
4. Visualization:
o Used PCA for 2D visualization of high-dimensional data.
o Plotted clusters with their centroids.

Notes:

1. Elbow Method: Use the elbow method to determine the optimal number of clusters
by plotting the inertia for different cluster counts.
2. Feature Engineering: Add more features like customer lifetime value, recency, etc.,
for better clustering.
3. Real Data: Replace the simulated data with real customer data.

27
Practical 5

Aim: 5. Predict future stock prices using a time series forecasting model
(e.g., ARIMA)
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt

# Step 1: Create a small dataset for stock prices


data = {
"Date": pd.date_range(start="2024-01-01", periods=10, freq="D"),
"Stock_Price": [150 + i * 2 + np.random.uniform(-3, 3) for i in
range(10)] # Simulated stock prices with noise
}
stock_data = pd.DataFrame(data)
stock_data.set_index("Date", inplace=True)

# Step 2: Fit an ARIMA model


stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d,
q)
fitted_model = model.fit()

# Step 3: Forecast future stock prices


forecast_steps = 5
forecast = fitted_model.forecast(steps=forecast_steps)

# Step 4: Plot the actual and forecasted stock prices


plt.figure(figsize=(10, 6))
plt.plot(stock_prices, label="Actual Stock Prices", marker="o")
forecast_index = pd.date_range(start=stock_data.index[-1] +
pd.Timedelta(days=1), periods=forecast_steps, freq='D')
plt.plot(forecast_index, forecast, label="Forecasted Stock Prices",
marker="x", linestyle="--", color="red")
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.title("Stock Price Forecast using ARIMA")
plt.legend()
plt.grid()
plt.show()

# Step 5: Display the forecasted values


print("Forecasted Stock Prices:")
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})

28
print(forecast_df)

Output:

Here’s a step-by-step explanation of the code:

Step 1: Create a Small Dataset for Stock Prices


python
Copy code
data = {
"Date": pd.date_range(start="2024-01-01", periods=10, freq="D"),
"Stock_Price": [150 + i * 2 + np.random.uniform(-3, 3) for i in
range(10)] # Simulated stock prices with noise
}
stock_data = pd.DataFrame(data)
stock_data.set_index("Date", inplace=True)

1. pd.date_range: Creates a range of 10 consecutive dates starting from "2024-01-01".


2. Stock_Price formula: Simulates prices starting at 150 and incrementing by 2 per day,
with random noise added using np.random.uniform(-3, 3).
3. pd.DataFrame: Stores the generated data in a pandas DataFrame.
4. set_index: Sets the Date column as the index, which is essential for time series
analysis.

29
Step 2: Fit an ARIMA Model
python
Copy code
stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d, q)
fitted_model = model.fit()

1. ARIMA: A popular time series forecasting model with three parameters:


o p: Autoregressive order (how past values influence current values).
o d: Degree of differencing (removes trends in the data).
o q: Moving average order (models residuals/errors).
2. Fit the model: The fit() method trains the ARIMA model on the dataset.

Step 3: Forecast Future Stock Prices


python
Copy code
forecast_steps = 5
forecast = fitted_model.forecast(steps=forecast_steps)

1. forecast_steps: Specifies the number of future days to predict (5 in this case).


2. fitted_model.forecast: Generates the predicted values for the specified number of
steps.

Step 4: Plot Actual and Forecasted Stock Prices


python
Copy code
plt.figure(figsize=(10, 6))
plt.plot(stock_prices, label="Actual Stock Prices", marker="o")
forecast_index = pd.date_range(start=stock_data.index[-1] +
pd.Timedelta(days=1), periods=forecast_steps, freq='D')
plt.plot(forecast_index, forecast, label="Forecasted Stock Prices",
marker="x", linestyle="--", color="red")
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.title("Stock Price Forecast using ARIMA")
plt.legend()
plt.grid()
plt.show()

1. Plot actual data: plt.plot displays the historical stock prices (stock_prices).
2. Create forecast dates: pd.date_range generates future dates starting from the day
after the last date in the dataset.
3. Plot forecast: Plots the predicted stock prices on the same graph.
4. Styling: Adds labels, title, legend, and grid for better visualization.

Step 5: Display Forecasted Values


python
Copy code

30
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})
print(forecast_df)

1. Create DataFrame: Combines the forecasted dates and predicted prices into a new
DataFrame.
2. Print results: Displays the forecasted values for better clarity.

Output

• Graph: Shows both the actual stock prices and the forecasted values with clear
markers.
• Table: Displays the forecasted stock prices in tabular form.

31
Practical 6

Aim: 6. Develop a sentiment analysis model to classify movie reviews


as positive, negative, or neutral.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report

# Step 1: Create a small dataset of movie reviews

data = {

"Review": [

"The movie was fantastic! I loved the characters and the plot.",

"What a terrible movie. It was a complete waste of time.",

"The movie was okay, not too good, not too bad.",

"Absolutely loved it! One of the best movies I've seen this year.",

"The plot was predictable, but the acting was decent.",

"Horrible! I couldn't even finish it.",

"It was just fine. Nothing special, nothing terrible.",

"A masterpiece. Beautifully directed and acted.",

"Worst movie ever. Do not watch this.",

"Pretty average. Had some good moments but also some flaws."

],

"Sentiment": [

"Positive",

32
"Negative",

"Neutral",

"Positive",

"Neutral",

"Negative",

"Neutral",

"Positive",

"Negative",

"Neutral"

# Convert to DataFrame

df = pd.DataFrame(data)

# Step 2: Split the data into training and test sets

X = df['Review']

y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Create a pipeline for vectorization and classification

pipeline = Pipeline([

('vectorizer', CountVectorizer()), # Converts text into numerical features

('classifier', MultinomialNB()) # Naive Bayes classifier

])

# Step 4: Train the model

33
pipeline.fit(X_train, y_train)

# Step 5: Evaluate the model

y_pred = pipeline.predict(X_test)

print("Classification Report:")

print(classification_report(y_test, y_pred))

# Step 6: Test with new reviews

new_reviews = [

"What an amazing film! I would watch it again.",

"It was boring and predictable. Not worth my time.",

"An average movie. Nothing stood out."

predictions = pipeline.predict(new_reviews)

# Display predictions

for review, sentiment in zip(new_reviews, predictions):

print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")

Output:

34
Explanation

1. Dataset:
o A small dataset with 10 movie reviews labeled as Positive, Negative, or
Neutral.
2. Splitting Data:
o The dataset is split into training (70%) and testing (30%) sets to evaluate the
model's performance.
3. Pipeline:
o CountVectorizer: Converts text into a numerical format using word
frequency.
o MultinomialNB: A Naive Bayes classifier, effective for text classification
tasks.
4. Training:
o The model is trained on the training dataset using the pipeline.
5. Evaluation:
o The model is tested on unseen reviews (test set) using
classification_report.
6. Predictions:
o The trained model predicts sentiments for new reviews.

Output

• Classification Report: Displays precision, recall, and F1 scores.


• Predictions: Shows the predicted sentiment for new reviews.

35
Practical 7

Aim: 7. Explore dimensionality reduction techniques like Principal


Component Analysis (PCA) to visualize high-dimensional data.

import numpy as np

import pandas as pd

from sklearn.decomposition import PCA

from sklearn.datasets import make_classification

import matplotlib.pyplot as plt

import seaborn as sns

# Step 1: Generate high-dimensional data

X, y = make_classification(

n_samples=500, # Number of samples

n_features=10, # Number of features

n_informative=5, # Number of informative features

n_redundant=2, # Number of redundant features

n_classes=3, # Number of classes

random_state=42

# Step 2: Apply PCA

pca = PCA(n_components=2) # Reduce to 2 components for visualization

X_pca = pca.fit_transform(X)

# Step 3: Create a DataFrame for visualization

pca_df = pd.DataFrame(X_pca, columns=["PCA1", "PCA2"])

36
pca_df["Target"] = y

# Step 4: Visualize the PCA results

plt.figure(figsize=(10, 6))

sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Target", palette="Set2", s=70)

plt.title("PCA Visualization of High-Dimensional Data", fontsize=14)

plt.xlabel("Principal Component 1", fontsize=12)

plt.ylabel("Principal Component 2", fontsize=12)

plt.legend(title="Class")

plt.grid()

plt.show()

# Step 5: Explain variance captured by PCA components

explained_variance = pca.explained_variance_ratio_

print("Explained Variance by Each Component:")

for i, variance in enumerate(explained_variance, 1):

print(f"Component {i}: {variance:.2f}")

Explanation

37
1. Data Generation:
o make_classification creates a synthetic dataset with 10 features and 3
classes.
o n_informative and n_redundant specify the number of informative and
redundant features.
2. PCA Application:
o PCA(n_components=2) reduces the dimensionality to 2 components for easy
visualization.
o fit_transform applies PCA to the data.
3. Visualization:
o Seaborn scatterplot: Displays data points in a 2D space (PCA1 vs. PCA2)
colored by their class labels.
4. Variance Explained:
o explained_variance_ratio_: Provides the proportion of variance explained
by each principal component.

38
Practical 8

Aim: 8. Train a support vector machine (SVM) to classify data.

import numpy as np

import pandas as pd

from sklearn.decomposition import PCA

from sklearn.datasets import make_classification

import matplotlib.pyplot as plt

import seaborn as sns

# Step 1: Generate high-dimensional data

X, y = make_classification(

n_samples=500, # Number of samples

n_features=10, # Number of features

n_informative=5, # Number of informative features

n_redundant=2, # Number of redundant features

n_classes=3, # Number of classes

random_state=42

# Step 2: Apply PCA

pca = PCA(n_components=2) # Reduce to 2 components for visualization

X_pca = pca.fit_transform(X)

# Step 3: Create a DataFrame for visualization

pca_df = pd.DataFrame(X_pca, columns=["PCA1", "PCA2"])

pca_df["Target"] = y

39
# Step 4: Visualize the PCA results

plt.figure(figsize=(10, 6))

sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Target", palette="Set2", s=70)

plt.title("PCA Visualization of High-Dimensional Data", fontsize=14)

plt.xlabel("Principal Component 1", fontsize=12)

plt.ylabel("Principal Component 2", fontsize=12)

plt.legend(title="Class")

plt.grid()

plt.show()

# Step 5: Explain variance captured by PCA components

explained_variance = pca.explained_variance_ratio_

print("Explained Variance by Each Component:")

for i, variance in enumerate(explained_variance, 1):

print(f"Component {i}: {variance:.2f}")

Output:

40
Explanation
41
1. Dataset Creation:
o make_classification: Generates a synthetic dataset with two classes and
four features.
o 2 features are informative, and the rest are noise.
2. Train-Test Split:
o Splits data into 70% training and 30% testing sets.
3. Training the SVM:
o SVC: Trains a Support Vector Machine with a linear kernel to classify data.
4. Predictions:
o Predictions are made on the testing data.
5. Evaluation:
o classification_report: Displays precision, recall, F1-score, and accuracy.
o Confusion matrix: Provides a visual representation of true and false
predictions.
6. Visualization:
o Decision boundaries are plotted for the first two features to demonstrate how
the SVM separates the classes.

Output

1. Classification Report:
o Precision, recall, F1-score, and accuracy for each class.
2. Confusion Matrix:
o A heatmap displaying the confusion matrix.
3. Decision Boundary Plot:
o Visualization of the decision boundaries and the training/testing points.

42

You might also like