1_Lab Manual (ML)

Nims Institute of Engineering & Technology
Nims University Rajasthan, Jaipur
LAB MANUAL
FOR
Machine Learning (CSC601B)
NIMS UNIVERSITY RAJASTHAN, JAIPUR

Jaipur-Delhi Highway
Jaipur - 303121, Rajasthan, India
Website: www.nimsuniversity.org
1
Contents
Syllabus ................................................................................................................................................... 3
MACHINE LEARNING LAB ........................................................................................................................ 3
Practical 1 ................................................................................................................................................ 4
Aim: 1. Predict housing prices based on features like area, number of bedrooms, and location using
linear regression. .................................................................................................................................... 4
Practical 2 .............................................................................................................................................. 16
Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier ............................................... 16
Practical 3 .............................................................................................................................................. 22
Aim: 3. Implement a decision tree algorithm to classify email spam based on keywords and sender
information. .......................................................................................................................................... 22
Practical 4 .............................................................................................................................................. 25
Aim: 4. Cluster customer data based on purchase history using k-means clustering. ......................... 25
Practical 5 .............................................................................................................................................. 28
Aim: 5. Predict future stock prices using a time series forecasting model (e.g., ARIMA) .................... 28
Practical 6 .............................................................................................................................................. 32
Aim: 6. Develop a sentiment analysis model to classify movie reviews as positive, negative, or
neutral. .................................................................................................................................................. 32
Practical 7 .............................................................................................................................................. 36
Aim: 7. Explore dimensionality reduction techniques like Principal Component Analysis (PCA) to
visualize high-dimensional data. ........................................................................................................... 36
Practical 8 .............................................................................................................................................. 39
Aim: 8. Train a support vector machine (SVM) to classify data. ........................................................... 39
2
Syllabus
MACHINE LEARNING LAB

Course Objectives:
1. Understand the concept of learning in computer and science.

2. Compare and contrast different paradigms for learning (supervised, unsupervised, etc.).
3. Design experiments to evaluate and compare different machine learning techniques on
real-world problems.
Experiments
1. Predict housing prices based on features like area, number of bedrooms, and
location using linear regression.
2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier.
3. Implement a decision tree algorithm to classify email spam based on keywords
and sender information.
4. Cluster customer data based on purchase history using k-means clustering.
5. Predict future stock prices using a time series forecasting model (e.g., ARIMA).
6. Develop a sentiment analysis model to classify movie reviews as positive,
negative, or neutral.
7. Explore dimensionality reduction techniques like Principal Component Analysis
(PCA) to visualize high-dimensional data.
8. Train a support vector machine (SVM) to classify data.
Course Outcomes:
1. Implement and analyse existing learning algorithms, including well-studied methods for
classification, regression and clustering
2. Apply evaluation metrics for various algorithms.
3. Identifying and implementing real-world problem
3
Practical 1
Aim: 1. Predict housing prices based on features like area, number of

bedrooms, and location using linear regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create a small dataset

data = {
"Area": [1500, 1800, 2400, 3000, 3500],
"Bedrooms": [3, 4, 3, 5, 4],
"Location": [1, 2, 1, 3, 2], # Encoding for location (e.g., 1: City A, 2: City B, 3: City C)
"Price": [300000, 400000, 350000, 500000, 450000]
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Features and target

X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Model Coefficients:", model.coef_)

print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)
# Visualize actual vs predicted

plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")
4
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.title("Actual vs Predicted Prices")
plt.show()
Output:
How the Code Works

Dataset: A small dataset with features: Area, Bedrooms, Location, and Price.
Location is encoded as integers.
Splitting Data: Divides the data into training (80%) and testing (20%) sets.
Linear Regression: Uses Linear Regression from scikit-learn to build the model.
Evaluation: Computes Mean Squared Error (MSE) and R² score to evaluate the model's
performance.
Visualization: Compares actual vs. predicted prices.
5
Reading Data from an Excel File instead of Data Frame
Firstly, create a dataset and save it as 1_Housing.xlsx
Upload this file in the Google Drive
import numpy as np
import pandas as pd
# Load data from an Excel file

data_path = "1_Housing.xlsx" # Replace with your Excel file path
df = pd.read_excel(data_path)
6
y = df["Price"]


# Make predictions
# Model evaluation


plt.ylabel("Price")
plt.legend()
7
plt.show()
Output:
Reading a data file from excel and then predicting the Price on the basis of Area, No. of
Rooms and City.
import numpy as np
import pandas as pd
# Load the dataset from an Excel file

df = pd.read_excel('1_Housing.xlsx') # Replace 'housing_data.xlsx' with your file path
8
y = df["Price"]


# Make predictions
# Model evaluation


plt.ylabel("Price")
plt.legend()
9
plt.show()
# Predict price for new data

new_data = pd.DataFrame({
"Area": [float(input("Enter Area: "))],
"Bedrooms": [int(input("Enter Bedrooms: "))],
"Location": [int(input("Enter Location (e.g., 1 for City A, 2 for City B): "))]
})
predicted_price = model.predict(new_data)
print(f"Predicted Price: {predicted_price[0]}")
10
11
12
13
14
15
Practical 2
Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Split dataset
# Standardize data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train kNN classifier
k=3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
16
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Output:
Add some Visualization to the code:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
17
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names
class_names = data.target_names
# Split dataset
# Standardize data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train kNN classifier
k=3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Visualizations
# 1. Pairplot of features
df = pd.DataFrame(X, columns=feature_names)
18
df['target'] = y
sns.pairplot(df, hue='target', diag_kind='hist', palette='Set2')
plt.suptitle('Feature Distributions and Pairwise Scatter Plots', y=1.02)
plt.show()
# 2. Confusion Matrix Heatmap
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=class_names,

yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
# 3. Decision Boundaries (for 2 features only, e.g., first two features)
if X.shape[1] == 2: # Only possible for datasets with 2 features
X_plot = X[:, :2] # Use the first two features
X_train_plot, X_test_plot, y_train_plot, y_test_plot = train_test_split(X_plot, y, test_size=0.2,

random_state=42)
X_train_plot = scaler.fit_transform(X_train_plot)
X_test_plot = scaler.transform(X_test_plot)
knn_2d = KNeighborsClassifier(n_neighbors=k)
knn_2d.fit(X_train_plot, y_train_plot)
# Create a mesh grid
x_min, x_max = X_train_plot[:, 0].min() - 1, X_train_plot[:, 0].max() + 1
y_min, y_max = X_train_plot[:, 1].min() - 1, X_train_plot[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
19
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.contourf(xx, yy, Z, alpha=0.8, cmap='Set3')
scatter = plt.scatter(X_test_plot[:, 0], X_test_plot[:, 1], c=y_test_plot, edgecolor='k', cmap='viridis')
plt.title('kNN Decision Boundary (2D)')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
legend = plt.legend(handles=scatter.legend_elements()[0], labels=class_names)
plt.show()
else:
print("Decision boundary visualization is only possible for datasets with 2 features.")
Output:
20
How kNN works?
21
Practical 3
Aim: 3. Implement a decision tree algorithm to classify email spam

based on keywords and sender information.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score, classification_report
# Sample dataset
data = {
'contains_free': [1, 0, 1, 0, 1, 0, 1, 0],
'contains_offer': [0, 1, 1, 0, 0, 1, 1, 0],
'sender_known': [0, 1, 1, 1, 0, 1, 0, 0],
'spam': [1, 0, 1, 0, 1, 0, 1, 0]
# Create a DataFrame
X = df[['contains_free', 'contains_offer', 'sender_known']]
y = df['spam']
# Split dataset into training and testing sets
22
# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
# Train the model
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Display the decision tree
tree_rules = export_text(clf, feature_names=list(X.columns))
print("\nDecision Tree Rules:")
print(tree_rules)
Output:
23
Explanation of the Code
1. Dataset:
The dataset contains three features:
o contains_free: Indicates if the email contains the keyword "free" (1 for yes,
0 for no).
o contains_offer: Indicates if the email contains the keyword "offer" (1 for
yes, 0 for no).
o sender_known: Indicates if the sender is known (1 for yes, 0 for no).
The spam column is the target (1 for spam, 0 for not spam).
2. Splitting the Dataset:
The dataset is split into training and testing sets using train_test_split.
3. Model Training:
A decision tree classifier is initialized and trained on the training set.
4. Model Evaluation:
Predictions are made on the testing set, and the accuracy and classification report are
printed.
5. Visualizing the Tree:
The export_text function generates human-readable decision rules for the tree.
Output Example
sql
Copy code
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 2

1 1.00 1.00 1.00 2
accuracy 1.00 4
macro avg 1.00 1.00 1.00 4
weighted avg 1.00 1.00 1.00 4
Decision Tree Rules:

|--- contains_free <= 0.50
| |--- sender_known <= 0.50
| | |--- class: 0
| |--- sender_known > 0.50
| | |--- class: 0
|--- contains_free > 0.50
| |--- class: 1
Notes
1. Customization: You can replace the sample dataset with a larger and more realistic
email dataset.
2. Feature Engineering: Add more features like the length of the email, frequency of
certain words, etc.
3. Model Tuning: Adjust the parameters of DecisionTreeClassifier (e.g.,
max_depth, min_samples_split) for better performance.
24
Practical 4
Aim: 4. Cluster customer data based on purchase history using k-means

clustering.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Simulated customer data

data = {
'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'total_spent': [500, 1500, 300, 800, 2500, 200, 1000, 1800, 400, 700],
'frequency': [5, 20, 2, 8, 30, 1, 12, 25, 3, 6],
'average_purchase_value': [100, 75, 150, 100, 83, 200, 83, 72, 133, 117],
}
# Create a DataFrame
# Features for clustering

features = df[['total_spent', 'frequency', 'average_purchase_value']]
# Standardize the features

scaled_features = scaler.fit_transform(features)
# Apply KMeans clustering

kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_features)
# Add cluster labels to the DataFrame

df['cluster'] = clusters
# Visualize the clusters using PCA (2D projection)

pca = PCA(n_components=2)
pca_features = pca.fit_transform(scaled_features)
for cluster in np.unique(clusters):
plt.scatter(
pca_features[clusters == cluster, 0],
pca_features[clusters == cluster, 1],
label=f'Cluster {cluster}'
)
25
# Add centroids to the plot
centroids = pca.transform(kmeans.cluster_centers_)
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.title('Customer Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
plt.grid()
plt.show()
# Print clustered data

print("Clustered Customer Data:")
print(df)
Output:
26
Key Components:
1. Dataset:
o total_spent: Total amount spent by a customer.
o frequency: Number of purchases.
o average_purchase_value: Average value of each purchase.
2. Standardization:
o Used StandardScaler to normalize features for better clustering
performance.
3. K-Means:
o Specified n_clusters=3 (can be optimized using the elbow method or
silhouette score).
4. Visualization:
o Used PCA for 2D visualization of high-dimensional data.
o Plotted clusters with their centroids.
Notes:
1. Elbow Method: Use the elbow method to determine the optimal number of clusters
by plotting the inertia for different cluster counts.
2. Feature Engineering: Add more features like customer lifetime value, recency, etc.,
for better clustering.
3. Real Data: Replace the simulated data with real customer data.
27
Practical 5
Aim: 5. Predict future stock prices using a time series forecasting model
(e.g., ARIMA)
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
# Step 1: Create a small dataset for stock prices

data = {
"Date": pd.date_range(start="2024-01-01", periods=10, freq="D"),
"Stock_Price": [150 + i * 2 + np.random.uniform(-3, 3) for i in
range(10)] # Simulated stock prices with noise
}
stock_data = pd.DataFrame(data)
stock_data.set_index("Date", inplace=True)
# Step 2: Fit an ARIMA model

stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d,
q)
fitted_model = model.fit()
# Step 3: Forecast future stock prices

forecast_steps = 5
forecast = fitted_model.forecast(steps=forecast_steps)
# Step 4: Plot the actual and forecasted stock prices

plt.plot(stock_prices, label="Actual Stock Prices", marker="o")
forecast_index = pd.date_range(start=stock_data.index[-1] +
pd.Timedelta(days=1), periods=forecast_steps, freq='D')
plt.plot(forecast_index, forecast, label="Forecasted Stock Prices",
marker="x", linestyle="--", color="red")
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.title("Stock Price Forecast using ARIMA")
plt.legend()
plt.grid()
plt.show()
# Step 5: Display the forecasted values

print("Forecasted Stock Prices:")
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})
28
print(forecast_df)
Output:
Here’s a step-by-step explanation of the code:
Step 1: Create a Small Dataset for Stock Prices

python
Copy code
data = {
"Date": pd.date_range(start="2024-01-01", periods=10, freq="D"),
"Stock_Price": [150 + i * 2 + np.random.uniform(-3, 3) for i in
range(10)] # Simulated stock prices with noise
}
stock_data = pd.DataFrame(data)
stock_data.set_index("Date", inplace=True)
1. pd.date_range: Creates a range of 10 consecutive dates starting from "2024-01-01".

2. Stock_Price formula: Simulates prices starting at 150 and incrementing by 2 per day,
with random noise added using np.random.uniform(-3, 3).
3. pd.DataFrame: Stores the generated data in a pandas DataFrame.
4. set_index: Sets the Date column as the index, which is essential for time series
analysis.
29
Step 2: Fit an ARIMA Model
python
Copy code
stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d, q)
fitted_model = model.fit()
1. ARIMA: A popular time series forecasting model with three parameters:

o p: Autoregressive order (how past values influence current values).
o d: Degree of differencing (removes trends in the data).
o q: Moving average order (models residuals/errors).
2. Fit the model: The fit() method trains the ARIMA model on the dataset.
Step 3: Forecast Future Stock Prices

python
Copy code
forecast_steps = 5
forecast = fitted_model.forecast(steps=forecast_steps)
1. forecast_steps: Specifies the number of future days to predict (5 in this case).

2. fitted_model.forecast: Generates the predicted values for the specified number of
steps.
Step 4: Plot Actual and Forecasted Stock Prices

python
Copy code
plt.plot(stock_prices, label="Actual Stock Prices", marker="o")
forecast_index = pd.date_range(start=stock_data.index[-1] +
pd.Timedelta(days=1), periods=forecast_steps, freq='D')
plt.plot(forecast_index, forecast, label="Forecasted Stock Prices",
marker="x", linestyle="--", color="red")
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.title("Stock Price Forecast using ARIMA")
plt.legend()
plt.grid()
plt.show()
1. Plot actual data: plt.plot displays the historical stock prices (stock_prices).
2. Create forecast dates: pd.date_range generates future dates starting from the day
after the last date in the dataset.
3. Plot forecast: Plots the predicted stock prices on the same graph.
4. Styling: Adds labels, title, legend, and grid for better visualization.
Step 5: Display Forecasted Values

python
Copy code
30
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})
print(forecast_df)
1. Create DataFrame: Combines the forecasted dates and predicted prices into a new
DataFrame.
2. Print results: Displays the forecasted values for better clarity.
Output
• Graph: Shows both the actual stock prices and the forecasted values with clear
markers.
• Table: Displays the forecasted stock prices in tabular form.
31
Practical 6
Aim: 6. Develop a sentiment analysis model to classify movie reviews

as positive, negative, or neutral.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Step 1: Create a small dataset of movie reviews
data = {
"Review": [
"The movie was fantastic! I loved the characters and the plot.",
"What a terrible movie. It was a complete waste of time.",
"The movie was okay, not too good, not too bad.",
"Absolutely loved it! One of the best movies I've seen this year.",
"The plot was predictable, but the acting was decent.",
"Horrible! I couldn't even finish it.",
"It was just fine. Nothing special, nothing terrible.",
"A masterpiece. Beautifully directed and acted.",
"Worst movie ever. Do not watch this.",
"Pretty average. Had some good moments but also some flaws."
],
"Sentiment": [
"Positive",
32
"Negative",
"Neutral",
"Positive",
"Neutral",
"Negative",
"Neutral",
"Positive",
"Negative",
"Neutral"
# Convert to DataFrame
# Step 2: Split the data into training and test sets
X = df['Review']
y = df['Sentiment']
# Step 3: Create a pipeline for vectorization and classification
pipeline = Pipeline([
('vectorizer', CountVectorizer()), # Converts text into numerical features
('classifier', MultinomialNB()) # Naive Bayes classifier
])
# Step 4: Train the model
33
pipeline.fit(X_train, y_train)
# Step 5: Evaluate the model
y_pred = pipeline.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Step 6: Test with new reviews
new_reviews = [
"What an amazing film! I would watch it again.",
"It was boring and predictable. Not worth my time.",
"An average movie. Nothing stood out."
predictions = pipeline.predict(new_reviews)
# Display predictions
for review, sentiment in zip(new_reviews, predictions):
print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")
Output:
34
Explanation
1. Dataset:
o A small dataset with 10 movie reviews labeled as Positive, Negative, or
Neutral.
2. Splitting Data:
o The dataset is split into training (70%) and testing (30%) sets to evaluate the
model's performance.
3. Pipeline:
o CountVectorizer: Converts text into a numerical format using word
frequency.
o MultinomialNB: A Naive Bayes classifier, effective for text classification
tasks.
4. Training:
o The model is trained on the training dataset using the pipeline.
5. Evaluation:
o The model is tested on unseen reviews (test set) using
classification_report.
6. Predictions:
o The trained model predicts sentiments for new reviews.
Output
• Classification Report: Displays precision, recall, and F1 scores.

• Predictions: Shows the predicted sentiment for new reviews.
35
Practical 7
Aim: 7. Explore dimensionality reduction techniques like Principal

Component Analysis (PCA) to visualize high-dimensional data.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
# Step 1: Generate high-dimensional data
X, y = make_classification(
n_samples=500, # Number of samples
n_features=10, # Number of features
n_informative=5, # Number of informative features
n_redundant=2, # Number of redundant features
n_classes=3, # Number of classes
random_state=42
# Step 2: Apply PCA
pca = PCA(n_components=2) # Reduce to 2 components for visualization
X_pca = pca.fit_transform(X)
# Step 3: Create a DataFrame for visualization
pca_df = pd.DataFrame(X_pca, columns=["PCA1", "PCA2"])
36
pca_df["Target"] = y
# Step 4: Visualize the PCA results
sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Target", palette="Set2", s=70)
plt.title("PCA Visualization of High-Dimensional Data", fontsize=14)
plt.xlabel("Principal Component 1", fontsize=12)
plt.ylabel("Principal Component 2", fontsize=12)
plt.legend(title="Class")
plt.grid()
plt.show()
# Step 5: Explain variance captured by PCA components
explained_variance = pca.explained_variance_ratio_
print("Explained Variance by Each Component:")
for i, variance in enumerate(explained_variance, 1):
print(f"Component {i}: {variance:.2f}")
Explanation
37
1. Data Generation:
o make_classification creates a synthetic dataset with 10 features and 3
classes.
o n_informative and n_redundant specify the number of informative and
redundant features.
2. PCA Application:
o PCA(n_components=2) reduces the dimensionality to 2 components for easy
visualization.
o fit_transform applies PCA to the data.
3. Visualization:
o Seaborn scatterplot: Displays data points in a 2D space (PCA1 vs. PCA2)
colored by their class labels.
4. Variance Explained:
o explained_variance_ratio_: Provides the proportion of variance explained
by each principal component.
38
Practical 8
Aim: 8. Train a support vector machine (SVM) to classify data.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
# Step 1: Generate high-dimensional data
X, y = make_classification(
n_samples=500, # Number of samples
n_features=10, # Number of features
n_informative=5, # Number of informative features
n_redundant=2, # Number of redundant features
n_classes=3, # Number of classes
random_state=42
# Step 2: Apply PCA
pca = PCA(n_components=2) # Reduce to 2 components for visualization
X_pca = pca.fit_transform(X)
# Step 3: Create a DataFrame for visualization
pca_df = pd.DataFrame(X_pca, columns=["PCA1", "PCA2"])
pca_df["Target"] = y
39
# Step 4: Visualize the PCA results
sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Target", palette="Set2", s=70)
plt.title("PCA Visualization of High-Dimensional Data", fontsize=14)
plt.xlabel("Principal Component 1", fontsize=12)
plt.ylabel("Principal Component 2", fontsize=12)
plt.legend(title="Class")
plt.grid()
plt.show()
# Step 5: Explain variance captured by PCA components
explained_variance = pca.explained_variance_ratio_
print("Explained Variance by Each Component:")
for i, variance in enumerate(explained_variance, 1):
print(f"Component {i}: {variance:.2f}")
Output:
40
Explanation
41
1. Dataset Creation:
o make_classification: Generates a synthetic dataset with two classes and
four features.
o 2 features are informative, and the rest are noise.
2. Train-Test Split:
o Splits data into 70% training and 30% testing sets.
3. Training the SVM:
o SVC: Trains a Support Vector Machine with a linear kernel to classify data.
4. Predictions:
o Predictions are made on the testing data.
5. Evaluation:
o classification_report: Displays precision, recall, F1-score, and accuracy.
o Confusion matrix: Provides a visual representation of true and false
predictions.
6. Visualization:
o Decision boundaries are plotted for the first two features to demonstrate how
the SVM separates the classes.
Output
1. Classification Report:
o Precision, recall, F1-score, and accuracy for each class.
2. Confusion Matrix:
o A heatmap displaying the confusion matrix.
3. Decision Boundary Plot:
o Visualization of the decision boundaries and the training/testing points.
42

1_Lab Manual (ML)

Uploaded by

1_Lab Manual (ML)

Uploaded by

Nims Institute of Engineering & Technology

Nims University Rajasthan, Jaipur

NIMS UNIVERSITY RAJASTHAN, JAIPUR

MACHINE LEARNING LAB

1. Understand the concept of learning in computer and science.

Aim: 1. Predict housing prices based on features like area, number of

# Create a small dataset

# Features and target

# Split the data into training and testing sets

# Create and train the Linear Regression model

print("Model Coefficients:", model.coef_)

# Visualize actual vs predicted

How the Code Works

Firstly, create a dataset and save it as 1_Housing.xlsx

Upload this file in the Google Drive

# Load data from an Excel file

# Split the data into training and testing sets

# Create and train the Linear Regression model

print("Model Coefficients:", model.coef_)

# Visualize actual vs predicted

# Load the dataset from an Excel file

# Split the data into training and testing sets

# Create and train the Linear Regression model

print("Model Coefficients:", model.coef_)

# Visualize actual vs predicted

# Predict price for new data

Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train kNN classifier

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

Add some Visualization to the code:

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train kNN classifier

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

sns.pairplot(df, hue='target', diag_kind='hist', palette='Set2')

plt.suptitle('Feature Distributions and Pairwise Scatter Plots', y=1.02)

# 2. Confusion Matrix Heatmap

conf_matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=class_names,

# 3. Decision Boundaries (for 2 features only, e.g., first two features)

if X.shape[1] == 2: # Only possible for datasets with 2 features

X_plot = X[:, :2] # Use the first two features

X_train_plot, X_test_plot, y_train_plot, y_test_plot = train_test_split(X_plot, y, test_size=0.2,

# Create a mesh grid

x_min, x_max = X_train_plot[:, 0].min() - 1, X_train_plot[:, 0].max() + 1

y_min, y_max = X_train_plot[:, 1].min() - 1, X_train_plot[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),

np.arange(y_min, y_max, 0.01))

# Plot decision boundary

plt.contourf(xx, yy, Z, alpha=0.8, cmap='Set3')

scatter = plt.scatter(X_test_plot[:, 0], X_test_plot[:, 1], c=y_test_plot, edgecolor='k', cmap='viridis')

plt.title('kNN Decision Boundary (2D)')

legend = plt.legend(handles=scatter.legend_elements()[0], labels=class_names)

print("Decision boundary visualization is only possible for datasets with 2 features.")

Aim: 3. Implement a decision tree algorithm to classify email spam

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, export_text

from sklearn.metrics import accuracy_score, classification_report

'contains_free': [1, 0, 1, 0, 1, 0, 1, 0],