1_Lab Manual (ML)
1_Lab Manual (ML)
LAB MANUAL
FOR
Machine Learning (CSC601B)
1
Contents
Syllabus ................................................................................................................................................... 3
MACHINE LEARNING LAB ........................................................................................................................ 3
Practical 1 ................................................................................................................................................ 4
Aim: 1. Predict housing prices based on features like area, number of bedrooms, and location using
linear regression. .................................................................................................................................... 4
Practical 2 .............................................................................................................................................. 16
Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier ............................................... 16
Practical 3 .............................................................................................................................................. 22
Aim: 3. Implement a decision tree algorithm to classify email spam based on keywords and sender
information. .......................................................................................................................................... 22
Practical 4 .............................................................................................................................................. 25
Aim: 4. Cluster customer data based on purchase history using k-means clustering. ......................... 25
Practical 5 .............................................................................................................................................. 28
Aim: 5. Predict future stock prices using a time series forecasting model (e.g., ARIMA) .................... 28
Practical 6 .............................................................................................................................................. 32
Aim: 6. Develop a sentiment analysis model to classify movie reviews as positive, negative, or
neutral. .................................................................................................................................................. 32
Practical 7 .............................................................................................................................................. 36
Aim: 7. Explore dimensionality reduction techniques like Principal Component Analysis (PCA) to
visualize high-dimensional data. ........................................................................................................... 36
Practical 8 .............................................................................................................................................. 39
Aim: 8. Train a support vector machine (SVM) to classify data. ........................................................... 39
2
Syllabus
Experiments
1. Predict housing prices based on features like area, number of bedrooms, and
location using linear regression.
2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier.
3. Implement a decision tree algorithm to classify email spam based on keywords
and sender information.
4. Cluster customer data based on purchase history using k-means clustering.
5. Predict future stock prices using a time series forecasting model (e.g., ARIMA).
6. Develop a sentiment analysis model to classify movie reviews as positive,
negative, or neutral.
7. Explore dimensionality reduction techniques like Principal Component Analysis
(PCA) to visualize high-dimensional data.
8. Train a support vector machine (SVM) to classify data.
Course Outcomes:
1. Implement and analyse existing learning algorithms, including well-studied methods for
classification, regression and clustering
2. Apply evaluation metrics for various algorithms.
3. Identifying and implementing real-world problem
3
Practical 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Convert to DataFrame
df = pd.DataFrame(data)
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
4
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.title("Actual vs Predicted Prices")
plt.show()
Output:
5
Reading Data from an Excel File instead of Data Frame
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
6
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
7
plt.show()
Output:
Reading a data file from excel and then predicting the Price on the basis of Area, No. of
Rooms and City.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
8
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
9
plt.title("Actual vs Predicted Prices")
plt.show()
predicted_price = model.predict(new_data)
print(f"Predicted Price: {predicted_price[0]}")
10
11
12
13
14
15
Practical 2
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Split dataset
# Standardize data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
k=3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
16
# Evaluate the model
Output:
import numpy as np
import pandas as pd
# Load dataset
17
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names
class_names = data.target_names
# Split dataset
# Standardize data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
k=3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy)
# Visualizations
# 1. Pairplot of features
df = pd.DataFrame(X, columns=feature_names)
18
df['target'] = y
plt.show()
plt.figure(figsize=(6, 5))
plt.title('Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
X_train_plot = scaler.fit_transform(X_train_plot)
X_test_plot = scaler.transform(X_test_plot)
knn_2d = KNeighborsClassifier(n_neighbors=k)
knn_2d.fit(X_train_plot, y_train_plot)
19
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.show()
else:
Output:
20
How kNN works?
21
Practical 3
import pandas as pd
# Sample dataset
data = {
'spam': [1, 0, 1, 0, 1, 0, 1, 0]
# Create a DataFrame
df = pd.DataFrame(data)
y = df['spam']
22
# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print(tree_rules)
Output:
23
Explanation of the Code
1. Dataset:
The dataset contains three features:
o contains_free: Indicates if the email contains the keyword "free" (1 for yes,
0 for no).
o contains_offer: Indicates if the email contains the keyword "offer" (1 for
yes, 0 for no).
o sender_known: Indicates if the sender is known (1 for yes, 0 for no).
The spam column is the target (1 for spam, 0 for not spam).
2. Splitting the Dataset:
The dataset is split into training and testing sets using train_test_split.
3. Model Training:
A decision tree classifier is initialized and trained on the training set.
4. Model Evaluation:
Predictions are made on the testing set, and the accuracy and classification report are
printed.
5. Visualizing the Tree:
The export_text function generates human-readable decision rules for the tree.
Output Example
sql
Copy code
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
accuracy 1.00 4
macro avg 1.00 1.00 1.00 4
weighted avg 1.00 1.00 1.00 4
1. Customization: You can replace the sample dataset with a larger and more realistic
email dataset.
2. Feature Engineering: Add more features like the length of the email, frequency of
certain words, etc.
3. Model Tuning: Adjust the parameters of DecisionTreeClassifier (e.g.,
max_depth, min_samples_split) for better performance.
24
Practical 4
# Create a DataFrame
df = pd.DataFrame(data)
plt.figure(figsize=(8, 6))
for cluster in np.unique(clusters):
plt.scatter(
pca_features[clusters == cluster, 0],
pca_features[clusters == cluster, 1],
label=f'Cluster {cluster}'
)
25
# Add centroids to the plot
centroids = pca.transform(kmeans.cluster_centers_)
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.title('Customer Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
plt.grid()
plt.show()
Output:
26
Key Components:
1. Dataset:
o total_spent: Total amount spent by a customer.
o frequency: Number of purchases.
o average_purchase_value: Average value of each purchase.
2. Standardization:
o Used StandardScaler to normalize features for better clustering
performance.
3. K-Means:
o Specified n_clusters=3 (can be optimized using the elbow method or
silhouette score).
4. Visualization:
o Used PCA for 2D visualization of high-dimensional data.
o Plotted clusters with their centroids.
Notes:
1. Elbow Method: Use the elbow method to determine the optimal number of clusters
by plotting the inertia for different cluster counts.
2. Feature Engineering: Add more features like customer lifetime value, recency, etc.,
for better clustering.
3. Real Data: Replace the simulated data with real customer data.
27
Practical 5
Aim: 5. Predict future stock prices using a time series forecasting model
(e.g., ARIMA)
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt
28
print(forecast_df)
Output:
29
Step 2: Fit an ARIMA Model
python
Copy code
stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d, q)
fitted_model = model.fit()
1. Plot actual data: plt.plot displays the historical stock prices (stock_prices).
2. Create forecast dates: pd.date_range generates future dates starting from the day
after the last date in the dataset.
3. Plot forecast: Plots the predicted stock prices on the same graph.
4. Styling: Adds labels, title, legend, and grid for better visualization.
30
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})
print(forecast_df)
1. Create DataFrame: Combines the forecasted dates and predicted prices into a new
DataFrame.
2. Print results: Displays the forecasted values for better clarity.
Output
• Graph: Shows both the actual stock prices and the forecasted values with clear
markers.
• Table: Displays the forecasted stock prices in tabular form.
31
Practical 6
import pandas as pd
data = {
"Review": [
"The movie was fantastic! I loved the characters and the plot.",
"The movie was okay, not too good, not too bad.",
"Absolutely loved it! One of the best movies I've seen this year.",
"Pretty average. Had some good moments but also some flaws."
],
"Sentiment": [
"Positive",
32
"Negative",
"Neutral",
"Positive",
"Neutral",
"Negative",
"Neutral",
"Positive",
"Negative",
"Neutral"
# Convert to DataFrame
df = pd.DataFrame(data)
X = df['Review']
y = df['Sentiment']
pipeline = Pipeline([
])
33
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
new_reviews = [
predictions = pipeline.predict(new_reviews)
# Display predictions
Output:
34
Explanation
1. Dataset:
o A small dataset with 10 movie reviews labeled as Positive, Negative, or
Neutral.
2. Splitting Data:
o The dataset is split into training (70%) and testing (30%) sets to evaluate the
model's performance.
3. Pipeline:
o CountVectorizer: Converts text into a numerical format using word
frequency.
o MultinomialNB: A Naive Bayes classifier, effective for text classification
tasks.
4. Training:
o The model is trained on the training dataset using the pipeline.
5. Evaluation:
o The model is tested on unseen reviews (test set) using
classification_report.
6. Predictions:
o The trained model predicts sentiments for new reviews.
Output
35
Practical 7
import numpy as np
import pandas as pd
X, y = make_classification(
random_state=42
X_pca = pca.fit_transform(X)
36
pca_df["Target"] = y
plt.figure(figsize=(10, 6))
plt.legend(title="Class")
plt.grid()
plt.show()
explained_variance = pca.explained_variance_ratio_
Explanation
37
1. Data Generation:
o make_classification creates a synthetic dataset with 10 features and 3
classes.
o n_informative and n_redundant specify the number of informative and
redundant features.
2. PCA Application:
o PCA(n_components=2) reduces the dimensionality to 2 components for easy
visualization.
o fit_transform applies PCA to the data.
3. Visualization:
o Seaborn scatterplot: Displays data points in a 2D space (PCA1 vs. PCA2)
colored by their class labels.
4. Variance Explained:
o explained_variance_ratio_: Provides the proportion of variance explained
by each principal component.
38
Practical 8
import numpy as np
import pandas as pd
X, y = make_classification(
random_state=42
X_pca = pca.fit_transform(X)
pca_df["Target"] = y
39
# Step 4: Visualize the PCA results
plt.figure(figsize=(10, 6))
plt.legend(title="Class")
plt.grid()
plt.show()
explained_variance = pca.explained_variance_ratio_
Output:
40
Explanation
41
1. Dataset Creation:
o make_classification: Generates a synthetic dataset with two classes and
four features.
o 2 features are informative, and the rest are noise.
2. Train-Test Split:
o Splits data into 70% training and 30% testing sets.
3. Training the SVM:
o SVC: Trains a Support Vector Machine with a linear kernel to classify data.
4. Predictions:
o Predictions are made on the testing data.
5. Evaluation:
o classification_report: Displays precision, recall, F1-score, and accuracy.
o Confusion matrix: Provides a visual representation of true and false
predictions.
6. Visualization:
o Decision boundaries are plotted for the first two features to demonstrate how
the SVM separates the classes.
Output
1. Classification Report:
o Precision, recall, F1-score, and accuracy for each class.
2. Confusion Matrix:
o A heatmap displaying the confusion matrix.
3. Decision Boundary Plot:
o Visualization of the decision boundaries and the training/testing points.
42