0% found this document useful (0 votes)
64 views103 pages

DMV & ML Lab

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
64 views103 pages

DMV & ML Lab

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 103

Department of Artificial Intelligence & Data Science

Lab Manual

Data Structures & Algorithms LABORATORY


Computer Laboratory I: Machine Learning
(417521)

Prepared by
Mrs. Sonali Nawale

BE AI& DS
SEM –VII
Academic year-2023-24
2

Page
Sr. No Title Of Experiment CO PO PSO
No
Feature Transformation: To use PCA Algorithm for
dimensionality reduction. You have a dataset that includes
measurements for different variables on wine (alcohol, ash,
magnesium, and so on). Apply PCA algorithm & transform
this data so that most variations in the measurements of the
1A
variables are captured by a small number of principal
components so that it is easier to distinguish between red
and white wine by inspecting these principal components.
Dataset Link: https://media.geeksforgeeks.org/wp-
content/uploads/Wine.csv
Feature Transformation: Apply LDA Algorithm on Iris
Dataset and classify which species a given flower belongs
1B to. Dataset Link:
https://www.kaggle.com/datasets/uciml/iris

Regression Analysis: Predict the price of the Uber ride from


a given pickup point to the agreed drop-off
location. Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
2A 4. Implement linear regression and ridge, Lasso regression
models.
5. Evaluate the models and compare their respective scores
like R2, RMSE, etc.
Dataset link:
https://www.kaggle.com/datasets/yasserh/uber-fares-dataset

Regression Analysis: Use the diabetes data set from UCI


and Pima Indians Diabetes data set for performing
the following:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard
Deviation, Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression
2B
modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the
two data sets
Dataset link: https://www.kaggle.com/datasets/uciml/pima-
indians-diabetes-database

Classification Analysis: A. Implementation of Support


Vector Machines (SVM) for classifying images of
3A
handwritten digits into their respective numerical
classes (0 to 9).
3B Classification Analysis: Implement K-Nearest
Neighbors’ algorithm on social network ad dataset.

Department of AI & DS AISSMS IOIT


3

Compute confusion matrix, accuracy, error rate,


precision and recall on the given dataset.
Dataset link:
https://www.kaggle.com/datasets/rakeshrau/social-
network-ads

Clustering Analysis: Implement K-Means clustering


on Iris.csv dataset. Determine the number of clusters
4A using the elbow method. Dataset Link:
https://www.kaggle.com/datasets/uciml/iris

Clustering Analysis: Implement K-Mediod Algorithm


on a credit card dataset. Determine the number of
clusters using the Silhouette Method. Dataset link:
4B
https://www.kaggle.com/datasets/arjunbhasin2013/ccd
ata

Ensemble Learning: Implement Random Forest


Classifier model to predict the safety of the car.
Dataset link:
5A
https://www.kaggle.com/datasets/elikplim/car-
evaluation-data-set

Ensemble Learning: Use different voting mechanism


and Apply AdaBoost (Adaptive Boosting), Gradient
Tree Boosting (GBM), XGBoost classification on Iris
5B dataset and compare the performance of three models
using different evaluation measures. Dataset Link:
https://www.kaggle.com/datasets/uciml/iris

Reinforcement Learning: Implement Reinforcement


Learning using an example of a maze environment that
6A
the agent needs to explore.

Reinforcement Learning:
Build a Tic-Tac-Toe game using reinforcement
learning in Python by using following
tasks
a. Setting up the environment
6C
b. Defining the Tic-Tac-Toe game
c. Building the reinforcement learning model
d. Training the model
e. Testing the model

Assignment 1A:
Title of the Assignment: Feature Transformation:

Department of AI & DS AISSMS IOIT


4

To use PCA Algorithm for dimensionality reduction.


You have a dataset that includes measurements for different variables on wine (alcohol, ash,
magnesium, and so on). Apply PCA algorithm & transform this data so that most variations in
the measurements of the variables are captured by a small number of principal components so
that it is easier to distinguish between red and white wine by inspecting these principal
components.
Dataset Link: https://media.geeksforgeeks.org/wp-content/uploads/Wine.csv
Dataset Description: The project is about measurement for different variables on wine (alcohol,
ash, magnesium, and so on), we're looking to predict the variations in the measurements of the
variables are captured by a small number of principal components so that it is easier to
distinguish between red and white wine by inspecting these principal components.

Link for Dataset: https://media.geeksforgeeks.org/wp-content/uploads/Wine.csv

Objective of the Assignment: Students should be able to preprocess dataset and identify
outliers, using PCA algorithm & transform this data so that most variations in the measurements
of the variables are captured by a small number of principal components.

Theory:
Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting.
There are two main categories of dimensionality reduction: feature selection and feature
extraction. Via feature selection, we select a subset of the original features, whereas in feature
extraction, we derive information from the feature set to construct a new feature subspace.
Principal Component Analysis (PCA): algorithm used to compress a dataset onto a lower
dimensional feature subspace with the goal of maintaining most of the relevant information. We
will explore: As stated earlier, Principal Component Analysis is a technique of feature extraction
that maps a higher dimensional feature space to a lower-dimensional feature space. While
reducing the number of dimensions, PCA ensures that maximum information of the original
dataset is retained in the dataset with the reduced no. of dimensions and the correlation between
the newly obtained Principal Components is minimum. The new features obtained after applying
PCA are called Principal Components and are denoted as PCi (i=1,2,3…n). Here, (Principal
Component-1) PC1 captures the maximum information of the original dataset, followed by PC2,
then PC3 and so on.
The following bar graph depicts the amount of Explained Variance captured by various Principal
Components. (The Explained Variance defines the amount of information captured by the
Principal Components).

Department of AI & DS AISSMS IOIT


5

Steps to Apply PCA in Python for Dimensionality Reduction

Step-1: Import necessary libraries


Step-2: Load the dataset
Step-3: Standardize the features
Step-3: Check the Co-relation between features without PCA (Optional)
Step-4: Applying Principal Component Analysis
Step-5: Checking Co-relation between features after PCA

Conclusion: In this way we have explored Concept of Feature transformation using PCA
algorithm for dimensionality reduction.

Department of AI & DS AISSMS IOIT


6

Department of AI & DS AISSMS IOIT


7

Department of AI & DS AISSMS IOIT


8

Assignment 1B:
Title of the Assignment: Feature Transformation:
Apply LDA Algorithm on Iris Dataset and classify which species a given flower belongs to.

Dataset Link: https://www.kaggle.com/datasets/uciml/iris

Dataset Description: The project involves using the Iris dataset, which includes measurements
of sepal length, sepal width, petal length, and petal width for three species of iris flowers. The
goal is to apply Linear Discriminant Analysis (LDA) to transform this data in a way that makes it
easier to classify a given flower into one of the three species based on its measurements.

Objective of the Assignment: Students should be able to preprocess the dataset, apply the LDA
algorithm for feature transformation, and build a classification model to classify iris flowers into
different species.

Theory:
Feature transformation is a critical step in preparing data for machine learning tasks. Linear
Discriminant Analysis (LDA) is a dimensionality reduction technique that aims to find a lower-
dimensional representation of the data while maximizing the separation between different
classes. In the case of the Iris dataset, LDA can help us transform the feature space so that the
species of iris flowers become more distinguishable.

Linear Discriminant Analysis (LDA):


LDA is a supervised dimensionality reduction technique that seeks to find a linear combination
of features that best separates multiple classes in the data. It aims to reduce the dimensionality of
the data while preserving as much class discrimination information as possible.

Steps to Apply LDA in Python for Feature Transformation and Classification:

Step 1: Import necessary libraries


Step 2: Load the Iris dataset
Step 3: Preprocess the data (e.g., handle missing values, encode categorical variables if any)
Step 4: Apply Linear Discriminant Analysis (LDA) to transform the features
Step 5: Split the data into training and testing sets
Step 6: Train a classification model (e.g., logistic regression, decision tree, or support vector
machine) on the transformed features
Step 7: Evaluate the model's performance using appropriate metrics (e.g., accuracy, precision,
recall)
Step 8: Visualize the results (e.g., scatter plots of LDA-transformed features)

Department of AI & DS AISSMS IOIT


9

Conclusion: In this assignment, we have explored the concept of feature transformation using
the LDA algorithm for dimensionality reduction and classification. By transforming the Iris
dataset's features, we can build a classification model that can predict the species of iris flowers
based on their measurements. This demonstrates the power of feature transformation techniques
in improving the performance of machine learning models.

Department of AI & DS AISSMS IOIT


10

Department of AI & DS AISSMS IOIT


11

Assignment 2A:
Title of the Assignment: Regression Analysis:
Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.

Dataset Description: The project is about on world's largest taxi company Uber inc. In this
project, we're looking to predict the fare for their future transactional cases. Uber delivers service
to lakhs of customers daily. Now it becomes really important to manage their data properly to
come up with new business ideas to get best results. Eventually, it becomes really important to
estimate the fare prices accurately.

Link for Dataset: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset

Objective of the Assignment: Students should be able to preprocess dataset and identify
outliers, to check correlation and implement linear regression and random forest regression
models. Evaluate them with respective scores like R2, RMSE etc.

Theory:
Data Preprocessing: Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model. It is the first and crucial step while creating a machine
learning model. When creating a machine learning project, it is not always a case that we come
across the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So, for this, we use data preprocessing task. Why do we need
Data Preprocessing? A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning models. Data
preprocessing is required tasks for cleaning the data and making it suitable for a machine
learning model which also increases the accuracy and efficiency of a machine learning model.

Linear Regression: Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product price,
etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable. The linear regression model provides a sloped
straight line representing the relationship between the variables. Consider the below image:

Department of AI & DS AISSMS IOIT


12

Mean Squared Error;


The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator
measures the average of error squares i.e., the average squared difference between the estimated
values and true value. It is a risk function, corresponding to the expected value of the squared
error loss. It is always non – negative and values close to zero are better. The MSE is the second
moment of the error (about the origin) and thus incorporates both the variance of the estimator
and its bias.

Conclusion: In this way we have explored Concept correlation and implement linear regression
and random forest regression models.

Department of AI & DS AISSMS IOIT


13

Department of AI & DS AISSMS IOIT


14

Department of AI & DS AISSMS IOIT


15

Assignment 2B:
Title of the Assignment: Regression Analysis
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness, and Kurtosis
b. Bivariate analysis: Linear and logistic regression modelling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets

Dataset link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

Dataset Description: The project involves working with two diabetes datasets: one from
the UCI Machine Learning Repository and the other the Pima Indians Diabetes Database.
The goal is to perform various regression analyses on these datasets, including univariate
analysis to understand the statistical properties of the data, bivariate analysis involving
linear and logistic regression modelling, and multiple regression analysis to predict
outcomes. Finally, the results of these analyses will be compared between the two datasets.

Objective of the Assignment: Students should be able to conduct comprehensive


regression analysis on real-world datasets, including univariate and bivariate analyses, as
well as multiple regression modelling. Additionally, they should be able to compare the
results of these analyses between two different datasets.

Theory:
Univariate Analysis:
Univariate analysis is the process of analyzing a single variable or attribute at a time. It
involves computing summary statistics like frequency, mean, median, mode, variance,
standard deviation, skewness, and kurtosis for a single variable. This analysis helps in
understanding the distribution and characteristics of the data within that variable.

Bivariate Analysis:

Department of AI & DS AISSMS IOIT


16

Bivariate analysis involves analyzing the relationship between two variables. In the context
of regression analysis, it includes linear and logistic regression modelling. Linear regression
is used when the dependent variable is continuous, while logistic regression is used when
the dependent variable is categorical.

Multiple Regression Analysis:


Multiple regression analysis extends linear regression by considering multiple independent
variables to predict the outcome. It helps in understanding how multiple predictors
collectively influence the dependent variable.

Linear Regression:
Linear regression is a statistical method for modelling the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed
data. It is commonly used for predicting numeric values.

Logistic Regression:
Logistic regression is a statistical method for modelling the probability of a binary outcome
by fitting a logistic curve to the observed data. It is used for binary classification problems.

Conclusion: In this assignment, we have explored various aspects of regression analysis,


including univariate and bivariate analysis, as well as multiple regression modelling.

Department of AI & DS AISSMS IOIT


17

Department of AI & DS AISSMS IOIT


18

Department of AI & DS AISSMS IOIT


19

Assignment 3A:
Title of the Assignment: Classification Analysis:
Implementation of Support Vector Machines (SVM) for classifying images of hand- written
digits into their respective numerical classes (0 to 9).

Dataset Description: (SVMs) are a type of supervised machine learning algorithm that can be
used for classification and regression tasks., we will focus on using SVMs for image
classification.
When a computer processes an image, it perceives it as a two-dimensional array of pixels. The
size of the array corresponds to the resolution of the image, for example, if the image is 200
pixels wide and 200 pixels tall, the array will have the dimensions 200 x 200 x 3. The first two
dimensions represent the width and height of the image, respectively, while the third dimension
represents the RGB color channels. The values in the array can range from 0 to 255, which
indicates the intensity of the pixel at each point.

Objective of the Assignment: Students should be able to learn classification analysis using
Implementation of Support Vector Machines (SVM) for classifying images of hand- written
digits.

Theory:
In order to classify an image using an SVM, we first need to extract features from the image.
These features can be the color values of the pixels, edge detection, or even the textures present
in the image. Once the features are extracted, we can use them as input for the SVM algorithm.
The SVM algorithm works by finding the hyperplane that separates the different classes in the
feature space. The key idea behind SVMs is to find the hyperplane that maximizes the margin,
which is the distance between the closest points of the different classes. The points that are
closest to the hyperplane are called support vectors.
One of the main advantages of using SVMs for image classification is that they can effectively
handle high-dimensional data, such as images. Additionally, SVMs are less prone to overfitting
than other algorithms such as neural networks.
“Support Vector Machine” (SVM) is a supervised learning machine learning algorithm that can
be used for both classification or regression challenges. However, it is mostly used in
classification problems, such as text classification. In the SVM algorithm, we plot each data item
as a point in n-dimensional space (where n is the number of features you have), with the value of
each feature being the value of a particular coordinate. Then, we perform classification by
finding the optimal hyper-plane that differentiates the two classes very well (look at the below
snapshot).

Department of AI & DS AISSMS IOIT


20

Conclusion:
In this way we have explored a Support Vector Machine (SVM) model to accurately classify
images of cats and dogs. The best parameters for the SVM model were determined using
GridSearchCV, and the model’s accuracy was measured.

Department of AI & DS AISSMS IOIT


21

Assignment 3B:
Title of the Assignment: Classification Analysis:
Implement K-Nearest Neighbors (KNN) algorithm on social network ad dataset. Compute
confusion matrix, accuracy, error rate, precision, and recall on the given dataset.
Dataset link: https://www.kaggle.com/datasets/rakeshrau/social-network-ads
Dataset Description: In this project, we will work with a social network ad dataset that contains
information about users, including features such as age, gender, and estimated salary, as well as
whether a user clicked on a particular ad (0 for no, 1 for yes). The goal is to implement the K-
Nearest Neighbors (KNN) algorithm on this dataset to predict whether a user is likely to click on
the ad or not. We will compute various classification metrics, including confusion matrix,
accuracy, error rate, precision, and recall.

Objective of the Assignment: Students should be able to apply the K-Nearest Neighbors
algorithm to a real-world dataset for classification tasks. They should also learn how to evaluate
the performance of a classification model using key metrics.
Theory:
K-Nearest Neighbors (KNN) Algorithm:
K-Nearest Neighbors is a supervised machine learning algorithm used for classification and
regression tasks. In KNN, an object is classified by a majority vote of its neighbors, with the
object being assigned to the class that is most common among its K nearest neighbors (K is a
hyperparameter). KNN is a non-parametric and instance-based learning algorithm.

Confusion Matrix:

Department of AI & DS AISSMS IOIT


22

A confusion matrix is a table that is used to evaluate the performance of a classification model. It
shows the true positive, true negative, false positive, and false negative values, which are
essential for calculating various classification metrics.

Accuracy:
Accuracy measures the ratio of correctly predicted instances to the total instances in the dataset.
It provides a general measure of the model's performance.

Error Rate:
Error rate is the complement of accuracy and measures the ratio of incorrectly predicted
instances to the total instances. It provides the rate of misclassification.

Precision:
Precision is a metric that measures the accuracy of positive predictions. It is the ratio of true
positive predictions to the total positive predictions and helps in assessing the model's ability to
avoid false positives.

Recall:
Recall, also known as sensitivity or true positive rate, measures the ability of the model to
correctly identify positive instances. It is the ratio of true positive predictions to the total actual
positive instances.

Conclusion: In this assignment, we have explored the implementation of the K-Nearest


Neighbors (KNN) algorithm on a social network ad dataset for classification. We have computed
key classification metrics such as the confusion matrix, accuracy, error rate, precision, and recall
to evaluate the model's performance in predicting user clicks on ads.

Department of AI & DS AISSMS IOIT


23

Department of AI & DS AISSMS IOIT


24

Department of AI & DS AISSMS IOIT


25

Assignment 4A:
Title of the Assignment: Clustering Analysis:
Implement K-Means clustering on Iris.csv dataset. Determine the number of clusters using the
elbow method.
Dataset Description: It has 150 entries with 1 dependent column and 4 feature columns.

Link for Dataset: https://www.kaggle.com/datasets/uciml/iris

Objective of the Assignment: Students should be able to implement K-means clustering on


Iris.csv dataset. and determine the number of clusters using the Elbow method.

Theory:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of means
clustering.
“It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.”

K Means Algorithm Working:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

The elbow method is a popular unsupervised learning algorithm used in K-Means clustering.
Unlike supervised learning, K-Means doesn’t require labelled data. It involves randomly
initializing K cluster centroids and iteratively adjusting them until they stop moving. Let’s go
through the steps involved in K-means clustering for a better understanding:

Department of AI & DS AISSMS IOIT


26

1. Select the number of clusters for the dataset (K)


2. Select the K number of centroids randomly from the dataset.
3. Now we will use Euclidean distance or Manhattan distance as the metric to calculate the
distance of the points from the nearest centroid and assign the points to that nearest cluster
centroid, thus creating K clusters.
4. Now we find the new centroid of the clusters thus formed.
5. Again, reassign the whole data point based on this new centroid, then repeat step 4. We
will continue this for a given number of iterations until the position of the centroid doesn’t
change, i.e., there is no more convergence.
Finding the optimal number of clusters is an important part of this algorithm. A commonly used
method for finding the optimum K value is Elbow Method.

K Means Clustering Using the Elbow Method


In the Elbow method, we are actually varying the number of clusters (K) from 1 – 10. For each
value of K, we are calculating WCSS (Within-Cluster Sum of Square). WCSS is the sum of the
squared distance between each point and the centroid in a cluster. When we plot the WCSS with
the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value
will start to decrease. WCSS value is largest when K = 1. When we analyze the graph, we can
see that the graph will rapidly change at a point and thus creating an elbow shape. From this
point, the graph moves almost parallel to the X-axis. The K value corresponding to this point is
the optimal value of K or an optimal number of clusters.

Conclusion:

In this way, we are studied the basic concepts of the K-Means Clustering algorithm in Machine
Learning. We used the Elbow method to find the K-mean for clustering the data in our sample
data set.

Department of AI & DS AISSMS IOIT


27

Department of AI & DS AISSMS IOIT


28

Assignment 4B:
Title of the Assignment: Clustering Analysis:
Implement K-Medoid Algorithm on a credit card dataset. Determine the number of clusters using
the Silhouette Method.
Dataset link: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata

Dataset Description: In this project, we will work with a credit card dataset that contains
information about credit card users, including various features such as credit limit, age, income,
and spending behavior. The goal is to implement the K-Medoid clustering algorithm on this
dataset to group users into clusters based on their credit card usage patterns. Additionally, we
will use the Silhouette Method to determine the optimal number of clusters.

Objective of the Assignment: Students should be able to apply clustering techniques to real-
world data, specifically using the K-Medoid algorithm. They should also learn how to evaluate
the quality of clustering using the Silhouette Method.
Theory:
K-Medoid Algorithm:
K-Medoid is a partitioning clustering algorithm that aims to divide a dataset into a pre-defined
number of clusters (K) where each cluster is represented by one data point, called the medoid.
Unlike K-Means, which uses the mean of data points as cluster representatives, K-Medoid uses
actual data points as representatives. This makes K-Medoid more robust to outliers and noise.

Silhouette Method:

Department of AI & DS AISSMS IOIT


29

The Silhouette Method is a technique to determine the optimal number of clusters for a given
dataset. It quantifies how similar an object is to its own cluster compared to other clusters. The
silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched
to its own cluster and poorly matched to neighboring clusters. The goal is to find the number of
clusters that maximizes the silhouette score.

Clustering Evaluation:
Clustering evaluation helps measure the quality of clustering results. The Silhouette Method is
one such evaluation technique that provides a quantitative measure of the goodness of clustering.
Other evaluation metrics like the Davies-Bouldin Index and the Dunn Index can also be used.

Conclusion: In this assignment, we have explored the implementation of the K-Medoid


clustering algorithm on a credit card dataset. We have also used the Silhouette Method to
determine the optimal number of clusters for the data.

Department of AI & DS AISSMS IOIT


30

Department of AI & DS AISSMS IOIT


31

Department of AI & DS AISSMS IOIT


32

Assignment 5A:
Title of the Assignment: Ensemble Learning
Implement Random Forest Classifier model to predict the safety of the car.
Dataset link: https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set

Dataset Description: In this project, we will work with a car evaluation dataset that
contains information about various features of cars, such as their price, maintenance cost,
number of doors, and safety rating. The goal is to implement the Random Forest Classifier
model on this dataset to predict the safety level of the car. The safety level is categorized as
"low," "medium," "high," and "very high."

Objective of the Assignment: Students should be able to apply ensemble learning


techniques, specifically the Random Forest Classifier, to a real-world dataset for
classification tasks. They should also learn how to evaluate the performance of an ensemble
model.

Theory:
Random Forest Classifier:
Random Forest is an ensemble learning method used for classification and regression tasks.
It is an ensemble of decision trees, where multiple decision trees are trained on different
subsets of the data. The predictions from individual trees are combined to make the final
prediction. Random Forest is known for its high accuracy and resistance to overfitting.

Ensemble Learning:
Ensemble learning is a machine learning technique that combines the predictions of
multiple models to improve overall performance. Random Forest is an example of ensemble
learning, where multiple decision trees are combined to make more robust predictions.

Evaluation Metrics:

Department of AI & DS AISSMS IOIT


33

Evaluation metrics are used to measure the performance of classification models. Common
metrics include accuracy, precision, recall, and F1-score, among others. These metrics help
assess the model's ability to make correct predictions and handle different aspects of
classification performance.

Conclusion: In this assignment, we have explored the implementation of the Random


Forest Classifier model on a car evaluation dataset to predict the safety level of cars. We
have also discussed key ensemble learning concepts and evaluation metrics used to assess
the model's performance.

Department of AI & DS AISSMS IOIT


34

Department of AI & DS AISSMS IOIT


35

Department of AI & DS AISSMS IOIT


36

Assignment 5B:
Title of the Assignment: Ensemble Learning
Use different voting mechanisms and Apply AdaBoost (Adaptive Boosting), Gradient Tree
Boosting (GBM), XGBoost classification on the Iris dataset and compare the performance
of three models using different evaluation measures.
Dataset Link: https://www.kaggle.com/datasets/uciml/iris

Dataset Description: In this project, we will work with the Iris dataset, a well-known
dataset in machine learning. The dataset contains information about different species of iris
flowers and their features, such as sepal length, sepal width, petal length, and petal width.
The goal is to apply three different ensemble learning techniques: AdaBoost, Gradient Tree
Boosting (GBM), and XGBoost, to classify iris flowers into their respective species. We
will compare the performance of these three models using various evaluation measures.

Objective of the Assignment: Students should be able to apply ensemble learning


techniques to a classic dataset like Iris, compare the performance of different ensemble
models, and use various evaluation metrics to assess their performance.

Theory:
Ensemble Learning:
Ensemble learning combines the predictions of multiple models to improve overall
performance. In this assignment, we will explore three ensemble techniques: AdaBoost,
GBM, and XGBoost, which are used for classification tasks.

AdaBoost (Adaptive Boosting):


AdaBoost is an ensemble learning method that focuses on improving the classification
performance by giving more weight to the misclassified data points in each iteration. It
combines multiple weak learners to create a strong classifier.

Gradient Tree Boosting (GBM):

Department of AI & DS AISSMS IOIT


37

Gradient Tree Boosting is another ensemble technique that builds an ensemble of decision
trees. It uses gradient descent optimization to minimize the loss function, gradually
improving the model's performance.

XGBoost Classification:
XGBoost (Extreme Gradient Boosting) is an advanced ensemble learning algorithm known
for its speed and performance. It combines the advantages of gradient boosting and
regularization techniques to create a powerful classifier.

Evaluation Metrics:
We will use various evaluation metrics, including accuracy, precision, recall, and F1-score,
to assess the performance of the three ensemble models. These metrics provide insights into
the models' ability to make correct predictions and handle different aspects of classification
performance.

Conclusion: In this assignment, we will apply AdaBoost, GBM, and XGBoost


classification models on the Iris dataset, compare their performance using various
evaluation metrics, and assess their ability to classify iris flowers into different species.

Department of AI & DS AISSMS IOIT


38

Department of AI & DS AISSMS IOIT


39

Department of AI & DS AISSMS IOIT


40

Assignment 6A:
Title of the Assignment: Reinforcement Learning
Implement Reinforcement Learning using an example of a maze environment that the agent
needs to explore.

Assignment Description: In this project, we will delve into the field of Reinforcement
Learning (RL) by implementing RL techniques using a maze environment. The primary
objective is to create an RL agent capable of navigating and solving a maze problem
autonomously. The agent will learn to make decisions and take actions based on its
interactions with the maze environment, gradually improving its ability to navigate and
reach a predefined goal.

Objective of the Assignment: The assignment aims to introduce students to the


fundamentals of Reinforcement Learning, specifically in the context of solving maze
problems. Students will learn how RL agents can learn and make decisions through
interaction with their environment.

Theory:
Reinforcement Learning (RL):
Reinforcement Learning is a type of machine learning where an agent interacts with an
environment and learns to make sequences of decisions to maximize a cumulative reward. It
involves the concept of an agent learning from its experiences, taking actions to achieve a
goal, and receiving feedback in the form of rewards or penalties.

Markov Decision Process (MDP):


MDP is a mathematical framework used to describe the RL problem. It defines the agent's
interactions with the environment as a sequence of states, actions, and rewards, where the
transition from one state to another depends on the agent's actions.

Department of AI & DS AISSMS IOIT


41

Agent, Environment, and State:


In RL, there are typically three main components: the agent (learner), the environment (with
which the agent interacts), and the state (representing the current situation of the
environment). The agent selects actions, the environment responds, and the state changes
accordingly.

Actions, Rewards, and Policies:


Actions represent the decisions made by the agent, rewards are the feedback given by the
environment to guide the agent, and policies are strategies or rules that the agent follows to
select actions.

Q-Learning and Value Iteration:


Q-Learning is a popular RL algorithm used to find optimal policies for decision-making in
an MDP. Value Iteration is another technique used to compute the expected cumulative
rewards for each state-action pair.

Exploration vs. Exploitation:


One of the key challenges in RL is the trade-off between exploration (trying new actions to
learn) and exploitation (choosing known actions for immediate reward). Balancing these
aspects is crucial for successful RL.

Department of AI & DS AISSMS IOIT


42

Learning from Experience (Episodes):


In RL, an agent typically learns from a series of episodes, where each episode consists of a
sequence of actions, states, and rewards. The agent uses these experiences to update its
policy and improve its decision-making.

Conclusion: This assignment will provide students with a practical understanding of


Reinforcement Learning by implementing an RL agent to navigate and solve a maze
environment.

Department of AI & DS AISSMS IOIT


43

Department of AI & DS AISSMS IOIT


44

Assignment 6B:
Title of the Assignment: Reinforcement Learning
Solve the Taxi problem using reinforcement learning where the agent acts as a taxi driver to
pick up a passenger at one location and then drop the passenger off at their destination.

Assignment Description: In this project, we will explore the field of Reinforcement


Learning (RL) by solving the classic "Taxi Problem." The primary objective is to
implement RL techniques to train an agent to act as a taxi driver. The agent will learn how
to pick up a passenger at one location, navigate through a grid-world environment, and drop
the passenger off at their specified destination. The assignment will demonstrate how RL
can be applied to solve real-world problems involving decision-making and navigation.

Objective of the Assignment: The assignment aims to introduce students to Reinforcement


Learning through a practical example, where they will learn how to formulate and solve a
complex problem using RL techniques. By working on the Taxi problem, students will gain
insights into RL concepts and how they can be applied to autonomous decision-making
tasks.

Reinforcement Learning (RL):


Reinforcement Learning is a type of machine learning where an agent interacts with an
environment and learns to make sequences of decisions to maximize a cumulative reward. It
involves the concept of an agent learning from its experiences, taking actions to achieve a
goal, and receiving feedback in the form of rewards or penalties.

Markov Decision Process (MDP):


MDP is a mathematical framework used to describe the RL problem. It defines the agent's
interactions with the environment as a sequence of states, actions, and rewards, where the
transition from one state to another depends on the agent's actions.

Agent, Environment, and State:


In RL, there are typically three main components: the agent (learner), the environment (with
which the agent interacts), and the state (representing the current situation of the

Department of AI & DS AISSMS IOIT


45

environment). The agent selects actions, the environment responds, and the state changes
accordingly.

Actions, Rewards, and Policies:


Actions represent the decisions made by the agent, rewards are the feedback given by the
environment to guide the agent, and policies are strategies or rules that the agent follows to
select actions.

Q-Learning and Value Iteration:


Q-Learning is a popular RL algorithm used to find optimal policies for decision-making in
an MDP. Value Iteration is another technique used to compute the expected cumulative
rewards for each state-action pair.

Exploration vs. Exploitation:


One of the key challenges in RL is the trade-off between exploration (trying new actions to
learn) and exploitation (choosing known actions for immediate reward). Balancing these
aspects is crucial for successful RL.

Learning from Experience (Episodes):


In RL, an agent typically learns from a series of episodes, where each episode consists of a
sequence of actions, states, and rewards. The agent uses these experiences to update its
policy and improve its decision-making.

Python Libraries:
Python libraries like NumPy are often used for implementing RL algorithms due to their
efficiency in numerical operations.

Conclusion: This assignment will provide students with practical experience in


Reinforcement Learning by solving the Taxi problem. They will learn how to train an RL
agent to make decisions and navigate through a grid-world environment to pick up and drop
off passengers. Through this project, students will gain valuable insights into RL concepts
and problem-solving using RL techniques.

Department of AI & DS AISSMS IOIT


46

Assignment 6C:
Title of the Assignment: Reinforcement Learning
Build a Tic-Tac-Toe game using reinforcement learning in Python by using the following
tasks.
a. Setting up the environment
b. Defining the Tic-Tac-Toe game
c. Building the reinforcement learning model
d. Training the model
e. Testing the model

Assignment Description: In this project, we will explore the field of Reinforcement


Learning (RL) by building a Tic-Tac-Toe game in Python and training an RL agent to play
the game optimally. The primary objective is to understand how RL can be applied to create
intelligent agents capable of learning and making strategic decisions.

Objective of the Assignment: The assignment aims to introduce students to Reinforcement


Learning through hands-on experience with building a game environment and training an
RL agent to play Tic-Tac-Toe. Students will learn the essential steps involved in setting up
the RL environment, defining the game rules, developing the RL model, training the agent,
and testing its performance.

Setting up the Environment:


Creating the RL environment involves defining the game board, specifying the rules, and
establishing the interaction between the RL agent and the environment.

Defining the Tic-Tac-Toe Game:


In this step, the rules of Tic-Tac-Toe are defined, including how players make moves, win
the game, or result in a draw.

Building the Reinforcement Learning Model:


The RL model includes the agent, states, actions, rewards, and policies. The agent learns to
make decisions by interacting with the environment and optimizing its actions.

Training the Model:


Training the RL agent involves running episodes of the game, where the agent learns to
make optimal moves by receiving rewards and updating its policies based on its
experiences.

Testing the Model:


After training, the RL agent's performance is evaluated by playing against it or assessing its
ability to make strategic decisions in the game.

Reinforcement Learning (RL):

Department of AI & DS AISSMS IOIT


47

Reinforcement Learning is a machine learning paradigm where an agent learns to make


sequences of decisions to maximize a cumulative reward. It is often used for autonomous
decision-making tasks.

Markov Decision Process (MDP):


MDP is a mathematical framework used to describe the RL problem. It defines the agent's
interactions with the environment as a sequence of states, actions, and rewards.

Agent, Environment, and State:


In RL, there are typically three main components: the agent (learner), the environment (with
which the agent interacts), and the state (representing the current situation of the
environment).

Actions, Rewards, and Policies:


Actions represent the decisions made by the agent, rewards are the feedback given by the
environment, and policies are strategies followed by the agent to select actions.

Q-Learning and Value Iteration:


Q-Learning is an RL algorithm used to find optimal policies for decision-making. Value
Iteration computes the expected cumulative rewards for state-action pairs.

Exploration vs. Exploitation:


Balancing exploration (trying new actions to learn) and exploitation (choosing known
actions for immediate reward) is a critical aspect of RL.

Learning from Experience (Episodes):


In RL, an agent learns from a series of episodes, updating its policy based on its experiences
to improve its decision-making.

Department of AI & DS AISSMS IOIT


48

Python Libraries:
Python libraries like NumPy are commonly used for implementing RL algorithms due to
their efficiency in numerical operations.

Tic-Tac-Toe Game Rules:


The rules of Tic-Tac-Toe include having two players, X and O, taking turns to place their
symbols on a 3x3 grid. The first player to form a line (horizontally, vertically, or
diagonally) wins, or the game ends in a draw if the grid is filled.

Conclusion: This assignment will provide students with practical experience in


Reinforcement Learning by building a Tic-Tac-Toe game and training an RL agent to play
strategically. Students will gain valuable insights into RL concepts and problem-solving
using RL techniques in a game environment.

Department of AI & DS AISSMS IOIT


49

Department of AI & DS AISSMS IOIT


50

Department of AI & DS AISSMS IOIT


51

Department of AI & DS AISSMS IOIT


52

Department of AI & DS AISSMS IOIT


53

Department of AI & DS AISSMS IOIT


54

Department of AI & DS AISSMS IOIT


55

Department of AI & DS AISSMS IOIT


56

Department of Artificial Intelligence & Data Science

Lab Manual

Data Structures & Algorithms LABORATORY


Computer Laboratory I: Data Modeling and
Visualization
(417522)

Prepared by
Mrs. Tejasvi Jadhav

BE AI& DS
SEM –VII
Academic year-2023-24

Department of AI & DS AISSMS IOIT


57

Page
Sr. No Title Of Experiment CO PO PSO
No
Data Loading, Storage and File Formats
Problem Statement: Analyzing Sales Data from Multiple
File Formats Dataset: Sales data in multiple file formats
(e.g., CSV, Excel, JSON) Description: The goal is to load
and analyze sales data from different file formats, including
CSV, Excel, and JSON, and perform data cleaning,
transformation, and analysis on the
dataset.
Tasks to Perform:
Obtain sales data files in various formats, such as CSV,
Excel, and JSON.
1. Load the sales data from each file format into the
appropriate data structures or DataFrames.
2. Explore the structure and content of the loaded data,
identifying any inconsistencies, missing values, or data
quality issues.
7
3. Perform data cleaning operations, such as handling
missing values, removing duplicates, or correcting
inconsistencies.
4. Convert the data into a unified format, such as a common
DataFrames or data structure, to enable seamless analysis.
5. Perform data transformation tasks, such as merging
multiple datasets, splitting columns, or deriving new
variables.
6. Analyze the sales data by performing descriptive
statistics, aggregating data by specific variables, or
calculating metrics such as total sales, average order value,
or product category distribution.
7. Create visualizations, such as bar plots, pie charts, or box
plots, to represent the sales data and gain insights into sales
trends, customer behavior, or product performance.

8 Interacting with Web APIs


Problem Statement: Analyzing Weather Data from
OpenWeatherMap API
Dataset: Weather data retrieved from OpenWeatherMap
API. Description: The goal is to interact with the
OpenWeatherMap API to retrieve weather data
for a specific location and perform data modeling and
visualization to analyze weather patterns over time.
Tasks to Perform:
1. Register and obtain API key from OpenWeatherMap.
2. Interact with the OpenWeatherMap API using the API
key to retrieve weather data for
a specific location.
3. Extract relevant weather attributes such as temperature,
humidity, wind speed, and precipitation from the API

Department of AI & DS AISSMS IOIT


58

response.
4. Clean and preprocess the retrieved data, handling missing
values or inconsistent formats.
5. Perform data modeling to analyze weather patterns, such
as calculating average temperature, maximum/minimum
values, or trends over time.
6. Visualize the weather data using appropriate plots, such
as line charts, bar plots, or scatter plots, to represent
temperature changes, precipitation levels, or wind speed
variations.
7. Apply data aggregation techniques to summarize weather
statistics by specific time periods (e.g., daily, monthly,
seasonal).
8. Incorporate geographical information, if available, to
create maps or geospatial visualizations representing
weather patterns across different locations.
9. Explore and visualize relationships between weather
attributes, such as temperature and humidity, using
correlation plots or heatmaps.

Data Cleaning and Preparation


Problem Statement: Analyzing Customer Churn in a
Telecommunications Company
Dataset: "Telecom_Customer_Churn.csv"
Description: The dataset contains information about
customers of a telecommunications
company and whether they have churned (i.e., discontinued
their services). The dataset includes various attributes of the
customers, such as their demographics, usage patterns, and
account information. The goal is to perform data cleaning
and preparation to gain insights into the factors that
contribute to customer churn.
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and
9
content.
3. Handle missing values in the dataset, deciding on an
appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent
formatting or spelling variations, and standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that
may be relevant to predicting customer churn.
9. Normalize or scale the data if necessary.
10. Split the dataset into training and testing sets for further
analysis.
11. Export the cleaned dataset for future analysis or
modeling.
10 Data Wrangling

Department of AI & DS AISSMS IOIT


59

Problem Statement: Data Wrangling on Real Estate Market


Dataset: "RealEstate_Prices.csv"
Description: The dataset contains information about
housing prices in a specific real estate
market. It includes various attributes such as property
characteristics, location, sale prices,
and other relevant features. The goal is to perform data
wrangling to gain insights into the
factors influencing housing prices and prepare the dataset
for further analysis or modeling.
Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean
column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an
appropriate strategy (e.g., imputation or removal).
3. Perform data merging if additional datasets with relevant
information are available (e.g., neighborhood demographics
or nearby amenities).
4. Filter and subset the data based on specific criteria, such
as a particular time period, property type, or location.
5. Handle categorical variables by encoding them
appropriately (e.g., one-hot encoding or label encoding) for
further analysis.
6. Aggregate the data to calculate summary statistics or
derived metrics such as average sale prices by
neighborhood or property type.
7. Identify and handle outliers or extreme values in the data
that may affect the analysis or modeling process.

11 Data Visualization using matplotlib


Problem Statement: Analyzing Air Quality Index
(AQI) Trends in a City
Dataset: "City_Air_Quality.csv"
Description: The dataset contains information about air
quality measurements in a specific city over a period
of time. It includes attributes such as date, time,
pollutant levels (e.g., PM2.5, PM10, CO), and the Air
Quality Index (AQI) values. The goal is to use the
matplotlib library to create visualizations that
effectively represent the AQI trends and patterns for
different pollutants in the city.
Tasks to Perform:
1. Import the "City_Air_Quality.csv" dataset.
2. Explore the dataset to understand its structure and
content.
3. Identify the relevant variables for visualizing AQI
trends, such as date, pollutant levels, and AQI values.
4. Create line plots or time series plots to visualize the
overall AQI trend over time.

Department of AI & DS AISSMS IOIT


60

5. Plot individual pollutant levels (e.g., PM2.5, PM10,


CO) on separate line plots to visualize their trends over
time.
6. Use bar plots or stacked bar plots to compare the
AQI values across different dates or time periods.
7. Create box plots or violin plots to analyze the
distribution of AQI values for different pollutant
categories.
8. Use scatter plots or bubble charts to explore the
relationship between AQI values and pollutant levels.
9. Customize the visualizations by adding labels, titles,
legends, and appropriate color schemes.

Data Aggregation
Problem Statement: Analyzing Sales Performance by
Region in a Retail Company
Dataset: "Retail_Sales_Data.csv"
Description: The dataset contains information about
sales transactions in a retail company. It
includes attributes such as transaction date, product
category, quantity sold, and sales
amount. The goal is to perform data aggregation to
analyze the sales performance by region
and identify the top-performing regions.
Tasks to Perform:
1. Import the "Retail_Sales_Data.csv" dataset.
2. Explore the dataset to understand its structure and
content.
3. Identify the relevant variables for aggregating sales
12
data, such as region, sales amount, and product
category.
4. Group the sales data by region and calculate the
total sales amount for each region.
5. Create bar plots or pie charts to visualize the sales
distribution by region.
6. Identify the top-performing regions based on the
highest sales amount.
7. Group the sales data by region and product category
to calculate the total sales amount for each
combination.
8. Create stacked bar plots or grouped bar plots to
compare the sales amounts across different regions and
product categories.

13 Time Series Data Analysis


Problem statement: Analysis and Visualization of

Department of AI & DS AISSMS IOIT


61

Stock Market Data


Dataset: "Stock_Prices.csv"
Description: The dataset contains historical stock price
data for a particular company over a period of time. It
includes attributes such as date, closing price, volume,
and other relevant features. The goal is to perform time
series data analysis on the stock price data to identify
trends, patterns, and potential predictors, as well as
build models to forecast future stock
prices.
Tasks to Perform:
1. Import the "Stock_Prices.csv" dataset.
2. Explore the dataset to understand its structure and
content.
3. Ensure that the date column is in the appropriate
format (e.g., datetime) for time series analysis.
4. Plot line charts or time series plots to visualize the
historical stock price trends over time.
5. Calculate and plot moving averages or rolling
averages to identify the underlying trends and smooth
out noise.
6. Perform seasonality analysis to identify periodic
patterns in the stock prices, such as weekly, monthly,
or yearly fluctuations.
7. Analyze and plot the correlation between the stock
prices and other variables, such as trading volume or
market indices.
8. Use autoregressive integrated moving average
(ARIMA) models or exponential smoothing models to
forecast future stock prices.

Assignment 7:

Department of AI & DS AISSMS IOIT


62

Title of the Assignment: Data Loading, Storage and File Formats


Problem Statement: Analyzing Sales Data from Multiple File Formats
Dataset: Sales data in multiple file formats (e.g., CSV, Excel, JSON)
Description: The goal is to load and analyze sales data from different file formats, including
CSV, Excel, and JSON, and perform data cleaning, transformation, and analysis on the
dataset.
Tasks to Perform:
Obtain sales data files in various formats, such as CSV, Excel, and JSON.
1. Load the sales data from each file format into the appropriate data structures or DataFrames.
2. Explore the structure and content of the loaded data, identifying any inconsistencies, missing
values, or data quality issues.
3. Perform data cleaning operations, such as handling missing values, removing duplicates, or
correcting inconsistencies.
4. Convert the data into a unified format, such as a common DataFrames or data structure, to
enable seamless analysis.
5. Perform data transformation tasks, such as merging multiple datasets, splitting columns, or
deriving new variables.
6. Analyze the sales data by performing descriptive statistics, aggregating data by specific
variables, or calculating metrics such as total sales, average order value, or product category
distribution.
7. Create visualizations, such as bar plots, pie charts, or box plots, to represent the sales data
and gain insights into sales trends, customer behaviour, or product performance.

Objective of the Assignment: Students should be able to load data from various file formats,
clean and preprocess the data, and perform exploratory data analysis (EDA) to gain insights.

Data loading, storage, and file formats play a crucial role in data analysis. Different sources and
file formats require specific methods for loading and processing. In this assignment, we will
explore the following steps:

Step-1: Import necessary libraries


Before starting with data loading and analysis, we need to import the required Python libraries,
such as pandas, NumPy, and matplotlib, to facilitate the process.

Step-2: Load data from different file formats


In this step, we will load sales data from various file formats, including CSV, Excel, and JSON.
Each format may require specific methods and libraries for loading.

Step-3: Data Cleaning and Preprocessing

Department of AI & DS AISSMS IOIT


63

Data from different sources can be messy and may contain missing values, duplicates, or
inconsistent data. Data cleaning is essential to prepare the dataset for analysis. We will cover
techniques for data cleaning and preprocessing.

Step-4: Exploratory Data Analysis (EDA)


EDA is a crucial step in understanding the dataset. It involves tasks like summary statistics, data
visualization, and identifying patterns or trends in the data. We will use Python libraries like
matplotlib for data visualization.

Step-5: Analysis and Insights


After performing EDA, we will extract valuable insights from the sales data. This may include
identifying top-selling products, sales trends, or customer behaviour patterns.

Conclusion: In this assignment, we have explored the fundamentals of data loading, storage, and
file formats. We have learned how to load and analyze sales data from various file formats,
perform data cleaning, transformation, and conduct exploratory data analysis. This knowledge is
essential for data analysts and data scientists working with diverse datasets from different
sources and formats.

Department of AI & DS AISSMS IOIT


64

Department of AI & DS AISSMS IOIT


65

Department of AI & DS AISSMS IOIT


66

Department of AI & DS AISSMS IOIT


67

Department of AI & DS AISSMS IOIT


68

Department of AI & DS AISSMS IOIT


69

Assignment 8:
Title of the Assignment: Interacting with Web APIsProblem Statement: Analyzing Weather
Data from OpenWeatherMap API
Dataset: Weather data retrieved from OpenWeatherMap API
Description: The goal is to interact with the OpenWeatherMap API to retrieve weather data
for a specific location and perform data modelling and visualization to analyze weather patterns
over time.
Tasks to Perform:
1. Register and obtain API key from OpenWeatherMap.
2. Interact with the OpenWeatherMap API using the API key to retrieve weather data for a
specific location.
3. Extract relevant weather attributes such as temperature, humidity, wind speed, and
precipitation from the API response.
4. Clean and preprocess the retrieved data, handling missing values or inconsistent formats.
5. Perform data modelling to analyze weather patterns, such as calculating average temperature,
maximum/minimum values, or trends over time.
6. Visualize the weather data using appropriate plots, such as line charts, bar plots, or scatter
plots, to represent temperature changes, precipitation levels, or wind speed
variations.
7. Apply data aggregation techniques to summarize weather statistics by specific time periods
(e.g., daily, monthly, seasonal).
8. Incorporate geographical information, if available, to create maps or geospatial visualizations
representing weather patterns across different locations.
9. Explore and visualize relationships between weather attributes, such as temperature and
humidity, using correlation plots or heatmaps.

Objective of the Assignment: Students should be able to access and retrieve data from a web
API, clean and preprocess the data, and perform data modelling and visualization to analyze
weather patterns.

Web APIs (Application Programming Interfaces) provide a valuable way to access data from
various sources, including weather data from services like OpenWeatherMap. In this assignment,
we will follow these steps:

Step-1: Introduction to Web APIs


Understand the concept of web APIs and how they enable data retrieval from external sources.
Explore the OpenWeatherMap API and its capabilities for accessing weather data.

Step-2: Making HTTP Requests in Python

Department of AI & DS AISSMS IOIT


70

Learn how to use Python to make HTTP requests to the OpenWeatherMap API. We'll cover
different HTTP methods and how to pass parameters to retrieve specific weather data for a
location.

Step-3: Data Retrieval from OpenWeatherMap API


Retrieve weather data for a specified location using the OpenWeatherMap API. This data may
include temperature, humidity, wind speed, and more.

Step-4: Data Cleaning and Preprocessing


Clean and preprocess the retrieved weather data. This may involve handling missing values, data
type conversions, and removing outliers.

Step-5: Data Modelling for Weather Analysis


Perform data modelling to analyze weather patterns over time. This could include tasks like
calculating monthly averages, identifying temperature trends, or understanding the impact of
weather conditions on local events.

Step-6: Data Visualization with Python Libraries


Visualize the analyzed weather data using Python libraries such as matplotlib. Create plots and
charts to present weather patterns effectively.

Conclusion: In this assignment, we have explored the process of interacting with web APIs,
specifically the OpenWeatherMap API, to retrieve and analyze weather data. Students have
learned how to make HTTP requests, clean and preprocess data, perform data modelling, and
create data visualizations to gain insights into local weather conditions. This knowledge is
valuable for data analysts and scientists working with external data sources to derive meaningful
conclusions from real-world data.

Department of AI & DS AISSMS IOIT


71

Department of AI & DS AISSMS IOIT


72

Department of AI & DS AISSMS IOIT


73

Department of AI & DS AISSMS IOIT


74

Assignment 9:
Title of the Assignment: Data Cleaning and Preparation
Problem Statement: Analyzing Customer Churn in a Telecommunications Company
Dataset: "Telecom_Customer_Churn.csv"
Description: The dataset contains information about customers of a telecommunications
company and whether they have churned (i.e., discontinued their services). The dataset
includes various attributes of the customers, such as their demographics, usage patterns, and
account information. The goal is to perform data cleaning and preparation to gain insights
into the factors that contribute to customer churn.
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations, and
standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that may be relevant to predicting customer
churn.
9. Normalize or scale the data if necessary.
10. Split the dataset into training and testing sets for further analysis.
11. Export the cleaned dataset for future analysis or modelling.
Objective of the Assignment: Students should be able to clean and prepare data for analysis,
including handling missing values, handling outliers, and performing feature engineering to
extract insights into customer churn.

Customer churn analysis is crucial for businesses, especially in the telecommunications industry.
It helps in understanding why customers leave and what factors influence their decision. This
assignment will guide students through the following steps:

Step-1: Introduction to Customer Churn Analysis


Understand the importance of analyzing customer churn and its relevance in the
telecommunications industry. Define the problem and objectives.

Step-2: Data Cleaning and Preprocessing


Clean and preprocess the dataset to ensure it is suitable for analysis. This includes handling
missing values, ensuring data consistency, and preparing it for further analysis.

Step-3: Handling Missing Values

Department of AI & DS AISSMS IOIT


75

Identify and address missing values in the dataset. Techniques such as imputation or removal of
missing data will be discussed.

Step-4: Outlier Detection and Treatment


Identify and handle outliers in the data. Outliers can significantly impact the results of the
analysis and should be treated appropriately.

Step-5: Feature Engineering


Create new features or transform existing ones to extract insights related to customer churn.
Feature engineering can enhance the model's predictive power.

Step-6: Data Visualization for Churn Analysis


Visualize the cleaned and prepared data to identify patterns and relationships that may indicate
why customers churn. Use data visualization libraries to create informative charts and graphs.

Step-7: Drawing Conclusions


Summarize the findings from the data analysis and draw conclusions regarding the factors that
contribute to customer churn. Provide actionable insights that can be used by the
telecommunications company to reduce churn.

Conclusion: In this assignment, we have focused on data cleaning and preparation for customer
churn analysis in a telecommunications company. By following these steps, students have
learned how to clean data, handle missing values and outliers, engineer features, and visualize
the data to gain insights into customer behavior. This knowledge is essential for businesses
seeking to reduce churn and improve customer retention.

Department of AI & DS AISSMS IOIT


76

Department of AI & DS AISSMS IOIT


77

Department of AI & DS AISSMS IOIT


78

Department of AI & DS AISSMS IOIT


79

Assignment 10:
Title of the Assignment: Data Wrangling
Problem Statement: Data Wrangling on Real Estate Market
Dataset: "RealEstate_Prices.csv"
Description: The dataset contains information about housing prices in a specific real estate
market. It includes various attributes such as property characteristics, location, sale prices,
and other relevant features. The goal is to perform data wrangling to gain insights into the
factors influencing housing prices and prepare the dataset for further analysis or modelling.

Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g.,
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available
(e.g., neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period,
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding or
label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average sale
prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis or
modelling process.

Objective of the Assignment: Students should be able to perform data wrangling tasks such as
data cleaning, data transformation, and data preparation to make the dataset suitable for analysis
or modelling in the context of the real estate market.

Data wrangling, also known as data munging, is a critical step in the data analysis process. It
involves cleaning, transforming, and preparing the data for analysis. In this assignment, students
will follow these steps:

Step-1: Introduction to Data Wrangling


Understand the importance of data wrangling and its role in preparing data for analysis. Define
the problem and objectives of the real estate market data analysis.

Step-2: Data Cleaning and Quality Assurance


Clean the dataset to ensure data quality. This includes identifying and handling missing values,
dealing with duplicates, and ensuring data consistency.

Step-3: Data Transformation and Feature Engineering

Department of AI & DS AISSMS IOIT


80

Transform the data and create new features to extract insights related to housing prices. Feature
engineering can enhance the analysis or modelling process.

Step-4: Handling Missing Values


Identify and address missing values in the dataset. Techniques such as imputation or removal of
missing data will be discussed.

Step-5: Dealing with Duplicates


Identify and handle duplicate records in the data. Duplicates can impact the analysis and should
be treated appropriately.

Step-6: Data Formatting and Standardization


Format data appropriately, including handling data types, converting categorical data, and
standardizing units or scales.

Step-7: Data Preparation for Analysis


Prepare the cleaned and transformed dataset for further analysis or modelling. This may involve
splitting data, scaling features, or encoding categorical variables.

Conclusion: In this assignment, we have focused on data wrangling for a real estate market
dataset. By following these steps, students have learned how to clean data, transform it, handle
missing values and duplicates, engineer features, and prepare the dataset for analysis or
modelling. Data wrangling is a crucial step in data analysis to ensure that the data is of high
quality and ready for further exploration.

Department of AI & DS AISSMS IOIT


81

Department of AI & DS AISSMS IOIT


82

Department of AI & DS AISSMS IOIT


83

Department of AI & DS AISSMS IOIT


84

Assignment 11:
Title of the Assignment: Data Visualization using matplotlib
Problem Statement: Analyzing Air Quality Index (AQI) Trends in a City
Dataset: "City_Air_Quality.csv"
Description: The dataset contains information about air quality measurements in a specific city
over a period of time. It includes attributes such as date, time, pollutant levels (e.g., PM2.5,
PM10, CO), and the Air Quality Index (AQI) values. The goal is to use the matplotlib
library to create visualizations that effectively represent the AQI trends and patterns for different
pollutants in the city.
Tasks to Perform:
1. Import the "City_Air_Quality.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Identify the relevant variables for visualizing AQI trends, such as date, pollutant levels, and
AQI values.
4. Create line plots or time series plots to visualize the overall AQI trend over time.
5. Plot individual pollutant levels (e.g., PM2.5, PM10, CO) on separate line plots to visualize
their trends over time.
6. Use bar plots or stacked bar plots to compare the AQI values across different dates or time
periods.
7. Create box plots or violin plots to analyze the distribution of AQI values for different pollutant
categories.
8. Use scatter plots or bubble charts to explore the relationship between AQI values and pollutant
levels.
9. Customize the visualizations by adding labels, titles, legends, and appropriate colour schemes.

Objective of the Assignment: Students should be able to use Matplotlib to create various types
of data visualizations that provide insights into air quality trends, allowing for the interpretation
and communication of findings.

Data visualization is a powerful tool for interpreting and communicating data patterns
effectively. In this assignment, students will be guided through the following steps:

Step-1: Introduction to Data Visualization with Matplotlib


Understand the importance of data visualization and how Matplotlib can be used to create a
variety of charts and plots for data analysis.

Step-2: Understanding Air Quality Index (AQI)


Learn about AQI and its significance as a measure of air quality. Understand the pollutants used
to calculate AQI and their health implications.

Step-3: Types of Data Visualizations for AQI Trends


Explore the different types of data visualizations suitable for analyzing AQI trends, including
line plots, bar charts, and box plots.

Department of AI & DS AISSMS IOIT


85

Step-4: Line Plots for Time Series Analysis


Create line plots to visualize time series data, focusing on how AQI values change over time for
specific pollutants.

Step-5: Bar Charts for Comparing Pollutants


Use bar charts to compare pollutant levels and AQI values across different time periods or
locations within the city.

Step-6: Box Plots for AQI Distribution


Generate box plots to visualize the distribution of AQI values, highlighting variations and
outliers in air quality data.

Step-7: Creating Legends and Labels


Enhance visualizations by adding legends, labels, and annotations to make the information more
accessible and interpretable.

Conclusion: In this assignment, we have explored data visualization techniques using Matplotlib
to analyze air quality trends in a specific city. By creating line plots, bar charts, and box plots,
students have learned how to effectively represent AQI data, making it easier to identify patterns
and communicate findings related to air quality in the city. Data visualization is a crucial skill for
data analysts and scientists to derive insights from complex datasets.

Department of AI & DS AISSMS IOIT


86

Department of AI & DS AISSMS IOIT


87

Department of AI & DS AISSMS IOIT


88

Department of AI & DS AISSMS IOIT


89

Department of AI & DS AISSMS IOIT


90

Department of AI & DS AISSMS IOIT


91

Department of AI & DS AISSMS IOIT


92

Assignment 12:
Title of the Assignment: Data Aggregation
Problem Statement: Analyzing Sales Performance by Region in a Retail Company
Dataset: "Retail_Sales_Data.csv"
Description: The dataset contains information about sales transactions in a retail company. It
includes attributes such as transaction date, product category, quantity sold, and sales
amount. The goal is to perform data aggregation to analyze the sales performance by region
and identify the top-performing regions.
Tasks to Perform:
1. Import the "Retail_Sales_Data.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Identify the relevant variables for aggregating sales data, such as region, sales amount, and
product category.
4. Group the sales data by region and calculate the total sales amount for each region.
5. Create bar plots or pie charts to visualize the sales distribution by region.
6. Identify the top-performing regions based on the highest sales amount.
7. Group the sales data by region and product category to calculate the total sales amount for
each combination.
8. Create stacked bar plots or grouped bar plots to compare the sales amounts across different
regions and product categories.

Objective of the Assignment: Students should be able to aggregate and summarize data to gain
insights into sales performance by region, allowing for the identification of top-performing
regions within the retail company.

Data aggregation is essential for summarizing and analyzing data at a higher level, providing
insights into overall performance. In this assignment, students will follow these steps:

Step-1: Introduction to Data Aggregation


Understand the importance of data aggregation in analyzing sales performance. Define the
problem and objectives for analyzing sales data by region.

Step-2: Identifying Key Metrics for Analysis


Identify the key metrics or attributes that are relevant to analyzing sales performance by region.
These may include sales amount, quantity sold, or product categories.

Step-3: Grouping Data by Region


Group the sales data by region to create subsets of data that can be analyzed separately. Regions
may be defined by geographical areas or any other relevant criteria.

Step-4: Aggregating Sales Data

Department of AI & DS AISSMS IOIT


93

Aggregate the sales data for each region, calculating metrics such as total sales amount, total
quantity sold, average sales, etc. This step involves using aggregation functions like sum, mean,
or count.

Step-5: Calculating Key Performance Indicators (KPIs)


Calculate key performance indicators (KPIs) that provide insights into sales performance. KPIs
may include top-selling products, revenue growth, or sales trends.

Step-6: Visualizing Sales Performance


Create visualizations, such as bar charts or heatmaps, to represent the sales performance by
region. Visualization helps in understanding the data and communicating findings effectively.

Conclusion: In this assignment, we have focused on data aggregation to analyze sales


performance by region in a retail company. By identifying key metrics, grouping data by region,
aggregating sales data, and calculating KPIs, students have learned how to gain insights into the
top-performing regions. Data aggregation is a valuable technique for decision-makers in retail
companies to identify strengths and areas for improvement.

Department of AI & DS AISSMS IOIT


94

Department of AI & DS AISSMS IOIT


95

Department of AI & DS AISSMS IOIT


96

Department of AI & DS AISSMS IOIT


97

Assignment 13:
Title of the Assignment: Time Series Data Analysis
Problem statement: Analysis and Visualization of Stock Market Data
Dataset: "Stock_Prices.csv"
Description: The dataset contains historical stock price data for a particular company over a
period of time. It includes attributes such as date, closing price, volume, and other relevant
features. The goal is to perform time series data analysis on the stock price data to identify
trends, patterns, and potential predictors, as well as build models to forecast future stock prices.
Tasks to Perform:
1. Import the "Stock_Prices.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Ensure that the date column is in the appropriate format (e.g., datetime) for time series
analysis.
4. Plot line charts or time series plots to visualize the historical stock price trends over time.
5. Calculate and plot moving averages or rolling averages to identify the underlying trends and
smooth out noise.
6. Perform seasonality analysis to identify periodic patterns in the stock prices, such as weekly,
monthly, or yearly fluctuations.
7. Analyze and plot the correlation between the stock prices and other variables, such as trading
volume or market indices.
8. Use autoregressive integrated moving average (ARIMA) models or exponential smoothing
models to forecast future stock prices.

Objective of the Assignment: Students should be able to analyze time series data, identify
trends, patterns, and potential predictors in stock prices, and build forecasting models for future
stock price prediction.

Time series data analysis is essential for understanding and predicting trends in sequential data.
In this assignment, students will follow these steps:

Step-1: Introduction to Time Series Data Analysis


Understand the significance of time series data analysis, especially in the context of stock price
forecasting. Define the problem and objectives for analyzing the stock price data.

Step-2: Exploratory Data Analysis (EDA) for Time Series Data


Perform EDA to understand the data, identify patterns, and visualize stock price trends over
time. Techniques may include time plots, autocorrelation, and partial autocorrelation plots.

Step-3: Time Series Decomposition

Department of AI & DS AISSMS IOIT


98

Decompose the time series data into its components, such as trend, seasonality, and residuals.
This step helps in understanding the underlying patterns in the data.

Step-4: Building Time Series Models


Build time series models such as Autoregressive Integrated Moving Average (ARIMA) or
Exponential Smoothing to forecast future stock prices. Model selection and parameter tuning are
crucial.

Step-5: Model Evaluation and Forecasting


Evaluate the performance of time series models using appropriate metrics and cross-validation
techniques. Use the models to make future stock price forecasts.

Step-6: Data Visualization for Stock Price Analysis


Create visualizations, such as time series plots, forecasted trends, and prediction intervals, to
represent the stock price analysis. Visualization enhances the understanding of data.

Step-7: Interpretation of Time Series Results


Interpret the results, identify significant predictors, and draw conclusions regarding stock price
trends and predictions. Communicate findings effectively.

Conclusion: In this assignment, we have explored time series data analysis for historical stock
price data. Students have learned how to perform EDA, decompose time series data, build
forecasting models, evaluate model performance, and visualize stock price trends. Time series
data analysis is essential for investors and analysts to make informed decisions in the stock
market.

Department of AI & DS AISSMS IOIT


99

Department of AI & DS AISSMS IOIT


100

Department of AI & DS AISSMS IOIT


101

Department of AI & DS AISSMS IOIT


102

Department of AI & DS AISSMS IOIT


103

Department of AI & DS AISSMS IOIT

You might also like