DMV & ML Lab
DMV & ML Lab
Lab Manual
Prepared by
Mrs. Sonali Nawale
BE AI& DS
SEM –VII
Academic year-2023-24
2
Page
Sr. No Title Of Experiment CO PO PSO
No
Feature Transformation: To use PCA Algorithm for
dimensionality reduction. You have a dataset that includes
measurements for different variables on wine (alcohol, ash,
magnesium, and so on). Apply PCA algorithm & transform
this data so that most variations in the measurements of the
1A
variables are captured by a small number of principal
components so that it is easier to distinguish between red
and white wine by inspecting these principal components.
Dataset Link: https://media.geeksforgeeks.org/wp-
content/uploads/Wine.csv
Feature Transformation: Apply LDA Algorithm on Iris
Dataset and classify which species a given flower belongs
1B to. Dataset Link:
https://www.kaggle.com/datasets/uciml/iris
Reinforcement Learning:
Build a Tic-Tac-Toe game using reinforcement
learning in Python by using following
tasks
a. Setting up the environment
6C
b. Defining the Tic-Tac-Toe game
c. Building the reinforcement learning model
d. Training the model
e. Testing the model
Assignment 1A:
Title of the Assignment: Feature Transformation:
Objective of the Assignment: Students should be able to preprocess dataset and identify
outliers, using PCA algorithm & transform this data so that most variations in the measurements
of the variables are captured by a small number of principal components.
Theory:
Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting.
There are two main categories of dimensionality reduction: feature selection and feature
extraction. Via feature selection, we select a subset of the original features, whereas in feature
extraction, we derive information from the feature set to construct a new feature subspace.
Principal Component Analysis (PCA): algorithm used to compress a dataset onto a lower
dimensional feature subspace with the goal of maintaining most of the relevant information. We
will explore: As stated earlier, Principal Component Analysis is a technique of feature extraction
that maps a higher dimensional feature space to a lower-dimensional feature space. While
reducing the number of dimensions, PCA ensures that maximum information of the original
dataset is retained in the dataset with the reduced no. of dimensions and the correlation between
the newly obtained Principal Components is minimum. The new features obtained after applying
PCA are called Principal Components and are denoted as PCi (i=1,2,3…n). Here, (Principal
Component-1) PC1 captures the maximum information of the original dataset, followed by PC2,
then PC3 and so on.
The following bar graph depicts the amount of Explained Variance captured by various Principal
Components. (The Explained Variance defines the amount of information captured by the
Principal Components).
Conclusion: In this way we have explored Concept of Feature transformation using PCA
algorithm for dimensionality reduction.
Assignment 1B:
Title of the Assignment: Feature Transformation:
Apply LDA Algorithm on Iris Dataset and classify which species a given flower belongs to.
Dataset Description: The project involves using the Iris dataset, which includes measurements
of sepal length, sepal width, petal length, and petal width for three species of iris flowers. The
goal is to apply Linear Discriminant Analysis (LDA) to transform this data in a way that makes it
easier to classify a given flower into one of the three species based on its measurements.
Objective of the Assignment: Students should be able to preprocess the dataset, apply the LDA
algorithm for feature transformation, and build a classification model to classify iris flowers into
different species.
Theory:
Feature transformation is a critical step in preparing data for machine learning tasks. Linear
Discriminant Analysis (LDA) is a dimensionality reduction technique that aims to find a lower-
dimensional representation of the data while maximizing the separation between different
classes. In the case of the Iris dataset, LDA can help us transform the feature space so that the
species of iris flowers become more distinguishable.
Conclusion: In this assignment, we have explored the concept of feature transformation using
the LDA algorithm for dimensionality reduction and classification. By transforming the Iris
dataset's features, we can build a classification model that can predict the species of iris flowers
based on their measurements. This demonstrates the power of feature transformation techniques
in improving the performance of machine learning models.
Assignment 2A:
Title of the Assignment: Regression Analysis:
Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
Dataset Description: The project is about on world's largest taxi company Uber inc. In this
project, we're looking to predict the fare for their future transactional cases. Uber delivers service
to lakhs of customers daily. Now it becomes really important to manage their data properly to
come up with new business ideas to get best results. Eventually, it becomes really important to
estimate the fare prices accurately.
Objective of the Assignment: Students should be able to preprocess dataset and identify
outliers, to check correlation and implement linear regression and random forest regression
models. Evaluate them with respective scores like R2, RMSE etc.
Theory:
Data Preprocessing: Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model. It is the first and crucial step while creating a machine
learning model. When creating a machine learning project, it is not always a case that we come
across the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So, for this, we use data preprocessing task. Why do we need
Data Preprocessing? A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning models. Data
preprocessing is required tasks for cleaning the data and making it suitable for a machine
learning model which also increases the accuracy and efficiency of a machine learning model.
Linear Regression: Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product price,
etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable. The linear regression model provides a sloped
straight line representing the relationship between the variables. Consider the below image:
Conclusion: In this way we have explored Concept correlation and implement linear regression
and random forest regression models.
Assignment 2B:
Title of the Assignment: Regression Analysis
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness, and Kurtosis
b. Bivariate analysis: Linear and logistic regression modelling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets
Dataset Description: The project involves working with two diabetes datasets: one from
the UCI Machine Learning Repository and the other the Pima Indians Diabetes Database.
The goal is to perform various regression analyses on these datasets, including univariate
analysis to understand the statistical properties of the data, bivariate analysis involving
linear and logistic regression modelling, and multiple regression analysis to predict
outcomes. Finally, the results of these analyses will be compared between the two datasets.
Theory:
Univariate Analysis:
Univariate analysis is the process of analyzing a single variable or attribute at a time. It
involves computing summary statistics like frequency, mean, median, mode, variance,
standard deviation, skewness, and kurtosis for a single variable. This analysis helps in
understanding the distribution and characteristics of the data within that variable.
Bivariate Analysis:
Bivariate analysis involves analyzing the relationship between two variables. In the context
of regression analysis, it includes linear and logistic regression modelling. Linear regression
is used when the dependent variable is continuous, while logistic regression is used when
the dependent variable is categorical.
Linear Regression:
Linear regression is a statistical method for modelling the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed
data. It is commonly used for predicting numeric values.
Logistic Regression:
Logistic regression is a statistical method for modelling the probability of a binary outcome
by fitting a logistic curve to the observed data. It is used for binary classification problems.
Assignment 3A:
Title of the Assignment: Classification Analysis:
Implementation of Support Vector Machines (SVM) for classifying images of hand- written
digits into their respective numerical classes (0 to 9).
Dataset Description: (SVMs) are a type of supervised machine learning algorithm that can be
used for classification and regression tasks., we will focus on using SVMs for image
classification.
When a computer processes an image, it perceives it as a two-dimensional array of pixels. The
size of the array corresponds to the resolution of the image, for example, if the image is 200
pixels wide and 200 pixels tall, the array will have the dimensions 200 x 200 x 3. The first two
dimensions represent the width and height of the image, respectively, while the third dimension
represents the RGB color channels. The values in the array can range from 0 to 255, which
indicates the intensity of the pixel at each point.
Objective of the Assignment: Students should be able to learn classification analysis using
Implementation of Support Vector Machines (SVM) for classifying images of hand- written
digits.
Theory:
In order to classify an image using an SVM, we first need to extract features from the image.
These features can be the color values of the pixels, edge detection, or even the textures present
in the image. Once the features are extracted, we can use them as input for the SVM algorithm.
The SVM algorithm works by finding the hyperplane that separates the different classes in the
feature space. The key idea behind SVMs is to find the hyperplane that maximizes the margin,
which is the distance between the closest points of the different classes. The points that are
closest to the hyperplane are called support vectors.
One of the main advantages of using SVMs for image classification is that they can effectively
handle high-dimensional data, such as images. Additionally, SVMs are less prone to overfitting
than other algorithms such as neural networks.
“Support Vector Machine” (SVM) is a supervised learning machine learning algorithm that can
be used for both classification or regression challenges. However, it is mostly used in
classification problems, such as text classification. In the SVM algorithm, we plot each data item
as a point in n-dimensional space (where n is the number of features you have), with the value of
each feature being the value of a particular coordinate. Then, we perform classification by
finding the optimal hyper-plane that differentiates the two classes very well (look at the below
snapshot).
Conclusion:
In this way we have explored a Support Vector Machine (SVM) model to accurately classify
images of cats and dogs. The best parameters for the SVM model were determined using
GridSearchCV, and the model’s accuracy was measured.
Assignment 3B:
Title of the Assignment: Classification Analysis:
Implement K-Nearest Neighbors (KNN) algorithm on social network ad dataset. Compute
confusion matrix, accuracy, error rate, precision, and recall on the given dataset.
Dataset link: https://www.kaggle.com/datasets/rakeshrau/social-network-ads
Dataset Description: In this project, we will work with a social network ad dataset that contains
information about users, including features such as age, gender, and estimated salary, as well as
whether a user clicked on a particular ad (0 for no, 1 for yes). The goal is to implement the K-
Nearest Neighbors (KNN) algorithm on this dataset to predict whether a user is likely to click on
the ad or not. We will compute various classification metrics, including confusion matrix,
accuracy, error rate, precision, and recall.
Objective of the Assignment: Students should be able to apply the K-Nearest Neighbors
algorithm to a real-world dataset for classification tasks. They should also learn how to evaluate
the performance of a classification model using key metrics.
Theory:
K-Nearest Neighbors (KNN) Algorithm:
K-Nearest Neighbors is a supervised machine learning algorithm used for classification and
regression tasks. In KNN, an object is classified by a majority vote of its neighbors, with the
object being assigned to the class that is most common among its K nearest neighbors (K is a
hyperparameter). KNN is a non-parametric and instance-based learning algorithm.
Confusion Matrix:
A confusion matrix is a table that is used to evaluate the performance of a classification model. It
shows the true positive, true negative, false positive, and false negative values, which are
essential for calculating various classification metrics.
Accuracy:
Accuracy measures the ratio of correctly predicted instances to the total instances in the dataset.
It provides a general measure of the model's performance.
Error Rate:
Error rate is the complement of accuracy and measures the ratio of incorrectly predicted
instances to the total instances. It provides the rate of misclassification.
Precision:
Precision is a metric that measures the accuracy of positive predictions. It is the ratio of true
positive predictions to the total positive predictions and helps in assessing the model's ability to
avoid false positives.
Recall:
Recall, also known as sensitivity or true positive rate, measures the ability of the model to
correctly identify positive instances. It is the ratio of true positive predictions to the total actual
positive instances.
Assignment 4A:
Title of the Assignment: Clustering Analysis:
Implement K-Means clustering on Iris.csv dataset. Determine the number of clusters using the
elbow method.
Dataset Description: It has 150 entries with 1 dependent column and 4 feature columns.
Theory:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of means
clustering.
“It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.”
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
The elbow method is a popular unsupervised learning algorithm used in K-Means clustering.
Unlike supervised learning, K-Means doesn’t require labelled data. It involves randomly
initializing K cluster centroids and iteratively adjusting them until they stop moving. Let’s go
through the steps involved in K-means clustering for a better understanding:
Conclusion:
In this way, we are studied the basic concepts of the K-Means Clustering algorithm in Machine
Learning. We used the Elbow method to find the K-mean for clustering the data in our sample
data set.
Assignment 4B:
Title of the Assignment: Clustering Analysis:
Implement K-Medoid Algorithm on a credit card dataset. Determine the number of clusters using
the Silhouette Method.
Dataset link: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata
Dataset Description: In this project, we will work with a credit card dataset that contains
information about credit card users, including various features such as credit limit, age, income,
and spending behavior. The goal is to implement the K-Medoid clustering algorithm on this
dataset to group users into clusters based on their credit card usage patterns. Additionally, we
will use the Silhouette Method to determine the optimal number of clusters.
Objective of the Assignment: Students should be able to apply clustering techniques to real-
world data, specifically using the K-Medoid algorithm. They should also learn how to evaluate
the quality of clustering using the Silhouette Method.
Theory:
K-Medoid Algorithm:
K-Medoid is a partitioning clustering algorithm that aims to divide a dataset into a pre-defined
number of clusters (K) where each cluster is represented by one data point, called the medoid.
Unlike K-Means, which uses the mean of data points as cluster representatives, K-Medoid uses
actual data points as representatives. This makes K-Medoid more robust to outliers and noise.
Silhouette Method:
The Silhouette Method is a technique to determine the optimal number of clusters for a given
dataset. It quantifies how similar an object is to its own cluster compared to other clusters. The
silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched
to its own cluster and poorly matched to neighboring clusters. The goal is to find the number of
clusters that maximizes the silhouette score.
Clustering Evaluation:
Clustering evaluation helps measure the quality of clustering results. The Silhouette Method is
one such evaluation technique that provides a quantitative measure of the goodness of clustering.
Other evaluation metrics like the Davies-Bouldin Index and the Dunn Index can also be used.
Assignment 5A:
Title of the Assignment: Ensemble Learning
Implement Random Forest Classifier model to predict the safety of the car.
Dataset link: https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set
Dataset Description: In this project, we will work with a car evaluation dataset that
contains information about various features of cars, such as their price, maintenance cost,
number of doors, and safety rating. The goal is to implement the Random Forest Classifier
model on this dataset to predict the safety level of the car. The safety level is categorized as
"low," "medium," "high," and "very high."
Theory:
Random Forest Classifier:
Random Forest is an ensemble learning method used for classification and regression tasks.
It is an ensemble of decision trees, where multiple decision trees are trained on different
subsets of the data. The predictions from individual trees are combined to make the final
prediction. Random Forest is known for its high accuracy and resistance to overfitting.
Ensemble Learning:
Ensemble learning is a machine learning technique that combines the predictions of
multiple models to improve overall performance. Random Forest is an example of ensemble
learning, where multiple decision trees are combined to make more robust predictions.
Evaluation Metrics:
Evaluation metrics are used to measure the performance of classification models. Common
metrics include accuracy, precision, recall, and F1-score, among others. These metrics help
assess the model's ability to make correct predictions and handle different aspects of
classification performance.
Assignment 5B:
Title of the Assignment: Ensemble Learning
Use different voting mechanisms and Apply AdaBoost (Adaptive Boosting), Gradient Tree
Boosting (GBM), XGBoost classification on the Iris dataset and compare the performance
of three models using different evaluation measures.
Dataset Link: https://www.kaggle.com/datasets/uciml/iris
Dataset Description: In this project, we will work with the Iris dataset, a well-known
dataset in machine learning. The dataset contains information about different species of iris
flowers and their features, such as sepal length, sepal width, petal length, and petal width.
The goal is to apply three different ensemble learning techniques: AdaBoost, Gradient Tree
Boosting (GBM), and XGBoost, to classify iris flowers into their respective species. We
will compare the performance of these three models using various evaluation measures.
Theory:
Ensemble Learning:
Ensemble learning combines the predictions of multiple models to improve overall
performance. In this assignment, we will explore three ensemble techniques: AdaBoost,
GBM, and XGBoost, which are used for classification tasks.
Gradient Tree Boosting is another ensemble technique that builds an ensemble of decision
trees. It uses gradient descent optimization to minimize the loss function, gradually
improving the model's performance.
XGBoost Classification:
XGBoost (Extreme Gradient Boosting) is an advanced ensemble learning algorithm known
for its speed and performance. It combines the advantages of gradient boosting and
regularization techniques to create a powerful classifier.
Evaluation Metrics:
We will use various evaluation metrics, including accuracy, precision, recall, and F1-score,
to assess the performance of the three ensemble models. These metrics provide insights into
the models' ability to make correct predictions and handle different aspects of classification
performance.
Assignment 6A:
Title of the Assignment: Reinforcement Learning
Implement Reinforcement Learning using an example of a maze environment that the agent
needs to explore.
Assignment Description: In this project, we will delve into the field of Reinforcement
Learning (RL) by implementing RL techniques using a maze environment. The primary
objective is to create an RL agent capable of navigating and solving a maze problem
autonomously. The agent will learn to make decisions and take actions based on its
interactions with the maze environment, gradually improving its ability to navigate and
reach a predefined goal.
Theory:
Reinforcement Learning (RL):
Reinforcement Learning is a type of machine learning where an agent interacts with an
environment and learns to make sequences of decisions to maximize a cumulative reward. It
involves the concept of an agent learning from its experiences, taking actions to achieve a
goal, and receiving feedback in the form of rewards or penalties.
Assignment 6B:
Title of the Assignment: Reinforcement Learning
Solve the Taxi problem using reinforcement learning where the agent acts as a taxi driver to
pick up a passenger at one location and then drop the passenger off at their destination.
environment). The agent selects actions, the environment responds, and the state changes
accordingly.
Python Libraries:
Python libraries like NumPy are often used for implementing RL algorithms due to their
efficiency in numerical operations.
Assignment 6C:
Title of the Assignment: Reinforcement Learning
Build a Tic-Tac-Toe game using reinforcement learning in Python by using the following
tasks.
a. Setting up the environment
b. Defining the Tic-Tac-Toe game
c. Building the reinforcement learning model
d. Training the model
e. Testing the model
Python Libraries:
Python libraries like NumPy are commonly used for implementing RL algorithms due to
their efficiency in numerical operations.
Lab Manual
Prepared by
Mrs. Tejasvi Jadhav
BE AI& DS
SEM –VII
Academic year-2023-24
Page
Sr. No Title Of Experiment CO PO PSO
No
Data Loading, Storage and File Formats
Problem Statement: Analyzing Sales Data from Multiple
File Formats Dataset: Sales data in multiple file formats
(e.g., CSV, Excel, JSON) Description: The goal is to load
and analyze sales data from different file formats, including
CSV, Excel, and JSON, and perform data cleaning,
transformation, and analysis on the
dataset.
Tasks to Perform:
Obtain sales data files in various formats, such as CSV,
Excel, and JSON.
1. Load the sales data from each file format into the
appropriate data structures or DataFrames.
2. Explore the structure and content of the loaded data,
identifying any inconsistencies, missing values, or data
quality issues.
7
3. Perform data cleaning operations, such as handling
missing values, removing duplicates, or correcting
inconsistencies.
4. Convert the data into a unified format, such as a common
DataFrames or data structure, to enable seamless analysis.
5. Perform data transformation tasks, such as merging
multiple datasets, splitting columns, or deriving new
variables.
6. Analyze the sales data by performing descriptive
statistics, aggregating data by specific variables, or
calculating metrics such as total sales, average order value,
or product category distribution.
7. Create visualizations, such as bar plots, pie charts, or box
plots, to represent the sales data and gain insights into sales
trends, customer behavior, or product performance.
response.
4. Clean and preprocess the retrieved data, handling missing
values or inconsistent formats.
5. Perform data modeling to analyze weather patterns, such
as calculating average temperature, maximum/minimum
values, or trends over time.
6. Visualize the weather data using appropriate plots, such
as line charts, bar plots, or scatter plots, to represent
temperature changes, precipitation levels, or wind speed
variations.
7. Apply data aggregation techniques to summarize weather
statistics by specific time periods (e.g., daily, monthly,
seasonal).
8. Incorporate geographical information, if available, to
create maps or geospatial visualizations representing
weather patterns across different locations.
9. Explore and visualize relationships between weather
attributes, such as temperature and humidity, using
correlation plots or heatmaps.
Data Aggregation
Problem Statement: Analyzing Sales Performance by
Region in a Retail Company
Dataset: "Retail_Sales_Data.csv"
Description: The dataset contains information about
sales transactions in a retail company. It
includes attributes such as transaction date, product
category, quantity sold, and sales
amount. The goal is to perform data aggregation to
analyze the sales performance by region
and identify the top-performing regions.
Tasks to Perform:
1. Import the "Retail_Sales_Data.csv" dataset.
2. Explore the dataset to understand its structure and
content.
3. Identify the relevant variables for aggregating sales
12
data, such as region, sales amount, and product
category.
4. Group the sales data by region and calculate the
total sales amount for each region.
5. Create bar plots or pie charts to visualize the sales
distribution by region.
6. Identify the top-performing regions based on the
highest sales amount.
7. Group the sales data by region and product category
to calculate the total sales amount for each
combination.
8. Create stacked bar plots or grouped bar plots to
compare the sales amounts across different regions and
product categories.
Assignment 7:
Objective of the Assignment: Students should be able to load data from various file formats,
clean and preprocess the data, and perform exploratory data analysis (EDA) to gain insights.
Data loading, storage, and file formats play a crucial role in data analysis. Different sources and
file formats require specific methods for loading and processing. In this assignment, we will
explore the following steps:
Data from different sources can be messy and may contain missing values, duplicates, or
inconsistent data. Data cleaning is essential to prepare the dataset for analysis. We will cover
techniques for data cleaning and preprocessing.
Conclusion: In this assignment, we have explored the fundamentals of data loading, storage, and
file formats. We have learned how to load and analyze sales data from various file formats,
perform data cleaning, transformation, and conduct exploratory data analysis. This knowledge is
essential for data analysts and data scientists working with diverse datasets from different
sources and formats.
Assignment 8:
Title of the Assignment: Interacting with Web APIsProblem Statement: Analyzing Weather
Data from OpenWeatherMap API
Dataset: Weather data retrieved from OpenWeatherMap API
Description: The goal is to interact with the OpenWeatherMap API to retrieve weather data
for a specific location and perform data modelling and visualization to analyze weather patterns
over time.
Tasks to Perform:
1. Register and obtain API key from OpenWeatherMap.
2. Interact with the OpenWeatherMap API using the API key to retrieve weather data for a
specific location.
3. Extract relevant weather attributes such as temperature, humidity, wind speed, and
precipitation from the API response.
4. Clean and preprocess the retrieved data, handling missing values or inconsistent formats.
5. Perform data modelling to analyze weather patterns, such as calculating average temperature,
maximum/minimum values, or trends over time.
6. Visualize the weather data using appropriate plots, such as line charts, bar plots, or scatter
plots, to represent temperature changes, precipitation levels, or wind speed
variations.
7. Apply data aggregation techniques to summarize weather statistics by specific time periods
(e.g., daily, monthly, seasonal).
8. Incorporate geographical information, if available, to create maps or geospatial visualizations
representing weather patterns across different locations.
9. Explore and visualize relationships between weather attributes, such as temperature and
humidity, using correlation plots or heatmaps.
Objective of the Assignment: Students should be able to access and retrieve data from a web
API, clean and preprocess the data, and perform data modelling and visualization to analyze
weather patterns.
Web APIs (Application Programming Interfaces) provide a valuable way to access data from
various sources, including weather data from services like OpenWeatherMap. In this assignment,
we will follow these steps:
Learn how to use Python to make HTTP requests to the OpenWeatherMap API. We'll cover
different HTTP methods and how to pass parameters to retrieve specific weather data for a
location.
Conclusion: In this assignment, we have explored the process of interacting with web APIs,
specifically the OpenWeatherMap API, to retrieve and analyze weather data. Students have
learned how to make HTTP requests, clean and preprocess data, perform data modelling, and
create data visualizations to gain insights into local weather conditions. This knowledge is
valuable for data analysts and scientists working with external data sources to derive meaningful
conclusions from real-world data.
Assignment 9:
Title of the Assignment: Data Cleaning and Preparation
Problem Statement: Analyzing Customer Churn in a Telecommunications Company
Dataset: "Telecom_Customer_Churn.csv"
Description: The dataset contains information about customers of a telecommunications
company and whether they have churned (i.e., discontinued their services). The dataset
includes various attributes of the customers, such as their demographics, usage patterns, and
account information. The goal is to perform data cleaning and preparation to gain insights
into the factors that contribute to customer churn.
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations, and
standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that may be relevant to predicting customer
churn.
9. Normalize or scale the data if necessary.
10. Split the dataset into training and testing sets for further analysis.
11. Export the cleaned dataset for future analysis or modelling.
Objective of the Assignment: Students should be able to clean and prepare data for analysis,
including handling missing values, handling outliers, and performing feature engineering to
extract insights into customer churn.
Customer churn analysis is crucial for businesses, especially in the telecommunications industry.
It helps in understanding why customers leave and what factors influence their decision. This
assignment will guide students through the following steps:
Identify and address missing values in the dataset. Techniques such as imputation or removal of
missing data will be discussed.
Conclusion: In this assignment, we have focused on data cleaning and preparation for customer
churn analysis in a telecommunications company. By following these steps, students have
learned how to clean data, handle missing values and outliers, engineer features, and visualize
the data to gain insights into customer behavior. This knowledge is essential for businesses
seeking to reduce churn and improve customer retention.
Assignment 10:
Title of the Assignment: Data Wrangling
Problem Statement: Data Wrangling on Real Estate Market
Dataset: "RealEstate_Prices.csv"
Description: The dataset contains information about housing prices in a specific real estate
market. It includes various attributes such as property characteristics, location, sale prices,
and other relevant features. The goal is to perform data wrangling to gain insights into the
factors influencing housing prices and prepare the dataset for further analysis or modelling.
Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g.,
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available
(e.g., neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period,
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding or
label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average sale
prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis or
modelling process.
Objective of the Assignment: Students should be able to perform data wrangling tasks such as
data cleaning, data transformation, and data preparation to make the dataset suitable for analysis
or modelling in the context of the real estate market.
Data wrangling, also known as data munging, is a critical step in the data analysis process. It
involves cleaning, transforming, and preparing the data for analysis. In this assignment, students
will follow these steps:
Transform the data and create new features to extract insights related to housing prices. Feature
engineering can enhance the analysis or modelling process.
Conclusion: In this assignment, we have focused on data wrangling for a real estate market
dataset. By following these steps, students have learned how to clean data, transform it, handle
missing values and duplicates, engineer features, and prepare the dataset for analysis or
modelling. Data wrangling is a crucial step in data analysis to ensure that the data is of high
quality and ready for further exploration.
Assignment 11:
Title of the Assignment: Data Visualization using matplotlib
Problem Statement: Analyzing Air Quality Index (AQI) Trends in a City
Dataset: "City_Air_Quality.csv"
Description: The dataset contains information about air quality measurements in a specific city
over a period of time. It includes attributes such as date, time, pollutant levels (e.g., PM2.5,
PM10, CO), and the Air Quality Index (AQI) values. The goal is to use the matplotlib
library to create visualizations that effectively represent the AQI trends and patterns for different
pollutants in the city.
Tasks to Perform:
1. Import the "City_Air_Quality.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Identify the relevant variables for visualizing AQI trends, such as date, pollutant levels, and
AQI values.
4. Create line plots or time series plots to visualize the overall AQI trend over time.
5. Plot individual pollutant levels (e.g., PM2.5, PM10, CO) on separate line plots to visualize
their trends over time.
6. Use bar plots or stacked bar plots to compare the AQI values across different dates or time
periods.
7. Create box plots or violin plots to analyze the distribution of AQI values for different pollutant
categories.
8. Use scatter plots or bubble charts to explore the relationship between AQI values and pollutant
levels.
9. Customize the visualizations by adding labels, titles, legends, and appropriate colour schemes.
Objective of the Assignment: Students should be able to use Matplotlib to create various types
of data visualizations that provide insights into air quality trends, allowing for the interpretation
and communication of findings.
Data visualization is a powerful tool for interpreting and communicating data patterns
effectively. In this assignment, students will be guided through the following steps:
Conclusion: In this assignment, we have explored data visualization techniques using Matplotlib
to analyze air quality trends in a specific city. By creating line plots, bar charts, and box plots,
students have learned how to effectively represent AQI data, making it easier to identify patterns
and communicate findings related to air quality in the city. Data visualization is a crucial skill for
data analysts and scientists to derive insights from complex datasets.
Assignment 12:
Title of the Assignment: Data Aggregation
Problem Statement: Analyzing Sales Performance by Region in a Retail Company
Dataset: "Retail_Sales_Data.csv"
Description: The dataset contains information about sales transactions in a retail company. It
includes attributes such as transaction date, product category, quantity sold, and sales
amount. The goal is to perform data aggregation to analyze the sales performance by region
and identify the top-performing regions.
Tasks to Perform:
1. Import the "Retail_Sales_Data.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Identify the relevant variables for aggregating sales data, such as region, sales amount, and
product category.
4. Group the sales data by region and calculate the total sales amount for each region.
5. Create bar plots or pie charts to visualize the sales distribution by region.
6. Identify the top-performing regions based on the highest sales amount.
7. Group the sales data by region and product category to calculate the total sales amount for
each combination.
8. Create stacked bar plots or grouped bar plots to compare the sales amounts across different
regions and product categories.
Objective of the Assignment: Students should be able to aggregate and summarize data to gain
insights into sales performance by region, allowing for the identification of top-performing
regions within the retail company.
Data aggregation is essential for summarizing and analyzing data at a higher level, providing
insights into overall performance. In this assignment, students will follow these steps:
Aggregate the sales data for each region, calculating metrics such as total sales amount, total
quantity sold, average sales, etc. This step involves using aggregation functions like sum, mean,
or count.
Assignment 13:
Title of the Assignment: Time Series Data Analysis
Problem statement: Analysis and Visualization of Stock Market Data
Dataset: "Stock_Prices.csv"
Description: The dataset contains historical stock price data for a particular company over a
period of time. It includes attributes such as date, closing price, volume, and other relevant
features. The goal is to perform time series data analysis on the stock price data to identify
trends, patterns, and potential predictors, as well as build models to forecast future stock prices.
Tasks to Perform:
1. Import the "Stock_Prices.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Ensure that the date column is in the appropriate format (e.g., datetime) for time series
analysis.
4. Plot line charts or time series plots to visualize the historical stock price trends over time.
5. Calculate and plot moving averages or rolling averages to identify the underlying trends and
smooth out noise.
6. Perform seasonality analysis to identify periodic patterns in the stock prices, such as weekly,
monthly, or yearly fluctuations.
7. Analyze and plot the correlation between the stock prices and other variables, such as trading
volume or market indices.
8. Use autoregressive integrated moving average (ARIMA) models or exponential smoothing
models to forecast future stock prices.
Objective of the Assignment: Students should be able to analyze time series data, identify
trends, patterns, and potential predictors in stock prices, and build forecasting models for future
stock price prediction.
Time series data analysis is essential for understanding and predicting trends in sequential data.
In this assignment, students will follow these steps:
Decompose the time series data into its components, such as trend, seasonality, and residuals.
This step helps in understanding the underlying patterns in the data.
Conclusion: In this assignment, we have explored time series data analysis for historical stock
price data. Students have learned how to perform EDA, decompose time series data, build
forecasting models, evaluate model performance, and visualize stock price trends. Time series
data analysis is essential for investors and analysts to make informed decisions in the stock
market.