EDA – Unit-1
Prerequisite of the Subject:
Prerequisites:
Basic Statistics and Probability: To understand data distributions, variability, and
statistical testing.
Python Programming: For hands-on implementation of EDA tasks using libraries like
Pandas, Matplotlib, and Seaborn.
Mathematics (Linear Algebra, Algebra): For understanding data structures and numerical
computations.
Basic Data Handling Skills: Knowledge of Excel or basic data tools to manipulate and
visualize data.
Future Link with Other Subjects in Curriculum:
EDA serves as a foundation for the following advanced topics:
Machine Learning (ML): EDA is critical for data preprocessing, feature engineering, and
model diagnostics.
Data Visualization: Builds upon the visual aspects introduced during EDA.
Big Data Analytics: Introduces basic data handling and exploration techniques applicable
to larger datasets.
Data Mining and Pattern Recognition: Uses insights found during EDA for deeper pattern
detection.
Artificial Intelligence (AI): Clean and well-explored data is crucial for building
intelligent systems.
Applications of the Subject:
EDA is used in nearly every field where data is involved. Some key application areas include:
Healthcare: Discovering patterns in patient records to predict disease outbreaks.
Finance: Analyzing stock market trends or customer credit behavior.
Marketing: Customer segmentation, campaign performance evaluation.
Manufacturing: Quality control and defect analysis.
Retail: Inventory and sales trend analysis.
Social Media: Analyzing user engagement and sentiment trends.
Course Objective and Course Outcomes
Course Objective:
To introduce students to the techniques of exploratory data analysis using statistical and
computational methods.
To develop the ability to prepare raw datasets for further analysis or modeling.
To teach how to extract meaningful insights and visualize them using tools.
Course Outcomes (COs):
1. Understand and explain the importance of EDA in the data science workflow.
2. Apply statistical and visual methods to explore datasets.
3. Identify and handle data quality issues like missing values and outliers.
4. Use Python or Excel to perform different types of EDA.
5. Interpret the output of EDA to inform future analysis or modeling.
Ex. After the course, a student should be able to use Python to visualize and interpret the sales
trend of a product using line and bar plots.
Introduction to the Exploratory Data Analysis:
EDA is the first step in data analysis that focuses on summarizing, visualizing, and
understanding the structure of your dataset before applying any modeling or inference
techniques.
EDA is not just about data cleaning—it's a mindset of interacting with the data curiously and
iteratively to understand it better.
EDA is like exploring your data for the first time. You’re trying to see what’s going on without
jumping to conclusions.
Exploratory Data Analysis is the preliminary step in data analysis focused on understanding
datasets through summarization, visualization, and pattern detection.
Kind of model after the EDA process done
1-Predictive Modelling
To predict a future values base on the patterns in the data, like Predict a continuous
number. i.e. House prices, sales.
2-Classification Modeling
To classify items into categories. i.e. spam vs not spam, fraud vs not fraud.
3-Clustering (Unsupervised Learning)
To group similar items when you don’t know the labels beforehand.
4-Dimensionality Reductions
To simplify the dataset by reducing the number of features while keeping the important
information. i.e. your data has too many variables or is hard to visualize.
5-Time series modeling
To predict values over time. i.e. sales next month, stock prices.
6-Anomaly detection
To find unusual patterns or outliers in your data. i.e. looking for fraud detection, system
failures.
7-Deep Learning Models
To solve more complex problems, like image recognition, NLP, or unstructured data.
Advantages of EDA:
Improves Model Quality:- Better feature engineering and data understanding lead to more
accurate models.
Identifies Hidden Trends:- Recognizes patterns not visible through raw data.
Saves Cost:- Early detection of data issues reduces downstream errors.
COMMAND
act as an expert, [Link] level: BTech graduation [Link] format: descriptive, detail 3. book
reference: "Hands-On Exploratory Data Analysis with Python, Suresh Kumar Mukhiya, Usman Ahmed,
Packt Publication, 2020" , "Data Science Fundamentals and Practical Approaches, Gypsy Nandi, Rupam
Sharma, BPB Publications, 2020", and you can collect data from other sources as well. this is syllabus :
"Introduction to exploratory data analysis. Types of exploratory data analysis-Descriptive, Inferential,
Visual, Quantitative. Phases and steps involved in EDA. Advantages, limitations, and application areas of
EDA." Under this syllabus, cover following points: What is EDA? Importance of EDA in the data science
pipeline Historical background (John Tukey’s role) Comparison with Confirmatory Data Analysis (CDA)
Real-world motivation/examples -Descriptive EDA: Summary statistics, frequency tables -Inferential
EDA: Confidence intervals, basic hypothesis testing -Visual EDA: Use of plots (histogram, boxplot, scatter
plot) -Quantitative EDA: Correlation analysis, data distribution metrics Phases and steps involved in EDA.
● Data collection and understanding ● Data cleaning (missing values, outliers) ● Data transformation
(scaling, encoding) ● Feature analysis and selection ● Pattern discovery and initial insights Advantages:
Improves model quality, finds hidden trends ● Limitations: Time-consuming, may not reveal deep
insights without context ● Applications: Healthcare, finance, manufacturing, etc. ● Case studies or
industry scenarios where EDA plays a crucial role