Lecture 1 Exploratory Data Analysis

Uploaded by

124ll124

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Lecture 1 Exploratory Data Analysis

Uploaded by

124ll124

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Exploratory Data Analysis

Notebook: lecture-1-exploratory-data-analysis
What is Exploratory Data Analysis?
Loosely speaking, any method of looking at data that does not
include formal statistical modelling and inference falls under the
term exploratory data analysis.
Why do Exploratory Data Analysis

• detection of mistakes
• checking of assumptions
• preliminary selection of appropriate models
• determining relationships among the explanatory variables, and
assessing the direction and rough size of relationships between
explanatory and outcome variables.
• People find looking at large lists of numbers tedious and overwhelming,
EDA negates this by accentuating elements of the data
EDA and where it sits
EDA in Python
Rough classes of EDA

Graphical Non-Graphical

Univariate Multivariate
Univariate Non-Graphical
• Looking at a single value gathered from an experiment/population
• Get an idea about the distribution of this value
• Appreciate the ‘sample distribution’
• Make some assertions about what kind of distribution will best fit this
data

Pictures of distributions
Categorical, Ordinate, Interval data
• A categorical variable (sometimes called a nominal variable) is one that
has two or more categories, but there is no intrinsic ordering to the
categories.
• Example – hair colour
• An ordinal variable is similar to a categorical variable. The difference
between the two is that there is a clear ordering of the categories.
• Example – economic groups
• An interval variable is similar to an ordinal variable, except that the
intervals between the values of the numerical variable are equally
spaced.
• Example – evenly spaced price ranges
Quantitative data
• Any data that is represented by amounts eg weight, velocity, strain etc
Univariate Non-Graphical
Categorical non-graphical representations
• The characteristics of interest for a categorical variable are
simply the range of values and the frequency (or relative
frequency) of occurrence for each value.
• A simple tabulation of the frequency of each category is the best
univariate non-graphical EDA for categorical data.
Quantitative data representations
• The characteristics of the population distribution of a
quantitative variable are its centre, spread, modality (number of
peaks in the probability distribution function), shape (including
“heaviness of the tails”), and outliers.
Non graphical representations of quantitative
data
• In most situations it is worthwhile to think of univariate non-
graphical EDA as telling you about aspects of the histogram of
the distribution of the variable of interest.
• If the quantitative variable does not have too many distinct
values, a tabulation, as we used for categorical data, will be a
worthwhile univariate, non-graphical technique.
• Mostly, for quantitative variables we are concerned here with
the quantitative numeric (non-graphical) measures which are
the various sample statistics
Descriptors of quantitative data
• Modality

• Central tendency

• Spread

• Skew

• Kurtosis
Modality
• How many peaks are there in the pdf
Central Tendency (Mean)
• The common, useful measures of central tendency are the
statistics called (arithmetic) mean, median, and sometimes
mode.
• Occasionally other means such as geometric, harmonic,
truncated, or Winsorized means are used as measures of
centrality.
Central Tendency (Median)
• The middle value after all of the values are put in an ordered
list.
• For symmetric distributions, the mean and the median coincide.

https://medium.com/@nhan.tran/mean-median-an-mode-in-statistics-3359d3774b0b
Spread (variance and standard deviation)
• Spread is an indicator of how far away from the centre we are still
likely to find data values.

• The standard deviation is simply the square root of the variance.

• For a
theoretical Gaussian distribution, we learned in the previous
chapter that mean plus or minus 1, 2 or 3 standard deviations holds
68.3, 95.4 and 99.7% of the probability respectively
Variance and Standard Deviation
• Variances have the very important property that they are
additive for any number of different independent sources of
variation. This property is not shared by the “standard
deviation”.
• The standard deviation has the same units as the original data,
which helps make it more interpretable.
Spread (Inter-quartile range)
• The quartiles of a population or a sample are the three values
which divide the distribution or observed data into even fourths.
• ¼ values fall below Q1, ½ fall below Q2, ¾ fall below Q3
• IQR = Q3 – Q1
• The IQR is a more robust measure of spread than the variance
or standard deviation.
• The IQR is not affected by extreme outliers as strongly (if at all).
• For normally distributed data only, the IQR approximately
equals 4/3 times the standard deviation.
Spread (Percentiles)
• Percentiles are a more flexible version of quartiles
• We can define any percentiles that we like
• For example the 95th percentile is the value below which 95% of the
values fall
Skewness
• Skewness is a measure of asymmetry

• Standard error of skewness:

• If Skew < -2SES: negative skew
• If Skew > 2SES: positive skew
Kurtosis
• Kurtosis is a measure of “peakedness”
relative to a Gaussian shape

• Standard error of kurtosis:

• If Kurtosis < -2SEK: negative kurtosis
• If Kurtosis > 2SEK: positive kurtosis
Univariate Graphical
Histograms
• Barplot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of value
• The only one of these techniques that makes sense for
categorical data
• Generally you will choose between about 5 and 30 bins
• It is often worthwhile to try a few different bin sizes/numbers
• t is very instructive to look at multiple samples from the same
population to get a feel for the variation that will be found in
histograms
Boxplots
• Boxplots are very good at presenting
information about the central tendency,
symmetry and skew, as well as outliers,
although they can be misleading about
aspects such as multimodality
Outliers
• The term “outlier” is not well defined in statistics, and the
definition varies depending on the purpose and situation. The
“outliers” identified by a boxplot, which could be called “boxplot
outliers” are defined as any points more than 1.5 IQRs above
Q3 or more than 1.5 IQRs below Q1. This does not by itself
indicate a problem with those data points!
Violin plots
• A violin plot is like a box plot, which shows
peaks in the data. It is used to visualize the
distribution of numerical data. Unlike a box
plot that can only show summary statistics,
violin plots depict summary statistics and the
density of each variable.
Multivariate Non-Graphical
Multivariate non-graphical EDA
• Cross tabulation

• Statistics by variable

• Correlation stats of variables

Cross tabulation
• Cross-tabulation is the basic bivariate non-graphical EDA
technique
Correlation of variables
• Cramer’s V is used to calculate the correlation
between nominal categorical variables. Recall
that nominal variables are ones that take on
category labels but have no natural ordering
• The value for Cramer’s V ranges from 0 to 1,
with 0 indicating no association between the
variables and 1 indicating a strong association
between the variables.

• https://www.geeksforgeeks.org/how-to-
calculate-cramers-v-in-python/
Quantitative variable statistics (covariance)
• For two quantitative variables, the basic statistics of interest are
the sample covariance and/or sample correlation

• Positive covariance values suggest that when one measurement is

above the mean the other will probably also be above the mean,
and vice versa.
Quantitative variable statistics (Correlation)
• Correlation is closely related to covariance
Covariance and correlation matrices
• When we have many quantitative variables the most common
non-graphical EDA technique is to calculate all of the pairwise
covariances and/or correlations and assemble them into a
matrix.
Graphical Multivariate
Univariate plots by
category
• When we have one categorical (usually
explanatory) and one quantitative (usually
outcome) variable, graphical EDA usually
takes the form of “conditioning” on the
categorical random variable. This simply
indicates that we focus on all of the
subjects with a particular level of the
categorical random variable, then make
plots of the quantitative variable for those
subjects.
Scatterplots
• For two quantitative variables, the basic
graphical EDA technique is the scatterplot
which has one variable on the x-axis, one
on the y-axis and a point for each case in
your dataset. If one variable is
explanatory and the other is outcome, it is
a very, very strong convention to put the
outcome on the y (vertical) axis.
• In a scatterplot we can increase the
dimensionality with things like marker
size, colour, shape etc…but don’t go too
far or you will simply overload the viewer
Scatterplots

• With many variables we can

make a multiple-scatterplot
• Easy with seaborn and pandas
Use graphical
*and* non-
graphical
measures
Anscombe’s quartet. All four
datasets have the same
summary statistics
Summary
• Exploratory data analysis is a *very* important first step in data
science
• We can detect outliers, check assumptions, make hypotheses about
the data
• We can use graphical and non-graphical measures – usually a
combination of both is optimal
• Methods differ depending on the data types, categorical or
quantitative
• Python has many tools that make this task easy and aesthetically
pleasing

12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Summary Statistics and Visualization Techniques To Explore
100% (1)
Summary Statistics and Visualization Techniques To Explore
30 pages
Exploratory Spatial Data Analysis
No ratings yet
Exploratory Spatial Data Analysis
54 pages
Data Science- Module 2 (Updated )
No ratings yet
Data Science- Module 2 (Updated )
94 pages
BI UNIT-IV
No ratings yet
BI UNIT-IV
142 pages
Module 3 - Lesson 3.2 Quantitative Data Analysis
No ratings yet
Module 3 - Lesson 3.2 Quantitative Data Analysis
41 pages
Data Analysis and Interpretation
No ratings yet
Data Analysis and Interpretation
33 pages
Chapter Six Methods of Describing Data
No ratings yet
Chapter Six Methods of Describing Data
20 pages
2 Eda
No ratings yet
2 Eda
20 pages
Quantitative and Qualitative
No ratings yet
Quantitative and Qualitative
41 pages
Statistics
No ratings yet
Statistics
30 pages
ap_stat_exam_rev_ch1-13
No ratings yet
ap_stat_exam_rev_ch1-13
120 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
21 pages
Descriptive Analytics - Uni and Bi
No ratings yet
Descriptive Analytics - Uni and Bi
36 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
Week 8 Quantitative Data Analysis - Descriptive Statistics
No ratings yet
Week 8 Quantitative Data Analysis - Descriptive Statistics
59 pages
Unit 1
No ratings yet
Unit 1
38 pages
Quantitative Data Analysis
100% (2)
Quantitative Data Analysis
27 pages
WINSEM2020-21 ECE3502 ETH VL2020210501413 Reference Material I 29-Apr-2021 New PPT
No ratings yet
WINSEM2020-21 ECE3502 ETH VL2020210501413 Reference Material I 29-Apr-2021 New PPT
23 pages
Fundamentals of Data Science and Analytics On Descriptive Analysis
No ratings yet
Fundamentals of Data Science and Analytics On Descriptive Analysis
53 pages
Exploratory Data Analysis types
No ratings yet
Exploratory Data Analysis types
14 pages
DataAnalytics(Unit 2)
No ratings yet
DataAnalytics(Unit 2)
131 pages
Cental Tendency
No ratings yet
Cental Tendency
20 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
26 pages
2 - Descriptive Statistics
No ratings yet
2 - Descriptive Statistics
29 pages
E Book - Unit 4
No ratings yet
E Book - Unit 4
12 pages
Lec 05&06 Data Mining and Data Wherehousing
No ratings yet
Lec 05&06 Data Mining and Data Wherehousing
25 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
Summarizing Data
No ratings yet
Summarizing Data
45 pages
Big Data - Sources and Opportunities
No ratings yet
Big Data - Sources and Opportunities
30 pages
Introduction To Statistics: February 21, 2006
No ratings yet
Introduction To Statistics: February 21, 2006
34 pages
1_III YR, VII unit Intro to Statistics
No ratings yet
1_III YR, VII unit Intro to Statistics
214 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Descriptive Analytics - Univariate and Bivariate
No ratings yet
Descriptive Analytics - Univariate and Bivariate
41 pages
Basics of Statistics
No ratings yet
Basics of Statistics
40 pages
Lec 1
No ratings yet
Lec 1
44 pages
Biostatistics (Descriptive Statistics)
No ratings yet
Biostatistics (Descriptive Statistics)
30 pages
Prob & Stat
No ratings yet
Prob & Stat
50 pages
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2025-01-07_Reference-Material-I
No ratings yet
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2025-01-07_Reference-Material-I
50 pages
Exploratory Data Analysis_v3_part1
No ratings yet
Exploratory Data Analysis_v3_part1
36 pages
Stata....Basic Note
No ratings yet
Stata....Basic Note
20 pages
Week 4 Statistics Recap MAKING MEANING OF MEASUREMENTS & RAW TEST SCORES
No ratings yet
Week 4 Statistics Recap MAKING MEANING OF MEASUREMENTS & RAW TEST SCORES
39 pages
Chapter 10 Data Analysis-Quantitative
No ratings yet
Chapter 10 Data Analysis-Quantitative
93 pages
lecture 2 - Advanced Topics (1)
No ratings yet
lecture 2 - Advanced Topics (1)
63 pages
Datamining and Analytics Unit V
No ratings yet
Datamining and Analytics Unit V
102 pages
Statistics in Psychology (1)
No ratings yet
Statistics in Psychology (1)
15 pages
Biostatistics Prelims Week 4
No ratings yet
Biostatistics Prelims Week 4
27 pages
Levels of Measurement
No ratings yet
Levels of Measurement
26 pages
Measures of Spread-2023
No ratings yet
Measures of Spread-2023
25 pages
Descriptive Statistic
No ratings yet
Descriptive Statistic
37 pages
Da SMNR
No ratings yet
Da SMNR
32 pages
EDA
No ratings yet
EDA
52 pages
5412-1-A23
No ratings yet
5412-1-A23
13 pages
Lec 1
No ratings yet
Lec 1
22 pages
Lecture 9descriptivestatistics 171204035552
No ratings yet
Lecture 9descriptivestatistics 171204035552
26 pages
3 - Introduction To Inferential Statistics
No ratings yet
3 - Introduction To Inferential Statistics
32 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Exploratory Vs Formative Research
No ratings yet
Exploratory Vs Formative Research
49 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Play With Wireline Log Data 1733476190
No ratings yet
Play With Wireline Log Data 1733476190
50 pages
BTECH_(L&SCM)_Detailed_Syllabus
No ratings yet
BTECH_(L&SCM)_Detailed_Syllabus
43 pages
Pranita Dane - IBM - Internship Project Submission - Data Analytics
No ratings yet
Pranita Dane - IBM - Internship Project Submission - Data Analytics
28 pages
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
No ratings yet
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
106 pages
Senthilnathan Resume
No ratings yet
Senthilnathan Resume
2 pages
IIT GUIDELINES
No ratings yet
IIT GUIDELINES
3 pages
Final Poster
No ratings yet
Final Poster
1 page
Quantitative Skills for Animal Sciences-day 2
No ratings yet
Quantitative Skills for Animal Sciences-day 2
69 pages
Eda
No ratings yet
Eda
4 pages
Chapter 1 Understanding Research
No ratings yet
Chapter 1 Understanding Research
28 pages
Business Analytics: Certificate Program in
No ratings yet
Business Analytics: Certificate Program in
20 pages
Session 4 - Exploratory Data Analysis - 2025
No ratings yet
Session 4 - Exploratory Data Analysis - 2025
23 pages
FDS MOST IMP QUESTION
No ratings yet
FDS MOST IMP QUESTION
12 pages
Unit 3
No ratings yet
Unit 3
77 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
A Level Maths OCR A Spec
No ratings yet
A Level Maths OCR A Spec
96 pages
Presentation
No ratings yet
Presentation
16 pages
IDA Question Bank Ch2
No ratings yet
IDA Question Bank Ch2
26 pages
Lesson 4 Data Description Measures of Position-1
No ratings yet
Lesson 4 Data Description Measures of Position-1
14 pages
AnalytixLabs-Advanced Certification in Business Analytics-1714541322570
No ratings yet
AnalytixLabs-Advanced Certification in Business Analytics-1714541322570
20 pages
Data Analysis and Business Intelligence BROCHURE
No ratings yet
Data Analysis and Business Intelligence BROCHURE
8 pages
Program Calender - July 2020 Data Science - Sheet1
No ratings yet
Program Calender - July 2020 Data Science - Sheet1
2 pages
Capstone Project
No ratings yet
Capstone Project
5 pages
A Preliminary Exploration of The Data To Better Understand Its Characteristics
No ratings yet
A Preliminary Exploration of The Data To Better Understand Its Characteristics
35 pages
Data Analysis 1
No ratings yet
Data Analysis 1
13 pages
Refonte Learning Data Analytics Program Syllabus
No ratings yet
Refonte Learning Data Analytics Program Syllabus
19 pages
BDA Unit 2
No ratings yet
BDA Unit 2
12 pages
2 - BBDS - Decisions Management & Problem Framing
No ratings yet
2 - BBDS - Decisions Management & Problem Framing
78 pages