0% found this document useful (0 votes)
11 views

Lecture 1 Exploratory Data Analysis

Uploaded by

124ll124
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 1 Exploratory Data Analysis

Uploaded by

124ll124
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Exploratory Data Analysis

Notebook: lecture-1-exploratory-data-analysis
What is Exploratory Data Analysis?
Loosely speaking, any method of looking at data that does not
include formal statistical modelling and inference falls under the
term exploratory data analysis.
Why do Exploratory Data Analysis

• detection of mistakes
• checking of assumptions
• preliminary selection of appropriate models
• determining relationships among the explanatory variables, and
assessing the direction and rough size of relationships between
explanatory and outcome variables.
• People find looking at large lists of numbers tedious and overwhelming,
EDA negates this by accentuating elements of the data
EDA and where it sits
EDA in Python
Rough classes of EDA

Graphical Non-Graphical

Univariate Multivariate
Univariate Non-Graphical
• Looking at a single value gathered from an experiment/population
• Get an idea about the distribution of this value
• Appreciate the ‘sample distribution’
• Make some assertions about what kind of distribution will best fit this
data

Pictures of distributions
Categorical, Ordinate, Interval data
• A categorical variable (sometimes called a nominal variable) is one that
has two or more categories, but there is no intrinsic ordering to the
categories.
• Example – hair colour
• An ordinal variable is similar to a categorical variable. The difference
between the two is that there is a clear ordering of the categories.
• Example – economic groups
• An interval variable is similar to an ordinal variable, except that the
intervals between the values of the numerical variable are equally
spaced.
• Example – evenly spaced price ranges
Quantitative data
• Any data that is represented by amounts eg weight, velocity, strain etc
Univariate Non-Graphical
Categorical non-graphical representations
• The characteristics of interest for a categorical variable are
simply the range of values and the frequency (or relative
frequency) of occurrence for each value.
• A simple tabulation of the frequency of each category is the best
univariate non-graphical EDA for categorical data.
Quantitative data representations
• The characteristics of the population distribution of a
quantitative variable are its centre, spread, modality (number of
peaks in the probability distribution function), shape (including
“heaviness of the tails”), and outliers.
Non graphical representations of quantitative
data
• In most situations it is worthwhile to think of univariate non-
graphical EDA as telling you about aspects of the histogram of
the distribution of the variable of interest.
• If the quantitative variable does not have too many distinct
values, a tabulation, as we used for categorical data, will be a
worthwhile univariate, non-graphical technique.
• Mostly, for quantitative variables we are concerned here with
the quantitative numeric (non-graphical) measures which are
the various sample statistics
Descriptors of quantitative data
• Modality

• Central tendency

• Spread

• Skew

• Kurtosis
Modality
• How many peaks are there in the pdf
Central Tendency (Mean)
• The common, useful measures of central tendency are the
statistics called (arithmetic) mean, median, and sometimes
mode.
• Occasionally other means such as geometric, harmonic,
truncated, or Winsorized means are used as measures of
centrality.
Central Tendency (Median)
• The middle value after all of the values are put in an ordered
list.
• For symmetric distributions, the mean and the median coincide.

https://medium.com/@nhan.tran/mean-median-an-mode-in-statistics-3359d3774b0b
Spread (variance and standard deviation)
• Spread is an indicator of how far away from the centre we are still
likely to find data values.

• The standard deviation is simply the square root of the variance.


• For a
theoretical Gaussian distribution, we learned in the previous
chapter that mean plus or minus 1, 2 or 3 standard deviations holds
68.3, 95.4 and 99.7% of the probability respectively
Variance and Standard Deviation
• Variances have the very important property that they are
additive for any number of different independent sources of
variation. This property is not shared by the “standard
deviation”.
• The standard deviation has the same units as the original data,
which helps make it more interpretable.
Spread (Inter-quartile range)
• The quartiles of a population or a sample are the three values
which divide the distribution or observed data into even fourths.
• ¼ values fall below Q1, ½ fall below Q2, ¾ fall below Q3
• IQR = Q3 – Q1
• The IQR is a more robust measure of spread than the variance
or standard deviation.
• The IQR is not affected by extreme outliers as strongly (if at all).
• For normally distributed data only, the IQR approximately
equals 4/3 times the standard deviation.
Spread (Percentiles)
• Percentiles are a more flexible version of quartiles
• We can define any percentiles that we like
• For example the 95th percentile is the value below which 95% of the
values fall
Skewness
• Skewness is a measure of asymmetry

• Standard error of skewness:


• If Skew < -2SES: negative skew
• If Skew > 2SES: positive skew
Kurtosis
• Kurtosis is a measure of “peakedness”
relative to a Gaussian shape

• Standard error of kurtosis:


• If Kurtosis < -2SEK: negative kurtosis
• If Kurtosis > 2SEK: positive kurtosis
Univariate Graphical
Histograms
• Barplot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of value
• The only one of these techniques that makes sense for
categorical data
• Generally you will choose between about 5 and 30 bins
• It is often worthwhile to try a few different bin sizes/numbers
• t is very instructive to look at multiple samples from the same
population to get a feel for the variation that will be found in
histograms
Boxplots
• Boxplots are very good at presenting
information about the central tendency,
symmetry and skew, as well as outliers,
although they can be misleading about
aspects such as multimodality
Outliers
• The term “outlier” is not well defined in statistics, and the
definition varies depending on the purpose and situation. The
“outliers” identified by a boxplot, which could be called “boxplot
outliers” are defined as any points more than 1.5 IQRs above
Q3 or more than 1.5 IQRs below Q1. This does not by itself
indicate a problem with those data points!
Violin plots
• A violin plot is like a box plot, which shows
peaks in the data. It is used to visualize the
distribution of numerical data. Unlike a box
plot that can only show summary statistics,
violin plots depict summary statistics and the
density of each variable.
Multivariate Non-Graphical
Multivariate non-graphical EDA
• Cross tabulation

• Statistics by variable

• Correlation stats of variables


Cross tabulation
• Cross-tabulation is the basic bivariate non-graphical EDA
technique
Correlation of variables
• Cramer’s V is used to calculate the correlation
between nominal categorical variables. Recall
that nominal variables are ones that take on
category labels but have no natural ordering
• The value for Cramer’s V ranges from 0 to 1,
with 0 indicating no association between the
variables and 1 indicating a strong association
between the variables.

• https://www.geeksforgeeks.org/how-to-
calculate-cramers-v-in-python/
Quantitative variable statistics (covariance)
• For two quantitative variables, the basic statistics of interest are
the sample covariance and/or sample correlation

• Positive covariance values suggest that when one measurement is


above the mean the other will probably also be above the mean,
and vice versa.
Quantitative variable statistics (Correlation)
• Correlation is closely related to covariance
Covariance and correlation matrices
• When we have many quantitative variables the most common
non-graphical EDA technique is to calculate all of the pairwise
covariances and/or correlations and assemble them into a
matrix.
Graphical Multivariate
Univariate plots by
category
• When we have one categorical (usually
explanatory) and one quantitative (usually
outcome) variable, graphical EDA usually
takes the form of “conditioning” on the
categorical random variable. This simply
indicates that we focus on all of the
subjects with a particular level of the
categorical random variable, then make
plots of the quantitative variable for those
subjects.
Scatterplots
• For two quantitative variables, the basic
graphical EDA technique is the scatterplot
which has one variable on the x-axis, one
on the y-axis and a point for each case in
your dataset. If one variable is
explanatory and the other is outcome, it is
a very, very strong convention to put the
outcome on the y (vertical) axis.
• In a scatterplot we can increase the
dimensionality with things like marker
size, colour, shape etc…but don’t go too
far or you will simply overload the viewer
Scatterplots

• With many variables we can


make a multiple-scatterplot
• Easy with seaborn and pandas
Use graphical
*and* non-
graphical
measures
Anscombe’s quartet. All four
datasets have the same
summary statistics
Summary
• Exploratory data analysis is a *very* important first step in data
science
• We can detect outliers, check assumptions, make hypotheses about
the data
• We can use graphical and non-graphical measures – usually a
combination of both is optimal
• Methods differ depending on the data types, categorical or
quantitative
• Python has many tools that make this task easy and aesthetically
pleasing

You might also like