Lecture 1 Exploratory Data Analysis
Lecture 1 Exploratory Data Analysis
Notebook: lecture-1-exploratory-data-analysis
What is Exploratory Data Analysis?
Loosely speaking, any method of looking at data that does not
include formal statistical modelling and inference falls under the
term exploratory data analysis.
Why do Exploratory Data Analysis
• detection of mistakes
• checking of assumptions
• preliminary selection of appropriate models
• determining relationships among the explanatory variables, and
assessing the direction and rough size of relationships between
explanatory and outcome variables.
• People find looking at large lists of numbers tedious and overwhelming,
EDA negates this by accentuating elements of the data
EDA and where it sits
EDA in Python
Rough classes of EDA
Graphical Non-Graphical
Univariate Multivariate
Univariate Non-Graphical
• Looking at a single value gathered from an experiment/population
• Get an idea about the distribution of this value
• Appreciate the ‘sample distribution’
• Make some assertions about what kind of distribution will best fit this
data
Pictures of distributions
Categorical, Ordinate, Interval data
• A categorical variable (sometimes called a nominal variable) is one that
has two or more categories, but there is no intrinsic ordering to the
categories.
• Example – hair colour
• An ordinal variable is similar to a categorical variable. The difference
between the two is that there is a clear ordering of the categories.
• Example – economic groups
• An interval variable is similar to an ordinal variable, except that the
intervals between the values of the numerical variable are equally
spaced.
• Example – evenly spaced price ranges
Quantitative data
• Any data that is represented by amounts eg weight, velocity, strain etc
Univariate Non-Graphical
Categorical non-graphical representations
• The characteristics of interest for a categorical variable are
simply the range of values and the frequency (or relative
frequency) of occurrence for each value.
• A simple tabulation of the frequency of each category is the best
univariate non-graphical EDA for categorical data.
Quantitative data representations
• The characteristics of the population distribution of a
quantitative variable are its centre, spread, modality (number of
peaks in the probability distribution function), shape (including
“heaviness of the tails”), and outliers.
Non graphical representations of quantitative
data
• In most situations it is worthwhile to think of univariate non-
graphical EDA as telling you about aspects of the histogram of
the distribution of the variable of interest.
• If the quantitative variable does not have too many distinct
values, a tabulation, as we used for categorical data, will be a
worthwhile univariate, non-graphical technique.
• Mostly, for quantitative variables we are concerned here with
the quantitative numeric (non-graphical) measures which are
the various sample statistics
Descriptors of quantitative data
• Modality
• Central tendency
• Spread
• Skew
• Kurtosis
Modality
• How many peaks are there in the pdf
Central Tendency (Mean)
• The common, useful measures of central tendency are the
statistics called (arithmetic) mean, median, and sometimes
mode.
• Occasionally other means such as geometric, harmonic,
truncated, or Winsorized means are used as measures of
centrality.
Central Tendency (Median)
• The middle value after all of the values are put in an ordered
list.
• For symmetric distributions, the mean and the median coincide.
https://medium.com/@nhan.tran/mean-median-an-mode-in-statistics-3359d3774b0b
Spread (variance and standard deviation)
• Spread is an indicator of how far away from the centre we are still
likely to find data values.
• Statistics by variable
• https://www.geeksforgeeks.org/how-to-
calculate-cramers-v-in-python/
Quantitative variable statistics (covariance)
• For two quantitative variables, the basic statistics of interest are
the sample covariance and/or sample correlation