Exploratory Data Analysis

EXPLORATORY DATA
ANALYSIS
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mainly graphical).
2
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings
3
Exploratory vs Confirmatory Data
Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis
• Generate hypothesis • Test the null hypothesis
• Uses graphical methods (mostly) • Uses statistical models
4
STEPS OF EDA
• Generate good research questions
• Data restructuring: You may need to make new variables from the existing ones.
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
• Try to identify confounding variables, interaction relations and variances, if any.
(Confounding variables are those that affect other variables in a way that produces spurious or
distorted associations between two variables. They confound the "true" relationship between two
variables.)
• Handle missing observations
• Decide on the need of transformation (on response and/or explanatory variables).
• Decide on the hypothesis based on your research questions 5
Classification of EDA
• Exploratory data analysis is generally cross-classified in two ways. First, each
method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics,
while graphical methods obviously summarize the data in a diagrammatic or
pictorial way.
• Univariate methods look at one variable (data column) at a time, while
multivariate methods look at two or more variables at a time to explore
relationships. Usually our multivariate EDA will be bivariate (looking at exactly
two variables), but occasionally it will involve three or more variables.
• It is almost always a good idea to perform univariate EDA on each of the
components of a multivariate EDA before performing the multivariate EDA.
6
Data Types and Measurement Scales
• Variables may be one of several types, and have a defined set of
valid values.
• Two main classes of variables are:
Continuous Variables: (Quantitative, numeric).
Continuous data can be rounded or \binned to create categorical data.
Categorical Variables: (Discrete, qualitative).
Some categorical variables (e.g. counts) are sometimes treated as
continuous.
Purely qualitative data such as culture, ethics, norms etc need different
analytical approaches such text or content analysis, thematic analysis,
narrative analysis etc 7
Categorical Data
• Unordered categorical data (nominal)
2 possible values (binary or dichotomous)
Examples: gender, alive/dead, yes/no.
Greater than 2 possible values - No order to categories
Examples: marital status, religion, country of birth, race.
• Ordered categorical data (ordinal)
Ratings or preferences
Behavioural attributes
Quality of life scales
Psycograhic variables
8
EDA Part 2: Summarizing Data With
Tables and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.
• Familiarizing yourself with the data.

• Find possible errors and anomalies.
• Examine the distribution of values for each variable.
9
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as
ordered categorical.
Plots specific to Continuous variables.
The goal for both categorical and continuous data is data reduction while
preserving/extracting key information about the process under
investigation.
10
Hypothesis
• We have to make decisions even when you are unsure. School,

marriage, therapy, jobs, whatever.
• Statistics provides an approach to decision making under
uncertainty. Sort of decision making by choosing the same way
you would bet. Maximize expected utility (subjective value).
• In inferential statistics, the null hypothesis is a general statement
or default position that there is nothing new happening, like there
is no association among groups, or no relationship between two
measured phenomena
Hypothesis
• A statistical hypothesis, sometimes called confirmatory data
analysis, is a hypothesis that is testable on the basis of observing
a process that is modeled via a set of random variables.
*** Main idea ***

It is difficult to prove that a fact is “right”.
But it is easy to prove that it is “wrong”.
The reason is that you only have to find one counter example
Steps for Hypothesis
Testing
Formulate H0 and H1
Select Appropriate Test

Choose Level of Significance
Calculate Test Statistic TSCAL
Determine Prob Assoc Determine Critical Value of

with Test Stat Test Stat
TSCR
Determine if TSCR
Compare with Level of falls into (Non) Rejection
Significance,  Region
Reject/Do not Reject H0
Draw Research Conclusion

Formulate the Hypothesis
• A null hypothesis may be rejected, but it can never be

accepted based on a single test.
• In research, the null hypothesis is formulated in such a
way that its rejection leads to the acceptance of the
desired conclusion.
• A new Internet Shopping Service will be introduced if
more than 40% people use it:
Step 2: Select an Appropriate
Test
• The test statistic measures how close the sample has
come to the null hypothesis.
• The test statistic often follows a well-known distribution
(eg, normal, t, or chi-square). Z test applies to normal
distribution
Step 3: Choose Level of Significance
Type I Error
• Occurs if the null hypothesis is rejected when it is in fact true.
• The probability of type I error ( α ) is also called the level of
significance.
Type II Error
• Occurs if the null hypothesis is not rejected when it is in fact false.
• The probability of type II error is denoted by β .
• Unlike α, which is specified by the researcher, the magnitude of β
depends on the actual value of the population parameter
(proportion).
It is necessary to balance the two types of errors.

A sample representation

Exploratory Data Analysis

Uploaded by

Exploratory Data Analysis

Uploaded by

EXPLORATORY DATA

• Familiarizing yourself with the data.

• We have to make decisions even when you are unsure. School,

* Main idea *

Select Appropriate Test

Determine Prob Assoc Determine Critical Value of

Reject/Do not Reject H0

Draw Research Conclusion

• A null hypothesis may be rejected, but it can never be

It is necessary to balance the two types of errors.

You might also like