LabModule - Exploratory Data Analysis - 2023ic
LabModule - Exploratory Data Analysis - 2023ic
2
Exploratory Data Analysis (Descriptive Statistics)
Course Objectives
To understand the use of statistical tests in research and to interpret results
from a statistical test.
Learning Objectives
1. Differentiate between various type of variable/scale of measurement
2. Describe data location, spread, and distribution (mean, median, modus,
frequency, proportion, maximum, minimum, percentiles/quartiles,
standard deviation, interquartile range)
3. Explain the features of normal distribution and non-normal distribution
4. Effectively summarize data and display data in tables and graphs
5. Interpret the descriptive summary and tabular/graphical data
visualization correctly
6. Perform descriptive analysis in Jamovi
Introduction
Descriptive statistics refers to the different methods applied to summarize and
present data in a form that will be more informative but also easier to
understand. Data collected for a study often consist of hundreds or even
thousands of subjects from whom the information is obtained. In order to
analyze this numerical data, it is necessary to organize the data systematically
and describe them comprehensively.
To describe a mass of numerical data and explore patterns within the dataset,
graphical methods or summary measures such as mean, median, mode
(central tendency), and dispersion or variability of our data can be used. These
measures are generally referred to as descriptive statistics. We can either
explore the distribution of one variable at a time, or explore the relationship
between two variables. Exploratory data analysis can also be used to check for
data errors and assumptions to more complex statistical tests.
3
The type of measures and graphs to be used depend on the type of
variables/scale of measurement.
Data, which were collected from the field in many ways, are not homogenous
so understanding and describing the data distribution is important.
We can describe the shape, center, and spread of the distribution of a numerical
variable.
The shape of a data can be described by its symmetry/skewness and
peakedness (modality), can be distinguished as:
a. Symmetric Distribution is when the left side of the distribution
mirrors the right side as it is looked at from the median.
b. Skewed Distribution, namely skewed to the right or skewed to the
left. Skewed to the right (positively skewed distribution) is caused
by the extremes in the higher values distorting the curve. Skewed to
the left (negatively skewed distribution), a less common type, is
caused by the extremes in the lower values which distort the curve
towards the left.
4
An unimodal distribution has one mode around which the observations are
concentrated; while bimodal distribution has two modes. A data can also be
uniformly distributed (has no mode).
5
Mean is very sensitive to extreme values (outliers), while median is not
affected by outliers. Mean is used for data with symmetric distribution with
no extreme values, while median might be more appropriate for data with
skewed distribution and or extreme values.
3. Mode
Mode is the value that appears most frequently in the dataset. In other words
that mode is the value that has the highest frequency.
6
individual observations from the mean, however, the negative and
positive deviation will cancel. Therefore, these deviations are
squared. Conceptually, variance is the average squared
deviation from the mean.
b. Standard Deviation. Standard deviation is the squared root of the
variance. Not like the variance, standard deviation measures
variation in original units of measurement, so this is the most
commonly used measurement.
4. Percentile
Percentiles split a set of ordered data into hundredths. Most commonly used
are 25th, 50th (the median), and 75th percentile. As an example, 70 mmHg is
the 25th percentile of the distribution of diastolic blood pressure of a sample; it
means that 25% of the sample have diastolic blood pressure less than 70
mmHg and 75% have more than 70 mmHg.
Data Summary
1. Frequency Distribution
The frequency distribution is an arrangement of numerical data according
to size or group. For discrete data, the frequency distribution is simply a tally
of the observation in each category. For continuous data, frequency
distribution can be made by constructing classes and counting the number
of observations that appear for each class. Categorical variables can also
be summarized as frequencies and proportions.
7
Table 1. Example of Frequency Distribution Table
2. Contingency table
Contingency tables (also called crosstabs or two-way tables) are used in
statistics to summarize the relationship between several categorical
variables. A contingency table is a special type of frequency distribution
table, where two variables are shown simultaneously, in which each of the
two variables has two categories. Epidemiological investigation/study often
use contingency table (also called two-by-two table) to compare exposure
vs. disease/case status.
8
Data Presentation
There are various ways to present and summarize the data. The most
common way is to make a graph.
a. Stem and leaf plot
This is a quick way to picture the shape of distribution while including the
actual numerical values in the graphs. A stem and leaf plot works best for
a small number of observations. If the number is large enough (a hundred
for example), it is difficult to read the stem-leaf plot. This stem and leaf plot
is obtained by sorting the observation into rows according to their leading
digit.
Figure 1. Example of Stem and Leaf Plot
9
95%. It depends on what percentile is interesting to show. This box is
crossed at the median. Then we draw a “whisker” from each end of the box
to correspond to the extreme value of the range. With a box plot, we can
easily identify the extreme value (outlier), highest values, upper percentile,
median, mean, lower percentile, and IQR (the five-number summary).
Boxplot is especially useful when presented side-by-side to compare the
distributions from several groups.
c. Histogram
A histogram is a better choice to present large volumes of data. In a
histogram, the data is simplified by grouping it into intervals (bins) and then
displaying it as a series of columns. Each column is proportional in height
to the number of observations or individuals falling into that interval (the
interval is plotted on the x-axis and the number of observations in each
interval (frequency) on the y-axis). For nominal or ordinal data, the grouping
will be the category themselves. For continuous data, the grouping can be
made using the same steps as the one of making frequency distribution.
10
d. Chart (Bar and Pie)
A pie or bar chart is also one alternative for data presentation. The pie
chart is more appropriate for a single variable and a small number of
categories while the bar chart can be used for more than 1 variable and
more than 5 categories.
e. Scatter plot
11
The scatter plot deals with two continuous variables. In the scatter plot, we
put the variable on two lines, horizontal or x-axis and vertical line or y-axis.
Then, we plot our data according to the value of the variable on the x and y-
axis. When we start to see the relation between two variables, the scatter
plot helps us to locate the outlier (if there is one).
12
Cost Affordable Expensive Expensive Open source Open source
(Perpetual (Renewal when (yearly renewal) (based on R
license, renew upgrading, long software)
only when term license)
upgrade)
Program .do (do-files) .sps (syntax file) .sas .R (R script file) .omv
extension
Output .log (text file) / .spo (SPSS Various format .txt (log files) .omv
extension .smcl (Stata output file) /.Rmd (R
formatted log markdown file in
file) html/pdf/doc
format)
13
Exercise
Pre-Class Exercise
Please read the following paper:
Purnama, I., Widjajanto, P. H., & Damayanti, W. (2021). Influence of initial treatment
delay on overall survival and event-free survival in childhood acute
lymphoblastic leukemia. Paediatrica Indonesiana, 61(4), 217–22.
https://doi.org/10.14238/pi61.4.2021.217-22
Based on the paper please fill the table below. The first row has been filled for you as
a guide.
You have to do this exercise before the practical session. Please give the printout
of your filled assignment to the tutor by the time you enter the practical session room.
Failure to submit your assignment means that you will be denied entry to the practical
session room. Your attendance will then be marked as absence.
No Variable name Variable type Scale Unit/Group
1 Sex Categorical Nominal Male &
female
2
3
4
5
6
7
In-Class Exercise
Exercise 1
a. What is the variable type for age (this refers to the age at diagnosis)?
14
● Open dataset in Jamovi. Choose menu (3 bars icon on upper left) → Open →
leukemia_dataset.omv
● Choose “Age at diagnosis” and press the arrow button next to the Variables
list. Open the Statistics menu (below the variables selection area) → choose
15
N, Missing, Mean, Median, Percentiles, Std. Deviation, Variance, Range, and
IQR.
16
f. What do we need to see age variability?
i. Can we use the mean and median for the “Gender” variable?
● Gender variable is not labeled yet (if you import the data from Microsoft Excel
file format, all variables are not labelled). Assign label by double-click on
variabel name → change values in the Level box (Use the codebook sheet in
leukocyte_dataset.xlsx as a guidance).
17
● Choose Analyses → Exploration → Descriptives
● Choose Gender → press the arrow button next to the “variable(s)” list
● Choose “Frequency tables” between Variables List and Statistics menu. In the
Statistics menu only choose N and Missing
18
Exercise 2
Below is a table about age and leukocyte count data. It contains the mean and
standard deviation. The students are asked to help the principal investigator
determine which variable is more varied between age and leukocyte count.
Exercise 3
One of the common ways to visualize quantitative data is through box plots. Box
plots are useful for identifying outliers and for comparing distributions. You are asked
to construct a box plot graph for a single variable with a certain command using the
software. Please create a box plot about Age at diagnosis:
Steps in Jamovi
19
● Choose Analyses → Exploration → Descriptives
● Choose “Age at diagnosis” and press the arrow button next to the Variables
list.
● Open the Plots menu → choose Box plot
Exercise 4
There are some common ways to visualize quantitative data such as diagrams,
histograms, and or box plots. You can construct a graph for a single variable with a
certain command using the software. Please create a graph and interpret the result:
20
○ Choose Gender → press the arrow button next to the Dependent
Variable
21
○ Interpret the result
○ Repeat the steps for Classification (choose “Classification” variables
instead of “Gender”)
2) Box plot - Age between men and women – See the different shape of
distribution
○ Choose Analyses → Exploration → Descriptives
○ Choose “Age at diagnosis” and press the arrow button next to the
Variables list. Choose “Gender” and press the arrow button next to
Split by
○ Open Plots menu and choose Box plot
22
○ Interpret the result!
Exercise 5
23
● Choose “Leukocytes count” → press arrow next to “Y-Axis” box. Choose “Age
24