0% found this document useful (0 votes)
4 views20 pages

M2 - Visualization of Categorical and Numerical Data

Uploaded by

krishnabadhe20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
4 views20 pages

M2 - Visualization of Categorical and Numerical Data

Uploaded by

krishnabadhe20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

Data Visualization

Visualization of Categorical and Numerical Data


Data Visualization

Table of Contents
Introduction ............................................................................................................................................................................. 3

1. Visual Analysis of Statistical Data ..................................................................................................................................... 4

1.1. Key Measures Computed for Statistical Data ............................................................................................................ 4

1.2. Examining the Data .................................................................................................................................................... 5

2. Data Visualization - Variation within Categorical Measures .......................................................................................... 7

2.1. Bar Graphs .................................................................................................................................................................. 7

2.2. Specialized Bar Graphs ........................................................................................................................................... 11

2.3. Treemaps .................................................................................................................................................................. 13

3. Variation within Numerical Measures ............................................................................................................................. 15

3.1. Histograms ................................................................................................................................................................ 15

Summary ............................................................................................................................................................................... 19
Data Visualization

Introduction
The topic gives an overview of the need to use visual analysis techniques for examining
statistical data. These techniques are used to understand the variation within categorical
and numerical data measures and also the relationships between these measures.

The visualization methods used for graphically examining the variation within the categorical,
and numerical types of data measures are covered in this topic. The remaining methods of
visualization through time and across space, and relationships between the measures are
covered in the subsequent topic.

Learning Objectives
Upon completion of this topic, you will be able to:
• Explain the importance of visual analysis for statistical data
• Describe the techniques available for visualizing variation within categorical and numerical
data.
Data Visualization

1. Visual Analysis of Statistical Data


In the data analysis process, it is essential to get a thorough understanding of the data prior
to applying modelling for extracting insights. However, the huge volume of data available in
today’s enterprises is a challenge in getting sufficient clarity regarding the data.

Hence, the approach taken will be to move top-down by arriving at an initial set of
characteristics which describe the data at an overall level. This is achieved by using data
visualization methods on sample data to unearth critical facts and trends.

1.1. Key Measures Computed for Statistical Data


To analyze the datasets, the following statistical measures are computed for key variables:
a. Location: The mean, median and mode
b. Variability: Percentiles, variance and standard deviation
c. Shape: Skew and kurtosis

Visualization techniques are especially useful for identifying outliers in distributions and
checking for associations between variables. Often, visualization of data distributions also
provides insight into very different behavior of data distributions in datasets which have
identical location or variability measures as shown in the figure 1.1.
Data Visualization

Figure 1.1. - Anscombe quartet


1.2. Examining the Data
To understand the data in a dataset, both the categorical and numerical (or quantitative)
measures associated with it need to be examined.

Categorical measures belong to a category. Typical examples of such data are product,
country, customer and territory. Numerical measures are measurable quantitatively.
Examples are profit, revenue, expense and blood pressure.

Statistical meaning is found both in the variations within the categorical and numerical
measures as well as in relationships among them. The following table summarizes the
different types of variations and relationships.

Variation within categorical How items in the categories relate to each other?
measures (Ranking, Part-to-whole)
Variation within numerical How values in the measure are distributed across the
measures range? (Distribution)
Data Visualization

Variation through time How values change through time? (Time-series)


Relationship between How measures relate to one another? (Correlation)
numerical measures
Variation across space Where are values located in space relative to one
another? (Spatial)
Relationship between How categories relate to each other mediated by
categorical measures measures? (Inter-category)

Table 1.1 – Summary table of variations and relationships within and across measures.
Data Visualization

2. Data Visualization - Variation within Categorical Measures


This analysis is done to evaluate how data items within a category in the dataset relate to
each other in terms of ranking (Example - highest to lowest) and proportion to the whole
that is, part-to-whole. The visualization methods used in this analysis are explained below.

2.1. Bar Graphs


Bar graphs are most suited for displaying values subdivided into discrete instances along a
nominal or ordinal scale. The visual weight of bars, places emphasis on the individual
values in the graph, and makes it easy to compare individual values to one another by
simply comparing the height of the bars.

There are three main types of bar graphs:


a. Horizontal: It is perfect for comparative ranking, like a top-five list. It is also preferred
in situations where the category labels are very descriptive, and adjusting them within
the axis of a column graph becomes an issue.

Figure 2.1. - Ranking of top five products sales.


Source: datapine.com
Data Visualization

b. Vertical/Column: Good for showing chronological data, such as growth over specific
periods, and comparing data across categories.

Figure 2.2. - Comparing product sales across channels and countries


Source: datapine.com
Data Visualization

c. Stacked: Useful for handling part-to-whole relationships.

Figure 2.3. - Age-wise distribution of new customers across quarters


Source: datapine.com

Two other techniques used to visualize part-to-whole distribution of items in the category
are the Pie Chart and the Dough-nut Chart. The arc length of each sector and consequently
its area is proportional to the quantity it represents.
Data Visualization

Car Taxi Two wheeler Cycle Local bus Metro Walk

3%
6%

28%
14%

2%

12%

35%

Figure 2.4. - Commuting means by employees in XYZ company

Figure 2.5. - Doughnut chart of age structure


Source - Devexpress
Data Visualization

2.2. Specialized Bar Graphs


a. Bullet graph
The bullet graph is a variation of the bar graph, which depicts a performance measure
along with a comparative value and a qualitative measure to show if performance is
good, bad or intermediate.

Figure 2.6. - Bullet Graph


Source: Wikipedia - Bullet Graph

In the figure 2.6, the dark bar represents the performance measure, the vertical marker
represents the comparative value, and the colour shading represents performance
degree, with the lightest shade denoting the best performance.

b. Pareto Charts
Pareto charts are helpful to depict part-to-whole relationships. The items are shown as
bars arranged in descending order of value. The line denotes cumulative value totalling
to 100% and helps to pin-point the main contributing parts of the whole.
Data Visualization

Figure 2.7. Pareto Graph/Charts


Source: pareto-chart.com

c. Deviation Bar Graphs


These are bar graphs which directly express the variation in value between two points in
time. They are very useful in cases where the focus is exclusively on variation of a value
in time, regardless of ranking or part-to-whole relationships.
Data Visualization

Figure 2.8. - Deviation in diseases before and after 4 weeks of sanitation drive
Source: www.cdc.gov
2.3. Treemaps
When the number of products to be compared in the categories exceed what a bar graph
can handle, treemaps are used. Treemaps are designed to display part-to-whole
relationships.

They use rectangles contained within larger rectangles to represent a hierarchy of up to 3


levels. In addition to rectangle size, we can also use colour to display another attribute.
Data Visualization

Figure 2.9. - Proportion of individual country GDP to Top 15 Nation GDP 2011
Source - Satori group
Data Visualization

3. Variation within Numerical Measures


When examining the variation within numerical measures, the focus is on understanding
how data is distributed across the range from the lowest to highest in other words, the data
distribution.

3.1. Histograms
These are the most commonly used graphs for summarising distributions and frequently
used to understand the data. It is useful when there are a large number of observations.

The spread of values is divided into intervals of equal size. Bars are used to display
percentage of values in each interval. The histogram is very helpful for easy visual
recognition of key characteristics or patterns in the data. For example, highest/lowest
scores, where scores are centred and whether scores are clustered together or scattered.

Figure 3.1. - Pattern of frequency of arrivals at park gate


Source - Wikimedia
Data Visualization

Figure 3.2. - Bimodal histogram showing 2 peaks in distribution of weights


Source – Minitab

a. Relative frequency histogram


In a relative frequency histogram, the vertical scale is marked with relative frequencies
instead of actual frequencies.
Relative Frequency = Class Frequency/Sum of all Frequencies

Figure 3.3. - Relative frequency percentage of occurrences over the year


Source - The National Severe Storms Laboratory
Data Visualization

b. Frequency polygons
These are similar in purpose to a histogram, but use a line to represent the values
instead of bars. Line segments are connected to points located directly above class
midpoint values. The heights of the points correspond to the class frequencies, and the
line segments are extended to the right and left so that the graph begins and ends on
the horizontal axis.

Figure 3.4. - Frequency polygon of bacterial cell lengths


Source - Kean University

Figure 3.5. - Frequency polygon with multiple distributions


Source - onlinestatbook.com
Frequency polygons have two advantages over histograms:
• The shape of the distribution is shown more clearly
Data Visualization

• The shapes of multiple distributions can be compared in a single graph

c. Cumulative Frequency Distributions


Frequency polygons are also good choices to display cumulative frequency distributions.
Cumulative frequency for a given class is the sum of the frequencies for that class and
the preceding classes.

Figure 3.6. - Cumulative frequency distribution


Source - sychstat.missouristate.edu
3.4. Box Plots
Invented by John Tukey, box-plots are an excellent tool for comparing multiple distributions.
To draw a box-plot, we need to do the following:
a. Order the data
b. Obtain the minimum and maximum values
c. Obtain the median and the quartiles Q1 and Q3.
d. Draw a line from the minimum to the maximum value
Data Visualization

e. Draw a box with its lines drawn at Q1, the median and Q3.
The IQR is the inter-quartile range i.e. the value of Q3-Q1. Values which fall outside the
lower and upper inferences i.e. below (Q1- 1.5 IQR) and above (Q3 + 1.5 IQR) are
considered as Outliers.

Figure 3.7. - Box-plots


Source - whatissixsigma.net
The box-plot is very useful to display the full range of data that is, center, spread of values
from min. to max. and outlier values. It is also helpful to check if values are clustered or
evenly distributed.

Summary
In the data analysis process, it is important to thoroughly understand the data prior to further
analysis in order to obtain insights. Visualizations help tremendously in this process to
unearth facts and trends which might otherwise not have been visible except through
graphical means.
Data Visualization

Visualization techniques are used to understand the shape of the data distribution. In order
to understand the data in the dataset, both the categorical and numerical (or quantitative)
measures associated with it need to be examined. Statistical meaning is found both in the
variations within, as well as relationships between categorical and numerical measures.

Variation within categorical measures is visualized primarily using bar graphs (horizontal,
column and stacked), pie and doughnut charts, and specialized bar graphs such as bullet
charts, pareto charts, deviation bar graphs and treemaps.

Variation within numerical measures is visualized using Histograms (including relative


frequency), frequency polygons and box-plots.

You might also like