M2 - Visualization of Categorical and Numerical Data
M2 - Visualization of Categorical and Numerical Data
Table of Contents
Introduction ............................................................................................................................................................................. 3
Summary ............................................................................................................................................................................... 19
Data Visualization
Introduction
The topic gives an overview of the need to use visual analysis techniques for examining
statistical data. These techniques are used to understand the variation within categorical
and numerical data measures and also the relationships between these measures.
The visualization methods used for graphically examining the variation within the categorical,
and numerical types of data measures are covered in this topic. The remaining methods of
visualization through time and across space, and relationships between the measures are
covered in the subsequent topic.
Learning Objectives
Upon completion of this topic, you will be able to:
• Explain the importance of visual analysis for statistical data
• Describe the techniques available for visualizing variation within categorical and numerical
data.
Data Visualization
Hence, the approach taken will be to move top-down by arriving at an initial set of
characteristics which describe the data at an overall level. This is achieved by using data
visualization methods on sample data to unearth critical facts and trends.
Visualization techniques are especially useful for identifying outliers in distributions and
checking for associations between variables. Often, visualization of data distributions also
provides insight into very different behavior of data distributions in datasets which have
identical location or variability measures as shown in the figure 1.1.
Data Visualization
Categorical measures belong to a category. Typical examples of such data are product,
country, customer and territory. Numerical measures are measurable quantitatively.
Examples are profit, revenue, expense and blood pressure.
Statistical meaning is found both in the variations within the categorical and numerical
measures as well as in relationships among them. The following table summarizes the
different types of variations and relationships.
Variation within categorical How items in the categories relate to each other?
measures (Ranking, Part-to-whole)
Variation within numerical How values in the measure are distributed across the
measures range? (Distribution)
Data Visualization
Table 1.1 – Summary table of variations and relationships within and across measures.
Data Visualization
b. Vertical/Column: Good for showing chronological data, such as growth over specific
periods, and comparing data across categories.
Two other techniques used to visualize part-to-whole distribution of items in the category
are the Pie Chart and the Dough-nut Chart. The arc length of each sector and consequently
its area is proportional to the quantity it represents.
Data Visualization
3%
6%
28%
14%
2%
12%
35%
In the figure 2.6, the dark bar represents the performance measure, the vertical marker
represents the comparative value, and the colour shading represents performance
degree, with the lightest shade denoting the best performance.
b. Pareto Charts
Pareto charts are helpful to depict part-to-whole relationships. The items are shown as
bars arranged in descending order of value. The line denotes cumulative value totalling
to 100% and helps to pin-point the main contributing parts of the whole.
Data Visualization
Figure 2.8. - Deviation in diseases before and after 4 weeks of sanitation drive
Source: www.cdc.gov
2.3. Treemaps
When the number of products to be compared in the categories exceed what a bar graph
can handle, treemaps are used. Treemaps are designed to display part-to-whole
relationships.
Figure 2.9. - Proportion of individual country GDP to Top 15 Nation GDP 2011
Source - Satori group
Data Visualization
3.1. Histograms
These are the most commonly used graphs for summarising distributions and frequently
used to understand the data. It is useful when there are a large number of observations.
The spread of values is divided into intervals of equal size. Bars are used to display
percentage of values in each interval. The histogram is very helpful for easy visual
recognition of key characteristics or patterns in the data. For example, highest/lowest
scores, where scores are centred and whether scores are clustered together or scattered.
b. Frequency polygons
These are similar in purpose to a histogram, but use a line to represent the values
instead of bars. Line segments are connected to points located directly above class
midpoint values. The heights of the points correspond to the class frequencies, and the
line segments are extended to the right and left so that the graph begins and ends on
the horizontal axis.
e. Draw a box with its lines drawn at Q1, the median and Q3.
The IQR is the inter-quartile range i.e. the value of Q3-Q1. Values which fall outside the
lower and upper inferences i.e. below (Q1- 1.5 IQR) and above (Q3 + 1.5 IQR) are
considered as Outliers.
Summary
In the data analysis process, it is important to thoroughly understand the data prior to further
analysis in order to obtain insights. Visualizations help tremendously in this process to
unearth facts and trends which might otherwise not have been visible except through
graphical means.
Data Visualization
Visualization techniques are used to understand the shape of the data distribution. In order
to understand the data in the dataset, both the categorical and numerical (or quantitative)
measures associated with it need to be examined. Statistical meaning is found both in the
variations within, as well as relationships between categorical and numerical measures.
Variation within categorical measures is visualized primarily using bar graphs (horizontal,
column and stacked), pie and doughnut charts, and specialized bar graphs such as bullet
charts, pareto charts, deviation bar graphs and treemaps.