0% found this document useful (0 votes)
57 views91 pages

Module 4

The document covers the significance of data visualization and exploration, emphasizing the use of Python libraries like NumPy and pandas for data analysis and visualization. It explains various types of visualizations, including comparison plots, relation plots, and the importance of data wrangling to transform raw data into meaningful insights. Additionally, it discusses design practices for effective visual representation of data.

Uploaded by

lguy3631
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views91 pages

Module 4

The document covers the significance of data visualization and exploration, emphasizing the use of Python libraries like NumPy and pandas for data analysis and visualization. It explains various types of visualizations, including comparison plots, relation plots, and the importance of data wrangling to transform raw data into meaningful insights. Additionally, it discusses design practices for effective visual representation of data.

Uploaded by

lguy3631
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

DATA SCIENCE AND VISUALIZATION

21CS644
Module-4: Data Visualization and Data Exploration

2
The Importance of Data Visualization and Data Exploration

1. Introduce you to the basics of the statistical analysis of a dataset


2. Basic operations for calculating the mean, median, and variance of different
datasets and use NumPy and pandas to filter, sort, and shape the datasets to our
requirements.
3. Ability to explain the importance of data visualization and calculate basic
statistical values (such as median, mean, and variance), and use NumPy and
pandas for data wrangling. 3
Introduction

1. Unlike machines, people are usually not equipped for interpreting a large amount
of information from a random set of numbers and messages in each piece of
data.
2. Out of all our logical capabilities, we understand things best through the visual
processing of information.
3. When data is represented visually, the probability of understanding complex
builds and numbers increases.

4
Introduction

1. Python has recently emerged as a programming language that performs well for
data analysis.
2. It has applications across data science pipelines that convert data into a usable
format (such as pandas), analyzes it (such as NumPy), and extract useful
conclusions from the data to represent it in a visually appealing manner (such as
Matplotlib or Bokeh).
3. Python provides data visualization libraries that can help you assemble graphical
representations efficiently.
5
Introduction to Data Visualization

1. Computers and smartphones store data such as names and numbers in a digital
format.
2. Data representation refers to the form in which you can store, process, and
transmit data.
3. Representations can narrate a story and convey fundamental discoveries to your
audience.
4. Without appropriately modeling your information to use it to make meaningful
findings, its value is reduced.
6
Introduction to Data Visualization

1. Creating representations helps us achieve a more precise, more concise, and


more direct perspective of information, making it easier for anyone to understand
the data.
2. Information isn't equivalent to data. Representations are a useful apparatus to
derive insights from the data.
3. Thus, representations transform data into useful information.

7
The Importance of Data Visualization

1. Instead of just looking at data in the columns of an Excel spreadsheet, we get a


better idea of what our data contains by using visualization.
2. For instance, it's easy to see a pattern emerge from the numerical data that's
given in the following scatter plot.
3. It shows the correlation between body mass and the maximum longevity of
various animals grouped by class.
4. There is a positive correlation between body mass and maximum longevity:

8
The Importance of Data Visualization

1. Instead of just looking at data in the columns of an Excel spreadsheet, we get a


better idea of what our data contains by using visualization.
2. For instance, it's easy to see a pattern emerge from the numerical data that's
given in the following scatter plot.
3. It shows the correlation between body mass and the maximum longevity of
various animals grouped by class.

9
The Importance of Data Visualization

1. There is a positive correlation between body mass and maximum longevity:

10
The Importance of Data Visualization

1. Visualizing data has many advantages, such as the following:

1. Complex data can be easily understood.


2. A simple visual representation of outliers, target audiences, and futures
markets can be created.
3. Storytelling can be done using dashboards and animations.
4. Data can be explored through interactive visualizations.

11
Data Wrangling

1. Data wrangling is the process of transforming raw data into a suitable


representation for various tasks.
2. It is the discipline of augmenting, cleaning, filtering, standardizing, and enriching
data in a way that allows it to be used in a downstream task, which in our case is
data visualization.
3. Look at the following data wrangling process flow diagram to understand how
accurate and actionable data can be obtained for business analysts to work on:

12
Data Wrangling

13
Data Wrangling

1. In relation to the preceding figure, the following steps explain the flow of the data
wrangling process:

1. First, the Employee Engagement data is in its raw form.


2. Then, the data gets imported as a DataFrame and is later cleaned.
3. The cleaned data is then transformed into graphs, from which findings can be
derived.
4. Finally, we analyze this data to communicate the final results. 14
Data Wrangling

1. For example, employee engagement can be measured based on raw data


gathered from feedback surveys, employee tenure, exit interviews, one-on-one
meetings, and so on.
2. This data is cleaned and made into graphs based on parameters such as
referrals, faith in leadership, and scope of promotions.
3. The percentages, that is, information derived from the graphs, help us reach our
result, which is to determine the measure of employee engagement.

15
Tools and Libraries for Visualization
1. There are several approaches to creating data visualizations.
2. Depending on your requirements, you might want to use a non-coding tool such
as Tableau, which allows you to get a good feel for your data.
3. Besides Python, which will be used in this book, MATLAB and R are widely used
in data analytics.
4. However, Python is the most popular language in the industry.
5. Its ease of use and the speed at which you can manipulate and visualize data,
combined with the availability of a number of libraries, make Python the best
choice for data visualization. 16
All You Need to Know about Plots

1. You will be equipped with the important skill of identifying the best plot type for a
given dataset and scenario.

17
Introduction
1. Focus on various visualizations and identify which visualization is best for
showing certain information for a given dataset.
2. We will describe every visualization in detail and give practical examples, such
as comparing different stocks over time or comparing the ratings for different
movies.
3. Starting with comparison plots, which are great for comparing multiple variables
over time, we will look at their types (such as line charts, bar charts, and radar
charts).
18
Introduction

1. We will then move onto relation plots, which are handy for showing relationships
among variables.
2. We will cover scatter plots for showing the relationship between two variables,
bubble plots for three variables, correlograms for variable pairs, and finally,
heatmaps for visualizing multivariate data.

19
Introduction

1. The chapter will further explain composition plots (used to visualize variables that
are part of a whole), as well as pie charts, stacked bar charts, stacked area
charts, and Venn diagrams.
2. To give you a deeper insight into the distribution of variables, we will discuss
distribution plots, describing histograms, density plots, box plots, and violin plots.

20
Introduction

1. Finally, we will talk about dot maps, connection maps, and choropleth maps,
which can be categorized into geoplots.
2. Geoplots are useful for visualizing geospatial data.
3. Let’s start with the family of comparison plots, including line charts, bar charts,
and radar charts.

21
Comparison Plots
1. Comparison plots include charts that are ideal for comparing multiple variables or
variables over time.
2. Line charts are great for visualizing variables over time.
3. For comparison among items, bar charts (also called column charts) are the best
way to go.
4. For a certain time period (say, fewer than 10-time points), vertical bar charts can
be used as well. Radar charts or spider plots are great for visualizing multiple
variables for multiple groups.
22
Line Chart

1. Line charts are used to display quantitative values over a continuous time period
and show information as a series.
2. A line chart is ideal for a time series that is connected by straight-line segments.

3. The value being measured is placed on the y-axis, while the x-axis is the
timescale.

23
Line Chart

1. Uses
1. Line charts are great for comparing multiple variables and visualizing trends
for both single as well as multiple variables, especially if your dataset has
many time periods (more than 10).
2. For smaller time periods, vertical bar charts might be the better choice.

24
Line Chart
1. The following diagram shows a trend of real estate prices (per million US dollars)
across two decades. Line charts are ideal for showing data trends:

25
Line Chart

1. Example:
2. The following figure is a multiple-variable line chart that compares the stock-
closing prices for Google, Facebook, Apple, Amazon, and Microsoft.
3. A line chart is great for comparing values and visualizing the trend of the stock.
4. As we can see, Amazon shows the highest growth:

26
Line Chart

27
Line Chart

1. Design Practices
1. Avoid too many lines per chart.
2. Adjust your scale so that the trend is clearly visible.
2. Note: For plots with multiple variables, a legend should be given to describe each
variable.

28
Bar Chart

1. In a bar chart, the bar length encodes the value.


2. There are two variants of bar charts: vertical bar charts and horizontal bar charts.
3. Use
1. While they are both used to compare numerical values across categories,
vertical bar charts are sometimes used to show a single variable over time.

29
Bar Chart

1. Don’ts of Bar Charts


1. Don’t confuse vertical bar charts with histograms. Bar charts compare
different variables or categories, while histograms show the distribution for a
single variable. Histograms will be discussed later in this chapter.
2. Another common mistake is to use bar charts to show central tendencies
among groups or categories. Use box plots or violin plots to show statistical
measures or distributions in these cases.
30
Bar Chart
1. Examples: The following diagram shows a vertical bar chart. Each bar shows the
marks out of 100 that 5 students obtained in a test:

31
Bar Chart
1. The following diagram shows a horizontal bar chart. Each bar shows the marks
out of 100 that 5 students obtained in a test:

32
Bar Chart
1. The following diagram compares movie ratings, giving two different scores.
2. The Tomatometer is the percentage of approved critics who have given a positive
review for the movie.
3. The Audience Score is the percentage of users who have given a score of 3.5 or
higher out of 5.
4. As we can see, The Martian is the only movie with both a high Tomatometer and
Audience Score.
5. The Hobbit: An Unexpected Journey has a relatively high Audience Score
compared to the Tomatometer score, which might be due to a huge fan base:
33
Bar Chart

34
Bar Chart
1. Design Practices
1. The axis corresponding to the numerical variable should start at zero.
Starting with another value might be misleading, as it makes a small value
difference look like a big one.
2. Use horizontal labels—that is, as long as the number of bars is small, and
the chart doesn’t look too cluttered.
3. The labels can be rotated to different angles if there isn’t enough space to
present them horizontally. You can see this on the labels of the x-axis of the
preceding diagram.
35
Radar Chart
1. Radar charts (also known as spider or web charts) visualize multiple variables
with each variable plotted on its own axis, resulting in a polygon.
2. All axes are arranged radially, starting at the center with equal distances between
one another, and have the same scale.
3. Uses
1. Radar charts are great for comparing multiple quantitative variables for a
single group or multiple groups.
2. They are also useful for showing which variables score high or low within a
dataset, making them ideal for visualizing performance
36
Radar Chart
1. Examples: The following diagram shows a radar chart for a single variable. This
chart displays data about a student scoring marks in different subjects:

37
Radar Chart
1. The following diagram shows a radar chart for two variables/groups. Here, the
chart explains the marks that were scored by two students in different subjects:

38
Radar Chart
1. The following diagram shows a radar chart for multiple variables/groups. Each
chart displays data about a student's performance in different subjects:

39
Radar Chart
1. The following diagram shows a radar chart for multiple variables/groups. Each
chart displays data about a student's performance in different subjects:

40
Radar Chart

1. Design Practices
1. Try to display 10 factors or fewer on a single radar chart to make it easier to
read.
2. Use faceting (displaying each variable in a separate plot) for multiple
variables/groups, as shown in the preceding diagram, in order to maintain
clarity.

41
Radar Chart

1. In the first section, we learned which plots are suitable for comparing items. Line
charts are great for comparing something over time, whereas bar charts are for
comparing different items.
2. Last but not least, radar charts are best suited for visualizing multiple variables
for multiple groups.
3. In the following activity, you can check whether you understood which plot is best
for which scenario.

42
Activity 2.01: Employee Skill Comparison

1. You are given scores of four employees (Alex, Alice, Chris, and Jennifer) for five
attributes: efficiency, quality, commitment, responsible conduct, and cooperation.
2. Your task is to compare the employees and their skills.
3. This activity will foster your skills in choosing the best visualization when it
comes to comparing items.
1. Which charts are suitable for this task?
2. You are given the following bar and radar charts.
4. List the advantages and disadvantages of both charts. Which is the better
chart for this task in your opinion, and why? 43
Activity 2.01: Employee Skill Comparison

44
Activity 2.01: Employee Skill Comparison

45
Activity 2.01: Employee Skill Comparison

1. What could be improved in the respective visualizations?

46
Activity 2.01: Employee Skill Comparison

1. Concluding the activity, you hopefully have a good understanding of deciding


which comparison plots are best for the situation.
2. In the next section, we will discuss different relation plots.

47
Relation Plots

1. Relation plots are perfectly suited to showing relationships among variables.


2. A scatter plot visualizes the correlation between two variables for one or
multiple groups.
3. Bubble plots can be used to show relationships between three variables.
4. The additional third variable is represented by the dot size.
5. Heatmaps are great for revealing patterns or correlations between two
qualitative variables.
6. A correlogram is a perfect visualization for showing the correlation among
multiple variables. 48
Scatter Plot

1. Scatter plots show data points for two numerical variables, displaying a variable
on both axes.
2. Uses
3. You can detect whether a correlation (relationship) exists between two
variables.
4. They allow you to plot the relationship between multiple groups or
categories using different colors.
5. A bubble plot, which is a variation of the scatter plot, is an excellent tool for
visualizing the correlation of a third variable 49
Scatter Plot

1. Examples: The following diagram shows a scatter plot of height and weight of
persons belonging to a single group:

50
Scatter Plot

1. The following diagram shows the same data as in the previous plot but
differentiates between groups. In this case, we have different groups: A, B, and
C:

51
Scatter Plot
1. The following diagram shows the correlation between body mass and the
maximum longevity for various animals grouped by their classes.
2. There is a positive correlation between body mass and maximum longevity:

52
Scatter Plot

1. Design Practices
1. Start both axes at zero to represent data accurately.
2. Use contrasting colors for data points and avoid using symbols for scatter
plots with multiple groups or categories.

53
Variants: Scatter Plots with Marginal Histograms

1. In addition to the scatter plot, which visualizes the correlation between two
numerical variables, you can plot the marginal distribution for each variable in the
form of histograms to give better insight into how each variable is distributed.
2. Examples: The following diagram shows the correlation between body mass and
the maximum longevity for animals in the Aves class.
3. The marginal histograms are also shown, which helps to get a better insight into
both variables:

54
Variants: Scatter Plots with Marginal Histograms

55
Bubble Plot

1. A bubble plot extends a scatter plot by introducing a third numerical variable.


2. The value of the variable is represented by the size of the dots.
3. The area of the dots is proportional to the value.
4. A legend is used to link the size of the dot to an actual numerical value
5. Use
1. Bubble plots help to show a correlation between three variables.
6. Example: The following diagram shows a bubble plot that highlights the
relationship between heights and age of humans to get the weight of each
person, which is represented by the size of the bubble: 56
Bubble Plot

57
Bubble Plot

1. Design Practices
1. The design practices for the scatter plot are also applicable to the bubble
plot.
2. Don't use bubble plots for very large amounts of data, since too many
bubbles make the chart difficult to read

58
Correlogram
1. A correlogram is a combination of scatter plots and histograms. Histograms will be
discussed in detail later in this chapter.
2. A correlogram or correlation matrix visualizes the relationship between each pair of
numerical variables using a scatter plot.
3. The diagonals of the correlation matrix represent the distribution of each variable in
the form of a histogram.
4. You can also plot the relationship between multiple groups or categories using
different colors.
5. A correlogram is a great chart for exploratory data analysis to get a feel for your
59
data, especially the correlation between variable pairs.
Correlogram

1. Examples: The following diagram shows a correlogram for the height, weight, and
age of humans.
2. The diagonal plots show a histogram for each variable.
3. The off-diagonal elements show scatter plots between variable pairs:

60
Correlogram

61
Correlogram
1. The following diagram shows the correlogram with data samples separated by color
into different groups:

62
Correlogram

1. Design Practices
1. Start both axes at zero to represent data accurately.
2. Use contrasting colors for data points and avoid using symbols for scatter plots
with multiple groups or categories.

63
Heatmap

1. A heatmap is a visualization where values contained in a matrix are represented as


colors or color saturation.
2. Heatmaps are great for visualizing multivariate data (data in which analysis is
based on more than two variables per observation), where categorical variables are
placed in the rows and columns and a numerical or categorical variable is
represented as colors or color saturation.

64
Heatmap

1. Use
1. The visualization of multivariate data can be done using heatmaps as they are
great for finding patterns in your data.
2. Examples: The following diagram shows a heatmap for the most popular products
on the electronics category page across various e-commerce websites, where the
color shows the number of units sold.
3. In the following diagram, we can analyze that the darker colors represent more units
sold, as shown in the key:
65
Heatmap

66
Variants: Annotated Heatmaps

1. Let's see the same example we saw previously in an annotated heatmap, where the
color shows the number of units sold:

67
Heatmap

1. Design Practice
1. Select colors and contrasts that will be easily visible to individuals with vision
problems so that your plots are more inclusive.
2. In this section, we introduced various plots for relating a variable to other variables
and looked at their uses, and multiple examples for the different relation plots were
given.

The following activity will give you some practice in working with heatmaps.
68
Activity 2.02: Road Accidents Occurring over Two Decades

1. You are given a diagram that provides information about the road accidents that
have occurred over the past two decades during the months of January, April, July,
and October.
2. The aim of this activity is to understand how you can use heatmaps to visualize
multivariate data
1. Identify the two years during which the number of road accidents occurring was
the least.
2. For the past two decades, identify the month for which accidents showed a
marked decrease: 69
Activity 2.02: Road Accidents Occurring over Two Decades

70
Activity 2.02: Road Accidents Occurring over Two Decades

1. The activity about road accidents gave you a simple example of how to use
heatmaps to illustrate the relationship between multiple variables.

71
Composition Plots

1. Composition plots are ideal if you think about something as a part of a whole.
2. For static data, you can use pie charts, stacked bar charts, or Venn diagrams.
3. Pie charts or donut charts help show proportions and percentages for groups.
4. If you need an additional dimension, stacked bar charts are great.
5. Venn diagrams are the best way to visualize overlapping groups, where each group
is represented by a circle.
6. For data that changes over time, you can use either stacked bar charts or
stacked area charts.
72
Pie Chart

1. Pie charts illustrate numerical proportions by dividing a circle into slices.


2. Each arc length represents a proportion of a category.
3. The full circle equates to 100%.
4. For humans, it is easier to compare bars than arc lengths; therefore, it is
recommended to use bar charts or stacked bar charts the majority of the time.
5. Use
1. To compare items that are part of a whole

73
Pie Chart
1. Examples: The following diagram shows household water usage around the world:

74
Pie Chart

1. Design Practices
1. Arrange the slices according to their size in increasing/decreasing order, either
in a clockwise or counterclockwise manner.
2. Make sure that every slice has a different color.

75
Variants: Donut Chart
1. An alternative to a pie chart is a donut chart. In contrast to pie charts, it is easier to
compare the size of slices, since the reader focuses more on reading the length of
the arcs instead of the area.
2. Donut charts are also more space-efficient because the center is cut out, so it can
be used to display information or further divide groups into subgroups.
3. The following diagram shows a basic donut chart:

76
Variants: Donut Chart

77
Variants: Donut Chart
1. The following diagram shows a donut chart
with subgroups:
2. Design Practice
1. Use the same color that's used for the
category for the subcategories. Use
varying brightness levels for the different
subcategories.

78
Stacked Bar Chart
1. Stacked bar charts are used to show how a category is divided into subcategories
and the proportion of the subcategory in comparison to the overall category.
2. You can either compare total amounts across each bar or show a percentage of
each group.
3. The latter is also referred to as a 100% stacked bar chart and makes it easier to see
relative differences between quantities in each group
4. Use
1. To compare variables that can be divided into sub-variables
79
Stacked Bar Chart
1. Examples: The following
diagram shows a generic
stacked bar chart with five
groups:

80
Stacked Bar Chart
1. The following diagram shows
a 100% stacked bar chart with
the same data that was used
in the preceding diagram:

81
Stacked Bar Chart
1. The following diagram
illustrates the daily total sales
of a restaurant over several
days. The daily total sales of
non-smokers are stacked on
top of the daily total sales of
smokers:

82
Stacked Bar Chart
1. Design Practices
2. Use contrasting colors for stacked bars.
3. Ensure that the bars are adequately spaced to eliminate visual clutter. The ideal
space guideline between each bar is half the width of a bar.
4. Categorize data alphabetically, sequentially, or by value, to uniformly order it and
make things easier for your audience.

83
Stacked Area Chart
1. Stacked area charts show trends for part-of-a-whole relations.
2. The values of several groups are illustrated by stacking individual area charts on top
of one another.
3. It helps to analyze both individual and overall trend information.
4. Use
1. To show trends for time series that are part of a whole
2. Examples: The following diagram shows a stacked area chart with the net profits of
Google, Facebook, Twitter, and Snapchat over a decade:
3. Design Practice: Use transparent colors to improve information visibility. This will
help you to analyze the overlapping data and you will also be able to see the grid
84
lines.
Stacked Area Chart

85
Activity 2.03: Smartphone Sales Units
1. You want to compare smartphone sales units for the five biggest smartphone
manufacturers over time and see whether there is any trend.
2. In this activity, we also want to look at the advantages and disadvantages of stacked
area charts compared to line charts:
1. Looking at the following line chart, analyze the sales of each manufacturer and
identify the one whose fourth-quarter performance is exceptional when
compared to the third quarter.
2. Analyze the performance of all manufacturers and make a prediction about two
companies whose sales units will show a downward and an upward trend: 86
Activity 2.03: Smartphone Sales Units

87
Activity 2.03: Smartphone Sales Units
1. What would be the advantages and disadvantages of using a stacked area chart
instead of a line chart?

88
Venn Diagram

1. Venn diagrams, also known as set diagrams, show all possible logical relations
between a finite collection of different sets.
2. Each set is represented by a circle.
3. The circle size illustrates the importance of a group.
4. The size of overlap represents the intersection between multiple groups.
5. Use: To show overlaps for different sets.
6. Example: Visualizing the intersection of the following diagram shows a Venn
diagram for students in two groups taking the same class in a semester:
89
Venn Diagram

90
Venn Diagram

1. From the preceding diagram, we can note that there are eight students in just group
A, four students in just group B, and one student in both groups.
2. Design Practice:
1. It is not recommended to use Venn diagrams if you have more than three
groups.
2. It would become difficult to understand.
3. Moving on from composition plots, we will cover distribution plots in the
following section.
91

You might also like