Chapter 3 – Data Visualization
Data Mining for Business Intelligence
Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010
Graphs for Data Exploration
Basic Plots Distribution Plots
Line Graphs Boxplots
Bar Charts Histograms
Scatterplots
Line Graph for Time Series
Bar Chart for Categorical Variable
95% of tracts do not border
Charles River
Excel can confuse:
y-axis is actually “% of
records that have a value
for CATMEDV” (i.e., “% of
all records”)
Scatterplot
Displays relationship between two
numerical variables
Distribution Plots
Display “how many” of each value occur in a data
set
Or, for continuous data or data with many possible
values, “how many” values are in each of a series of
ranges or “bins”
Histograms
Boston Housing example:
Histogram shows the
distribution of the
outcome variable
(median house value)
Boxplots
Side-by-side boxplots are useful for comparing subgroups
Boston Housing Example:
Display distribution of
outcome variable (MEDV)
for neighborhoods on
Charles river (1) and not
on Charles river (0)
Box Plot
Top outliers defined as
those above
Q3+1.5(Q3-Q1).
outliers
“max” = maximum of
non-outliers
Analogous definitions
“max”
Quartile 3 for bottom outliers and
mean Median
for “min”
Details may differ
Quartile 1
“min”
across software
Heat Maps
Color conveys information
In data mining, used to visualize
Correlations
Missing Data
Heatmap to highlight correlations
(Boston Housing)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1.00
ZN -0.20 1.00
INDUS 0.41 -0.53 1.00
CHAS -0.06 -0.04 0.06 1.00 In Excel
NOX 0.42 -0.52 0.76 0.09 1.00
RM -0.22 0.31 -0.39 0.09 -0.30 1.00 (using
AGE 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00
DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 conditional
RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00
TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 formatting)
PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00
LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00
MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00
In Spotfire
Multidimensional Visualization
Scatterplot with color added
Boston Housing
NOX vs. LSTAT
Red = low median value
Blue = high median
value
Matrix Plot
Matrix Plot
0 0.2 0.4 0.6 0.8 1
Shows scatterplots
9
1.8 3.6 5.4 7.2
for variable pairs CRIM
101
0
1
Example:
0.2 0.4 0.6 0.8
scatterplots for 3
ZN
102
Boston Housing 0
variables
3
0.6 1.2 1.8 2.4
INDUS
101
0
0 1.8 3.6 5.4 7.2 9 0 0.6 1.2 1.8 2.4 3
Rescaling to log scale (on right)
“uncrowds” the data
Aggregation
Amtrak Ridership – Monthly Data
Aggregation – Monthly Average
Aggregation – Yearly Average
Scatter Plot with Labels (Utilities)
Scaling: Smaller markers, jittering, color contrast
(Universal Bank; red = accept loan)
Jittering
Moving markers by a small random amount
Uncrowds the data by allowing more markers to be
seen
Without jittering (for comparison)
Parallel Coordinate Plot (Boston Housing)
CATMEDV =1
CATMEDV =0
- CAT. MEDV: (1)
Filter Settings
Linked plots
(same record is highlighted in each plot)
Network Graph – eBay Auctions
(sellers on left, buyers on right)
Circle size = # of
transactions for the node
Line width =# of
auctions for the buyer-
seller pair
Arrows point from buyer
to seller
Treemap – eBay Auctions
(Hierarchical eBay data:
Category> sub-category> Brand)
Rectangle size =
average closing
price (=item
value)
Color = % sellers
with negative
feedback
(darker=more)
Map Chart
(Comparing countries’ well-being with GDP)
Darker = higher value