Chapter 3 – Data Visualization
Chapter 4 – Summary Statistics
Data Mining for Business Intelligence
Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010
Data Visualization
• “A picture is worth a thousand words”
• Data visualization and summary statistics help condense
data
• Effective presentation
• Supports data cleaning (identify missing values, outliers,
incorrect values, duplicates) and exploring (combine some
groups)
• Helps identify suitable variables
• Mandatory initial step for most data mining applications
Graphs for Data Exploration
Basic Plots Distribution Plots
Line Graphs Boxplots
Bar Charts Histograms
Scatterplots
Two Examples
Amtrak Ridership: Boston Housing Data:
Amtrak routinely collects Census tracts in Boston
data on ridership Several variables (14) –
Goal: To predict future crime rate, location, etc.
ridership using the series Goal 1: Predict median
of monthly ridership data value of a home in the tract
between Jan 1991 – Goal 2: Cluster census
March 2004 tracts
Line Graph for Time Series
Shows how ridership patterns of Amtrak trains change over time
Bar Chart for Categorical Variable
Determine differences
between subgroups
Example: 95% of tracts do
not border Charles River
Scatterplot
Displays relationship between two numerical variables – median values
decrease as percentage of low status population increases
Graphs
Three most effective plots:
bar charts – usually for categorical variables
line graphs – time series data
Scatterplots – relationship between 2 variables
Used widely in the business world
Domain knowledge and nature of the task are used to
select appropriate chart for data at hand
Distribution Plots
Display entire distribution of a numerical variable
Display “ how many” of each value occur in a data set or,
for continuous data or data with many possible values,
“ how many” values are in each of a series of ranges or
“ bins”
Generally useful for prediction tasks (supervised learning)
and help determine the potential methods and variable
transformations
Histograms
Boston Housing example:
Histogram shows the
distribution of the
outcome variable
(median house value)
Boxplots
Side-by-side boxplots are useful for comparing subgroups
Boston Housing Example:
Display distribution of
outcome variable (MEDV)
for neighborhoods on
Charles river (1) and not on
Charles river (0)
Box Plot
Top outliers defined as
those above Q3+1.5(Q3-
Q1).
“ max” = maximum of
outliers
non-outliers
“ ma
x”
Analogous definitions
Quartile 3 for bottom outliers and
mean
Median
for “ min”
Quartile 1 Details may differ
“ min”
across software
Heat Maps
Basic charts and distribution plots can display a maximum of 2
variables
Cannot represent high-dimensional data
In data mining, often data are multi-dimensional
Heat maps are graphical displays where color is used to
convey information
Used to visualize:
Correlation
Missing Data
Heat maps
Correlation table for p variables has p rows and p columns
Data table has p columns (variables) and n rows (records)
If n is large, a subset can be used
Easier and faster to scan the color coding rather than the
values
Useful when examining a large number of values but bar
charts and plots should be used for precise graphical
representations
Heatmap to highlight correlations
(Boston Housing)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1.00
ZN -0.20 1.00
INDUS 0.41 -0.53 1.00
CHAS
NOX
-0.06
0.42
-0.04
-0.52
0.06
0.76
1.00
0.09 1.00
In Excel
RM
AGE
-0.22
0.35
0.31
-0.57
-0.39
0.64
0.09
0.09
-0.30
0.73
1.00
-0.24 1.00
(using
DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 conditional
RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00
TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 formatting)
PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00
LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00
MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00
In Spotfire
Multidimensional Visualization
Adding variables
• In order to add more variables to the plot
• Categorical: hue, shape, multiple panels
• Numerical : color intensity
• Incorporating more variables has advantages
• Use for both classification and prediction tasks
• Helps adding interaction terms
Scatterplot with color added
Boston Housing
NOX vs. LSTAT
Red = low median value
Blue = high median value
Data Manipulations
Important step in pre-processing of data
Includes – variable transformations, deriving new
variables (binning, condensing categories)
Common methods:
Rescaling – can often enhance the plot and illuminate
relationships
Aggregation – temporal scale: by granularity (monthly,
weekly), geographical (by zip codes)
Zooming and Panning – reveal patterns and outliers (Google
maps – zoom certain areas of interest)
Filtering – removing some “noise” from data to focus
attention on certain data
Rescaling to log scale (on right)
“ uncrowds” the data
Rescaling removes crowding and allows a better view of the linear
relationship between the two logged-scale variables
Aggregation
Amtrak Ridership – Monthly Data
Aggregation – Monthly Average
“Seasonal aggregation”(monthly) – Peak ridership in July-August, and
there is a dip in January-February
Aggregation – Yearly Average
“Temporal aggregation”(yearly) – Ridership decreased from 1991 –
1996 and then grew again from 1996 – 2004 (with a slight drop in
2003-2004)
Scatter Plot with Labels (Utilities)
Helps visualize and identify clusters and outliers, detect patterns.
For example: Nevada and Puget are similar and away from the rest
Scaling up: Large datasets
• Scatterplots for large observations can sometimes be ineffective
• Alternatives:
• Sampling
• Reduce marker size
• Breaking data down into subsets
• Aggregation
• Jittering – slightly moving each marker by adding a small
amount of noise
Other plots/graphs
• Matrix plot – multiple scatterplots together for pairwise
relationships
• Interactive visualization
• Multiple inter-link plots (single view)
• Interactive visualization is often preferred over “static”
graphs – all plots on one screen
• Specialized Visualization
• Network graphs – actors and relations between them
(“nodes”, “edges”)
• Tree maps for hierarchical large-scale data
• Map charts for geographical data
• Spotfire software – [Link]
Linked plots
(same record is highlighted in each plot)
Network Graph – eBay Auctions
(sellers on left, buyers on right)
Circle size = # of
transactions for the node
Line width =# of auctions
for the buyer-seller pair
Arrows point from seller
to buyer
Treemap – eBay Auctions
(Hierarchical eBay data:
Category> sub-category> Brand)
Rectangle size =
average closing
price (=item
value)
Color = % sellers
with negative
feedback
(darker=more)
Map Chart
(Comparing countries’ well-being with GDP)
Darker = higher value
Summary of Data visualization tools
• Prediction and Classification
• Bar charts, scatterplots
• Boxplots, histograms
• Side-by-side boxplots, multiple panels, color added
• Aggregation methods
• Time series forecasting
• Line charts – temporal, seasonal aggregations
• Zooming and panning
• Unsupervised learning
• Matrix plots
• Heatmaps
• Aggregation, zooming and panning
• Map charts, parallel coordinate plots
Other Pre-processing steps – Chapter
2
Detecting outliers
Handling missing data
Normalizing/standardizing data
Summary Statistics: Exploring the data
• Useful initial step of data exploration
• Statistical summary of data: common metric
• Average
• Median
• Mode
• Minimum
• Maximum
• Range
• Variance and Standard deviation
• Counts & percentages
Summary Statistics – Boston Housing
Summarize Using Pivot Tables
Counts & percentages are useful
for summarizing categorical data
Boston Housing example: Count of MEDV
471 neighborhoods border the CHAS Total
Charles River (1) 0 471
35 neighborhoods do not (0) 1 35
Grand Total 506
Pivot Tables - cont.
Averages are useful for summarizing
grouped numerical data
Boston Housing example:
Compare average home values Average of MEDV
in neighborhoods that border CHAS Total
0 22.09
Charles River (1) and those 1 28.44
that do not (0) Grand Total 22.53
Conclusion
Both data visualization and summary statistics are
ways to explore, summarize and describe data
Visualization techniques are more appealing but
summary statistics are essential to quantitatively
understand the information from the data
They both help in data reduction and forming
groups/aggregates