Module 3
Module 3
Data Mining:
Concepts and Techniques
3
Introduction
• Motivation: Why data mining?
• Applications
4
Motivation: “Necessity is the Mother
of Invention”
• Data explosion problem
• "Get your facts first, and then you can distort them as
much as you please.“
7
The Data Flood is Everywhere
8
Evolution of Database Technology
(See Fig. 1.1)
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive,
etc.) and application-oriented DBMS (spatial, scientific, engineering,
etc.)
• 1990s—2000s:
– Data mining and data warehousing, multimedia databases, and Web
databases 9
What Is Data Mining?
– Expert systems
11
So what is it?
13
14
Why Data Mining? — Potential
•
Applications
Database analysis and decision support
17
Market Analysis and Management (2)
• Customer profiling
– data mining can tell you what types of customers buy what products
(clustering or classification)
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage accidents to collect
on insurance
– money laundering: detect suspicious money transactions (US Treasury's
Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of doctors and
ring of references 20
Fraud Detection and Management (2)
• Sports
– IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York
Knicks and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining
• Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.
22
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery Data Mining
process.
Task-relevant Data
Data Cleaning
Data Integration
Databases 24
Steps of a KDD Process
Pattern evaluation
31
Data Mining Functionalities
33
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
34
Data Mining Functionalities
• Outlier analysis
36
Are All the “Discovered” Patterns
Interesting?
• A data mining system/query may generate thousands of patterns, not all of them are
interesting.
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
39
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Relational Data Model •Imprecise Queries
•SQL
•Association Rule •Textual Data
Algorithms •Web Search Engines
•Data Warehousing
•Scalability Techniques
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Algorithm Design •Time Series Analysis
Techniques
•Algorithm Analysis
•Data Structures •Neural Networks
•Decision Tree
Algorithms
40
Data Mining: Classification Schemes
• General functionality
42
Data Mining Models and Tasks
43
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
• Data Visualization
46
Types of Data Sets
• Record
– Relational records
– Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
crosstabs
ball
lost
pla
wi
n
y
– Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
– Transaction data
• Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
• Dimensionality
– Curse of dimensionality
• Sparsity
– Only presence counts
• Resolution
– Patterns depend on the scale
• Distribution
– Centrality and dispersion
48
Data Objects
52
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order and can be positive or negative
– E.g., temperature in C˚or F˚, calendar dates
• With ranking allows us to compare and quantify
difference between values
• Ratio
• Inherent zero-point
• We can speak of values as being a multiple of another
value
• Values are ordered
– e.g., temperature in Kelvin, length, counts, monetary
quantities
53
summary of data types and scale measures
54
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented
using a finite number of digits
– Continuous attributes are typically represented as floating-point
variables
56
Basic Statistical Descriptions of Data
• Motivation
• used to identify properties of the data and highlight which data
values should be treated as noise or outliers
1. central tendency which measure the location of the middle or
center of a data distribution the mean, median, mode, and
midrange
2. dispersion of the data. That is, how are the data spread out?
The most common data dispersion measures are the range,
quartiles, and interquartile range; the five-number summary
and boxplots; and the variance and standard deviation of the
data
57
3. Graphic displays of basic statistical
descriptions to visually inspect ,Most
statistical or graphical data presentation
software packages include bar charts, pie
charts, and line graphs. Other popular
displays of data summaries and distributions
include quantile plots, quantile–quantile
plots, histograms, and scatter plots
58
Measuring the Central Tendency
• Mean (algebraic measure):The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean
• Let x1,x2, …… ,xN be a set of N values or observations, such as for
• some numeric attribute X, like salary. The mean of this set of values is
59
• problem with the mean is its sensitivity to extreme (e.g., outlier)
values
• Even a small number of extreme values can corrupt the mean. For
example, the mean salary at a company may be substantially pushed
up by that of a few highly paid managers
• To offset the effect we can instead use the trimmed mean, which is
the mean obtained after chopping off values at the high and low
extremes
• For example, we can sort the values observed for salary and remove
the top and bottom 2% before computing the mean. We should
avoid trimming too large a portion (such as 20%) at both ends, as
this can result in the loss of valuable information
60
• Median: for skewed(asymmetric)data
– which is the middle value in a set of ordered data values. It is the value that separates
the higher half of a data set from the lower half
– Middle value if odd number of values, or average of the middle two values otherwise
– The median is expensive to compute when we have a large number of
observations, we can easily approximate the value
• where L1 is the lower boundary of the median interval, N is the number of values in the
entire data set,
• is the sum of the frequencies of all of the intervals that are lower than the median
interval, is the frequency of the median interval, and
• width is the width of the median interval
61
• Mode
– The mode for a set of data is the value that occurs most frequently in the
set
– it can be determined for qualitative and quantitative attributes.
– Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal
– In general, a data set with two or more modes is multimodal
– At the other extreme, if each data value occurs only once, then there is
no mode.
• Midrange
– The midrange can also be used to assess the central tendency of a
numeric data set.
– It is the average of the largest and smallest values in the set.
62
• Mean. Suppose we have the following Median:
values for salary (in thousands of for same example the two
dollars), shown
middlemost values are 52 and 56
• in increasing order: 30, 36, 47, 50, 52, (that is, within the sixth and seventh
52, 56, 60, 63, 70, 70, 110 values in the list).
By convention, we assign the
average of the two middlemost values
as the median; that is, 52+56/2=54
Thus,the median is $54,000.
Suppose that we had only the first
11 values in the list. Then the median
• Mode. The data from above example are is the middlemost value.
bimodal. The two modes are $52,000 This is the sixth value in this list,
and $70,000. which has a value of $52,000
• Midrange : It is the average of the
largest and smallest values in the set.
• 30,000+110,000/2=$70,000.
63
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
positively negatively
skewed skewed
64
Measuring the Dispersion of Data
• Range: of the set is the difference between the largest (max()) and smallest (min())
values.
• Quantiles: are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
• The 4-quantiles are the three data points that split the data distribution into four equal
parts; each part represents one-fourth of the data distribution. They are more commonly
referred to as quartiles
65
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– The 100-quantiles are more commonly referred to as percentiles; they
divide the data distribution into 100 equal-sized consecutive sets
66
Five-Number Summary, Boxplots, and Outliers
• Five number summary: of a distribution consists of the median (Q2), the quartiles Q1
and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
• These include
– quantile plots,
– quantile–quantile plots,
– histograms,
– scatter plots.
• Such graphs are helpful for the visual inspection of data,
which is useful for data preprocessing
• The first three of these show univariate distributions (i.e.,
data for one attribute)
• while scatter plots show bivariate distributions (i.e.,
involving two attributes)
68
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
69
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.
70
7
1
Histogram Analysis
• Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X
• If X is nominal, such as automobile model or item type, then a pole or vertical bar is drawn for
each known value of X. The height of the bar indicates the frequency (i.e., count) of that X value
• The resulting graph is more commonly known as a bar chart
• If X is numeric, the term histogram is preferred. The range of values for X is partitioned into
disjoint consecutive subranges
• subranges, referred to as buckets or bins, are disjoint subsets of the data distribution for X. The
range of a bucket is known as the width. Typically, the buckets are of equal width.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
72
Positively and Negatively Correlated Data
73
7
4
Uncorrelated Data
7
5
7
6
7
7
Data Visualization
• Why data visualization?
• Data visualization is the graphical representation of information and data.
– Gain insight into an information space by mapping data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities, relationships among data
– Help find interesting regions and suitable parameters for further quantitative analysis
– Provide a visual proof of computer representations derived
• Categorization of visualization methods:
– Pixel-oriented visualization techniques
– Geometric projection visualization techniques
– Icon-based visualization techniques
– Hierarchical visualization techniques
– Visualizing complex data and relations
78
Pixel-Oriented Visualization Techniques
• For a data set of m dimensions, create m windows on the screen, one for each
dimension
• The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
• The colors of the pixels reflect the corresponding values
81
• A 3-D scatter plot uses three axes in a Cartesian coordinate
system. If it also uses color, it can display up to 4-D data
points
82
• The scatter-plot matrix technique is a useful extension to the scatter plot.
For an n dimensional data set, a scatter-plot matrix is an nxn grid of 2-D scatter
plots that provides a visualization of each dimension with every other dimension
83
• The scatter-plot matrix becomes less effective as the dimensionality increases.
• Another popular technique, called parallel coordinates, can handle higher dimensionality.
• To visualize n-dimensional data points, the parallel coordinates technique draws n
equally spaced axes, one for each dimension, parallel to one of the display axes
84
Icon-Based Visualization Techniques
• Visualization of the data values as features of icons
• Typical visualization methods
– Chernoff Faces
– Stick Figures
85
Chernoff Faces
• A way to display variables on a two-dimensional surface, e.g., let x be eyebrow
slant, y be eye size, z be nose length, etc.
• The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening
86
Stick Figure
87
shows census data, where age and income are mapped to the
display axes, and the remaining dimensions (gender, education, and
so on) are mapped to stick figures.
If the data items are relatively dense with respect to the two
display dimensions, the resulting visualization shows texture
patterns, reflecting data trends
88
Hierarchical Visualization Techniques
89
Dimensional Stacking
92
• first fix the values of dimensions X3,X4,X5 to some selected
values, say, c3, c4, c5.
• We can then visualize F,X1,X2 using a 3-D plot, called a world, as
shown
• The position of the origin of the inner world is located at the
point(c3, c4, c5) in the outer world, which is another 3-D plot using
dimensions X3,X4,X5.
• A user can interactively change, in the outer world, the location of
the origin of the inner world.
• The user then views the resulting changes of the inner world.
93
Tree-Map
• tree-maps display hierarchical data as a set of nested rectangles
• All news stories are organized into seven categories, each shown in a large rectangle of a
unique color.
• Within each category (i.e., each rectangle at the top level), the news stories are further
partitioned into smaller subcategories
Ack.: http://www.cs.umd.edu/hcil/treemap- 94
Tree-Map of a File System (Schneiderman)
95
InfoCube
• A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
• The outermost cubes correspond to the top level
data, while the subnodes or the lower level data are
represented as smaller cubes inside the outermost
cubes, and so on
96
Data Preprocessing
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
98
Data Quality: Why Preprocess the
Data?
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 100
Data Quality: Why Preprocess the Data?
101
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
102
103
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data
• Missing data may need to be inferred
104
Data Cleaning
• Data cleaning (or data cleansing) routines attempt to
fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data
– Missing Values
– Noisy Data
– Data Cleaning as a Process
105
How to Handle Missing Data?
1. Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
2. Fill in the missing value manually: tedious + infeasible?
3. Fill in it automatically with a global constant : e.g.,
“unknown”, a new class?!
4. Use a measure of central tendency for the attribute to fill in
the missing value
the attribute mean or median depending on data
106
5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple
the attribute mean for all samples belonging to the same class:
smarter
6. Use the most probable value to fill in the missing value
the most probable value: inference-based such as Bayesian formula or decision tree
• Methods 3 through 6 bias the data—the filled-in value may not be
correct.
• Method 6, however, is a popular strategy as it uses the most information
from the present data to predict missing values
107
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
108
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
109
Binning
• Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it
– The sorted values are distributed into a number of “buckets,” or bins. i.e
local smoothing.
– example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values)
110
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8,
9, 15, 21, 21, 24, 25, 26, 28, 29, 34
smoothing by bin
* Partition into equal-frequency (equi-
boundaries,
depth) bins:
the minimum and
- Bin 1: 4, 8, 9, 15
maximum values in a
- Bin 2: 21, 21, 24, 25 given bin are identified
- Bin 3: 26, 28, 29, 34 as the bin boundaries.
* Smoothing by bin means: Each bin value
- Bin 1: 9, 9, 9, 9 is then replaced by the
- Bin 2: 23, 23, 23, 23 closest boundary value.
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
Binning is also used as a
- Bin 1: 4, 4, 4, 15
discretization
- Bin 2: 21, 21, 25, 25
technique
- Bin 3: 26, 26, 26, 34
111
Consider the following example for equal frequency binning.
• Dataset: 10,15, 18, 20, 31, 34, 41, 46, 51, 53, 54, 60
• First, we have to sort the data. If we have unsorted data, we need to convert it
into sorted data and then apply the second step. In the second step, we have to
find the frequency. To calculate the frequency, we can use the formula a total
number of data points/number of bins.
• In this case, the total number of data points is 12, and the
number of bins required is 3. Therefore, the frequency
comes out to be 4. Now let's add values into the bins.
• BIN 1: 10, 15, 18, 20
• BIN 2: 31, 34, 41, 46
• BIN 3: 51, 53, 54, 60
Customer Segmentation
• Binning can be used in customer segmentation. A business can optimize its
marketing strategies using the Binning method in customer segmentation.
■ Regression
■ smooth by fitting the data into regression functions
114
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
115
Data Preprocessing
■ Data Quality
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Summary
116
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ help improve the accuracy and speed of the subsequent data
mining process.
■ FOLLOWING TECHNIQUES ARE USED
117
Entity Identification Problem
118
Redundancy and Correlation Analysis
■ Redundant data occur often when integration of multiple databases
■ Object identification: The same attribute or object may have
different names in different databases
■ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
■ can be detected by correlation analysis
■ Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data
■ For nominal data, we use the (chi-square) test.
■ For numeric attributes, we can use the correlation coefficient and
119
Correlation Analysis (Nominal Data)
120
* Data Mining: Concepts and Techniques 121
Chi-Square Calculation: An Example
122
Correlation Analysis (Numeric Data)
123
Covariance (Numeric Data)
124
* Data Mining: Concepts and Techniques 125
■ Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
■ Negative covariance: if one of attributes tends to be above its
expected value when the other attribute is below its expected value,
then the covariance of A and B is negative
■ Independence: CovA,B = 0 but the converse is not true:
■ Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data
follow multivariate normal distributions) does a covariance of 0 imply
independence
126
* Data Mining: Concepts and Techniques 127
* Data Mining: Concepts and Techniques 128
Tuple Duplication
■ Data Quality
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Summary
131
Data Reduction Strategies
■ Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
■ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
■ Data reduction strategies
■ Dimensionality reduction: is the process of reducing number of
132
Data Reduction 1: Dimensionality
Reduction
■ Curse of dimensionality
■ When dimensionality increases, data becomes increasingly sparse
■ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
■ The possible combinations of subspaces will grow exponentially
■ Dimensionality reduction
■ Avoid the curse of dimensionality
■ Help eliminate irrelevant features and reduce noise
■ Reduce time and space required in data mining
■ Allow easier visualization
■ Dimensionality reduction techniques
■ Wavelet transforms
■ Principal Component Analysis
■ Supervised and nonlinear techniques (e.g., feature selection)
133
Mapping Data to a New Space
■ Fourier transform
■ Wavelet transform
134
What Is Wavelet Transform?
■ Decomposes a signal into
different frequency subbands
■ Applicable to n-
dimensional signals
■ Data are transformed to
preserve relative distance
between objects at different
levels of resolution
■ Allow natural clusters to
become more distinguishable
■ Used for image compression
135
Wavelet Transformation
Haar2 Daubechie4
■ Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
■ Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
■ Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
■ Method:
■ Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
■ Each transform has 2 functions: smoothing, difference
■ Applies to pairs of data, resulting in two set of data of length L/2
■ Applies two functions recursively, until reaches the desired length
136
Wavelet Decomposition
■ Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
■ S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
■ Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained
137
Haar Wavelet Coefficients
Coefficient “Supports”
Hierarchical 2.75
decomposition 2.75 +
structure (a.k.a. +
“error tree”) + -1.25
-
-1.25
+ -
0.5
0.5 0 + -
+ - + - 0
+ -
0 -1 -1 0
+ - + - + - + -
0
+ -+
2 2 0 2 3 5 4 4 -1
-1 -+
Original frequency distribution 0 -+
-
138
Why Wavelet Transform?
■ Use hat-shape filters
■ Emphasize region where points cluster
■ Multi-resolution
■ Detect arbitrary shaped clusters at different scales
■ Efficient
■ Complexity O(N)
139
Principal Component Analysis (PCA)
■ Find a projection that captures the largest amount of variation in data
■ The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
140
Principal Component Analysis (Steps)
■ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
■ Normalize input data: Each attribute falls within the same range
■ Compute k orthonormal (unit) vectors, i.e., principal components
■ Each input data (vector) is a linear combination of the k principal
component vectors
■ The principal components are sorted in order of decreasing
“significance” or strength
■ Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
■ Works for numeric data only
141
Attribute Subset Selection
■ way to reduce dimensionality of data
■ Redundant attributes
■ Duplicate much or all of the information contained in
one or more other attributes
■ E.g., purchase price of a product and the amount of
sales tax paid
■ Irrelevant attributes
■ Contain no information that is useful for the data
mining task at hand
■ E.g., students' ID is often irrelevant to the task of
predicting students' GPA
142
Heuristic Search in Attribute Selection
■ There are 2n possible attribute combinations of n attributes
■ Typical heuristic attribute selection methods:
1. Stepwise forward selection: The procedure starts with an empty set of
attributes as
■ the reduced set. The best of the original attributes is determined
and added to the reduced set. At each subsequent iteration or step,
the best of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set
of attributes.
■ At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The
stepwise forward
■ selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
143
■ 4. Decision tree induction: Decision tree
algorithms (e.g., ID3, C4.5, and CART) were
■ originally intended for classification
144
Data Reduction 2: Numerosity
Reduction
■ Reduce data volume by choosing alternative, smaller
forms of data representation
■ Parametric methods (e.g., regression)
■ Assume the data fits some model, estimate model
145
■ In (simple) linear regression, the data are modeled to fit a
straight line. For example, a random variable, y (called a
response variable), can be modeled as a linear function of
another random variable, x (called a predictor variable), with
the equation
■ Y=wx+b
■ where the variance of y is assumed to be constant.
■ In the context of data mining, x and y are numeric database
attributes.
■ The coefficients, w and b (called regression coefficients),
specify the slope of the line and the y-intercept, respectively
■ Multiple linear regression is an extension of (simple) linear regression,
which allows a response variable, y, to be modeled as a linear function of
two or more predictor variables.
146
■ Log-linear models approximate discrete multidimensional probability
distributions.
■ Given a set of tuples in n dimensions (e.g., described by n attributes), we can
consider each tuple as a point in an n-dimensional space.
■ Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes,
■ This allows a higher-dimensional data space to be constructed from lower-
dimensional spaces.
■ Log-linear models are therefore also useful for dimensionality reduction
(since the lower-dimensional points together typically occupy less space than
the original data points) and data smoothing (since aggregate estimates in the
lower-dimensional space are less subject to sampling variations than the
estimates in the higher-dimensional space).
147
Parametric Data Reduction:
Regression and Log-Linear Models
■ Linear regression
■ Data modeled to fit a straight line
■ Multiple regression
■ Allows a response variable Y to be modeled as a
distributions
148
y
Regression Analysis
Y1
149
Regress Analysis and Log-Linear
Models
■ Linear regression: Y = w X + b
■ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
■ Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
■ Multiple regression: Y = b0 + b1 X1 + b2 X2
■ Many nonlinear functions can be transformed into the above
■ Log-linear models:
■ Approximate discrete multidimensional probability distributions
■ Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
■ Useful for dimensionality reduction and data smoothing
150
Histogram Analysis
151
■ Histograms Example
■ The following data are a list of AllElectronics prices
for commonly sold items (rounded to the nearest
dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5,
8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.
152
■ Figure shows a histogram for the data using
singleton buckets
153
■ To further reduce the data, it is common to have
each bucket denote a continuous value range for
the given attribute. In Figure , each bucket
represents a different $10 range for price
154
■ “How are the buckets determined and the attribute values
partitioned?”
■ There are several partitioning rules, includingfollowing:
■ Equal-width: In an equal-width histogram, the width
156
Sampling
157
Types of Sampling
158
159
Sampling: With or without Replacement
Raw Data
160
❑ Stratified sample: If D is divided intomutually disjoint arts called
strata, a stratified sample of D is generated by obtaining an SRS at each
stratum.
❑This helps ensure a representative sample, especially when the data are
skewed. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group
❑In this way, the age group having the smallest number of customers will
be sure to be represented
Data Cube Aggregation
■ Concept hierarchies may exist for each attribute, allowing the analysis
of data at multiple abstraction levels
■ For example, a hierarchy for branch could allow branches to be
grouped into regions, based on their address
■ Data cubes provide fast access to precomputed, summarized data,
thereby benefiting online analytical processing as well as data mining
■ The cube created at the lowest abstraction level is referred to as the
base cuboid
■ base cuboid should correspond to an individual entity of interest such
as sales or customer.
■ A cube at the highest level of abstraction is the apex cuboid. For
the sales data in Figure 3.11,
■ the apex cuboid would give one total—the total sales for all three
years, for all item types, and for all branches
166
Data Reduction : Data Compression
■ String compression
■ There are extensive theories and well-tuned algorithms
Original Data
Approximated
169
Data Preprocessing
■ Data Quality
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Summary
170
Data Transformation
■ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
■ Methods
■ Smoothing: Remove noise from data, techniques include binning,
regression, and clustering
■ Attribute/feature construction
■ New attributes constructed from the given ones
■ Aggregation: Summarization, data cube construction
■ Normalization: where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
■ Discretization: where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior), labels, in turn, can
be recursively organized into higher-level concepts, resulting in a
concept hierarchy for the numeric attribute
171
■ Concept hierarchy generation for nominal data: where attributes
such as street can be generalized to higher-level concepts, like
city or country. Many hierarchies for nominal attributes are
implicit within the database schema and can be automatically
defined at the schema definition level
172
Data Transformation by Normalization
174
Data Transformation by Normalization
■ let A be a numeric attribute with n observed values V1, V2,…… Vn
■ Min-max normalization: performs a linear transformation on the original
data
■ Suppose that minA, and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, Vi of A to Vi’ in the range
[new_minA, new_maxA] by computing
* 175
■ Z-score normalization (μ: mean, σ: standard deviation): the values
for an attribute, A, are normalized based on the mean (i.e., average)
and standard deviation of A,
■ The value Vi of A is normalized to Vi’
* 176
■ Normalization by decimal scaling: normalizes by moving the
decimal point of values of attribute A. The number of decimal points
moved depends on the maximum absolute value of A
■ The value Vi of A is normalized to Vi
’
178
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Binning
■ Top-down split, unsupervised For example, attribute values can be
discretized by applying equal-width or equal-frequency binning, and
then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively
■ Histogram analysis
■ Top-down split, unsupervised equal-width histogram, for example,
the values are partitioned into equal-size partitions or ranges
■ Clustering analysis (unsupervised, top-down split or bottom-up
merge)
■ Decision-tree analysis (supervised, top-down split)
179
Labels
(Binning vs. Clustering)
180
Discretization by Classification &
Correlation Analysis
■ Classification (e.g., decision tree analysis)
■ Supervised: Given class labels, e.g., cancerous vs. benign
■ Using entropy to determine split point (discretization point)
■ Top-down, recursive split
■ Details to be covered in Chapter 7
■ Correlation analysis (e.g., Chi-merge: χ2-based discretization)
■ Supervised: use class information
■ Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
■ Merge performed recursively, until a predefined stopping condition
181
Concept Hierarchy Generation
182
Concept Hierarchy Generation
for Nominal Data
■ Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
■ street < city < state < country
■ Specification of a portion of a hierarchy for a set of values
by explicit data grouping
■ “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and
■ “{British Columbia, prairies Canada} ⊂ Western Canada”
■ Specification of a set of attributes, but not of their partial
ordering: A user may specify a set of attributes forming a
concept hierarchy, but omit to explicitly state their partial
ordering. The system can then try to automatically
generate the attribute ordering so as to construct a
meaningful concept hierarchy
183
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at
■ Data Quality
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Summary
185
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem
■ Remove redundancies
■ Detect inconsistencies
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
186
Links
• https://www.geeksforgeeks.org/discretization-by-histogram-analysis-in-
data-mining/
• https://zebrabi.com/advanced-guide/how-to-create-a-histogram-in-
power-
bi/#:~:text=Step%2Dby%2DStep%20Guide%20to%20Creating%20a%2
0Histogram%20in%20Power%20BI,-
Now%20that%20your&text=Click%20on%20the%20%22Visualizations
%22%20pane,field%20of%20the%20histogram%20visualization.
• https://youtu.be/cz1-SITC5vE
• https://www.javatpoint.com/discretization-in-data-mining
• https://youtu.be/cz1-SITC5vE