Notes Module 2
Notes Module 2
• The input-data is stored in various formats such as flat files, spread sheet or relational tables.
• Purpose of preprocessing: to transform the raw input-data into an appropriate format
for subsequent analysis.
MOTIVATING CHALLENGES
Scalability
• Nowadays, data-sets with sizes of terabytes or even petabytes are becoming common.
• DM algorithms must be scalable in order to handle these massive data sets.
• Scalability may also require the implementation of novel data structures to access
individual records in an efficient manner.
• Scalability can also be improved by developing parallel & distributed algorithms.
High Dimensionality
• Traditional data-analysis technique can only deal with low dimensional data.
• Nowadays, data-sets with hundreds or thousands of attributes are becoming common.
• Data-sets with temporal or spatial components also tend to have high dimensionality.
• The computational complexity increases rapidly as the dimensionality increases.
Heterogeneous and Complex Data
• Traditional analysis methods can deal with homogeneous type of attributes.
• Recent years have also seen the emergence of more complex data-objects.
• DM techniques for complex objects should take into consideration relationships in the data,
such as
→ temporal & spatial autocorrelation
→ parent-child relationships between the elements in semi-structured text & XML
documents
Data Ownership & Distribution
• Sometimes, the data is geographically distributed among resources belonging to multiple
entities.
• Key challenges include:
1) How to reduce amount of communication needed to perform the distributed
computation
2) How to effectively consolidate the DM results obtained from multiple sources &
3) How to address data-security issues
Non Traditional Analysis
• The traditional statistical approach is based on a hypothesized and test paradigm.
In other words, a hypothesis is proposed, an experiment is designed to gather the data,
and then the data is analyzed with respect to hypothesis.
• Current data analysis tasks often require the generation and evaluation of thousands
of hypotheses, and consequently, the development of some DM techniques has been
motivated by the desire to automate the process of hypothesis generation and evaluation.
P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 2
Data Mining & Data Warehousing 15CS651
3) Association Analysis
• This is used to discover patterns that describe strongly associated features in the data.
• The goal is to extract the most interesting patterns in an efficient manner.
• Useful applications include
→ finding groups of genes that have related functionality or
→ identifying web pages that are accessed together
• Ex: market based analysis
We may discover the rule that {diapers} -> {Milk}, which suggests that
customers who buy diapers also tend to buy milk.
4) Anomaly Detection
• This is the task of identifying observations whose characteristics are
significantly different from the rest of the data. Such observations are
known as anomalies or outliers.
• The goal is
→ to discover the real anomalies and
→ to avoid falsely labeling normal objects as anomalous.
• Applications include the detection of fraud, network intrusions, and unusual patterns
of disease.
Example 1.4 (Credit Card Fraud Detection).
• A credit card company records the transactions made by every credit
card holder, along with personal information such as credit limit, age,
annual income, and address.
• Since the number of fraudulent cases is relatively small compared to
the number of legitimate transactions, anomaly detection techniques can
be applied to build a profile of legitimate transactions for the users.
• When a new transaction arrives, it is compared against the profile of the user.
• If the characteristics of the transaction are very different from the
previously created profile, then the transaction is flagged as potentially
fraudulent
3) Addition: + -
4) Multiplication: * /
• Nominal attribute: Uses only
distinctness. Examples: ID
numbers, eye color, pin
codes
• Ordinal attribute: Uses
distinctness & order.
Examples: Grades in
{SC, FC, FCD}
Shirt sizes in {S, M, L, XL}
• Interval attribute: Uses distinctness, order & addition
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio attribute: Uses all 4 properties
Examples: temperature in Kelvin, length, time, counts
ASYMMETRIC ATTRIBUTES
• Binary attributes where only non-zero values are important are called asymmetric
binary attributes.
• Consider a data-set where each object is a student and each attribute records whether
or not a student took a particular course at a university.
• For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise.
• Because students take only a small fraction of all available courses, most of the values
in such a data-set would be 0.
• Therefore, it is more meaningful and more efficient to focus on the non-zero values.
• This type of attribute is particularly important for association analysis.
RECORD DATA
• Data-set is a collection of records.
Each record consists of a fixed set of attributes.
• Every record has the same set of attributes.
• There is no explicit relationship among records or attributes.
• The data is usually stored either Flat files or relational databases
Document Data
• A document can be represented as a ‘vector’,
where each term is a attribute of the vector and value of each attribute is the no. of times
corresponding term occurs in the document.
ORDERED DATA
Sequential Data (Temporal Data)
• This can be thought of as an extension of
record-data, where each
record has a time associated
with it.
• A time can also be associated with each attribute.
• For example, each record could be the purchase history of a customer,
with a listing of items purchased at different times.
• Using this information, it is possible to find patterns such as "people who
buy DVD
players tend to buy DVDs in the period immediately following the purchase."
Sequence Data
• This consists of a data-set that is a sequence of individual entities, such
as a sequence of words or letters.
• This is quite similar to sequential data, except that there are no time stamps;
instead, there are positions in an ordered sequence.
• For example, the genetic information of plants and animals can be
represented in the form of sequences of nucleotides that are known as
genes.
Time Series Data
• This is a special type of sequential data in which a series of measurements
are taken over time.
• For example, a financial data-set might contain objects that are time
series of the daily prices of various stocks.
• An important aspect of temporal-data is temporal-autocorrelation i.e.
if two measurements are close in time, then the values of those
measurements are often very similar.
Spatial Data
• Some objects have spatial attributes, such as positions or areas.
• An example is weather-data (temperature, pressure) that is collected for
a variety of geographical location.
• An important aspect of spatial-data is spatial-autocorrelation i.e.
objects that are physically close tend to be similar in other ways as well.
Data Quality
Preventing Data Quality Problems is not possible. Hence Data Mining Focuses on,
1. Detection and Correction of Data Quality problems, this step is called Data Cleaning.
2. Use of Algorithms that can tolerate poor Data Quality.
Measurement Error: It refers to any problem resulting from the measurement process. A common
problem is that the value recorded differs from the true value to some extent.
Data Collection Error: It refers to errors such as omitting data objects or attribute values or in
appropriately including a data object.
Precision: The closeness of repeated measurements (of the same quantity) to one another.
Bias: A systematic variation of measurements from the quantity being measured.
Accuracy: Closeness of measurements to the true value of the quantity being measured.
Outliers:
1. Data objects that have characteristics those are different from most of the other data
objects in the data set. Or
2. Values of an attribute that are unusual with respect to the typical values for that attribute.
Missing Values
Inconsistent Values
Duplicate Data
A dataset may include data objects that are duplicates, or almost duplicates of one
another
To detect and eliminate such duplicates, two main issues must be addressed.
1. If there are two objects that actually represent a single object, then the values of
corresponding attributes may differ and these inconsistent values must be resolved.
2. Care needs to be taken to avoid accidently combining data objects that are similar, but not
duplicates such as two distinct people with identical names.
The term reduplication is used to refer to the process of dealing with these issues.
1. Timeliness:
Some data starts to age as soon as it has been collected
Eg. Snapshot of some ongoing process represents reality for only a limited time.
2. Relevance:
The available data must contain the information necessary for the application.
3. Knowledge about the data:
Ideally, the data sets are accompanied by documentation that describes different aspects
of data. The quality of this documentation can either aid or hinder the subsequent
analysis.
If the documentation is poor, then our analysis of the data may be faulty.
DATA PREPROCESSING
• Data preprocessing is a broad area and consists of a number of different strategies and
techniques that are interrelated in complex way.
• Different data processing techniques are:
1. Aggregation 2. Sampling 3. Dimensionality reduction
4. Feature subset selection 5. Feature creation 6. Discretization and
binarization 7. Variable transformation
AGGREGATION
• This refers to combining 2 or more attributes (or objects) into a single attribute (or object).
For example, merging daily sales figures to obtain monthly sales figures
Consider the dataset consisting of transactions(data objects) recording the daily sales of
products in various store locations for different days over the course of the year.
One way to aggregate transactions for this data set is to replace all the transactions of
a single with a single store wide transaction.
SAMPLING
• This is a method used for selecting a subset of the data-objects to be analyzed.
• This is often used for both
→ preliminary investigation of the data → final data analysis
• Q: Why sampling?
Ans: Obtaining & processing the entire set of “data of interest” is too expensive or time
consuming.
• Sampling can reduce data-size to the point where better & more expensive algorithm can be
used.
• Key principle for effective sampling: Using a sample will work almost as well as
using entire data-set, if the sample is representative.
Sampling Methods
1) Simple Random Sampling
• There is an equal probability of selecting any particular object.
• There are 2 variations on random sampling:
i) Sampling without Replacement
• As each object is selected, it is removed from the population.
ii) Sampling with Replacement
• Objects are not removed from the population as they are selected for the
sample.
• The same object can be picked up more than once.
• When the population consists of different types(or number) of objects, simple random
sampling can fail to adequately represent those types of objects that are less frequent.
2) Stratified Sampling
• This starts with pre-specified groups of objects.
• In the simplest version, equal numbers of objects are drawn from each group even
though the groups are of different sizes.
• In another variation, the number of objects drawn from each group is
proportional to the size of that group.
3) Progressive Sampling
• If proper sample-size is difficult to determine then progressive sampling can be used.
• This method starts with a small sample, and then increases the sample-size until a
sample of sufficient size has been obtained.
• This method requires a way to evaluate the sample to judge if it is large enough.
DIMENSIONALITY REDUCTION
• Key benefit: many DM algorithms work better if the dimensionality is lower.
Purpose
• May help to eliminate irrelevant features or reduce noise.
• Can lead to a more understandable model (which can be easily visualized).
• Reduce amount of time and memory required by DM algorithms.
• Avoid curse of dimensionality.
The Curse of Dimensionality
• Data-analysis becomes significantly harder as the dimensionality of the data increases.
• For classification, this can mean that there are not enough data-objects to allow
the creation of a model that reliably assigns a class to all possible objects.
• For clustering, the definitions of density and the distance between points (which are
critical for clustering) become less meaningful.
• As a result, we get
→ reduced classification accuracy &
→ poor quality clusters.
Techniques
– Principal Components Analysis (PCA)- To find a projection that captures the
largest amount of variation
– Singular Value Decomposition (SVD) – It is a linear algebra technique for the
continuous attributes that finds a new attribute.
– Others: supervised and non-linear techniques
FEATURE CREATION
Create new attributes that can capture the important information in a data set much more
efficiently than the original attributes. Furthermore, the number of new attributes can be smaller
than the number of original attributes .
Binarization
• A simple technique to binarize a categorical attribute is the following: If
there are m categorical values, then uniquely assign each original value to an
integer in interval [0,m-1].
• Next, convert each of these m integers to a binary number.
• Since n=[log2(m)] binary digits are required to represent these integers, represent these
binary numbers using ‘n’ binary attributes.
Eg. Attributes x2 and x3 are correlated because information about the “good’ value is
encoded using both attributes.
Conclusion:
Problem of discretization is one of deciding how many split points to choose and where to place
them. The result can be represented either as a set of intervals { (x0,x1), (x1,x2)… (xn-1,xn)} ,
Unsupervised Descretization:
Equal width approach – divides the range of the attribute into a user specified
number of intervals each having the same width.
Disadvantage: badly affected by outliers
Equal Frequency(depth) – tries to put same number of objects into each interval
Clustering methods
Supervised Descretization:
The splits in a way that maximizes the purity of the intervals.
Entropy based approach.
Definition of Entropy:
The entropy of the ith interval ,
k
Ei = Σ Pij log2 Pij
i=1
Where,
K= number of different class labels
Pij = mij / mi is the probability of class j in the ith interval
mi - no. of values in ith interval of a partition
mij – no. of values of class j in interval i
Total Entropy, of the partition is the weighted average of the individual interval entropies.
n
E = Σ wiei
I=1
Wi = mi/m
Where, mi = is the fraction of vales in the ith interval
M = number of values
N = number of intervals
Entropy of an interval is a measure of purity of an interval.
If an interval contains only values of one class (is perfectly pure), then the entropy is
overall entropy ∅ and it contributes nothing to the overall entropy.
If the classes of values in an interval occur equally often (the interval is as impure as
possible), then the entropy is maximum.
A simple approach for partitioning a continuous attribute starts by bisecting the
initial values so that the resulting two intervals gives minimum entropy.
The splitting process is the repeated with another interval, typically choosing the
interval with the worst (highest) entropy, until a user specified number of intervals is
reached, or a stopping criterion is satisfied.
VARIABLE TRANSFORMATION
It is a function that maps the entire set of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of the new values
• This refers to a transformation that is applied to all the values of a variable.
• Ex: converting a floating point value to an absolute value.
• Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
x
• If x is a variable, then examples of transformations include e , 1/x, log(x),
sin(x).
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.
• A traditional example is that of "standardizing a variable" in statistics.
• If x is the mean of the attribute values and s x is their standard
deviation, then the transformation x'=(x- x )/sx creates a new variable
that has a mean of 0 and a standard deviation of 1.
detection.
• Proximity is used to refer to either similarity or dissimilarity.
• The similarity between 2 objects is a numerical measure of degree to which the 2 objects are
alike.
• Consequently, similarities are higher for pairs of objects that are more alike.
• Similarities are usually non-negative and are often between 0(no similarity) and
1(complete similarity).
• The dissimilarity between 2 objects is a numerical measure of the degree to which the
2 objects are different.
• Dissimilarities are lower for more similar pairs of objects.
• The term distance is used as a synonym for dissimilarity.
• Dissimilarities sometimes fall in the interval [0,1] but is also common for them to range
from 0 to infinity.
• The Euclidean distance measure given in equation 2.1 is generalized by the Minkowski
distance metric given by
where r=parameter.
• The following are the three most common examples of minkowski distance:
r=1. City block( Manhattan L1 norm) distance.
A common example is the Hamming distance, which is the number of bits
that are different between two objects that have only binary attributes ie
between two binary vectors.
r=2. Euclidean distance (L2 norm)
r=∞. Supremum(L∞ or Lmax norm) distance. This is the maximum difference
between any attribute of the objects. Distance is defined by
• If d(x,y) is the distance between two points, x and y, then the following properties hold
1) Positivity
d(x,x)>=0 for all x and y d(x,y)=0 only if x=y
2) Symmetry
d(x,y)=d(y,x) for all x and y.
3) Triangle inequality
d(x,z)<=d(x,y)+d(y,z) for all points x,y and z.
• Measures that satisfy all three properties are known as metrics.
JACCARD COEFFICIENT
• Jaccard coefficient is frequently used to handle objects consisting of asymmetric binary
attributes.
• The jaccard coefficient is given by the following equation:
COSINE SIMILARITY
• Documents are often represented as vectors, where each attribute represents the frequency
with
which a particular term (or word) occurs in the document.
• This is more complicated, since certain common words are ignored and various
processing techniques are used to account for
→ different forms of the same word
→ differing document lengths and
→ different word frequencies.
• The cosine similarity is one of the most common measure of document similarity.
• If x and y are two document vectors, then
• As indicates by figure 2.16, cosine similarity really is a measure of the angle between x and y.
• Thus, if the cosine similarity is 1,the angle between x and y is 0',and x and y are the same
except for magnitude(length).
• If cosine similarity is 0, then the angle between x and y is 90' and they do not share any terms.
Correlation
Correlation between two data objects that have binary or continuous variables is a Measures of
the linear relationship between the attributes of the objects.
More Precisely, Pearson’s correlation coefficient between two data objects x and y is defined
by,
Corr(x,y) = Covarience (x,y) / std-deviation (x) * std-deviation(y)
= Sxy / SxSy
Where, n
Sxy = 1 / n-1 Σ (xk - ) (yk - - ȳ )
k=1
n
Sx = sq. rt of (1 / n-1 Σ (xk - x ) 2
k=1
n
Sy = sq. rt of (1 / n-1 Σ (yk - ȳ ) 2
k=1
Mean of x,
= ( Σ xi ) / n
Ex. X= (-3,6,0,3,-6)
Y= (1,-2,0,-1,2)
Ans:
Sxy = -15/2
Sx = Sq,rt of (45/2)
Sy = Sq.rt of (5/2)
Corr(x,y) = -1
ISSUES IN PROXIMITY CALCULATION
1) How to handle the case in which attributes have different scales and/or are correlated.
2) How to calculate proximity between objects that are composed of different types of
attributes e.g. quantitative and qualitative.
3) How to handle proximity calculation when attributes have different weights.
Question Bank
1. What is data mining? Explain Data Mining and Knowledge Discovery? (10)
2. What are different challenges that motivated the development of DM? (10)
3. Explain Origins of data mining (5)
4. Discuss the tasks of data mining with suitable examples. (10)
5. Explain Anomaly Detection .Give an Example? (5)
6. Explain Descriptive tasks in detail? (10)
7. Explain Predictive tasks in detail by example? (10)
8. Explain Data set. Give an Example? (5)
9. Explain 4 types of attributes by giving appropriate example? (10)