Lecture 8 Data Prepration Techniques

CIT- 652. DATA MINING.
COURSE INSTRUCTOR : Sheza Naeem

Lecture#8
Data preparation techniques:-
1.Missing data analysis:-

In real world application, when we store data in different data set then some of subset may
have or may will have some missing values. There are some methods in data mining that accept missing values and give
satisfactory answers. but some methods require that all value must be available. so we can do two things
 ignore the missing values and eliminate them.

 Fill the missing values.
there is a possibility to eliminate or ignore missing values when a company have large amount of data set and
only few of datasets or subsets are missing. Then we simply eliminate those datasets or subsets but when there is a
large number of values are missing or it is not possible to eliminate values then we have to fill missing values by
following ways.
 filling missing values manually.
 filling with constant global value.
 filling with mean.
 filling with mean of specific class group.
 By using tool and algorithms.
First,, data miner and domain experts can manually examine the pattern and fill value with proper, reasonable Or
expected missing value. but this method is only possible when there are very small number of values are
incomplete or missing. but with large value it is not possible.
we can fill missing values by replacing them with global constant value like null, not available or N/A. But as we
know data mining techniques work on frequent patterns. so these values can or may produce unexpected or
unobvious result or pattern.
sometimes we have interpretation of missing values is that they are don't care values. it means they do not affect
the final results. in this case, a sample with missing values may be extended to set of artificial sample.
for example, if one three-dimensional sample X is given as X={1,?,3} Where second feature value is missing.
process will generate an artificial sample For the feature domain {0,1,2,3,4}
X¹ = {1,0,3} X²={1,1,3} X³ = {1,2,3} X⁴={1,3,3} X⁴= {1,4,3}
Finally, in this way data miner can generate predictive model to predict each of missing value. we can use many
techniques like regression, clustering and decision tree to predict a set of missing values. in a real application it is
best to define many features value instead of 1 value because many feature value means many chances to predict
correct value and to analyze.
2. Outlier analysis
Sometimes there are some datasets that are not complying with general datasets. Such data set data
are significantly different or inconsistent with remaining data are called outliers. They can become the cause of
measurement error.
Many data mining algorithms try to minimize the effect of outlier on the final model or eliminate them in
preprocessing phase. Outlier detection methods are used to detect, remove the outlier from data that will reduce the risk of
poor decision.
Data mining analyst should remain careful during elimination of outlier because some data may be useful for your
organization. The process of detecting outlier consists on 2 steps.
 Build a profile of normal behavior.
 Use the normal profile to detect outlier.
Outlier detection and removal from data set can be described as process of selection K out of N. main types of
outlier detections are
 visualization technique
 statistical technique
 distance based technique
 model based technique
graphical or visualization technique

scatter plots and box plots are most preferred visualization tool for detect outlier.
Scatter plot often have a pattern. we call our data point an outlier if it doesn’t fit in the pattern.
Sports take a pattern as an example.
There is a graph that have backpack weight on Y axis and the student weight on X axis.
Notice how two points are misfit the patron very well where each point represents a student. Brad and Sharon
are considered an outlier because
 Sharon is carrying a much heavier backpack than the pattern predicts.
 Brad is carrying a much lighter backpack than its pattern predict.
statistics based techniques

This method is divided between univariate’s method and multivariate method. Univar rates involve analysis
of 1 variable while multivariate analysis examine two or more variables.
 Grubb ‘s Test
 Inter quartile Range (IQR)
 Dixon’s Test
Grubb ‘s Test
Grubb ‘s Test work with two hypothesis:
 There is no outlier in data set

 There is only one outlier in data sets.
Where Xn is the element of data set, X and S denoting sample mean a and Standard Deviation.
Inter quartile Range (IQR)
IQR is quartile method to detect Outlier. There are Two steps for IQR.
1. First, find Q1 and Q3 , first quartile and third quartile.
2. Find difference between them Q3 -Q1.
A value lower than Q1–1.5H and higher than Q3+1.5H is considered to be a mild outlier. A value lower than Q1–3H and
higher than Q3+3H is considered to be an extreme outlier.
Dixon’s Test :
The Dixon's Q test, developed by W.J. Dixon in 1950, is used to identify outliers in a small sample size, particularly when
n≤30 . The test statistic for Dixon's Q test is defined as follows:
Q= Gap / Range
where:
 The Gap is the absolute difference between the suspected outlier and its nearest neighbor.
 The Range is the difference between the maximum and minimum values in the dataset.

To perform the Dixon's Q test, follow these steps:
1. Arrange the data in ascending order.
2. Identify the suspected outlier. This could be the smallest or largest value in the dataset.
3. Calculate the Q statistic for the suspected outlier using the formula:
o If the suspected outlier is the smallest value: Q= x2−x1/ xn−x1

o If the suspected outlier is the largest value: Q= xn− (xn−1) / xn−x1
where x1,x2,... xn are the ordered data points.
4. Compare the calculated Q value to the critical Q value from the Dixon's Q table for the given sample
size and confidence level.
5. Decision rule:
o If the calculated Q value is greater than the critical Q value, the suspected value is considered an
outlier.
o If the calculated Q value is less than or equal to the critical Q value, the suspected value is not
considered an outlier.
Limitations
 The Dixon's Q test is most appropriate for small sample sizes (n≤30).
 It is sensitive to the presence of multiple outliers; thus, it is typically used to detect a single outlier.
 The test assumes that the data come from a normally distributed population.
Here’s an example:
Suppose you have the following dataset: [1.2, 1.5, 1.7, 1.8, 2.0, 2.1, 2.2, 2.3, 2.4, 10.5]
1. Arrange the data in ascending order: [1.2, 1.5, 1.7, 1.8, 2.0, 2.1, 2.2, 2.3, 2.4, 10.5]
2. Suspected outlier: 10.5
3. Calculate the Q statistic: Q=10.5−2.4 / 10.5−1.2
=8.1/ 9.3 ≈0.871
4. Compare the calculated Q value to the critical Q value from the Dixon's Q table for n=10.
If the critical Q value (at the desired confidence level) is, for instance, 0.568, then since 0.871 > 0.568, 10.5 is
considered an outlier.
Distance-Based Methods
Distance-based methods for outlier detection rely on a multi-dimensional index to determine the neighborhood
of each object. The main idea is to check whether the neighborhood of each object contains a sufficient number
of points. If an object does not have enough neighbors within a specified distance, it is considered an outlier.
Here are the key points:
 Neighborhood Retrieval: The multi-dimensional index helps efficiently retrieve the neighborhood of
each object.
 Sufficient Points Criterion: An object is considered an outlier if its neighborhood lacks sufficient
points.
 Scalability: Distance-based methods are generally more scalable to multi-dimensional spaces compared
to statistical methods.
Two primary algorithms are used in distance-based techniques:
1. Index-Based Algorithms: These algorithms use index structures to search for neighbors of each object
within a specified radius. The complexity is O(k⋅n2)O(k \cdot n^2)O(k⋅n2), where kkk is the number of
dimensions and nnn is the number of objects.
2. Cell-Based Algorithms: These algorithms aim to avoid the O(n2)O(n^2)O(n2) computational
complexity by partitioning the data space into cells. This technique is more efficient for memory-
resident datasets.
Model-Based Techniques
Model-based techniques form the third class of outlier detection methods. These methods simulate the way
humans identify unusual samples within a set of similar samples. The core idea is to define a model that
characterizes the normal behavior of the data, and samples that deviate significantly from this model are
considered outliers.
 Characteristics Definition: The model defines the typical characteristics of the dataset.
 Outlier Detection: Samples that do not match the defined model characteristics are classified as
outliers.
 Sequential Exception Technique: One approach within model-based methods is the sequential
exception technique, which relies on a dissimilarity function to identify outliers based on how different
they are from the expected pattern.

Lecture 8 Data Prepration Techniques

Uploaded by

Lecture 8 Data Prepration Techniques

Uploaded by

CIT- 652. DATA MINING.

COURSE INSTRUCTOR : Sheza Naeem

Data preparation techniques:-

1.Missing data analysis:-

 ignore the missing values and eliminate them.

graphical or visualization technique

statistics based techniques

Grubb ‘s Test work with two hypothesis:

 There is no outlier in data set

o If the suspected outlier is the smallest value: Q= x2−x1/ xn−x1

Two primary algorithms are used in distance-based techniques:

You might also like