Lecture 8 Data Prepration Techniques
Lecture 8 Data Prepration Techniques
2. Outlier analysis
Sometimes there are some datasets that are not complying with general datasets. Such data set data
are significantly different or inconsistent with remaining data are called outliers. They can become the cause of
measurement error.
Many data mining algorithms try to minimize the effect of outlier on the final model or eliminate them in
preprocessing phase. Outlier detection methods are used to detect, remove the outlier from data that will reduce the risk of
poor decision.
Data mining analyst should remain careful during elimination of outlier because some data may be useful for your
organization. The process of detecting outlier consists on 2 steps.
Build a profile of normal behavior.
Use the normal profile to detect outlier.
Outlier detection and removal from data set can be described as process of selection K out of N. main types of
outlier detections are
visualization technique
statistical technique
distance based technique
model based technique
Notice how two points are misfit the patron very well where each point represents a student. Brad and Sharon
are considered an outlier because
Sharon is carrying a much heavier backpack than the pattern predicts.
Brad is carrying a much lighter backpack than its pattern predict.
Grubb ‘s Test
Where Xn is the element of data set, X and S denoting sample mean a and Standard Deviation.
Inter quartile Range (IQR)
IQR is quartile method to detect Outlier. There are Two steps for IQR.
1. First, find Q1 and Q3 , first quartile and third quartile.
2. Find difference between them Q3 -Q1.
A value lower than Q1–1.5H and higher than Q3+1.5H is considered to be a mild outlier. A value lower than Q1–3H and
higher than Q3+3H is considered to be an extreme outlier.
Dixon’s Test :
The Dixon's Q test, developed by W.J. Dixon in 1950, is used to identify outliers in a small sample size, particularly when
n≤30 . The test statistic for Dixon's Q test is defined as follows:
Q= Gap / Range
where:
The Gap is the absolute difference between the suspected outlier and its nearest neighbor.
The Range is the difference between the maximum and minimum values in the dataset.
To perform the Dixon's Q test, follow these steps:
1. Arrange the data in ascending order.
2. Identify the suspected outlier. This could be the smallest or largest value in the dataset.
3. Calculate the Q statistic for the suspected outlier using the formula:
4. Compare the calculated Q value to the critical Q value from the Dixon's Q table for the given sample
size and confidence level.
5. Decision rule:
o If the calculated Q value is greater than the critical Q value, the suspected value is considered an
outlier.
o If the calculated Q value is less than or equal to the critical Q value, the suspected value is not
considered an outlier.
Limitations
The Dixon's Q test is most appropriate for small sample sizes (n≤30).
It is sensitive to the presence of multiple outliers; thus, it is typically used to detect a single outlier.
The test assumes that the data come from a normally distributed population.
Here’s an example:
Suppose you have the following dataset: [1.2, 1.5, 1.7, 1.8, 2.0, 2.1, 2.2, 2.3, 2.4, 10.5]
1. Arrange the data in ascending order: [1.2, 1.5, 1.7, 1.8, 2.0, 2.1, 2.2, 2.3, 2.4, 10.5]
2. Suspected outlier: 10.5
3. Calculate the Q statistic: Q=10.5−2.4 / 10.5−1.2
=8.1/ 9.3 ≈0.871
4. Compare the calculated Q value to the critical Q value from the Dixon's Q table for n=10.
If the critical Q value (at the desired confidence level) is, for instance, 0.568, then since 0.871 > 0.568, 10.5 is
considered an outlier.
Distance-Based Methods
Distance-based methods for outlier detection rely on a multi-dimensional index to determine the neighborhood
of each object. The main idea is to check whether the neighborhood of each object contains a sufficient number
of points. If an object does not have enough neighbors within a specified distance, it is considered an outlier.
Here are the key points:
Neighborhood Retrieval: The multi-dimensional index helps efficiently retrieve the neighborhood of
each object.
Sufficient Points Criterion: An object is considered an outlier if its neighborhood lacks sufficient
points.
Scalability: Distance-based methods are generally more scalable to multi-dimensional spaces compared
to statistical methods.
1. Index-Based Algorithms: These algorithms use index structures to search for neighbors of each object
within a specified radius. The complexity is O(k⋅n2)O(k \cdot n^2)O(k⋅n2), where kkk is the number of
dimensions and nnn is the number of objects.
2. Cell-Based Algorithms: These algorithms aim to avoid the O(n2)O(n^2)O(n2) computational
complexity by partitioning the data space into cells. This technique is more efficient for memory-
resident datasets.
Model-Based Techniques
Model-based techniques form the third class of outlier detection methods. These methods simulate the way
humans identify unusual samples within a set of similar samples. The core idea is to define a model that
characterizes the normal behavior of the data, and samples that deviate significantly from this model are
considered outliers.
Characteristics Definition: The model defines the typical characteristics of the dataset.
Outlier Detection: Samples that do not match the defined model characteristics are classified as
outliers.
Sequential Exception Technique: One approach within model-based methods is the sequential
exception technique, which relies on a dissimilarity function to identify outliers based on how different
they are from the expected pattern.