4 Automatic Outlier Detection Algorithms in Python
4 Automatic Outlier Detection Algorithms in Python
Search... !
Identifying and removing outliers is challenging with simple statistical methods for most
machine learning datasets given the large number of input variables. Instead, automatic Picked for you:
outlier detection methods can be used in the modeling pipeline and compared, just like other
data preparation transforms that may be applied to the dataset. How to Choose a Feature Selection
Method For Machine Learning
In this tutorial, you will discover how to use automatic outlier detection and removal to
improve machine learning predictive modeling performance.
How to Calculate Feature Importance
With Python
After completing this tutorial, you will know:
Discover data cleaning, feature selection, data transforms, dimensionality reduction and
much more in my new book, with 30 step-by-step tutorials and full Python source code. Recursive Feature Elimination (RFE) for
Feature Selection in Python
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
Perhaps the most common or familiar type of outlier is the observations that are far from the
rest of the observations or the center of mass of observations.
This is easy to understand when we have one or two variables and we can visualize the data
as a histogram or scatter plot, although it becomes very challenging when we have many
input variables defining a high-dimensional input feature space.
In this case, simple statistical methods for identifying outliers can break down, such as
methods that use standard deviations or the interquartile range.
It can be important to identify and remove outliers from data when training machine learning
algorithms for predictive modeling.
Outliers can skew statistical measures and data distributions, providing a misleading
representation of the underlying data and relationships. Removing outliers from training data
prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.
Thankfully, there are a variety of automatic model-based methods for identifying outliers in
input data. Importantly, each method approaches the definition of an outlier is slightly
different ways, providing alternate approaches to preparing a training dataset that can be
evaluated and compared, just like any other data preparation step in a modeling pipeline.
Before we dive into automatic outlier detection methods, let’s first select a standard machine
learning dataset that we can use as the basis for our investigation.
Click to sign-up and also get a free PDF Ebook version of the course.
This will provide the context for exploring the outlier identification and removal method of
data preparation in the next section.
This dataset has 13 input variables that describe the properties of the house and suburb and
requires the prediction of the median value of houses in the suburb in thousands of dollars.
No need to download the dataset as we will download it automatically as part of our worked
examples.
Open the dataset and review the raw data. The first few rows of data are listed below.
We can see that it is a regression predictive modeling problem with numerical input
variables, each of which has different scales.
1 0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00
2 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60
3 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70
4 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40
5 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20
6 ...
The dataset has many numerical input variables that have unknown and complex
relationships. We don’t know that outliers exist in this dataset, although we may guess that
some outliers may be present.
The example below loads the dataset and splits it into the input and output columns, splits it
into train and test datasets, then summarizes the shapes of the data arrays.
Running the example, we can see that the dataset was loaded correctly and that there are
506 rows of data with 13 input variables and a single target variable.
The dataset is split into train and test sets with 339 rows used for model training and 167 for
model evaluation.
Next, let’s evaluate a model on this dataset and establish a baseline in performance.
In this case, we will fit a linear regression algorithm and evaluate model performance by
training the model on the test dataset and making a prediction on the test data and evaluate
the predictions using the mean absolute error (MAE).
The complete example of evaluating a linear regression model on the dataset is listed below.
Running the example fits and evaluates the model, then reports the MAE.
Your specific results may differ given the stochastic nature of the learning algorithm, the
evaluation procedure, and/or differences in precision across systems. Try running the
example a few times.
In this case, we can see that the model achieved a MAE of about 3.417. This provides a
baseline in performance to which we can compare different outlier identification and removal
procedures.
1 MAE: 3.417
In this section, we will review four methods and compare their performance on the house
price dataset.
Each method will be defined, then fit on the training dataset. The fit model will then predict
which examples in the training dataset are outliers and which are not (so-called inliers). The
outliers will then be removed from the training dataset, then the model will be fit on the
remaining examples and evaluated on the entire test dataset.
It would be invalid to fit the outlier detection method on the entire training dataset as this
would result in data leakage. That is, the model would have access to data (or information
about the data) in the test set not used to train the model. This may result in an optimistic
estimate of model performance.
We could attempt to detect outliers on “new data” such as the test set prior to making a
prediction, but then what do we do if outliers are detected?
One approach might be to return a “None” indicating that the model is unable to make a
prediction on those outlier cases. This might be an interesting extension to explore that may
be appropriate for your project.
Isolation Forest
Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.
It is based on modeling the normal data in such a way as to isolate anomalies that are both
few in number and different in the feature space.
Perhaps the most important hyperparameter in the model is the “contamination” argument,
which is used to help estimate the number of outliers in the dataset. This is a value between
0.0 and 0.5 and by default is set to 0.1.
1 ...
2 # identify outliers in the training dataset
3 iso = IsolationForest(contamination=0.1)
4 yhat = iso.fit_predict(X_train)
Once identified, we can remove the outliers from the training dataset.
1 ...
2 # select all rows that are not outliers
3 mask = yhat != -1
4 X_train, y_train = X_train[mask, :], y_train[mask]
Tying this together, the complete example of evaluating the linear model on the housing
dataset with outliers identified and removed with isolation forest is listed below.
Running the example fits and evaluates the model, then reports the MAE.
Your specific results may differ given the stochastic nature of the learning algorithm, the
evaluation procedure, and/or differences in precision across systems. Try running the
example a few times.
In this case, we can see that that model identified and removed 34 outliers and achieved a
MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.
For example, if the dataset has two input variables and both are Gaussian, then the feature
space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to
identify values far from the distribution.
This approach can be generalized by defining a hypersphere (ellipsoid) that covers the
normal data, and data that falls outside this shape is considered an outlier. An efficient
implementation of this technique for multivariate data is known as the Minimum Covariance
Determinant, or MCD for short.
The scikit-learn library provides access to this method via the EllipticEnvelope class.
It provides the “contamination” argument that defines the expected ratio of outliers to be
observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and
error.
1 ...
2 # identify outliers in the training dataset
3 ee = EllipticEnvelope(contamination=0.01)
4 yhat = ee.fit_predict(X_train)
Once identified, the outliers can be removed from the training dataset as we did in the prior
example.
Tying this together, the complete example of identifying and removing outliers from the
housing dataset using the elliptical envelope (minimum covariant determinant) method is
listed below.
Running the example fits and evaluates the model, then reports the MAE.
Your specific results may differ given the stochastic nature of the learning algorithm, the
evaluation procedure, and/or differences in precision across systems. Try running the
example a few times.
In this case, we can see that the elliptical envelope method identified and removed only 4
outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.
This can work well for feature spaces with low dimensionality (few features), although it can
become less reliable as the number of features is increased, referred to as the curse of
dimensionality.
The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of
nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated
or how likely it is to be outliers based on the size of its local neighborhood. Those examples
with the largest score are more likely to be outliers.
We introduce a local outlier (LOF) for each object in the dataset, indicating its
$ degree of outlier-ness.
The model provides the “contamination” argument, that is the expected percentage of
outliers in the dataset, be indicated and defaults to 0.1.
1 ...
2 # identify outliers in the training dataset
3 lof = LocalOutlierFactor()
4 yhat = lof.fit_predict(X_train)
Tying this together, the complete example of identifying and removing outliers from the
housing dataset using the local outlier factor method is listed below.
1 # evaluate model performance with outliers removed using local outlier factor
2 from pandas import read_csv
3 from sklearn.model_selection import train_test_split
4 from sklearn.linear_model import LinearRegression
5 from sklearn.neighbors import LocalOutlierFactor
6 from sklearn.metrics import mean_absolute_error
7 # load the dataset
8 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
9 df = read_csv(url, header=None)
10 # retrieve the array
11 data = df.values
12 # split into input and output elements
13 X, y = data[:, :-1], data[:, -1]
14 # split into train and test sets
15 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state
16 # summarize the shape of the training dataset
17 print(X_train.shape, y_train.shape)
18 # identify outliers in the training dataset
19 lof = LocalOutlierFactor()
20 yhat = lof.fit_predict(X_train)
21 # select all rows that are not outliers
22 mask = yhat != -1
23 X_train, y_train = X_train[mask, :], y_train[mask]
24 # summarize the shape of the updated training dataset
25 print(X_train.shape, y_train.shape)
26 # fit the model
27 model = LinearRegression()
28 model.fit(X_train, y_train)
29 # evaluate the model
30 yhat = model.predict(X_test)
31 # evaluate predictions
32 mae = mean_absolute_error(y_test, yhat)
33 print('MAE: %.3f' % mae)
Running the example fits and evaluates the model, then reports the MAE.
Your specific results may differ given the stochastic nature of the learning algorithm, the
evaluation procedure, and/or differences in precision across systems. Try running the
example a few times.
In this case, we can see that the local outlier factor method identified and removed 34
outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the
baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of
outliers were identified and removed.
One-Class SVM
The support vector machine, or SVM, algorithm developed initially for binary classification
can be used for one-class classification.
When modeling one class, the algorithm captures the density of the majority class and
classifies examples on the extremes of the density function as outliers. This modification of
SVM is referred to as One-Class SVM.
The class provides the “nu” argument that specifies the approximate ratio of outliers in the
dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and
error.
1 ...
2 # identify outliers in the training dataset
3 ee = OneClassSVM(nu=0.01)
4 yhat = ee.fit_predict(X_train)
Tying this together, the complete example of identifying and removing outliers from the
housing dataset using the one class SVM method is listed below.
1 # evaluate model performance with outliers removed using one class SVM
2 from pandas import read_csv
3 from sklearn.model_selection import train_test_split
4 from sklearn.linear_model import LinearRegression
5 from sklearn.svm import OneClassSVM
6 from sklearn.metrics import mean_absolute_error
7 # load the dataset
8 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
9 df = read_csv(url, header=None)
10 # retrieve the array
11 data = df.values
12 # split into input and output elements
13 X, y = data[:, :-1], data[:, -1]
14 # split into train and test sets
15 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state
16 # summarize the shape of the training dataset
17 print(X_train.shape, y_train.shape)
18 # identify outliers in the training dataset
19 ee = OneClassSVM(nu=0.01)
20 yhat = ee.fit_predict(X_train)
21 # select all rows that are not outliers
22 mask = yhat != -1
23 X_train, y_train = X_train[mask, :], y_train[mask]
24 # summarize the shape of the updated training dataset
25 print(X_train.shape, y_train.shape)
26 # fit the model
27 model = LinearRegression()
28 model.fit(X_train, y_train)
29 # evaluate the model
30 yhat = model.predict(X_test)
31 # evaluate predictions
32 mae = mean_absolute_error(y_test, yhat)
33 print('MAE: %.3f' % mae)
Running the example fits and evaluates the model, then reports the MAE.
Your specific results may differ given the stochastic nature of the learning algorithm, the
evaluation procedure, and/or differences in precision across systems. Try running the
example a few times.
In this case, we can see that only three outliers were identified and removed and the model
achieved a MAE of about 3.431, which is not better than the baseline model that achieved
3.417. Perhaps better performance can be achieved with more tuning.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Related Tutorials
One-Class Classification Algorithms for Imbalanced Datasets
How to Remove Outliers for Machine Learning
Papers
Isolation Forest, 2008.
Minimum Covariance Determinant and Extensions, 2017.
LOF: Identifying Density-based Local Outliers, 2000.
Estimating the Support of a High-Dimensional Distribution, 2001.
APIs
Novelty and Outlier Detection, scikit-learn user guide.
sklearn.covariance.EllipticEnvelope API.
sklearn.svm.OneClassSVM API.
sklearn.neighbors.LocalOutlierFactor API.
sklearn.ensemble.IsolationForest API.
Summary
In this tutorial, you discovered how to use automatic outlier detection and removal to
improve machine learning predictive modeling performance.
" How to Use Feature Extraction on Tabular Data for Machine Learning
6 Dimensionality Reduction Algorithms With Python #
REPLY %
Joseph July 8, 2020 at 7:00 pm #
REPLY %
Jason Brownlee July 9, 2020 at 6:39 am #
I think trees are pretty robust to outliers. Test for your dataset.
REPLY %
JParzival July 9, 2020 at 10:42 am #
Great article!
Thank you for sharing your experience!
REPLY %
Jason Brownlee July 9, 2020 at 1:19 pm #
You’re welcome.
REPLY %
Nagdev. A July 10, 2020 at 9:39 am #
REPLY %
Jason Brownlee July 10, 2020 at 1:47 pm #
REPLY %
Vishal July 10, 2020 at 10:18 pm #
Hello sir,
It was a great article. Just one doubt:
MCD technique doesn’t perform well when the data has very large dimensions like >1000. In
that case, it is a good option to feed the model with principal components of the data. The
paper that you mentioned in the link says:
“For large p we can still make a rough estimate of the scatter as follows. First compute the
first q < p robust principal components of the data. For this we can use the MCD-based
ROBPCA method53, which requires that the number of components q be set rather low."
Now the ROBPCA is not available in python. Can you please tell what can be done in this
case?
Thank you
REPLY %
Jason Brownlee July 11, 2020 at 6:13 am #
REPLY %
fabou July 10, 2020 at 11:08 pm #
Hi Jason,
How could automatic outlier detection be integrated into a cross validation loop? Does it
have to be part of a pipeline which steps would be : outlier detection > outlier removal
(transformer) > modeling?
In this case, should a specific transformer “outlier remover” be created?
Thanks
REPLY %
Jason Brownlee July 11, 2020 at 6:16 am #
You would have to run the CV loop manually and apply the method to the data
prior to fitting/evaluating a model or pipeline.
It’s disappointing that sklearn does not support methods in pipelines that add/remove
rows. imbalanced learn can do this kind of thing…
REPLY %
Chayma July 15, 2020 at 1:16 am #
Which algorithm is the most sutible for outlier detection in time series data?
REPLY %
Jason Brownlee July 15, 2020 at 8:27 am #
I don’t know off hand, I hope to write about that topic in the future.
Leave a Reply
Name (required)
Website
SUBMIT COMMENT
© 2020 Machine Learning Mastery Pty. Ltd. All Rights Reserved. Privacy | Disclaimer | Terms | Contact | Sitemap | Search
Address: PO Box 206, Vermont Victoria 3133, Australia. | ACN: 626 223 336.
LinkedIn | Twitter | Facebook | Newsletter | RSS