4 Automatic Outlier Detection Algorithms in Python

Click to Take the FREE Data Preparation Crash-Course
Search... !
Get Started Blog Topics & EBooks FAQ About Contact
4 Automatic Outlier Detection Algorithms in Welcome!

Python My name is Jason Brownlee PhD,
and I help developers get results
by Jason Brownlee on July 8, 2020 in Data Preparation with machine learning.
Read more
Tweet Share Share
Never miss a tutorial:

The presence of outliers in a classification or regression dataset can result in a poor fit and
lower predictive modeling performance.
Identifying and removing outliers is challenging with simple statistical methods for most
machine learning datasets given the large number of input variables. Instead, automatic Picked for you:
outlier detection methods can be used in the modeling pipeline and compared, just like other
data preparation transforms that may be applied to the dataset. How to Choose a Feature Selection
Method For Machine Learning
In this tutorial, you will discover how to use automatic outlier detection and removal to
improve machine learning predictive modeling performance.
How to Calculate Feature Importance
With Python
After completing this tutorial, you will know:
Automatic outlier detection models provide an alternative to statistical techniques with a

How to Remove Outliers for Machine
larger number of input variables with complex and unknown inter-relationships.
Learning
How to correctly apply automatic outlier detection and removal to the training dataset
only to avoid data leakage.
How to evaluate and compare predictive modeling pipelines with outliers removed from Data Preparation for Machine Learning (7-
the training dataset. Day Mini-Course)
Discover data cleaning, feature selection, data transforms, dimensionality reduction and
much more in my new book, with 30 step-by-step tutorials and full Python source code. Recursive Feature Elimination (RFE) for
Feature Selection in Python
Let’s get started.
Loving the Tutorials?
The Data Preparation for Machine Learning EBook is

where I keep the Really Good stuff.
SEE WHAT'S INSIDE
Model-Based Outlier Detection and Removal in Python

Photo by Zoltán Vörös, some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
1. Outlier Detection and Removal

2. Dataset and Performance Baseline
1. House Price Regression Dataset
2. Baseline Model Performance
3. Automatic Outlier Detection
1. Isolation Forest
2. Minimum Covariance Determinant
3. Local Outlier Factor
4. One-Class SVM
Outlier Detection and Removal

Outliers are observations in a dataset that don’t fit in some way.
Perhaps the most common or familiar type of outlier is the observations that are far from the
rest of the observations or the center of mass of observations.
This is easy to understand when we have one or two variables and we can visualize the data
as a histogram or scatter plot, although it becomes very challenging when we have many
input variables defining a high-dimensional input feature space.
In this case, simple statistical methods for identifying outliers can break down, such as
methods that use standard deviations or the interquartile range.
It can be important to identify and remove outliers from data when training machine learning
algorithms for predictive modeling.
Outliers can skew statistical measures and data distributions, providing a misleading
representation of the underlying data and relationships. Removing outliers from training data
prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.
Thankfully, there are a variety of automatic model-based methods for identifying outliers in
input data. Importantly, each method approaches the definition of an outlier is slightly
different ways, providing alternate approaches to preparing a training dataset that can be
evaluated and compared, just like any other data preparation step in a modeling pipeline.
Before we dive into automatic outlier detection methods, let’s first select a standard machine
learning dataset that we can use as the basis for our investigation.
Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-Course
Dataset and Performance Baseline

In this section, we will first select a standard machine learning dataset and establish a
baseline in performance on this dataset.
This will provide the context for exploring the outlier identification and removal method of
data preparation in the next section.
House Price Regression Dataset

We will use the house price regression dataset.
This dataset has 13 input variables that describe the properties of the house and suburb and
requires the prediction of the median value of houses in the suburb in thousands of dollars.
You can learn more about the dataset here:
House Price Dataset (housing.csv)

House Price Dataset Description (housing.names)
No need to download the dataset as we will download it automatically as part of our worked
examples.
Open the dataset and review the raw data. The first few rows of data are listed below.
We can see that it is a regression predictive modeling problem with numerical input
variables, each of which has different scales.
1 0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00
2 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60
3 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70
4 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40
5 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20
6 ...
The dataset has many numerical input variables that have unknown and complex
relationships. We don’t know that outliers exist in this dataset, although we may guess that
some outliers may be present.
The example below loads the dataset and splits it into the input and output columns, splits it
into train and test datasets, then summarizes the shapes of the data arrays.
1 # load and summarize the dataset

2 from pandas import read_csv
3 from sklearn.model_selection import train_test_split
4 # load the dataset
5 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
6 df = read_csv(url, header=None)
7 # retrieve the array
8 data = df.values
9 # split into input and output elements
10 X, y = data[:, :-1], data[:, -1]
11 # summarize the shape of the dataset
12 print(X.shape, y.shape)
13 # split into train and test sets
14 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state
15 # summarize the shape of the train and test sets
16 print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Running the example, we can see that the dataset was loaded correctly and that there are
506 rows of data with 13 input variables and a single target variable.
The dataset is split into train and test sets with 339 rows used for model training and 167 for
model evaluation.
1 (506, 13) (506,)

2 (339, 13) (167, 13) (339,) (167,)
Next, let’s evaluate a model on this dataset and establish a baseline in performance.
Baseline Model Performance

It is a regression predictive modeling problem, meaning that we will be predicting a numeric
value. All input variables are also numeric.
In this case, we will fit a linear regression algorithm and evaluate model performance by
training the model on the test dataset and making a prediction on the test data and evaluate
the predictions using the mean absolute error (MAE).
The complete example of evaluating a linear regression model on the dataset is listed below.
1 # evaluate model on the raw dataset

4 from sklearn.linear_model import LinearRegression
5 from sklearn.metrics import mean_absolute_error
10 data = df.values
12 X, y = data[:, :-1], data[:, -1]
15 # fit the model
16 model = LinearRegression()
17 model.fit(X_train, y_train)
18 # evaluate the model
19 yhat = model.predict(X_test)
20 # evaluate predictions
21 mae = mean_absolute_error(y_test, yhat)
22 print('MAE: %.3f' % mae)
Running the example fits and evaluates the model, then reports the MAE.
Your specific results may differ given the stochastic nature of the learning algorithm, the
evaluation procedure, and/or differences in precision across systems. Try running the
example a few times.
In this case, we can see that the model achieved a MAE of about 3.417. This provides a
baseline in performance to which we can compare different outlier identification and removal
procedures.
1 MAE: 3.417
Next, we can try removing outliers from the training dataset.
Automatic Outlier Detection

The scikit-learn library provides a number of built-in automatic methods for identifying
outliers in data.
In this section, we will review four methods and compare their performance on the house
price dataset.
Each method will be defined, then fit on the training dataset. The fit model will then predict
which examples in the training dataset are outliers and which are not (so-called inliers). The
outliers will then be removed from the training dataset, then the model will be fit on the
remaining examples and evaluated on the entire test dataset.
It would be invalid to fit the outlier detection method on the entire training dataset as this
would result in data leakage. That is, the model would have access to data (or information
about the data) in the test set not used to train the model. This may result in an optimistic
estimate of model performance.
We could attempt to detect outliers on “new data” such as the test set prior to making a
prediction, but then what do we do if outliers are detected?
One approach might be to return a “None” indicating that the model is unable to make a
prediction on those outlier cases. This might be an interesting extension to explore that may
be appropriate for your project.
Isolation Forest
Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.
It is based on modeling the normal data in such a way as to isolate anomalies that are both
few in number and different in the feature space.
… our proposed method takes advantage of two anomalies’ quantitative

$ properties: i) they are the minority consisting of fewer instances and ii) they have
attribute-values that are very different from those of normal instances.
— Isolation Forest, 2008.
The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest

class.
Perhaps the most important hyperparameter in the model is the “contamination” argument,
which is used to help estimate the number of outliers in the dataset. This is a value between
0.0 and 0.5 and by default is set to 0.1.
1 ...
2 # identify outliers in the training dataset
3 iso = IsolationForest(contamination=0.1)
4 yhat = iso.fit_predict(X_train)
Once identified, we can remove the outliers from the training dataset.
1 ...
2 # select all rows that are not outliers
3 mask = yhat != -1
4 X_train, y_train = X_train[mask, :], y_train[mask]
Tying this together, the complete example of evaluating the linear model on the housing
dataset with outliers identified and removed with isolation forest is listed below.
1 # evaluate model performance with outliers removed using isolation forest

5 from sklearn.ensemble import IsolationForest
11 data = df.values
13 X, y = data[:, :-1], data[:, -1]
16 # summarize the shape of the training dataset
17 print(X_train.shape, y_train.shape)
19 iso = IsolationForest(contamination=0.1)
20 yhat = iso.fit_predict(X_train)
22 mask = yhat != -1
24 # summarize the shape of the updated training dataset
26 # fit the model
In this case, we can see that that model identified and removed 34 outliers and achieved a
MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.
1 (339, 13) (339,)

2 (305, 13) (305,)
3 MAE: 3.189
Minimum Covariance Determinant

If the input variables have a Gaussian distribution, then simple statistical methods can be
used to detect outliers.
For example, if the dataset has two input variables and both are Gaussian, then the feature
space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to
identify values far from the distribution.
This approach can be generalized by defining a hypersphere (ellipsoid) that covers the
normal data, and data that falls outside this shape is considered an outlier. An efficient
implementation of this technique for multivariate data is known as the Minimum Covariance
Determinant, or MCD for short.
The Minimum Covariance Determinant (MCD) method is a highly robust estimator

$ of multivariate location and scatter, for which a fast algorithm is available. […] It
also serves as a convenient and efficient tool for outlier detection.
— Minimum Covariance Determinant and Extensions, 2017.
The scikit-learn library provides access to this method via the EllipticEnvelope class.
It provides the “contamination” argument that defines the expected ratio of outliers to be
observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and
error.
1 ...
3 ee = EllipticEnvelope(contamination=0.01)
4 yhat = ee.fit_predict(X_train)
Once identified, the outliers can be removed from the training dataset as we did in the prior
example.
Tying this together, the complete example of identifying and removing outliers from the
housing dataset using the elliptical envelope (minimum covariant determinant) method is
listed below.
1 # evaluate model performance with outliers removed using elliptical envelope

5 from sklearn.covariance import EllipticEnvelope
11 data = df.values
13 X, y = data[:, :-1], data[:, -1]
19 ee = EllipticEnvelope(contamination=0.01)
26 # fit the model
In this case, we can see that the elliptical envelope method identified and removed only 4
outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.
1 (339, 13) (339,)

2 (335, 13) (335,)
3 MAE: 3.388
Local Outlier Factor

A simple approach to identifying outliers is to locate those examples that are far from the
other examples in the feature space.
This can work well for feature spaces with low dimensionality (few features), although it can
become less reliable as the number of features is increased, referred to as the curse of
dimensionality.
The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of
nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated
or how likely it is to be outliers based on the size of its local neighborhood. Those examples
with the largest score are more likely to be outliers.
We introduce a local outlier (LOF) for each object in the dataset, indicating its
$ degree of outlier-ness.
— LOF: Identifying Density-based Local Outliers, 2000.
The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor

class.
The model provides the “contamination” argument, that is the expected percentage of
outliers in the dataset, be indicated and defaults to 0.1.
1 ...
3 lof = LocalOutlierFactor()
4 yhat = lof.fit_predict(X_train)
housing dataset using the local outlier factor method is listed below.
1 # evaluate model performance with outliers removed using local outlier factor
5 from sklearn.neighbors import LocalOutlierFactor
11 data = df.values
13 X, y = data[:, :-1], data[:, -1]
19 lof = LocalOutlierFactor()
20 yhat = lof.fit_predict(X_train)
26 # fit the model
In this case, we can see that the local outlier factor method identified and removed 34
outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the
baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of
outliers were identified and removed.
1 (339, 13) (339,)

2 (305, 13) (305,)
3 MAE: 3.356
One-Class SVM
The support vector machine, or SVM, algorithm developed initially for binary classification
can be used for one-class classification.
When modeling one class, the algorithm captures the density of the majority class and
classifies examples on the extremes of the density function as outliers. This modification of
SVM is referred to as One-Class SVM.
… an algorithm that computes a binary function that is supposed to capture

$ regions in input space where the probability density lives (its support), that is, a
function such that most of the data will live in the region where the function is
nonzero.
— Estimating the Support of a High-Dimensional Distribution, 2001.
Although SVM is a classification algorithm and One-Class SVM is also a classification

algorithm, it can be used to discover outliers in input data for both regression and
classification datasets.
The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM

class.
The class provides the “nu” argument that specifies the approximate ratio of outliers in the
dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and
error.
1 ...
3 ee = OneClassSVM(nu=0.01)
housing dataset using the one class SVM method is listed below.
1 # evaluate model performance with outliers removed using one class SVM
5 from sklearn.svm import OneClassSVM
11 data = df.values
13 X, y = data[:, :-1], data[:, -1]
19 ee = OneClassSVM(nu=0.01)
26 # fit the model
In this case, we can see that only three outliers were identified and removed and the model
achieved a MAE of about 3.431, which is not better than the baseline model that achieved
3.417. Perhaps better performance can be achieved with more tuning.
1 (339, 13) (339,)

2 (336, 13) (336,)
3 MAE: 3.431
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Related Tutorials
One-Class Classification Algorithms for Imbalanced Datasets
How to Remove Outliers for Machine Learning
Papers
Isolation Forest, 2008.
Minimum Covariance Determinant and Extensions, 2017.
LOF: Identifying Density-based Local Outliers, 2000.
Estimating the Support of a High-Dimensional Distribution, 2001.
APIs
Novelty and Outlier Detection, scikit-learn user guide.
sklearn.covariance.EllipticEnvelope API.
sklearn.svm.OneClassSVM API.
sklearn.neighbors.LocalOutlierFactor API.
sklearn.ensemble.IsolationForest API.
Summary
In this tutorial, you discovered how to use automatic outlier detection and removal to
improve machine learning predictive modeling performance.
Specifically, you learned:
Automatic outlier detection models provide an alternative to statistical techniques with a

larger number of input variables with complex and unknown inter-relationships.
How to correctly apply automatic outlier detection and removal to the training dataset
only to avoid data leakage.
How to evaluate and compare predictive modeling pipelines with outliers removed from
the training dataset.
Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.
Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes
...with just a few lines of python code
Discover how in my new Ebook:

Data Preparation for Machine Learning
It provides self-study tutorials with full working code on:

Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling,
Dimensionality Reduction, and much more...
Bring Modern Data Preparation Techniques to

Your Machine Learning Projects
SEE WHAT'S INSIDE
Tweet Share Share
About Jason Brownlee

Jason Brownlee, PhD is a machine learning specialist who teaches developers how
to get results with modern machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →
" How to Use Feature Extraction on Tabular Data for Machine Learning
6 Dimensionality Reduction Algorithms With Python #
12 Responses to 4 Automatic Outlier Detection Algorithms in Python
REPLY %
Joseph July 8, 2020 at 7:00 pm #
Hi Jason, thanks for one more great article!

My question is about outliers in tree based algorithms (RF, XGboost). Does it really change
model outcomes in real life to delete outliers in this case? Findings change over time, that’s
why I’ve this question.
REPLY %
Jason Brownlee July 9, 2020 at 6:39 am #
I think trees are pretty robust to outliers. Test for your dataset.
REPLY %
JParzival July 9, 2020 at 10:42 am #
Great article!
Thank you for sharing your experience!
REPLY %
Jason Brownlee July 9, 2020 at 1:19 pm #
You’re welcome.
REPLY %
Nagdev. A July 10, 2020 at 9:39 am #
Two more to the list autoencoders and PCA
REPLY %
Jason Brownlee July 10, 2020 at 1:47 pm #
For outlier detection? How so?
REPLY %
Vishal July 10, 2020 at 10:18 pm #
Hello sir,
It was a great article. Just one doubt:
MCD technique doesn’t perform well when the data has very large dimensions like >1000. In
that case, it is a good option to feed the model with principal components of the data. The
paper that you mentioned in the link says:
“For large p we can still make a rough estimate of the scatter as follows. First compute the
first q < p robust principal components of the data. For this we can use the MCD-based
ROBPCA method53, which requires that the number of components q be set rather low."
Now the ROBPCA is not available in python. Can you please tell what can be done in this
case?
Thank you
REPLY %
Great tip, thanks.
Perhaps find a different platform that implements the method?

Perhaps implement it yourself?
Perhaps use a different method entirely?
REPLY %
fabou July 10, 2020 at 11:08 pm #
Hi Jason,
as usual great educational article.
How could automatic outlier detection be integrated into a cross validation loop? Does it
have to be part of a pipeline which steps would be : outlier detection > outlier removal
(transformer) > modeling?
In this case, should a specific transformer “outlier remover” be created?
Thanks
REPLY %
You would have to run the CV loop manually and apply the method to the data
prior to fitting/evaluating a model or pipeline.
It’s disappointing that sklearn does not support methods in pipelines that add/remove
rows. imbalanced learn can do this kind of thing…
REPLY %
Chayma July 15, 2020 at 1:16 am #
Thank you for the great article.
Which algorithm is the most sutible for outlier detection in time series data?
REPLY %
I don’t know off hand, I hope to write about that topic in the future.
Leave a Reply
Name (required)
Email (will not be published) (required)
Website
SUBMIT COMMENT
© 2020 Machine Learning Mastery Pty. Ltd. All Rights Reserved. Privacy | Disclaimer | Terms | Contact | Sitemap | Search
Address: PO Box 206, Vermont Victoria 3133, Australia. | ACN: 626 223 336.
LinkedIn | Twitter | Facebook | Newsletter | RSS
Start Machine Learning

4 Automatic Outlier Detection Algorithms in Python

Uploaded by

4 Automatic Outlier Detection Algorithms in Python

Uploaded by

Click to Take the FREE Data Preparation Crash-Course

Get Started Blog Topics & EBooks FAQ About Contact

4 Automatic Outlier Detection Algorithms in Welcome!

Tweet Share Share

Never miss a tutorial:

Automatic outlier detection models provide an alternative to statistical techniques with a

Loving the Tutorials?

The Data Preparation for Machine Learning EBook is

SEE WHAT'S INSIDE

Model-Based Outlier Detection and Removal in Python

1. Outlier Detection and Removal

Outlier Detection and Removal

Want to Get Started With Data Preparation?

Download Your FREE Mini-Course

Dataset and Performance Baseline

House Price Regression Dataset

You can learn more about the dataset here:

House Price Dataset (housing.csv)

1 # load and summarize the dataset

1 (506, 13) (506,)

Baseline Model Performance

1 # evaluate model on the raw dataset

Next, we can try removing outliers from the training dataset.

Automatic Outlier Detection

… our proposed method takes advantage of two anomalies’ quantitative

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest

1 # evaluate model performance with outliers removed using isolation forest

1 (339, 13) (339,)

Minimum Covariance Determinant

The Minimum Covariance Determinant (MCD) method is a highly robust estimator

— Minimum Covariance Determinant and Extensions, 2017.

1 # evaluate model performance with outliers removed using elliptical envelope

1 (339, 13) (339,)

Local Outlier Factor

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor

1 (339, 13) (339,)

… an algorithm that computes a binary function that is supposed to capture

— Estimating the Support of a High-Dimensional Distribution, 2001.

Although SVM is a classification algorithm and One-Class SVM is also a classification

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM

1 (339, 13) (339,)

Specifically, you learned:

Automatic outlier detection models provide an alternative to statistical techniques with a

Do you have any questions?

Get a Handle on Modern Data Preparation!

Discover how in my new Ebook:

It provides self-study tutorials with full working code on:

Bring Modern Data Preparation Techniques to

SEE WHAT'S INSIDE

Tweet Share Share

About Jason Brownlee

12 Responses to 4 Automatic Outlier Detection Algorithms in Python

Hi Jason, thanks for one more great article!

Two more to the list autoencoders and PCA

For outlier detection? How so?

Great tip, thanks.

Perhaps find a diﬀerent platform that implements the method?

as usual great educational article.

Thank you for the great article.

Email (will not be published) (required)

Start Machine Learning

You might also like