Using Categorical Data With One Hot Encoding - Kaggle PDF
Using Categorical Data With One Hot Encoding - Kaggle PDF
In this
step, you will learn what a "categorical" variable is, as well as the most common approach for handling this
type of data.
Introduction
Categorical data is data that takes only a limited number of values.
For example, if you people responded to a survey about which what brand of car they owned, the result
would be categorical (because the answers would be things like Honda, Toyota, Ford, None, etc.).
Responses fall into a fixed set of categories.
You will get an error if you try to plug these variables into most machine learning models in Python
without "encoding" them first. Here we'll show the most popular method for encoding categorical
variables.
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the
original data. Let's work through an example.
The values in the original data are Red, Yellow and Green. We create a separate column for each possible
value. Wherever the original value was Red, we put a 1 in the Red column.
Example
Let's see this in code. We'll skip the basic data set-up code, so you can start at the point where you have
train_predictors, test_predictors DataFrames. This data contains housing characteristics. You will use
them to predict home prices, which are stored in a Series called target.
Code Output
Pandas assigns a data type (called a dtype) to each column or Series. Let's see a random sample of
dtypes from our prediction data:
In [2]:
train_predictors.dtypes.sample(10)
Out[2]:
Heating object
CentralAir object
Foundation object
Condition1 object
YrSold int64
PavedDrive object
RoofMatl object
oo at object
PoolArea int64
EnclosedPorch int64
KitchenAbvGr int64
dtype: object
Object indicates a column has text (there are other things it could be theoretically be, but that's
unimportant for our purposes). It's most common to one-hot encode these "object" columns, since they
can't be plugged directly into most models. Pandas offers a convenient function called get_dummies to
get one-hot encodings. Call it like this:
In [3]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predict
ors)
Alternatively, you could have dropped the categoricals. To see how the approaches compare, we can
calculate the mean absolute error of models built with two alternative sets of predictors:
One-hot encoding usually helps, but it varies on a case-by-case basis. In this case, there doesn't appear to
be any meaningful benefit from using the one-hot encoded variables.
In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
predictors_without_categoricals = train_predictors.select_dtypes(e
xclude=['object'])
mae_without_categoricals = get_mae(predictors_without_categoricals
, target)
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors,
target)
Ensure the test data is encoded in the same manner as the training data with the align command:
In [5]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predict
ors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.alig
n(one_hot_encoded_test_predictors,
join='left',
axis=1)
The align command makes sure the columns show up in the same order in both datasets (it uses column
names to identify which columns line up in each dataset.) The argument join='left' specifies that
we will do the equivalent of SQL's left join. That means, if there are ever columns that show up in one
dataset and not the other, we will keep exactly the columns from our training data. The argument
join='inner' would do what SQL databases call an inner join, keeping only the columns showing up
in both datasets. That's also a sensible choice.
Conclusion
The world is filled with categorical data. You will be a much more effective data scientist if you know how
to use this data. Here are resources that will be useful as you start doing more sophisticated work with
cateogircal data.
Pipelines: Deploying models into production ready systems is a topic unto itself. While one-hot
encoding is still a great approach, your code will need to built in an especially robust way. Scikit-
learn pipelines are a great tool for this. Scikit-learn offers a class for one-hot encoding
(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
and this can be added to a Pipeline. Unfortunately, it doesn't handle text or object values, which
is a common use case.
Your Turn
Use one-hot encoding to allow categoricals in your course project. Then add some categorical columns
to your X data. If you choose the right variables, your model will improve quite a bit. Once you've done
that, Click Here (https://www.kaggle.com/learn/machine-learning) to return to Learning Machine
Learning where you can continue improving your model.
This kernel has been released under the Apache 2.0 open source license.
Data
Data Sources
House Prices: Advanced Regression
House Prices: Advanc… Techniques
Predict sales prices and practice feature
sampl… 1459 x 2
engineering, RFs, and gradient boosting
test.c… 1459 x 80 Last Updated: 2 years ago
train.… 1460 x 81
About this Competition
data_description.txt
File descriptions
train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column,
originally prepared by Dean De Cock but lightly edited to
match the column names used here
sample_submission.csv - a benchmark submission from a
linear regression on year and month of sale, lot square footage,
and number of bedrooms
Data fields
Here's a brief version of what you'll find in the data description
file.