0% found this document useful (0 votes)

27 views23 pages

L7 - Categorical Data - Encoding - Preprocessing - NCU

The document discusses the importance of preprocessing categorical variables in machine learning, highlighting various encoding techniques such as Label Encoding, One Hot Encoding, Dummy Encoding, Effect Encoding, Hash Encoding, Binary Encoding, Base N Encoding, and Target Encoding. Each method is explained with examples and Python implementations, emphasizing their applications, advantages, and drawbacks. The document serves as a comprehensive guide for data scientists on how to effectively encode categorical data to improve model performance.

Uploaded by

23csu241

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views23 pages

L7 - Categorical Data - Encoding - Preprocessing - NCU

Uploaded by

23csu241

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction

The performance of a machine learning model not only depends on the model
and the hyperparameters but also on how we process and feed different types
of variables to the model. Since most machine learning models only accept
numerical variables, preprocessing the categorical variables becomes a
necessary step. We need to convert these categorical variables to numbers
such that the model is able to understand and extract valuable information.
A typical data scientist spends 70 – 80% of his time cleaning and preparing
the data. And converting categorical data is an unavoidable activity. It not only
elevates the model quality but also helps in better feature engineering. Now
the question is, how do we proceed? Which categorical data encoding method
should we use?

In this article, I will be explaining various types of categorical data encoding

methods with implementation in Python.

In case you want to learn concepts of data science in video format, check out
our course- Introduction to Data Science

Table of contents

 Overview
 Introduction
 What is categorical data?
 Label Encoding or Ordinal Encoding
 One Hot Encoding
 Dummy Encoding
o Drawbacks of One-Hot and Dummy Encoding
 Effect Encoding:
 Hash Encoder
 Binary Encoding
 Base N Encoding
 Target Encoding
 Frequently Asked Questions
 Endnote

What is categorical data?

Since we are going to be working on categorical variables in this article, here

is a quick refresher on the same with a couple of examples. Categorical
variables are usually represented as ‘strings’ or ‘categories’ and are finite in
number. Here are a few examples:

1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore,

etc.
2. The department a person works in: Finance, Human resources, IT,
Production.
3. The highest degree a person has: High school, Diploma, Bachelors,
Masters, PhD.
4. The grades of a student: A+, A, B+, B, B- etc.

In the above examples, the variables only have definite possible values.
Further, we can see there are two kinds of categorical data-

Ready to start your data science journey?

Master 23+ tools & learn 50+ real-world projects to transform your career in
Data Science.

Enroll Now
 Ordinal Data: The categories have an inherent order
 Nominal Data: The categories do not have an inherent order

In Ordinal data, while encoding, one should retain the information regarding
the order in which the category is provided. Like in the above example the
highest degree a person possesses, gives vital information about his
qualification. The degree is an important feature to decide whether a person is
suitable for a post or not.

While encoding Nominal data, we have to consider the presence or absence

of a feature. In such a case, no notion of order is present. For example, the
city a person lives in. For the data, it is important to retain where a person
lives. Here, We do not have any order or sequence. It is equal if a person lives
in Delhi or Bangalore.

For encoding categorical data, we have a python package

category_encoders. The following code helps you install easily.

pip install category_encoders

Label Encoding or Ordinal Encoding

We use this categorical data encoding technique when the categorical feature
is ordinal. In this case, retaining the order is important. Hence encoding
should reflect the sequence.

In Label encoding, each label is converted into an integer value. We will

create a variable that contains the categories representing the education
qualification of a person.
Python Code:

Fit and transform train data

df_train_transformed = encoder.fit_transform(train_df)

One Hot Encoding

We use this categorical data encoding technique when the features are
nominal(do not have any order). In one hot encoding, for each level of a
categorical feature, we create a new variable. Each category is mapped with a
binary variable containing either 0 or 1. Here, 0 represents the absence, and 1
represents the presence of that category.

These newly created binary features are known as Dummy variables. The
number of dummy variables depends on the levels present in the categorical
variable. This might sound complicated. Let us take an example to understand
this better. Suppose we have a dataset with a category animal, having
different animals like Dog, Cat, Sheep, Cow, Lion. Now we have to one-hot
encode this data.

After encoding, in the second table, we have dummy variables each

representing a category in the feature Animal. Now for each category that is
present, we have 1 in the column of that category and 0 for the others. Let’s
see how to implement a one-hot encoding in python.

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','
Bangalore','Delhi'
]})

#Create object for one-hot encoding

encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',retur
n_df=True,use_cat_names=True)

#Original Data
data
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded

Now let’s move to another very interesting and widely used encoding
technique i.e Dummy encoding.

Dummy Encoding
Dummy coding scheme is similar to one-hot encoding. This categorical data
encoding method transforms the categorical variable into a set of binary
variables (also known as dummy variables). In the case of one-hot encoding,
for N categories in a variable, it uses N binary variables. The dummy encoding
is a small improvement over one-hot-encoding. Dummy encoding uses N-1
features to represent N labels/categories.

To understand this better let’s see the image below. Here we are coding the
same data using both one-hot encoding and dummy encoding techniques.
While one-hot uses 3 variables to represent the data whereas dummy
encoding uses 2 variables to code 3 categories.
Let us implement it in python.

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Ban
galore','Delhi,'Hyderabad']})

#Original Data
data

#encode the data

data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded
Here using drop_first argument, we are representing the first label Bangalore
using 0.

Drawbacks of One-Hot and Dummy Encoding

One hot encoder and dummy encoder are two powerful and effective
encoding schemes. They are also very popular among the data scientists, But
may not be as effective when-

1. A large number of levels are present in data. If there are multiple

categories in a feature variable in such a case we need a similar
number of dummy variables to encode the data. For example, a column
with 30 different values will require 30 new variables for coding.
2. If we have multiple categorical features in the dataset similar situation
will occur and again we will end to have several binary features each
representing the categorical feature and their multiple categories e.g a
dataset having 10 or more categorical columns.
In both the above cases, these two encoding schemes introduce sparsity in
the dataset i.e several columns having 0s and a few of them having 1s. In
other words, it creates multiple dummy features in the dataset without adding
much information.

Also, they might lead to a Dummy variable trap. It is a phenomenon where

features are highly correlated. That means using the other variables, we can
easily predict the value of a variable.

Due to the massive increase in the dataset, coding slows down the learning of
the model along with deteriorating the overall performance that ultimately
makes the model computationally expensive. Further, while using tree-based
models these encodings are not an optimum choice.

Effect Encoding:

This encoding technique is also known as Deviation Encoding or Sum

Encoding. Effect encoding is almost similar to dummy encoding, with a little
difference. In dummy coding, we use 0 and 1 to represent the data but in
effect encoding, we use three values i.e. 1,0, and -1.

The row containing only 0s in dummy encoding is encoded as -1 in effect

encoding. In the dummy encoding example, the city Bangalore at index
4 was encoded as 0000. Whereas in effect encoding it is represented by -1-1-
1-1.

Let us see how we implement it in python-

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Ban
galore','Delhi,'Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data
data

encoder.fit_transform(data)

Effect encoding is an advanced technique. In case you are interested to know

more about effect encoding, refer to this interesting paper.
Hash Encoder

To understand Hash encoding it is necessary to know about hashing. Hashing

is the transformation of arbitrary size input in the form of a fixed-size value.
We use hashing algorithms to perform hashing operations i.e to generate the
hash value of an input. Further, hashing is a one-way process, in other words,
one can not generate original input from the hash representation.

Hashing has several applications like data retrieval, checking data corruption,
and in data encryption also. We have multiple hash functions available for
example Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0,
SHA1, SHA2), and many more.

Just like one-hot encoding, the Hash encoder represents categorical features
using the new dimensions. Here, the user can fix the number of dimensions
after transformation using n_component argument. Here is what I mean – A
feature with 5 categories can be represented using N new features similarly, a
feature with 100 categories can also be transformed using N new features.
Doesn’t this sound amazing?

By default, the Hashing encoder uses the md5 hashing algorithm but a user
can pass any algorithm of his choice. If you want to explore the md5
algorithm, I suggest this paper.

import category_encoders as ce
import pandas as pd

#Create the dataframe

data=pd.DataFrame({'Month':['January','April','March','April','Februay
','June','July','June','September']})
#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)

#Fit and Transform Data

encoder.fit_transform(data)
Since Hashing transforms the data in lesser dimensions, it may lead to loss of
information. Another issue faced by hashing encoder is the collision. Since
here, a large number of features are depicted into lesser dimensions, hence
multiple values can be represented by the same hash value, this is known as
a collision.

Moreover, hashing encoders have been very successful in some Kaggle

competitions. It is great to try if the dataset has high cardinality features.

Binary Encoding

Binary encoding is a combination of Hash encoding and one-hot encoding. In

this encoding scheme, the categorical feature is first converted into numerical
using an ordinal encoder. Then the numbers are transformed in the binary
number. After that binary value is split into different columns.

Binary encoding works really well when there are a high number of categories.
For example the cities in a country where a company supplies its products.

#Import the libraries

import category_encoders as ce
import pandas as pd

#Create the Dataframe

data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Ban
galore','Delhi','Hyderabad','Mumbai','Agra']})

#Create object for binary encoding

encoder= ce.BinaryEncoder(cols=['city'],return_df=True)

#Original Data
data
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded
Binary encoding is a memory-efficient encoding scheme as it uses fewer
features than one-hot encoding. Further, It reduces the curse of
dimensionality for data with high cardinality.

Base N Encoding

Before diving into BaseN encoding let’s first try to understand what is Base
here?

In the numeral system, the Base or the radix is the number of digits or a
combination of digits and letters used to represent the numbers. The most
common base we use in our life is 10 or decimal system as here we use 10
unique digits i.e 0 to 9 to represent all the numbers. Another widely used
system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express all the
numbers.

For Binary encoding, the Base is 2 which means it converts the numerical
values of a category into its respective Binary form. If you want to change the
Base of encoding scheme you may use Base N encoder. In the case when
categories are more and binary encoding is not able to handle the
dimensionality then we can use a larger base such as 4 or 8.

#Import the libraries

import category_encoders as ce
import pandas as pd

#Create the dataframe

data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Ban
galore','Delhi','Hyderabad','Mumbai','Agra']})

#Create an object for Base N Encoding

encoder= ce.BaseNEncoder(cols=['city'],return_df=True,base=5)
#Original Data
data

#Fit and Transform Data

data_encoded=encoder.fit_transform(data)
data_encoded
In the above example, I have used base 5 also known as the Quinary system.
It is similar to the example of Binary encoding. While Binary encoding
represents the same data by 4 new features the BaseN encoding uses only 3
new variables.

Hence BaseN encoding technique further reduces the number of features

required to efficiently represent the data and improving memory usage. The
default Base for Base N is 2 which is equivalent to Binary Encoding.

Target Encoding

Target encoding is a Baysian encoding technique.

Bayesian encoders use information from dependent/target variables to

encode the categorical data.
In target encoding, we calculate the mean of the target variable for each
category and replace the category variable with the mean value. In the case of
the categorical target variables, the posterior probability of the target replaces
each category..

#import the libraries

import pandas as pd
import category_encoders as ce

#Create the Dataframe

data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':
[50,30,70,80,45,97,80,68]})

#Create target encoding object

encoder=ce.TargetEncoder(cols='class')

#Original Data
Data

#Fit and Transform Train Data

encoder.fit_transform(data['class'],data['Marks'])
We perform Target encoding for train data only and code the test data using
results obtained from the training dataset. Although, a very efficient coding
system, it has the following issues responsible for deteriorating the model
performance-

1. It can lead to target leakage or overfitting. To address overfitting we can

use different techniques.
1. In the leave one out encoding, the current target value is reduced
from the overall mean of the target to avoid leakage.
2. In another method, we may introduce some Gaussian noise in the
target statistics. The value of this noise is hyperparameter to the
model.
2. The second issue, we may face is the improper distribution of
categories in train and test data. In such a case, the categories may
assume extreme values. Therefore the target means for the category
are mixed with the marginal mean of the target.

What Are Categorical Data Encoding Methods - Binary Encoding
No ratings yet
What Are Categorical Data Encoding Methods - Binary Encoding
14 pages
Handling Categorical Data in ML
No ratings yet
Handling Categorical Data in ML
18 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Categorical Variable Encoding Guide
No ratings yet
Categorical Variable Encoding Guide
21 pages
Exp 6
No ratings yet
Exp 6
9 pages
Categorical Variable Encoding Techniques
No ratings yet
Categorical Variable Encoding Techniques
25 pages
Feature Encoding
No ratings yet
Feature Encoding
5 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
7 - InnovatiCS - Categorical Data & Data Transformation
No ratings yet
7 - InnovatiCS - Categorical Data & Data Transformation
20 pages
Encoding Notes
No ratings yet
Encoding Notes
4 pages
ML Concepts Papers
No ratings yet
ML Concepts Papers
3 pages
Encoding Categorical Data
No ratings yet
Encoding Categorical Data
4 pages
Deep-Learned Encoding for Categorical Data
No ratings yet
Deep-Learned Encoding for Categorical Data
11 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
TP4-ML-features Encoding
No ratings yet
TP4-ML-features Encoding
4 pages
One-Hot Encoding for Categorical Data
No ratings yet
One-Hot Encoding for Categorical Data
4 pages
Comparison Between Encoding Methods - 1
No ratings yet
Comparison Between Encoding Methods - 1
7 pages
Dealing With Categorical Data
No ratings yet
Dealing With Categorical Data
14 pages
Mastering Categorical Encoding
No ratings yet
Mastering Categorical Encoding
8 pages
Lab 6
No ratings yet
Lab 6
6 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Ex 3
No ratings yet
Ex 3
11 pages
Categorical Encoding: Label vs One-Hot
No ratings yet
Categorical Encoding: Label vs One-Hot
9 pages
Understanding Discrete and Continuous Data
No ratings yet
Understanding Discrete and Continuous Data
43 pages
Feature Engineering Techniques in Data Science
100% (2)
Feature Engineering Techniques in Data Science
76 pages
Working With Pre (Rocessing Data Files
No ratings yet
Working With Pre (Rocessing Data Files
4 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Week 10
No ratings yet
Week 10
50 pages
A Comparative Study of Categorical Variable Encoding Techniques
No ratings yet
A Comparative Study of Categorical Variable Encoding Techniques
4 pages
Categorical Variables Explained
No ratings yet
Categorical Variables Explained
3 pages
Exploring Categorical Data - Students
No ratings yet
Exploring Categorical Data - Students
40 pages
Handling Categorical Variables in Python
No ratings yet
Handling Categorical Variables in Python
8 pages
Practical 3 - Categorical Feature Engineering
No ratings yet
Practical 3 - Categorical Feature Engineering
6 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
Understanding Variable Types in Data Analysis
No ratings yet
Understanding Variable Types in Data Analysis
2 pages
Comparing Categorical Encoding Methods
No ratings yet
Comparing Categorical Encoding Methods
11 pages
Label Encoding in Machine Learning
No ratings yet
Label Encoding in Machine Learning
11 pages
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
No ratings yet
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
4 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Machine Learning
No ratings yet
Machine Learning
81 pages
OneHot Encoding
No ratings yet
OneHot Encoding
5 pages
6 One Hot Encoding
No ratings yet
6 One Hot Encoding
3 pages
Handling Categorical Variables in Ensemble Algorithms 2
No ratings yet
Handling Categorical Variables in Ensemble Algorithms 2
18 pages
Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning With High Cardinality Features
No ratings yet
Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning With High Cardinality Features
22 pages
Machine Learning Pipeline Overview
No ratings yet
Machine Learning Pipeline Overview
32 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
Encoding Methods for Categorical Data
No ratings yet
Encoding Methods for Categorical Data
2 pages
Categorical Data Encoding Guide
No ratings yet
Categorical Data Encoding Guide
2 pages
ML Course16
No ratings yet
ML Course16
5 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
One-Hot Encoding for Categorical Data
No ratings yet
One-Hot Encoding for Categorical Data
2 pages
1
No ratings yet
1
3 pages
Blender Software Interface Guide
No ratings yet
Blender Software Interface Guide
25 pages
Understanding Smart City-A Data-Driven Literature Review: Sustainability
No ratings yet
Understanding Smart City-A Data-Driven Literature Review: Sustainability
23 pages
Menu Driven Code - Interface
No ratings yet
Menu Driven Code - Interface
6 pages
Solidworks Flow Simulation
0% (1)
Solidworks Flow Simulation
7 pages
WWW Incedoinc Com Incedo Dataxel ...
No ratings yet
WWW Incedoinc Com Incedo Dataxel ...
7 pages
Class 10 ICT Skills and Notes
No ratings yet
Class 10 ICT Skills and Notes
6 pages
Comprehensive Guide to Linked Lists
No ratings yet
Comprehensive Guide to Linked Lists
19 pages
Skillogic Internship Guide
No ratings yet
Skillogic Internship Guide
4 pages
Onlyfans - ?SlimExotica - OnlyFans Collection 9.34 GB Sorry Mother Forum Onlyfans Leaks
No ratings yet
Onlyfans - ?SlimExotica - OnlyFans Collection 9.34 GB Sorry Mother Forum Onlyfans Leaks
1 page
OSI Securuty Architecture Lecture Note
No ratings yet
OSI Securuty Architecture Lecture Note
16 pages
Learning Activities 6
No ratings yet
Learning Activities 6
3 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
4 pages
Embassy Chisinau, Moldova. If You Would Like To Submit A Quotation, Follow The Instructions in
No ratings yet
Embassy Chisinau, Moldova. If You Would Like To Submit A Quotation, Follow The Instructions in
38 pages
Lesson 2 (Ii) Internal Controls in A Computerised Environment
No ratings yet
Lesson 2 (Ii) Internal Controls in A Computerised Environment
10 pages
COM-1C:/Users/NP1-2/Desktop/242.093.20-SPEC ECM Final
No ratings yet
COM-1C:/Users/NP1-2/Desktop/242.093.20-SPEC ECM Final
16 pages
Installing Drivers for Devices
No ratings yet
Installing Drivers for Devices
4 pages
Forms Tutorial
100% (1)
Forms Tutorial
37 pages
Unit 3 List
No ratings yet
Unit 3 List
53 pages
Micro Project DBMS
No ratings yet
Micro Project DBMS
18 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
3 pages
UD36080B Baseline Digital Video Recorder Digital Video Recorder User Manual V4.71.240 20231215
No ratings yet
UD36080B Baseline Digital Video Recorder Digital Video Recorder User Manual V4.71.240 20231215
101 pages
Sahil Srivastav's Tech Resume
No ratings yet
Sahil Srivastav's Tech Resume
1 page
Open GApps Install Log for HTC One E8
No ratings yet
Open GApps Install Log for HTC One E8
3 pages
Mastering Capture One Photo Editing: A Comprehensive Guide
No ratings yet
Mastering Capture One Photo Editing: A Comprehensive Guide
15 pages
JavaScript Date Formats Guide
No ratings yet
JavaScript Date Formats Guide
7 pages
Software Developer - Metricell New
No ratings yet
Software Developer - Metricell New
2 pages
Parent Contact Sheet for School
No ratings yet
Parent Contact Sheet for School
1 page
Brkarc 2012
No ratings yet
Brkarc 2012
134 pages
Selenium Firefox Compatibility Issues
No ratings yet
Selenium Firefox Compatibility Issues
3 pages
SFRA6 US Web
No ratings yet
SFRA6 US Web
2 pages

L7 - Categorical Data - Encoding - Preprocessing - NCU

Uploaded by

L7 - Categorical Data - Encoding - Preprocessing - NCU

Uploaded by

Introduction

In this article, I will be explaining various types of categorical data encoding

What is categorical data?

Since we are going to be working on categorical variables in this article, here

1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore,

Ready to start your data science journey?

While encoding Nominal data, we have to consider the presence or absence

For encoding categorical data, we have a python package

pip install category_encoders

Label Encoding or Ordinal Encoding

In Label encoding, each label is converted into an integer value. We will

Fit and transform train data

One Hot Encoding

After encoding, in the second table, we have dummy variables each

#Create object for one-hot encoding

#encode the data

Drawbacks of One-Hot and Dummy Encoding

1. A large number of levels are present in data. If there are multiple

Also, they might lead to a Dummy variable trap. It is a phenomenon where

This encoding technique is also known as Deviation Encoding or Sum

The row containing only 0s in dummy encoding is encoded as -1 in effect

Let us see how we implement it in python-

Effect encoding is an advanced technique. In case you are interested to know

To understand Hash encoding it is necessary to know about hashing. Hashing

#Create the dataframe

#Fit and Transform Data

Moreover, hashing encoders have been very successful in some Kaggle

Binary encoding is a combination of Hash encoding and one-hot encoding. In

#Import the libraries

#Create the Dataframe

#Create object for binary encoding

#Import the libraries

#Create the dataframe

#Create an object for Base N Encoding

#Fit and Transform Data

Hence BaseN encoding technique further reduces the number of features

Target encoding is a Baysian encoding technique.

Bayesian encoders use information from dependent/target variables to

#import the libraries

#Create the Dataframe

#Create target encoding object

#Fit and Transform Train Data

1. It can lead to target leakage or overfitting. To address overfitting we can

You might also like