0% found this document useful (0 votes)
21 views

How To Analyse Loops For Complexity Analysis of Algorithms

Uploaded by

YeePee Indo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

How To Analyse Loops For Complexity Analysis of Algorithms

Uploaded by

YeePee Indo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 261

Why the Analysis of Algorithm is

important?
Last Updated : 02 Nov, 2023



In this article, we will discuss why algorithm and its analysis is important?
In the analysis of the algorithm, it generally focused on CPU (time) usage,
Memory usage, Disk usage, and Network usage. All are important, but the
most concern is about the CPU time. Be careful to differentiate between:
 Performance: How much time/memory/disk/etc. is used when a
program is run. This depends on the machine, compiler, etc. as well as
the code we write.
 Complexity: How do the resource requirements of a program or
algorithm scale, i.e. what happens as the size of the problem being solved
by the code gets larger.
Note: Complexity affects performance but not vice-versa.
Algorithm Analysis:
Algorithm analysis is an important part of computational complexity theory,
which provides theoretical estimation for the required resources of an
algorithm to solve a specific computational problem. Analysis of algorithms
is the determination of the amount of time and space resources required to
execute it.
Why Analysis of Algorithms is important?
 To predict the behavior of an algorithm without implementing it on a
specific computer.
 It is much more convenient to have simple measures for the efficiency of
an algorithm than to implement the algorithm and test the efficiency
every time a certain parameter in the underlying computer system
changes.
 It is impossible to predict the exact behavior of an algorithm. There are
too many influencing factors.
 The analysis is thus only an approximation; it is not perfect.
 More importantly, by analyzing different algorithms, we can compare
them to determine the best one for our purpose.
Types of Algorithm Analysis:
1. Best case
2. Worst case
3. Average case
 Best case: Define the input for which algorithm takes less time or
minimum time. In the best case calculate the lower bound of an
algorithm. Example: In the linear search when search data is present at
the first location of large data then the best case occurs.
 Worst Case: Define the input for which algorithm takes a long time or
maximum time. In the worst calculate the upper bound of an algorithm.
Example: In the linear search when search data is not present at all then
the worst case occurs.
 Average case: In the average case take all random inputs and calculate
the computation time for all inputs.
And then we divide it by the total number of inputs.
Average case = all random case time / total no of case

"The DSA course helped me a lot in clearing the interview rounds. It was
really very helpful in setting a strong foundation for my problem-solving
skills. Really a great investment, the passion Sandeep sir has towards
DSA/teaching is what made the huge difference." - Gaurav | Placed at
Amazon

Before you move on to the world of development, master the


fundamentals of DSA on which every advanced algorithm is built upon.
Choose your preferred language and start learning today:
How to Analyse Loops for Complexity
Analysis of Algorithms
Last Updated : 08 Mar, 2024



We have discussed Asymptotic Analysis, Worst, Average and Best


Cases and Asymptotic Notations in previous posts. In this post, an analysis
of iterative programs with simple examples is discussed.
The analysis of loops for the complexity analysis of algorithms involves
finding the number of operations performed by a loop as a function of the
input size. This is usually done by determining the number of iterations of
the loop and the number of operations performed in each iteration.
Here are the general steps to analyze loops for complexity analysis:
Determine the number of iterations of the loop. This is usually done by
analyzing the loop control variables and the loop termination condition.
Determine the number of operations performed in each iteration of the loop.
This can include both arithmetic operations and data access operations, such
as array accesses or memory accesses.
Express the total number of operations performed by the loop as a function
of the input size. This may involve using mathematical expressions or
finding a closed-form expression for the number of operations performed by
the loop.
Determine the order of growth of the expression for the number of
operations performed by the loop. This can be done by using techniques
such as big O notation or by finding the dominant term and ignoring lower-
order terms.

Constant Time Complexity O(1):


The time complexity of a function (or set of statements) is considered as
O(1) if it doesn’t contain a loop, recursion, and call to any other non-
constant time function.
i.e. set of non-recursive and non-loop statements
In computer science, O(1) refers to constant time complexity, which means
that the running time of an algorithm remains constant and does not depend
on the size of the input. This means that the execution time of an O(1)
algorithm will always take the same amount of time regardless of the input
size. An example of an O(1) algorithm is accessing an element in an array
using an index.
Example:
 swap() function has O(1) time complexity.
 A loop or recursion that runs a constant number of times is also
considered O(1). For example, the following loop is O(1).
 C++
 C
 Java
 C#
 Javascript
 Python3

// Here c is a positive constant

for (int i = 1; i <= c; i++) {

// some O(1) expressions

//This code is contributed by Kshitij

Linear Time Complexity O(n):


The Time Complexity of a loop is considered as O(n) if the loop variables
are incremented/decremented by a constant amount. For example following
functions have O(n) time complexity. Linear time complexity, denoted as
O(n), is a measure of the growth of the running time of an algorithm
proportional to the size of the input. In an O(n) algorithm, the running time
increases linearly with the size of the input. For example, searching for an
element in an unsorted array or iterating through an array and performing a
constant amount of work for each element would be O(n) operations. In
simple words, for an input of size n, the algorithm takes n steps to complete
the operation.
 C++
 C
 Java
 C#
 Javascript
 Python3

// Here c is a positive integer constant

for (int i = 1; i <= n; i = i + c) {

// some O(1) expressions

for (int i = n; i > 0; i = i - c) {

// some O(1) expressions

// This code is contributed by Kshitij

Quadratic Time Complexity O(nc):


The time complexity is defined as an algorithm whose performance is
directly proportional to the squared size of the input data, as in nested loops
it is equal to the number of times the innermost statement is executed. For
example, the following sample loops have O(n2) time complexity
Quadratic time complexity, denoted as O(n^2), refers to an algorithm whose
running time increases proportional to the square of the size of the input. In
other words, for an input of size n, the algorithm takes n * n steps to
complete the operation. An example of an O(n^2) algorithm is a nested loop
that iterates over the entire input for each element, performing a constant
amount of work for each iteration. This results in a total of n * n iterations,
making the running time quadratic in the size of the input.
 C++
 C
 Java
 C#
 Javascript
 Python3

// Here c is any positive constant

for (int i = 1; i <= n; i += c) {

for (int j = 1; j <= n; j += c) {

// some O(1) expressions

for (int i = n; i > 0; i -= c) {

for (int j = i + 1; j <= n; j += c) {


// some O(1) expressions

for (int i = n; i > 0; i -= c) {

for (int j = i - 1; j > 0; j -= c) {

// some O(1) expressions

// This code is contributed by Kshitij

Example: Selection sort and Insertion Sort have O(n2) time complexity.

Logarithmic Time Complexity O(Log n):


The time Complexity of a loop is considered as O(Logn) if the loop
variables are divided/multiplied by a constant amount. And also for
recursive calls in the recursive function, the Time Complexity is considered
as O(Logn).
 C++
 C
 Java
 C#
 Javascript
 Python3
for (int i = 1; i <= n; i *= c) {

// some O(1) expressions

for (int i = n; i > 0; i /= c) {

// some O(1) expressions

// This code is contributed by Kshitij

 C++
 C
 Java
 C#
 Javascript
 Python3

// Recursive function

void recurse(int n)

if (n <= 0)

return;
else {

// some O(1) expressions

recurse(n/c);

// Here c is positive integer constant greater than 1

// This code is contributed by Kshitij

Example: Binary Search(refer iterative implementation) has O(Logn) time


complexity.

Logarithmic Time Complexity O(Log Log n):


The Time Complexity of a loop is considered as O(LogLogn) if the loop
variables are reduced/increased exponentially by a constant amount.
 C++
 C
 Java
 C#
 Javascript
 Python3

// Here c is a constant greater than 1

for (int i = 2; i <= n; i = pow(i, c)) {

// some O(1) expressions


}

// Here fun() is sqrt or cuberoot or any other constant root

for (int i = n; i > 1; i = fun(i)) {

// some O(1) expressions

//This code is contributed by Kshitij

See this for mathematical details.

How to combine the time complexities of consecutive

loops?
When there are consecutive loops, we calculate time complexity as a sum of
the time complexities of individual loops.
To combine the time complexities of consecutive loops, you need to
consider the number of iterations performed by each loop and the amount of
work performed in each iteration. The total time complexity of the algorithm
can be calculated by multiplying the number of iterations of each loop by
the time complexity of each iteration and taking the maximum of all
possible combinations.
For example, consider the following code:
for i in range(n):
for j in range(m):
# some constant time operation
Here, the outer loop performs n iterations, and the inner loop performs m
iterations for each iteration of the outer loop. So, the total number of
iterations performed by the inner loop is n * m, and the total time
complexity is O(n * m).
In another example, consider the following code:
for i in range(n):
for j in range(i):
# some constant time operation
Here, the outer loop performs n iterations, and the inner loop performs i
iterations for each iteration of the outer loop, where i is the current iteration
count of the outer loop. The total number of iterations performed by the
inner loop can be calculated by summing the number of iterations performed
in each iteration of the outer loop, which is given by the formula sum(i)
from i=1 to n, which is equal to n * (n + 1) / 2. Hence, the total time
complex
 C++
 C
 Java
 C#
 Javascript
 Python3

//Here c is any positive constant

for (int i = 1; i <= m; i += c) {

// some O(1) expressions

for (int i = 1; i <= n; i += c) {

// some O(1) expressions

}
// Time complexity of above code is O(m) + O(n) which is O(m + n)

// If m == n, the time complexity becomes O(2n) which is O(n).

//This code is contributed by Kshitij

How to calculate time complexity when there are

many if, else statements inside loops?


As discussed here, the worst-case time complexity is the most useful among
best, average and worst. Therefore we need to consider the worst case. We
evaluate the situation when values in if-else conditions cause a maximum
number of statements to be executed.
For example, consider the linear search function where we consider the case
when an element is present at the end or not present at all.
When the code is too complex to consider all if-else cases, we can get an
upper bound by ignoring if-else and other complex control statements.

How to calculate the time complexity of recursive

functions?
The time complexity of a recursive function can be written as a
mathematical recurrence relation. To calculate time complexity, we must
know how to solve recurrences. We will soon be discussing recurrence-
solving techniques as a separate post.
Naive Bayes Algorithms: A Complete Guide
for Beginners

Parth Shukla21 Mar, 2024 • 11 min read

Introduction

Machine learning algorithms are one of the essential parameters while training and
building an intelligent model for some of the problem statements. Many machine learning
algorithms are used in several cases due to their faster and more accurate results. The
Naive Bayes Classifier algorithm is also one of the best machine learning algorithms,
resulting in a precise model with less effort.

In this article, we will discuss the naive Bayes algorithms with their core intuition,
working mechanism, mathematical formulas, PROs, CONs, and other important aspects
related to the same. Also, the key takeaways discussed in the end will help one answer the
interview questions related to the NaiveBayesClassifier algorithms efficiently.

As the algorithm works totally on the concept of probabilities, conditional probabilities,


and the bayesian rule, we can start learning the Naive Bayes Classifier algorithm by
revising the concepts of probabilities and conditional statements of the same.

This article was published as a part of the Data Science Blogathon.

Table of contents

What is Probability?
To understand the Naive Bayes Classifier from scratch, it is required to understand the
term probability, as the algorithm itself works on the concept of probabilities of events. Let
us try to understand the same.

Probability is the thing or term called inmathematicsthe “chance of something to take


place”. In simple words, “the probability is a chance of some event to occur.”

We know that the sum of all probabilities is always one, and for Example, if we toss the
coin in the air, the possibility is the head is 0.5 and the tails are also 0.5, which means that
there is an equal, and 50% chance of heads and tails to come for the first trial.

What is Conditional Probability?

Now we know the meaning of probability, the next term to understand is conditional
probability.Conditional probabilityis defined as the probability of some event happening
with respect to another event. In simple words, conditional probability is also a probability
of some things occurring when a condition is involved.

The formula for theConditional Probabilityis:

S
ource- Machinelearningplus

P(A*B) =Probability of events A and B both happening

P(A)= Probability of event A to occur.

P(B)= Probability of event B to occur.

P(A|B)= Probability of event A happening when event B occurs.

P(B|A)= Probability of event B happening when event A occurs.


Bayes Rule

Now, we are prepared to learn the bayesian rule after knowing the two critical
terms.Thomas Bayes, a British mathematician in 1763, gave the bayesian theorem, which
helped calculate the probability of some events taking place with conditions.

The formula forBayes Ruleis:

Source- Medium

As we can see in the above image, the formula comprises a total of 4 terms. Let us try to
understand them one by one.

P(B|A)= Probability of event B to happen when event A occurs.

P(A|B)= Probability of event A to happen when event B occurs.

P(A)= Probability of event A to occur.

P(B)= Probability of event B to occur.

From the above formula, we can easily calculate the probability of some event happening
with the condition if we have the average likelihood of vents happening and both events
happening.
What is Naive Bayes Algorithm?

Now is the best time to understand the naive Bayes algorithm, as the core fundamentals
are clear. In real-time, there can be many events and many conditions that can happen
simultaneously with events. So, in this case, we expand the bayesian theorem to solve this
type of issue. If the features are independent, we can quickly extendthe theorem and
calculate the probability of the same.

The same bayesian theorem formula can be used here for multiple events and conditions,
and one can easily calculate the probability with the help of the same.

The algorithm is one of the most useful algorithms in machine learning which helps in
several classification problems, sentiment analysis, face recognition, etc.

How Naive Bayes Works?

After understanding the Naive Bayes algorithm, let us try to understand the working
mechanismof the algorithm.

Let us take an example.

Let’s suppose we have a dataset of golf matches. The problem statement is a


classification problem where we have to predict whether a gold match will players or not
given some conditions of temperature, rain, weather, etc.
As we can see in the dataset that the outlook, temperature, humidity, and wind are
independent features, and the play gold is a categorical target column. When we feed this
data to the algorithm, the algorithm will calculate the normal and conditional probabilities
of all the events occurring with all the possible conditions. Once the model is trained now,
it is ready to predict unknown data.

Suppose we try to predict whether a golf match will play, given some conditional outlook,
humidity, and temperature. In that case, the model will take the data as input and calculate
the probability of Yes and No concerning all the conditions provided. If the likelihood of
Yes is higher than No, then the model will return Yes as the output and vice versa.

What is Multicollinearity?

Multicollinearity in machine learning is a term that deals with the linearity of the features
of data feed. In simple words, the dataset havingcorrelations between its independent
featuresis called multilinear.

To understand the concept better, let us take an example.

Suppose we have a dataset with three columns, age, marks, and passed. Here the age is the
age of the students, marks are the number obtained by students in exams, and the past is a
categorical column that indicates whether a studentpassed or not.

Now here, the age and marks are the training columns means these columns should be fed
to the algorithm, and the passed column should be the target column that a machine
learning algorithm will predict. Now in some cases, the age and the marks columns are
correlated somehow, and they are not independent anymore. It is called that the data
hasMulticollinearityin its features.
The professor checking the answer sheets can be biased toward students having less age
and marks them with good numbers. Both columns are now correlated, and
Multicollinearity is present in this dataset.

How to Check Multicollinearity?

One of the basic assumptions of the naive Bayes algorithm is related to the
Multicollinearit; it is required to check whether the data has Multicollinearity.

To check the some, we can use the following code:

import pandas as pd
df = pd.read_csv("data.csv")
df.corr()

The following code results in thePearson Correlationbetween the independent and


dependent columns; we can check the relation between all the independent columns with
another independent column to check for Multicollinearity.

Why is it Naive?

Now a question might appear in your mind: Why is the algorithm called naive?

The main reason behind the name of the Naive Bayes Classifier is its assumption that ut
assume while working on particular datasets and theMulticollinearity.

Here Naive Bayes Classifier assumes that the dataset provided to the algorithm is
independent and the independent features are separate and not dependent on some other
factors, which is why theNaive Bayesalgorithm is calledNaive.

Types of Naive Bayes

There are mainly a total of three types of naive byes algorithms. Different types of naive
Bayes are used for different use cases. Let us try to understand them one by one.
1. Bernoulli Naive Bayes

This Naive Bayes Classifier is used when there is abooleantype of dependent or target
variable present in the dataset. For example, a dataset has target column categories as Yes
and No.

This type of Naive is mainly used in a binary categorical tagete column where the problem
statement is to predict onlyYes or No. For Example, sentiment analysis with Positive and
Negative Categories, A specific ord id present in the text or not, etc.

Code Example:

from sklearn.datasets import make_classification


from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
nb_samples = 100
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2,
n_redundant=0)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
bnb = BernoulliNB(binarize=0.0)
bnb.fit(X_train, Y_train)
bnb.score(X_test, Y_test)
2. Multinomial Naive Bayes

This type of naive Bayes is used where the data is multinomial distributed. This type of
naive Bayes is mainly used when there is atext classificationproblem.

For Example, if you want to predict whether a text belongs to which tag, education,
politics, e-tech, or some other tag, you can use the multinomial Naive Bayes Classifier to
classify the same.

This naive baseoutperformstext classification problems and is used the most out of all the
other Naive Bayes Classifier.

Code Example:

from sklearn.feature_extraction import DictVectorizer


from sklearn.naive_bayes import MultinomialNB
data = [
{'parth1': 100, 'parth2': 50, 'parth3': 25, 'parth4': 100, 'parth5': 20},
{'parth1': 5, 'parth2': 5, 'parth3': 0, 'parth4': 10, 'parth5': 500, 'parth6': 1}
]
dv = DictVectorizer(sparse=False)
X = dv.fit_transform(data)
Y = np.array([1, 0])
mnb = MultinomialNB()
mnb.fit(X, Y)
test_data = data = [
{'parth1': 80, 'parth2': 20, 'parth3': 15, 'parth4': 70, 'parth5': 10, 'parth6':
1},
]
{'parth1': 10, 'parth2': 5, 'parth3': 1, 'parth4': 8, 'parth5': 300, 'parth6': 0}
mnb.predict(dv.fit_transform(test_data))
3. Gaussian Naive Bayes

This type of naive is used when the predictor variables have continuous valuesinstead of
discrete ones. Here it is assumed that the distribution of the data is Gaussian distribution.

Code Example:

from sklearn.datasets import make_classification


from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
nb_samples = 100
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2,
n_redundant=0)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
gnb = GaussianNB(binarize=0.0)
gnb.fit(X_train, Y_train)
bngnb.score(X_test, Y_test)

Applications of Naive Bayes

1. Text Classification

The naive Bayes algorithms are known to perform best ontext classificationproblems. The
algorithm is mainly used when there is a problem statement related to the text and its
classification. Several naive Bayes algorithms are tried and tuned according to the problem
statement and used for a better accurate model. For Example: classifying the tags from the
text etc.
2. Sentiment Analysis

Algorithms like Bernoulli naive are used most for thesesentiment analysisproblems. This
algorithm is known to outperform on binary classification problems and is hence used
most for such cases.

3. Recommendation Systems

There are a total of two recommendation systems, content-based and collaborative


filtering. The naive Bayes withcollaborative filtering-based modelsis known for their best
accuracy on recommendation problems. The naive Bayes algorithms help achieve better
accuracies for recommending features to the users based on their interests and related to
other users’ interests.
Source: https://www.xenonstack.com/hubfs/xenonstack-deep-learning-based-
recommendation-system.png

4. Real-Time Predictions

The Naive Bayes algorithms areeager learningalgorithms that try to learn from the training
data and assume some of the parameters. Now, whenever the test data is provided for
prediction to the algorithm, the algorithm calculates the results according to its knowledge
gained from the training and offers faster and more accurate results. Hence it could be
used forreal-time predictions.
Advantages and Disadvantages of Naive Bayes

Advantages

1. Faster Algorithms:

The Naive Bayes algorithm is a parametric algorithm that tries to assume certain things
while training and using the knowledge for prediction. Hence it takes significantly less
time for prophecy and is afaster algorithm.

2. Less Training Data:

The naive Bayes algorithm assumes theindependent features to be independentof each


other, and if it exists, then the naive Bayes needs less data for training and performs better.

3. Performance:

The Naive Bayes algorithm achieves faster and more accurate performance with less data,
and its handling of categorical text data surpasses that of other algorithms, making
comparisons inequitable.

Disadvantages

1. Independent Features:

In a real-time dataset, obtaining independent features that are entirely independent of each
other is almost impossible. There are typically two to three features that correlate with
each other, thus not fully satisfying the assumption at all times.

2. Zero Frequency Error:

The zero frequency error in naive Bayes is one of the most critical CONs of the Naive
Bayes algorithm. According to this error, if a category is absent in both the training data
and the test data, then the Naive Bayes algorithm will assign it zero probability, resulting
in what is known as the Zero Frequency error in Naive Bayes.

To address this kind of issue, we can use Laplace smoothing techniques.

When to Use Naive Bayes?

Well, the Naive Bayes algorithm is the best-performing and faster algorithm compared to
other algorithms. However, still, there are cases where it cannot perform well, and some
different algorithms should be used to handle such cases.

The Naive Bayes algorithm can be used if there is no multicollinearity in the independent
features and if the features’ probabilities provide some valuable information to the
algorithms.

This algorithm should also be preferred for text classification problems. One should avoid
using the Naive Bayes algorithm when the data is entirely numeric and multicollinearity is
present in the dataset.

If it is necessary to use the Naive Bayes algorithm, then one can use the following steps to
improve the performance of Naive Bayes algorithms.

How to Improve Naive Bayes?

1. Remove Correlated Features:

Naive Bayes algorithms perform well on datasets with no correlations in independent


features.removing the correlated featuresmay improve the performance of the algorithm

2. Feature Engineering:

Try to apply feature engineering to the dataset and its features, combine someof the
elements, andextract some partsof them out of existing ones. This may help the Naive
Bayes algorithm learn the data quickly and results in an accurate model.

3. Use Some Domain Knowledge:


Oe should always try to apply somedomain knowledgeto the dataset and its features and
take steps according to it. It may help the algorithm to make decisions faster and achieve
higher accuracies.

4. Probabilistic Features:

The Naive Bayes algorithm works on the concept of probabilities, so try to improve the
features that givemore weightage to the algorithms and their probabilities, try to implement
those, and run the roses in a loop to know which features are best for the algorithm.

5. Laplace Transform:

In some cases, the category may be present in the test dataset and was not present while
training and the model will assign it with zero probability. Here we should handle this
issue byLaplace transform.

6. Feature Transformation:

It is always better to have normal distributions in the datasets and try to apply box-cox and
yeo-johnsonfeature transformation techniques to achieve the normal distributions in the
dataset.

Conclusion

In this article, we discussed the naive Bayes algorithm, the probabilities, conditional
probabilities, the bayesian theorem, the core intuition and working mechanism of the
algorithm with their types, code examples, applications, PROs, and CONs associated with
some of the key takeaways from this article. This article’s complete knowledge will help
one understand the Naive Bayes algorithm from scratch to an in-depth level. It will help
answer the interviews related to it very efficiently.

Key Takeaways

 Naive Bayes algorithm is a type of algorithm that works on the concept of


conditional probability and the bayesian theorem.
 The algorithm assumes that the independent data is independent of all the
other features; hence, it earns the name “Naive.”

 The algorithm is an eager learning algorithm that learns while the training
phase and results faster while the testing phase.

 Zero frequency error in a Naive Bayes algorithm is where the model assigns
zero probability to the unseen categories during the prediction phase.

 When there is a boolean type of target variable with two categories, the
Bernoulli Naive Bayes algorithm is used.

 The Multinomial Naive Bayes algorithm allows for text classification in


scenarios where multiple categories exist.

 For real-time datasets, it is impossible to have zero Multicollinearity; hence,


sometimes naive Bayes algorithms underperform in high Multicollinearity.

 One can use Box-Cox and Yeo-Johnson transforms to achieve the normal
distribution of the dataset columns.

The media shown in this article is not owned by Analytics Vidhya and is used at the
Author’s discretion.

blogathondata statisticsGaussian Naive BayesNaive Bayesnaive bayes classifiernaive


bayes in R

Parth Shukla21 Mar 2024

UG (PE) @PDEU | 25+ Published Articles on Data Science | Data Science Intern &
Freelancer | Amazon ML Summer School '22 | AI/ML/DL Enthusiast | Reach Out
@portfolio.parthshukla.live

A Comprehensive Guide to Time


Series Analysis and Forecasting
S
Sukanya Bag 01 Dec, 2022 • 16 min read
This article was published as a part of the Data Science Blogathon .

Source – bounteous.com

Introduction

Time Series Analysis and Forecasting is a very pronounced and powerful


study in data science, data analytics and Artificial Intelligence. It helps us
to analyse and forecast or compute the probability of an incident, based on
data stored with respect to changing time. For example, let us suppose
you have visited a clinic due to some chest pain and want to get
an electrocardiogram (ECG) test done to see if your heart is healthy
functioning. The ECG graph produced is a time-series data where your
Heart Rate Variability (HRV) with respect to time is plotted, analysing
which the doctor can suggest crucial measures to take care of your heart
and reduce the risk of stroke or heart attacks. Time Series is used widely
in healthcare analytics, geospatial analysis, weather forecasting, and to
forecast the future of data that changes continuously with time!

What is Time Series Analysis in Machine

Learning?

Time-series analysis is the process of extracting useful information from


time-series data to forecast and gain insights from it. It consists of a series
of data that varies with time, hence continuous and non-static in nature. It
may vary from hours to minutes and even seconds (milliseconds to
microseconds). Due to its non-static and continuous nature, working with
time-series data is indeed difficult even today!

As time-series data consists of a series of observations taken in


sequences of time, it is entirely non-static in nature.

Time Series – Analysis Vs. Forecasting


Time series data analysis is the scientific extraction of useful information
from time-series data to gather insights from it. It consists of a series of
data that varies with time. It is non-static in nature. Likewise, it may vary
from hours to minutes and even seconds (milliseconds to microseconds).
Due to its continuous and non-static nature, working with time-series data
is challenging!

As time-series data consists of a series of observations taken in


sequences of time, it is entirely non-static in nature.

Time Series Analysis and Time Series Forecasting are the two studies
that, most of the time, are used interchangeably. Although, there is a very
thin line between this two. The naming to be given is based on analysing
and summarizing reports from existing time-series data or predicting the
future trends from it.

Thus, it’s a descriptive Vs. predictive strategy based on your time-series


problem statement.

In a nutshell, time series analysis is the study of patterns and trends in a


time-series data frame by descriptive and inferential statistical methods.
Whereas, time series forecasting involves forecasting and extrapolating
future trends or values based on old data points (supervised time-series
forecasting), clustering them into groups, and predicting future patterns
(unsupervised time-series forecasting).

The Time Series Integrants

Any time-series problem or data can be broken down or decomposed into


several integrants, which can be useful for performing analysis and
forecasting. Transforming time series into a series of integrants is called
Time Series Decomposition.

A quick thing worth mentioning is that the integrants are broken further into
2 types-
1. Systematic — components that can be used for predictive modelling
and occur recurrently. Level, Trend, and Seasonality come under this
category.

2. Non-systematic — components that cannot be used for predictive


modelling directly. Noise comes under this category.

The original time series data is hence split or decomposed into 5 parts-

1. Level — The most common integrant in every time series data is the
level. It is nothing but the mean or average value in the time series. It has
0 variances when plotted against itself.

2. Trend — The linear movement or drift of the time series which may be
increasing, decreasing or neutral. Trends are observable over
positive(increasing) and negative(decreasing) and even linear slopes over
the entire range of time.

3. Seasonality — Seasonality is something that repeats over a lapse of


time, say a year. An easy way to get an idea about seasonality- seasons,
like summer, winter, spring, and monsoon, which come and go in cycles
throughout a specified period of time. However, in terms of data science,
seasonality is the integrant that repeats at a similar frequency.

Note — If seasonality doesn’t occur at the same frequency, we call it a


cycle. A cycle does not have any predefined and fixed signal or frequency
is very uncertain, in terms of probability. It may sometimes be random,
which poses a great challenge in forecasting.

4. Noise — A irregularity or noise is a randomly occurring integrant, and it’s


optional and arrives under observation if and only if the features are not
correlated with each other and, most importantly, variance is the similar
across the series. Noise can lead to dirty and messy data and hinder
forecasting, hence noise removal or at least reduction is a very important
part of the time series data pre-processing stage.

5. Cyclicity — A particular time-series pattern that repeats itself after a


large gap or interval of time, like months, years, or even decades.
The Time Series Forecasting Applications

Time series analysis and forecasting are done on automating a variety of


tasks, such as-

1. Weather Forecasting

2. Anomaly Forecasting

3. Sales Forecasting

4. Stock Market Analysis

5. ECG Analysis

6. Risk Analysis

and many more!

Time Series Components Combinatorics

A time-series model can be represented by 2 methodologies-

The Additive Methodology —

When the time series trend is a linear relationship between integrants, i.e.,
the frequency (width) and amplitude(height) of the series are the same,
the additive rule is applied.

Additive methodology is used when we have a time series where seasonal


variation is linear or constant over timestamps.

It can be represented as follows-

y(t) or x(t) = level + trend + seasonality + noise

where the model y(multivariate) or x(univariate) is a function of time t.

The Multiplicative Methodology —


When the time series is not a linear relationship between integrants, then
modelling is done following the multiplicative rule.

The multiplicative methodology is used when we have a time series where


seasonal variation increases with time — which may be exponential or
quadratic.

It is represented as-

y(t) or x(t)= Level * Trend * Seasonality * Noise

Deep-Dive into Supervised Time-Series

Forecasting

Supervised learning is the most used domain-specific machine learning,


and hence we will focus on supervised time series forecasting.

This will contain various detailed topics to ensure that readers at the end
will know how to-

1. Load time series data and use descriptive statistics to explore it

2. Scale and normalize time series data for further modelling

3. Extracting useful features from time-series data (Feature


Engineering)

4. Checking the stationarity of the time series to reduce it

5. ARIMA and Grid-search ARIMA models for time-series forecasting

6. Heading to deep learning methods for more complex time-series


forecasting (LSTM and bi-LSTMs)

So without further ado, let’s begin!

Load Time Series Data and Use Descriptive

Statistics to Explore it
For the easy and quick understanding and analysis of time-series data, we
will work on the famous toy dataset named ‘Daily Female Births Dataset’.

Get the dataset downloaded from here.

Importing necessary libraries and loading the data –

import numpy
import pandas
import statmodels
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv(‘daily-total-female-births-in-cal.csv’, parse_dates = True, header = 0,
squeeze=True)
data.head()

This is the output we get-

1959–01–01 35
1959–01–02 32
1959–01–03 30
1959–01–04 31
1959–01–05 44
Name: Daily total female births in California, 1959, dtype: int64

Note —Remember, it is required to use ‘parse_dates’ because it converts


dates to datetime objects that can be parsed, header=0 which ensures the
column named is stored for easy reference, and squeeze=True which
converts the data frame of single object elements into a scalar.

Exploring the Time-Series Data –

print(data.size) #output-365

(a) Carry out some descriptive statistics —

print(data.describe())

Output —
count 365.000000
mean 41.980822
std 7.348257
min 23.000000
25% 37.000000
50% 42.000000
75% 46.000000
max 73.000000

(b) A look at the time-series distribution plot —

pyplot.plot(series)
pyplot.show()

Scale and Normalize Time Series Data for Further

Modelling

A normalized data scales the numeric features in the training data in the
range of 0 and 1 so that gradient descent and loss optimization is fast and
efficient and converges quickly to the local minima. Interchangeably known
as feature scaling, it is crucial for any ML problem statement.
Let’s see how we can achieve normalization in time-series data.

For this purpose, let’s pick a highly fluctuating time-series data — the
minimum daily temperatures data. Grab it here!

Let’s have a look at the extreme fluctuating nature of the data —

Source-DataMarket

It is seen that it has a strong seasonality integrant. Hence, scaling is


important to remove seasonal integrants, and it leads to crisp relationship
establishment between independent and target features!

To normalize a feature, Scikit-learn’s MinMaxScaler is too handy! If you


want to generate original data points after prediction, an
inverse_transform() function is also provided by this awesome built-in
function!

Here goes the normalization code —

# import necessary libraries


import pandas
from sklearn.preprocessing import MinMaxScaler
# load and sanity check the data
data = read_csv(‘daily-minimum-temperatures-in-me.csv’, parse_dates = True, header = 0,
squeeze=True, index_col=0)
print(data.head())
#convert data into matrix of row-col vectors
values = data.values
values = values.reshape((len(values), 1))
# feature scaling
scaler = MinMaxScaler(feature_range=(0, 1))
#fit the scaler with the train data to get min-max values
scaler = scalar.fit(values)
print(‘Min: %f, Max: %f’ % (scaler.data_min_, scaler.data_max_))
# normalize the data and sanity check
normalized = scaler.transform(values)
for i in range(5):
print(normalized[i])
# inverse transform to obtain original values
original_matrix= scaler.inverse_transform(normalized)
for i in range(5):
print(original_matrix[i])

Let’s have a look at what we got –


See how the values have scaled!

Note — In our case, our data does not have outliers present and hence a
MinMaxScaler solves the purpose well. In the case where you have an
unsupervised learning approach, and your data contains outliers, it is
better to go for standardization, which is more robust than normalization,
as normalization scales the data close to the mean which doesn’t handle
or include outliers leading to a poor model. Standardization, on the other
hand, takes large intervals with a standard deviation value of 1 and a
mean of 0, thus outlier handling is robust.

More on that here!

Extracting Useful Features from Time-Series Data

(Feature Engineering)

Framing data into a supervised learning problem simply deals with the task
of handling and extracting useful features and discarding irrelevant
features to make the model robust and cost-efficient.
We already know that supervised learning problems have 2 types of
features — the independents (x) and dependent/target(y). Hence, how
better the target value is achieved depends on how well we choose and
engineer the independent features.

You must know by now that time-series data has two columns, timestamp,
and its respective value. So, it is very self-explanatory that in the time
series problem, the independent feature is time and the dependent feature
is value.

Now let us look at what are the features that need to be engineered into
these input and output values so that the inherent relationship between
these two variables is established to make the forecasting as good as
possible.

The features which are extremely important to model the relationship


between the input and output variables in a time series are —

1. Descriptive Statistical Features — Quite straightforward as it sounds,


calculating the statistical details and summary of any data is extremely
important. Mean, Median, Standard Deviation, Quantiles, and min-max
values. These come extremely handy while in tasks such as outlier
detection, scaling and normalization, recognizing the distribution, etc.

2. Window Statistic Features — Window features are a statistical


summary of different statistical operations upon a fixed window size of
previous timestamps. There are, in general, 2 ways to extract descriptive
statistics from windows. They are

(a) Rolling Window Statistics: The rolling window focuses on calculating


rolling means or what we conventionally call Moving Average, and often
other statistical operations. This calculates summary statistics (mostly
mean) across values within a specific sliding window, and then we can
assign these as features in our dataset.

Let, the mean at timestamp t-1 is x and t-2 be y, so we find the average of
x and y to predict the value at timestamp t+1. The rolling window hence
takes a mean of 2 values to predict the 3rd value. After that is done, the
window shifts to the next set of values, and hence the mean is calculated
for each window consisting of 2 values. We use rolling window statistics
more often when the recent data is more important for forecasting and not
previous data.

Let’s see how we can calculate moving or rolling average with a rolling
window —

from pandas import DataFrame


from pandas import concat
df = DataFrame(data.values)
tshifts = df.shift(1)
rwin = tshifts.rolling(window=2)
moving_avg = rwin.mean()
joined_df = concat([moving_avg, df], axis=1)
joined_df.columns = [‘mean(t-2,t-1)’, ‘t+1’]
print(joined_df.head(5))

Let’s have a look at what we got —

(b) Expanding Window Statistics: Almost similar to the rolling window,


expanding windows takes into account an extra habit of extracting the
predicted value as well as all the previous observations, each time it
expands. This is beneficial when the previous data is equally important for
forecasting as well as the recent data.

Let’s have a quick look at expanding window code-

window = tshifts.expanding()
joined_df2 = concat([rwin.mean(),df.shift(-1)], axis=1)
joined_df2.columns = ['mean', 't+1']
print(joined_df2.head(5))
Let’s have a look at what we got -

3. Lag Features — Lag is simply predicting the value at timestamp t+1,


provided we know the value at the previous timestamp, say, t-1. It’s simply
distance or lag between two values at 2 different timestamps.

4. Datetime Features — This is simply the conversion of time into its


specific components like a month, or day, along with the value of
temperature for better forecasting. By doing this, we can gather specific
information about the month and day at a particular timestamp for each
record.

5. Timestamp Decomposition — Timestamp decomposition includes


breaking down the timestamp into subset columns of timestamp for storing
unique and special timestamps. Before Diwali or, say, Christmas, the sale
of crackers and Santa-caps, fruit-cakes increases exponentially more than
at other times of the year. So storing such a special timestamp by
decomposing the original timestamp into subsets is useful for forecasting.

Time-series Data Stationary Checks

So, let’s first digest what stationary time-series data is!

Stationary, as the term suggests, is consistent. In time-series, the data if it


does not contain seasonality or trends is termed stationary. Any other
time-series data that has a specific trend or seasonality, are, thus, non-
stationary.

Can you recall, that amongst the two time-series data we worked on, the
childbirths data had no trend or seasonality and is stationary. Whereas,
the average daily temperatures data, has a seasonality factor and drifts,
and hence, it’s non-stationary and hard to model!

Stationarity in time-series is noticeable in 3 types —

(a) Trend Stationary — This kind of time-series data possesses no trend.

(b) Seasonality Stationary — This kind of time-series data possesses no


seasonality factor.

(c) Strictly Stationary — The time-series data is strictly consistent with


almost no variance to drifts.

Now that we know what stationarity in time series is, how can we check for
the same?

Vision is everything. A quick visualization of your time-series data at hand


can give a quick eye review of whether the data can be stationary or not.
Next in the line comes the statistical summary. A clear look into the
summary statistics of the data like min, max, variance, deviation, mean,
quantiles, etc. can be very helpful to recognize drifts or shifts in data.

Lets POC this!

So, we take stationary data, which is the handy childbirths data we worked
on earlier. However, for the non-stationary data, let’s take the famous
airline-passenger data, which is simply the number of airline passengers
per month, and prove how they are stationary and non-stationary.

Case 1 — Stationary Proof

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(‘daily-total-female-births.csv’, parse_dates = True, header = 0,
squeeze=True)
data.hist()
plt.show()

Output —
As I said, vision! Look how the visualization itself speaks that it’s a
Gaussian Distribution. Hence, stationary!

More curious? Let’s get solid math proof!

X = data.values
seq = round(len(X) / 2)
x1, x2 = X[0:seq], X[seq:]
meanx1, meanx2 = x1.mean(), x2.mean()
varx1, varx2 = x1.var(), x2.var()
print(‘meanx1=%f, meanx2=%f’ % (meanx1, meanx2))
print(‘variancex1=%f, variancex2=%f’ % (varx1, varx2))

Output —

meanx1=39.763736, meanx2=44.185792
variancex1=49.213410, variancex2=48.708651

The mean and variances linger around each other, which clearly shows
the data is invariant and hence, stationary! Great.

Case 2— Non-Stationary Proof


import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(‘international-airline-passengers.csv’, parse_dates = True, header = 0,
squeeze=True)
data.hist()
plt.show()

Output —

The graph pretty much gives a seasonal taste. Moreover, it is too distorted
for a Gaussian tag. Let’s now quickly get the mean-variance gaps.

X = data.values
seq = round(len(X) / 2)
x1, x2 = X[0:seq], X[seq:]
meanx1, meanx2 = x1.mean(), x2.mean()
varx1, varx2 = x1.var(), x2.var()
print(‘meanx1=%f, meanx2=%f’ % (meanx1, meanx2))
print(‘variancex1=%f, variancex2=%f’ % (varx1, varx2))

Output —

meanx1=182.902778, meanx2=377.694444
variancex1=2244.087770, variancex2=7367.962191

Alright, the value gap between mean and variances are pretty self-
explanatory to pick the non-stationary kind.

ARMA, ARIMA, and SARIMAX Models for Time-

Series Forecasting

A very traditional yet remarkable ‘machine-learning’ way of forecasting a


time series is the ARMA (Auto-Regressive Moving Average) and Auto
Regressive Integrated Moving Average Model commonly called ARIMA
statistical models.

Other than these 2 traditional approaches, we have SARIMA (Seasonal


Auto-Regressive Integrated Moving Average) and Grid-Search ARIMA,
which we will see too!

So, let’s explore the models, one by one!

ARMA

The ARMA model is an assembly of 2 statistical models — the AR or Auto-


Regressive model and Moving Average.

The Auto-Regressive Model estimates any dependent variable value y(t)


at a given timestamp t on the basis of lags. Look at the formula below for a
better understanding —

Here, y(t) = predicted value at timestamp t, α = intercept term, β =


coefficient of lag, and, y(t-1) = time-series lag at timestamp t-1.

So α and β are the model estimators that estimate y(t).


The Moving Average Model plays a similar role, but it does not take the
past predicted forecasts into account, as said earlier in rolling average. It
rather uses the lagged forecast errors in previously predicted values to
predict the future values, as shown in the formula below.

Let’s see how both the AR and MA models perform on the International-
Airline-Passengers data.

AR model

AR_model = ARIMA(indexedDataset_logScale, order=(2,1,0))


AR_results = AR_model.fit(disp=-1)
plt.plot(datasetLogDiffShifting)
plt.plot(AR_results.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((AR_results.fittedvalues -
datasetLogDiffShifting['#Passengers'])**2))

The RSS or sum of squares residual is 1.5023 in the case of the AR


model, which is kind of dissatisfactory as AR doesn’t capture non-
stationarity well enough.

MA Model
MA_model = ARIMA(indexedDataset_logScale, order=(0,1,2))
MA_results = MA_model.fit(disp=-1)
plt.plot(datasetLogDiffShifting)
plt.plot(MA_results.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((MA_results.fittedvalues -
datasetLogDiffShifting['#Passengers'])**2))

The MA model shows similar results to AR, differing by a very small


amount. We know our data is non-stationary, so let’s make this RSS score
better by the non-stationarity handler AR+I+MA!

ARIMA

Along with the squashed use of the AR and MA model used earlier,
ARIMA uses a special concept of Integration(I) with the purpose of
differentiating some observations in order to make non-stationary data
stationary, for better forecasting. So, it’s obviously better than its
predecessor ARMA which could only handle stationary data.

What the differencing factor does is, that it takes into account the
difference in predicted values between two timestamps (t and t+1, for
example). Doing this helps in achieving a constant mean rather than a
highly fluctuating ‘non-stationary’ mean.

Let’s fit the same data with ARIMA and see how well it performs!

ARIMA_model = ARIMA(indexedDataset_logScale, order=(2,1,2))


ARIMA_results = ARIMA_model.fit(disp=-1)
plt.plot(datasetLogDiffShifting)
plt.plot(ARIMA_results.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((ARIMA_results.fittedvalues -
datasetLogDiffShifting['#Passengers'])**2))

Great! The graph itself speaks how ARIMA fits our data in a well and
generalized fashion compared to the ARMA! Also, observe how the RSS
has dropped to 1.0292 from 1.5023 or 1.4721.
SARIMAX

Designed and developed as a beautiful extension to the ARIMA, SARIMAX


or, Seasonal Auto-Regressive Integrated Moving Average with eXogenous
factors is a better player than ARIMA in case of highly seasonal time
series. There are 4 seasonal components that SARIMAX takes into
account.

They are -

1. Seasonal Autoregressive Component

2. Seasonal Moving Average Component

3. Seasonal Integrity Order Component

4. Seasonal Periodicity

Sourc
e-Wikipedia

If you are more of a theory conscious person like me, do read more on
this here, as getting into the details of the formula is beyond the scope of
this article!

Now, let’s see how well SARIMAX performs on seasonal time-series data
like the International-Airline-Passengers data.

from statsmodels.tsa.statespace.sarimax import SARIMAX


SARIMAX_model=SARIMAX(train['#Passengers'],order=(1,1,1),seasonal_order=(1,0,0,12))
SARIMAX_results=SARIMAX_model.fit()
preds=SARIMAX_results.predict(start,end,typ='levels').rename('SARIMAX Predictions')
test['#Passengers'].plot(legend=True,figsize=(8,5))
preds.plot(legend=True)

Look how beautifully SARIMAX handles seasonal time series!

Heading to DL Methods for Complex Time-Series

Forecasting

One of the very common features of time-series data is the long-term


dependency factor. It is obvious that many time-series forecasting works
on previous records (the future is forecasted based on previous records,
which may be far behind). Hence, ordinary traditional machine learning
models like ARIMA, ARMA, or SARIMAX are not capable of capturing
long-term dependencies, which makes them poor guys in sequence-
dependent time series problems.

To address such an issue, a massively intelligent and robust neural


network architecture was proposed which can extraordinarily handle
sequence dependence. It was known as Recurrent Neural Networks or
RNN.

Sour
ce-Medium

RNN was designed to work on sequential data like time series. However, a
very remarkable pitfall of RNN was that it couldn’t handle long-term
dependencies. For a problem where you want to forecast a time series
based on a huge number of previous records, RNN forgets the maximum
of the previous records which occurred much earlier, and only learns
sequences of recent data fed to its neural network. So, RNN was observed
to not be up to the mark for NSP (Next Sequence Prediction) tasks in NLP
and time series.

To address this issue of not capturing long-term dependencies, a powerful


variant of RNN was developed, known as LSTM (Long Short Term
Memory) Networks. Unlike RNN, which could only capture short-term
sequences/dependencies, LSTM, as its name suggests was observed to
learn long as well as short term dependencies. Hence, it was a great
success for modelling and forecasting time series data!

Note — Since explaining the architecture of LSTM will be beyond the size
of this blog, I recommend you to head over to my article where I explained
LSTM in detail!

Let us now take our Airline Passengers’ data and see how well RNN and
LSTM work on it!

Imports —

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn.preprocessing
from sklearn.metrics import r2_score
from keras.layers import Dense, Dropout, SimpleRNN, LSTM
from keras.models import Sequential

Scaling the data to make it stationary for better forecasting —

minmax_scaler = sklearn.preprocessing.MinMaxScaler()
data['Passengers'] = minmax_scaler.fit_transform(data['Passengers'].values.reshape(-1,1))
data.head()

Scaled data —

Train, test splits (80–20 ratio) —

split = int(len(data[‘Passengers’])*0.8)
x_train,y_train,x_test,y_test = np.array(x[:split]),np.array(y[:split]),
np.array(x[split:]), np.array(y[split:])
#reshaping data to original shape
x_train = np.reshape(x_train, (split, 20, 1))
x_test = np.reshape(x_test, (x_test.shape[0], 20, 1))

RNN Model —

model = Sequential()
model.add(SimpleRNN(40, activation="tanh", return_sequences=True,
input_shape=(x_train.shape[1],1)))
model.add(Dropout(0.15))
model.add(SimpleRNN(50, return_sequences=True, activation="tanh"))
model.add(Dropout(0.1)) #remove overfitting
model.add(SimpleRNN(10, activation="tanh"))
model.add(Dense(1))
model.summary()

Complie it, fit it and predict—

model.compile(optimizer="adam", loss="MSE")
model.fit(x_train, y_train, epochs=15, batch_size=50)
preds = model.predict(x_test)
Let me show you a picture of how well the model predicts —
Pretty much accurate!

LSTM Model —

model = Sequential()
model.add(LSTM(100, activation="ReLU", return_sequences=True,
input_shape=(x_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(80, activation="ReLU", return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(50, activation="ReLU", return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(30, activation="ReLU"))
model.add(Dense(1))
model.summary()
Complie it, fit it and predict—

model.compile(optimizer="adam", loss="MSE")
model.fit(x_train, y_train, epochs=15, batch_size=50)
preds = model.predict(x_test)
Let me show you a picture of how well the model predicts —
Here, we can easily observe that RNN does the job better than LSTMs. As
it is clearly seen that LSTM works great in training data but bad
invalidation/test data, which shows a sign of overfitting!

Hence, try to use LSTM only where there is a need for long-term
dependency learning otherwise RNN works good enough.

Conclusion

Cheers on reaching the end of the guide and learning pretty interesting
kinds of stuff about Time Series. From this guide, you successfully learned
the basics of time series, got a brief idea of the difference between Time
Series Analysis and Forecasting subdomains of Time Series, a crisp
mathematical intuition on Time Series analysis and forecasting techniques
and explored how to work on Time Series problems in Machine Learning
and Deep Learning to solve complex problems.

Hope you had fun exploring Time Series with Machine Learning and Deep
Learning along with intuition! If you are a curious learner and want to “not”
stop learning more, head over to this awesome notebook on time series
provided by TensorFlow!

Feel free to follow me on Medium and GitHub for more articles and
notebooks on Machine & Deep Learning! Connect with me on LinkedIn if
you want to discuss anything regarding this article!
Happy Learning!
A Complete Guide on Hough
Transform

Dulari Bhatt 07 Jun, 2024 • 11 min read

Introduction

The Hough Transform is a pivotal algorithm in computer vision and image


processing , enabling the detection of geometrical shapes such as lines,
circles, and ellipses within images. By transforming image space into
parameter space, the Hough Transform leverages a voting mechanism to
identify shapes through local maxima in an accumulator array. Typically,
this method detect lines and edges, utilizing parameters like rho and theta
to represent straight lines in polar coordinates. This algorithm is essential
in various applications, from edge detection and feature extraction to more
complex tasks like circle detection and generalized shape identification.
Implemented in languages like Python, often using libraries like OpenCV,
the hough transform line detection remains a robust tool for analyzing and
interpreting visual data despite its computational intensity and parameter
sensitivity.

This article will provide you with in-depth knowledge about the Hough
transform. It will give you a basic introduction to the Hough transform, how
exactly it works, and the math behind it. It also enlists the merits and
demerits of Hough Transform algorithm and its various applications.

Learning Outcomes:

 Explain the foundational concepts of the Hough Transform,


including its purpose in detecting geometrical shapes within images,
and its historical development.
 Gain an understanding of the mathematical principles underlying the
Hough Transform, such as the transformation of image space to
parameter space.

 Implement the Hough Transform line detection algorithm in a


programming language such as Python using libraries like OpenCV.

 Explore various real-world applications of the Hough Transform


algorithm, such as in-lane and traffic sign recognition, medical
imaging, industrial inspection, and object tracking.

This article was published as a part of the Data Science Blogathon .

Table of contents

History

It was first developed for the automated examination of bubble chamber


pictures (Hough, 1959). In 1962, the Hough transform was patented as
U.S. Patent 3069654 and assigned by the Atomic Energy Commission to
the United States as a “Method and Means for Recognizing Complex
Patterns.” Because the slope might go to infinity, this invention employs a
slope-intercept parameterization for straight lines, which results in infinite
transform space.

Duda, R. O., and P. E. Hart, “Use of the HT to Detect Lines and Curves in
Pictures,” Comm. ACM, Vol. 15, pp. 11–15 (January 1972), although it had
been standard for the Radon transformation since the 1930s. In Frank
O’Gorman and MB Clowes: Finding Picture Edges through Collinearity of
Feature Points, O’Gorman and Clowes discuss their version. IEEE
Transactions on Computers, vol. 25, no. 4, pp. 449-456. (1976) Hart, P.
E., “How the HT was invented,” IEEE Signal Processing Magazine, Vol 26,
Issue 6, pages 18 – 22 (November 2009) tells the tale of how the
contemporary form of the Hough transform was invented.

What is Hough Transform?


Hough Transform is a computer vision technique that detects shapes like
lines and circles in an image. It converts these shapes into mathematical
representations in parameter space, making it easier to identify them even
if they’re broken or obscured. This method is valuable for image analysis,
pattern recognition, and object detection.

The Hough Transform algorithm line detection is a feature


extraction method in image analysis, computer vision, and digital image
processing. It uses a voting mechanism to identify bad examples of
objects inside a given class of forms. This voting mechanism is carried out
in parameter space. First, the HT algorithm produces object candidates as
local maxima in an accumulator space.

The traditional HT was concerned with detecting lines in an image, but it


was subsequently expanded to identifying locations of arbitrary shapes,
most often circles or ellipses. Richard Duda and Peter Hart devised the HT
as we know it today in 1972, calling it a “generalized HT” after Paul
Hough’s related 1962 patent. Dana H. Ballard popularized the
transformation in the computer vision field with his 1981 journal paper
“Generalizing the HT to Identify Arbitrary Forms.”

Why is it Needed?
In many circumstances, a pre-processing stage can use an edge detector
to obtain picture points or pixels on the required curve in the image space.
However, there may be missing points or pixels on the required curves
due to flaws in either the image data or the edge detector and spatial
variations between the ideal line/circle/ellipse and the noisy edge points
acquired by the edge detector. As a result, grouping the extracted edge
characteristics into an appropriate collection of lines, circles, or ellipses is
frequently difficult.
1. Original image of Lane

Figure 2: Image after applying edge detection technique. Red circles


show that the line is breaking there.
How Does the Hough Transform Work?

The Hough approach is effective for computing a global description of a


feature(s) from (potentially noisy) where the number of solution classes
does not need to be provided. For example, the Hough approach for line
identification motivates the assumption that each input measurement
reflects its contribution to a globally consistent solution (e.g., the physical
line that gave rise to that image point).

A line can be described analytically in various ways. One of the line


equations uses the parametric or normal notion: xcosθ+ysinθ=r. where r is
the length of a normal from the origin to this line and θ is the orientation,
as given in Figure 5.
The known variables (i.e., x i,y i ) in the image are constants in the
parametric line equation, whereas r and are the unknown variables we
seek. Points in cartesian image space correspond to curves (i.e.,
sinusoids) in the polar Hough parameter space if we plot the potential (r,
θ) values specified by each. The Hough Transform algorithm for straight
lines is this point-to-curve transformation. Collinear spots in the cartesian
image space become obvious when examined in the Hough parameter
space because they provide curves that overlap at a single (r, θ) point.

A and b are the circle’s center coordinates, and r is the radius. The
algorithm’s computing complexity increases because we now have three
coordinates in the parameter space and a 3-D accumulator. (In general,
the number of parameters increases the calculation and the size of the
accumulator array polynomially.) As a result, the fundamental Hough
approach described here only applies to straight lines.

Algorithm

 Determine the range of ρ and θ. Typically, the range of θ is [0, 180]


degrees and ρ is [-d, d], where d is the diagonal length of the edge.
Therefore, it’s crucial to quantify the range of ρ and θ, which means
there should only be a finite number of potential values.

 Create a 2D array called the accumulator with the dimensions (num


rhos, num thetas) to represent the Hough Space and set all its
values to zero.
 Use the original image for edge detection (ED). You can use
whatever ED technique you like.

 Check each pixel on the edge picture to see if it is an edge pixel. If


the pixel is on edge, loop over all possible values of θ, compute the
corresponding ρ, locate the θ and ρ index in the accumulator, and
then increase the accumulator base on those index pairs.

 Iterate over the accumulator’s values. Retrieve the ρ and θ index


and get the value of ρ and θ from the index pair. If the value
exceeds a specified threshold, we can then transform the index pair
back to the form of y = ax + b.

Sum of Hough Transform

Problem: Given set of points, use Hough transform to join these points.
A(1,4), B(2,3) ,C(3,1) ,D(4,1) ,E(5,0)

Solution:

Lets think about equation of line that is y=ax+b.

Now, if we rewrite the same line equation by keeping b in LHS, then we


get

b=-ax+y. So if we write the same equation for point A(1,4), then consider
x=1 and y=4 so that we will get

b=-a+4. The following table shows all the equations for a given point

Point X and y values Substituting the value in b=-ax+y


A(1,4) x=1 ; y=4 b= -a+4
B(2,3) x=2 ; y=3 b= -2a+3
C(3,1) x=3 ; y=1 b= -3a+1
D(4,1) x=4 ; y=1 b= -4a+1
E(5,0) x=5 ; y=0 b= -5a+0

Now take x-0 and find corresponding y value for above given five
equations
New point
Point equations Now a=0 Now a=1 New point (a,b)
(a,b)
A(1,4) b= -a+4 b= -(0)+4 =4 (0,4) b= -(1)+4 =3 (1,3)
B(2,3) b= -2a+3 b= -2(0)+3=3 (0,3) b= -2(1)+3=1 (1,1)
C(3,1) b= -3a+1 b= -3(0)+1=1 (0,1) b= -3(1)+1=-2 (1,-2)
D(4,1
b= -4a+1 b= -4(0)+1=1 (0,1) b= -4(1)+1=3 (1,-3)
)
E(5,0) b= -5a+0 b= -5(0)+0=0 (0,0) b= -5(1)+0=-5 (1,-5)

Let us plot the new point on the graph as given below in figure 6.

We can see that almost all line crosses each other at a point (-1,5). So
here now a=-1 and b =5.

Now let’s put these values in the y=ax+b equation so we get y=-1x+5 so
y=-x+5 is the line equation that will link all the edges.

Advantages of Hough Transform


 Robustness: It can detect shapes in images even if they are
partially obscured, degraded, or noisy.

 Simplicity: Conceptually straightforward, making it relatively easy


to implement and understand.

 Versatility: Can be adapted to detect various geometric shapes, not


just lines but also circles and ellipses.

 Global Detection: Able to identify lines and shapes across the


entire image rather than just in localized regions.

Disadvantages of Hough Transform

 Computationally Intensive: This can be slow and requires


significant computational resources, especially for large images or
complex shapes.

 Parameter Sensitivity: Requires careful selection of parameters


(e.g., resolution of the parameter space) to balance accuracy and
performance.

 Noise Sensitivity: This may produce false positives if the image


has a lot of noise or clutter.

 Discrete Nature: Quantizing the parameter space can lead to a


loss of precision in detecting shapes.

Application of the Hough Transform

The HT has been widely employed in numerous applications because of


the benefits, such as noise immunity. 3D applications, object and form
detection, lane and traffic sign recognition, industrial and medical
applications, pipe and cable inspection, and underwater tracking are just a
few examples [6]. Below are some examples of these applications.
Proposes the hierarchical additive Hough transform (HAHT) for detecting
lane lines. The HAHT that is recommended accumulates the votes at
various hierarchical levels. Line segmentation into multiple blocks also
minimizes the computational load. [7] proposes a lane detection strategy
in which the HT is merged with the joint photographic experts’ group
(JPEG) compression. However, only simulations are used to test the
method.

Biometric and Man-machine Interaction

A generalized hand detection model with articulated degrees of freedom is


shown in 6. It makes use of geometric feature correspondences between
points and lines. The PHT recognizes the lines. Following that, they are
matched with a model. The GHT can identify Turkish finger-spelling. To
produce interest zones, the scale-invariant features transform is applied
(SIFT). The superfluous interest zones are eliminated using skin color
reduction. It also suggests using hand gestures and tracking to create a
real-time human-robot interaction system. The linked component labeling
(CCL) and the Hough transform are integrated into this procedure.
Sensing the skin tone can extract the hand’s center, orientations, and
fingertip positions of all outstretched fingers.

3D Applications

It proposes a detector for 3D lines, boxes, and blocks. This approach


decreases the computing cost by using 2D Hough space for parallel line
detection. In addition, 3D HT is used to extract planar faces from point
clouds that are unevenly dispersed.

Object Recognition

It describes a method for detecting objects utilizing the GHT and color
similarity between homogeneous portions of the item. It takes as input
already segmented areas of homogenous hue. According to this research,
it resists changes in light, occlusion, and distortion of the segmentation
output. It can distinguish items that have been rotated, scaled, and even
moved around in a complicated environment.

Object Tracking

It uses a blend of uniform and Gaussian Hough transforms to conduct


shape-based tracking. While the camera moves, this method can find and
follow objects even against difficult backdrops like a dense forest.

Underwater Application
It offers a technique for detecting geometrical forms in underwater robot
images. Unlike the traditional Generalized Hough Transform (GHT), it
transforms the recognition problem into a bounded error estimation
problem. An autonomous underwater vehicle (AUV) creates a system for
visually directed operations underwater. The Hough Transform identifies
underwater pipelines, and their orientations are determined using binary
morphology.

Industrial and Commercial Application

Various industrial and commercial uses utilize the Hough Transform (HT)
and its various versions. For instance, in an uncrewed aerial vehicle (UAV)
surveillance and inspection system, a knowledge-based power line
identification approach is proposed. Before applying the Line Hough
Transform (LHT), researchers construct a pulse-coupled neural network
(PCNN) filter to eliminate background noise from picture frames. They
then refine the findings using knowledge-based line clustering.

Medical Application

It uses a warped frequency transform (WFT) to adjust for the dispersive


behavior of ultrasonic guided waves, followed by a Wigner-Ville time-
frequency analysis and the HT to increase fault localization accuracy. The
system automatically recognizes the locations and boundaries of
vertebrae. It utilizes the Hough Transform in conjunction with the Genetic
Algorithm to determine the migrating vertebrae.

Unconventional Application

This section shows unexpected usage to demonstrate the HT’s


adaptability and ever-expanding nature. It proposes a data mining
technique for locating arbitrarily oriented subspace clusters. This is
achieved by mapping the data space to the parameter space, defining the
collection of arbitrarily oriented subspaces that may be created. The
clustering method works by locating clusters that may hold many database
objects. Even if subspace clusters of differing dimensions are sparse or
crossed by other clusters in a noisy environment, the method can discover
them.
Implementation Using the Hough Transform

Code for Line detection using hough transform


Python Code:

Circle Detection Using Hough Transform

# Read image as gray-scale


img = cv2.imread('/content/drive/MyDrive/Colab Notebooks/circle_hough.jpg',
cv2.IMREAD_COLOR)
# Convert to gray-scale
print("Original Image")
cv2_imshow(img)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Blur the image to reduce noise
img_blur = cv2.medianBlur(gray, 5)
# Apply hough transform on the image
circles = cv2.HoughCircles(img_blur, cv2.HOUGH_GRADIENT, 1, 40,
param1=100, param2=30, minRadius=1, maxRadius=40)
# Draw detected circles
if circles is not None:
circles = np.uint16(np.around(circles))
for i in circles[0, :]:
# Draw outer circle
cv2.circle(img, (i[0], i[1]), i[2], (0, 255, 0), 2)
# Draw inner circle
cv2.circle(img, (i[0], i[1]), 2, (0, 0, 255), 5)
# Show result
print("Circle Detection using Hough Transform")
cv2_imshow(img)

Output

Conclusion
The Hough transform is a robust image processing technique for detecting
geometric shapes in noisy and occluded images. By transforming points
from image space to parameter space, it identifies local maxima in an
accumulator array to detect shapes like lines, circles, and ellipses. Despite
computational intensity and parameter sensitivity, its global detection
capabilities make it invaluable in applications such as lane recognition,
medical imaging, and industrial inspection. Parallel computing
advancements promise to enhance its efficiency for real-time use,
solidifying its role as a fundamental computer vision tool.

Key Takeaways:

 Hough Transform detects shapes by transforming points to


parameter space

 Voting mechanism makes it robust to noise and missing data

 Wide applications: lane recognition, medical imaging, industrial


inspection, etc.

 Fundamental tool despite computational intensity; efficiency


improving with parallel computing
What is Camera Calibration in
Computer Vision?

Dulari Bhatt 13 Jun, 2023 • 13 min read

Camera calibration is a fundamental task in computer vision crucial in


various applications such as 3D reconstruction, object tracking,
augmented reality, and image analysis. Accurate calibration ensures
precise measurements and reliable analysis by correcting distortions and
estimating intrinsic and extrinsic camera parameters. This comprehensive
guide delves into the principles, techniques, and algorithms of camera
calibration. We explore obtaining intrinsic and extrinsic camera
parameters, understanding distortion models, conducting calibration
patterns, and utilizing calibration software. Whether you are a beginner or
an experienced computer vision practitioner, this guide will equip you with
the knowledge and skills to perform accurate camera calibration and
unlock the full potential of your vision-based applications.

This article was published as a part of the Data Science Blogathon

Table of contents

What is Camera Calibration?

A camera is a device that converts the 3D world into a 2D image. A


camera plays a very important role in capturing three-dimensional images
and storing them in two-dimensional images. To know the mathematics
behind it is extremely fascinating. The following equation can represent
the camera.

x=PX
Here x denotes 2-D image point, P denotes camera matrix and X denotes
3-D world point.

Figure1 vector representation of x=PX [1]

Camera calibration is frequently used word in image processing or


computer vision field. The camera calibration method is intended to
identify the geometric characteristics of the image creation process. This is
a vital step to perform in many computer vision applications, especially
when metric information on the scene is needed. The camera is often
categorized on the basis of a set of intrinsic parameters such as skew of
the axis, focal length, and main point in these applications, and its
orientation is expressed by extrinsic parameters such as rotation and
translation. Linear or nonlinear algorithms are used to estimate intrinsic
and extrinsic parameters utilizing known points in real-time and their
projections in the picture plane.

Types of Camera Calibration

Camera calibration is the process of determining specific camera


parameters in order to complete operations with specified performance
measurements.

Camera calibration can be defined as the technique of estimating the


characteristics of a camera. It means that we have all of the camera’s
information like parameters or coefficients which are needed to determine
an accurate relationship between a 3D point in the real world and its
corresponding 2D projection in the image acquired by that calibrated
camera.
In most cases, this entails recovering two types of parameters.

1. Intrinsic or Internal Parameters

It allows mapping between pixel coordinates and camera coordinates in


the image frame. E.g. optical center, focal length, and radial distortion
coefficients of the lens.

2. Extrinsic or External Parameters

It describes the orientation and location of the camera. This refers to the
rotation and translation of the camera with respect to some world
coordinate system.

Camera Calibration Method

In this section, we will look at a simple calibrating procedure. The


important part is to get the right focal length because most of the
parameters can be set using simple assumptions like square straight
pixels, optical center at the middle of the image. A flat rectangular
calibration object for example a book will suffice, measuring tape or a
ruler, and a flat surface are required for this calibration method. The
following steps can be performed to have camera calibration.

 Take a measurement of the length and width of your rectangle


calibration object. Let us refer to these as dX and dY.

 Place the camera and calibration object on a flat surface with the
camera back and calibration object parallel and the object roughly in
the center of the camera’s vision. To acquire a good alignment, you
may need to lift the camera or object.

 Calculate the distance between the camera and the calibration


object. Let’s call it dZ.

 Take a picture to ensure that the setup is straight, which means that
the sides of the calibration object align with the image’s rows and
columns.
 In pixels, determine the width and height of the object. Let us refer
to these as dx and dy.

Focal Length

Now let us see the following Figure-3 for set up.

Figure 3: Normal camera calibration setup. The left image shows an image
of the setup; the right side image displays calibration. The focal length can
be determined by measuring the width and height of the calibration object
in the image, as well as the physical measurements of the setup.

For the exact setup in Figure 4-3, the object was measured to be 130 by
185 mm, hence dX = 130 and dY = 185. The camera’s distance from the
object was 460 mm, hence dZ = 460. Only the ratios of the measurements
matter; any unit of measurement can be utilized. After using ginput() to
choose four points in the image, the width and height in pixels were 722
and 1040, respectively. In other words, dx = 722 and dy = 1040. When
these numbers are used in the aforementioned relationship, the result is

fx equals 2555, and fy equals 2586.


It is critical to note that this is only applicable to a specific image
resolution. In this situation, the image was 2592 1936 pixels in size. It
should be noted that focal length and optical center are mainly measured
in pixels and scale with image resolution. If you choose a different image
resolution, the values will change (for example, a thumbnail image). It’s a
good idea to add your camera’s variables to a helper function like this:

def my_calibration(sz):
row,col = sz
fx = 2555*col/2592
fy = 2586*row/1936
K = diag([fx,fy,1])
K[0,2] = 0.5*col
K[1,2] = 0.5*row
return K

After that, this function takes a size tuple and returns the calibration
matrix. The optical center is assumed to be the image’s center in this
case. Replace the focal lengths with their mean if you like; for most
consumer cameras, this is fine. It should be noted that the calibration is
only for photographs in landscape orientation.

Camera Calibration Models

Calibration techniques for the pinhole camera model and the fisheye
camera model are included in the Computer Vision ToolboxTM. The
fisheye variant is compatible with cameras with a field of vision (FOV) of
up to 195 degrees.
The pinhole calibration algorithm is based on Jean-Yves Bouguet’s [3]
model. The pinhole camera model and lens distortion are included in the
model. Because an ideal pinhole camera does not have a lens, the pinhole
camera model does not account for lens distortion. To accurately simulate
a genuine camera, the algorithm’s whole camera model incorporates radial
and tangential lens distortion.

The pinhole model cannot model a fisheye camera due to the high
distortion produced by a fisheye lens.

Pinhole Camera Model

A pinhole camera is a basic camera model without a lens. Light rays pass
through the aperture and project an inverted image on the opposite side of
the camera. Visualize the virtual image plane in front of the camera and
assume that it is containing the upright image of the scene.

The camera matrix is a 4-by-3 matrix that represents the pinhole camera
specifications. The image plane is mapped into the image plane by this
matrix, which maps the 3-D world scene. Using the extrinsic and intrinsic
parameters, the calibration algorithm computes the camera matrix. The
extrinsic parameters represent the camera’s position in the 3-D scene. The
intrinsic characteristics represent the camera’s optical center and focal
length.
The world points are transformed to camera coordinates using the
extrinsic parameters. Intrinsic parameters are used to map the camera
coordinates into the image plane.

Fisheye Camera Model

Camera calibration is the process of calculating the extrinsic and intrinsic


properties of a camera. After calibrating a camera, the picture information
can be utilized to extract 3-D information from 2-D photographs. Images
taken with a fisheye camera can also be distortion-free. In Matlab, there is
the Computer Vision Toolbox which includes calibration procedures for the
pinhole camera model and the fisheye camera model. The fisheye
variation works with cameras that have a field of view (FOV) of up to 195
degrees.
As a fisheye lens produces extreme distortion, the pinhole model cannot
model a fisheye camera.

Fisheye cameras are employed in odometry as well as to visually solve


simultaneous localization and mapping (SLAM) difficulties. Surveillance
systems, GoPro, virtual reality (VR) to capture a 360-degree field of view
(fov), and stitching algorithms are examples of other applications. These
cameras employ a complicated array of lenses to increase the camera’s
field of view, allowing it to capture wide panoramic or hemispherical
images. The lenses achieve this extraordinarily wide-angle view, however,
by distorting the perspective lines
in the image.

Scaramuzza’s fisheye camera model is used by the Computer Vision


Toolbox calibration algorithm. An omnidirectional camera model is used in
the model. The imaging system is treated as a compact system in the
procedure. You must collect the camera’s extrinsic and intrinsic
parameters in order to link a 3-D world point to a 2-D image. The extrinsic
parameters are used to translate world points to camera coordinates. The
intrinsic parameters are used to transfer the camera coordinates onto the
picture plane.

Types of distortion effects and their cause


We obtain better photos when we use a lens, yet the lens produces some
distortion effects. Distortion effects are classified into two types:

Radial Distortion

This sort of distortion is caused by unequal light bending. The rays bend
more at the lens’s borders than they do at the lens’s center. Straight lines
in the actual world appear curved in the image due to radial distortion.
Before hitting the image sensor, the light ray is shifted radially inward or
outward from its optimal point. The radial distortion effect is classified into
two types.

1. Effect of barrel distortion, which corresponds to negative radial


displacement

2. The pincushion distortion effect results in a positive radial


displacement.

Tangential Distortion

It occurs when the picture screen or sensor is at an angle with respect to


the lens. As a result, the image appears to be slanted and stretched.

Fi

gure: Example of the effect of barrel and pincushion distortion on a square


grid.

There are three types of distortion depending on the source: radial


distortion, decentering distortion, and thin prism distortion. Decentering
and narrow prism distortion both cause radial and tangential distortion.
Fi

gure: Diagrams illustrating the effect of tangential distortion, with solid


lines representing no distortion and dotted lines representing tangential
distortion (right) and (left) how tangential and radial distortion shifts a pixel
from its ideal position.

We now have a better understanding of the different sorts of distortion


effects generated by lenses, but what does a distorted image look like? Is
it necessary to be concerned about the lens’s distortion? If so, why? How
are we going to deal with it?

Figure: Illustration of the


distortion effect. Take note of how the margins of the wall and doors are
curled as a result of distortion.

The image above is an example of the distortion effect that a lens can
produce. The figure corresponds to figure 1 and is a barrel distortion
effect, which is a form of the radial distortion effect. Which two points
would you consider if you were asked to find the correct door height?
Things get considerably more challenging when executing SLAM or
developing an augmented reality application using cameras that have a
large distortion effect in the image.

Mathematically Representing Lens Distortion

When attempting to estimate the 3D points of the real world from an


image, we must account for distortion effects.

Based on the lens parameters, we mathematically analyze the distortion


effect and combine it with the pinhole camera model described in the
previous piece in this series. As additional intrinsic parameters, we have
distortion coefficients (which quantitatively indicate lens distortion), in
addition to the intrinsic and extrinsic characteristics discussed in the
preceding post.

To account for these distortions in our camera model, we make the


following changes to the pinhole camera model:
The calibrateCamera method returns the distCoeffs matrix, which contains
values for K1 through K6, which represent radial distortion, and P1 and
P2, which represent tangential distortion. Because the following
mathematical model of lens distortion covers all sorts of distortions, radial
distortion, decentering distortion, and thin prism distortion, the coefficients
K1 through K6 represent net radial distortion while P1 and P2 represent
net tangential distortion.

Removing Distortion

So what do we do after the calibration step? We got the camera matrix


and distortion coefficients in the previous post on camera calibration but
how do we use the values?
One application is to use the derived distortion coefficients to un-distort
the image. The images shown below depict the effect of lens distortion and
how it can be removed from the coefficients obtained from camera
calibration.

How to Reduce Lens Distortion?

To reduce lens distortion, three fundamental actions must be taken.

1. Calibrate the camera and obtain the intrinsic camera parameters.


This is exactly what we accomplished in the previous installment of
this series. The camera distortion characteristics are also included
in the intrinsic parameters.

2. Control the percentage of undesired pixels in the undistorted image


by fine-tuning the camera matrix.

3. Using the revised camera matrix to remove distortion from the


image.

The getOptimalNewCameraMatrix() method is used in the second phase.


What exactly does this refined matrix imply, and why do we require it?
Refer to the photographs below; in the right image, we can see several
black pixels at the boundaries. These are caused by the image’s
undistortion. These dark pixels are sometimes undesirable in the final
undistorted image. Thus, the getOptimalNewCameraMatrix() method gives
a refined camera matrix as well as the ROI (region of interest), which may
be used to crop the image to exclude all black pixels. The percentage of
undesirable pixels to be removed is determined by the alpha parameter,
which is supplied as an input to the getOptimalNewCameraMatrix()
method.

Python Code for Camera Calibration

import cv2
import numpy as np
import os
import glob
# Defining the dimensions of checkerboard
CHECKERBOARD = (6,9)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
# Creating vector to store vectors of 3D points for each checkerboard image
objpoints = []
# Creating vector to store vectors of 2D points for each checkerboard image
imgpoints = []
# Defining the world coordinates for 3D points
objp = np.zeros((1, CHECKERBOARD[0] * CHECKERBOARD[1], 3), np.float32)
objp[0,:,:2] = np.mgrid[0:CHECKERBOARD[0], 0:CHECKERBOARD[1]].T.reshape(-1, 2)
prev_img_shape = None
# Extracting path of individual image stored in a given directory
images = glob.glob('./images/*.jpg')
for fname in images:
img = cv2.imread(fname)
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
# Find the chess board corners
# If desired number of corners are found in the image then ret = true
ret, corners = cv2.findChessboardCorners(gray, CHECKERBOARD,
cv2.CALIB_CB_ADAPTIVE_THRESH + cv2.CALIB_CB_FAST_CHECK +
cv2.CALIB_CB_NORMALIZE_IMAGE)
"""
If desired number of corner are detected,
we refine the pixel coordinates and display
them on the images of checker board
"""
if ret == True:
objpoints.append(objp)
# refining pixel coordinates for given 2d points.
corners2 = cv2.cornerSubPix(gray, corners, (11,11),(-1,-1), criteria)
imgpoints.append(corners2)
# Draw and display the corners
img = cv2.drawChessboardCorners(img, CHECKERBOARD, corners2, ret)
cv2.imshow('img',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
h,w = img.shape[:2]
"""
Performing camera calibration by
passing the value of known 3D points (objpoints)
and corresponding pixel coordinates of the
detected corners (imgpoints)
"""
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1],
None, None)
print("Camera matrix : n")
print(mtx)
print("dist : n")
print(dist)
print("rvecs : n")
print(rvecs)
print("tvecs : n")
print(tvecs)

C++ Code for Camera Calibration

#include
#include
#include
#include
#include
#include
// Defining the dimensions of checkerboard
int CHECKERBOARD[2]{6,9};
int main()
{
// Creating vector to store vectors of 3D points for each checkerboard image
std::vector<std::vector > objpoints;
// Creating vector to store vectors of 2D points for each checkerboard image
std::vector<std::vector > imgpoints;
// Defining the world coordinates for 3D points
std::vector objp;
for(int i{0}; i<CHECKERBOARD[1]; i++)
{
for(int j{0}; j<CHECKERBOARD[0]; j++)
objp.push_back(cv::Point3f(j,i,0));
}
// Extracting path of individual image stored in a given directory
std::vector images;
// Path of the folder containing checkerboard images
std::string path = "./images/*.jpg";
cv::glob(path, images);
cv::Mat frame, gray;
// vector to store the pixel coordinates of detected checker board corners
std::vector corner_pts;
bool success;
// Looping over all the images in the directory
for(int i{0}; i<images.size(); i++)
{
frame = cv::imread(images[i]);
cv::cvtColor(frame,gray,cv::COLOR_BGR2GRAY);
// Finding checker board corners
// If desired number of corners are found in the image then success = true
success = cv::findChessboardCorners(gray, cv::Size(CHECKERBOARD[0],
CHECKERBOARD[1]), corner_pts, CV_CALIB_CB_ADAPTIVE_THRESH |
CV_CALIB_CB_FAST_CHECK | CV_CALIB_CB_NORMALIZE_IMAGE);
/*
* If desired number of corner are detected,
* we refine the pixel coordinates and display
* them on the images of checker board
*/
if(success)
{
cv::TermCriteria criteria(CV_TERMCRIT_EPS | CV_TERMCRIT_ITER, 30, 0.001);
// refining pixel coordinates for given 2d points.
cv::cornerSubPix(gray,corner_pts,cv::Size(11,11), cv::Size(-1,-1),criteria);
// Displaying the detected corner points on the checker board
cv::drawChessboardCorners(frame, cv::Size(CHECKERBOARD[0],
CHECKERBOARD[1]), corner_pts, success);
objpoints.push_back(objp);
imgpoints.push_back(corner_pts);
}
cv::imshow("Image",frame);
cv::waitKey(0);
}
cv::destroyAllWindows();
cv::Mat cameraMatrix,distCoeffs,R,T;
/*
* Performing camera calibration by
* passing the value of known 3D points (objpoints)
* and corresponding pixel coordinates of the
* detected corners (imgpoints)
*/
cv::calibrateCamera(objpoints, imgpoints, cv::Size(gray.rows,gray.cols), cameraMatrix,
distCoeffs, R, T);
std::cout << "cameraMatrix : " << cameraMatrix << std::endl;
std::cout << "distCoeffs : " << distCoeffs << std::endl;
std::cout << "Rotation vector : " << R << std::endl;
std::cout << "Translation vector : " << T << std::endl;
return 0;
}

Learn All About Computer Vision

Camera calibration is vital to computer vision, ensuring accurate


measurements and reliable analysis in various applications. To enhance
your understanding and proficiency in camera calibration, consider joining
our Blackbelt program for Data Science. This comprehensive program
equips learners with in-depth knowledge, hands-on experience, and
industry-relevant skills to excel in computer vision and data science. Take
advantage of the opportunity to become a proficient practitioner through
the Blackbelt program and advance your career.
A Comprehensive Guide to UNET
Architecture | Mastering Image
Segmentation
P
Premanand S 05 Nov, 2023 • 17 min read

Introduction

In the exciting subject of computer vision, where images contain many


secrets and information, distinguishing and highlighting items is crucial.
Image segmentation, the process of splitting images into meaningful
regions or objects, is essential in various applications ranging from
medical imaging to autonomous driving and object recognition. Accurate
and automatic segmentation has long been challenging, with traditional
approaches frequently falling short in accuracy and efficiency. Enter the
UNET architecture, an intelligent method that has revolutionized image
segmentation. With its simple design and inventive techniques, UNET has
paved the way for more accurate and robust segmentation findings.
Whether you are a newcomer to the exciting field of computer vision or an
experienced practitioner looking to improve your segmentation abilities,
this in-depth blog article will unravel the complexities of UNET and provide
a complete understanding of its architecture, components, and usefulness.

This article was published as a part of the Data Science Blogathon .

Table of contents

Understanding Convolution Neural Network

CNNs are a deep learning model frequently employed in computer vision


tasks, including image classification, object recognition, and picture
segmentation. CNNs are mainly to learn and extract relevant information
from images, making them extremely useful in visual data analysis.

The critical components of CNNs

 Convolutional Layers: CNNs comprise a collection of learnable


filters (kernels) convolved with the input picture or feature maps.
Each filter applies element-wise multiplication and summing to
produce a feature map highlighting specific patterns or local
features in the input. These filters can capture many visual
elements, such as edges, corners, and textures.

 Pooling Layers: Create the feature maps by the convolutional


layers that are downsampled using pooling layers. Pooling reduces
the spatial dimensions of the feature maps while maintaining the
most critical information, lowering the computational complexity of
succeeding layers and making the model more resistant to input
fluctuations. The most common pooling operation is max pooling,
which takes the most significant value within a given neighborhood.

 Activation Functions: Introduce the Non-linearity into the CNN


model using activation functions. Apply them to the outputs of
convolutional or pooling layers element by element, allowing the
network to understand complicated associations and make non-
linear decisions. Because of its simplicity and efficiency in
addressing the vanishing gradient problem, the Rectified Linear Unit
(ReLU) activation function is common in CNNs.

 Fully Connected Layers: Fully connected layers, also called dense


layers, use the retrieved features to complete the final classification
or regression operation. They connect every neuron in one layer to
every neuron in the next, allowing the network to learn global
representations and make high-level judgments based on the
previous layers’ combined input.
The network begins with a stack of convolutional layers to capture low-
level features, followed by pooling layers. Deeper convolutional layers
learn higher-level characteristics as the network evolves. Finally, use one
or more full layers for the classification or regression operation.

Need for a Fully Connected Network

Traditional CNNs are generally intended for image classification jobs in


which a single label is assigned to the whole input image. On the other
hand, traditional CNN architectures have problems with finer-grained tasks
like semantic segmentation, in which each pixel of an image must be
sorted into various classes or regions. Fully Convolutional Networks
(FCNs) come into play here.

Limitations of Traditional CNN Architectures in

Segmentation Tasks

Loss of Spatial Information: Traditional CNNs use pooling layers to


gradually reduce the spatial dimensionality of feature maps. While this
downsampling helps capture high-level features, it results in a loss of
spatial information, making it difficult to precisely detect and split objects
at the pixel level.

Fixed Input Size: CNN architectures are often built to accept images of a
specific size. However, the input images might have various dimensions in
segmentation tasks, making variable-sized inputs challenging to manage
with typical CNNs.

Limited Localisation Accuracy: Traditional CNNs often use fully


connected layers at the end to provide a fixed-size output vector for
classification. Because they do not retain spatial information, they cannot
precisely localize objects or regions within the image.
Fully Convolutional Networks (FCNs) as a

Solution for Semantic Segmentation

By working exclusively on convolutional layers and maintaining spatial


information throughout the network, Fully Convolutional Networks (FCNs)
address the constraints of classic CNN architectures in segmentation
tasks. FCNs are intended to make pixel-by-pixel predictions, with each
pixel in the input image assigned a label or class. FCNs enable the
construction of a dense segmentation map with pixel-level forecasts by
upsampling the feature maps. Transposed convolutions (also known as
deconvolutions or upsampling layers) are used to replace the completely
linked layers after the CNN design. The spatial resolution of the feature
maps is increased by transposed convolutions, allowing them to be the
same size as the input image.

During upsampling, FCNs generally use skip connections, bypassing


specific layers and directly linking lower-level feature maps with higher-
level ones. These skip relationships aid in preserving fine-grained details
and contextual information, boosting the segmented regions’ localization
accuracy. FCNs are extremely effective in various segmentation
applications, including medical picture segmentation, scene parsing, and
instance segmentation. It can now handle input images of various sizes,
provide pixel-level predictions, and keep spatial information across the
network by leveraging FCNs for semantic segmentation.

Image Segmentation

Image segmentation is a fundamental process in computer vision in which


an image is divided into many meaningful and separate parts or segments.
In contrast to image classification, which provides a single label to a
complete image, segmentation adds labels to each pixel or group of pixels,
essentially splitting the image into semantically significant parts. Image
segmentation is important because it allows for a more detailed
comprehension of the contents of an image. We can extract considerable
information about object boundaries, forms, sizes, and spatial
relationships by segmenting a picture into multiple parts. This fine-grained
analysis is critical in various computer vision tasks, enabling improved
applications and supporting higher-level visual data interpretations.

Understanding the UNET Architecture

Traditional image segmentation technologies, such as manual annotation


and pixel-wise classification, have various disadvantages that make them
wasteful and difficult for accurate and effective segmentation jobs.
Because of these constraints, more advanced solutions, such as
the UNET architecture, have been developed. Let us look at the flaws of
previous ways and why UNET was created to overcome these issues.

 Manual Annotation: Manual annotation entails sketching and


marking image boundaries or regions of interest. While this method
produces reliable segmentation results, it is time-consuming, labor-
intensive, and susceptible to human mistakes. Manual annotation is
not scalable for large datasets, and maintaining consistency and
inter-annotator agreement is difficult, especially in sophisticated
segmentation tasks.

 Pixel-wise Classification: Another common approach is pixel-wise


classification, in which each pixel in an image is classified
independently, generally using algorithms such as decision trees,
support vector machines (SVM), or random forests. Pixel-wise
categorization, on the other hand, struggles to capture global
context and dependencies among surrounding pixels, resulting in
over- or under-segmentation problems. It cannot consider spatial
relationships and frequently fails to offer accurate object
boundaries.

Overcomes Challenges
The UNET architecture was developed to address these limitations and
overcome the challenges faced by traditional approaches to image
segmentation. Here’s how UNET tackles these issues:

 End-to-End Learning: UNET takes an end-to-end learning


technique, which means it learns to segment images directly from
input-output pairs without user annotation. UNET can automatically
extract key features and execute accurate segmentation by training
on a large labeled dataset, removing the need for labor-intensive
manual annotation.

 Fully Convolutional Architecture: UNET is based on a fully


convolutional architecture, which implies that it is entirely made up
of convolutional layers and does not include any fully connected
layers. This architecture enables UNET to function on input images
of any size, increasing its flexibility and adaptability to various
segmentation tasks and input variations.

 U-shaped Architecture with Skip Connections: The network’s


characteristic architecture includes an encoding path (contracting
path) and a decoding path (expanding path), allowing it to collect
local information and global context. Skip connections bridge the
gap between the encoding and decoding paths, maintaining critical
information from previous layers and allowing for more precise
segmentation.

 Contextual Information and Localisation: The skip connections


are used by UNET to aggregate multi-scale feature maps from
multiple layers, allowing the network to absorb contextual
information and capture details at different levels of abstraction.
This information integration improves localization accuracy, allowing
for exact object boundaries and accurate segmentation results.

 Data Augmentation and Regularization: UNET employs data


augmentation and regularisation techniques to improve its resilience
and generalization ability during training. To increase the diversity of
the training data, data augmentation entails adding numerous
transformations to the training images, such as rotations, flips,
scaling, and deformations. Regularisation techniques such as
dropout and batch normalization prevent overfitting and improve
model performance on unknown data.

Overview of the UNET Architecture

UNET is a fully convolutional neural network (FCN) architecture built for


image segmentation applications. It was first proposed in 2015 by Olaf
Ronneberger, Philipp Fischer, and Thomas Brox. UNET is frequently
utilized for its accuracy in picture segmentation and has become a popular
choice in various medical imaging applications. UNET combines an
encoding path, also called the contracting path, with a decoding path
called the expanding path. The architecture is named after its U-shaped
look when depicted in a diagram. Because of this U-shaped architecture,
the network can record both local features and global context, resulting in
exact segmentation results.

Critical Components of the UNET Architecture

 Contracting Path (Encoding Path): UNET’s contracting path


comprises convolutional layers followed by max pooling operations.
This method captures high-resolution, low-level characteristics by
gradually lowering the spatial dimensions of the input image.

 Expanding Path (Decoding Path): Transposed convolutions, also


known as deconvolutions or upsampling layers, are used for
upsampling the feature maps from the encoding path in the UNET
expansion path. The feature maps’ spatial resolution is increased
during the upsampling phase, allowing the network to reconstitute a
dense segmentation map.

 Skip Connections: Skip connections are used in UNET to connect


matching layers from encoding to decoding paths. These links
enable the network to collect both local and global data. The
network retains essential spatial information and improves
segmentation accuracy by integrating feature maps from earlier
layers with those in the decoding route.
 Concatenation: Concatenation is commonly used to implement
skip connections in UNET. The feature maps from the encoding path
are concatenated with the upsampled feature maps from the
decoding path during the upsampling procedure. This concatenation
allows the network to incorporate multi-scale information for
appropriate segmentation, exploiting high-level context and low-
level features.

 Fully Convolutional Layers: UNET comprises convolutional layers


with no fully connected layers. This convolutional architecture
enables UNET to handle images of unlimited sizes while preserving
spatial information across the network, making it flexible and
adaptable to various segmentation tasks.

The encoding path, or the contracting path, is an essential component of


UNET architecture. It is responsible for extracting high-level information
from the input image while gradually shrinking the spatial dimensions.

Convolutional Layers

The encoding process begins with a set of convolutional layers.


Convolutional layers extract information at multiple scales by applying a
set of learnable filters to the input image. These filters operate on the local
receptive field, allowing the network to catch spatial patterns and minor
features. With each convolutional layer, the depth of the feature maps
grows, allowing the network to learn more complicated representations.

Activation Function

Following each convolutional layer, an activation function such as the


Rectified Linear Unit (ReLU) is applied element by element to induce non-
linearity into the network. The activation function aids the network in
learning non-linear correlations between input images and retrieved
features.

Pooling Layers
Pooling layers are used after the convolutional layers to reduce the spatial
dimensionality of the feature maps. The operations, such as max pooling,
divide feature maps into non-overlapping regions and keep only the
maximum value inside each zone. It reduces the spatial resolution by
down-sampling feature maps, allowing the network to capture more
abstract and higher-level data.

The encoding path’s job is to capture features at various scales and levels
of abstraction in a hierarchical manner. The encoding process focuses on
extracting global context and high-level information as the spatial
dimensions decrease.

Skip Connections

The availability of skip connections that connect appropriate levels from


the encoding path to the decoding path is one of the UNET architecture’s
distinguishing features. These skip links are critical in maintaining key data
during the encoding process.

Feature maps from prior layers collect local details and fine-grained
information during the encoding path. These feature maps are
concatenated with the upsampled feature maps in the decoding pipeline
utilizing skip connections. This allows the network to incorporate multi-
scale data, low-level features and high-level context into the segmentation
process.

By conserving spatial information from prior layers, UNET can reliably


localize objects and keep finer details in segmentation results. UNET’s
skip connections aid in addressing the issue of information loss caused by
downsampling. The skip links allow for more excellent local and global
information integration, improving segmentation performance overall.

To summarise, the UNET encoding approach is critical for capturing high-


level characteristics and lowering the spatial dimensions of the input
image. The encoding path extracts progressively abstract representations
via convolutional layers, activation functions, and pooling layers. By
integrating local features and global context, introducing skip links allows
for preserving critical spatial information, facilitating reliable segmentation
outcomes.

Decoding Path in UNET

A critical component of the UNET architecture is the decoding path, also


known as the expanding path. It is responsible for upsampling the
encoding path’s feature maps and constructing the final segmentation
mask.

Upsampling Layers (Transposed Convolutions)

To boost the spatial resolution of the feature maps, the UNET decoding
method includes upsampling layers, frequently done using transposed
convolutions or deconvolutions. Transposed convolutions are essentially
the opposite of regular convolutions. They enhance spatial dimensions
rather than decrease them, allowing for upsampling. By constructing a
sparse kernel and applying it to the input feature map, transposed
convolutions learn to upsample the feature maps. The network learns to fill
in the gaps between the current spatial locations during this process, thus
boosting the resolution of the feature maps.

Concatenation

The feature maps from the preceding layers are concatenated with the
upsampled feature maps during the decoding phase. This concatenation
enables the network to aggregate multi-scale information for correct
segmentation, leveraging high-level context and low-level features. Aside
from upsampling, the UNET decoding path includes skip connections from
the encoding path’s comparable levels.

The network may recover and integrate fine-grained characteristics lost


during encoding by concatenating feature maps from skip connections. It
enables more precise object localization and delineation in the
segmentation mask.
The decoding process in UNET reconstructs a dense segmentation map
that fits with the spatial resolution of the input picture by progressively
upsampling the feature maps and including skip links.

The decoding path’s function is to recover spatial information lost during


the encoding path and refine the segmentation findings. It combines low-
level encoding details with high-level context gained from the upsampling
layers to provide an accurate and thorough segmentation mask.

UNET can boost the spatial resolution of the feature maps by using
transposed convolutions in the decoding process, thereby upsampling
them to match the original image size. Transposed convolutions assist the
network in generating a dense and fine-grained segmentation mask by
learning to fill in the gaps and expand the spatial dimensions.

In summary, the decoding process in UNET reconstructs the segmentation


mask by enhancing the spatial resolution of the feature maps via
upsampling layers and skip connections. Transposed convolutions are
critical in this phase because they allow the network to upsample the
feature maps and build a detailed segmentation mask that matches the
original input image.

Contracting and Expanding Paths in UNET

The UNET architecture follows an “encoder-decoder” structure, where the


contracting path represents the encoder, and the expanding path
represents the decoder. This design resembles encoding information into
a compressed form and then decoding it to reconstruct the original data.

Contracting Path (Encoder)

The encoder in UNET is the contracting path. It extracts context and


compresses the input image by gradually decreasing the spatial
dimensions. This method includes convolutional layers followed by pooling
procedures such as max pooling to downsample the feature maps. The
contracting path is responsible for obtaining high-level characteristics,
learning global context, and decreasing spatial resolution. It focuses on
compressing and abstracting the input image, efficiently capturing relevant
information for segmentation.

Expanding Path (Decoder)

The decoder in UNET is the expanding path. By upsampling the feature


maps from the contracting path, it recovers spatial information and
generates the final segmentation map. The expanding route comprises
upsampling layers, often performed with transposed convolutions or
deconvolutions to increase the spatial resolution of the feature maps. The
expanding path reconstructs the original spatial dimensions via skip
connections by integrating the upsampled feature maps with the
equivalent maps from the contracting path. This method enables the
network to recover fine-grained features and properly localize items.

The UNET design captures global context and local details by mixing
contracting and expanding pathways. The contracting path compresses
the input image into a compact representation, decided to build a detailed
segmentation map by the expanding path. The expanding path concerns
decoding the compressed representation into a dense and precise
segmentation map. It reconstructs the missing spatial information and
refines the segmentation results. This encoder-decoder structure enables
precision segmentation using high-level context and fine-grained spatial
information.

In summary, UNET’s contracting and expanding routes resemble an


“encoder-decoder” structure. The expanding path is the decoder,
recovering spatial information and generating the final segmentation map.
In contrast, the contracting path serves as the encoder, capturing context
and compressing the input image. This architecture enables UNET to
encode and decode information effectively, allowing for accurate and
thorough image segmentation.

Skip Connections in UNET

Skip connections are essential to the UNET design because they allow
information to travel between the contracting (encoding) and expanding
(decoding) paths. They are critical for maintaining spatial information and
improving segmentation accuracy.

Preserving Spatial Information

Some spatial information may be lost during the encoding path as the
feature maps undergo downsampling procedures such as max pooling.
This information loss can lead to lower localization accuracy and a loss of
fine-grained details in the segmentation mask.

By establishing direct connections between corresponding layers in the


encoding and decoding processes, skip connections help to address this
issue. Skip connections protect vital spatial information that would
otherwise be lost during downsampling. These connections allow
information from the encoding stream to avoid downsampling and be
transmitted directly to the decoding path.

Multi-scale Information Fusion

Skip connections allow the merging of multi-scale information from many


network layers. Later levels of the encoding process capture high-level
context and semantic information, whereas earlier layers catch local
details and fine-grained information. UNET may successfully combine
local and global information by connecting these feature maps from the
encoding path to the equivalent layers in the decoding path. This
integration of multi-scale information improves segmentation accuracy
overall. The network can use low-level data from the encoding path to
refine segmentation findings in the decoding path, allowing for more
precise localization and better object boundary delineation.

Combining High-Level Context and Low-Level

Details

Skip connections allow the decoding path to combine high-level context


and low-level details. The concatenated feature maps from the skip
connections include the decoding path’s upsampled feature maps and the
encoding path’s feature maps.

This combination enables the network to take advantage of the high-level


context recorded in the decoding path and the fine-grained features
captured in the encoding path. The network may incorporate information of
several sizes, allowing for more precise and detailed segmentation.

UNET may take advantage of multi-scale information, preserve spatial


details, and merge high-level context with low-level details by adding skip
connections. As a result, segmentation accuracy improves, object
localization improves, and fine-grained information in the segmentation
mask is retained.

In conclusion, skip connections in UNETs are critical for maintaining


spatial information, integrating multi-scale information, and boosting
segmentation accuracy. They provide direct information flow across the
encoding and decoding routes, allowing the network to collect local and
global details, resulting in more precise and detailed image segmentation.

Loss Function in UNET

It is critical to select an appropriate loss function while training UNET and


optimizing its parameters for picture segmentation tasks. UNET frequently
employs segmentation-friendly loss functions such as the Dice coefficient
or cross-entropy loss.

Dice Coefficient Loss

The Dice coefficient is a similarity statistic that calculates the overlap


between the anticipated and true segmentation masks. The Dice
coefficient loss, or soft Dice loss, is calculated by subtracting one from the
Dice coefficient. When the anticipated and ground truth masks align well,
the loss minimizes, resulting in a higher Dice coefficient.
The Dice coefficient loss is especially effective for unbalanced datasets in
which the background class has many pixels. By penalizing false positives
and false negatives, it promotes the network to divide both foreground and
background regions accurately.

Cross-Entropy Loss

Use cross-entropy loss function in image segmentation tasks. It measures


the dissimilarity between the predicted class probabilities and the ground
truth labels. Treat each pixel as an independent classification problem in
image segmentation, and the cross-entropy loss is computed pixel-wise.

The cross-entropy loss encourages the network to assign high


probabilities to the correct class labels for each pixel. It penalizes
deviations from the ground truth, promoting accurate segmentation results.
This loss function is effective when the foreground and background
classes are balanced or when multiple classes are involved in the
segmentation task.

The choice between the Dice coefficient loss and cross-entropy loss
depends on the segmentation task’s specific requirements and the
dataset’s characteristics. Both loss functions have advantages and can be
combined or customized based on specific needs.

1: Importing Libraries

import tensorflow as tf
import os
import numpy as np
from tqdm import tqdm
from skimage.io import imread, imshow
from skimage.transform import resize
import matplotlib.pyplot as plt
import random

2: Image Dimensions – Settings

IMG_WIDTH = 128
IMG_HEIGHT = 128
IMG_CHANNELS = 3

3: Setting the Randomness

seed = 42
np.random.seed = seed

4: Importing the Dataset

# Data downloaded from - https://www.kaggle.com/competitions/data-science-


bowl-2018/data
#importing datasets
TRAIN_PATH = 'stage1_train/'
TEST_PATH = 'stage1_test/'

5: Reading all the Images Present in the

Subfolder

train_ids = next(os.walk(TRAIN_PATH))[1]
test_ids = next(os.walk(TEST_PATH))[1]
6: Training

X_train = np.zeros((len(train_ids), IMG_HEIGHT, IMG_WIDTH,


IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((len(train_ids), IMG_HEIGHT, IMG_WIDTH, 1),
dtype=np.bool)

7: Resizing the Images

print('Resizing training images and masks')


for n, id_ in tqdm(enumerate(train_ids), total=len(train_ids)):
path = TRAIN_PATH + id_
img = imread(path + '/images/' + id_ + '.png')[:,:,:IMG_CHANNELS]
img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant',
preserve_range=True)
X_train[n] = img #Fill empty X_train with values from img
mask = np.zeros((IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
for mask_file in next(os.walk(path + '/masks/'))[2]:
mask_ = imread(path + '/masks/' + mask_file)
mask_ = np.expand_dims(resize(mask_, (IMG_HEIGHT, IMG_WIDTH),
mode='constant',
preserve_range=True), axis=-1)
mask = np.maximum(mask, mask_)

Y_train[n] = mask

8: Testing the Images

# test images
X_test = np.zeros((len(test_ids), IMG_HEIGHT, IMG_WIDTH,
IMG_CHANNELS), dtype=np.uint8)
sizes_test = []
print('Resizing test images')
for n, id_ in tqdm(enumerate(test_ids), total=len(test_ids)):
path = TEST_PATH + id_
img = imread(path + '/images/' + id_ + '.png')[:,:,:IMG_CHANNELS]
sizes_test.append([img.shape[0], img.shape[1]])
img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant',
preserve_range=True)
X_test[n] = img

print('Done!')

9: Random Check of the Images

image_x = random.randint(0, len(train_ids))


imshow(X_train[image_x])
plt.show()
imshow(np.squeeze(Y_train[image_x]))
plt.show()

10: Building the Model

inputs = tf.keras.layers.Input((IMG_HEIGHT, IMG_WIDTH,


IMG_CHANNELS))
s = tf.keras.layers.Lambda(lambda x: x / 255)(inputs)

11: Paths

#Contraction path
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(s)
c1 = tf.keras.layers.Dropout(0.1)(c1)
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c1)
p1 = tf.keras.layers.MaxPooling2D((2, 2))(c1)
c2 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(p1)
c2 = tf.keras.layers.Dropout(0.1)(c2)
c2 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c2)
p2 = tf.keras.layers.MaxPooling2D((2, 2))(c2)

c3 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu',


kernel_initializer='he_normal', padding='same')(p2)
c3 = tf.keras.layers.Dropout(0.2)(c3)
c3 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c3)
p3 = tf.keras.layers.MaxPooling2D((2, 2))(c3)

c4 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu',


kernel_initializer='he_normal', padding='same')(p3)
c4 = tf.keras.layers.Dropout(0.2)(c4)
c4 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c4)
p4 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(c4)

c5 = tf.keras.layers.Conv2D(256, (3, 3), activation='relu',


kernel_initializer='he_normal', padding='same')(p4)
c5 = tf.keras.layers.Dropout(0.3)(c5)
c5 = tf.keras.layers.Conv2D(256, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c5)

12: Expansion Paths

u6 = tf.keras.layers.Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same')


(c5)
u6 = tf.keras.layers.concatenate([u6, c4])
c6 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(u6)
c6 = tf.keras.layers.Dropout(0.2)(c6)
c6 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(c6)

u7 = tf.keras.layers.Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same')


(c6)
u7 = tf.keras.layers.concatenate([u7, c3])
c7 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(u7)
c7 = tf.keras.layers.Dropout(0.2)(c7)
c7 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(c7)

u8 = tf.keras.layers.Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same')


(c7)
u8 = tf.keras.layers.concatenate([u8, c2])
c8 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(u8)
c8 = tf.keras.layers.Dropout(0.1)(c8)
c8 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(c8)

u9 = tf.keras.layers.Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same')


(c8)
u9 = tf.keras.layers.concatenate([u9, c1], axis=3)
c9 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(u9)
c9 = tf.keras.layers.Dropout(0.1)(c9)
c9 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
kernel_initializer='he_normal',
padding='same')(c9)
13: Outputs

outputs = tf.keras.layers.Conv2D(1, (1, 1), activation='sigmoid')(c9)

14: Summary

model = tf.keras.Model(inputs=[inputs], outputs=[outputs])


model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()

15: Model Checkpoint

checkpointer = tf.keras.callbacks.ModelCheckpoint('model_for_nuclei.h5',
verbose=1, save_best_only=True)

callbacks = [
tf.keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),
tf.keras.callbacks.TensorBoard(log_dir='logs')]

results = model.fit(X_train, Y_train, validation_split=0.1, batch_size=16,


epochs=25,
callbacks=callbacks)

16: Last Stage – Prediction

idx = random.randint(0, len(X_train))

preds_train = model.predict(X_train[:int(X_train.shape[0]*0.9)], verbose=1)


preds_val = model.predict(X_train[int(X_train.shape[0]*0.9):], verbose=1)
preds_test = model.predict(X_test, verbose=1)
preds_train_t = (preds_train > 0.5).astype(np.uint8)
preds_val_t = (preds_val > 0.5).astype(np.uint8)
preds_test_t = (preds_test > 0.5).astype(np.uint8)

# Perform a sanity check on some random training samples


ix = random.randint(0, len(preds_train_t))
imshow(X_train[ix])
plt.show()
imshow(np.squeeze(Y_train[ix]))
plt.show()
imshow(np.squeeze(preds_train_t[ix]))
plt.show()

# Perform a sanity check on some random validation samples


ix = random.randint(0, len(preds_val_t))
imshow(X_train[int(X_train.shape[0]*0.9):][ix])
plt.show()
imshow(np.squeeze(Y_train[int(Y_train.shape[0]*0.9):][ix]))
plt.show()
imshow(np.squeeze(preds_val_t[ix]))
plt.show()

Conclusion

In this comprehensive blog post, we have covered the UNET architecture


for image segmentation. By addressing the constraints of prior
methodologies, UNET architecture has revolutionized picture
segmentation. Its encoding and decoding routes, skip connections, and
other modifications, such as U-Net++, Attention U-Net, and Dense U-Net,
have proven highly effective in capturing context, maintaining spatial
information, and boosting segmentation accuracy. The potential for
accurate and automatic segmentation with UNET offers new pathways to
improve computer vision and beyond. We encourage readers to learn
more about UNET and experiment with its implementation to maximize its
utility in their picture segmentation projects.
Key Takeaways

1. Image segmentation is essential in computer vision tasks, allowing the


division of images into meaningful regions or objects.

2. Traditional approaches to image segmentation, such as manual


annotation and pixel-wise classification, have limitations in terms of
efficiency and accuracy.

3. Develop the UNET architecture to address these limitations and achieve


accurate segmentation results.

4. It is a fully convolutional neural network (FCN) combining an encoding


path to capture high-level features and a decoding method to generate the
segmentation mask.

5. Skip connections in UNET preserve spatial information, enhance feature


propagation, and improve segmentation accuracy.

6. Found successful applications in medical imaging, satellite imagery


analysis, and industrial quality control, achieving notable benchmarks and
recognition in competitions.

A Step-by-Step Guide to Image


Segmentation Techniques (Part 1)

Pulkit Sharma 11 Jun, 2024 • 15 min read


Introduction

What’s the first thing you do when attempting to cross the road? We
typically look left and right, take stock of the vehicles on the road, and
decide. In milliseconds, Our brain can analyze what kind of vehicle (car,
bus, truck, auto, etc.) is approaching us. Can machines do that?

The answer was an emphatic ‘no’ until a few years back. However, the rise
and advancements in computer vision have changed the game. We can
build computer vision models that can detect objects, determine their
shape, predict the direction they will go in, and many other things. You
might have guessed it—that’s the powerful technology behind self-driving
cars!

There are multiple ways of dealing with computer vision challenges. The
most popular approach I have encountered is based on identifying the
objects present in an image, aka object detection. But what if we want to
dive deeper? What if just detecting objects isn’t enough—we want to
analyze our image at a much more granular level?

As data scientists, we are always curious to dig deeper into the data.
Asking questions like these is why I love working in this field!

In this article, I will introduce you to image segmentation. It is a


powerful computer vision algorithm that builds upon the idea of object
detection and takes us to a whole new level of working with image data.
This technique opens up so many possibilities – it has blown my mind.
Also, in this article, you will learn about image segmentation in image
processing, its benefits, and how it works. So, you will completely
understand image segmentation and image segmentation in image
processing.

Table of Contents

What Is Image Segmentation?

Image segmentation is a fundamental technique in computer vision that


involves dividing an image into meaningful parts. It’s like breaking down an
image into smaller pieces, like objects, groups of similar characteristics, or
even individual pixels.

Here’s a breakdown of what image segmentation is and what it does:

 Goal: Simplify and analyze images by separating them into different


segments. This makes it easier for computers to understand the
content of the image.

 Process: Assigns a label to each pixel in the image. Pixels with the
same label share certain properties, like color or brightness.

 Benefits:

o Enables object detection and recognition in images.

o Allows for more detailed analysis of specific image regions.

o Simplifies image processing tasks.pen_spark

Let’s understand the image segmentation algorithm using a simple


example. Consider the below image:
There’s only one object here – a dog. We can build a straightforward cat-
dog classifier model and predict that there’s a dog in the given image. But
what if we have a cat and a dog in a single image?

We can train a multi-label classifier, for instance. However, there’s another


caveat—we won’t know the location of either animal or object in the image.

That’s where image localization comes into the picture (no pun intended!).
It helps us identify a single object’s location in the given image. We rely on
object detection (OD) if we have multiple objects present. We can predict
the location and class for each object using OD.
Before detecting the objects and even before classifying the image, we
need to understand what it consists of. Enter Image Segmentation.

How Does Image Segmentation Work?

We can divide or partition the image into various parts called segments.
It’s not a great idea to process the entire image at the same time, as there
will be regions in the image that do not contain any information. By
dividing the image into segments, we can use the important segments to
process the image. That, in a nutshell, is how image segmentation works.

An image is a collection or set of different pixels. We group the pixels that


have similar attributes using image segmentation. Take a moment to go
through the below visual (it’ll give you a practical idea of segmentation in
image processing):
Object detection builds a bounding box corresponding to each class in the
image. But it tells us nothing about the object’s shape—we only get the set
of bounding box coordinates. We want more information—this is too vague
for our purposes.

The image segmentation algorithm creates a pixel-wise mask for each


object in the image. This technique gives us a far more granular
understanding of the object(s) in the image.

Why do we need to go this deep? Can’t all image processing tasks be


solved using simple bounding box coordinates? Let’s take a real-world
example to answer this pertinent question.

What Is Image Segmentation Used For?

Cancer has long been a deadly illness. Even in today’s age of


technological advancements, cancer can be fatal if we don’t identify it at
an early stage. Detecting cancerous cells as quickly as possible can save
millions of lives.

The shape of the cancerous cells plays a vital role in determining the
severity of the cancer. You might have put the pieces together, but object
detection will not be very useful here. We will only generate bounding
boxes, which will not help us identify the shape of the cells.

Image Segmentation techniques make a MASSIVE impact here. They help


us approach this problem more granularly and get more meaningful
results. A win-win for everyone in the healthcare industry.
Source: Wikipedia

Here, we can see the shapes of all the cancerous cells. There are many
other applications where the Image segmentation algorithm is transforming
industries:

 Traffic Control Systems

 Self Driving Cars

 Locating objects in satellite images

There are even more applications where Image Segmentation algorithms


are very useful. Feel free to share them with me in the comments section

below this article – let’s see if we can build something together.

Different Types of Image Segmentation

We can broadly divide image segmentation techniques into two types.


Consider the below images:
Can you identify the difference between these two? Both images use
image segmentation techniques to identify and locate the people present.

 In image 1, every pixel belongs to a particular class (either


background or person). Also, all the pixels belonging to a particular
class are represented by the same color (background as black and
person as pink). This is an example of semantic segmentation

 Image 2 also assigns a particular class to each pixel of the image.


However, different objects of the same class have different colors
(Person 1 as red, Person 2 as green, background as black, etc.).
This is an example of instance segmentation

Let me quickly summarize what we’ve learned. If there are 5 people in an


image, semantic segmentation will focus on classifying all the people as a
single instance. Instance segmentation, however, will identify each of
these people individually.

So far, we have delved into the theoretical concepts of image processing


and segmentation. Let’s mix things up a bit – we’ll combine learning
concepts with implementing them in Python. I believe that’s the best way
to learn and remember any topic.

Region-based Segmentation

One simple way to segment different objects could be to use their pixel
values. An important point to note – the pixel values will be different for the
objects and the image’s background if there’s a sharp contrast between
them.

In this case, we can set a threshold value. The pixel values falling below or
above that threshold can be classified accordingly (as objects or
backgrounds). This technique is known as Threshold Segmentation.

If we want to divide the image into two regions (object and background),
we define a single threshold value. This is known as the global threshold.

If we have multiple objects along with the background, we must define


multiple thresholds. These thresholds are collectively known as the local
threshold.

Let’s implement what we’ve learned in this section. Download this


image and run the below code. It will give you a better understanding of
how thresholding works (you can use any image of your choice if you feel
like experimenting!).

First, we’ll import the required libraries.

from skimage.color import

rgb2gray

import numpy as np

import cv2

import matplotlib.pyplot as plt

%matplotlib inline

from scipy import ndimage


view raw import_library.py hosted with by GitHub

Let’s read the downloaded image and plot it:

image =

plt.imread('1.jpeg')

image.shape

plt.imshow(image)

view raw read_image_1.py hosted with by GitHub

It is a three-channel image (RGB). We need to convert it into grayscale so


that we have only a single channel. Doing this will also help us better
understand how the algorithm works.
Python Code:

Now, we want to apply a certain threshold to this image. This threshold


should separate the image into two parts – the foreground and the
background. Before we do that, let’s quickly check the shape of this
image:

gray.shape

(192, 263)

The height and width of the image are 192 and 263, respectively. We will
take the mean of the pixel values and use that as a threshold. If the pixel
value exceeds our threshold, we can say it belongs to an object. The pixel
value will be treated as the background if it is less than the threshold. Let’s
code this:

gray_r =

gray.reshape(gray.shape[0]*gray.shape[1])

for i in range(gray_r.shape[0]):

if gray_r[i] > gray_r.mean():

gray_r[i] = 1

else:

gray_r[i] = 0
gray = gray_r.reshape(gray.shape[0],gray.shape[1])

plt.imshow(gray, cmap='gray')

view raw global_threshold.py hosted with by GitHub

Nice! The darker region (black) represents the background, and the
brighter (white) region is the foreground. We can define multiple
thresholds as well to detect multiple objects:

gray = rgb2gray(image)

gray_r =

gray.reshape(gray.shape[0]*gray.shape[1])

for i in range(gray_r.shape[0]):

if gray_r[i] > gray_r.mean():

gray_r[i] = 3

elif gray_r[i] > 0.5:

gray_r[i] = 2

elif gray_r[i] > 0.25:

gray_r[i] = 1

else:

gray_r[i] = 0

gray = gray_r.reshape(gray.shape[0],gray.shape[1])

plt.imshow(gray, cmap='gray')

view raw local_threshold.py hosted with by GitHub


There are four different segments in the above image. You can set
different threshold values and check how the segments are made. Some
of the advantages of this method are:

 Calculations are simpler

 Fast operation speed

 When the object and background have high contrast, this method
performs well

However, this approach has some limitations. When there is no significant


grayscale difference or an overlap of the grayscale pixel values, it
becomes very difficult to get accurate segments.

Edge Detection Segmentation

What divides two objects in an image? An edge is always between two


adjacent regions with different grayscale values (pixel values). The edges
can be considered as the discontinuous local features of an image.

We can use this discontinuity to detect edges and hence define a


boundary of the object. This helps us detect the shapes of multiple objects
in a given image. Now, the question is, how can we detect these edges?
This is where we can make use of filters and convolutions. Refer to this
article if you need to learn about these concepts.

The below visual will help you understand how a filter convolves over an
image :

Here’s the step-by-step process of how this works:

 Take the weight matrix

 Put it on top of the image

 Perform element-wise multiplication and get the output

 Move the weight matrix as per the stride chosen

 Convolve until all the pixels of the input are used

The values of the weight matrix define the output of the convolution. My
advice: It helps to extract features from the input. Researchers have found
that choosing some specific values for these weight matrices helps us
detect horizontal or vertical edges (or even the combination of horizontal
and vertical edges).

One such weight matrix is the Sobel operator. It is typically used to detect
edges. The Sobel operator has two weight matrices—one for detecting
horizontal edges and the other for detecting vertical edges. Let me show
how these operators look, and we will then implement them in Python.

Sobel filter (horizontal) =

1 2 1
0 0 0
-1 -2 -1

Sobel filter (vertical) =

-
01
1
-
02
2
-
01
1

Edge detection works by convolving these filters over the given image.
Let’s visualize them on this article .

image =

plt.imread('index.png')

plt.imshow(image)

view raw read_image_2.py hosted with by GitHub

It should be fairly simple to understand how the edges are detected in this
image. Let’s convert it into grayscale and define the sobel filter (both
horizontal and vertical) that will be convolved over this image:

# converting to grayscale

gray = rgb2gray(image)
# defining the sobel filters

sobel_horizontal = np.array([np.array([1, 2, 1]), np.array([0, 0, 0]), np.array([-1, -2, -

1])])

print(sobel_horizontal, 'is a kernel for detecting horizontal edges')

sobel_vertical = np.array([np.array([-1, 0, 1]), np.array([-2, 0, 2]), np.array([-1, 0, 1])])

print(sobel_vertical, 'is a kernel for detecting vertical edges')

view raw sobel_filters.py hosted with by GitHub

Now, convolve this filter over the image using the convolve function of
the ndimage package from scipy.

out_h = ndimage.convolve(gray, sobel_horizontal, mode='reflect')

out_v = ndimage.convolve(gray, sobel_vertical, mode='reflect')

# here mode determines how the input array is extended when the filter overlaps a border.

view raw convolving_sobel_filters.py hosted with by GitHub

Let’s plot these results:

plt.imshow(out_h,

cmap='gray')

view raw plot_1.py hosted with by GitHub


plt.imshow(out_v,

cmap='gray')

view raw plot_2.py hosted with by GitHub

Here, we can identify the horizontal and vertical edges. There is one more
type of filter that can detect both horizontal and vertical edges
simultaneously. This is called the laplace operator:

11 1
-
1 1
8
11 1

Let’s define this filter in Python and convolve it on the same image:
kernel_laplace = np.array([np.array([1, 1, 1]), np.array([1, -8, 1]), np.array([1, 1,

1])])

print(kernel_laplace, 'is a laplacian kernel')

view raw laplacian_filter.py hosted with by GitHub

Next, convolve the filter and print the output:

out_l = ndimage.convolve(gray, kernel_laplace,

mode='reflect')

plt.imshow(out_l, cmap='gray')

view raw convolving_laplacian_filter.py hosted with by GitHub

Here, we can see that our method has detected both horizontal and
vertical edges. I encourage you to try it on different images and share your
results. Remember, the best way to learn is by practicing!
Clustering-based Image Segmentation

This idea might have come to you while reading about image
segmentation techniques. Can’t we use clustering techniques to divide
images into segments? We certainly can!

In this section, we’ll get an intuition of clustering (it’s always good to revise
certain concepts!) and how to use it to segment images.

Clustering is dividing the population (data points) into many groups, such
that data points in the same groups are more similar to other data points in
that group than those in other groups. These groups are known as
clusters.

K-means Clustering

One of the most commonly used clustering algorithms is k-means. Here,


the k represents the number of clusters (not to be confused with k-nearest
neighbor). Let’s understand how k-means works:

1. First, randomly select k initial clusters

2. Randomly assign each data point to any one of the k clusters

3. Calculate the centers of these clusters

4. Calculate the distance of all the points from the center of each
cluster

5. Depending on this distance, the points are reassigned to the


nearest cluster

6. Calculate the center of the newly formed clusters

7. Finally, repeat steps (4), (5) and (6) until either the center of the
clusters does not change or we reach the set number of iterations
The key advantage of using the k-means algorithm is that it is simple and
easy to understand. We are assigning the points to the clusters closest to
them.

How well does k-means segment objects in an

image?

Let’s put our learning to the test and check how well k-means segment the
objects in an image. We will be using this image, so download it, read it
and, check its dimensions:

pic = plt.imread('1.jpeg')/255 # dividing by 255 to bring the pixel values between 0 and 1

print(pic.shape)

plt.imshow(pic)

view raw read_image_3.py hosted with by GitHub


It’s a 3-dimensional image of shape (192, 263, 3). To cluster the image
using k-means, we first need to convert it into a 2-dimensional array
whose shape is (length* width* channels). In our example, this will be
(192* 263, 3).

pic_n = pic.reshape(pic.shape[0]*pic.shape[1],

pic.shape[2])

pic_n.shape

view raw reshaping_image.py hosted with by GitHub

(50496, 3)

The image has been converted to a 2-dimensional array. Next, the k-


means algorithm is fitted to this reshaped array to obtain the clusters. The
cluster_centers_ function of k-means returns the cluster centers, and the
labels _ function gives us the label for each pixel (it tells us which pixel of
the image belongs to which cluster).

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5,

random_state=0).fit(pic_n)

pic2show = kmeans.cluster_centers_[kmeans.labels_]

view raw Kmeans.py hosted with by GitHub

I have chosen 5 clusters for this article, but you can play around with this
number and check the results. Now, let’s return the clusters to their
original shape, a 3-dimensional image, and plot the results.

cluster_pic = pic2show.reshape(pic.shape[0], pic.shape[1],

pic.shape[2])

plt.imshow(cluster_pic)

view raw clusters.py hosted with by GitHub


Amazing, isn’t it? We can segment the image pretty well using just 5
clusters. I’m sure you’ll be able to improve the segmentation by increasing
the number of clusters.

K-means works well when we have a small dataset. It can segment the
objects in the image and give impressive results. However, when applied
to a large dataset (more images), the algorithm hits a roadblock.

It looks at all the samples at every iteration, so the time taken is too high.
Hence, it’s also too expensive to implement. And since k-means is a
distance-based algorithm, it only applies to convex datasets and is
unsuitable for clustering non-convex clusters .

Finally, let’s look at a simple, flexible, and general approach for


segmentation in image processing.

Mask R-CNN

Data scientists and researchers at Facebook AI Research (FAIR)


pioneered a deep learning architecture called Mask R-CNN that can create
a pixel-wise mask for each object in an image. This is a cool concept so
follow along closely!
Mask R-CNN is an extension of the popular Faster R-CNN object detection
architecture. Mask R-CNN adds a branch to the already existing Faster R-
CNN outputs. The Faster R-CNN method generates two things for each
object in the image:

 It’s class

 The bounding box coordinates

Mask R-CNN adds a third branch to this, which also outputs the object
mask. Take a look at the below image to get an intuition of how Mask R-
CNN works on the inside:

Source: arxiv.org

1. We take an image as input and pass it to the ConvNet, which


returns the feature map for that image

2. A region proposal network (RPN) is applied to these feature maps.


This returns the object proposals along with their objectness score

3. A RoI pooling layer is applied to these proposals to bring down all


the proposals to the same size

4. Finally, the proposals are passed to a fully connected layer to


classify and output the bounding boxes for objects. It also returns
the mask for each proposal

Mask R-CNN is the current state-of-the-art for image segmentation


techniques and runs at 5 fps.
Image Classification vs. Object Detection vs.

Image Segmentation

Image classification, object detection, and image segmentation techniques


are all fundamental tasks in computer vision that analyze image content,
but they answer different questions about the image:

 Image Classification: What’s in the image? This is the most basic


task. The model assigns a single label to the entire image, like “cat”
or “landscape.” It’s like answering a multiple-choice question with
only one answer.

 Object Detection: What objects are in the image, and where are
they? This goes beyond classification. The model identifies specific
objects (cats, cars, people) and draws bounding boxes around them
to indicate their location. It’s like answering a multiple-choice
question where you can choose multiple answers and mark their
positions on the image.

 Image Segmentation: What are the exact shapes of the objects in


the image? This provides the most detail. The model assigns a label
to each pixel in the image, creating a kind of digital mask that
outlines the shape of each object. It’s like coloring each object in the
image with a different color to show their exact boundaries.

Summary of Image Segmentation Techniques

I have summarized the different image segmentation algorithms in the


below table.. I suggest keeping this handy next time you’re working on an
image segmentation challenge or problem!

Algorithm Description Advantages Limitations


Region-Based Separates the objects When there is no It is not suitable
Segmentation into different regions significant grayscale when there are too
based on some difference or an overlap many edges in the
threshold value(s). of the grayscale pixel image and if there is
values, it becomes less contrast
difficult to get accurate
between objects.
segments.
a. Simple calculations

It is good for images Works well on small


Edge Detection to have better b. Fast operation speed datasets and
Segmentation contrast between generates excellent
objects. c. When the object and clusters.
background have high
contrast, this method
performs well
a. Computation time
is too large and
a. Simple, flexible, and
expensive.
general approach
Divides the pixels of
Segmentation
the image into
based on
homogeneous b. k-means is a
Clustering b. It is also the current
clusters. distance-based
state-of-the-art for image
algorithm. It is not
segmentation
suitable for clustering
non-convex clusters.
a. Simple, flexible and
Gives three outputs general approach
for each object in the
image: its class,
Mask R-CNN High training time
bounding box
coordinates, and b. It is also the current
object mask state-of-the-art for image
segmentation

Image Segmentation in Image Processing

Image segmentation techniques are fundamental in image processing.


They divide a digital image into meaningful parts, like partitioning it into
different regions containing pixels with similar characteristics. This
simplifies the image and allows for a more focused analysis of specific
objects or areas of interest.

Here’s a breakdown of what image segmentation is and how it works:

Purpose:

 Simplify Complex Images: Segmenting an image breaks it down


into smaller, more manageable pieces, making it easier to analyze
specific regions or objects within the image.

 Extract Objects of Interest: Image segmentation allows you to


isolate specific objects from the background or other foreground
elements. This is crucial for tasks like object recognition, counting,
and tracking.

 Prepare Images for Further Processing: Segmentation can be a


pre-processing step for various image-processing tasks. By
segmenting the image, you can focus on relevant regions and
improve the accuracy of subsequent analysis.

How it Works:

 Grouping Pixels: Image segmentation algorithms group pixels in


an image based on shared characteristics. These characteristics
can include color, intensity, texture, or spatial location.

 Segmenting the Image: The segmentation process creates a new


image, often called a segmentation mask. This mask assigns a label
to each pixel, indicating the segment it belongs to.

Conclusion

This article is just beginning our journey to learn about image


segmentation. In the next article of this series, we will explore the
implementation of Mask R-CNN. So stay tuned!

The image segmentation algorithm is a useful function in my deep learning


career. The level of granularity I get from these techniques is astounding.
How much detail we can extract with a few lines of code always amazes
me.
Computer Vision Tutorial:
Implementing Mask R-CNN for
Image Segmentation (with Python
Code)

Pulkit Sharma 30 May, 2024 • 14 min read

Overview

 Mask R-CNN is a state-of-the-art framework for Image


Segmentation tasks

 We will learn how Mask R-CNN works in a step-by-step manner

 We will also look at how to implement Mask R-CNN in Python and


use it for our own images

Introduction

I am fascinated by self-driving cars. The sheer complexity and mix of


different computer vision techniques that go into building a self-driving car
system is a dream for a data scientist like me.

So, I set about trying to understand the computer vision technique behind
how a self-driving car potentially detects objects. A simple object detection
framework might not work because it simply detects an object and draws a
fixed shape around it.

That’s a risky proposition in a real-world scenario. Imagine if there’s a


sharp turn in the road ahead and our system draws a rectangular box
around the road. The car might not be able to understand whether to turn
or go straight. That’s a potential disaster!

Instead, we need a technique that can detect the exact shape of the road
so our self-driving car system can safely navigate the sharp turns as well.

The latest state-of-the-art framework that we can use to build such a


system? That’s Mask R-CNN!

So, in this article, we will first quickly look at what image segmentation is.
Then we’ll look at the core of this article – the Mask R-CNN framework.
Finally, we will dive into implementing our own Mask R-CNN model in
Python. with that also you will clarified about the Mask R CNN Pytorch
Implementation So,Let’s begin!

Table of contents

What is Mask R-CNN?

Mask R-CNN, which stands for Mask Region-based Convolutional Neural


Network, is a deep learning model that tackles computer vision tasks like
object detection and instance segmentation. It builds upon an existing
architecture called Faster R-CNN.

Two main types of image segmentation of Mask R-CNN


Semantic Segmentation involves categorizing every pixel in an image
into a distinct class. Picture a scene where individuals are strolling along a
road. Semantic segmentation is able to categorize all the pixels belonging
to individuals as a single group, despite there being numerous people.

Instance Segmentation: This method goes beyond semantic


segmentation by not only classifying pixels but also differentiating between
objects of the same class. In the same image with people walking,
instance segmentation would create a separate mask for each individual
person.

A Brief Overview of Image Segmentation

We learned the concept of image segmentation in part 1 of this series in a


lot of detail. We discussed what is image segmentation and its different
techniques, like region-based segmentation, edge detection segmentation,
and segmentation based on clustering.

I would recommend checking out that article first if you need a quick
refresher (or want to learn image segmentation from scratch).

I’ll quickly recap that article here. Image segmentation creates a pixel-wise
mask for each object in the image. This technique gives us a far more
granular understanding of the object(s) in the image. The image shown
below will help you to understand what image segmentation is:
Here, you can see that each object (which are the cells in this particular
image) has been segmented. This is how image segmentation works.

We also discussed the two types of image segmentation: Semantic


Segmentation and Instance Segmentation. Again, let’s take an example to
understand both of these types:

All 5 objects in the left image are people. Hence, semantic


segmentation will classify all the people as a single instance. Now, the
image on the right also has 5 objects (all of them are people). But here,
different objects of the same class have been assigned as different
instances. This is an example of instance segmentation.

Part one covered different techniques and their implementation in Python


to solve such image segmentation problems. In this article, we will be
implementing a state-of-the-art image segmentation technique called Mask
R-CNN to solve an instance segmentation problem.

Understanding Mask R-CNN

Mask R-CNN is basically an extension of Faster R-CNN . Faster R-CNN is


widely used for object detection tasks. For a given image, it returns the
class label and bounding box coordinates for each object in the image. So,
let’s say you pass the following image:
The Fast R-CNN model will return something like this:

The Mask R-CNN framework is built on top of Faster R-CNN. So, for a
given image, Mask R-CNN, in addition to the class label and bounding box
coordinates for each object, will also return the object mask.

Let’s first quickly understand how Faster R-CNN works. This will help us
grasp the intuition behind Mask R-CNN as well.

 Faster R-CNN first uses a ConvNet to extract feature maps from the
images

 These feature maps are then passed through a Region Proposal


Network (RPN) which returns the candidate bounding boxes

 We then apply an RoI pooling layer on these candidate bounding


boxes to bring all the candidates to the same size

 And finally, the proposals are passed to a fully connected layer to


classify and output the bounding boxes for objects
Once you understand how Faster R-CNN works, understanding Mask R-
CNN will be very easy. So, let’s understand it step-by-step starting from
the input to predicting the class label, bounding box, and object mask.

Backbone Model

Similar to the ConvNet that we use in Faster R-CNN to extract feature


maps from the image, we use the ResNet 101 architecture to extract
features from the images in Mask R-CNN. So, the first step is to take an
image and extract features using the ResNet 101 architecture. These
features act as an input for the next layer.

Region Proposal Network (RPN)

Now, we take the feature maps obtained in the previous step and apply a
region proposal network (RPM). This basically predicts if an object is
present in that region (or not). In this step, we get those regions or feature
maps which the model predicts contain some object.

Region of Interest (RoI)

The regions obtained from the RPN might be of different shapes, right?
Hence, we apply a pooling layer and convert all the regions to the same
shape. Next, these regions are passed through a fully connected network
so that the class label and bounding boxes are predicted.

Till this point, the steps are almost similar to how Faster R-CNN works.
Now comes the difference between the two frameworks. In addition to
this, Mask R-CNN also generates the segmentation mask.
For that, we first compute the region of interest so that the computation
time can be reduced. For all the predicted regions, we compute the
Intersection over Union (IoU) with the ground truth boxes. We can
computer IoU like this:

IoU = Area of the intersection / Area of the union

Now, only if the IoU is greater than or equal to 0.5, we consider that
as a region of interest. Otherwise, we neglect that particular region.
We do this for all the regions and then select only a set of regions for
which the IoU is greater than 0.5.

Let’s understand it using an example. Consider this image:

Here, the red box is the ground truth box for this image. Now, let’s say we
got 4 regions from the RPN as shown below:
Here, the IoU of Box 1 and Box 2 is possibly less than 0.5, whereas the
IoU of Box 3 and Box 4 is approximately greater than 0.5. Hence. we can
say that Box 3 and Box 4 are the region of interest for this particular image
whereas Box 1 and Box 2 will be neglected.

Next, let’s see the final step of Mask R-CNN.

Segmentation Mask

Once we have the RoIs based on the IoU values, we can add a mask
branch to the existing architecture. This returns the segmentation mask for
each region that contains an object. It returns a mask of size 28 X 28 for
each region which is then scaled up for inference.

Again, let’s understand this visually. Consider the following image:


The segmentation mask for this image would look something like this:

Here, our model has segmented all the objects in the image. This is the
final step in Mask R-CNN where we predict the masks for all the objects in
the image.

Keep in mind that the training time for Mask R-CNN is quite high. It took
me somewhere around 1 to 2 days to train the Mask R-CNN on the
famous COCO dataset . So, for the scope of this article, we will not be
training our own Mask R-CNN model.

We will instead use the pretrained weights of the Mask R-CNN model
trained on the COCO dataset. Now, before we dive into the Python code,
let’s look at the steps to use the Mask R-CNN model to perform instance
segmentation.

Steps to implement Mask R-CNN

It’s time to perform some image segmentation tasks! We will be using


the mask rcnn framework created by the Data scientists and researchers
at Facebook AI Research (FAIR).

Let’s have a look at the steps which we will follow to perform image
segmentation using Mask R-CNN.

Step 1: Clone the repository

First, we will clone the mask rcnn repository which has the architecture for
Mask R-CNN. Use the following command to clone the repository:

git clone https://github.com/matterport/Mask_RCNN.git

Once this is done, we need to install the dependencies required by Mask


R-CNN.

Step 2: Install the dependencies

Here is a list of all the dependencies for Mask R-CNN:

 numpy

 scipy

 Pillow

 cython

 matplotlib

 scikit-image
 tensorflow>=1.3.0

 keras>=2.0.8

 opencv-python

 h5py

 imgaug

 IPython

You must install all these dependencies before using the Mask R-
CNN framework.

Step 3: Download the pre-trained weights (trained on MS COCO)

Next, we need to download the pretrained weights. You can use this
link to download the pre-trained weights. These weights are obtained from
a model that was trained on the MS COCO dataset. Once you have
downloaded the weights, paste this file in the samples folder of the
Mask_RCNN repository that we cloned in step 1.

Step 4: Predicting for our image

Finally, we will use the Mask R-CNN architecture and the pretrained
weights to generate predictions for our own images.

Once you’re done with these four steps, it’s time to jump into your Jupyter
Notebook! We will implement all these things in Python and then generate
the masks along with the classes and bounding boxes for objects in our
images.

What is Convolutional Neural Networks (CNN)?

Convolutional Neural Networks (CNNs) are a type of deep learning neural


network that excel at analyzing visual imagery like photos and videos.
They are inspired by the structure of the animal visual cortex, which is the
part of the brain responsible for processing vision.
Implementing Mask R-CNN in Python

Sp, are you ready to dive into Python and code your own image
segmentation model? Let’s begin!

To execute all the code blocks which I will be covering in this section,
create a new Python notebook inside the “samples” folder of the cloned
Mask_RCNN repository.

Let’s start by importing the required libraries:

import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt

# Root directory of the project


ROOT_DIR = os.path.abspath("../")

import warnings
warnings.filterwarnings("ignore")

# Import Mask RCNN


sys.path.append(ROOT_DIR) # To find local version of the library
from mrcnn import utils
import mrcnn.model as modellib
from mrcnn import visualize
# Import COCO config
sys.path.append(os.path.join(ROOT_DIR, "samples/coco/")) # To find local
version
import coco
%matplotlib inline

Next, we will define the path for the pretrained weights and the images on
which we would like to perform segmentation:

# Directory to save logs and trained model


MODEL_DIR = os.path.join(ROOT_DIR, "logs")

# Local path to trained weights file


COCO_MODEL_PATH = os.path.join('', "mask_rcnn_coco.h5")

# Download COCO trained weights from Releases if needed


if not os.path.exists(COCO_MODEL_PATH):
utils.download_trained_weights(COCO_MODEL_PATH)

# Directory of images to run detection on


IMAGE_DIR = os.path.join(ROOT_DIR, "images")

If you have not placed the weights in the samples folder, this will again
download the weights. Now we will create an inference class which will be
used to infer the Mask R-CNN model:

class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1

config = InferenceConfig()
config.display()

view rawinference.py hosted with by GitHub


What can you infer from the above summary? We can see the multiple
specifications of the Mask R-CNN model that we will be using.

So, the backbone is resnet101 as we have discussed earlier as well. The


mask shape that will be returned by the model is 28X28, as it is trained on
the COCO dataset. And we have a total of 81 classes (including the
background).

We can also see various other statistics as well, like:

 The input shape

 Number of GPUs to be used

 Validation steps, among other things.


You should spend a few moments and understand these specifications. If
you have any doubts regarding these specifications, feel free to ask me in
the comments section below.

Loading Weights

Next, we will create our model and load the pretrained weights which we
downloaded earlier. Make sure that the pretrained weights are in the same
folder as that of the notebook otherwise you have to give the location of
the weights file:

# Create model object in inference mode.


model = modellib.MaskRCNN(mode="inference",
model_dir='mask_rcnn_coco.hy', config=config)

# Load weights trained on MS-COCO


model.load_weights('mask_rcnn_coco.h5', by_name=True)

Now, we will define the classes of the COCO dataset which will help us in
the prediction phase:

# COCO Class names


class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',
'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',
'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',
'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard',
'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',
'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
'teddy bear', 'hair drier', 'toothbrush']

Let’s load an image and try to see how the model performs. You can use
any of your images to test the model.

# Load a random image from the images folder


image = skimage.io.imread('sample.jpg')

# original image
plt.figure(figsize=(12,10))
skimage.io.imshow(image)

This is the image we will work with. You can clearly identify that there are
a couple of cars (one in the front and one in the back) along with a bicycle.
Making Predictions

It’s prediction time! We will use the Mask R-CNN model along with the
pretrained weights and see how well it segments the objects in the image.
We will first take the predictions from the model and then plot the results
to visualize them:

# Run detection
results = model.detect([image], verbose=1)

# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], class_names,
r['scores'])

Interesting. The model has done pretty well to segment both the cars as
well as the bicycle in the image. We can look at each mask or the
segmented objects separately as well. Let’s see how we can do that.
I will first take all the masks predicted by our model and store them in the
mask variable. Now, these masks are in the boolean form (True and
False) and hence we need to convert them to numbers (1 and 0). Let’s do
that first:

mask = r['masks']
mask = mask.astype(int)
mask.shape

Output:

(480,640,3)

This will give us an array of 0s and 1s, where 0 means that there is no
object at that particular pixel and 1 means that there is an object at that
pixel. Note that the shape of the mask is similar to that of the original
image (you can verify that by printing the shape of the original image).

However, the 3 here in the shape of the mask does not represent the
channels. Instead, it represents the number of objects segmented by our
model. Since the model has identified 3 objects in the above sample
image, the shape of the mask is (480, 640, 3). Had there been 5 objects,
this shape would have been (480, 640, 5).

We now have the original image and the array of masks. To print or get
each segment from the image, we will create a for loop and multiply each
mask with the original image to get each segment:

for i in range(mask.shape[2]):
temp = skimage.io.imread('sample.jpg')
for j in range(temp.shape[2]):
temp[:,:,j] = temp[:,:,j] * mask[:,:,i]
plt.figure(figsize=(8,8))
plt.imshow(temp)
This is how we can plot each mask or object from the image. This can
have a lot of interesting as well as useful use cases. Getting the segments
from the entire image can reduce the computation cost as we do not have
to preprocess the entire image now, but only the segments.

Inferences

Below are a few more results which I got using our Mask R-CNN model:
Looks awesome! You have just built your own image segmentation model
using Mask R-CNN – well done.

How to implement Mask R-CNN using PyTorch:

Install Required Libraries:

 Ensure you have torch, torchvision, and pycocotools installed. If not,


install them using:

pip install torch torchvision pycocotools

1. Load the Pre-trained Mask R-CNN Model:

o PyTorch’s torchvision library provides a pre-trained Mask R-


CNN model.

2. Prepare the Dataset:

o Mask R-CNN is typically trained on the COCO dataset, but


you can prepare your own dataset following the COCO
format.

3. Inference with Mask R-CNN:


o Load an image, preprocess it, and pass it through the model
to get the predictions.

Here’s a sample code snippet demonstrating these steps:

import torch
import torchvision
from PIL import Image
import matplotlib.pyplot as plt
import torchvision.transforms as T

# Load a pre-trained Mask R-CNN model


model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval() # Set the model to evaluation mode

# Load an image
img_path = 'path_to_your_image.jpg'
img = Image.open(img_path).convert("RGB")

# Preprocess the image


transform = T.Compose([T.ToTensor()])
img = transform(img)

# Add a batch dimension


img = img.unsqueeze(0)

# Perform inference
with torch.no_grad():
prediction = model(img)

# Visualize the results


def plot_image(image, masks, boxes):
fig, ax = plt.subplots(1, figsize=(12,9))
ax.imshow(image)
for i in range(len(masks)):
mask = masks[i].cpu().numpy()
box = boxes[i].cpu().numpy()
ax.imshow(mask, alpha=0.5)
rect = plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
fill=False, color='red')
ax.add_patch(rect)
plt.show()

# Convert image back to numpy


img_np = img.squeeze().permute(1, 2, 0).cpu().numpy()

# Plot the image with masks and bounding boxes


plot_image(img_np, prediction[0]['masks'], prediction[0]['boxes'])

Conclusion

Exploring the table of contents provides a comprehensive understanding


of image segmentation, particularly focusing on the implementation of
Mask R-CNN in Python. The step-by-step guide outlined elucidates the
process from cloning the repository to predicting for customized images.
Through understanding Mask R-CNN and following the prescribed steps,
users can effectively utilize pre-trained weights and execute segmentation
tasks efficiently. This structured approach empowers practitioners to
harness the capabilities of advanced segmentation techniques, enhancing
their proficiency in image analysis and fostering innovation in various
fields reliant on computer vision technologies. Also, you will how to
implement mask r CNN pytorch.
Intersection over Union (IoU) in Object
Detection & Segmentation

Kukil
JUNE 28, 2022 2 COMMENTS

Image Segmentation Object Detection PyTorch Segmentation

Intersection Over Union(IoU)is a number that quantifies the degree of overlap


between two boxes. In the case of object detection and
segmentation,IoUevaluates the overlap of theGround TruthandPredictionregion.

If you are a computer vision practitioner or even an enthusiast, you must have
come across the term very often. It is the first checkpoint for evaluating the
accuracy of a model. In simple terms, it’s a metric that helps us measure the
correctness of a prediction.

In this blog post, you will get a detailed and intuitive explanation of the following.

✅ Intersection over Union in Object Detection

✅ Intersection over Union in Image Segmentation

✅ Implementing IoU using NumPy

✅ Implementing IoU using PyTorch


1. Intersection over Union (IoU) in Object Detection

1. Observations

2. Designing IoU Metric for Object Detection

3. Qualitative Analysis of Predictions

2. Intersection over Union (IoU) in Image Segmentation

3. A sample object detection example

4. Implementing IoU using NumPy

5. Implementing IoU using the built-in function box_iou in PyTorch

6. Implementing IoU manually using PyTorch

7. Conclusion
Intersection over Union in Object
Detection
Let’s go through the following example to understand how IoU is calculated.
Let three models- A, B, and C- be trained to predict birds. We pass an image
through the models where we already know the Ground Truth (marked in red).
The image below showspredictionsof the models (marked in cyan).

1.1 Observations
 It is clear that the predicted box of Model A has more overlap with the Ground
Truth as compared to Model B.
 However, Model C has an even higher overlap with the ground truth. But it
also has a high overlap with the background.
 So from models B and C, it is clear that a metric based on only overlap is not a
fair one as we should also account for localization accuracy. It is not just about
matching the Ground Truth but how closely the prediction matches it.
 Therefore, we need a metric that will penalize the metric whenever,

 The prediction fails to predict the area inside the Ground Truth.
 The prediction overflows the Ground Truth.

Keeping the above in mind, the IoU metric has been designed.
1.2 Designing Intersection over Union metric for
Object Detection

It is the ratio of theoverlap areato thecombined area of predictionandground


truth. The numerator will be lesser as the prediction fails to predict the area
inside the ground truth. If the area of the predicted box is higher, the
denominator will be higher, making the IoU lower.

IoU values range from 0 to 1. Where 0 means no overlap and 1 means perfect
overlap.

Looking closely, we are adding the area of the intersection twicein the
denominator. So actually, we calculate IoU as shown in the illustration below.

1.3 Qualitative Analysis of Predictions


With the help of the IoU threshold, we can decide whether the prediction is True
Positive(TP), False Positive(FP),orFalse Negative(FN). The example below
shows predictions with the IoU threshold ɑset at0.5.

The decision of making a detection as True PositiveorFalse Positivecompletely


depends on the requirement.

 The first prediction is True Positive as the IoU threshold is 0.5.


 If we set the threshold at 0.97, it becomes a False Positive.
 Similarly, the second prediction shown above is False Positive due to the
threshold but can be True Positive if we set the threshold at 0.20.
 Theoretically, the third prediction can also be True Positive, given that we
lower the threshold all the way to 0.

Intersection over Union in Image


Segmentation
IoU in object detection is a helper metric. However, inimage segmentation, IoU
is the primary metric to evaluate model accuracy.

In the case of Image Segmentation, the area is not necessarily rectangular. It can
have any regular or irregular shape. That means the predictions are segmentation
masks and not bounding boxes. Therefore, pixel-by-pixel analysis is done here.
Moreover, the definition of TP, FP, and FN is slightly different as it is not based
on a predefined threshold.
(a)True Positive: The area of intersection between Ground Truth(GT) and
segmentation mask(S). Mathematically, this is logical AND operation of GT and
S i.e.,

(b)False Positive: The predicted area outside the Ground Truth. This is the
logical OR of GT and segmentation minus GT.

(c) False Negative: Number of pixels in the Ground Truth area that the model
failed to predict. This is the logical OR of GT and segmentation minus S.

We know fromObject Detectionthat IoU is the ratio of theintersected areato


thecombined area of predictionandground truth.Since the values of TP, FP, and
FN are nothing but areas or number of pixels; we can write IoU as follows.
Note that we have already provided colab notebooks for PyTorchand Numpy
versions in the code download section. Therefore, no need to install
dependencies manually. However, if you use them locally, install PyTorch from
the official source.

A Sample Object Detection Example

In the image above, the blue bounding box is the detected object. Given that
theGround Truthis known (shown in red), let us seehow to implement IoU
calculation using NumPy andPyTorch. We will see the available in-built
function and define manual functions as well.

In the order of top left to bottom right corner, the coordinates are,

☑ Ground truth[1202, 123, 1650, 868]


☑ Prediction[1162.0001, 92.0021, 1619.9832, 694.0033]
In practice, the predictions are obtained from model inference.

Download Code To easily follow along this tutorial, please download code by clicking
on the button below. It's FREE!

Download Code
Click here to download the source code to this post

Implementing Intersection over Union


using NumPy
Now that we know how IoU is calculated in theory let us define a function to
calculate IoU with our data, i.e., coordinates of the Ground Truth and Prediction.

(a). Import Dependencies for IoU


1 import numpy as np
2 np.__version__

(b). Defining a Function to Calculate IoU

Here, we find the coordinates of the bounding box surrounding the intersection
area. Then subtract the area of intersection from the sum of the area of Ground
Truth and Prediction. We add 1 while calculating height and width to counter
zero division errors. Theoretically, it is possible to add an infinitesimally small
positive value, say 0.0001. However, images are discrete. The minimum possible
dimension of an image is 1×1. Therefore, we have to add 1.

1 def get_iou(ground_truth, pred):


2 # coordinates of the area of intersection.
3 ix1 = np.maximum(ground_truth[0], pred[0])
4 iy1 = np.maximum(ground_truth[1], pred[1])
5 ix2 = np.minimum(ground_truth[2], pred[2])
6 iy2 = np.minimum(ground_truth[3], pred[3])
7
8 # Intersection height and width.
9 i_height = np.maximum(iy2 - iy1 + 1, np.array(0.))
10 i_width = np.maximum(ix2 - ix1 + 1, np.array(0.))
11
12 area_of_intersection = i_height * i_width
13
14 # Ground Truth dimensions.
15 gt_height = ground_truth[3] - ground_truth[1] + 1
16 gt_width = ground_truth[2] - ground_truth[0] + 1
17
18 # Prediction dimensions.
19 pd_height = pred[3] - pred[1] + 1
20 pd_width = pred[2] - pred[0] + 1
21
22 area_of_union = gt_height * gt_width + pd_height * pd_width - area_of_intersec
23
24 iou = area_of_intersection / area_of_union
25
26 return iou

(c). Bounding Box Coordinates


1 ground_truth_bbox = np.array([1202, 123, 1650, 868], dtype=np.float32)
2
3 prediction_bbox = np.array([1162.0001, 92.0021, 1619.9832, 694.0033], dtype=np.fl

(d). Get IoU Value


1 iou = get_iou(ground_truth_bbox, prediction_bbox)
2 print('IOU: ', iou)
Output

IOU: 0.6441399913136432

PyTorch Built-In Function for IoU


Pytorch already has a built-in functionbox_iou[1] to calculate IoU.
Documentation in the Reference section. It takes the set of bounding boxes as
inputs and returns an IoU tensor.

1 # Import dependencies.
2 import torch
3 from torchvision import ops
4
5 # Bounding box coordinates.
6 ground_truth_bbox = torch.tensor([[1202, 123, 1650, 868]], dtype=torch.float)
7 prediction_bbox = torch.tensor([[1162.0001, 92.0021, 1619.9832, 694.0033]], dtyp
8
9 # Get iou.
10 iou = ops.box_iou(ground_truth_bbox, prediction_bbox)
11 print('IOU : ', iou.numpy()[0][0])
Output

IOU : 0.6436676

Implementing IoU by defining a function


in PyTorch
The code flow is similar to the NumPy implementation that we have done
above.

(a). Import Dependencies


1 import torch
2 torch.__version__

(b). Function to Calculate IoU


1 def get_iou_torch(ground_truth, pred):
2 # Coordinates of the area of intersection.
3 ix1 = torch.max(ground_truth[0][0], pred[0][0])
4 iy1 = torch.max(ground_truth[0][1], pred[0][1])
5 ix2 = torch.min(ground_truth[0][2], pred[0][2])
6 iy2 = torch.min(ground_truth[0][3], pred[0][3])
7
8 # Intersection height and width.
9 i_height = torch.max(iy2 - iy1 + 1, torch.tensor(0.))
10 i_width = torch.max(ix2 - ix1 + 1, torch.tensor(0.))
11
12 area_of_intersection = i_height * i_width
13
14 # Ground Truth dimensions.
15 gt_height = ground_truth[0][3] - ground_truth[0][1] + 1
16 gt_width = ground_truth[0][2] - ground_truth[0][0] + 1
17
18 # Prediction dimensions.
19 pd_height = pred[0][3] - pred[0][1] + 1
20 pd_width = pred[0][2] - pred[0][0] + 1
21
22 area_of_union = gt_height * gt_width + pd_height * pd_width - area_of_intersec
23
24 iou = area_of_intersection / area_of_union
25
26 return iou

(c). Bounding Box Coordinates

The prediction bounding box is usually obtained while performing model


inference. We are defining it manually here for the sake of simplicity.

1 ground_truth_bbox = torch.tensor([[1202, 123, 1650, 868]], dtype=torch.float)


2
3 prediction_bbox = torch.tensor([[1162.0001, 92.0021, 1619.9832, 694.0033]], dtype

(d). Get IoU Value


1 iou_val = get_iou_torch(ground_truth_bbox, prediction_bbox)
2 print('IOU : ', iou_val.numpy()[0][0])
Output

1 IOU : 0.64413995
We can see that the output varies slightly. This error is introduced for adding 1
to counter zero division error. In practice, values are clamped to a Min-Max
range. Here, let’s keep it as it is for the sake of simplicity. You can also look
at thesource code[2] for a better understanding.

Conclusion
So that’s all aboutIntersection over UnionorJaccard Index. In this blog post,
we discussed the basics of IoU and why it is needed. You also learned the
implementation of IoU using NumPy and PyTorch. It should be noted that IoU
in object detection does not have the same meaning in segmentation.

In object detection, IoU does not calculate the accuracy of a model directly.
Rather, it isahelper metric that evaluates the degree of overlap between ground
truth and the prediction.

We have Average Precision (AP) andMean Average Precision (mAP)metrics for


evaluating model accuracy. When we see mAP@0.5,mAP@0.75, etc. These are
essentially mAP values calculated at IOU thresholds 0.5 and 0.75 respectively.
We will discuss more onmAPin a separate blog post.
Enhancing Image Segmentation using
U2-Net: An Approach to Efficient
Background Removal

Kunal Dawn
JUNE 11, 2024LEAVE A COMMENT

Background EstimationImage SegmentationPyTorchSegmentation


U2-Net (popularly known as U2-Net) is a simple yet powerful deep-
learning-based semantic segmentation model that revolutionizes
background removal in image segmentation. Its effective and
straightforward approach is crucial for applications where isolating
foregrounds from backgrounds is beneficial and essential. This
capability has significant implications for fields such as advertising,
filmmaking, and medical imaging.

This article is aimed at intermediate to advanced readers interested


in mastering background subtraction using image segmentation. U2-
Net stands out by focusing on binary segmentation, offering a clear
advantage over traditional semantic segmentation models. Given the
limited high-quality content on this topic, our goal is to provide
valuable insights into using U2-Net for efficient background removal.
We will also discuss IS-Net, an enhanced version of U2-Net,
showcasing its superior results. This article will present impressive
outcomes on challenging images, demonstrating the model’s
capability. We hope to show that precise background removal is
achievable and encourage readers to apply these models to their
projects.

By the end of this article, you’ll understand why background removal


is a complex yet vital task and how U2-Net and IS-Net are making
significant strides in this area.

Scroll through the stunning inference results from U2-Net and


IS-Net for a quick look.

1. Why U2-Net for Foreground Estimation?

2. Architecture of ReSidual U-block (RSU) in the Context of U2-Net

3. U2-Net Architecture Explanation

4. Training and Evaluation Strategies for U2-Net

5. Qualitative Analysis of U2-Net Predictions

6. IS-Net Architecture: Advancing U2-Net for Image Segmentation

7. Inference Results from IS-Net

8. Key Takeaways

9. Conclusion

10. References

Why U2-Net for Foreground


Estimation?
Deep learning segmentation architectures, such as Fully
Convolutional Networks (FCNs), tend to capture more semantic
information through local feature extraction as we go deeper through
the network with reduced feature map resolutions (as a consequence
of multiple pooling operations). However, they miss the global
contextual information extracted from the feature maps across
multiple scales.

Newer approaches, such as DeepLab, mitigate information loss


across multiple scales by increasing the network’s receptive field
through multiple dilated convolutions (also known as atrous
convolutions). However, this incurs significant computation costs
during training with higher image resolutions.

The authors propose the U2-Net architecture that can handle both
multi-level deep feature extraction and multi-scale information across
local and global contexts. This two-level nested modified U-Net-like
structure enables training without significant memory consumption
and computation costs. The core of each level in the architecture is
built upon the ReSidual U-block (RSU), which incorporates the
properties of a residual block and a U-Net-like symmetric encoder-
decoder structure.

Moreover, the U2-Net architecture performs better without using pre-


trained classification backbones, enabling it to be trained from
scratch.

In the next section, we will explore the RSU block in more detail.

Architecture of ReSidual U-block


(RSU) in the Context of U2-Net
An RSU-L (L being the number of layers in the encoder) block can be
structurally represented as RSU-L(Cin, M, Cout), where:

 Cin is the channel of the input feature map


 M is the number of channels for the intermediate layers in the
encoder
 Cout is the channel for the output feature map.

Note that the spatial resolution of the output feature map from any
RSU-L block remains identical to that of the input feature map.

A concise representation of an RSU-7 block is provided in the paper,


as shown below.

Figure 1: ReSidual-U block (for L=7)


[Source-U2-Net: Going Deeper with Nested U-Structure for Salient
Object Detection]
The Residual-U blocks primarily comprise the three components:

1. An input convolution layer that transforms an input feature map (of

shape: [HxWxCin]) to an intermediate feature map of

shape HxWxCout to learn the local features.


2. A U-Net-like symmetric encoder-decoder block that takes the

feature map and learns to encode the multi-scale

features . These multi-scale features are then extracted

from the downsampled feature maps from the encoder layers through
subsequent concatenation, convolution, and upsampling (in that order).

The resolution from the final feature map is


again HxWxCout.

The downsampling occurs due to the pooling operation, while a


“bilinear” upsampling is used in the decoding phase.

3. A residual connection to fuse the local and multi-scale features

through addition: .

The RSU block is analogous to a residual block where we can learn


multi-scale features instead of a convolution block to learn local
features.

Fig
ure 2: Comparison between Residual block and ReSidual-U block
[Source-U2-Net: Going Deeper with Nested U-Structure for Salient
Object Detection]
In the diagram above, and are the feature representations

learned from the weight (convolution) layers, and are the


multi-scale features learned from encoder-decoder blocks.

We will now present a more detailed block diagram for RSU-7, which
has an input feature map with a resolution of 320x320x3. The
notations I, M, and O represent the number of input, intermediate,
and final output channels in the RSU block, respectively. The
diagram also ascertains the shapes resulting from convolution,
pooling, concatenation, and upsampling.

The REBNCONV block is the usual convolution followed by Batch


Norm and ReLU activation (in that order).

The default padding and dilation rate is 1*d, where d assumes a


default value of 1. So, if d is set to 2, then both padding and dilation
rate are set to 2.

The ⊕ symbol in the diagram below refers to the concatenation


operation.
Figure 3:
Complete block diagram for RSU-7 block with an input feature map
of 320x320x3
Download Code To easily follow along this tutorial, please download
code by clicking on the button below. It's FREE!

Download Code

Code Explanation for RSU block

To better understand the implementation of the RSU7 module, let’s


first review the REBNCONV module and its _upsample_like helper
function:

1. The REBNCONV block applies the convolution followed by Batch


Norm and ReLU to an intermediate feature map (as discussed in the
previous diagram).

1 class REBNCONV(nn.Module):
2 def __init__(self,in_ch=3,out_ch=3,dirate=1):
3 super(REBNCONV,self).__init__()
4
5 self.conv_s1 = nn.Conv2d(in_ch,out_ch,3,padding=1*dirate,dilation=1*dira
6 self.bn_s1 = nn.BatchNorm2d(out_ch)
7 self.relu_s1 = nn.ReLU(inplace=True)
8
9 def forward(self,x):
10
11 hx = x
12 xout = self.relu_s1(self.bn_s1(self.conv_s1(hx)))
13
14 return xout
2. The function _upsample_like accepts feature
maps src and tar and subsequently upsamples src to have the
sample spatial resolution of tar.

1 def _upsample_like(src,tar):
2
3 src = F.upsample(src,size=tar.shape[2:],mode='bilinear')
4
5 return src
We will begin by analyzing the RSU7 Module’s __init__ method for
initializing the encoder, decoder, and pooling layers.

1 class RSU7(nn.Module):#UNet07DRES(nn.Module):
2
3 def __init__(self, in_ch=3, mid_ch=12, out_ch=3):
4 super(RSU7,self).__init__()
5
6 self.rebnconvin = REBNCONV(in_ch,out_ch,dirate=1)
7
8 self.rebnconv1 = REBNCONV(out_ch,mid_ch,dirate=1)
9 self.pool1 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
10
11 self.rebnconv2 = REBNCONV(mid_ch,mid_ch,dirate=1)
12 self.pool2 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
13
14 self.rebnconv3 = REBNCONV(mid_ch,mid_ch,dirate=1)
15 self.pool3 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
16
17 self.rebnconv4 = REBNCONV(mid_ch,mid_ch,dirate=1)
18 self.pool4 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
19
20 self.rebnconv5 = REBNCONV(mid_ch,mid_ch,dirate=1)
21 self.pool5 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
22
23 self.rebnconv6 = REBNCONV(mid_ch,mid_ch,dirate=1)
24
25 self.rebnconv7 = REBNCONV(mid_ch,mid_ch,dirate=2)
26
27 self.rebnconv6d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
28 self.rebnconv5d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
29 self.rebnconv4d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
30 self.rebnconv3d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
31 self.rebnconv2d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
32 self.rebnconv1d = REBNCONV(mid_ch*2,out_ch,dirate=1)

 The rebnconvin block acts as an additional layer outside the


encoder-decoder blocks, transforming the input feature map to
pass through the subsequent encoder-decoder blocks.
Interestingly, the feature map output from this layer is finally
added to the final decoder output.

 Attributes rebnconv1 through rebnconv7 represent the


encoder blocks, while the max-pooling
layers pool1 through pool5 are used for downsampling the
features maps from encoder
blocks rebnconv1 through rebnconv5, respectively.

 Attributes rebnconv6d through rebnconv1d represent the


decoder blocks that extract the multi-scale features from the
subsequent encoder layers rebnconv7 through rebnconv1 in
a bottom-up fashion.

Lastly, we will discuss the forward step for the RSU7 module:

1 def forward(self,x):
2
3 hx = x
4 hxin = self.rebnconvin(hx)
5
6 hx1 = self.rebnconv1(hxin)
7 hx = self.pool1(hx1)
8
9 hx2 = self.rebnconv2(hx)
10 hx = self.pool2(hx2)
11
12 hx3 = self.rebnconv3(hx)
13 hx = self.pool3(hx3)
14
15 hx4 = self.rebnconv4(hx)
16 hx = self.pool4(hx4)
17
18 hx5 = self.rebnconv5(hx)
19 hx = self.pool5(hx5)
20
21 hx6 = self.rebnconv6(hx)
22
23 hx7 = self.rebnconv7(hx6)
24
25 hx6d = self.rebnconv6d(torch.cat((hx7,hx6),1))
26 hx6dup = _upsample_like(hx6d,hx5)
27
28 hx5d = self.rebnconv5d(torch.cat((hx6dup,hx5),1))
29 hx5dup = _upsample_like(hx5d,hx4)
30
31 hx4d = self.rebnconv4d(torch.cat((hx5dup,hx4),1))
32 hx4dup = _upsample_like(hx4d,hx3)
33
34 hx3d = self.rebnconv3d(torch.cat((hx4dup,hx3),1))
35 hx3dup = _upsample_like(hx3d,hx2)
36
37 hx2d = self.rebnconv2d(torch.cat((hx3dup,hx2),1))
38 hx2dup = _upsample_like(hx2d,hx1)
39
40 hx1d = self.rebnconv1d(torch.cat((hx2dup,hx1),1))
41
42 return hx1d + hxin
Figure 3. illustrates how convolution, downsampling, concatenation,
and upsampling are done across the multiple encoder-decoder
blocks.

Observe that the final addition happens from the outputs of


the rebnconvin and the `rebnconv1d` decoder layers.

Prior to explaining the U2-Net architecture, please note that the


authors also use a modified version of the RSU block that replaces
the pooling and upsampling layers in the encoder-decoder blocks
with dilated convolutions. This is done to mitigate the gradual loss of
contextual information across the deeper stages in the network. This
block is called the RSU-KF block, where K is the number of layers in
the encoder.

U2-Net Architecture Explanation


The core of the U2-Net architecture is the residual-U (RSU) block
described in the previous section. The “squared” exponentiation
comes from the network built on a two-level nested structure. The
outer level maintains a U-structure composed of 11 stages, each
comprising a configured RSU block.

The block diagram is shown below.

Figure 4:
U2-Net Architecture
[Source-U2-Net: Going Deeper with Nested U-Structure for Salient
Object Detection]
The U2-Net structure consists of the following three components:

1. The six encoder stages: En_1, En_2, En_3, En_4, En_5, and En_6.
Stages 1-4 follow the RSU-7, RSU-6, RSU-5, and RSU-4 blocks; while
stages 5 and 6 implements the RSU-4F block mentioned earlier.

2. The five decoder stages, De_1, De_2, De_3, De_4, and De_5, follow
similar RSU architectures as their symmetric encoder counterparts.
Each decoder (except De_5, which takes the upsampled concatenation
of output feature maps from En_5 and En_6) takes the upsampled
concatenation of the output from its previous stage and its symmetric
encoder stage.

3. The feature map outputs from the decoder stages De_1 to De_5 and the
encoder stage En_6 are then convolved using a 3×3 convolution layer
and upsampled to the input image resolution (320x320) to produce six-

side feature outputs: , and ,


which are then activated using the sigmoid activation function to
produce the output probability maps.
These feature outputs are concatenated, followed by a 1x1 convolution

and the sigmoid activation to produce the final probability map .

Although the complete U2-Net architecture is built using the RSU


blocks, the implementation details for the final layers of the
architecture are worth mentioning since we will use them during
inference.

Let us now focus on the final side outputs implemented at the U2NET
module’s forward step.

1 #side output
2 d1 = self.side1(hx1d)
3
4 d2 = self.side2(hx2d)
5 d2 = _upsample_like(d2,d1)
6
7 d3 = self.side3(hx3d)
8 d3 = _upsample_like(d3,d1)
9
10 d4 = self.side4(hx4d)
11 d4 = _upsample_like(d4,d1)
12
13 d5 = self.side5(hx5d)
14 d5 = _upsample_like(d5,d1)
15
16 d6 = self.side6(hx6)
17 d6 = _upsample_like(d6,d1)
18
19 d0 = self.outconv(torch.cat((d1,d2,d3,d4,d5,d6),1))
20
21 return F.sigmoid(d0), F.sigmoid(d1), F.sigmoid(d2), F.sigmoid(d3), F.sigmoid(d4)
Attributes side1 through side6 are the 3x3 convolutions applied to
the output features from the decoder
stages hx1d through hx5d (De_1 to De_5) and the encoder
stage hx6. The upsampled side outputs are concatenated and
passed through a 1x1 convolution (outconv). Finally, the sigmoid
activation is applied to all the individuals and the concatenated side
outputs.

The authors also provide a smaller version of the U2–Net model


called U2-NetP targeted for edge-device inference. It consists of the
reduced number of the input, intermediate, and output channels in
the RSU blocks.

Did you know that background subtraction can also be used for
document alignment? Our Automated Document
Alignment article describes how we fine-tune DeepLabv3 to
segment and align documents automatically.
Training and Evaluation Strategies
for U2-Net
Now that we have an in-depth understanding of the U2-Net
architecture, we will also discuss the various strategies the authors
have employed to train the U2-Net model.

Training Dataset and Augmentation

The authors used the DUTS Image dataset for this binary
segmentation. The training data contains 10553 images, further
augmented through horizontal flipping to obtain 21106 training
images offline.

During training, the images are first resized to 320×320 resolution,


then randomly vertically flipped, and finally randomly cropped to
228×288 resolution.

Loss and Optimizer for U2-Net Training

The training loss is defined as the weighted sum of the losses from
the side output probability maps and that of the final fused
(concatenated) output map.

M is set to 6, indicating the six-side output saliency maps. The side

and the fused weights ( and ) are set to 1. Each loss


term, l, is the standard Binary Cross-Entropy loss:
Where (r,c) indicates the pixel location and H and W represent the

image height and width, respectively. and refer to the


ground truth and the predicted probability pixel values, respectively.

The authors used Adam optimizer to train the network with the
default hyperparameters with an initial learning rate of 1e-3,
betas=(0.9, 0.999), eps=1e-8, and weightdecay=0.

Evaluation Datasets and Metrics

Six benchmark datasets were used for evaluation: DUT-


OMRON (5168 images), DUTS-TE (5019 images), HKU-IS (4447
images), ECSSD (1000 images), PASCAL-S (850
images), SOD (300 images).

The following evaluation metrics were used to report the performance


of the U2-Net model:

1. Precision-Recall Curve
2. The beta-F-score measure (higher the better) is given as follows:

3. Mean absolute Error (MAE – lower the better) between the ground
truth mask and the predicted map; given as:
4. Weighted F-score (higher the better) to overcome the possible unfair
comparison caused by “interpolation flaw, dependency f law and equal-
importance flaw”; given as:

5. S-measure (Sm – higher the better) is used to evaluate the structure


similarity of the predicted non-binary saliency map and the ground truth.
The S-measure is defined as the weighted sum of region-aware Sr and
object-aware So structural similarity:

6. The relax boundary F-measure ( —higher the better) is


used to quantitatively estimate the predicted mask’s boundary quality.

The following table shows the metric scores across the six datasets
used during the evaluation.

Table 1: Evaluation scores from U2-Net and U2-NetP


U^2-Net achieves almost state-of-the-art results on the DUT-
OMRON, the HKU-IS, and the ECSSD datasets, as vindicated by the
scores highlighted in red (scores in green and blue indicate the
second and the third best scores). It almost achieves the second-best
performance on the DUTS-TE and the SOD data. Even though the
performance of U2-Net on the PASCAL-S data is inferior, the scores
are almost closer to the top 3 scores.

Qualitative Analysis of U2-Net


Predictions
Next, let’s perform the inference on some sample images. We shall
visualize the results using both easy and challenging examples.

We will use both UNet and UNetP models for inference. The
script u2net.py contains the model architecture definition,
supporting modules, and helper functions.

We will start with our imports.

1 import os
2 from PIL import Image
3 import numpy as np
4
5 from torchinfo import summary
6
7 import torch
8 import torchvision.transforms as T
9
10 from u2net import U2NET, U2NETP
11
12 import torchvision.transforms.functional as F

Loading U2-Net and U2-NetP Weights


First, we will initialize the U2NET and U2NETP models.

1 u2net = U2NET(in_ch=3,out_ch=1)
2 u2netp = U2NETP(in_ch=3,out_ch=1)
The figure below shows the model summary for U2NET. The model
has approximately 44M parameters.

Fig
ure 5: U2-Net Model Summary
The U2NETP model is around 38 times smaller than the original U2-
Net model, containing only around 1.13M parameters, as the model
summary vindicates.
Figu
re 6: U2-NetP Model Summary
Next, we load the model weights using the load_model helper
function.

1 def load_model(model, model_path, device):


2
3 model.load_state_dict(torch.load(model_path, map_location=device))
4 model = model.to(device)
5
6 return model
The model weights are now loaded through the following lines of
code.

1 u2net = load_model(model=u2net, model_path="u2net.pth", device="cuda:0")


2
3 u2netp = load_model(model=u2netp, model_path="u2netp.pth", device="cuda:0")
Data Preprocessing

Next, we shall prepare the batch image samples. The authors


recommend resizing the image resolution to 320x320 during
inference.

We will also scale the image data in the range [0, 1] and
normalize it using the ImageNet mean and standard deviation.

1 mean = torch.tensor([0.485, 0.456, 0.406])


2 std = torch.tensor([0.229, 0.224, 0.225])
3
4 resize_shape = (320,320)
5
6 transforms = T.Compose([T.ToTensor(),
7 T.Normalize(mean=mean, std=std)])
Using the above transforms, we will prepare the batch data and
provide the path to the image directory.

1 def prepare_image_batch(image_dir, resize, transforms, device):


2
3 image_batch = []
4
5 for image_file in os.listdir(image_dir):
6 image = Image.open(os.path.join(image_dir, image_file)).convert("RGB")
7 image_resize = image.resize(resize, resample = Image.BILINEAR)
8
9 image_trans = transforms(image_resize)
10 image_batch.append(image_trans)
11
12
13 image_batch = torch.stack(image_batch).to(device)
14
15 return image_batch
16
17 image_batch = prepare_image_batch(image_dir="test_images",
18 resize=resize_shape,
19 transforms=transforms,
20 device="cuda:0")
We will provide a helper function to denormalize images in the
range [0, 255]. This would be required for visualization purposes.
The denorm_image utility denormalizes an image using the same
ImageNet mean and standard deviation.

1 def denorm_image(image):
2 image_denorm = torch.addcmul(mean[:,None,None], image, std[:,None, None])
3 image = torch.clamp(image_denorm*255., min=0., max=255.)
4 image = torch.permute(image, dims=(1,2,0)).numpy().astype("uint8")
5
6 return image

Prepare Model Predictions

Once we have pre-processed our image batch, we can forward pass


it through the model. The prepare_predictions utility does
precisely this.

1 def prepare_predictions(model, image_batch):


2
3 model.eval()
4
5 all_results = []
6
7 for image in image_batch:
8 with torch.no_grad():
9 results = model(image.unsqueeze(dim=0))
10
11 all_results.append(torch.squeeze(results[0].cpu(), dim=(0,1)).numpy())
12
13 return all_results
Let’s remember that the forward method for the U2NET model
produces a combined (fused) probability map alongside separate
side outputs. However, we’ll only be utilizing the fused prediction
map.

We obtain the predictions for both the U2-Net and U2-NetP models.

1 predictions_u2net = prepare_predictions(u2net, image_batch)


2 predictions_u2netp = prepare_predictions(u2netp, image_batch)
Once we have the predictions, we can normalize them using the
simple min-max normalization.

1 def normPRED(predicted_map):
2 ma = np.max(predicted_map)
3 mi = np.min(predicted_map)
4
5 map_normalize = (predicted_map - mi) / (ma-mi)
6
7 return map_normalize
We shall now visualize the model predictions for both U2Net and U2-
NetP.

Let’s take a look at a simple example.

Figure 7: Prediction sample instance from U2-Net (middle) and U2-


NetP (right)
Both U2-Net (2nd column) and U2-NetP (3rd column) give decent
predictions. However, U2-Net produces slightly better predictions
around the sprinter’s hair portion.
The next series of predictions shows that U2-Net produces a higher-
confidence probability map than its U-NetP counterpart.

Fig
ure 8: More Prediction samples from U2-Net (middle) and U2-NetP
(right)
However, there are instances where U2-NetP surprisingly gives
better prediction maps compared to U2-Net.
Figure
9: Better predictions from U2-NetP (right) compared U2-Net (middle)
Let us now take a closer look at a few challenging examples.
Figure 10:
Predicted masks from U2-Net (middle) and U2-NetP (right) on
challenging images
Although both models were able to segment out the foreground
instances, there is tremendous scope for improvements in getting
more fine-grained segmentation results.

The following section will discuss the IS-Net architecture composed


of RSU blocks in the encoder-decoder stages. It is similar to the U2-
Net structure but can produce significantly better results.

IS-Net Architecture: Advancing U2-


Net for Image Segmentation
The authors of the U2-Net paper proposed a more robust approach
to performing foreground segmentation using an efficient
intermediate supervision (IS) learning strategy in their paper, “Highly
Accurate Dichotomous Image Segmentation,” released in 2022.

There were three major highlights of the paper:

1. Curation of the first large-scale dataset: DIS5K, containing 5470 high-


resolution image data coupled with highly accurate binary segmentation
masks.
2. Implement an intermediate self-supervision strategy (IS) to learn a
ground-truth mask-level encoder and incorporate it with the
segmentation component (based on U2-Net) to learn mask-level and
image-level features via feature synchronization. The entire architecture
is referred to as IS-Net.
3. Design of a novel metric called Human Correction Efforts (HCE) to
approximate the number of mouse-clicking operations required to
correct the false positives and false negatives.

We shall focus only on the IS-Net architecture since we intend to


compare the inference results with the previous models.

The IS-Net architecture comprises two components:


1. A ground truth encoder to learn high-dimensional mask-level
encodings.
2. An image segmentation component (similar to U2-Net) to learn high-
resolution multi-stage and multi-level image features.

Note: Unlike U2-Net, the segmentation component doesn’t use a


concatenated (fused) module of the side output maps.

The block diagram below shows the proposed IS-Net training


pipeline.

Figure 11: Proposed Training Pipeline for IS-Net


The training is performed with a 1024×1024 input image resolution.
The authors employ a two-phase training pipeline which involves the
following phases:

1. The first phase involves training a self-supervised ground truth encoder


model to learn high-dimensional mask-level features.
The encoder consists of 6-stage RSU encoder blocks (discussed
earlier). Specifically, encoders from stages 1-4 employ the RSU-
7, RSU-6, RSU-5, and RSU4 modules, respectively, while those
from stages 5-6 use the RSU-4F modules.

Note: To reduce the computation cost, a downsampled version


(512×512) of the high-resolution ground truth mask (1024×1024) is
passed to the encoder stages through strided convolution (with a stride
of 2).

The training involves using a simple binary cross-entropy loss involving


the side output maps (a result of 3×3 convolutions on top of the
encoder outputs across each stage) alongside the ground truth masks.
2. The image segmentation component consists of five decoder stages
(DE_1-DE_5 employing the RSU7-RSU4F modules) and six encoder
stages (EN_1 – EN_6 comprising RSU-7 to RSU-4F in that order).
This segmentation model produces the side output probability maps and
the intermediate features (the logits without the sigmoid activations)
from the decoder stages (DE_1 – DE_5) and the last encoder stage
(EN_6).

The authors use an MSE (mean-squared error) loss between the


learned encodings and the predicted output features from the
decoder stages (devoid of the side outputs) to perform feature
synchronization (FS) through intermediate supervision. The FS loss
(Lfs) is formulated using:

Where,

 fdI is the image features (logits without activations) extracted


from the decoder stage d (devoid of the side outputs) from the
segmentation component.
 fdG is the mask-level encoding learned in the first training phase
from encoder stage d.
 The weighting factor is kept as 1 for each stage.
 D=6; representing the stages of the segmentation model
(DE_1 – DE_5 and EN_6)

Additionally, the segmentation training process is identical to U2-net,


which computes the standard BCE loss (Lsg) of the side outputs and
the ground truth masks discussed in the [Loss and Optimizer for U2-
Net Training section].

The training process for the segmentation component is perceived as


the following optimizing problem:

We should remember that only the segmentation component of the


IS-Net pipeline is used for inference, as we had done with the
previous U2-Net models.

Inference Results from IS-Net


We will now perform inference on some image samples using the IS-
Net pipeline, which contains a mix of easy and challenging instances.
Finally, we will compare the results with U2-Net.

We will import the model architecture in the isnet.py script.

1 from isnet import ISNetDIS


2
3 isnet = ISNetDIS(in_ch=3,out_ch=1)
The IS-Net is a 44M parameter model similar to that of U2-Net. We
can observe the model summary shown below.
Next, we load the model weights.

1 isnet = load_model(model=isnet, model_path="isnet-general-use.pth", device="cuda:


We will now use the following pre-processing transforms to the image
data. We will perform the inference using the 1024x1024 image
resolution suggested by the authors.

1 mean = torch.tensor([0.5, 0.5, 0.5])


2 std = torch.tensor([1.0, 1.0, 1.0])
3
4 resize_shape = (1024,1024)
5
6 transforms = T.Compose([T.ToTensor(),
7 T.Normalize(mean=mean, std=std)])
Since the IS-Net pipeline doesn’t use a fused side output map, we
will use the side output from the first decoder stage (DE_1) for
inference. The prepare_predictions utility remains largely the same,
with the only difference being that we use the first side output (from
the first decoder stage).

1 def prepare_predictions(model, image_batch):


2
3 model.eval()
4
5 all_results = []
6
7 for image in image_batch:
8 with torch.no_grad():
9 results = model(image.unsqueeze(dim=0))
10
11 all_results.append(torch.squeeze(results[0][0].cpu(), dim=(0,1)).numpy(
12
13 return all_results
Let us visualize a couple of inference samples and compare them
with those from U2-Net.
Figure 12: Prediction comparison between IS-Net (middle) and U2-
Net (right) on image samples
We can observe that IS-Net gives stunning results compared to its
U2-Net counterpart.

Now, we look into a few more challenging instances.


13: Prediction comparison between IS-Net (middle) and U2-Net
(right) on challenging image instances
Did you find the results exciting? Jump to the RSU-block explanation
section to know more about the U2-Net.

We might be tempted to believe that IS-Net always produces better


prediction masks than U2-Net. That might be most of the time, but we
show a few exceptions where U2-Net could give better prediction
masks.

Figure 14: Better prediction masks from U2-Net (right) compared to


IS-Net (middle)
However, we can improve the IS-Net prediction masks by
using OpenCV’s thresholding utility. We kept the thresholded values
to 10.
F
igure 15: Thresholded masks from IS-Net predictions

Key Takeaways
1. ReSidual-U block: We have learned how the ReSidual-U structure
(RSU) pivots in learning multi-scale global context in addition to learning
from local representations. It forms the crux of the pipeline for both U2-
Net and IS-Net.
2. U2-Net: The U2-Net is a nested two-level structure of RSU encoder-
decoder layers that helps attain multi-level and multi-scale deep feature
representations without requiring a pre-trained classification backbone
at minimal computation and memory costs.
3. Intermediate Supervision Strategy: Training a self-supervised ground-
truth encoder from the target segmentation masks helps capture high-
dimensional mask-level features.
4. IS-Net pipeline: The IS-Net pipeline aims to attain feature
synchronization using the trained GT encoder and the multi-stage
feature maps, along with learning intra-stage and multi-level image
features through the segmentation component (similar to U2-Net).

Conclusion
Background subtraction is one of the most crucial tasks in computer
vision, and hence, understanding the high-dimensional multi-scale
and multi-level image features is imperative. In this post, we have
explored how U2-Net helps achieve this through its RSU blocks. We
also observed that incorporating an intermediate supervision strategy
results in significant improvements while generating the prediction
masks, as evidenced by the IS-Net pipeline.
Recommender Systems
— A Complete Guide to
Machine Learning
Models
Leveraging data to help users discovering

new content

Recommender Systems: Why And How?

Recommender systems are algorithms providing


personalized suggestions for items that are most relevant to
each user. With the massive growth of available online
contents, users have been inundated with choices. It is
therefore crucial for web platforms to offer recommendations
of items to each user, in order to increase user satisfaction and
engagement.
YouTube recommends videos to users, to help them discover and watch content relevant to
them in the middle of a huge number of available contents. (Image by Author)

The following list shows examples of well-known web


platforms with a huge number of available contents,
which need efficient recommender systems to keep users
interested.

1. Youtube. Every minute people upload 500 hours of


videos, i.e. it would take 82 years to a user to watch all
videos uploaded just in the last hour.

2. Spotify. Users can listen to ore than 80 million song tracks


and podcasts.

3. Amazon. Users can buy more than 350 million different


products.

All these platforms use powerful machine learning models in


order to generate relevant recommendations for each user.

Explicit Feedback vs. Implicit Feedback

In recommender systems, machine learning models are used to


predict the rating rᵤᵢ of a user u on an item i. At inference
time, we recommend to each user u the items l having highest
predicted rating rᵤᵢ.

We therefore need to collect user feedback, so that we can have


a ground truth for training and evaluating our models. An
important distinction has to be made here between explicit
feedback and implicit feedback.
Explicit vs. implicit feedback for recommender systems. (Image by Author)

Explicit feedback is a rating explicitly given by the user to


express their satisfaction with an item. Examples are: number
of stars on a scale from 1 to 5 given after buying a product,
thumb up/down given after watching a video, etc. This
feedback provides detailed information on how much a
user liked an item, but it is hard to collect as most users
typically don’t write reviews or give explicit ratings for each
item they purchase.

Implicit feedback, on the other hand, assume that user-item


interactions are an indication of preferences. Examples are:
purchases/browsing history of a user, list of songs played by a
user, etc. This feedback is extremely abundant, but at the
same time it is less detailed and more noisy (e.g. someone
may buy a product as a present for someone else). However,
this noise becomes negligible when compared to the sheer size
of available data of this kind, and most modern
Recommender Systems tend to rely on implicit
feedback.
User-item rating matrix for explicit feedback and implicit feedback datasets. (Image by
Author)

Once we have collected explicit or implicit feedbacks, we can


create the user-item rating matrix rᵤᵢ. For explicit
feedback, each entry in rᵤᵢ is a numerical value—e.g. rᵤᵢ = “stars
given by u to movie i”—or “?” if user u did not rate item i. For
implicit feedback, the values in rᵤᵢ are a boolean values
representing presence or lack of interaction—e.g. rᵤᵢ = “did
user u watch movie i?”. Notice that the matrix rᵤᵢ is very sparse,
as users interact with few items among all available contents,
and they review even fewer items!

Content-Based vs. Collaborative Filtering

Approaches

Recommender system can be classified according to the kind of


information used to predict user preferences as Content-
Based or Collaborative Filtering.
Content-Based vs. Collaborative Filtering approaches for recommender systems. (Image
by author)

Content-Based Approach

Content-based methods describe users and


items by their known metadata. Each item i is
represented by a set of relevant tags—e.g. movies of the IMDb
platform can be tagged as“action”, “comedy”, etc. Each
user u is represented by a user profile, which can created from
known user information—e.g. sex and age—or from the user’s
past activity.

To train a Machine Learning model with this approach we can


use a k-NN model. For instance, if we know that
user u bought an item i, we can recommend to u the available
items with features most similar to i.

The advantage of this approach is that items metadata are


known in advance, so we can also apply it to Cold-Start
scenarios where a new item or user is added to the platform
and we don’t have user-item interactions to train our model.
The disadvantages are that we don’t use the full set of known
user-item interactions (each user is treated independently),
and that we need to know metadata information for each item
and user.

Collaborative Filtering Approach

Collaborative filtering methods do not use item or user


metadata, but try instead to leverage the feedbacks or
activity history of all users in order to predict the rating of
a user on a given item by inferring interdependencies between
users and items from the observed activities.

To train a Machine Learning model with this approach we


typically try to cluster or factorize the rating matrix rᵤᵢ in order
to make predictions on the unobserved pairs (u,i), i.e.
where rᵤᵢ = “?”. In the following of this article we present
the Matrix Factorization algorithm, which is the most
popular method of this class.

The advantage of this approach is that the whole set of user-


item interactions (i.e. the matrix rᵤᵢ) is used, which typically
allows to obtain higher accuracy than using Content-Based
models. The disadvantage of this approach is that it requires
to have a few user interactions before the model can be fitted.

Hybrid Approaches

Finally, there are also hybrid methods that try to use both
the known metadata and the set of observed user-item
interactions. This approach combines advantages of both
Content-Based and Collaborative Filtering methods, and allow
to obtain the best results. Later in this article we
present LightFM, which is the most popular algorithm of this
class of methods.

Collaborative Filtering: Matrix

Factorization

Matrix factorization algorithms are probably the most


popular and effective collaborative filtering methods for
recommender systems. Matrix factorization is a latent factor
model assuming that for each user u and item i there
are latent vector representations pᵤ, qᵢ ∈ Rᶠ s.t. rᵤᵢ can be
uniquely expressed— i.e. “factorized” — in terms of pᵤ and qᵢ.
The Python library Surprise provides excellent implements of
these methods.

Matrix Factorization for Explicit Feedback

The simplest idea is to model user-item interactions through


a linear model. To learn the values of pᵤ and qᵢ, we can
minimize a regularized MSE loss over the set K of pairs (u, i)
for which rᵤᵢ is known. The algorithm so obtained is
called probabilistic matrix factorization (PMF).

Probabilistic matrix factorization: model for rᵤᵢ and loss function.

The loss function can be minimized in two different ways. The


first approach is to use stochastic gradient descent
(SGD). SGD is easy to implement, but it may have some issues
because both pᵤ and qᵢ are both unknown and therefore the
loss function is not convex. To solve this issue, we can
alternatively fix the value pᵤ and qᵢ and obtain a convex linear
regression problem that can be easily solved with ordinary
least squares (OLS). This second method is known
as alternating least squares (ALS) and allows significant
parallelization and speedup.

The PMF algorithm was later generalized by the singular


value decomposition (SVD) algorithm, which
introduced bias terms in the model. More
specifically, bᵤ and bᵢ measure observed rating deviations of
user u and item i, respectively, while μ is the overall average
rating. These terms often explain most of the observed
ratings rᵤᵢ, as some items widely receive better/worse ratings,
and some users are consistently more/less generous with their
ratings.

SVD algorithm, a generalization of probabilistic matrix factorization.

Matrix Factorization for Implicit Feedback

The SVD method can be adapted to implicit feedback


datasets. The idea is to look at implicit feedback as
an indirect measure of confidence. Let’s assume that the
implicit feedback tᵤᵢ measures the percentage of movie i that
user u has watched — e.g. tᵤᵢ = 0 means that u never
watched i, tᵤᵢ = 0.1 means that he watched only 10% of it, tᵤᵢ =
2 means that he watched it twice. Intuitively, a user is more
likely to be interested into a movie they watched twice, rather
than in a movie they never watched. We therefore define
a confidence matrix cᵤᵢ and a rating matrix rᵤᵢ as follows.

Confidence matrix and rating matrix for implicit feedback.


Then, we can model the observed rᵤᵢ using the same linear
model used for SVD, but with a slightly different loss function.
First, we compute the loss over all (u, i) pairs — unlike the
explicit case, if user u never interacted with i we have rᵤᵢ = 0
instead of rᵤᵢ =“?” . Second, we weight each loss term by the
confidence cᵤᵢ that u likes i.

Loss function for SVD for implicit feedback.

Finally, the SVD++ algorithm can be used when we have


access to both explicit and implicit feedbacks. This can be very
useful, because typically users interact with many items (=
implicit feedabck) but rate only a small subset of them (=
explicit feedback). Let’s denote, for each user u, the set N(u) of
items that u has interacted with. Then, we assume that an
implicit interaction with an item j is associated with a new
latent vector zⱼ ∈ Rᶠ. The SVD++ algorithm modifies the linear
model of SVD by including into the user representation a
weighted sum of these latent factors zⱼ.

SVD++ for mixed (explicit + implicit) feedback

Hybrid Approach: LightFM

Collaborative filtering methods based on matrix factorization


often produce excellent results, but in cold-start scenarios—
where little to no interaction data is available for new items
and users—they cannot make good predictions because they
lack data to estimate the latent factors. Hybrid
approaches solve this issue by leveraging known item or user
metadata in order to improve the matrix factorization model.
The Python library LightFM implements one of the most
popular hybrid algorithms.

In LightFM, we assume that for each user u we have collected a


set of tag annotations Aᵁ(u) — e.g. “male”, “age < 30”, … — and
similarly each item i has a set of annotations Aᴵ(i) — e.g. “price
> 100 $”, “book”, … Then we model each user tag by a latent
factor xᵁₐ ∈ Rᶠ and by a bias term bᵁₐ ∈ R, and we assume
that the user vector representation pᵤ and its associated
bias bᵤ can be expressed simply as the sum of these
terms xᵁₐ and bᵁₐ, respectively. We take the same approach to
item tags, using latent factors xᴵₐ ∈ Rᶠ and bias terms bᴵₐ ∈ R.
Once we have defined pᵤ, qᵢ, bᵤ, bᵢ using these formulas, we
can use the same linear model of SVD to describe the
relationship between these terms and rᵤᵢ.

LightFM: user/item embeddings and biases are the sum of latent vectors associated to each
user/item.

Notice that there are three interesting cases of this hybrid


approach of LightFM.

1. Cold start. If we have a new item i with known tags Aᴵ(i),


then we can use the latent vectors xᴵₐ (obtained by fitted
our model on the previous data) to compute its
embedding qᵢ, and therefore estimate for any user u its
rating rᵤᵢ.

2. No available tags. If we don’t have any known metadata


for items or users, the only annotation we can use is an
indicator function, i.e. a different annotation a for each
user and each item. Then, user and item feature matrices
are identity matrices, and LightFM reduces to a classical
collaborative filtering method such as SVD.

3. Content-based vs. Hybrid. If we only used user or item


tags without indicator annotations, LightFM would almost
be a content-based model. So in practice, to leverage user-
item interactions, we also add to known tags an indicator
annotation a different from each user and item.

TL;DR – Conclusions

 Recommender systems leverage machine learning


algorithms to help users inundated with choices in
discovering relevant contents.

 Explicit vs. implicit feedback: the first is easier to


leverage, but the second is way more abundant.

 Content-based models work well in cold-start scenarios,


but require to know user and item metadata.

 Collaborative filtering models typically use matrix


factorization: PMF, SVD, SVD for implicit feedback,
SVD++.

 Hybrid models take the best of content-based and


collaborative filtering. LightFM is a great example of this
approach.
Mastering Recommendation System: A
Complete Guide

Ankan Ghosh
MAY 21, 2024 2 COMMENTS

Beginners Deep Learning Tutorial


Suppose you’re listening to a song on Spotify, watching a video on
YouTube or Netflix, or shopping on Amazon; you’ll always see a list
of similar songs, videos, or products recommended to you.
Amazingly, most of them perfectly match your tastes. Pretty cool,
right? Under the hood, these applications use a recommendation
system (or recommender system) to recommend their products to
the user. This article will explore what a Recommendation System is
and how it has advanced over recent years. Whether you’re a novice
or an expert in recommendation systems, you’ll thoroughly
understand their workflow by the end of this blog.

To see the key takeaways, scroll down to the concluding part of the
article or click here to see them immediately.

 What is a Recommendation System?


 Types of Recommendation System
 Recommendation System – Traditional ML Approaches
 Recommendation System – Latest DL Approaches
 Key Takeaways
 Conclusion

 References

What is a Recommendation System?


With the massive growth of online content, users have been
overflowed with choices. Recommender systems are designed to
understand users’ and products’ preferences, past decisions, and
characteristics by analyzing interaction data such as impressions,
clicks, likes, and purchases. These systems can recommend various
products or services that match consumers’ interests, including
books, videos, products, and clothing.

Image 1- Popular Recommendation Systems

A Statistical Overview

Now, here is a list containing all the statistical data of popular


companies that use recommendation systems as a main component
of their business:

1. YouTube – People upload 500 hours of videos every minute, so it would


take 82 years for a user to watch all the videos uploaded in the last hour.
2. Spotify – Users can listen to over 100 million song tracks and podcasts.
3. Amazon – Users can buy more than 350 million different products.
4. Netflix – If a user wants to watch all the series on Netflix, it will take
approximately 10.27 years of continuous viewing, assuming no breaks
and watching 24 hours a day. It would take about 61.64 years of
watching 4 hours per day to watch all the series available on Netflix.
5. Facebook – As of 2024, Facebook has approximately 3.05 billion
monthly active users, and it’s impossible for anyone to connect with all
of them.
6. LinkedIn – Approximately 61 million people use LinkedIn to search for
jobs every week. LinkedIn hosts over 15 million job listings available for
users to apply to at any given time. Additionally, around 8.72 million job
applications are submitted daily, translating to about 363,600 job
applications per hour or 140 job applications every second.

Now, we can clearly see that every user needs suggestions or


recommendations to meet their needs without wasting more time.
Here, recommendation systems do the work for you and recommend
all your favorite songs, the products you need to buy, or the web
series you want to watch. All these platforms use powerful machine
learning models to generate relevant recommendations for each
user. Now, let’s move to the next section to explore what popular
recommendation system types we use.

Types of Recommendation System


Three main types of recommendation systems are widely used:

1. Content-Based
2. Collaborative Filtering
3. Context filtering

These recommendation systems rely on feedback data regarding


ratings, clicks, likes or dislikes, etc. Before we move to the types, let’s
explore the two main types of feedback:
Ima
ge 2- Explicit vs Implicit Feedback

Recommendation System – Explicit Feedback

Explicit feedback is when a user directly rates something to show


how much they like it. Suppose you give a product a star rating from
1 to 5 after buying it or give a video a thumbs up or down. This type
of feedback gives clear information about how much a user liked
something, but it is hard to collect because most people don’t take
the time to write reviews or rate every item they use.

Recommendation System – Implicit Feedback

Implicit feedback is based on user actions, like what they buy,


browse, or listen to. Let’s say a user’s purchase history, the songs
they play, or the videos they watch. This feedback is abundant and
easy to collect, but it is less precise and can be noisy. For instance,
someone might buy a product as a gift, not for themselves. Despite
this noise(the model will consider the product as a preference of the
user who placed the order), the large amount of data makes it very
useful, and most modern recommendation systems prefer to use
implicit feedback.
Rating Matrix

I
mage 3 – Rating Matrix
Once we collect explicit or implicit feedback, you can create a user-
item rating matrix out of it (rui). For explicit feedback, each entry in (rui)
is a numerical value, like the number of stars you gave to a movie or
a “?” if you didn’t rate that movie. For implicit feedback, the values in
(rui)are boolean, showing whether there was an interaction or not—
like whether you watched a movie or not. It’s important to note that
this matrix (rui) is very sparse because you only interact with a few
items out of all available ones, and you rate even fewer items out of
them!

Now, let’s get back to the types of recommendation systems:

Content-Based Recommendation System

Content-based recommendation systems(or recommender systems)


focus on the characteristics of items and users’ preferences as
expressed through their interactions with items. For example, in a
movie recommendation system like IMDb or Netflix, each movie
might be tagged with genres such as “action” or “comedy.” Similarly,
users are profiled based on their personal details like age and gender
or their previous interactions with movies. This type of system
recommends items by matching the attributes of items with users’
preferences. If a user has enjoyed movies like “Batman Begins” and
“The Dark Knight” the system might suggest a similar movie, “The
Dark Knight Rises,” because it features the same actors or belongs
to a similar genre.

Image 4 – Content-Based Filtering


Utility Matrix

This is a structured format where each cell represents the interaction


between a user and an item, such as a rating or a like. It helps
understand which items a user prefers over others, although not all
interactions are always known. The rating matrix we saw before is an
example of this only, where we collect feedback from the user.

Creating Item Profiles

Each item in the system is described by a profile that captures its


essential characteristics. For movies, this could include details about
the cast, the director, the year it was released, and its genres. This
information is quantified using techniques such as TF-IDF
vectorization:

Term Frequency (TF): This is a count of how often a particular word


appears in a document relative to the most frequent word in that
document.

Image 5 – Term Frequency


Inverse Document Frequency (IDF): This calculates how unique a
word is across all documents; more unique words are given higher
importance.

Image 6 – Inverse Document


Frequency
Term Frequency-Inverse Document Frequency (TF-IDF) is a number
used in searching and natural language processing(NLP). It helps to
find how important a word is in a document when you look at a bunch
of documents together (called a corpus). TF measures how often a
word appears in a document. IDF looks at how rare a word is across
all documents, giving more weight to unique words. By multiplying TF
and IDF, you get the TF-IDF score, which tells you how important
each word is in that document.

Image 7
– Term Frequency-Inverse Document Frequency

Creating User Profiles

User profiles are built from data on how users have interacted with
items. These profiles are essentially vectors that summarize a user’s
likes and dislikes based on past behavior. The profiles help predict
what new items a user might like.

Recommendation Methods

For content based recommendation, we use two main methods:

1. Cosine Similarity – This method calculates the cosine of the angle


between the item profile vector and the user profile vector. A smaller
angle (or higher cosine similarity) suggests a more robust match
between the user’s preferences and the item’s characteristics.
2. Classification Models – Methods like decision trees can be used to
decide whether an item would appeal to a user based on various
characteristics of the user and the item.

When you search for a movie on Netflix or a photo on Google, you


will get similar search results to your desired movie or photo, right?
All of these recommendations are using these methods under the
hood. The Recommendation System uses similarity search
techniques to search and list the nearby feature vectors of your given
input. We will explore this more in one of our future articles.

Im
age 8 – Content-Based vs Collaborative Filtering

Collaborative Filtering

Collaborative filtering (CF) is a popular technique in recommendation


systems that predicts a user’s preferences by using the collective
knowledge and behaviors of a large pool of users. Unlike content-
based filtering, which relies on the properties of items, CF focuses on
user-item interactions, utilizing the feedback or activity history of all
users to make predictions.
I
mage 9 – Collaborative Filtering
We can classify collaborative filtering (CF) recommendation systems
based on various approaches. In the next section, we will explore this
further.

Memory-Based vs. Model-Based Approaches

1. Memory-Based CF – Uses the entire user-item interaction matrix to


make direct recommendations based on similarities between users or
items. It is straightforward but can struggle with large, sparse matrices.
Generally, it deals with implicit feedback. It consists of two methods:
user-based collaborative filtering and item-based collaborative filtering.

2. Model-Based CF – Uses machine learning models to predict


interactions between users and items. Techniques like matrix
factorization, clustering, SVD, and deep learning are used to
learn latent features from the data, improving prediction accuracy. We
also use these methods for explicit feedback.

Im
age 10 – Memory-Based vs. Model-Based CF

User-Based Collaborative Filtering

This method recommends items to a user based on the preferences


of similar users in the database. It involves creating a matrix of items
rated/liked/clicked by each user, computing similarity scores between
users, and then suggesting items that the target user hasn’t
encountered yet but similar users have liked.

Example – If User A likes ‘Batman Begins’ and ‘Justice League,’ and


User B likes ‘Batman Begins,’ ‘Justice League,’ and ‘Avengers,’ User
A might also enjoy ‘Avengers.’

I
mage 11 – User-Based Collaborative Filtering

Item-based Collaborative Filtering

Instead of finding similar users, this method finds similar items to


recommend based on the user’s past preferences. It identifies pairs
of items rated/liked by the same users, measures their similarity, and
then suggests similar items based on these similarity scores.

Example – If User A liked ‘Movie 1’ and ‘Movie 2’, and ‘Movie 2’ is


similar to ‘Movie 3’ based on ratings by other users, ‘Movie 3’ is
recommended to User A.
Image 12 – Item-Based Collaborative Filtering

Recommendation Methods

Matrix Factorization – In this method, the sparse user-item


interaction matrix is divided into two smaller, dense matrices
representing the latent features of users and items. In the context of
movie recommendations, latent features could represent different
genres or user preferences for certain types of movies.

Image 13 – Matrix Factorization


K-Nearest Neighbors (KNN) – To make a recommendation, KNN
identifies the “neighbors” of a user or item, which are most similar
based on certain features or past behavior. For example, KNN might
compare users based on their movie ratings in a movie
recommendation system. If a user has given similar ratings to a set of
movies as another user, KNN considers them neighbors. The system
then recommends movies that the neighbor liked, but the user hasn’t
seen yet.

Imag
e 14 – K Nearest Neighbors
The process involves several steps. First, the system computes the
similarity between users or items using metrics like Euclidean
distance or cosine similarity. Then, it selects the top ‘k’ most similar
neighbors. A user-based KNN finds users with similar tastes; an item-
based KNN identifies items that have been similarly rated. Finally, the
system aggregates the preferences of these neighbors to generate a
recommendation list.

Early recommendation models often use collaborative filtering, K-


Nearest-Neighbor (KNN) models predict a user’s preferences by
comparing them to similar users or items. User-user-based
collaborative filtering uses a target user’s neighbors’ preferences,
while item-item-based filtering uses a user’s preferences on similar
items. Combining both methods improves accuracy in handling large
datasets, and algorithms like SVD reduce data dimensions. Matrix
Factorization (MF) considers implicit feedback and temporal
information, outperforming KNN.

Context Filtering

Session-based and sequence-based recommender systems predict


the next user action by analyzing past interactions. Session-based
systems focus on the current session’s actions, while sequence-
based systems use the order of all past interactions. These systems
face challenges like varying session lengths, action types, and user
anonymity. Conventional methods include K-nearest neighbors and
Markov chains, while advanced approaches use deep learning
models like RNNs and attention mechanisms. Netflix, for example,
uses contextual sequence prediction by considering user actions and
context to suggest what to watch next.

I
mage 15 – Context Filtering
This figure shows how user interactions with a service like Netflix can
be tracked over time with additional context to improve
recommendations. Each row represents a user’s activity at a specific
time, including their location, the device they used, and the
exact date and time of the interaction. For example, a user in the
United States watched “Stranger Things” on their computer on
December 10, 2017. The system captures more information about
each user’s behavior by recording this context—location, device, and
timestamp.

With this detailed data, recommendation systems can make better


predictions about what users might want to watch next. For instance,
if the system sees that a user often watches action movies on their
computer in the evening, it can suggest similar movies around that
time. The question mark in the last row shows an unknown action the
system aims to predict based on past interactions and context. This
approach helps create personalized recommendations that match the
user’s habits and preferences, enhancing their viewing experience.

Summarization of Recommendation System


Types

Content-based recommendation systems(or recommender systems)


suggest items by matching item attributes with user preferences,
relying on techniques like TF-IDF for profiling. Collaborative filtering
leverages users’ collective behaviors to make predictions, employing
methods such as user-based and item-based filtering and advanced
techniques like matrix factorization. Context filtering predicts user
actions based on session and sequence data, utilizing models like
RNNs and attention mechanisms to provide relevant
recommendations. Now, let’s move to the next section to explore the
main components behind these recommendation systems’ workflows.
Recommendation System –
Traditional ML Approaches

Ima
ge 16 – ML Approach Evaluation in RecSys
Many advancements have occurred throughout the decades for
recommendation systems. Here, we will explore three main and
widely used ML techniques.

Matrix Factorization

As discussed, matrix factorization decomposes the matrix into two


lower-dimensional matrices representing users and items. These
lower-dimensional matrices capture latent features that describe the
preferences of users and the characteristics of items. After that,
latent features are used to predict user recommendations. Now, two
different types of algorithms have been used based on the feedback
we received.

I
mage 17 – Matrix Factorization Process

For Implicit Feedback

The SVD (singular value decomposition) algorithm is used for implicit


feedback, which involves measuring user confidence in their
preferences. Suppose tui represents how much of a movie (i) user (u)
has watched. For example, tui = 0 means the user never watched the
movie, tui = 0.3 means they watched 30% of it, and tui = 2 means they
watched it twice. This implies that a user is likely more interested in a
movie they’ve watched two times than in one they haven’t watched at
all.

We define the confidence matrix cui and the rating matrix rui as follows:
The confidence matrix is cui = 1 + 𝛼tui, where 𝛼 is a constant. The
rating matrix rui is defined as rui = 1 if tui > 0 (indicating the user has
watched the movie) and rui = 0 if tui = 0 (indicating the user has not
watched the movie).

Imag
e 18 – confidence matrix and rating matrix for implicit feedback

Then, using both matrices, t computes the loss of overall user-item


pairs (u, i) using a weighted squared error between the observed
rating rui and the predicted rating , weighted by the confidence cui.
Regularization terms for user and item feature vectors (pu and qi) are
added to prevent overfitting and scaled by a parameter λ. If a user
never interacted with an item, rui is set to 0.

Image 19 – Loss function for implicit feedback


For Explicit feedback

Explicit feedback uses direct user ratings for items, such as movie
ratings. In matrix factorization for explicit feedback, a linear model
represents user-item interactions, and the algorithm, called
probabilistic matrix factorization, learns latent vectors for users and
items by minimizing a regularized mean squared error (MSE) loss
over known ratings.

I
mage 20 – Probabilistic Matrix Factorization and Loss
function for explicit feedback
Two common optimization methods are used: stochastic gradient
descent (SGD) and alternating least squares (ALS). While SGD is
easy to implement, it may struggle with non-convex loss functions.
ALS, however, can transform the problem into a series of convex
linear regression problems, which are easier to solve and can be
significantly parallelized for speed.

Hybrid Approach

The SVD++ algorithm can handle both explicit and implicit feedback
simultaneously, which is useful because users often interact with
many items but rate only a few. This algorithm modifies the basic
SVD model by including a weighted sum of latent factors from items
a user has interacted with. This helps provide more accurate
recommendations by leveraging all available user interactions, not
just the rated items.

Image 21 – SVD++ for mixed (explicit + implicit) feedback


Matrix factorization methods, especially those using techniques like
ALS and SVD, have proven highly effective in collaborative filtering
systems. They are more advanced than traditional methods like K-
nearest neighbors (KNN) due to their ability to handle large datasets.

Logistic Regression

Logistic regression is a commonly used linear model in


recommendation systems, especially for predicting click-through
rates (CTR). This model predicts the probability of an event
occurring, with values between 0 and 1. It can use side information
such as user demographics, past behaviors, item attributes,
and contextual information, helping to address the cold start
problem. The logistic regression model is trained by minimizing a loss
function with two parts: a regularization term and a logistic loss term.
The optimization problem can be expressed as:

Image 22 – Logistic
Regression
Here, λ is the regularization parameter, w is the weight vector, yi is
the label, and xi is the feature vector.

Traditional logistic regression assigns a weight to each feature but


does not consider how features interact with each other. Additional
techniques are used to capture these interactions. One method is the
degree-2 polynomial (Poly2) model, which learns a weight for every
possible feature pair. Another approach combines decision trees with
logistic regression, where decision trees transform input features,
and their outputs serve as inputs to the logistic model.

Factorization Machines (FMs)

Factorization Machines (FMs) are a type of generalized low-rank


model used to approximate tabular datasets by representing them
with lower-dimensional latent factors. Unlike traditional matrix
factorization, FMs can handle more complex feature interactions,
making them useful for tasks like recommendation systems. FMs
convert the weight of each feature pair into the inner product of two
k-dimensional latent vectors corresponding to each feature in the
pair.

Image 23 –
Factorization Machines
Field-aware Factorization Machines (FFMs) extend the basic FM
model by grouping features into different fields. Each feature has a
different latent vector for each field, allowing the model to capture
more nuanced interactions between features from different fields. In
FFMs, when deciding the weight for a feature pair, the latent vector of
one feature in the pair is used in the context of the field of the other
feature and vice versa.

Image 24 – Field-
aware Factorization Machines
These algorithms mostly use feature vectors to provide predictions as
recommendations, and a ranking method like similarity search ranks
those features according to user input. However, these models face
challenges when trying to model interactions involving more than two
features. This limitation has led researchers to explore neural
network-based recommendation systems, which offer more flexibility
for advanced feature interactions and can more effectively handle
higher-order feature combinations.

Recommendation System – Latest


DL Approaches

Image
25 – DL Approach Evaluation in RecSys
Deep learning (DL) models advanced recommender systems by
leveraging vast amounts of data and complex architectures. Unlike
traditional machine learning methods, which often lack performance,
DL models continue to improve as more data is introduced. This
increases accuracy and flexibility, making DL models ideal for
personalized recommendations. Companies like AirBnB, Facebook,
and Google have successfully implemented DL techniques in their
recommendation engines. Let’s explore the key components of a DL
recommendation system.

Life Cycle of Deep Learning Models for


Recommendation
Training Phase

Image 26 – Training Phase


The training phase involves teaching the model to predict user-item
interaction probabilities by using historical data on user interactions.
This phase includes presenting the model with examples of past
interactions and adjusting its parameters to minimize prediction
errors. Techniques like backpropagation and gradient descent are
used to optimize the model. The goal is to accurately capture the
patterns in user behavior and item characteristics to predict future
interactions.
Inference Phase

Ima
ge 27 – Inference Phase
In the inference phase, the trained model predicts new user-item
interactions. This involves three key steps: candidate generation,
candidate ranking, and filtering. First, the model pairs users with
numerous candidate items based on learned similarities. Then, it
ranks these items by the likelihood of user enjoyment. Finally, the
highest-ranked items are presented to the user. This phase requires
efficient data processing and real-time prediction capabilities to
deliver timely and relevant recommendations.
Workflow

I
mage 28 – General Workflow
The general workflow of a DNN-based recommendation system
involves two steps:

1. First, user-item interactions, like movie ratings, are collected as features.


A model is built by learning user and item embeddings from these
features. The model is trained to predict item ratings or interactions.

2. Then, for a new recommendation, the user’s context (like past ratings) is
used to generate features, which are fed into the trained model. The
model utilizes the learned embeddings to recommend items similar to
those the user has liked before, providing personalized suggestions.
This process involves embedding learning, feature extraction, and
model inference for recommendation.

DNN-Based Recommendation Systems

Deep neural networks have gradually improved recommendation


systems by understanding complex relationships between users and
items. We will explore how embeddings turn data into useful patterns,
different network types like MLPs and CNNs, and popular models
such as Neural Collaborative Filtering, Variational Autoencoders,
Google’s Wide and Deep model, and Meta’s Deep Learning
Recommender Model. Before that, we will see what the main
components work under the hood to make such complicated and
accurate architectures.

Key Components

Embeddings

I
mage 29 – Embeddings
Embeddings are a core component of DL recommender systems,
transforming categorical data into dense vector representations.
Using Embeddings, the model captures similarities between entities,
such as users and items, in a high-dimensional space. For example,
users with similar preferences will have similar embedding vectors.
These embeddings are learned during training and can significantly
enhance the model’s ability to generalize from sparse data.

Core Architecture

DL recommender systems utilize various network architectures,


including feedforward neural networks, multilayer perceptrons
(MLPs), convolutional neural networks (CNNs), and recurrent neural
networks (RNNs). Feedforward networks pass information forward
through layers, while MLPs add depth and non-linearity with hidden
layers. CNNs are used for processing image data, and RNNs handle
sequential data, such as user interaction histories.

Image 30 – Core Architecture


The core architecture of these neural networks calculates the dot-
product between the user embedding and the item embedding to get
a final score, the likelihood that a user interacts with an item. As a
last step, you may apply the sigmoid activation function to transform
the output to a probability between 0 and 1.

Image 31 – Core
Architecture with Multiple Inputs
It’s important to consider additional user information such as gender,
age, city, time since last visit, and credit card used for payment,
along with item details like brand, price, categories, and quantity sold
in the last seven days. This additional information can enhance the
model’s ability to generalize. Modify the neural network to incorporate
these extra features as inputs.

Popular Architectures

Neural Collaborative Filtering (NCF)

NCF combines matrix factorization with MLPs to model user-item


interactions. It treats the problem from a non-linearity perspective,
where user and item embeddings are learned through interaction
data. The model then multiplies these embeddings and feeds them
into an MLP network. The matrix factorization and MLP outputs are
combined to predict interaction probabilities. This approach allows for
capturing complex, non-linear relationships in the data, enhancing
recommendation accuracy.

Ima
ge 32 – Neural Collaborative Filtering (NCF)
It starts with a sparse input layer representing user and item IDs.
These IDs are mapped to dense user(u) and item(i) latent vectors via
an embedding layer. The latent vectors are then fed into multiple
neural collaborative filtering (CF) layers, which learn interactions
between users and items. The output layer produces a predicted
score ŷui for the user-item pair. This score is compared to the actual
rating yui during training to minimize prediction error. The trained
model uses these interactions to recommend items to users based
on learned embeddings.
Variational Autoencoders (VAE)

Variational Autoencoders (VAEs) for collaborative filtering learn non-


linear representations of user-item interactions. The training data for
this model consists of pairs of user-item IDs for each interaction
between a user and an item. VAEs consist of an encoder that
converts the input vector, representing a user’s interactions, into an
n-dimensional variational distribution. This distribution provides a
latent feature representation of the user (or embedding). And a
decoder that reconstructs the input from these latent representations.
This model can effectively handle missing data by predicting missing
values in the user-item interaction matrix. The output is a vector of
item interaction probabilities for the user.

Image 33 – Variational Autoencoders (VAE)


Wide and Deep by Google

The Wide and Deep model combines a linear (wide component)


model with a deep neural network (deep component). The wide
component memorizes simple, frequently occurring patterns, while
the deep component can learn rich representations of relationships in
the data and generalize them to similar items via embeddings.
Categorical features are embedded into continuous vector spaces
before being fed into the deep component. The outputs of both
components are added to create the final interaction probability. This
dual approach allows the model to capture both shallow and deep
patterns in the data, improving overall recommendation performance.

Image 34 – Wide and Deep Architecture


Deep Learning Recommender Model (DLRM)

by Meta

DLRM, introduced by Meta, combines categorical and numerical


inputs through embedding layers and multilayer perceptrons (MLPs).
Categorical features are mapped to dense vectors, and numerical
features are fed directly into the MLP. The model explicitly computes
second-order interactions between features by taking the dot product
of all pairs of embedding vectors. These interactions are then
processed by a top-level MLP to predict user-item interaction
probabilities. DLRM is designed to handle large-scale data efficiently
while maintaining high accuracy.
I
mage 35 – Deep Learning Recommender Model (DLRM)
Architecture

Session-Based Recommendation Systems

Ima
ge 36 – Session-Based Recommendation Workflow
Session-based recommender systems use sequential data, which
captures user interactions within a session, such as viewing multiple
products. These models utilize variations of RNNs like GRU, LSTM,
or transformer-based architectures such as BERT to process
sequences and understand the context of user behavior. For
instance, RNNs capture temporal dependencies, while transformers
use attention mechanisms to focus on relevant interactions. These
session-based models can predict the likelihood of
users engaging with specific items based on their recent activity,
providing more timely and relevant recommendations. Two popular
examples of implementations include Square’s deep learning-based
product recommendation system and Alibaba’s transformer-based
model BERT, GRUs, and NVIDIA GPUs to create a vector
representation of their sellers.

Im
age 37 – NLP vs RecSys
In NLP applications, input text is converted into word vectors using
word embedding techniques. Each word is translated into numbers
before being processed by RNNs, Transformers, or BERT to
understand context. These numbers change during training,
encoding semantics, and contextual information, making similar
words close in this space. These models provide outputs for tasks
like next-word prediction and text summarization. For session-based
recommendations, RNN models train on user event sequences (e.g.,
product clicks, interaction times) to predict the likelihood of clicking a
target item. Interactions are embedded like words in a sentence
before being processed by LSTM, GRU, or Transformers for context
understanding.

LLM-Based Recommendation System

Image 38 – LLM-Based Approach Evaluation in RecSys


Deep Neural Networks (DNNs) have significantly advanced RecSys
by effectively modeling user-item interactions. Recurrent Neural
Networks (RNNs), including Long Short-Term Memory (LSTM) and
Gated Recurrent Units (GRU), excel at handling sequential data and
capturing high-order dependencies in user interaction sequences.
Graph Neural Networks (GNNs) have emerged as powerful tools for
learning user and item representations from graph-structured data,
such as user behaviors on social networks. Additionally, DNNs can
encode side information like user reviews, with models like BERT
extracting and utilizing textual data.

Despite their success, traditional DNN-based RecSys models face


limitations. They often struggle with capturing rich textual knowledge
about users and items, leading to suboptimal performance.
Additionally, many RecSys methods are designed for specific tasks,
lacking generalization to new recommendation scenarios. These
models typically perform well on simple tasks like rating prediction
but find complex, multi-step decisions challenging, such as planning
a travel itinerary.

Image 39 – LLM-Based Approach in RecSys


Large Language Models (LLMs), like ChatGPT and BERT, offer a
promising solution for next-generation RecSys. Trained on vast
amounts of text data, LLMs demonstrate powerful language
understanding and generation capabilities, making them suitable for
various tasks without extensive fine-tuning. Techniques like in-
context learning and chain-of-thought reasoning enhance their
ability to handle complex decision-making processes. Recent efforts
have explored using LLMs for recommendations, improving accuracy
and explainability through conversational interactions, and refined
candidate sets.

Key Takeaways
Understanding Recommendation System

Recommendation systems are algorithms that suggest relevant items


to users based on their preferences and behaviors. They are crucial
for platforms like Spotify, Amazon, and Netflix to help users find
songs, products, or shows they might like. These systems analyze
user interactions, such as clicks and purchases, to predict what users
will enjoy. By understanding user and item profiles, recommendation
systems make personalized suggestions, enhancing user satisfaction
and engagement.

Types of Recommendation System

There are three main types of recommendation systems: content-


based, collaborative filtering, and context filtering. Content-based
systems match item attributes with user preferences, while
collaborative filtering uses user interactions to find similarities among
users or items. Context filtering considers the sequence of user
actions to predict future behavior. Each type has unique methods and
applications, providing various solutions for different recommendation
challenges.

Traditional Machine Learning Techniques

Traditional machine learning techniques like matrix factorization,


logistic regression, and factorization machines play a significant role
in recommendation systems. Matrix factorization decomposes user-
item interaction matrices into lower-dimensional representations,
capturing latent features. Logistic regression predicts click-through
rates by analyzing features and interactions. Factorization machines
extend matrix factorization by modeling feature interactions,
enhancing prediction accuracy. These techniques form the
foundation of many recommendation systems.
Advancements with Deep Learning

Deep learning models have revolutionized recommendation systems


by handling large datasets and complex interactions. Techniques like
Neural Collaborative Filtering (NCF), Variational Autoencoders
(VAE), and the Wide and Deep model combine user and item
embeddings with neural networks to capture non-linear relationships.
These models improve accuracy and personalization by learning
intricate patterns from user behavior. Additionally, session-based
recommendations using RNNs and transformers provide real-time
suggestions, enhancing user experience on platforms like Netflix and
Google.

Emerging Trends with Large Language Models

Large Language Models (LLMs) like ChatGPT and BERT enhance


recommendation systems by understanding and generating natural
language. These models can handle complex decision-making
processes and improve user experience by making
recommendations more accurate and explainable.

Conclusion
A recommendation system(or recommender system) is key to
enhancing user experience on modern platforms. By understanding
user preferences and behaviors, these systems offer personalized
suggestions that engage users. Recommendation systems have
evolved from traditional methods like matrix factorization to advanced
deep learning models to provide accurate and relevant
recommendations. They are essential for navigating the vast amount
of content available today, ensuring users find what they need quickly
and easily. So, the next time your parents get their favorite song or
movie suggested as “you might like this,” you can tell them the secret
behind it.

You might also like