How To Analyse Loops For Complexity Analysis of Algorithms
How To Analyse Loops For Complexity Analysis of Algorithms
important?
Last Updated : 02 Nov, 2023
In this article, we will discuss why algorithm and its analysis is important?
In the analysis of the algorithm, it generally focused on CPU (time) usage,
Memory usage, Disk usage, and Network usage. All are important, but the
most concern is about the CPU time. Be careful to differentiate between:
Performance: How much time/memory/disk/etc. is used when a
program is run. This depends on the machine, compiler, etc. as well as
the code we write.
Complexity: How do the resource requirements of a program or
algorithm scale, i.e. what happens as the size of the problem being solved
by the code gets larger.
Note: Complexity affects performance but not vice-versa.
Algorithm Analysis:
Algorithm analysis is an important part of computational complexity theory,
which provides theoretical estimation for the required resources of an
algorithm to solve a specific computational problem. Analysis of algorithms
is the determination of the amount of time and space resources required to
execute it.
Why Analysis of Algorithms is important?
To predict the behavior of an algorithm without implementing it on a
specific computer.
It is much more convenient to have simple measures for the efficiency of
an algorithm than to implement the algorithm and test the efficiency
every time a certain parameter in the underlying computer system
changes.
It is impossible to predict the exact behavior of an algorithm. There are
too many influencing factors.
The analysis is thus only an approximation; it is not perfect.
More importantly, by analyzing different algorithms, we can compare
them to determine the best one for our purpose.
Types of Algorithm Analysis:
1. Best case
2. Worst case
3. Average case
Best case: Define the input for which algorithm takes less time or
minimum time. In the best case calculate the lower bound of an
algorithm. Example: In the linear search when search data is present at
the first location of large data then the best case occurs.
Worst Case: Define the input for which algorithm takes a long time or
maximum time. In the worst calculate the upper bound of an algorithm.
Example: In the linear search when search data is not present at all then
the worst case occurs.
Average case: In the average case take all random inputs and calculate
the computation time for all inputs.
And then we divide it by the total number of inputs.
Average case = all random case time / total no of case
"The DSA course helped me a lot in clearing the interview rounds. It was
really very helpful in setting a strong foundation for my problem-solving
skills. Really a great investment, the passion Sandeep sir has towards
DSA/teaching is what made the huge difference." - Gaurav | Placed at
Amazon
Example: Selection sort and Insertion Sort have O(n2) time complexity.
C++
C
Java
C#
Javascript
Python3
// Recursive function
void recurse(int n)
if (n <= 0)
return;
else {
recurse(n/c);
loops?
When there are consecutive loops, we calculate time complexity as a sum of
the time complexities of individual loops.
To combine the time complexities of consecutive loops, you need to
consider the number of iterations performed by each loop and the amount of
work performed in each iteration. The total time complexity of the algorithm
can be calculated by multiplying the number of iterations of each loop by
the time complexity of each iteration and taking the maximum of all
possible combinations.
For example, consider the following code:
for i in range(n):
for j in range(m):
# some constant time operation
Here, the outer loop performs n iterations, and the inner loop performs m
iterations for each iteration of the outer loop. So, the total number of
iterations performed by the inner loop is n * m, and the total time
complexity is O(n * m).
In another example, consider the following code:
for i in range(n):
for j in range(i):
# some constant time operation
Here, the outer loop performs n iterations, and the inner loop performs i
iterations for each iteration of the outer loop, where i is the current iteration
count of the outer loop. The total number of iterations performed by the
inner loop can be calculated by summing the number of iterations performed
in each iteration of the outer loop, which is given by the formula sum(i)
from i=1 to n, which is equal to n * (n + 1) / 2. Hence, the total time
complex
C++
C
Java
C#
Javascript
Python3
}
// Time complexity of above code is O(m) + O(n) which is O(m + n)
functions?
The time complexity of a recursive function can be written as a
mathematical recurrence relation. To calculate time complexity, we must
know how to solve recurrences. We will soon be discussing recurrence-
solving techniques as a separate post.
Naive Bayes Algorithms: A Complete Guide
for Beginners
Introduction
Machine learning algorithms are one of the essential parameters while training and
building an intelligent model for some of the problem statements. Many machine learning
algorithms are used in several cases due to their faster and more accurate results. The
Naive Bayes Classifier algorithm is also one of the best machine learning algorithms,
resulting in a precise model with less effort.
In this article, we will discuss the naive Bayes algorithms with their core intuition,
working mechanism, mathematical formulas, PROs, CONs, and other important aspects
related to the same. Also, the key takeaways discussed in the end will help one answer the
interview questions related to the NaiveBayesClassifier algorithms efficiently.
Table of contents
What is Probability?
To understand the Naive Bayes Classifier from scratch, it is required to understand the
term probability, as the algorithm itself works on the concept of probabilities of events. Let
us try to understand the same.
We know that the sum of all probabilities is always one, and for Example, if we toss the
coin in the air, the possibility is the head is 0.5 and the tails are also 0.5, which means that
there is an equal, and 50% chance of heads and tails to come for the first trial.
Now we know the meaning of probability, the next term to understand is conditional
probability.Conditional probabilityis defined as the probability of some event happening
with respect to another event. In simple words, conditional probability is also a probability
of some things occurring when a condition is involved.
S
ource- Machinelearningplus
Now, we are prepared to learn the bayesian rule after knowing the two critical
terms.Thomas Bayes, a British mathematician in 1763, gave the bayesian theorem, which
helped calculate the probability of some events taking place with conditions.
Source- Medium
As we can see in the above image, the formula comprises a total of 4 terms. Let us try to
understand them one by one.
From the above formula, we can easily calculate the probability of some event happening
with the condition if we have the average likelihood of vents happening and both events
happening.
What is Naive Bayes Algorithm?
Now is the best time to understand the naive Bayes algorithm, as the core fundamentals
are clear. In real-time, there can be many events and many conditions that can happen
simultaneously with events. So, in this case, we expand the bayesian theorem to solve this
type of issue. If the features are independent, we can quickly extendthe theorem and
calculate the probability of the same.
The same bayesian theorem formula can be used here for multiple events and conditions,
and one can easily calculate the probability with the help of the same.
The algorithm is one of the most useful algorithms in machine learning which helps in
several classification problems, sentiment analysis, face recognition, etc.
After understanding the Naive Bayes algorithm, let us try to understand the working
mechanismof the algorithm.
Suppose we try to predict whether a golf match will play, given some conditional outlook,
humidity, and temperature. In that case, the model will take the data as input and calculate
the probability of Yes and No concerning all the conditions provided. If the likelihood of
Yes is higher than No, then the model will return Yes as the output and vice versa.
What is Multicollinearity?
Multicollinearity in machine learning is a term that deals with the linearity of the features
of data feed. In simple words, the dataset havingcorrelations between its independent
featuresis called multilinear.
Suppose we have a dataset with three columns, age, marks, and passed. Here the age is the
age of the students, marks are the number obtained by students in exams, and the past is a
categorical column that indicates whether a studentpassed or not.
Now here, the age and marks are the training columns means these columns should be fed
to the algorithm, and the passed column should be the target column that a machine
learning algorithm will predict. Now in some cases, the age and the marks columns are
correlated somehow, and they are not independent anymore. It is called that the data
hasMulticollinearityin its features.
The professor checking the answer sheets can be biased toward students having less age
and marks them with good numbers. Both columns are now correlated, and
Multicollinearity is present in this dataset.
One of the basic assumptions of the naive Bayes algorithm is related to the
Multicollinearit; it is required to check whether the data has Multicollinearity.
import pandas as pd
df = pd.read_csv("data.csv")
df.corr()
Why is it Naive?
Now a question might appear in your mind: Why is the algorithm called naive?
The main reason behind the name of the Naive Bayes Classifier is its assumption that ut
assume while working on particular datasets and theMulticollinearity.
Here Naive Bayes Classifier assumes that the dataset provided to the algorithm is
independent and the independent features are separate and not dependent on some other
factors, which is why theNaive Bayesalgorithm is calledNaive.
There are mainly a total of three types of naive byes algorithms. Different types of naive
Bayes are used for different use cases. Let us try to understand them one by one.
1. Bernoulli Naive Bayes
This Naive Bayes Classifier is used when there is abooleantype of dependent or target
variable present in the dataset. For example, a dataset has target column categories as Yes
and No.
This type of Naive is mainly used in a binary categorical tagete column where the problem
statement is to predict onlyYes or No. For Example, sentiment analysis with Positive and
Negative Categories, A specific ord id present in the text or not, etc.
Code Example:
This type of naive Bayes is used where the data is multinomial distributed. This type of
naive Bayes is mainly used when there is atext classificationproblem.
For Example, if you want to predict whether a text belongs to which tag, education,
politics, e-tech, or some other tag, you can use the multinomial Naive Bayes Classifier to
classify the same.
This naive baseoutperformstext classification problems and is used the most out of all the
other Naive Bayes Classifier.
Code Example:
This type of naive is used when the predictor variables have continuous valuesinstead of
discrete ones. Here it is assumed that the distribution of the data is Gaussian distribution.
Code Example:
1. Text Classification
The naive Bayes algorithms are known to perform best ontext classificationproblems. The
algorithm is mainly used when there is a problem statement related to the text and its
classification. Several naive Bayes algorithms are tried and tuned according to the problem
statement and used for a better accurate model. For Example: classifying the tags from the
text etc.
2. Sentiment Analysis
Algorithms like Bernoulli naive are used most for thesesentiment analysisproblems. This
algorithm is known to outperform on binary classification problems and is hence used
most for such cases.
3. Recommendation Systems
4. Real-Time Predictions
The Naive Bayes algorithms areeager learningalgorithms that try to learn from the training
data and assume some of the parameters. Now, whenever the test data is provided for
prediction to the algorithm, the algorithm calculates the results according to its knowledge
gained from the training and offers faster and more accurate results. Hence it could be
used forreal-time predictions.
Advantages and Disadvantages of Naive Bayes
Advantages
1. Faster Algorithms:
The Naive Bayes algorithm is a parametric algorithm that tries to assume certain things
while training and using the knowledge for prediction. Hence it takes significantly less
time for prophecy and is afaster algorithm.
3. Performance:
The Naive Bayes algorithm achieves faster and more accurate performance with less data,
and its handling of categorical text data surpasses that of other algorithms, making
comparisons inequitable.
Disadvantages
1. Independent Features:
In a real-time dataset, obtaining independent features that are entirely independent of each
other is almost impossible. There are typically two to three features that correlate with
each other, thus not fully satisfying the assumption at all times.
The zero frequency error in naive Bayes is one of the most critical CONs of the Naive
Bayes algorithm. According to this error, if a category is absent in both the training data
and the test data, then the Naive Bayes algorithm will assign it zero probability, resulting
in what is known as the Zero Frequency error in Naive Bayes.
Well, the Naive Bayes algorithm is the best-performing and faster algorithm compared to
other algorithms. However, still, there are cases where it cannot perform well, and some
different algorithms should be used to handle such cases.
The Naive Bayes algorithm can be used if there is no multicollinearity in the independent
features and if the features’ probabilities provide some valuable information to the
algorithms.
This algorithm should also be preferred for text classification problems. One should avoid
using the Naive Bayes algorithm when the data is entirely numeric and multicollinearity is
present in the dataset.
If it is necessary to use the Naive Bayes algorithm, then one can use the following steps to
improve the performance of Naive Bayes algorithms.
2. Feature Engineering:
Try to apply feature engineering to the dataset and its features, combine someof the
elements, andextract some partsof them out of existing ones. This may help the Naive
Bayes algorithm learn the data quickly and results in an accurate model.
4. Probabilistic Features:
The Naive Bayes algorithm works on the concept of probabilities, so try to improve the
features that givemore weightage to the algorithms and their probabilities, try to implement
those, and run the roses in a loop to know which features are best for the algorithm.
5. Laplace Transform:
In some cases, the category may be present in the test dataset and was not present while
training and the model will assign it with zero probability. Here we should handle this
issue byLaplace transform.
6. Feature Transformation:
It is always better to have normal distributions in the datasets and try to apply box-cox and
yeo-johnsonfeature transformation techniques to achieve the normal distributions in the
dataset.
Conclusion
In this article, we discussed the naive Bayes algorithm, the probabilities, conditional
probabilities, the bayesian theorem, the core intuition and working mechanism of the
algorithm with their types, code examples, applications, PROs, and CONs associated with
some of the key takeaways from this article. This article’s complete knowledge will help
one understand the Naive Bayes algorithm from scratch to an in-depth level. It will help
answer the interviews related to it very efficiently.
Key Takeaways
The algorithm is an eager learning algorithm that learns while the training
phase and results faster while the testing phase.
Zero frequency error in a Naive Bayes algorithm is where the model assigns
zero probability to the unseen categories during the prediction phase.
When there is a boolean type of target variable with two categories, the
Bernoulli Naive Bayes algorithm is used.
One can use Box-Cox and Yeo-Johnson transforms to achieve the normal
distribution of the dataset columns.
The media shown in this article is not owned by Analytics Vidhya and is used at the
Author’s discretion.
UG (PE) @PDEU | 25+ Published Articles on Data Science | Data Science Intern &
Freelancer | Amazon ML Summer School '22 | AI/ML/DL Enthusiast | Reach Out
@portfolio.parthshukla.live
Source – bounteous.com
Introduction
Learning?
Time Series Analysis and Time Series Forecasting are the two studies
that, most of the time, are used interchangeably. Although, there is a very
thin line between this two. The naming to be given is based on analysing
and summarizing reports from existing time-series data or predicting the
future trends from it.
A quick thing worth mentioning is that the integrants are broken further into
2 types-
1. Systematic — components that can be used for predictive modelling
and occur recurrently. Level, Trend, and Seasonality come under this
category.
The original time series data is hence split or decomposed into 5 parts-
1. Level — The most common integrant in every time series data is the
level. It is nothing but the mean or average value in the time series. It has
0 variances when plotted against itself.
2. Trend — The linear movement or drift of the time series which may be
increasing, decreasing or neutral. Trends are observable over
positive(increasing) and negative(decreasing) and even linear slopes over
the entire range of time.
1. Weather Forecasting
2. Anomaly Forecasting
3. Sales Forecasting
5. ECG Analysis
6. Risk Analysis
When the time series trend is a linear relationship between integrants, i.e.,
the frequency (width) and amplitude(height) of the series are the same,
the additive rule is applied.
It is represented as-
Forecasting
This will contain various detailed topics to ensure that readers at the end
will know how to-
Statistics to Explore it
For the easy and quick understanding and analysis of time-series data, we
will work on the famous toy dataset named ‘Daily Female Births Dataset’.
import numpy
import pandas
import statmodels
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv(‘daily-total-female-births-in-cal.csv’, parse_dates = True, header = 0,
squeeze=True)
data.head()
1959–01–01 35
1959–01–02 32
1959–01–03 30
1959–01–04 31
1959–01–05 44
Name: Daily total female births in California, 1959, dtype: int64
print(data.size) #output-365
print(data.describe())
Output —
count 365.000000
mean 41.980822
std 7.348257
min 23.000000
25% 37.000000
50% 42.000000
75% 46.000000
max 73.000000
pyplot.plot(series)
pyplot.show()
Modelling
A normalized data scales the numeric features in the training data in the
range of 0 and 1 so that gradient descent and loss optimization is fast and
efficient and converges quickly to the local minima. Interchangeably known
as feature scaling, it is crucial for any ML problem statement.
Let’s see how we can achieve normalization in time-series data.
For this purpose, let’s pick a highly fluctuating time-series data — the
minimum daily temperatures data. Grab it here!
Source-DataMarket
Note — In our case, our data does not have outliers present and hence a
MinMaxScaler solves the purpose well. In the case where you have an
unsupervised learning approach, and your data contains outliers, it is
better to go for standardization, which is more robust than normalization,
as normalization scales the data close to the mean which doesn’t handle
or include outliers leading to a poor model. Standardization, on the other
hand, takes large intervals with a standard deviation value of 1 and a
mean of 0, thus outlier handling is robust.
(Feature Engineering)
Framing data into a supervised learning problem simply deals with the task
of handling and extracting useful features and discarding irrelevant
features to make the model robust and cost-efficient.
We already know that supervised learning problems have 2 types of
features — the independents (x) and dependent/target(y). Hence, how
better the target value is achieved depends on how well we choose and
engineer the independent features.
You must know by now that time-series data has two columns, timestamp,
and its respective value. So, it is very self-explanatory that in the time
series problem, the independent feature is time and the dependent feature
is value.
Now let us look at what are the features that need to be engineered into
these input and output values so that the inherent relationship between
these two variables is established to make the forecasting as good as
possible.
Let, the mean at timestamp t-1 is x and t-2 be y, so we find the average of
x and y to predict the value at timestamp t+1. The rolling window hence
takes a mean of 2 values to predict the 3rd value. After that is done, the
window shifts to the next set of values, and hence the mean is calculated
for each window consisting of 2 values. We use rolling window statistics
more often when the recent data is more important for forecasting and not
previous data.
Let’s see how we can calculate moving or rolling average with a rolling
window —
window = tshifts.expanding()
joined_df2 = concat([rwin.mean(),df.shift(-1)], axis=1)
joined_df2.columns = ['mean', 't+1']
print(joined_df2.head(5))
Let’s have a look at what we got -
Can you recall, that amongst the two time-series data we worked on, the
childbirths data had no trend or seasonality and is stationary. Whereas,
the average daily temperatures data, has a seasonality factor and drifts,
and hence, it’s non-stationary and hard to model!
Now that we know what stationarity in time series is, how can we check for
the same?
So, we take stationary data, which is the handy childbirths data we worked
on earlier. However, for the non-stationary data, let’s take the famous
airline-passenger data, which is simply the number of airline passengers
per month, and prove how they are stationary and non-stationary.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(‘daily-total-female-births.csv’, parse_dates = True, header = 0,
squeeze=True)
data.hist()
plt.show()
Output —
As I said, vision! Look how the visualization itself speaks that it’s a
Gaussian Distribution. Hence, stationary!
X = data.values
seq = round(len(X) / 2)
x1, x2 = X[0:seq], X[seq:]
meanx1, meanx2 = x1.mean(), x2.mean()
varx1, varx2 = x1.var(), x2.var()
print(‘meanx1=%f, meanx2=%f’ % (meanx1, meanx2))
print(‘variancex1=%f, variancex2=%f’ % (varx1, varx2))
Output —
meanx1=39.763736, meanx2=44.185792
variancex1=49.213410, variancex2=48.708651
The mean and variances linger around each other, which clearly shows
the data is invariant and hence, stationary! Great.
Output —
The graph pretty much gives a seasonal taste. Moreover, it is too distorted
for a Gaussian tag. Let’s now quickly get the mean-variance gaps.
X = data.values
seq = round(len(X) / 2)
x1, x2 = X[0:seq], X[seq:]
meanx1, meanx2 = x1.mean(), x2.mean()
varx1, varx2 = x1.var(), x2.var()
print(‘meanx1=%f, meanx2=%f’ % (meanx1, meanx2))
print(‘variancex1=%f, variancex2=%f’ % (varx1, varx2))
Output —
meanx1=182.902778, meanx2=377.694444
variancex1=2244.087770, variancex2=7367.962191
Alright, the value gap between mean and variances are pretty self-
explanatory to pick the non-stationary kind.
Series Forecasting
ARMA
Let’s see how both the AR and MA models perform on the International-
Airline-Passengers data.
AR model
MA Model
MA_model = ARIMA(indexedDataset_logScale, order=(0,1,2))
MA_results = MA_model.fit(disp=-1)
plt.plot(datasetLogDiffShifting)
plt.plot(MA_results.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((MA_results.fittedvalues -
datasetLogDiffShifting['#Passengers'])**2))
ARIMA
Along with the squashed use of the AR and MA model used earlier,
ARIMA uses a special concept of Integration(I) with the purpose of
differentiating some observations in order to make non-stationary data
stationary, for better forecasting. So, it’s obviously better than its
predecessor ARMA which could only handle stationary data.
What the differencing factor does is, that it takes into account the
difference in predicted values between two timestamps (t and t+1, for
example). Doing this helps in achieving a constant mean rather than a
highly fluctuating ‘non-stationary’ mean.
Let’s fit the same data with ARIMA and see how well it performs!
Great! The graph itself speaks how ARIMA fits our data in a well and
generalized fashion compared to the ARMA! Also, observe how the RSS
has dropped to 1.0292 from 1.5023 or 1.4721.
SARIMAX
They are -
4. Seasonal Periodicity
Sourc
e-Wikipedia
If you are more of a theory conscious person like me, do read more on
this here, as getting into the details of the formula is beyond the scope of
this article!
Now, let’s see how well SARIMAX performs on seasonal time-series data
like the International-Airline-Passengers data.
Forecasting
Sour
ce-Medium
RNN was designed to work on sequential data like time series. However, a
very remarkable pitfall of RNN was that it couldn’t handle long-term
dependencies. For a problem where you want to forecast a time series
based on a huge number of previous records, RNN forgets the maximum
of the previous records which occurred much earlier, and only learns
sequences of recent data fed to its neural network. So, RNN was observed
to not be up to the mark for NSP (Next Sequence Prediction) tasks in NLP
and time series.
Note — Since explaining the architecture of LSTM will be beyond the size
of this blog, I recommend you to head over to my article where I explained
LSTM in detail!
Let us now take our Airline Passengers’ data and see how well RNN and
LSTM work on it!
Imports —
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn.preprocessing
from sklearn.metrics import r2_score
from keras.layers import Dense, Dropout, SimpleRNN, LSTM
from keras.models import Sequential
minmax_scaler = sklearn.preprocessing.MinMaxScaler()
data['Passengers'] = minmax_scaler.fit_transform(data['Passengers'].values.reshape(-1,1))
data.head()
Scaled data —
split = int(len(data[‘Passengers’])*0.8)
x_train,y_train,x_test,y_test = np.array(x[:split]),np.array(y[:split]),
np.array(x[split:]), np.array(y[split:])
#reshaping data to original shape
x_train = np.reshape(x_train, (split, 20, 1))
x_test = np.reshape(x_test, (x_test.shape[0], 20, 1))
RNN Model —
model = Sequential()
model.add(SimpleRNN(40, activation="tanh", return_sequences=True,
input_shape=(x_train.shape[1],1)))
model.add(Dropout(0.15))
model.add(SimpleRNN(50, return_sequences=True, activation="tanh"))
model.add(Dropout(0.1)) #remove overfitting
model.add(SimpleRNN(10, activation="tanh"))
model.add(Dense(1))
model.summary()
model.compile(optimizer="adam", loss="MSE")
model.fit(x_train, y_train, epochs=15, batch_size=50)
preds = model.predict(x_test)
Let me show you a picture of how well the model predicts —
Pretty much accurate!
LSTM Model —
model = Sequential()
model.add(LSTM(100, activation="ReLU", return_sequences=True,
input_shape=(x_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(80, activation="ReLU", return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(50, activation="ReLU", return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(30, activation="ReLU"))
model.add(Dense(1))
model.summary()
Complie it, fit it and predict—
model.compile(optimizer="adam", loss="MSE")
model.fit(x_train, y_train, epochs=15, batch_size=50)
preds = model.predict(x_test)
Let me show you a picture of how well the model predicts —
Here, we can easily observe that RNN does the job better than LSTMs. As
it is clearly seen that LSTM works great in training data but bad
invalidation/test data, which shows a sign of overfitting!
Hence, try to use LSTM only where there is a need for long-term
dependency learning otherwise RNN works good enough.
Conclusion
Cheers on reaching the end of the guide and learning pretty interesting
kinds of stuff about Time Series. From this guide, you successfully learned
the basics of time series, got a brief idea of the difference between Time
Series Analysis and Forecasting subdomains of Time Series, a crisp
mathematical intuition on Time Series analysis and forecasting techniques
and explored how to work on Time Series problems in Machine Learning
and Deep Learning to solve complex problems.
Hope you had fun exploring Time Series with Machine Learning and Deep
Learning along with intuition! If you are a curious learner and want to “not”
stop learning more, head over to this awesome notebook on time series
provided by TensorFlow!
Feel free to follow me on Medium and GitHub for more articles and
notebooks on Machine & Deep Learning! Connect with me on LinkedIn if
you want to discuss anything regarding this article!
Happy Learning!
A Complete Guide on Hough
Transform
Introduction
This article will provide you with in-depth knowledge about the Hough
transform. It will give you a basic introduction to the Hough transform, how
exactly it works, and the math behind it. It also enlists the merits and
demerits of Hough Transform algorithm and its various applications.
Learning Outcomes:
Table of contents
History
Duda, R. O., and P. E. Hart, “Use of the HT to Detect Lines and Curves in
Pictures,” Comm. ACM, Vol. 15, pp. 11–15 (January 1972), although it had
been standard for the Radon transformation since the 1930s. In Frank
O’Gorman and MB Clowes: Finding Picture Edges through Collinearity of
Feature Points, O’Gorman and Clowes discuss their version. IEEE
Transactions on Computers, vol. 25, no. 4, pp. 449-456. (1976) Hart, P.
E., “How the HT was invented,” IEEE Signal Processing Magazine, Vol 26,
Issue 6, pages 18 – 22 (November 2009) tells the tale of how the
contemporary form of the Hough transform was invented.
Why is it Needed?
In many circumstances, a pre-processing stage can use an edge detector
to obtain picture points or pixels on the required curve in the image space.
However, there may be missing points or pixels on the required curves
due to flaws in either the image data or the edge detector and spatial
variations between the ideal line/circle/ellipse and the noisy edge points
acquired by the edge detector. As a result, grouping the extracted edge
characteristics into an appropriate collection of lines, circles, or ellipses is
frequently difficult.
1. Original image of Lane
A and b are the circle’s center coordinates, and r is the radius. The
algorithm’s computing complexity increases because we now have three
coordinates in the parameter space and a 3-D accumulator. (In general,
the number of parameters increases the calculation and the size of the
accumulator array polynomially.) As a result, the fundamental Hough
approach described here only applies to straight lines.
Algorithm
Problem: Given set of points, use Hough transform to join these points.
A(1,4), B(2,3) ,C(3,1) ,D(4,1) ,E(5,0)
Solution:
b=-ax+y. So if we write the same equation for point A(1,4), then consider
x=1 and y=4 so that we will get
b=-a+4. The following table shows all the equations for a given point
Now take x-0 and find corresponding y value for above given five
equations
New point
Point equations Now a=0 Now a=1 New point (a,b)
(a,b)
A(1,4) b= -a+4 b= -(0)+4 =4 (0,4) b= -(1)+4 =3 (1,3)
B(2,3) b= -2a+3 b= -2(0)+3=3 (0,3) b= -2(1)+3=1 (1,1)
C(3,1) b= -3a+1 b= -3(0)+1=1 (0,1) b= -3(1)+1=-2 (1,-2)
D(4,1
b= -4a+1 b= -4(0)+1=1 (0,1) b= -4(1)+1=3 (1,-3)
)
E(5,0) b= -5a+0 b= -5(0)+0=0 (0,0) b= -5(1)+0=-5 (1,-5)
Let us plot the new point on the graph as given below in figure 6.
We can see that almost all line crosses each other at a point (-1,5). So
here now a=-1 and b =5.
Now let’s put these values in the y=ax+b equation so we get y=-1x+5 so
y=-x+5 is the line equation that will link all the edges.
3D Applications
Object Recognition
It describes a method for detecting objects utilizing the GHT and color
similarity between homogeneous portions of the item. It takes as input
already segmented areas of homogenous hue. According to this research,
it resists changes in light, occlusion, and distortion of the segmentation
output. It can distinguish items that have been rotated, scaled, and even
moved around in a complicated environment.
Object Tracking
Underwater Application
It offers a technique for detecting geometrical forms in underwater robot
images. Unlike the traditional Generalized Hough Transform (GHT), it
transforms the recognition problem into a bounded error estimation
problem. An autonomous underwater vehicle (AUV) creates a system for
visually directed operations underwater. The Hough Transform identifies
underwater pipelines, and their orientations are determined using binary
morphology.
Various industrial and commercial uses utilize the Hough Transform (HT)
and its various versions. For instance, in an uncrewed aerial vehicle (UAV)
surveillance and inspection system, a knowledge-based power line
identification approach is proposed. Before applying the Line Hough
Transform (LHT), researchers construct a pulse-coupled neural network
(PCNN) filter to eliminate background noise from picture frames. They
then refine the findings using knowledge-based line clustering.
Medical Application
Unconventional Application
Output
Conclusion
The Hough transform is a robust image processing technique for detecting
geometric shapes in noisy and occluded images. By transforming points
from image space to parameter space, it identifies local maxima in an
accumulator array to detect shapes like lines, circles, and ellipses. Despite
computational intensity and parameter sensitivity, its global detection
capabilities make it invaluable in applications such as lane recognition,
medical imaging, and industrial inspection. Parallel computing
advancements promise to enhance its efficiency for real-time use,
solidifying its role as a fundamental computer vision tool.
Key Takeaways:
Table of contents
x=PX
Here x denotes 2-D image point, P denotes camera matrix and X denotes
3-D world point.
It describes the orientation and location of the camera. This refers to the
rotation and translation of the camera with respect to some world
coordinate system.
Place the camera and calibration object on a flat surface with the
camera back and calibration object parallel and the object roughly in
the center of the camera’s vision. To acquire a good alignment, you
may need to lift the camera or object.
Take a picture to ensure that the setup is straight, which means that
the sides of the calibration object align with the image’s rows and
columns.
In pixels, determine the width and height of the object. Let us refer
to these as dx and dy.
Focal Length
Figure 3: Normal camera calibration setup. The left image shows an image
of the setup; the right side image displays calibration. The focal length can
be determined by measuring the width and height of the calibration object
in the image, as well as the physical measurements of the setup.
For the exact setup in Figure 4-3, the object was measured to be 130 by
185 mm, hence dX = 130 and dY = 185. The camera’s distance from the
object was 460 mm, hence dZ = 460. Only the ratios of the measurements
matter; any unit of measurement can be utilized. After using ginput() to
choose four points in the image, the width and height in pixels were 722
and 1040, respectively. In other words, dx = 722 and dy = 1040. When
these numbers are used in the aforementioned relationship, the result is
def my_calibration(sz):
row,col = sz
fx = 2555*col/2592
fy = 2586*row/1936
K = diag([fx,fy,1])
K[0,2] = 0.5*col
K[1,2] = 0.5*row
return K
After that, this function takes a size tuple and returns the calibration
matrix. The optical center is assumed to be the image’s center in this
case. Replace the focal lengths with their mean if you like; for most
consumer cameras, this is fine. It should be noted that the calibration is
only for photographs in landscape orientation.
Calibration techniques for the pinhole camera model and the fisheye
camera model are included in the Computer Vision ToolboxTM. The
fisheye variant is compatible with cameras with a field of vision (FOV) of
up to 195 degrees.
The pinhole calibration algorithm is based on Jean-Yves Bouguet’s [3]
model. The pinhole camera model and lens distortion are included in the
model. Because an ideal pinhole camera does not have a lens, the pinhole
camera model does not account for lens distortion. To accurately simulate
a genuine camera, the algorithm’s whole camera model incorporates radial
and tangential lens distortion.
The pinhole model cannot model a fisheye camera due to the high
distortion produced by a fisheye lens.
A pinhole camera is a basic camera model without a lens. Light rays pass
through the aperture and project an inverted image on the opposite side of
the camera. Visualize the virtual image plane in front of the camera and
assume that it is containing the upright image of the scene.
The camera matrix is a 4-by-3 matrix that represents the pinhole camera
specifications. The image plane is mapped into the image plane by this
matrix, which maps the 3-D world scene. Using the extrinsic and intrinsic
parameters, the calibration algorithm computes the camera matrix. The
extrinsic parameters represent the camera’s position in the 3-D scene. The
intrinsic characteristics represent the camera’s optical center and focal
length.
The world points are transformed to camera coordinates using the
extrinsic parameters. Intrinsic parameters are used to map the camera
coordinates into the image plane.
Radial Distortion
This sort of distortion is caused by unequal light bending. The rays bend
more at the lens’s borders than they do at the lens’s center. Straight lines
in the actual world appear curved in the image due to radial distortion.
Before hitting the image sensor, the light ray is shifted radially inward or
outward from its optimal point. The radial distortion effect is classified into
two types.
Tangential Distortion
Fi
The image above is an example of the distortion effect that a lens can
produce. The figure corresponds to figure 1 and is a barrel distortion
effect, which is a form of the radial distortion effect. Which two points
would you consider if you were asked to find the correct door height?
Things get considerably more challenging when executing SLAM or
developing an augmented reality application using cameras that have a
large distortion effect in the image.
Removing Distortion
import cv2
import numpy as np
import os
import glob
# Defining the dimensions of checkerboard
CHECKERBOARD = (6,9)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
# Creating vector to store vectors of 3D points for each checkerboard image
objpoints = []
# Creating vector to store vectors of 2D points for each checkerboard image
imgpoints = []
# Defining the world coordinates for 3D points
objp = np.zeros((1, CHECKERBOARD[0] * CHECKERBOARD[1], 3), np.float32)
objp[0,:,:2] = np.mgrid[0:CHECKERBOARD[0], 0:CHECKERBOARD[1]].T.reshape(-1, 2)
prev_img_shape = None
# Extracting path of individual image stored in a given directory
images = glob.glob('./images/*.jpg')
for fname in images:
img = cv2.imread(fname)
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
# Find the chess board corners
# If desired number of corners are found in the image then ret = true
ret, corners = cv2.findChessboardCorners(gray, CHECKERBOARD,
cv2.CALIB_CB_ADAPTIVE_THRESH + cv2.CALIB_CB_FAST_CHECK +
cv2.CALIB_CB_NORMALIZE_IMAGE)
"""
If desired number of corner are detected,
we refine the pixel coordinates and display
them on the images of checker board
"""
if ret == True:
objpoints.append(objp)
# refining pixel coordinates for given 2d points.
corners2 = cv2.cornerSubPix(gray, corners, (11,11),(-1,-1), criteria)
imgpoints.append(corners2)
# Draw and display the corners
img = cv2.drawChessboardCorners(img, CHECKERBOARD, corners2, ret)
cv2.imshow('img',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
h,w = img.shape[:2]
"""
Performing camera calibration by
passing the value of known 3D points (objpoints)
and corresponding pixel coordinates of the
detected corners (imgpoints)
"""
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1],
None, None)
print("Camera matrix : n")
print(mtx)
print("dist : n")
print(dist)
print("rvecs : n")
print(rvecs)
print("tvecs : n")
print(tvecs)
#include
#include
#include
#include
#include
#include
// Defining the dimensions of checkerboard
int CHECKERBOARD[2]{6,9};
int main()
{
// Creating vector to store vectors of 3D points for each checkerboard image
std::vector<std::vector > objpoints;
// Creating vector to store vectors of 2D points for each checkerboard image
std::vector<std::vector > imgpoints;
// Defining the world coordinates for 3D points
std::vector objp;
for(int i{0}; i<CHECKERBOARD[1]; i++)
{
for(int j{0}; j<CHECKERBOARD[0]; j++)
objp.push_back(cv::Point3f(j,i,0));
}
// Extracting path of individual image stored in a given directory
std::vector images;
// Path of the folder containing checkerboard images
std::string path = "./images/*.jpg";
cv::glob(path, images);
cv::Mat frame, gray;
// vector to store the pixel coordinates of detected checker board corners
std::vector corner_pts;
bool success;
// Looping over all the images in the directory
for(int i{0}; i<images.size(); i++)
{
frame = cv::imread(images[i]);
cv::cvtColor(frame,gray,cv::COLOR_BGR2GRAY);
// Finding checker board corners
// If desired number of corners are found in the image then success = true
success = cv::findChessboardCorners(gray, cv::Size(CHECKERBOARD[0],
CHECKERBOARD[1]), corner_pts, CV_CALIB_CB_ADAPTIVE_THRESH |
CV_CALIB_CB_FAST_CHECK | CV_CALIB_CB_NORMALIZE_IMAGE);
/*
* If desired number of corner are detected,
* we refine the pixel coordinates and display
* them on the images of checker board
*/
if(success)
{
cv::TermCriteria criteria(CV_TERMCRIT_EPS | CV_TERMCRIT_ITER, 30, 0.001);
// refining pixel coordinates for given 2d points.
cv::cornerSubPix(gray,corner_pts,cv::Size(11,11), cv::Size(-1,-1),criteria);
// Displaying the detected corner points on the checker board
cv::drawChessboardCorners(frame, cv::Size(CHECKERBOARD[0],
CHECKERBOARD[1]), corner_pts, success);
objpoints.push_back(objp);
imgpoints.push_back(corner_pts);
}
cv::imshow("Image",frame);
cv::waitKey(0);
}
cv::destroyAllWindows();
cv::Mat cameraMatrix,distCoeffs,R,T;
/*
* Performing camera calibration by
* passing the value of known 3D points (objpoints)
* and corresponding pixel coordinates of the
* detected corners (imgpoints)
*/
cv::calibrateCamera(objpoints, imgpoints, cv::Size(gray.rows,gray.cols), cameraMatrix,
distCoeffs, R, T);
std::cout << "cameraMatrix : " << cameraMatrix << std::endl;
std::cout << "distCoeffs : " << distCoeffs << std::endl;
std::cout << "Rotation vector : " << R << std::endl;
std::cout << "Translation vector : " << T << std::endl;
return 0;
}
Introduction
Table of contents
Segmentation Tasks
Fixed Input Size: CNN architectures are often built to accept images of a
specific size. However, the input images might have various dimensions in
segmentation tasks, making variable-sized inputs challenging to manage
with typical CNNs.
Image Segmentation
Overcomes Challenges
The UNET architecture was developed to address these limitations and
overcome the challenges faced by traditional approaches to image
segmentation. Here’s how UNET tackles these issues:
Convolutional Layers
Activation Function
Pooling Layers
Pooling layers are used after the convolutional layers to reduce the spatial
dimensionality of the feature maps. The operations, such as max pooling,
divide feature maps into non-overlapping regions and keep only the
maximum value inside each zone. It reduces the spatial resolution by
down-sampling feature maps, allowing the network to capture more
abstract and higher-level data.
The encoding path’s job is to capture features at various scales and levels
of abstraction in a hierarchical manner. The encoding process focuses on
extracting global context and high-level information as the spatial
dimensions decrease.
Skip Connections
Feature maps from prior layers collect local details and fine-grained
information during the encoding path. These feature maps are
concatenated with the upsampled feature maps in the decoding pipeline
utilizing skip connections. This allows the network to incorporate multi-
scale data, low-level features and high-level context into the segmentation
process.
To boost the spatial resolution of the feature maps, the UNET decoding
method includes upsampling layers, frequently done using transposed
convolutions or deconvolutions. Transposed convolutions are essentially
the opposite of regular convolutions. They enhance spatial dimensions
rather than decrease them, allowing for upsampling. By constructing a
sparse kernel and applying it to the input feature map, transposed
convolutions learn to upsample the feature maps. The network learns to fill
in the gaps between the current spatial locations during this process, thus
boosting the resolution of the feature maps.
Concatenation
The feature maps from the preceding layers are concatenated with the
upsampled feature maps during the decoding phase. This concatenation
enables the network to aggregate multi-scale information for correct
segmentation, leveraging high-level context and low-level features. Aside
from upsampling, the UNET decoding path includes skip connections from
the encoding path’s comparable levels.
UNET can boost the spatial resolution of the feature maps by using
transposed convolutions in the decoding process, thereby upsampling
them to match the original image size. Transposed convolutions assist the
network in generating a dense and fine-grained segmentation mask by
learning to fill in the gaps and expand the spatial dimensions.
The UNET design captures global context and local details by mixing
contracting and expanding pathways. The contracting path compresses
the input image into a compact representation, decided to build a detailed
segmentation map by the expanding path. The expanding path concerns
decoding the compressed representation into a dense and precise
segmentation map. It reconstructs the missing spatial information and
refines the segmentation results. This encoder-decoder structure enables
precision segmentation using high-level context and fine-grained spatial
information.
Skip connections are essential to the UNET design because they allow
information to travel between the contracting (encoding) and expanding
(decoding) paths. They are critical for maintaining spatial information and
improving segmentation accuracy.
Some spatial information may be lost during the encoding path as the
feature maps undergo downsampling procedures such as max pooling.
This information loss can lead to lower localization accuracy and a loss of
fine-grained details in the segmentation mask.
Details
Cross-Entropy Loss
The choice between the Dice coefficient loss and cross-entropy loss
depends on the segmentation task’s specific requirements and the
dataset’s characteristics. Both loss functions have advantages and can be
combined or customized based on specific needs.
1: Importing Libraries
import tensorflow as tf
import os
import numpy as np
from tqdm import tqdm
from skimage.io import imread, imshow
from skimage.transform import resize
import matplotlib.pyplot as plt
import random
IMG_WIDTH = 128
IMG_HEIGHT = 128
IMG_CHANNELS = 3
seed = 42
np.random.seed = seed
Subfolder
train_ids = next(os.walk(TRAIN_PATH))[1]
test_ids = next(os.walk(TEST_PATH))[1]
6: Training
Y_train[n] = mask
# test images
X_test = np.zeros((len(test_ids), IMG_HEIGHT, IMG_WIDTH,
IMG_CHANNELS), dtype=np.uint8)
sizes_test = []
print('Resizing test images')
for n, id_ in tqdm(enumerate(test_ids), total=len(test_ids)):
path = TEST_PATH + id_
img = imread(path + '/images/' + id_ + '.png')[:,:,:IMG_CHANNELS]
sizes_test.append([img.shape[0], img.shape[1]])
img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant',
preserve_range=True)
X_test[n] = img
print('Done!')
11: Paths
#Contraction path
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(s)
c1 = tf.keras.layers.Dropout(0.1)(c1)
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c1)
p1 = tf.keras.layers.MaxPooling2D((2, 2))(c1)
c2 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(p1)
c2 = tf.keras.layers.Dropout(0.1)(c2)
c2 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_normal', padding='same')(c2)
p2 = tf.keras.layers.MaxPooling2D((2, 2))(c2)
14: Summary
checkpointer = tf.keras.callbacks.ModelCheckpoint('model_for_nuclei.h5',
verbose=1, save_best_only=True)
callbacks = [
tf.keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),
tf.keras.callbacks.TensorBoard(log_dir='logs')]
Conclusion
What’s the first thing you do when attempting to cross the road? We
typically look left and right, take stock of the vehicles on the road, and
decide. In milliseconds, Our brain can analyze what kind of vehicle (car,
bus, truck, auto, etc.) is approaching us. Can machines do that?
The answer was an emphatic ‘no’ until a few years back. However, the rise
and advancements in computer vision have changed the game. We can
build computer vision models that can detect objects, determine their
shape, predict the direction they will go in, and many other things. You
might have guessed it—that’s the powerful technology behind self-driving
cars!
There are multiple ways of dealing with computer vision challenges. The
most popular approach I have encountered is based on identifying the
objects present in an image, aka object detection. But what if we want to
dive deeper? What if just detecting objects isn’t enough—we want to
analyze our image at a much more granular level?
As data scientists, we are always curious to dig deeper into the data.
Asking questions like these is why I love working in this field!
Table of Contents
Process: Assigns a label to each pixel in the image. Pixels with the
same label share certain properties, like color or brightness.
Benefits:
That’s where image localization comes into the picture (no pun intended!).
It helps us identify a single object’s location in the given image. We rely on
object detection (OD) if we have multiple objects present. We can predict
the location and class for each object using OD.
Before detecting the objects and even before classifying the image, we
need to understand what it consists of. Enter Image Segmentation.
We can divide or partition the image into various parts called segments.
It’s not a great idea to process the entire image at the same time, as there
will be regions in the image that do not contain any information. By
dividing the image into segments, we can use the important segments to
process the image. That, in a nutshell, is how image segmentation works.
The shape of the cancerous cells plays a vital role in determining the
severity of the cancer. You might have put the pieces together, but object
detection will not be very useful here. We will only generate bounding
boxes, which will not help us identify the shape of the cells.
Here, we can see the shapes of all the cancerous cells. There are many
other applications where the Image segmentation algorithm is transforming
industries:
Region-based Segmentation
One simple way to segment different objects could be to use their pixel
values. An important point to note – the pixel values will be different for the
objects and the image’s background if there’s a sharp contrast between
them.
In this case, we can set a threshold value. The pixel values falling below or
above that threshold can be classified accordingly (as objects or
backgrounds). This technique is known as Threshold Segmentation.
If we want to divide the image into two regions (object and background),
we define a single threshold value. This is known as the global threshold.
rgb2gray
import numpy as np
import cv2
%matplotlib inline
image =
plt.imread('1.jpeg')
image.shape
plt.imshow(image)
gray.shape
(192, 263)
The height and width of the image are 192 and 263, respectively. We will
take the mean of the pixel values and use that as a threshold. If the pixel
value exceeds our threshold, we can say it belongs to an object. The pixel
value will be treated as the background if it is less than the threshold. Let’s
code this:
gray_r =
gray.reshape(gray.shape[0]*gray.shape[1])
for i in range(gray_r.shape[0]):
gray_r[i] = 1
else:
gray_r[i] = 0
gray = gray_r.reshape(gray.shape[0],gray.shape[1])
plt.imshow(gray, cmap='gray')
Nice! The darker region (black) represents the background, and the
brighter (white) region is the foreground. We can define multiple
thresholds as well to detect multiple objects:
gray = rgb2gray(image)
gray_r =
gray.reshape(gray.shape[0]*gray.shape[1])
for i in range(gray_r.shape[0]):
gray_r[i] = 3
gray_r[i] = 2
gray_r[i] = 1
else:
gray_r[i] = 0
gray = gray_r.reshape(gray.shape[0],gray.shape[1])
plt.imshow(gray, cmap='gray')
When the object and background have high contrast, this method
performs well
The below visual will help you understand how a filter convolves over an
image :
The values of the weight matrix define the output of the convolution. My
advice: It helps to extract features from the input. Researchers have found
that choosing some specific values for these weight matrices helps us
detect horizontal or vertical edges (or even the combination of horizontal
and vertical edges).
One such weight matrix is the Sobel operator. It is typically used to detect
edges. The Sobel operator has two weight matrices—one for detecting
horizontal edges and the other for detecting vertical edges. Let me show
how these operators look, and we will then implement them in Python.
1 2 1
0 0 0
-1 -2 -1
-
01
1
-
02
2
-
01
1
Edge detection works by convolving these filters over the given image.
Let’s visualize them on this article .
image =
plt.imread('index.png')
plt.imshow(image)
It should be fairly simple to understand how the edges are detected in this
image. Let’s convert it into grayscale and define the sobel filter (both
horizontal and vertical) that will be convolved over this image:
# converting to grayscale
gray = rgb2gray(image)
# defining the sobel filters
1])])
Now, convolve this filter over the image using the convolve function of
the ndimage package from scipy.
# here mode determines how the input array is extended when the filter overlaps a border.
plt.imshow(out_h,
cmap='gray')
cmap='gray')
Here, we can identify the horizontal and vertical edges. There is one more
type of filter that can detect both horizontal and vertical edges
simultaneously. This is called the laplace operator:
11 1
-
1 1
8
11 1
Let’s define this filter in Python and convolve it on the same image:
kernel_laplace = np.array([np.array([1, 1, 1]), np.array([1, -8, 1]), np.array([1, 1,
1])])
mode='reflect')
plt.imshow(out_l, cmap='gray')
Here, we can see that our method has detected both horizontal and
vertical edges. I encourage you to try it on different images and share your
results. Remember, the best way to learn is by practicing!
Clustering-based Image Segmentation
This idea might have come to you while reading about image
segmentation techniques. Can’t we use clustering techniques to divide
images into segments? We certainly can!
In this section, we’ll get an intuition of clustering (it’s always good to revise
certain concepts!) and how to use it to segment images.
Clustering is dividing the population (data points) into many groups, such
that data points in the same groups are more similar to other data points in
that group than those in other groups. These groups are known as
clusters.
K-means Clustering
4. Calculate the distance of all the points from the center of each
cluster
7. Finally, repeat steps (4), (5) and (6) until either the center of the
clusters does not change or we reach the set number of iterations
The key advantage of using the k-means algorithm is that it is simple and
easy to understand. We are assigning the points to the clusters closest to
them.
image?
Let’s put our learning to the test and check how well k-means segment the
objects in an image. We will be using this image, so download it, read it
and, check its dimensions:
pic = plt.imread('1.jpeg')/255 # dividing by 255 to bring the pixel values between 0 and 1
print(pic.shape)
plt.imshow(pic)
pic_n = pic.reshape(pic.shape[0]*pic.shape[1],
pic.shape[2])
pic_n.shape
(50496, 3)
kmeans = KMeans(n_clusters=5,
random_state=0).fit(pic_n)
pic2show = kmeans.cluster_centers_[kmeans.labels_]
I have chosen 5 clusters for this article, but you can play around with this
number and check the results. Now, let’s return the clusters to their
original shape, a 3-dimensional image, and plot the results.
pic.shape[2])
plt.imshow(cluster_pic)
K-means works well when we have a small dataset. It can segment the
objects in the image and give impressive results. However, when applied
to a large dataset (more images), the algorithm hits a roadblock.
It looks at all the samples at every iteration, so the time taken is too high.
Hence, it’s also too expensive to implement. And since k-means is a
distance-based algorithm, it only applies to convex datasets and is
unsuitable for clustering non-convex clusters .
Mask R-CNN
It’s class
Mask R-CNN adds a third branch to this, which also outputs the object
mask. Take a look at the below image to get an intuition of how Mask R-
CNN works on the inside:
Source: arxiv.org
Image Segmentation
Object Detection: What objects are in the image, and where are
they? This goes beyond classification. The model identifies specific
objects (cats, cars, people) and draws bounding boxes around them
to indicate their location. It’s like answering a multiple-choice
question where you can choose multiple answers and mark their
positions on the image.
Purpose:
How it Works:
Conclusion
Overview
Introduction
So, I set about trying to understand the computer vision technique behind
how a self-driving car potentially detects objects. A simple object detection
framework might not work because it simply detects an object and draws a
fixed shape around it.
Instead, we need a technique that can detect the exact shape of the road
so our self-driving car system can safely navigate the sharp turns as well.
So, in this article, we will first quickly look at what image segmentation is.
Then we’ll look at the core of this article – the Mask R-CNN framework.
Finally, we will dive into implementing our own Mask R-CNN model in
Python. with that also you will clarified about the Mask R CNN Pytorch
Implementation So,Let’s begin!
Table of contents
I would recommend checking out that article first if you need a quick
refresher (or want to learn image segmentation from scratch).
I’ll quickly recap that article here. Image segmentation creates a pixel-wise
mask for each object in the image. This technique gives us a far more
granular understanding of the object(s) in the image. The image shown
below will help you to understand what image segmentation is:
Here, you can see that each object (which are the cells in this particular
image) has been segmented. This is how image segmentation works.
The Mask R-CNN framework is built on top of Faster R-CNN. So, for a
given image, Mask R-CNN, in addition to the class label and bounding box
coordinates for each object, will also return the object mask.
Let’s first quickly understand how Faster R-CNN works. This will help us
grasp the intuition behind Mask R-CNN as well.
Faster R-CNN first uses a ConvNet to extract feature maps from the
images
Backbone Model
Now, we take the feature maps obtained in the previous step and apply a
region proposal network (RPM). This basically predicts if an object is
present in that region (or not). In this step, we get those regions or feature
maps which the model predicts contain some object.
The regions obtained from the RPN might be of different shapes, right?
Hence, we apply a pooling layer and convert all the regions to the same
shape. Next, these regions are passed through a fully connected network
so that the class label and bounding boxes are predicted.
Till this point, the steps are almost similar to how Faster R-CNN works.
Now comes the difference between the two frameworks. In addition to
this, Mask R-CNN also generates the segmentation mask.
For that, we first compute the region of interest so that the computation
time can be reduced. For all the predicted regions, we compute the
Intersection over Union (IoU) with the ground truth boxes. We can
computer IoU like this:
Now, only if the IoU is greater than or equal to 0.5, we consider that
as a region of interest. Otherwise, we neglect that particular region.
We do this for all the regions and then select only a set of regions for
which the IoU is greater than 0.5.
Here, the red box is the ground truth box for this image. Now, let’s say we
got 4 regions from the RPN as shown below:
Here, the IoU of Box 1 and Box 2 is possibly less than 0.5, whereas the
IoU of Box 3 and Box 4 is approximately greater than 0.5. Hence. we can
say that Box 3 and Box 4 are the region of interest for this particular image
whereas Box 1 and Box 2 will be neglected.
Segmentation Mask
Once we have the RoIs based on the IoU values, we can add a mask
branch to the existing architecture. This returns the segmentation mask for
each region that contains an object. It returns a mask of size 28 X 28 for
each region which is then scaled up for inference.
Here, our model has segmented all the objects in the image. This is the
final step in Mask R-CNN where we predict the masks for all the objects in
the image.
Keep in mind that the training time for Mask R-CNN is quite high. It took
me somewhere around 1 to 2 days to train the Mask R-CNN on the
famous COCO dataset . So, for the scope of this article, we will not be
training our own Mask R-CNN model.
We will instead use the pretrained weights of the Mask R-CNN model
trained on the COCO dataset. Now, before we dive into the Python code,
let’s look at the steps to use the Mask R-CNN model to perform instance
segmentation.
Let’s have a look at the steps which we will follow to perform image
segmentation using Mask R-CNN.
First, we will clone the mask rcnn repository which has the architecture for
Mask R-CNN. Use the following command to clone the repository:
numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow>=1.3.0
keras>=2.0.8
opencv-python
h5py
imgaug
IPython
You must install all these dependencies before using the Mask R-
CNN framework.
Next, we need to download the pretrained weights. You can use this
link to download the pre-trained weights. These weights are obtained from
a model that was trained on the MS COCO dataset. Once you have
downloaded the weights, paste this file in the samples folder of the
Mask_RCNN repository that we cloned in step 1.
Finally, we will use the Mask R-CNN architecture and the pretrained
weights to generate predictions for our own images.
Once you’re done with these four steps, it’s time to jump into your Jupyter
Notebook! We will implement all these things in Python and then generate
the masks along with the classes and bounding boxes for objects in our
images.
Sp, are you ready to dive into Python and code your own image
segmentation model? Let’s begin!
To execute all the code blocks which I will be covering in this section,
create a new Python notebook inside the “samples” folder of the cloned
Mask_RCNN repository.
import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
Next, we will define the path for the pretrained weights and the images on
which we would like to perform segmentation:
If you have not placed the weights in the samples folder, this will again
download the weights. Now we will create an inference class which will be
used to infer the Mask R-CNN model:
class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1
config = InferenceConfig()
config.display()
Loading Weights
Next, we will create our model and load the pretrained weights which we
downloaded earlier. Make sure that the pretrained weights are in the same
folder as that of the notebook otherwise you have to give the location of
the weights file:
Now, we will define the classes of the COCO dataset which will help us in
the prediction phase:
Let’s load an image and try to see how the model performs. You can use
any of your images to test the model.
# original image
plt.figure(figsize=(12,10))
skimage.io.imshow(image)
This is the image we will work with. You can clearly identify that there are
a couple of cars (one in the front and one in the back) along with a bicycle.
Making Predictions
It’s prediction time! We will use the Mask R-CNN model along with the
pretrained weights and see how well it segments the objects in the image.
We will first take the predictions from the model and then plot the results
to visualize them:
# Run detection
results = model.detect([image], verbose=1)
# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], class_names,
r['scores'])
Interesting. The model has done pretty well to segment both the cars as
well as the bicycle in the image. We can look at each mask or the
segmented objects separately as well. Let’s see how we can do that.
I will first take all the masks predicted by our model and store them in the
mask variable. Now, these masks are in the boolean form (True and
False) and hence we need to convert them to numbers (1 and 0). Let’s do
that first:
mask = r['masks']
mask = mask.astype(int)
mask.shape
Output:
(480,640,3)
This will give us an array of 0s and 1s, where 0 means that there is no
object at that particular pixel and 1 means that there is an object at that
pixel. Note that the shape of the mask is similar to that of the original
image (you can verify that by printing the shape of the original image).
However, the 3 here in the shape of the mask does not represent the
channels. Instead, it represents the number of objects segmented by our
model. Since the model has identified 3 objects in the above sample
image, the shape of the mask is (480, 640, 3). Had there been 5 objects,
this shape would have been (480, 640, 5).
We now have the original image and the array of masks. To print or get
each segment from the image, we will create a for loop and multiply each
mask with the original image to get each segment:
for i in range(mask.shape[2]):
temp = skimage.io.imread('sample.jpg')
for j in range(temp.shape[2]):
temp[:,:,j] = temp[:,:,j] * mask[:,:,i]
plt.figure(figsize=(8,8))
plt.imshow(temp)
This is how we can plot each mask or object from the image. This can
have a lot of interesting as well as useful use cases. Getting the segments
from the entire image can reduce the computation cost as we do not have
to preprocess the entire image now, but only the segments.
Inferences
Below are a few more results which I got using our Mask R-CNN model:
Looks awesome! You have just built your own image segmentation model
using Mask R-CNN – well done.
import torch
import torchvision
from PIL import Image
import matplotlib.pyplot as plt
import torchvision.transforms as T
# Load an image
img_path = 'path_to_your_image.jpg'
img = Image.open(img_path).convert("RGB")
# Perform inference
with torch.no_grad():
prediction = model(img)
Conclusion
Kukil
JUNE 28, 2022 2 COMMENTS
If you are a computer vision practitioner or even an enthusiast, you must have
come across the term very often. It is the first checkpoint for evaluating the
accuracy of a model. In simple terms, it’s a metric that helps us measure the
correctness of a prediction.
In this blog post, you will get a detailed and intuitive explanation of the following.
1. Observations
7. Conclusion
Intersection over Union in Object
Detection
Let’s go through the following example to understand how IoU is calculated.
Let three models- A, B, and C- be trained to predict birds. We pass an image
through the models where we already know the Ground Truth (marked in red).
The image below showspredictionsof the models (marked in cyan).
1.1 Observations
It is clear that the predicted box of Model A has more overlap with the Ground
Truth as compared to Model B.
However, Model C has an even higher overlap with the ground truth. But it
also has a high overlap with the background.
So from models B and C, it is clear that a metric based on only overlap is not a
fair one as we should also account for localization accuracy. It is not just about
matching the Ground Truth but how closely the prediction matches it.
Therefore, we need a metric that will penalize the metric whenever,
The prediction fails to predict the area inside the Ground Truth.
The prediction overflows the Ground Truth.
Keeping the above in mind, the IoU metric has been designed.
1.2 Designing Intersection over Union metric for
Object Detection
IoU values range from 0 to 1. Where 0 means no overlap and 1 means perfect
overlap.
Looking closely, we are adding the area of the intersection twicein the
denominator. So actually, we calculate IoU as shown in the illustration below.
In the case of Image Segmentation, the area is not necessarily rectangular. It can
have any regular or irregular shape. That means the predictions are segmentation
masks and not bounding boxes. Therefore, pixel-by-pixel analysis is done here.
Moreover, the definition of TP, FP, and FN is slightly different as it is not based
on a predefined threshold.
(a)True Positive: The area of intersection between Ground Truth(GT) and
segmentation mask(S). Mathematically, this is logical AND operation of GT and
S i.e.,
(b)False Positive: The predicted area outside the Ground Truth. This is the
logical OR of GT and segmentation minus GT.
(c) False Negative: Number of pixels in the Ground Truth area that the model
failed to predict. This is the logical OR of GT and segmentation minus S.
In the image above, the blue bounding box is the detected object. Given that
theGround Truthis known (shown in red), let us seehow to implement IoU
calculation using NumPy andPyTorch. We will see the available in-built
function and define manual functions as well.
In the order of top left to bottom right corner, the coordinates are,
Download Code To easily follow along this tutorial, please download code by clicking
on the button below. It's FREE!
Download Code
Click here to download the source code to this post
Here, we find the coordinates of the bounding box surrounding the intersection
area. Then subtract the area of intersection from the sum of the area of Ground
Truth and Prediction. We add 1 while calculating height and width to counter
zero division errors. Theoretically, it is possible to add an infinitesimally small
positive value, say 0.0001. However, images are discrete. The minimum possible
dimension of an image is 1×1. Therefore, we have to add 1.
IOU: 0.6441399913136432
1 # Import dependencies.
2 import torch
3 from torchvision import ops
4
5 # Bounding box coordinates.
6 ground_truth_bbox = torch.tensor([[1202, 123, 1650, 868]], dtype=torch.float)
7 prediction_bbox = torch.tensor([[1162.0001, 92.0021, 1619.9832, 694.0033]], dtyp
8
9 # Get iou.
10 iou = ops.box_iou(ground_truth_bbox, prediction_bbox)
11 print('IOU : ', iou.numpy()[0][0])
Output
IOU : 0.6436676
1 IOU : 0.64413995
We can see that the output varies slightly. This error is introduced for adding 1
to counter zero division error. In practice, values are clamped to a Min-Max
range. Here, let’s keep it as it is for the sake of simplicity. You can also look
at thesource code[2] for a better understanding.
Conclusion
So that’s all aboutIntersection over UnionorJaccard Index. In this blog post,
we discussed the basics of IoU and why it is needed. You also learned the
implementation of IoU using NumPy and PyTorch. It should be noted that IoU
in object detection does not have the same meaning in segmentation.
In object detection, IoU does not calculate the accuracy of a model directly.
Rather, it isahelper metric that evaluates the degree of overlap between ground
truth and the prediction.
Kunal Dawn
JUNE 11, 2024LEAVE A COMMENT
8. Key Takeaways
9. Conclusion
10. References
The authors propose the U2-Net architecture that can handle both
multi-level deep feature extraction and multi-scale information across
local and global contexts. This two-level nested modified U-Net-like
structure enables training without significant memory consumption
and computation costs. The core of each level in the architecture is
built upon the ReSidual U-block (RSU), which incorporates the
properties of a residual block and a U-Net-like symmetric encoder-
decoder structure.
In the next section, we will explore the RSU block in more detail.
Note that the spatial resolution of the output feature map from any
RSU-L block remains identical to that of the input feature map.
from the downsampled feature maps from the encoder layers through
subsequent concatenation, convolution, and upsampling (in that order).
through addition: .
Fig
ure 2: Comparison between Residual block and ReSidual-U block
[Source-U2-Net: Going Deeper with Nested U-Structure for Salient
Object Detection]
In the diagram above, and are the feature representations
We will now present a more detailed block diagram for RSU-7, which
has an input feature map with a resolution of 320x320x3. The
notations I, M, and O represent the number of input, intermediate,
and final output channels in the RSU block, respectively. The
diagram also ascertains the shapes resulting from convolution,
pooling, concatenation, and upsampling.
Download Code
1 class REBNCONV(nn.Module):
2 def __init__(self,in_ch=3,out_ch=3,dirate=1):
3 super(REBNCONV,self).__init__()
4
5 self.conv_s1 = nn.Conv2d(in_ch,out_ch,3,padding=1*dirate,dilation=1*dira
6 self.bn_s1 = nn.BatchNorm2d(out_ch)
7 self.relu_s1 = nn.ReLU(inplace=True)
8
9 def forward(self,x):
10
11 hx = x
12 xout = self.relu_s1(self.bn_s1(self.conv_s1(hx)))
13
14 return xout
2. The function _upsample_like accepts feature
maps src and tar and subsequently upsamples src to have the
sample spatial resolution of tar.
1 def _upsample_like(src,tar):
2
3 src = F.upsample(src,size=tar.shape[2:],mode='bilinear')
4
5 return src
We will begin by analyzing the RSU7 Module’s __init__ method for
initializing the encoder, decoder, and pooling layers.
1 class RSU7(nn.Module):#UNet07DRES(nn.Module):
2
3 def __init__(self, in_ch=3, mid_ch=12, out_ch=3):
4 super(RSU7,self).__init__()
5
6 self.rebnconvin = REBNCONV(in_ch,out_ch,dirate=1)
7
8 self.rebnconv1 = REBNCONV(out_ch,mid_ch,dirate=1)
9 self.pool1 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
10
11 self.rebnconv2 = REBNCONV(mid_ch,mid_ch,dirate=1)
12 self.pool2 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
13
14 self.rebnconv3 = REBNCONV(mid_ch,mid_ch,dirate=1)
15 self.pool3 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
16
17 self.rebnconv4 = REBNCONV(mid_ch,mid_ch,dirate=1)
18 self.pool4 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
19
20 self.rebnconv5 = REBNCONV(mid_ch,mid_ch,dirate=1)
21 self.pool5 = nn.MaxPool2d(2,stride=2,ceil_mode=True)
22
23 self.rebnconv6 = REBNCONV(mid_ch,mid_ch,dirate=1)
24
25 self.rebnconv7 = REBNCONV(mid_ch,mid_ch,dirate=2)
26
27 self.rebnconv6d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
28 self.rebnconv5d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
29 self.rebnconv4d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
30 self.rebnconv3d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
31 self.rebnconv2d = REBNCONV(mid_ch*2,mid_ch,dirate=1)
32 self.rebnconv1d = REBNCONV(mid_ch*2,out_ch,dirate=1)
Lastly, we will discuss the forward step for the RSU7 module:
1 def forward(self,x):
2
3 hx = x
4 hxin = self.rebnconvin(hx)
5
6 hx1 = self.rebnconv1(hxin)
7 hx = self.pool1(hx1)
8
9 hx2 = self.rebnconv2(hx)
10 hx = self.pool2(hx2)
11
12 hx3 = self.rebnconv3(hx)
13 hx = self.pool3(hx3)
14
15 hx4 = self.rebnconv4(hx)
16 hx = self.pool4(hx4)
17
18 hx5 = self.rebnconv5(hx)
19 hx = self.pool5(hx5)
20
21 hx6 = self.rebnconv6(hx)
22
23 hx7 = self.rebnconv7(hx6)
24
25 hx6d = self.rebnconv6d(torch.cat((hx7,hx6),1))
26 hx6dup = _upsample_like(hx6d,hx5)
27
28 hx5d = self.rebnconv5d(torch.cat((hx6dup,hx5),1))
29 hx5dup = _upsample_like(hx5d,hx4)
30
31 hx4d = self.rebnconv4d(torch.cat((hx5dup,hx4),1))
32 hx4dup = _upsample_like(hx4d,hx3)
33
34 hx3d = self.rebnconv3d(torch.cat((hx4dup,hx3),1))
35 hx3dup = _upsample_like(hx3d,hx2)
36
37 hx2d = self.rebnconv2d(torch.cat((hx3dup,hx2),1))
38 hx2dup = _upsample_like(hx2d,hx1)
39
40 hx1d = self.rebnconv1d(torch.cat((hx2dup,hx1),1))
41
42 return hx1d + hxin
Figure 3. illustrates how convolution, downsampling, concatenation,
and upsampling are done across the multiple encoder-decoder
blocks.
Figure 4:
U2-Net Architecture
[Source-U2-Net: Going Deeper with Nested U-Structure for Salient
Object Detection]
The U2-Net structure consists of the following three components:
1. The six encoder stages: En_1, En_2, En_3, En_4, En_5, and En_6.
Stages 1-4 follow the RSU-7, RSU-6, RSU-5, and RSU-4 blocks; while
stages 5 and 6 implements the RSU-4F block mentioned earlier.
2. The five decoder stages, De_1, De_2, De_3, De_4, and De_5, follow
similar RSU architectures as their symmetric encoder counterparts.
Each decoder (except De_5, which takes the upsampled concatenation
of output feature maps from En_5 and En_6) takes the upsampled
concatenation of the output from its previous stage and its symmetric
encoder stage.
3. The feature map outputs from the decoder stages De_1 to De_5 and the
encoder stage En_6 are then convolved using a 3×3 convolution layer
and upsampled to the input image resolution (320x320) to produce six-
Let us now focus on the final side outputs implemented at the U2NET
module’s forward step.
1 #side output
2 d1 = self.side1(hx1d)
3
4 d2 = self.side2(hx2d)
5 d2 = _upsample_like(d2,d1)
6
7 d3 = self.side3(hx3d)
8 d3 = _upsample_like(d3,d1)
9
10 d4 = self.side4(hx4d)
11 d4 = _upsample_like(d4,d1)
12
13 d5 = self.side5(hx5d)
14 d5 = _upsample_like(d5,d1)
15
16 d6 = self.side6(hx6)
17 d6 = _upsample_like(d6,d1)
18
19 d0 = self.outconv(torch.cat((d1,d2,d3,d4,d5,d6),1))
20
21 return F.sigmoid(d0), F.sigmoid(d1), F.sigmoid(d2), F.sigmoid(d3), F.sigmoid(d4)
Attributes side1 through side6 are the 3x3 convolutions applied to
the output features from the decoder
stages hx1d through hx5d (De_1 to De_5) and the encoder
stage hx6. The upsampled side outputs are concatenated and
passed through a 1x1 convolution (outconv). Finally, the sigmoid
activation is applied to all the individuals and the concatenated side
outputs.
Did you know that background subtraction can also be used for
document alignment? Our Automated Document
Alignment article describes how we fine-tune DeepLabv3 to
segment and align documents automatically.
Training and Evaluation Strategies
for U2-Net
Now that we have an in-depth understanding of the U2-Net
architecture, we will also discuss the various strategies the authors
have employed to train the U2-Net model.
The authors used the DUTS Image dataset for this binary
segmentation. The training data contains 10553 images, further
augmented through horizontal flipping to obtain 21106 training
images offline.
The training loss is defined as the weighted sum of the losses from
the side output probability maps and that of the final fused
(concatenated) output map.
The authors used Adam optimizer to train the network with the
default hyperparameters with an initial learning rate of 1e-3,
betas=(0.9, 0.999), eps=1e-8, and weightdecay=0.
1. Precision-Recall Curve
2. The beta-F-score measure (higher the better) is given as follows:
3. Mean absolute Error (MAE – lower the better) between the ground
truth mask and the predicted map; given as:
4. Weighted F-score (higher the better) to overcome the possible unfair
comparison caused by “interpolation flaw, dependency f law and equal-
importance flaw”; given as:
The following table shows the metric scores across the six datasets
used during the evaluation.
We will use both UNet and UNetP models for inference. The
script u2net.py contains the model architecture definition,
supporting modules, and helper functions.
1 import os
2 from PIL import Image
3 import numpy as np
4
5 from torchinfo import summary
6
7 import torch
8 import torchvision.transforms as T
9
10 from u2net import U2NET, U2NETP
11
12 import torchvision.transforms.functional as F
1 u2net = U2NET(in_ch=3,out_ch=1)
2 u2netp = U2NETP(in_ch=3,out_ch=1)
The figure below shows the model summary for U2NET. The model
has approximately 44M parameters.
Fig
ure 5: U2-Net Model Summary
The U2NETP model is around 38 times smaller than the original U2-
Net model, containing only around 1.13M parameters, as the model
summary vindicates.
Figu
re 6: U2-NetP Model Summary
Next, we load the model weights using the load_model helper
function.
We will also scale the image data in the range [0, 1] and
normalize it using the ImageNet mean and standard deviation.
1 def denorm_image(image):
2 image_denorm = torch.addcmul(mean[:,None,None], image, std[:,None, None])
3 image = torch.clamp(image_denorm*255., min=0., max=255.)
4 image = torch.permute(image, dims=(1,2,0)).numpy().astype("uint8")
5
6 return image
We obtain the predictions for both the U2-Net and U2-NetP models.
1 def normPRED(predicted_map):
2 ma = np.max(predicted_map)
3 mi = np.min(predicted_map)
4
5 map_normalize = (predicted_map - mi) / (ma-mi)
6
7 return map_normalize
We shall now visualize the model predictions for both U2Net and U2-
NetP.
Fig
ure 8: More Prediction samples from U2-Net (middle) and U2-NetP
(right)
However, there are instances where U2-NetP surprisingly gives
better prediction maps compared to U2-Net.
Figure
9: Better predictions from U2-NetP (right) compared U2-Net (middle)
Let us now take a closer look at a few challenging examples.
Figure 10:
Predicted masks from U2-Net (middle) and U2-NetP (right) on
challenging images
Although both models were able to segment out the foreground
instances, there is tremendous scope for improvements in getting
more fine-grained segmentation results.
Where,
Key Takeaways
1. ReSidual-U block: We have learned how the ReSidual-U structure
(RSU) pivots in learning multi-scale global context in addition to learning
from local representations. It forms the crux of the pipeline for both U2-
Net and IS-Net.
2. U2-Net: The U2-Net is a nested two-level structure of RSU encoder-
decoder layers that helps attain multi-level and multi-scale deep feature
representations without requiring a pre-trained classification backbone
at minimal computation and memory costs.
3. Intermediate Supervision Strategy: Training a self-supervised ground-
truth encoder from the target segmentation masks helps capture high-
dimensional mask-level features.
4. IS-Net pipeline: The IS-Net pipeline aims to attain feature
synchronization using the trained GT encoder and the multi-stage
feature maps, along with learning intra-stage and multi-level image
features through the segmentation component (similar to U2-Net).
Conclusion
Background subtraction is one of the most crucial tasks in computer
vision, and hence, understanding the high-dimensional multi-scale
and multi-level image features is imperative. In this post, we have
explored how U2-Net helps achieve this through its RSU blocks. We
also observed that incorporating an intermediate supervision strategy
results in significant improvements while generating the prediction
masks, as evidenced by the IS-Net pipeline.
Recommender Systems
— A Complete Guide to
Machine Learning
Models
Leveraging data to help users discovering
new content
Approaches
Content-Based Approach
Hybrid Approaches
Finally, there are also hybrid methods that try to use both
the known metadata and the set of observed user-item
interactions. This approach combines advantages of both
Content-Based and Collaborative Filtering methods, and allow
to obtain the best results. Later in this article we
present LightFM, which is the most popular algorithm of this
class of methods.
Factorization
LightFM: user/item embeddings and biases are the sum of latent vectors associated to each
user/item.
TL;DR – Conclusions
Ankan Ghosh
MAY 21, 2024 2 COMMENTS
To see the key takeaways, scroll down to the concluding part of the
article or click here to see them immediately.
References
A Statistical Overview
1. Content-Based
2. Collaborative Filtering
3. Context filtering
I
mage 3 – Rating Matrix
Once we collect explicit or implicit feedback, you can create a user-
item rating matrix out of it (rui). For explicit feedback, each entry in (rui)
is a numerical value, like the number of stars you gave to a movie or
a “?” if you didn’t rate that movie. For implicit feedback, the values in
(rui)are boolean, showing whether there was an interaction or not—
like whether you watched a movie or not. It’s important to note that
this matrix (rui) is very sparse because you only interact with a few
items out of all available ones, and you rate even fewer items out of
them!
Image 7
– Term Frequency-Inverse Document Frequency
User profiles are built from data on how users have interacted with
items. These profiles are essentially vectors that summarize a user’s
likes and dislikes based on past behavior. The profiles help predict
what new items a user might like.
Recommendation Methods
Im
age 8 – Content-Based vs Collaborative Filtering
Collaborative Filtering
Im
age 10 – Memory-Based vs. Model-Based CF
I
mage 11 – User-Based Collaborative Filtering
Recommendation Methods
Imag
e 14 – K Nearest Neighbors
The process involves several steps. First, the system computes the
similarity between users or items using metrics like Euclidean
distance or cosine similarity. Then, it selects the top ‘k’ most similar
neighbors. A user-based KNN finds users with similar tastes; an item-
based KNN identifies items that have been similarly rated. Finally, the
system aggregates the preferences of these neighbors to generate a
recommendation list.
Context Filtering
I
mage 15 – Context Filtering
This figure shows how user interactions with a service like Netflix can
be tracked over time with additional context to improve
recommendations. Each row represents a user’s activity at a specific
time, including their location, the device they used, and the
exact date and time of the interaction. For example, a user in the
United States watched “Stranger Things” on their computer on
December 10, 2017. The system captures more information about
each user’s behavior by recording this context—location, device, and
timestamp.
Ima
ge 16 – ML Approach Evaluation in RecSys
Many advancements have occurred throughout the decades for
recommendation systems. Here, we will explore three main and
widely used ML techniques.
Matrix Factorization
I
mage 17 – Matrix Factorization Process
We define the confidence matrix cui and the rating matrix rui as follows:
The confidence matrix is cui = 1 + 𝛼tui, where 𝛼 is a constant. The
rating matrix rui is defined as rui = 1 if tui > 0 (indicating the user has
watched the movie) and rui = 0 if tui = 0 (indicating the user has not
watched the movie).
Imag
e 18 – confidence matrix and rating matrix for implicit feedback
Explicit feedback uses direct user ratings for items, such as movie
ratings. In matrix factorization for explicit feedback, a linear model
represents user-item interactions, and the algorithm, called
probabilistic matrix factorization, learns latent vectors for users and
items by minimizing a regularized mean squared error (MSE) loss
over known ratings.
I
mage 20 – Probabilistic Matrix Factorization and Loss
function for explicit feedback
Two common optimization methods are used: stochastic gradient
descent (SGD) and alternating least squares (ALS). While SGD is
easy to implement, it may struggle with non-convex loss functions.
ALS, however, can transform the problem into a series of convex
linear regression problems, which are easier to solve and can be
significantly parallelized for speed.
Hybrid Approach
The SVD++ algorithm can handle both explicit and implicit feedback
simultaneously, which is useful because users often interact with
many items but rate only a few. This algorithm modifies the basic
SVD model by including a weighted sum of latent factors from items
a user has interacted with. This helps provide more accurate
recommendations by leveraging all available user interactions, not
just the rated items.
Logistic Regression
Image 22 – Logistic
Regression
Here, λ is the regularization parameter, w is the weight vector, yi is
the label, and xi is the feature vector.
Image 23 –
Factorization Machines
Field-aware Factorization Machines (FFMs) extend the basic FM
model by grouping features into different fields. Each feature has a
different latent vector for each field, allowing the model to capture
more nuanced interactions between features from different fields. In
FFMs, when deciding the weight for a feature pair, the latent vector of
one feature in the pair is used in the context of the field of the other
feature and vice versa.
Image 24 – Field-
aware Factorization Machines
These algorithms mostly use feature vectors to provide predictions as
recommendations, and a ranking method like similarity search ranks
those features according to user input. However, these models face
challenges when trying to model interactions involving more than two
features. This limitation has led researchers to explore neural
network-based recommendation systems, which offer more flexibility
for advanced feature interactions and can more effectively handle
higher-order feature combinations.
Image
25 – DL Approach Evaluation in RecSys
Deep learning (DL) models advanced recommender systems by
leveraging vast amounts of data and complex architectures. Unlike
traditional machine learning methods, which often lack performance,
DL models continue to improve as more data is introduced. This
increases accuracy and flexibility, making DL models ideal for
personalized recommendations. Companies like AirBnB, Facebook,
and Google have successfully implemented DL techniques in their
recommendation engines. Let’s explore the key components of a DL
recommendation system.
Ima
ge 27 – Inference Phase
In the inference phase, the trained model predicts new user-item
interactions. This involves three key steps: candidate generation,
candidate ranking, and filtering. First, the model pairs users with
numerous candidate items based on learned similarities. Then, it
ranks these items by the likelihood of user enjoyment. Finally, the
highest-ranked items are presented to the user. This phase requires
efficient data processing and real-time prediction capabilities to
deliver timely and relevant recommendations.
Workflow
I
mage 28 – General Workflow
The general workflow of a DNN-based recommendation system
involves two steps:
2. Then, for a new recommendation, the user’s context (like past ratings) is
used to generate features, which are fed into the trained model. The
model utilizes the learned embeddings to recommend items similar to
those the user has liked before, providing personalized suggestions.
This process involves embedding learning, feature extraction, and
model inference for recommendation.
Key Components
Embeddings
I
mage 29 – Embeddings
Embeddings are a core component of DL recommender systems,
transforming categorical data into dense vector representations.
Using Embeddings, the model captures similarities between entities,
such as users and items, in a high-dimensional space. For example,
users with similar preferences will have similar embedding vectors.
These embeddings are learned during training and can significantly
enhance the model’s ability to generalize from sparse data.
Core Architecture
Image 31 – Core
Architecture with Multiple Inputs
It’s important to consider additional user information such as gender,
age, city, time since last visit, and credit card used for payment,
along with item details like brand, price, categories, and quantity sold
in the last seven days. This additional information can enhance the
model’s ability to generalize. Modify the neural network to incorporate
these extra features as inputs.
Popular Architectures
Ima
ge 32 – Neural Collaborative Filtering (NCF)
It starts with a sparse input layer representing user and item IDs.
These IDs are mapped to dense user(u) and item(i) latent vectors via
an embedding layer. The latent vectors are then fed into multiple
neural collaborative filtering (CF) layers, which learn interactions
between users and items. The output layer produces a predicted
score ŷui for the user-item pair. This score is compared to the actual
rating yui during training to minimize prediction error. The trained
model uses these interactions to recommend items to users based
on learned embeddings.
Variational Autoencoders (VAE)
by Meta
Ima
ge 36 – Session-Based Recommendation Workflow
Session-based recommender systems use sequential data, which
captures user interactions within a session, such as viewing multiple
products. These models utilize variations of RNNs like GRU, LSTM,
or transformer-based architectures such as BERT to process
sequences and understand the context of user behavior. For
instance, RNNs capture temporal dependencies, while transformers
use attention mechanisms to focus on relevant interactions. These
session-based models can predict the likelihood of
users engaging with specific items based on their recent activity,
providing more timely and relevant recommendations. Two popular
examples of implementations include Square’s deep learning-based
product recommendation system and Alibaba’s transformer-based
model BERT, GRUs, and NVIDIA GPUs to create a vector
representation of their sellers.
Im
age 37 – NLP vs RecSys
In NLP applications, input text is converted into word vectors using
word embedding techniques. Each word is translated into numbers
before being processed by RNNs, Transformers, or BERT to
understand context. These numbers change during training,
encoding semantics, and contextual information, making similar
words close in this space. These models provide outputs for tasks
like next-word prediction and text summarization. For session-based
recommendations, RNN models train on user event sequences (e.g.,
product clicks, interaction times) to predict the likelihood of clicking a
target item. Interactions are embedded like words in a sentence
before being processed by LSTM, GRU, or Transformers for context
understanding.
Key Takeaways
Understanding Recommendation System
Conclusion
A recommendation system(or recommender system) is key to
enhancing user experience on modern platforms. By understanding
user preferences and behaviors, these systems offer personalized
suggestions that engage users. Recommendation systems have
evolved from traditional methods like matrix factorization to advanced
deep learning models to provide accurate and relevant
recommendations. They are essential for navigating the vast amount
of content available today, ensuring users find what they need quickly
and easily. So, the next time your parents get their favorite song or
movie suggested as “you might like this,” you can tell them the secret
behind it.