Unit 2 Notes
Unit 2 Notes
Supervised Learning: Classification and Regression, Linear Regression: Single and Multiple, Logistic
Regression: Ridge Regression, Lasso Regression, k-Nearest Neighbour, Naive Bayes Classifier, Decision
Tree, Support Vector Machine (TB-1)
Website:
Supervised Learning:
Supervised learning is used whenever we want to predict a certain outcome from a given input.
Its goal is to make accurate predictions for new, never-before-seen data.
Supervised learning often requires human effort to build the training set, but afterward automates and
often speeds up an otherwise laborious or infeasible task.
In supervised learning, the labelled training data is the experience or prior knowledge or belief. It is called
supervised learning because the process of learning from the training data by a machine.
Training data is the past information with known value of class field or ‘label’. Hence, we say that the ‘
training data is labelled’ in the case of supervised learning.
k-Neighbors classification:
In its simplest version, the k-NN algorithm only considers exactly one nearest neighbor, which is the
closest training data point to the point we want to make a prediction for. The prediction is then simply
the known output for this training point.
mglearn.plots.plot_knn_classification (n_neighbors=1)
Figure 2-4. Predictions made by the one-nearest-neighbor model on the forge dataset
Here, we added three new data points, shown as stars. For each of them, we marked the closest point in the
training set. The prediction of the one-nearest-neighbor algorithm is the label of that point.
Instead of considering only the closest neighbor, we can also consider an arbitrary number, k, of
neighbors. This is where the name of the k-nearest neighbors algorithm comes from. When considering
more than one neighbor, we use voting to assign a label. This means that for each test point, we count
how many neighbors belong to class 0 and how many neighbors belong to class 1. We then assign the
class that is more frequent: in other words, the majority class among the k-nearest neighbors. The
following example (Figure 2-5) uses the three closest neighbors:
In[11]:
mglearn.plots.plot_knn_classification(n_neighbors=3)
Figure 2-5. Predictions made by the three-nearest-neighbors model on the forge dataset
Again, the prediction is shown as the color of the cross. You can see that the prediction for the new
data point at the top left is not the same as the prediction when we used only one neighbor.
While this illustration is for a binary classification problem, this method can be applied to datasets with
any number of classes. For more classes, we count how many neighbors belong to each class and again
predict the most common class.
Next, we import and instantiate the class. This is when we can set parameters, like the number of
neighbors to use. Here, we set it to 3:
In[13]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
Now, we fit the classifier using the training set. For K-Neighbors Classifier this means storing the
dataset, so we can compute neighbors during prediction:
In[14]: clf.fit(X_train, y_train)
To make predictions on the test data, we call the predict method. For each data point in the test set,
this computes its nearest neighbors in the training set and finds the most common class among these:
In[15]: print("Test set predictions: {}".format(clf.predict(X_test)))
Out[15]: Test set predictions: [1 0 1 0 1 0 0]
To evaluate how well our model generalizes, we can call the score method with the test data together with the
test labels:
In[16]: print("Test set accuracy: {:.2f}".format(clf.score(X_test, y_test)))
Out[16]: Test set accuracy: 0.86
We see that our model is about 86% accurate, meaning the model predicted the class correctly for 86% of the
samples in the test dataset.
Let’s take the real-world Breast Cancer dataset. We begin by splitting the dataset into a training and a
test set. Then we evaluate training and test set performance with different numbers of neighbors.
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target,
random_state=66)
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
# build the model
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
# record training set accuracy
training_accuracy.append(clf.score(X_train, y_train))
# record generalization accuracy
test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
The plot shows the training and test set accuracy on the y-axis against the setting of n_neighbors on
the x-axis. While real-world plots are rarely very smooth, we can still recognize some of the
characteristics of overfitting and underfitting.
Considering a single nearest neighbor, the prediction on the training set is perfect. But when more
neighbors are considered, the model becomes simpler and the training accuracy drops. The test set
accuracy for using a single neighbor is lower than when using more neighbors, indicating that using the
single nearest neighbor leads to a model that is too complex. On the other hand, when considering 10
neighbors, the model is too simple and performance is even worse. The best performance is somewhere
in the middle, using around six neighbors. Still, it is good to keep the scale of the plot in mind. The
worst performance is around 88% accuracy, which might still be acceptable.
k-neighbors regression:
This is another variant of the k-nearest neighbors algorithm. It starts by using the single nearest
neighbor, this time using the wave dataset. We’ve added three test data points as green stars on the x-
axis. The prediction using a single neighbor is just the target value of the nearest neighbor.
mglearn.plots.plot_knn_regression(n_neighbors=1)
Again, we can use more than the single closest neighbor for regression. When using multiple nearest
neighbors, the prediction is the average, or mean, of the relevant neighbors.
mglearn.plots.plot_knn_regression(n_neighbors=3)
In principle, there are two important parameters to the KNeighbors classifier: the number of neighbors and how
you measure distance between data points. In practice, using a small number of neighbors like three or five often
works well, but you should certainly adjust this parameter. By default, Euclidean distance is used, which works
well in many settings.
One of the strengths of k-NN is that the model is very easy to understand, and often gives reasonable
performance without a lot of adjustments Building the nearest neighbors model is usually very fast, but when your
training set is very large (either in number of features or in number of samples) prediction can be slow.
When using the k-NN algorithm, it’s important to preprocess your data. This approach often does not perform
well on datasets with many features (hundreds or more), and it does particularly badly with datasets where
most features are 0 most of the time (so-called sparse datasets).
So, while the nearest k-neighbors algorithm is easy to understand, it is not often used in practice, due to
prediction being slow and its inability to handle many features.
kNN algorithm
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be considered)
Steps:
Do for all test data points
• Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.
• Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that class label to the test data point
End do
Why the kNN algorithm is called a lazy learner?
It stores the training data and directly applies the philosophy of nearest neighbourhood finding to arrive at the
classification. So, for kNN, there is no learning happening in the real sense. Therefore, kNN falls under the
category of lazy learner.
What is K-Nearest Neighbors (KNN)?
K-Nearest Neighbors is a machine learning technique and algorithm that can be used for both regression and
classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a
target data point, in order to make a prediction about the class that the data point falls into. K-Nearest Neighbors
(KNN) is a conceptually simple yet very powerful algorithm, and for those reasons, it’s one of the most popular
machine learning algorithms. Let’s take a deep dive into the KNN algorithm and see exactly how it works. Having a
good understanding of how KNN operates will let you appreciated the best and worst use cases for KNN.
What is a KNN (K-Nearest Neighbors)? - Unite.AI
Overview of KNN:
Let’s visualize a dataset on a 2D plane. Picture a bunch of data points on a graph, spread out along the graph in
small clusters. KNN examines the distribution of the data points and, depending on the arguments given to the
model, it separates the data points into groups. These groups are then assigned a label. The primary assumption
that a KNN model makes is that data points/instances which exist in close proximity to each other are highly
similar, while if a data point is far away from another group it’s dissimilar to those data points.
A KNN model calculates similarity using the distance between two points on a graph. The greater the distance
between the points, the less similar they are. There are multiple ways of calculating the distance between points,
but the most common distance metric is just Euclidean distance (the distance between two points in a straight
line).
KNN is a supervised learning algorithm, meaning that the examples in the dataset must have labels assigned to
them/their classes must be known. There are two other important things to know about KNN. First, KNN is a non-
parametric algorithm. This means that no assumptions about the dataset are made when the model is used.
Rather, the model is constructed entirely from the provided data. Second, there is no splitting of the dataset into
training and test sets when using KNN. KNN makes no generalizations between a training and testing set, so all
the training data is also used when the model is asked to make predictions.
Linear models make a prediction using a linear function of the input features.
For regression, the general prediction formula for a linear model looks as follows:
Here, x[0] to x[p] denotes the features (in this example, the number of features is p) of a single data
point, w and b are parameters of the model that are learned, and ŷ is the prediction the model makes.
For a dataset with a single feature, this is:
ŷ = w[0] * x[0] + b
In[26]:
from sklearn.linear_model import LinearRegression
X, y = mglearn.datasets.make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_train, y_train)
The “slope” parameters (w), also called weights or coefficients, are stored in the coef_ attribute, while the
offset or intercept (b) is stored in the intercept_ attribute:
In[27]:
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
Out[27]:
lr.coef_: [ 0.394]
lr.intercept_: -0.031804343026759746
The intercept_ attribute is always a single float number, while the coef_ attribute is
a NumPy array with one entry per input feature.
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))
The simple linear regression model and the multiple regression model assume that the dependent
variable is continuous.
The following expression describes the equation involving the relationship with two predictor variables,
namely X1 and X2 .
Ŷ = a + b1 X1 + b2 X2
The model describes a plane in the three-dimensional space of Ŷ, X1, and X2. Parameter ‘a’ is the
intercept of this plane. Parameters ‘b1’ and ‘b2 ’ are referred to as partial regression coefficients.
Parameter b1 represents the change in the mean response corresponding to a unit change in X1 when
X2 is held constant. Parameter b2 represents the change in the mean response corresponding to a unit
change in X2 when X1 is held constant.
Ŷ = 22 + 0.3X1 + 1.2X2
While finding the best fit line, we can fit either a polynomial or curvilinear regression. These are
known as polynomial or curvilinear regression, respectively.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The
features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and
returns a sparse matrix or dense array (depending on the sparse_output parameter) ---> OneHotEncoder()
Actual Value
Predicted Value
Residual Sum of Squares
1. The dependent variable (Y) can be calculated / predicated as a linear function of a specific set of
independent variables (X’s) plus an error term (ε).
2. The number of observations (n) is greater than the number of parameters (k) to be estimated, i.e.
n > k.
3. Relationships determined by regression are only relationships of association based on the data set
and not necessarily of cause and effect of the defined class.
4. Regression line can be valid only over a limited range of data. If the line is extended (outside the
range of extrapolation), it may only lead to wrong predictions.
5. If the business conditions change and the business assumptions underlying the regression model are
no longer valid, then the past data set will no longer be able to predict future trends.
6. Variance is the same for all values of X (homoskedasticity).
7. The error term (ε) is normally distributed. This also means that the mean of the error (ε) has an
expected value of 0.
8. The values of the error (ε) are independent and are not related to any values of X. This means that
there are no relationships between a particular X, Y that are related to another specific value of X, Y.
Let us say we have a regression model which is highly accurate and highly predictive; therefore, the
overall error of our model will be low, implying a low bias (high accuracy) and low variance (high
prediction). This is highly preferable.
Similarly, we can say that if the variance increases (low prediction), the spread of our data points
increases, which results in less accurate prediction. As the bias increases (low accuracy), the error
between our predicted value and the observed values increases. Therefore, balancing out bias and
accuracy is essential in a regression model.
In the linear regression model, it is assumed that the number of observations (n) is greater than the
number of parameters (k) to be estimated, i.e. n > k, and in that case, the least squares estimates tend
to have low variance and hence will perform well on test observations.
If k > n, then linear regression is not usable. This also indicates infinite variance, and so, the method
cannot be used at all.
Logistic Regression:
Logistic regression is both classification and regression technique depending on the scenario used.
Logistic regression (logit regression) is a type of regression analysis used for predicting the
outcome of a categorical dependent variable.
In logistic regression, dependent variable (Y) is binary (0,1) and independent variables (X) are
continuous in nature.
The probabilities describing the possible outcomes (probability that Y = 1) of a single trial are
modeled as a logistic function of the predictor variables.
In the logistic regression model, there is no R to gauge the fit of the overall model; however, a chi-
square test is used to gauge how well the logistic regression model fits the data.
Note: A chi square test is a statistical test that is used to compare the observed result from the
experiment with the actual expected results that we were guessing for the variables. The main
purpose of this test is to determine whether the difference between the observed data and the expected data
is by any chance or if it is a relationship between the variables that you are studying and making the chi-
square test for.
The goal of logistic regression is to predict the likelihood that Y is equal to 1 (probability that Y =
1 rather than 0) given certain values of X. That is, if X and Y have a strong positive linear relationship,
the probability that a person will have a score of Y = 1 will increase as values of X increase. So, we are
predicting probabilities rather than the scores of the dependent variable.
For example, we might try to predict whether or not a small project will succeed or fail on the basis of
the number of years of experience of the project manager handling the project. We presume that those
project managers who have been managing projects for many years will be more likely to succeed. This
means that as X (the number of years of experience of project manager) increases, the probability that Y
will be equal to 1 (success of the new project) will tend to increase. If we take a hypothetical example in
which 60 already executed projects were studied and the years of experience of project managers
ranges from 0 to 20 years, we could represent this tendency to increase the probability that Y = 1 with a
graph.
A perfect relationship represents a perfectly curved S rather than a straight line.
An explanation of logistic regression begins with an explanation of the logistic function, which always
takes values between zero and one. The logistic formulae are stated in terms of
the probability that Y = 1, which is referred to as P.
The probability that Y = 0 is 1 − P.
To illustrate this, it is convenient to segregate years of experience into categories (i.e. 0–8, 9–16, 17–24,
25–32, 33– 40). If we compute the mean score on Y (averaging the 0s and 1s) for each category of years
of experience, we will get something like
ln(c) = k is equal to e^k=c
As X increases, the probability that Y = 1 increases. In other words, when the project manager has more
years of experience, a larger percentage of projects succeed. A perfect relationship represents a
perfectly curved S rather than a straight line, as was the case in OLS regression. So, to model this
relationship, we need some fancy algebra / mathematics that accounts for the bends in the curve.
Logistic regression always takes values between zero and one.
The logistic formulae are stated in terms of the probability that Y = 1, which is referred to as P.
The probability that Y is 0 is 1 − P.
The ‘ln’ symbol refers to a natural logarithm and a + bX is the regression line equation. Probability (P)
can also be computed from the regression equation. So, if we know the regression equation, we could,
theoretically, calculate the expected probability that Y = 1 for a given value of X.
Ex: a model that can predict whether a person is male or female on the basis of their height.
Given a height of 150 cm, coefficients of a = −100 and b = 0.6.
Using the above equation, we can calculate the probability of male given a height of 150 cm or more
formally P(male|height = 150).
y = e^(a + b × X)/(1 + e^(a + b × X))
y = exp ( −100 + 0.6 × 150)/(1 + EXP( −100 + 0.6 × X)
y = 0.000046 or a probability of near zero that the person is a male.
o Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Lasso and Ridge Regression:
A regression model which is highly accurate and highly predictive; therefore, the overall error of
our model will be low, implying a low bias (high accuracy) and low variance (high prediction).
This is highly preferable.
Similarly, we can say that if the variance increases (low prediction), the spread of our data points
increases, which results in less accurate prediction. As the bias increases (low accuracy), the error
between our predicted value and the observed values increases. Therefore, balancing out bias and
accuracy is essential in a regression model.
In the linear regression model, it is assumed that the number of observations (n) is greater than
the number of parameters (k) to be estimated, i.e. n > k, and in that case, the least squares
estimates tend to have low variance and hence will perform well on test observations.
However, if an observation (n) is not much larger than parameters (k), then there can be high variability
in the least squares fit, resulting in overfitting and leading to poor predictions.
If k > n, then linear regression is not usable. This also indicates infinite variance, and so, the method
cannot be used at all.
Accuracy of linear regression can be improved using the following three methods:
1. Shrinkage Approach (Regularization)
2. Subset Selection
3. Dimensionality (Variable) Reduction
By limiting (shrinking) the estimated coefficients, we can try to reduce the variance at the cost of a
negligible increase in bias. This can in turn lead to substantial improvements in the accuracy of the
model.
Few variables used in the multiple regression model are in fact not associated with the overall response
and are called as irrelevant variables; this may lead to unnecessary complexity in the regression model.
This approach involves fitting a model involving all predictors. However, the estimated coefficients are
shrunken towards zero relative to the least squares estimates. This shrinkage (also known as
regularization) has the effect of reducing the overall variance. Some of the coefficients may also be
estimated to be exactly zero, thereby indirectly performing variable selection.
The two best-known techniques for shrinking the regression coefficients towards zero are
Ridge regression performs L2 regularization, i.e. it adds penalty equivalent to square of the
magnitude of coefficients
Ridge regression (include all k predictors in the final model) is very similar to least squares, except that
the coefficients are estimated by minimizing a slightly different quantity. If k > n, then the least squares
estimates do not even have a unique solution, whereas ridge regression can still perform well by
trading off a small increase in bias for a large decrease in variance. Thus, ridge regression works
best in situations where the least squares estimates have high variance i.e. overfitting.
One disadvantage with ridge regression is that it will include all k predictors in the final model.
This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation
in settings in which the number of variables k is quite large. Ridge regression will perform better when
the response is a function of many predictors, all with coefficients of roughly equal size.
Lasso regression performs L1 regularization, i.e. it adds penalty equivalent to the absolute value
of the magnitude of coefficients.
The lasso overcomes this disadvantage by forcing some of the coefficients to zero value. We can
say that the lasso yields sparse models (involving only subset) that are simpler as well as more
interpretable. The lasso can be expected to perform better in a setting where a relatively small number
of predictors have substantial coefficients, and the remaining predictors have coefficients that are very
small or equal to zero.
These models are penalizes models based on their complexity, favouring simpler models that are
also better at generalizing.
Bayes’ theorem:
where A and B are conditionally related events and p(A|B) denotes the probability of event A occurring
when event B has already occurred.
Naïve Bayes is a simple technique for building classifiers: models that assign class labels to problem
instances. The basic idea of Bayes rule is that the outcome of a hypothesis can be predicted on the
basis of some evidence (E) that can be observed.
From Bayes rule, we observed that
1. A prior probability of hypothesis h or P(h): This is the probability of an event or hypothesis before the
evidence is observed.
2. A posterior probability of h or P(h|D): This is the probability of an event after the evidence is observed
within the population D.
Posterior Probability is of the format ‘What is the probability that a particular object belongs to
class i given its observed feature values?’
Bayes’ theorem is used when new information can be used to revise previously determined
probabilities.
According to the approach in Bayes’ theorem, the classification of the new instance is performed by
assigning the most probable target classification C on the basis of the attribute values of the new
instance {a1 , a2 ,…, an }. (MAP—Maximum A Posteriori Hypothesis) So,
We take a learning task where each instance x has some attributes and the target function (f(x)) can take
any value from the finite set of classification values C. We also have a set of training examples for target
function, and the set of attributes {a1, a2 ,…, an } for the new instance are known to us. Our task is to
predict the classification of the new instance.
• It is easy to estimate each of the P(vj) simply by counting the frequency with which each target
value vj occurs in the training data.
• The naive Bayes classifier is based on the simplifying assumption that the attribute values are
conditionally independent given the target value. (target value does not influence the attributes)
• For a given the target value of the instance, the probability of observing conjunction al,a2...an,
is just the product of the probabilities for the individual attributes.
• Naive Bayes classifier:
A Naïve Bayes classifier is a primary probabilistic classifier based on a view of applying Bayes’
theorem (from Bayesian inference with strong naive) independence assumptions. The prior
probabilities in Bayes’ theorem that are changed with the help of newly available information are
classified as posterior probabilities.
A key benefit of the naive Bayes classifier is that it requires only a little bit of training information
(data) to gauge the parameters (mean and differences of the variables) essential for the
classification (arrangement).
Some of the key strengths and weaknesses of Naïve Bayes classifiers are described
Example. Let us assume that we want to predict the outcome of a football world cup match on the
basis of the past performance data of the playing teams. We have training data available (refer Fig. 6.3)
for actual match outcome, while four parameters are considered – Weather Condition (Rainy, Overcast,
or Sunny), how many matches won were by this team out of the last three matches (one match, two
matches, or three matches), Humidity Condition (High or Normal), and whether they won the toss (True
or False). Using Naïve Bayesian, you need to classify the conditions when this team wins and then
predict the probability of this team winning a particular match when Weather Conditions = Rainy, they
won two of the last three matches, Humidity = Normal and they won the toss in the particular match.
P(Yes) will give the overall probability of favourable condition in the given scenario.
P(No) will give the overall probability of non-favourable condition in the given scenario.
Conclusion: This shows that there is 58% probability that the team will win if the above conditions
become true for that particular day. Thus, Naïve Bayes classifier provides a simple yet powerful way to
consider the influence of multiple attributes on the target outcome and refine the uncertainty of the event
on the basis of the prior knowledge because it is able to simplify the calculation through independence
assumption.
Applications of Naïve Bayes classifier
➢ Text classification: Naïve Bayes classifier is among the most successful known algorithms for
learning to classify text documents. It classifies the document where the probability of classifying
the text is more. It uses the above algorithm to check the permutation and combination of the
probability of classifying a document under a particular ‘Title’. It has various applications in
document categorization, language detection, and sentiment detection, which are very useful for
traditional retailers, e-retailors, and other businesses on judging the sentiments of their clients on the
basis of keywords in feedback forms, social media comments, etc.
➢ Spam filtering: Spam filtering is the best known use of Naïve Bayesian text classification.
Presently, almost all the email providers have this as a built-in functionality, which makes use of a
Naïve Bayes classifier to identify spam email on the basis of certain conditions and also the
probability of classifying an email as ‘Spam’. Naïve Bayesian spam sifting has turned into a
mainstream mechanism to recognize illegitimate a spam email from an honest-to-goodness
email (sometimes called ‘ham’). Users can also install separate email filtering programmes.
Server-side email filters such as DSPAM, Spam Assassin, Spam Bayes, and ASSP make use of
Bayesian spam filtering techniques, and the functionality is sometimes embedded within the mail
server software itself.
➢ Hybrid Recommender System: It uses Naïve Bayes classifier and collaborative filtering.
Recommender systems (used by eretailors like eBay, Alibaba, Target, Flipkart, etc.) apply machine
learning and data mining techniques for filtering unseen information and can predict whether a user
would like a given resource. For example, when we log in to these retailer websites, on the basis of
the usage of texts used by the login and the historical data of purchase, it automatically recommends
the product for the particular login persona. One of the algorithms is combining a Naïve Bayes
classification
approach with collaborative filtering, and experimental results show that this algorithm provides better
performance regarding accuracy and coverage than other algorithms.
Online Sentiment Analysis: The online applications use supervised machine learning (Naïve Bayes)
and useful computing. In the case of sentiment analysis, let us assume there are three sentiments such as
nice, nasty, or neutral, and Naïve Bayes classifier is used to distinguish between them.
Simple emotion modelling combines a statistically based classifier with a dynamical model. The Naïve
Bayes classifier employs ‘single words’ and ‘word pairs’ like features and determines the sentiments of
the users. It allocates user utterances into nice, nasty, and neutral classes, labelled as +1, −1, and 0,
respectively. This binary output drives a simple first-order dynamical system, whose emotional state
represents the simulated emotional state of the experiment’s personification.
Decision tree:
➢ Decision tree learning is for classification and it builds a model in the form of a tree
structure.
➢ The goal of decision tree learning is to create a model (based on the past data called past
vector) that predicts the value of the output variable based on the input variables in the
feature vector.
➢ Each node (or decision node) of a decision tree corresponds to one of the feature vector.
From every node, there are edges to children, wherein there is an edge for each of the possible
values (or range of values) of the feature associated with the node.
➢ The tree terminates at different leaf nodes (or terminal nodes) where each leaf node
represents a possible value for the output variable.
Each internal node (represented by boxes) tests an attribute (represented as ‘A’/‘B’ within the boxes).
Each branch corresponds to an attribute value (T/F) in the above case. Each leaf node assigns a
classification. The first node is called as ‘Root’ Node. Branches from the root node are called as
‘Leaf’ Nodes where ‘A’ is the Root Node (first node). ‘B’ is the Branch Node. ‘T’ & ‘F’ are Leaf Nodes.
Thus, a decision tree consists of three types of nodes:
• Root Node
• Branch Node
• Leaf Node
• Root Node: The root node is the starting point of a tree. At this point, the first split is performed.
• Internal Nodes: Each internal node represents a decision point (predictor variable) that
eventually leads to the prediction of the outcome.
• Leaf/ Terminal Nodes: Leaf nodes represent the final class of the outcome and therefore they’re
also called terminating nodes.
• Branches: Branches are connections between nodes, they’re represented as arrows. Each branch
represents a response such as yes or no.
Decision tree construction:
1. Data set
2. Approach to select relevant attributes
3. Test Condition
4. Splitting – used to grow the tree
Measures :
1. Entropy
2. Information gain
3. Gini Index
Fig. shows an example decision tree for a car driving – the decision to be taken is whether to ‘Keep
Going’ or to ‘Stop’, which depends on various situations as depicted in the figure. If the signal is RED in
colour, then the car should be stopped. If there is not enough gas (petrol) in the car, the car should be
stopped at the next available gas station.
Building a decision tree:
Decision trees are built corresponding to the training data following an approach called recursive
partitioning. The approach splits the data into multiple subsets on the basis of the feature values. It
starts from the root node, which is nothing but the entire data set. It first selects the feature which
predicts the target class in the strongest way. The decision tree splits the data set into multiple
partitions, with data in each partition having a distinct value for the feature based on which the
partitioning has happened. This is the first set of branches.
Likewise, the algorithm continues splitting the nodes on the basis of the feature which helps in the best
partition. This continues till a stopping criterion is reached. The usual stopping criteria are –
1. All or most of the examples at a particular node have the same class
2. All features have been used up in the partitioning
3. The tree has grown to a pre-defined threshold limit
FIGURE: A decision tree for the concept PlayTennis. An example is classified by sorting
it through the tree to the appropriate leaf node, then returning the classification associated
with this leaf
Example:
Global Technology Solutions (GTS), a leading provider of IT solutions, is coming to College of
Engineering and Management (CEM) for hiring B.Tech. students. Last year during campus recruitment,
they had shortlisted 18 students for the final interview. Being a company of international repute, they
follow a stringent interview process to select only the best of the students. The information related to
the interview evaluation results of shortlisted students (hiding the names) on the basis of different
evaluation parameters is available for reference in Figure 7.10. Chandra, a student of CEM, wants to find
out if he may be offered a job in GTS. His CGPA is quite high. His self-evaluation on the other
parameters is as follows:
There are many implementations of decision tree, the most prominent ones being C5.0, CART
(Classification and Regression Tree), CHAID (Chi-square Automatic Interaction Detector) and ID3
(Iterative Dichotomiser 3) algorithms. The biggest challenge of a decision tree algorithm is to find out
which feature to split upon.
Entropy is a measure of impurity of an attribute or feature.
The information gain is calculated on the basis of the decrease in entropy (S) after a data set is split
according to a particular attribute (A).
Constructing a decision tree is all about finding an attribute that returns the highest information gain
(i.e. the most homogeneous branches).
Note: Like information gain, there are other measures like Gini index or chi-square for individual nodes
to decide the feature on the basis of which the split has to be applied. The CART algorithm uses Gini
index, while the CHAID algorithm uses chi-square for deciding the feature for applying split.
Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
Entropy measures the impurity or uncertainty present in the data. It is used to decide how a
Decision Tree can split the data.
•The entropy is 1 when the collection contains an equal number of positive and negative
examples
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1
Example: Entropy
• Suppose S is a collection of 14 examples of some boolean concept, including 9 positive and 5
negative examples. Then the entropy of S relative to this boolean classification is
Ex:
Target class: ‘JobOffered?’—Yes and No
The value of pi for class value ‘Yes’ is 0.44 (i.e. 8/18) and that for class value ‘No’ is 0.56 (i.e. 10/18). So,
we can calculate the entropy as
Entropy(S) = -0.44 log (0.44) - 0.56 log (0.56) = 0.99
The information gain is created on the basis of the decrease in entropy (S) after a data set is split
according to a particular attribute (A).
Constructing a decision tree is all about finding an attribute that returns the highest information gain
(i.e. the most homogeneous branches). If the information gain is 0, it means that there is no reduction in
entropy due to split of the data set according to that particular feature. On the other hand, the
maximum amount of information gain which may happen is the entropy of the data set before the split.
• Information gain, is the expected reduction in entropy caused by partitioning the examples
according to this attribute.
• The information gain, Gain(S, A) of an attribute A, relative to a
collection of examples S, is defined as
Information gain for a particular feature A is calculated by the difference in entropy before a split (or
Sbs ) with the entropy after the split (Sas ).
Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas )
We will find the value of entropy at the beginning before any split happens and then again after the
split happens. We will compare the values for all the cases –
1. when the feature ‘CGPA’ is used for the split
2. when the feature ‘Communication’ is used for the split
3. when the feature ‘Aptitude’ is used for the split
4. when the feature ‘Programming Skills’ is used for the split
As calculated, entropy of the data set before split (i.e. Entropy (Sbs )) = 0.99, and entropy of the data set
after split (i.e. Entropy (Sas )) is
0.69 when the feature ‘CGPA’ is used for split
0.63 when the feature ‘Communication’ is used for split
0.52 when the feature ‘Aptitude’ is used for split
0.95 when the feature ‘Programming skill’ is used for split
Fig: Entropy and information gain calculation (Level 1)
Therefore, the information gain from the feature ‘CGPA’ = 0.99 − 0.69 = 0.3, whereas the information
gain from the feature ‘Communication’ = 0.99 − 0.63 = 0.36. Likewise, the information gain for
‘Aptitude’ and ‘Programming skills’ is 0.47 and 0.04, respectively.
Hence, it is quite evident that among all the features, ‘Aptitude’ results in the best information gain
when adopted for the split. So, ‘Aptitude’ will be the first node of the decision tree formed.
One important point to be noted here is that for Aptitude = Low, entropy is 0, which indicates that the
branch towards Aptitude = Low will not continue any further.
As a part of level 2, only one branch to navigate in this case – the one for Aptitude = High.
1. Kernel Tricks:- It utilizes existing features,applies some transformations and creates new features, this new
features are key to find nonlinear decision boundary.Two most popular kernels:-
a) Polynomial Kernal
b) Radial Basis Function (RBF).
2. Soft Margin:-
It is used for both linearly and non linearly seperable data.Applying soft margin,SVM tolerates few dots to
get misclassified and tries to balance the trade-off between finding a line that maximizes the margin and
minimizes the misclassification.
Support Vector Machines:
Thus, an SVM model is a representation of the input instances as points in the feature space,
which are mapped so that an apparent gap between them divides the instances of the separate classes.
In other words, the goal of the SVM analysis is to find a plane, or rather a hyperplane, which
separates the instances on the basis of their classes. New examples (i.e. new instances) are then
mapped into that same space and predicted to belong to a class on the basis of which side of the gap
the new instance will fall on. In summary, in the overall training process, the SVM algorithm analyses
input data and identifies a surface in the multi-dimensional feature space called the hyperplane.
There may be many possible hyperplanes, and one of the challenges with the SVM model is to
find the optimal hyperplane.
1st fig -- is linearly separable. Here D1 may give more errors w.r.to Test data and not giving
generalized model.
Aim is-- Marginal plane distance should be very high
Generalization error in terms of SVM is the measure of how accurately and precisely this SVM
model can predict values for previously unseen data (new data). A hard margin in terms of SVM
means that an SVM model is inflexible in classification and tries to work exceptionally fit in the
training set, thereby causing overfitting.
Support Vectors: Support vectors are the data points (representing classes), the critical component in a
data set, which are near the identified set of lines (hyperplane). If support vectors are removed, they
will alter the position of the dividing hyperplane.
Hyperplane and Margin: For an N-dimensional feature space, hyperplane is a flat subspace of
dimension (N−1) that separates and classifies a set of data. For example, if we consider a two-
dimensional feature space (which is nothing but a data set having two features and a class variable), a
hyperplane will be a one-dimensional subspace or a straight line. In the same way, for a three-
dimensional feature space (data set having three features and a class variable), hyperplane is a two-
dimensional subspace or a simple plane.
More distance from the hyperplane the data points lie, the more confident we can be about correct
categorization. So, when a new testing data point/data set is added, the side of the hyperplane it lands
on will decide the class that we assign to it. The distance between hyperplane and data points is
known as margin.
Scenario 2:
we have three hyperplanes: A, B,and C. We have to identify the correct hyperplane which classifies the
triangles and circles in the best possible way.
Here, maximizing the distances between the nearest data points of both the classes and
hyperplane will help us decide the correct hyperplane. This distance is called as margin.
Scenario 3:
Here SVM selects the hyperplane which classifies the classes accurately before maximizing the margin.
Here, hyperplane B has a classification error, and A has classified all data instances correctly. Therefore,
A is the correct hyperplane.
Scenario 4:
In this scenario, as shown in Figure 7.19a, it is not possible to distinctly segregate the two classes by
using a straight line, as one data instance belonging to one of the classes (triangle) lies in the territory
of the other class (circle) as an outlier.
One triangle at the other end is like an outlier for the triangle class. SVM has a feature to ignore
outliers and find the hyperplane that has the maximum margin (hyperplane A, as shown in Fig.
7.19b). Hence, we can say that SVM is robust to outliers.
Fig: 7.19 Support vector machine: Scenario 4
So, by summarizing the observations from the different scenarios, we can say that
1. The hyperplane should segregate the data instances belonging to the two classes in the best
possible way.
2. It should maximize the distances between the nearest data points of both the classes, i.e.
maximize the margin.
3. If there is a need to prioritize between higher margin and lesser misclassification, the
hyperplane should try to reduce misclassifications.
➢ Support vectors, as can be observed in Figure 7.20, are data instances from the two classes
which are closest to the MMH.
➢ There should be at least one support vector from each class. The identification of support
vectors requires intense mathematical formulation.
➢ Modelling a problem using SVM is nothing but identifying the support vectors and MMH
corresponding to the problem space.
Identifying the MMH for linearly separable data
➢ In this case, an outer boundary needs to be drawn for the data instances belonging to the
different classes. These outer boundaries are known as convex hull.
Fig: Drawing the MMH for linearly separable data
Using this equation, the objective is to find a set of values for the vector such that two hyperplanes,
represented by the equations below, can be specified.
This is to ensure that all the data instances that belong to one class falls above one hyperplane and all
the data instances belonging to the other class falls below another hyperplane.
According to vector geometry, the distance of these planes should be 2/ c. in order to maximize
the distance between hyperplanes, the value of c. should be minimized.
For non-linearly separable data we have to use a slack variable ξ, which provides some soft margin
for data instances in one class that fall on the wrong side of the hyperplane. A cost value ‘C’ is
imposed on all such data instances that fall on the wrong side of the hyperplane. The task of SVM
is now to minimize the total cost due to such data instances in order to solve the revised
optimization problem:
Drawing the MMH for non-linearly separable data
Kernel trick:
➢ One way to deal with nonlinearly separable data is by using a slack variable and an
optimization function to minimize the cost value.
➢ SVM has a technique called the kernel trick to deal with non-linearly separable data.
➢ These are functions which can transform lower dimensional input space to a higher
dimensional space. In the process, it converts linearly non separable data to a linearly
separable data. These functions are called kernels.
Kernel trick in SVM
When data instances of the classes are closer to each other, this method can be used. The
effectiveness of SVM depends both on the:
• selection of the kernel function
• adoption of values for the kernel parameters
Strengths of SVM
• SVM can be used for both classification and regression.
• It is robust, i.e. not much impacted by data with noise or outliers.
• The prediction results using this model are very promising.
Weaknesses of SVM
• SVM is applicable only for binary classification, i.e. when there are only two classes in the
problem domain.
• The SVM model is very complex – almost like a black box when it deals with a high-dimensional
data set. Hence, it is very difficult and close to impossible to understand the model in such
cases.
• It is slow for a large dataset, i.e. a data set with either a large number of features or a large
number of instances.
• It is quite memory-intensive.
Applications of SVM:
➢ SVM can be applied is in the field of bioinformatics – more specifically, in detecting cancer and
other genetic disorders.
➢ It can also be used in detecting the image of a face by binary classification of images into face
and non-face components.
The difference between a hard margin and a soft margin in SVMs lies in the separability of
the data. If our data is linearly separable, we go for a hard margin.
Sometimes, the data is linearly separable, but the margin is so small that the model
becomes prone to overfitting or being too sensitive to outliers. Also, in this case, we can opt
for a larger margin by using soft margin SVM in order to help the model generalize better.