0% found this document useful (0 votes)
16 views105 pages

Unit 2 Notes

Uploaded by

D44 SREETEJA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
16 views105 pages

Unit 2 Notes

Uploaded by

D44 SREETEJA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 105

UNIT II:

Supervised Learning: Classification and Regression, Linear Regression: Single and Multiple, Logistic
Regression: Ridge Regression, Lasso Regression, k-Nearest Neighbour, Naive Bayes Classifier, Decision
Tree, Support Vector Machine (TB-1)

Website:

Scikit-learn SVM Tutorial with Python (Support Vector Machines) | DataCamp

Supervised Learning:
Supervised learning is used whenever we want to predict a certain outcome from a given input.
Its goal is to make accurate predictions for new, never-before-seen data.
Supervised learning often requires human effort to build the training set, but afterward automates and
often speeds up an otherwise laborious or infeasible task.

In supervised learning, the labelled training data is the experience or prior knowledge or belief. It is called
supervised learning because the process of learning from the training data by a machine.
Training data is the past information with known value of class field or ‘label’. Hence, we say that the ‘
training data is labelled’ in the case of supervised learning.

Some more examples of supervised learning are as follows:


• Prediction of results of a game based on the past analysis of results.
• Predicting whether a tumour is malignant or benign on the basis of the analysis of data
• Price prediction in domains such as real estate, stocks, etc.
Classification and Regression:
There are two major types of supervised machine learning problems, called classification and regression.
• When we are trying to predict a categorical or nominal variable, the problem is known as a
classification problem.
In classification, the goal is to predict a class label, which is a choice from a predefined list of possibilities.
Classification is sometimes separated into binary classification, which is the special case of
distinguishing between exactly two classes, and multiclass classification, which is classification between
more than two classes. You can think of binary classification as trying to answer a yes/no question.
Classifying emails as either spam or not spam is an example of a binary classification problem.
In this binary classification task, the yes/no question being asked would be “Is this emailspam?”
The iris example, on the other hand, is an example of a multiclass classification problem.
Another example is predicting what language a website is in from the text on the website. The classes
here would be a pre-defined list of possible languages.

FIG. 7.2 Classification model


In summary, classification is a type of supervised learning where a target feature, which is of
categorical type, is predicted for test data on the basis of the information imparted by the training
data. The target categorical feature is known as class .
Some typical classification problems include the following:
➢ Image classification
➢ Disease prediction
➢ Win–loss prediction of games
➢ Prediction of natural calamity such as earthquake, flood, etc.
➢ Handwriting recognition
For regression tasks, the goal is to predict a continuous number, or a floating-point number in
programming terms (or real number in mathematical terms). Predicting a person’s annual income from
their education, their age, and where they live is an example of a regression task. When predicting
income, the predicted value is an amount, and can be any number in a given range. Another example of a
regression task is predicting the yield of a corn farm given attributes such as previous yields, weather,
and number of employees working on the farm. The yield again can be an
arbitrary number.
An easy way to distinguish between classification and regression tasks is to ask whether there is some
kind of continuity in the output. If there is continuity between possible outcomes, then the problem
is a regression problem. Think about predicting annual income. There is a clear continuity in the output.
k-Nearest Neighbors
The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists
only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the
closest data points in the training dataset—its “nearest neighbors.”

k-Neighbors classification:
In its simplest version, the k-NN algorithm only considers exactly one nearest neighbor, which is the
closest training data point to the point we want to make a prediction for. The prediction is then simply
the known output for this training point.

mglearn.plots.plot_knn_classification (n_neighbors=1)

Figure 2-4. Predictions made by the one-nearest-neighbor model on the forge dataset

Here, we added three new data points, shown as stars. For each of them, we marked the closest point in the
training set. The prediction of the one-nearest-neighbor algorithm is the label of that point.

Instead of considering only the closest neighbor, we can also consider an arbitrary number, k, of
neighbors. This is where the name of the k-nearest neighbors algorithm comes from. When considering
more than one neighbor, we use voting to assign a label. This means that for each test point, we count
how many neighbors belong to class 0 and how many neighbors belong to class 1. We then assign the
class that is more frequent: in other words, the majority class among the k-nearest neighbors. The
following example (Figure 2-5) uses the three closest neighbors:

In[11]:
mglearn.plots.plot_knn_classification(n_neighbors=3)

Figure 2-5. Predictions made by the three-nearest-neighbors model on the forge dataset
Again, the prediction is shown as the color of the cross. You can see that the prediction for the new
data point at the top left is not the same as the prediction when we used only one neighbor.
While this illustration is for a binary classification problem, this method can be applied to datasets with
any number of classes. For more classes, we count how many neighbors belong to each class and again
predict the most common class.

from sklearn.model_selection import train_test_split


X, y = mglearn.datasets.make_forge()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Next, we import and instantiate the class. This is when we can set parameters, like the number of
neighbors to use. Here, we set it to 3:
In[13]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)

Now, we fit the classifier using the training set. For K-Neighbors Classifier this means storing the
dataset, so we can compute neighbors during prediction:
In[14]: clf.fit(X_train, y_train)
To make predictions on the test data, we call the predict method. For each data point in the test set,
this computes its nearest neighbors in the training set and finds the most common class among these:
In[15]: print("Test set predictions: {}".format(clf.predict(X_test)))
Out[15]: Test set predictions: [1 0 1 0 1 0 0]

To evaluate how well our model generalizes, we can call the score method with the test data together with the
test labels:
In[16]: print("Test set accuracy: {:.2f}".format(clf.score(X_test, y_test)))
Out[16]: Test set accuracy: 0.86
We see that our model is about 86% accurate, meaning the model predicted the class correctly for 86% of the
samples in the test dataset.

Let’s take the real-world Breast Cancer dataset. We begin by splitting the dataset into a training and a
test set. Then we evaluate training and test set performance with different numbers of neighbors.

import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target,
random_state=66)
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
# build the model
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
# record training set accuracy
training_accuracy.append(clf.score(X_train, y_train))
# record generalization accuracy
test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
The plot shows the training and test set accuracy on the y-axis against the setting of n_neighbors on
the x-axis. While real-world plots are rarely very smooth, we can still recognize some of the
characteristics of overfitting and underfitting.
Considering a single nearest neighbor, the prediction on the training set is perfect. But when more
neighbors are considered, the model becomes simpler and the training accuracy drops. The test set
accuracy for using a single neighbor is lower than when using more neighbors, indicating that using the
single nearest neighbor leads to a model that is too complex. On the other hand, when considering 10
neighbors, the model is too simple and performance is even worse. The best performance is somewhere
in the middle, using around six neighbors. Still, it is good to keep the scale of the plot in mind. The
worst performance is around 88% accuracy, which might still be acceptable.

k-neighbors regression:
This is another variant of the k-nearest neighbors algorithm. It starts by using the single nearest
neighbor, this time using the wave dataset. We’ve added three test data points as green stars on the x-
axis. The prediction using a single neighbor is just the target value of the nearest neighbor.
mglearn.plots.plot_knn_regression(n_neighbors=1)
Again, we can use more than the single closest neighbor for regression. When using multiple nearest
neighbors, the prediction is the average, or mean, of the relevant neighbors.
mglearn.plots.plot_knn_regression(n_neighbors=3)

Figure 2-9. Predictions made by three-nearest-neighbors regression on the wave dataset


The k-nearest neighbors algorithm for regression is implemented in the KNeighbors Regressor class in
scikit-learn. It’s used similarly to KNeighborsClassifier:
In[21]:
from sklearn.neighbors import KNeighborsRegressor
X, y = mglearn.datasets.make_wave(n_samples=40)
# split the wave dataset into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# instantiate the model and set the number of neighbors to consider to 3
reg = KNeighborsRegressor(n_neighbors=3)
# fit the model using the training data and training targets
reg.fit(X_train, y_train)
Now we can make predictions on the test set:
In[22]: print("Test set predictions:\n{}".format(reg.predict(X_test)))
Out[22]: Test set predictions:
[-0.054 0.357 1.137 -1.894 -1.139 -1.631 0.357 0.912 -0.447 -1.139]
We can also evaluate the model using the score method, which for regressors returns the R^2 score. The
R^2 score, also known as the coefficient of determination, is a measure of goodness of a prediction for a
regression model, and yields a score between 0 and 1. A value of 1 corresponds to a perfect prediction,
and a value of 0 corresponds to a constant model that just predicts the mean of the training set
responses, y_train:
In[23]: print("Test set R^2: {:.2f}".format(reg.score(X_test, y_test))) Out[23]: Test set R^2: 0.83
Here, the score is 0.83, which indicates a relatively good model fit.
Strengths, weaknesses, and parameters:

In principle, there are two important parameters to the KNeighbors classifier: the number of neighbors and how
you measure distance between data points. In practice, using a small number of neighbors like three or five often
works well, but you should certainly adjust this parameter. By default, Euclidean distance is used, which works
well in many settings.
One of the strengths of k-NN is that the model is very easy to understand, and often gives reasonable
performance without a lot of adjustments Building the nearest neighbors model is usually very fast, but when your
training set is very large (either in number of features or in number of samples) prediction can be slow.
When using the k-NN algorithm, it’s important to preprocess your data. This approach often does not perform
well on datasets with many features (hundreds or more), and it does particularly badly with datasets where
most features are 0 most of the time (so-called sparse datasets).
So, while the nearest k-neighbors algorithm is easy to understand, it is not often used in practice, due to
prediction being slow and its inability to handle many features.

kNN algorithm
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be considered)
Steps:
Do for all test data points
• Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.
• Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that class label to the test data point

End do
Why the kNN algorithm is called a lazy learner?
It stores the training data and directly applies the philosophy of nearest neighbourhood finding to arrive at the
classification. So, for kNN, there is no learning happening in the real sense. Therefore, kNN falls under the
category of lazy learner.
What is K-Nearest Neighbors (KNN)?
K-Nearest Neighbors is a machine learning technique and algorithm that can be used for both regression and
classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a
target data point, in order to make a prediction about the class that the data point falls into. K-Nearest Neighbors
(KNN) is a conceptually simple yet very powerful algorithm, and for those reasons, it’s one of the most popular
machine learning algorithms. Let’s take a deep dive into the KNN algorithm and see exactly how it works. Having a
good understanding of how KNN operates will let you appreciated the best and worst use cases for KNN.
What is a KNN (K-Nearest Neighbors)? - Unite.AI
Overview of KNN:
Let’s visualize a dataset on a 2D plane. Picture a bunch of data points on a graph, spread out along the graph in
small clusters. KNN examines the distribution of the data points and, depending on the arguments given to the
model, it separates the data points into groups. These groups are then assigned a label. The primary assumption
that a KNN model makes is that data points/instances which exist in close proximity to each other are highly
similar, while if a data point is far away from another group it’s dissimilar to those data points.
A KNN model calculates similarity using the distance between two points on a graph. The greater the distance
between the points, the less similar they are. There are multiple ways of calculating the distance between points,
but the most common distance metric is just Euclidean distance (the distance between two points in a straight
line).
KNN is a supervised learning algorithm, meaning that the examples in the dataset must have labels assigned to
them/their classes must be known. There are two other important things to know about KNN. First, KNN is a non-
parametric algorithm. This means that no assumptions about the dataset are made when the model is used.
Rather, the model is constructed entirely from the provided data. Second, there is no splitting of the dataset into
training and test sets when using KNN. KNN makes no generalizations between a training and testing set, so all
the training data is also used when the model is asked to make predictions.

How a KNN Algorithm Operates


A KNN algorithm goes through three main phases as it is carried out:
1. Setting K to the chosen number of neighbors.
2. Calculating the distance between a provided/test example and the dataset examples.
3. Sorting the calculated distances.
4. Getting the labels of the top K entries.
5. Returning a prediction about the test example.
In the first step, K is chosen by the user and it tells the algorithm how many neighbors (how many
surrounding data points) should be considered when rendering a judgment about the group the target
example belongs to. In the second step, note that the model checks the distance between the target
example and every example in the dataset. The distances are then added into a list and sorted. Afterward,
the sorted list is checked and the labels for the top K elements are returned. In other words, if K is set to
5, the model checks the labels of the top 5 closest data points to the target data point. When rendering a
prediction about the target data point, it matters if the task is a regression or classification task. For a
regression task, the mean of the top K labels is used, while the mode of the top K labels is used in the
case of classification.
The exact mathematical operations used to carry out KNN differ depending on the chosen distance
metric.
KNN Pros And Cons
Pros:
• KNN can be used for both regression and classification tasks, unlike some other supervised learning algorithms.
• KNN is highly accurate and simple to use. It’s easy to interpret, understand, and implement.
• KNN doesn’t make any assumptions about the data, meaning it can be used for a wide variety of problems.
Cons:
• KNN stores most or all of the data, which means that the model requires a lot of memory and its computationally
expensive. Large datasets can also cause predictions to be take a long time.
• KNN proves to be very sensitive to the scale of the dataset and it can be thrown off by irrelevant features fairly easily
in comparison to other models.
Regression is a technique for investigating the relationship between independent variables or features and a
dependent variable or outcome. It's used as a method for predictive modelling in machine learning, in which an algorithm is
used to predict continuous outcomes.
Simple or Single Linear Regression:
It is a simplest regression model which involves only one predictor. This model assumes a linear relationship
between the dependent variable and the predictor variable.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x (input)
and y(output). Hence, the name is Linear Regression.

FIG. 8.1 Simple linear regression


Linear models for regression:

Linear models make a prediction using a linear function of the input features.
For regression, the general prediction formula for a linear model looks as follows:

ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b

Here, x[0] to x[p] denotes the features (in this example, the number of features is p) of a single data
point, w and b are parameters of the model that are learned, and ŷ is the prediction the model makes.
For a dataset with a single feature, this is:
ŷ = w[0] * x[0] + b

w[0] is the slope and b is the y-axis offset.


Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method for
regression. Linear regression finds the parameters w and b that minimize the mean squared error
between predictions and the true regression targets, y, on the training set. The mean squared
error is the sum of the squared differences between the predictions and the true values.

In[26]:
from sklearn.linear_model import LinearRegression
X, y = mglearn.datasets.make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_train, y_train)
The “slope” parameters (w), also called weights or coefficients, are stored in the coef_ attribute, while the
offset or intercept (b) is stored in the intercept_ attribute:
In[27]:
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
Out[27]:
lr.coef_: [ 0.394]
lr.intercept_: -0.031804343026759746

The intercept_ attribute is always a single float number, while the coef_ attribute is
a NumPy array with one entry per input feature.
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Multiple Linear Regression:


In a multiple regression model, two or more independent variables, i.e. predictors are involved in the
model.
Simple linear regression considered -- Price of a Property as the dependent variable and the Area of
the Property (in sq. m.) as the predictor variable. However location, floor, number of years since
purchase, amenities available, etc. are also important predictors or the independent variables.
A multiple regression equation as shown below:

Price Property = f (Area Property location, floor, Ageing, Amenities)

The simple linear regression model and the multiple regression model assume that the dependent
variable is continuous.

The following expression describes the equation involving the relationship with two predictor variables,
namely X1 and X2 .
Ŷ = a + b1 X1 + b2 X2

The model describes a plane in the three-dimensional space of Ŷ, X1, and X2. Parameter ‘a’ is the
intercept of this plane. Parameters ‘b1’ and ‘b2 ’ are referred to as partial regression coefficients.
Parameter b1 represents the change in the mean response corresponding to a unit change in X1 when
X2 is held constant. Parameter b2 represents the change in the mean response corresponding to a unit
change in X2 when X1 is held constant.
Ŷ = 22 + 0.3X1 + 1.2X2

FIG. 8.15 Multiple regression plane


Multiple regression for estimating equation when there are ‘n’ predictor variables is as follows:
Ŷ = a + b1 X 1 + b2 X 2 + b3 X 3 + … + b n X n

While finding the best fit line, we can fit either a polynomial or curvilinear regression. These are
known as polynomial or curvilinear regression, respectively.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The
features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and
returns a sparse matrix or dense array (depending on the sparse_output parameter) ---> OneHotEncoder()
Actual Value

Predicted Value
Residual Sum of Squares

Total Sum of squares


Note: Multicollinearity occurs when two or more independent variables have a high correlation with one
another in a regression model, which makes it difficult to determine the individual effect of each independent
variable on the dependent variable.
Although multicollinearity does not affect the regression estimates, it makes them vague, imprecise, and unreliable. Thus, it can be hard to
determine how the independent variables influence the dependent variable individually.

Assumptions in Regression Analysis:

1. The dependent variable (Y) can be calculated / predicated as a linear function of a specific set of
independent variables (X’s) plus an error term (ε).
2. The number of observations (n) is greater than the number of parameters (k) to be estimated, i.e.
n > k.
3. Relationships determined by regression are only relationships of association based on the data set
and not necessarily of cause and effect of the defined class.
4. Regression line can be valid only over a limited range of data. If the line is extended (outside the
range of extrapolation), it may only lead to wrong predictions.
5. If the business conditions change and the business assumptions underlying the regression model are
no longer valid, then the past data set will no longer be able to predict future trends.
6. Variance is the same for all values of X (homoskedasticity).
7. The error term (ε) is normally distributed. This also means that the mean of the error (ε) has an
expected value of 0.
8. The values of the error (ε) are independent and are not related to any values of X. This means that
there are no relationships between a particular X, Y that are related to another specific value of X, Y.

Main Problems in Regression Analysis:


In multiple regressions, there are two primary problems:
multicollinearity and heteroskedasticity.
• Multicollinearity is the situation in which the degree of correlation is not only between the
dependent variable and the independent variable, but there is also a strong correlation within
(among) the independent variables themselves. When multicollinearity is present, it increases
the standard errors of the coefficients
• Heteroskedasticity refers to the changing variance of the error term. If the variance of the error
term is not constant across data sets, there will be erroneous predictions.

Improving Accuracy of the Linear Regression Model:


The concept of bias and variance is similar to accuracy and prediction. Accuracy refers to how close the
estimation is near the actual value, whereas prediction refers to continuous estimation of the value.
➢ High bias = low accuracy (not close to real value)
➢ High variance = low prediction (values are scattered)
➢ Low bias = high accuracy (close to real value)
➢ Low variance = high prediction (values are close to each other)
Bias is one type of error that occurs due to wrong assumptions about data such as
assuming data is linear when in reality, data follows a complex function. On the other
hand, variance gets introduced with high sensitivity to variations in training data.

Let us say we have a regression model which is highly accurate and highly predictive; therefore, the
overall error of our model will be low, implying a low bias (high accuracy) and low variance (high
prediction). This is highly preferable.

Similarly, we can say that if the variance increases (low prediction), the spread of our data points
increases, which results in less accurate prediction. As the bias increases (low accuracy), the error
between our predicted value and the observed values increases. Therefore, balancing out bias and
accuracy is essential in a regression model.

In the linear regression model, it is assumed that the number of observations (n) is greater than the
number of parameters (k) to be estimated, i.e. n > k, and in that case, the least squares estimates tend
to have low variance and hence will perform well on test observations.
If k > n, then linear regression is not usable. This also indicates infinite variance, and so, the method
cannot be used at all.

Logistic Regression:

Logistic regression is both classification and regression technique depending on the scenario used.
Logistic regression (logit regression) is a type of regression analysis used for predicting the
outcome of a categorical dependent variable.
In logistic regression, dependent variable (Y) is binary (0,1) and independent variables (X) are
continuous in nature.
The probabilities describing the possible outcomes (probability that Y = 1) of a single trial are
modeled as a logistic function of the predictor variables.

In the logistic regression model, there is no R to gauge the fit of the overall model; however, a chi-
square test is used to gauge how well the logistic regression model fits the data.

Note: A chi square test is a statistical test that is used to compare the observed result from the
experiment with the actual expected results that we were guessing for the variables. The main
purpose of this test is to determine whether the difference between the observed data and the expected data
is by any chance or if it is a relationship between the variables that you are studying and making the chi-
square test for.

The goal of logistic regression is to predict the likelihood that Y is equal to 1 (probability that Y =
1 rather than 0) given certain values of X. That is, if X and Y have a strong positive linear relationship,
the probability that a person will have a score of Y = 1 will increase as values of X increase. So, we are
predicting probabilities rather than the scores of the dependent variable.

For example, we might try to predict whether or not a small project will succeed or fail on the basis of
the number of years of experience of the project manager handling the project. We presume that those
project managers who have been managing projects for many years will be more likely to succeed. This
means that as X (the number of years of experience of project manager) increases, the probability that Y
will be equal to 1 (success of the new project) will tend to increase. If we take a hypothetical example in
which 60 already executed projects were studied and the years of experience of project managers
ranges from 0 to 20 years, we could represent this tendency to increase the probability that Y = 1 with a
graph.
A perfect relationship represents a perfectly curved S rather than a straight line.

An explanation of logistic regression begins with an explanation of the logistic function, which always
takes values between zero and one. The logistic formulae are stated in terms of
the probability that Y = 1, which is referred to as P.
The probability that Y = 0 is 1 − P.
To illustrate this, it is convenient to segregate years of experience into categories (i.e. 0–8, 9–16, 17–24,
25–32, 33– 40). If we compute the mean score on Y (averaging the 0s and 1s) for each category of years
of experience, we will get something like
ln(c) = k is equal to e^k=c

As X increases, the probability that Y = 1 increases. In other words, when the project manager has more
years of experience, a larger percentage of projects succeed. A perfect relationship represents a
perfectly curved S rather than a straight line, as was the case in OLS regression. So, to model this
relationship, we need some fancy algebra / mathematics that accounts for the bends in the curve.
Logistic regression always takes values between zero and one.
The logistic formulae are stated in terms of the probability that Y = 1, which is referred to as P.
The probability that Y is 0 is 1 − P.
The ‘ln’ symbol refers to a natural logarithm and a + bX is the regression line equation. Probability (P)
can also be computed from the regression equation. So, if we know the regression equation, we could,
theoretically, calculate the expected probability that Y = 1 for a given value of X.

‘exp’ is the exponent function, which is sometimes also written as e.

Ex: a model that can predict whether a person is male or female on the basis of their height.
Given a height of 150 cm, coefficients of a = −100 and b = 0.6.
Using the above equation, we can calculate the probability of male given a height of 150 cm or more
formally P(male|height = 150).
y = e^(a + b × X)/(1 + e^(a + b × X))
y = exp ( −100 + 0.6 × 150)/(1 + EXP( −100 + 0.6 × X)
y = 0.000046 or a probability of near zero that the person is a male.

Assumptions in logistic regression


The following assumptions must hold when building a logistic regression model:
➢ There exists a linear relationship between logit function and independent variables
➢ The dependent variable Y must be categorical (1/0) and take binary value,
o e.g. if pass then Y = 1; else Y = 0
➢ The data meets the ‘iid’ criterion, i.e. the error terms, ε, are independent from one another and
identically distributed.
➢ The error term follows a binomial distribution [n, p]
o n = # of records in the data
o p = probability of success (pass, responder)

o Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Lasso and Ridge Regression:

Ѳ0+Ѳ1x+ Ѳ2x2+ Ѳ3x3+ Ѳ4x4


Ѳ0+Ѳ1x+ Ѳ2x2+ Ѳ3x3+ Ѳ4x4

Lass regression is L1 Regularization , Lambda is hyper parameter.


Ridge Regression (L2 Regularization)

Improving Accuracy of the Linear Regression Model:


The concept of bias and variance is similar to accuracy and prediction.
Accuracy refers to how close the estimation is near the actual value, whereas prediction refers to
continuous estimation of the value.
High bias = low accuracy (not close to real value)
High variance = low prediction (values are scattered)
Low bias = high accuracy (close to real value)
Low variance = high prediction (values are close to each other)

A regression model which is highly accurate and highly predictive; therefore, the overall error of
our model will be low, implying a low bias (high accuracy) and low variance (high prediction).
This is highly preferable.
Similarly, we can say that if the variance increases (low prediction), the spread of our data points
increases, which results in less accurate prediction. As the bias increases (low accuracy), the error
between our predicted value and the observed values increases. Therefore, balancing out bias and
accuracy is essential in a regression model.

In the linear regression model, it is assumed that the number of observations (n) is greater than
the number of parameters (k) to be estimated, i.e. n > k, and in that case, the least squares
estimates tend to have low variance and hence will perform well on test observations.
However, if an observation (n) is not much larger than parameters (k), then there can be high variability
in the least squares fit, resulting in overfitting and leading to poor predictions.
If k > n, then linear regression is not usable. This also indicates infinite variance, and so, the method
cannot be used at all.

Accuracy of linear regression can be improved using the following three methods:
1. Shrinkage Approach (Regularization)
2. Subset Selection
3. Dimensionality (Variable) Reduction

Shrinkage (Regularization) approach:

By limiting (shrinking) the estimated coefficients, we can try to reduce the variance at the cost of a
negligible increase in bias. This can in turn lead to substantial improvements in the accuracy of the
model.

Few variables used in the multiple regression model are in fact not associated with the overall response
and are called as irrelevant variables; this may lead to unnecessary complexity in the regression model.

This approach involves fitting a model involving all predictors. However, the estimated coefficients are
shrunken towards zero relative to the least squares estimates. This shrinkage (also known as
regularization) has the effect of reducing the overall variance. Some of the coefficients may also be
estimated to be exactly zero, thereby indirectly performing variable selection.
The two best-known techniques for shrinking the regression coefficients towards zero are

1. ridge regression (L2 regularization)


2. lasso (Least Absolute Shrinkage Selector Operator) (L1 regularization)

Ridge regression performs L2 regularization, i.e. it adds penalty equivalent to square of the
magnitude of coefficients

Minimization objective of ridge = LS Obj + α × (sum of square of coefficients)

Ridge regression (include all k predictors in the final model) is very similar to least squares, except that
the coefficients are estimated by minimizing a slightly different quantity. If k > n, then the least squares
estimates do not even have a unique solution, whereas ridge regression can still perform well by
trading off a small increase in bias for a large decrease in variance. Thus, ridge regression works
best in situations where the least squares estimates have high variance i.e. overfitting.
One disadvantage with ridge regression is that it will include all k predictors in the final model.
This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation
in settings in which the number of variables k is quite large. Ridge regression will perform better when
the response is a function of many predictors, all with coefficients of roughly equal size.

Lasso regression performs L1 regularization, i.e. it adds penalty equivalent to the absolute value
of the magnitude of coefficients.

Minimization objective of ridge = LS Obj + α × (absolute value of the magnitude of coefficients)

The lasso overcomes this disadvantage by forcing some of the coefficients to zero value. We can
say that the lasso yields sparse models (involving only subset) that are simpler as well as more
interpretable. The lasso can be expected to perform better in a setting where a relatively small number
of predictors have substantial coefficients, and the remaining predictors have coefficients that are very
small or equal to zero.
These models are penalizes models based on their complexity, favouring simpler models that are
also better at generalizing.

Naive Bayes Classifier:

Bayes’ theorem:
where A and B are conditionally related events and p(A|B) denotes the probability of event A occurring
when event B has already occurred.

Naïve Bayes is a simple technique for building classifiers: models that assign class labels to problem
instances. The basic idea of Bayes rule is that the outcome of a hypothesis can be predicted on the
basis of some evidence (E) that can be observed.
From Bayes rule, we observed that
1. A prior probability of hypothesis h or P(h): This is the probability of an event or hypothesis before the
evidence is observed.
2. A posterior probability of h or P(h|D): This is the probability of an event after the evidence is observed
within the population D.

Posterior Probability is of the format ‘What is the probability that a particular object belongs to
class i given its observed feature values?’
Bayes’ theorem is used when new information can be used to revise previously determined
probabilities.

According to the approach in Bayes’ theorem, the classification of the new instance is performed by
assigning the most probable target classification C on the basis of the attribute values of the new
instance {a1 , a2 ,…, an }. (MAP—Maximum A Posteriori Hypothesis) So,

which can be rewritten using Bayes’ theorem as

We take a learning task where each instance x has some attributes and the target function (f(x)) can take
any value from the finite set of classification values C. We also have a set of training examples for target
function, and the set of attributes {a1, a2 ,…, an } for the new instance are known to us. Our task is to
predict the classification of the new instance.
• It is easy to estimate each of the P(vj) simply by counting the frequency with which each target
value vj occurs in the training data.
• The naive Bayes classifier is based on the simplifying assumption that the attribute values are
conditionally independent given the target value. (target value does not influence the attributes)
• For a given the target value of the instance, the probability of observing conjunction al,a2...an,
is just the product of the probabilities for the individual attributes.
• Naive Bayes classifier:

A Naïve Bayes classifier is a primary probabilistic classifier based on a view of applying Bayes’
theorem (from Bayesian inference with strong naive) independence assumptions. The prior
probabilities in Bayes’ theorem that are changed with the help of newly available information are
classified as posterior probabilities.

A key benefit of the naive Bayes classifier is that it requires only a little bit of training information
(data) to gauge the parameters (mean and differences of the variables) essential for the
classification (arrangement).

Some of the key strengths and weaknesses of Naïve Bayes classifiers are described
Example. Let us assume that we want to predict the outcome of a football world cup match on the
basis of the past performance data of the playing teams. We have training data available (refer Fig. 6.3)
for actual match outcome, while four parameters are considered – Weather Condition (Rainy, Overcast,
or Sunny), how many matches won were by this team out of the last three matches (one match, two
matches, or three matches), Humidity Condition (High or Normal), and whether they won the toss (True
or False). Using Naïve Bayesian, you need to classify the conditions when this team wins and then
predict the probability of this team winning a particular match when Weather Conditions = Rainy, they
won two of the last three matches, Humidity = Normal and they won the toss in the particular match.

FIG. 6.3 Training data for the Naïve Bayesian method

Naïve Bayes classifier steps


Step 1: First construct a frequency table. A frequency table is drawn for each attribute against the
target outcome. For example, in Figure 6.3, the various attributes are (1) Weather Condition, (2) How
many matches won by this team in last three matches, (3) Humidity Condition, and (4) whether they
won the toss and the target outcome is will they win the match or not?
Step 2: Identify the cumulative probability for ‘Won match = Yes’ and the probability for ‘Won match =
No’ on the basis of all the attributes. Otherwise, simply multiply probabilities of all favourable
conditions to derive ‘YES’ condition. Multiply probabilities of all non-favourable conditions to derive
‘No’ condition.
Step 3: Calculate probability through normalization by applying the below Formula

P(Yes) will give the overall probability of favourable condition in the given scenario.
P(No) will give the overall probability of non-favourable condition in the given scenario.

Solving the above problem with Naive Bayes


Step 1: Construct a frequency table. The posterior probability can be easily derived by constructing a
frequency table for each attribute against the target.
For example, frequency of Weather Condition variable with values ‘Sunny’when the target value Won
match is ‘Yes’, is, 3/(3+4+2) = 3/9.
Figure 6.4 shows the frequency table thus constructed.
Step 2:
To predict whether the team will win for given weather conditions (a ) = Rainy, Wins in last three
matches (a ) = 2 wins, Humidity (a ) = Normal and Win toss (a ) = True, we need to choose ‘Yes’ from
the above table for the given conditions.
From Bayes’ theorem, we get

This should be compared with


P(!Win match|a1 ∩a2 ∩a3 ∩a4 )

FIG. 6.4 Construct frequency table


Step 3: by normalizing the above two probabilities, we can ensure that the sum of these two
probabilities is 1.

Conclusion: This shows that there is 58% probability that the team will win if the above conditions
become true for that particular day. Thus, Naïve Bayes classifier provides a simple yet powerful way to
consider the influence of multiple attributes on the target outcome and refine the uncertainty of the event
on the basis of the prior knowledge because it is able to simplify the calculation through independence
assumption.
Applications of Naïve Bayes classifier
➢ Text classification: Naïve Bayes classifier is among the most successful known algorithms for
learning to classify text documents. It classifies the document where the probability of classifying
the text is more. It uses the above algorithm to check the permutation and combination of the
probability of classifying a document under a particular ‘Title’. It has various applications in
document categorization, language detection, and sentiment detection, which are very useful for
traditional retailers, e-retailors, and other businesses on judging the sentiments of their clients on the
basis of keywords in feedback forms, social media comments, etc.
➢ Spam filtering: Spam filtering is the best known use of Naïve Bayesian text classification.
Presently, almost all the email providers have this as a built-in functionality, which makes use of a
Naïve Bayes classifier to identify spam email on the basis of certain conditions and also the
probability of classifying an email as ‘Spam’. Naïve Bayesian spam sifting has turned into a
mainstream mechanism to recognize illegitimate a spam email from an honest-to-goodness
email (sometimes called ‘ham’). Users can also install separate email filtering programmes.
Server-side email filters such as DSPAM, Spam Assassin, Spam Bayes, and ASSP make use of
Bayesian spam filtering techniques, and the functionality is sometimes embedded within the mail
server software itself.
➢ Hybrid Recommender System: It uses Naïve Bayes classifier and collaborative filtering.
Recommender systems (used by eretailors like eBay, Alibaba, Target, Flipkart, etc.) apply machine
learning and data mining techniques for filtering unseen information and can predict whether a user
would like a given resource. For example, when we log in to these retailer websites, on the basis of
the usage of texts used by the login and the historical data of purchase, it automatically recommends
the product for the particular login persona. One of the algorithms is combining a Naïve Bayes
classification
approach with collaborative filtering, and experimental results show that this algorithm provides better
performance regarding accuracy and coverage than other algorithms.
Online Sentiment Analysis: The online applications use supervised machine learning (Naïve Bayes)
and useful computing. In the case of sentiment analysis, let us assume there are three sentiments such as
nice, nasty, or neutral, and Naïve Bayes classifier is used to distinguish between them.
Simple emotion modelling combines a statistically based classifier with a dynamical model. The Naïve
Bayes classifier employs ‘single words’ and ‘word pairs’ like features and determines the sentiments of
the users. It allocates user utterances into nice, nasty, and neutral classes, labelled as +1, −1, and 0,
respectively. This binary output drives a simple first-order dynamical system, whose emotional state
represents the simulated emotional state of the experiment’s personification.
Decision tree:

➢ Decision tree learning is for classification and it builds a model in the form of a tree
structure.
➢ The goal of decision tree learning is to create a model (based on the past data called past
vector) that predicts the value of the output variable based on the input variables in the
feature vector.
➢ Each node (or decision node) of a decision tree corresponds to one of the feature vector.
From every node, there are edges to children, wherein there is an edge for each of the possible
values (or range of values) of the feature associated with the node.
➢ The tree terminates at different leaf nodes (or terminal nodes) where each leaf node
represents a possible value for the output variable.

Fig: Decision tree structure

Each internal node (represented by boxes) tests an attribute (represented as ‘A’/‘B’ within the boxes).
Each branch corresponds to an attribute value (T/F) in the above case. Each leaf node assigns a
classification. The first node is called as ‘Root’ Node. Branches from the root node are called as
‘Leaf’ Nodes where ‘A’ is the Root Node (first node). ‘B’ is the Branch Node. ‘T’ & ‘F’ are Leaf Nodes.
Thus, a decision tree consists of three types of nodes:
• Root Node
• Branch Node
• Leaf Node

• Root Node: The root node is the starting point of a tree. At this point, the first split is performed.
• Internal Nodes: Each internal node represents a decision point (predictor variable) that
eventually leads to the prediction of the outcome.
• Leaf/ Terminal Nodes: Leaf nodes represent the final class of the outcome and therefore they’re
also called terminating nodes.
• Branches: Branches are connections between nodes, they’re represented as arrows. Each branch
represents a response such as yes or no.
Decision tree construction:
1. Data set
2. Approach to select relevant attributes
3. Test Condition
4. Splitting – used to grow the tree
Measures :
1. Entropy
2. Information gain
3. Gini Index

Fig: The basic structure of a Decision Tree


FIG. 7.9 Decision tree example

Fig. shows an example decision tree for a car driving – the decision to be taken is whether to ‘Keep
Going’ or to ‘Stop’, which depends on various situations as depicted in the figure. If the signal is RED in
colour, then the car should be stopped. If there is not enough gas (petrol) in the car, the car should be
stopped at the next available gas station.
Building a decision tree:
Decision trees are built corresponding to the training data following an approach called recursive
partitioning. The approach splits the data into multiple subsets on the basis of the feature values. It
starts from the root node, which is nothing but the entire data set. It first selects the feature which
predicts the target class in the strongest way. The decision tree splits the data set into multiple
partitions, with data in each partition having a distinct value for the feature based on which the
partitioning has happened. This is the first set of branches.
Likewise, the algorithm continues splitting the nodes on the basis of the feature which helps in the best
partition. This continues till a stopping criterion is reached. The usual stopping criteria are –
1. All or most of the examples at a particular node have the same class
2. All features have been used up in the partitioning
3. The tree has grown to a pre-defined threshold limit

FIGURE: A decision tree for the concept PlayTennis. An example is classified by sorting
it through the tree to the appropriate leaf node, then returning the classification associated
with this leaf

Example:
Global Technology Solutions (GTS), a leading provider of IT solutions, is coming to College of
Engineering and Management (CEM) for hiring B.Tech. students. Last year during campus recruitment,
they had shortlisted 18 students for the final interview. Being a company of international repute, they
follow a stringent interview process to select only the best of the students. The information related to
the interview evaluation results of shortlisted students (hiding the names) on the basis of different
evaluation parameters is available for reference in Figure 7.10. Chandra, a student of CEM, wants to find
out if he may be offered a job in GTS. His CGPA is quite high. His self-evaluation on the other
parameters is as follows:

Communication – Bad; Aptitude – High; Programming skills – Bad

FIG. 7.10 Training data for GTS recruitment


Predict whether Chandra will get a job offer or not using the decision tree model.
First, we need to draw the decision tree corresponding to the training data. According to the table, job
offer condition (i.e. the outcome) is FALSE for all the cases where Aptitude = Low, irrespective of other
conditions. So, the feature Aptitude can be taken up as the first node of the decision tree.
Fig: Decision tree based on the training data

There are many implementations of decision tree, the most prominent ones being C5.0, CART
(Classification and Regression Tree), CHAID (Chi-square Automatic Interaction Detector) and ID3
(Iterative Dichotomiser 3) algorithms. The biggest challenge of a decision tree algorithm is to find out
which feature to split upon.
Entropy is a measure of impurity of an attribute or feature.
The information gain is calculated on the basis of the decrease in entropy (S) after a data set is split
according to a particular attribute (A).
Constructing a decision tree is all about finding an attribute that returns the highest information gain
(i.e. the most homogeneous branches).

Note: Like information gain, there are other measures like Gini index or chi-square for individual nodes
to decide the feature on the basis of which the split has to be applied. The CART algorithm uses Gini
index, while the CHAID algorithm uses chi-square for deciding the feature for applying split.

a. Entropy of a decision tree:


Entropy (S) measuring the impurity of S is defined as

where S is the sample set of training examples, c is the


number of different class labels and p refers to the proportion of values falling into the i-th class label.
ENTROPY MEASURES HOMOGENEITY OF EXAMPLES:
• To define information gain, we begin by defining a measure called entropy.
Entropy measures the impurity of a collection of examples.
• Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this Boolean classification is

Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
Entropy measures the impurity or uncertainty present in the data. It is used to decide how a
Decision Tree can split the data.

• The entropy is 0 if all members of S belong to the same class

•The entropy is 1 when the collection contains an equal number of positive and negative
examples
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1
Example: Entropy
• Suppose S is a collection of 14 examples of some boolean concept, including 9 positive and 5
negative examples. Then the entropy of S relative to this boolean classification is

Ex:
Target class: ‘JobOffered?’—Yes and No
The value of pi for class value ‘Yes’ is 0.44 (i.e. 8/18) and that for class value ‘No’ is 0.56 (i.e. 10/18). So,
we can calculate the entropy as
Entropy(S) = -0.44 log (0.44) - 0.56 log (0.56) = 0.99

b. Information gain of a decision tree:

The information gain is created on the basis of the decrease in entropy (S) after a data set is split
according to a particular attribute (A).
Constructing a decision tree is all about finding an attribute that returns the highest information gain
(i.e. the most homogeneous branches). If the information gain is 0, it means that there is no reduction in
entropy due to split of the data set according to that particular feature. On the other hand, the
maximum amount of information gain which may happen is the entropy of the data set before the split.
• Information gain, is the expected reduction in entropy caused by partitioning the examples
according to this attribute.
• The information gain, Gain(S, A) of an attribute A, relative to a
collection of examples S, is defined as

Information gain for a particular feature A is calculated by the difference in entropy before a split (or
Sbs ) with the entropy after the split (Sas ).
Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas )

We will find the value of entropy at the beginning before any split happens and then again after the
split happens. We will compare the values for all the cases –
1. when the feature ‘CGPA’ is used for the split
2. when the feature ‘Communication’ is used for the split
3. when the feature ‘Aptitude’ is used for the split
4. when the feature ‘Programming Skills’ is used for the split

As calculated, entropy of the data set before split (i.e. Entropy (Sbs )) = 0.99, and entropy of the data set
after split (i.e. Entropy (Sas )) is
0.69 when the feature ‘CGPA’ is used for split
0.63 when the feature ‘Communication’ is used for split
0.52 when the feature ‘Aptitude’ is used for split
0.95 when the feature ‘Programming skill’ is used for split
Fig: Entropy and information gain calculation (Level 1)
Therefore, the information gain from the feature ‘CGPA’ = 0.99 − 0.69 = 0.3, whereas the information
gain from the feature ‘Communication’ = 0.99 − 0.63 = 0.36. Likewise, the information gain for
‘Aptitude’ and ‘Programming skills’ is 0.47 and 0.04, respectively.

Hence, it is quite evident that among all the features, ‘Aptitude’ results in the best information gain
when adopted for the split. So, ‘Aptitude’ will be the first node of the decision tree formed.

One important point to be noted here is that for Aptitude = Low, entropy is 0, which indicates that the
branch towards Aptitude = Low will not continue any further.

As a part of level 2, only one branch to navigate in this case – the one for Aptitude = High.

The entropy value is as follows:


0.85 before the split
0.33 when the feature ‘CGPA’ is used for split
0.30 when the feature ‘Communication’ is used for split
0.80 when the feature ‘Programming skill’ is used for split
Fig: Entropy and information gain calculation (Level 2)
As a part of level 3, we will thus have only one branch to navigate in this case – the one for
Communication = Bad.
The entropy value is as follows:
0.81 before the split
0 when the feature ‘CGPA’ is used for split
0.50 when the feature ‘Programming Skill’ is used for split

Fig: Entropy and information gain calculation (Level 3)


Hence, the information gain after split with the feature CGPA is 0.81, which is the maximum possible
information gain (as the entropy before split was 0.81). Hence, as obvious, a split will be applied on the
basis of the value of ‘CGPA’. Because the maximum information gain is already achieved, the tree will
not continue any further.
Algorithm for decision tree:
Input: Training data set, test data set (or data points)
Steps:
Do for all attributes
Calculate the entropy Ei of the attribute Fi
if Ei < Emin
then Emin = Ei and Fmin = Fi
end if
End do
Split the data set into subsets using the attribute Fmin
Draw a decision tree node containing the attribute Fmin and split the data set into subsets
Repeat the above steps until the full tree is drawn covering all the attributes of the original table.

Avoiding overfitting in decision tree – pruning


The decision tree algorithm, unless a stopping criterion is applied, may keep growing indefinitely –
splitting for every feature and dividing into smaller partitions till the point that the data is perfectly
classified. This, as is quite evident, results in overfitting problem. To prevent a decision tree getting
overfitted to the training data, pruning of the decision tree is essential. Pruning a decision tree reduces
the size of the tree such that the model is more generalized and can classify unknown and unlabelled
data in a better way.
There are two approaches of pruning:
Pre-pruning: Stop growing the tree before it reaches perfection.
Post-pruning: Allow the tree to grow entirely and then post-prune some of the branches from it.
➢ In the case of pre-pruning, the tree is stopped from further growing once it reaches a certain
number of decision nodes or decisions. Hence, in this strategy, the algorithm avoids overfitting
as well as optimizes computational cost. However, it also stands a chance to ignore important
information contributed by a feature which was skipped, thereby resulting in miss out of certain
patterns in the data.
➢ On the other hand, in the case of post-pruning, the tree is allowed to grow to the full extent.
Then, by using certain pruning criterion, e.g. error rates at the nodes, the size of the tree is
reduced. This is a more effective approach in terms of classification accuracy as it considers all
minute information available from the training data. However, the computational cost is
obviously more than that of pre-pruning.

Strengths of decision tree


➢ It produces very simple understandable rules. For smaller trees, not much mathematical and
computational knowledge is required to understand this model.
➢ Works well for most of the problems.
➢ It can handle both numerical and categorical variables.
➢ Can work well both with small and large training data sets.
➢ Decision trees provide a definite clue of which features are more useful for classification.
Weaknesses of decision tree:
➢ Decision tree models are often biased towards features having more number of possible
values, i.e. levels.
➢ This model gets overfitted or underfitted quite easily.
➢ Decision trees are prone to errors in classification problems with many classes and relatively
small number of training examples.
➢ A decision tree can be computationally expensive to train.
➢ Large trees are complex to understand.

Application of decision tree:


➢ Marketing
➢ Retention of Customers
➢ Diagnosis of Diseases and Ailments
➢ Detection of Frauds
A decision tree can be used even for some instances with missing attributes and instances with
errors in the classification of examples or in the attribute values describing those examples;
such instances are handled well by decision trees, thereby making them a robust learning
method.

Ex: for Decision tree


SVM for non seperable data:
This uses two concepts-

1. Kernel Tricks:- It utilizes existing features,applies some transformations and creates new features, this new
features are key to find nonlinear decision boundary.Two most popular kernels:-
a) Polynomial Kernal
b) Radial Basis Function (RBF).
2. Soft Margin:-
It is used for both linearly and non linearly seperable data.Applying soft margin,SVM tolerates few dots to
get misclassified and tries to balance the trade-off between finding a line that maximizes the margin and
minimizes the misclassification.
Support Vector Machines:

It is useful for solving both regression and classification problems.


➢ SVM is a model, which can do linear classification as well as regression.
➢ SVM is based on the concept of a surface, called a hyperplane, which draws a boundary
between data instances plotted in the multi-dimensional feature space.
➢ The output prediction of an SVM is one of two conceivable classes which are already defined in
the training data.
➢ In summary, the SVM algorithm builds an N-dimensional hyperplane model that assigns
future instances into one of the two possible output classes.

Classification using hyperplanes:


➢ In SVM, a model is built to discriminate the data instances belonging to different classes.
➢ When mapped in a two-dimensional space, the data instances belonging to different classes
fall in different sides of a straight line drawn in the two-dimensional space.
➢ A multidimensional feature space, the straight line dividing data instances belonging to different
classes transforms to a hyperplane.

Fig: Linearly separable data instances

Thus, an SVM model is a representation of the input instances as points in the feature space,
which are mapped so that an apparent gap between them divides the instances of the separate classes.
In other words, the goal of the SVM analysis is to find a plane, or rather a hyperplane, which
separates the instances on the basis of their classes. New examples (i.e. new instances) are then
mapped into that same space and predicted to belong to a class on the basis of which side of the gap
the new instance will fall on. In summary, in the overall training process, the SVM algorithm analyses
input data and identifies a surface in the multi-dimensional feature space called the hyperplane.
There may be many possible hyperplanes, and one of the challenges with the SVM model is to
find the optimal hyperplane.
1st fig -- is linearly separable. Here D1 may give more errors w.r.to Test data and not giving
generalized model.
Aim is-- Marginal plane distance should be very high
Generalization error in terms of SVM is the measure of how accurately and precisely this SVM
model can predict values for previously unseen data (new data). A hard margin in terms of SVM
means that an SVM model is inflexible in classification and tries to work exceptionally fit in the
training set, thereby causing overfitting.

Support Vectors: Support vectors are the data points (representing classes), the critical component in a
data set, which are near the identified set of lines (hyperplane). If support vectors are removed, they
will alter the position of the dividing hyperplane.

Hyperplane and Margin: For an N-dimensional feature space, hyperplane is a flat subspace of
dimension (N−1) that separates and classifies a set of data. For example, if we consider a two-
dimensional feature space (which is nothing but a data set having two features and a class variable), a
hyperplane will be a one-dimensional subspace or a straight line. In the same way, for a three-
dimensional feature space (data set having three features and a class variable), hyperplane is a two-
dimensional subspace or a simple plane.

Mathematically, in a two-dimensional space, hyperplane can be defined by the equation:


c0+ c1 X1 + c2 X2 = 0, which is nothing but an equation of a straight line.

For N-dimensional space, hyperplane can be defined by the equation:


c0+ c1 X1 + c2 X2+ … + cn Xn = 0 which, in short, can be represented as follows:

More distance from the hyperplane the data points lie, the more confident we can be about correct
categorization. So, when a new testing data point/data set is added, the side of the hyperplane it lands
on will decide the class that we assign to it. The distance between hyperplane and data points is
known as margin.

Identifying the correct hyperplane in SVM:


Scenario1:
In this scenario, we have three hyperplanes: A, B, and C. Now, we need to identify the correct
hyperplane which better segregates the two classes represented by the triangles and circles. As we can
see, hyperplane ‘A’ has performed this task quite well.

Fig: Support vector machine: Scenario 1

Scenario 2:
we have three hyperplanes: A, B,and C. We have to identify the correct hyperplane which classifies the
triangles and circles in the best possible way.
Here, maximizing the distances between the nearest data points of both the classes and
hyperplane will help us decide the correct hyperplane. This distance is called as margin.

Fig: SVM Scenario2


Here, the margin for hyperplane A is high as compared to those for both B and C. Hence,
hyperplane A is the correct hyperplane. Another quick reason for selecting the hyperplane with
higher margin (distance) is robustness. If we select a hyperplane having a lower margin (distance),
then there is a high probability of misclassification.

Scenario 3:
Here SVM selects the hyperplane which classifies the classes accurately before maximizing the margin.
Here, hyperplane B has a classification error, and A has classified all data instances correctly. Therefore,
A is the correct hyperplane.

Fig: Support vector machine: Scenario 3

Scenario 4:
In this scenario, as shown in Figure 7.19a, it is not possible to distinctly segregate the two classes by
using a straight line, as one data instance belonging to one of the classes (triangle) lies in the territory
of the other class (circle) as an outlier.
One triangle at the other end is like an outlier for the triangle class. SVM has a feature to ignore
outliers and find the hyperplane that has the maximum margin (hyperplane A, as shown in Fig.
7.19b). Hence, we can say that SVM is robust to outliers.
Fig: 7.19 Support vector machine: Scenario 4

So, by summarizing the observations from the different scenarios, we can say that
1. The hyperplane should segregate the data instances belonging to the two classes in the best
possible way.
2. It should maximize the distances between the nearest data points of both the classes, i.e.
maximize the margin.
3. If there is a need to prioritize between higher margin and lesser misclassification, the
hyperplane should try to reduce misclassifications.

Identify a hyperplane which maximizes the margin.

Maximum margin hyperplane:


Finding the Maximum Margin Hyperplane (MMH) is nothing but identifying the hyperplane which
has the largest separation with the data instances of the two classes.
It helps us in achieving more generalization and hence less number of issues in the classification
of unknown data.

FIG. 7.20 Support vectors

➢ Support vectors, as can be observed in Figure 7.20, are data instances from the two classes
which are closest to the MMH.
➢ There should be at least one support vector from each class. The identification of support
vectors requires intense mathematical formulation.
➢ Modelling a problem using SVM is nothing but identifying the support vectors and MMH
corresponding to the problem space.
Identifying the MMH for linearly separable data
➢ In this case, an outer boundary needs to be drawn for the data instances belonging to the
different classes. These outer boundaries are known as convex hull.
Fig: Drawing the MMH for linearly separable data

Hyperplane equation in the N-dimensional feature space is:

Using this equation, the objective is to find a set of values for the vector such that two hyperplanes,
represented by the equations below, can be specified.
This is to ensure that all the data instances that belong to one class falls above one hyperplane and all
the data instances belonging to the other class falls below another hyperplane.
According to vector geometry, the distance of these planes should be 2/ c. in order to maximize
the distance between hyperplanes, the value of c. should be minimized.

So, in summary, the task of SVM is to solve the optimization problem:


Identifying the MMH for non-linearly separable data.

For non-linearly separable data we have to use a slack variable ξ, which provides some soft margin
for data instances in one class that fall on the wrong side of the hyperplane. A cost value ‘C’ is
imposed on all such data instances that fall on the wrong side of the hyperplane. The task of SVM
is now to minimize the total cost due to such data instances in order to solve the revised

optimization problem:
Drawing the MMH for non-linearly separable data

Kernel trick:
➢ One way to deal with nonlinearly separable data is by using a slack variable and an
optimization function to minimize the cost value.
➢ SVM has a technique called the kernel trick to deal with non-linearly separable data.
➢ These are functions which can transform lower dimensional input space to a higher
dimensional space. In the process, it converts linearly non separable data to a linearly
separable data. These functions are called kernels.
Kernel trick in SVM

Different SVM implementations are as follows:


Linear kernel: It is in the form

Polynomial kernel: It is in the form

Sigmoid kernel: It is in the form

Gaussian RBF kernel: It is in the form

When data instances of the classes are closer to each other, this method can be used. The
effectiveness of SVM depends both on the:
• selection of the kernel function
• adoption of values for the kernel parameters

Strengths of SVM
• SVM can be used for both classification and regression.
• It is robust, i.e. not much impacted by data with noise or outliers.
• The prediction results using this model are very promising.
Weaknesses of SVM
• SVM is applicable only for binary classification, i.e. when there are only two classes in the
problem domain.
• The SVM model is very complex – almost like a black box when it deals with a high-dimensional
data set. Hence, it is very difficult and close to impossible to understand the model in such
cases.
• It is slow for a large dataset, i.e. a data set with either a large number of features or a large
number of instances.
• It is quite memory-intensive.
Applications of SVM:
➢ SVM can be applied is in the field of bioinformatics – more specifically, in detecting cancer and
other genetic disorders.
➢ It can also be used in detecting the image of a face by binary classification of images into face
and non-face components.

SVM is also an example of a linear classifier and a maximum margin classifier.


When all the classes are closer to each other, kernel method can be used.

The difference between a hard margin and a soft margin in SVMs lies in the separability of
the data. If our data is linearly separable, we go for a hard margin.

Sometimes, the data is linearly separable, but the margin is so small that the model
becomes prone to overfitting or being too sensitive to outliers. Also, in this case, we can opt
for a larger margin by using soft margin SVM in order to help the model generalize better.

You might also like