0% found this document useful (0 votes)
177 views

Probabilistic Models in Machine Learning: Unit - III Chapter - 1

Probabilistic models use statistics to analyze data and were one of the earliest machine learning methods. They provide a framework for understanding what learning is through probabilistic relationships between random variables. Naive Bayes is a well-known probabilistic classification algorithm that makes predictions based on probabilities. Bayesian inference, which uses Bayes' theorem, is a core concept for probabilistic models and allows combining prior knowledge with new evidence.

Uploaded by

Anil
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views

Probabilistic Models in Machine Learning: Unit - III Chapter - 1

Probabilistic models use statistics to analyze data and were one of the earliest machine learning methods. They provide a framework for understanding what learning is through probabilistic relationships between random variables. Naive Bayes is a well-known probabilistic classification algorithm that makes predictions based on probabilities. Bayesian inference, which uses Bayes' theorem, is a core concept for probabilistic models and allows combining prior knowledge with new evidence.

Uploaded by

Anil
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit – III

Chapter - 1

Probabilistic Models in Machine Learning


Introduction

Probabilistic Models in Machine Learning is the use of the codes of statistics to data examination. It was one of
the initial methods of machine learning. It’s quite extensively used to this day. Individual of the best-known algorithms in
this group is the Naive Bayes algorithm.

Probabilistic modelling delivers a framework for accepting what learning is. The probabilistic framework defines
how to signify and deploy reservation about models. Predictions have a dominant role in scientific data analysis. Their
role is also so important in machine learning, automation, cognitive computing and artificial intelligence.

Description

Probabilistic models are presented as a prevailing idiom to define the world. Those were described by using random
variables for example building blocks believed together by probabilistic relationships.

There are probabilistic models along with non-probabilistic models in machine learning. The information about basic
concepts of probability for example random variables and probability distributions would be helpful in order to have a
well understanding of probabilistic models.

Portrayal inference from noisy or ambiguous data is an imperative part of intelligent systems. In probability theory
particularly Bayes’ theorem helps as a principled framework of combining prior knowledge and empirical evidence.

Importance of probabilistic ML models

One of the key benefits of probabilistic models is that they give an idea about the uncertainty linked with predictions. We
may get an idea of how confident a machine learning model is on its prediction. For example, if the probabilistic classifier
allocates a probability of 0.9 for the ‘Dog’ class in its place of 0.6, it means the classifier is extra confident that the animal
in the image is a dog. These concepts connected to uncertainty and confidences are very valuable when it originates to
critical machine learning uses for example disease diagnosis and autonomous driving. Moreover, probabilistic
consequences would be worthwhile for many methods linked to Machine Learning for instance Active Learning.

Bayesian Inference

At the centre of Bayesian inference is the Bayes’ rule sometimes called Bayes’ theorem. It is used to define the
probability of a hypothesis with former knowledge. It is contingent on conditional probability.

The formula for Bayes’ theorem is known as;

P (hypothesis│data) = P (data│hypothesis) P (hypothesis) / P (data)

 Bayes rule states that how to do inference about hypotheses from data.
 Learning and prediction may be understood as forms of inference.
The typical Bayesian inference with Bayes’ rule needs for a mechanism to straight regulate the target posterior
distribution. For example, the inference process is a one-way procedure that plans the earlier distribution to the posterior
by detecting empirical data. In supervised learning and reinforcement learning, our final goal is to put on the posterior to
learning tasks. That is applied with some measurement on the performance for instance prediction error or expected
reward.

An upright posterior distribution should have a small prediction error or a great expected reward. Furthermore, by way,
the large scale knowledge bases are built and crowd sourcing platforms are broadly accepted to gather human data, it is
needed to include the outside information into statistical modelling and inference when building an intelligent system.
Naive Bayes algorithm

Naïve Bayes algorithm is a supervised learning algorithm. It is created on the Bayes theorem and used for resolving
sorting problems. It is chiefly used in text classification that comprises a high-dimensional training dataset. The naïve
Bayes algorithm is one of the simple and best operational Classification algorithms that support construction the of fast
machine learning models which may create rapid predictions.

The naive Bayes algorithm is a probabilistic classifier. It means that it predicts on the basis of the probability of an object.
More or less prevalent instances of Naïve Bayes Algorithm are;

 Spam filtration
 Sentimental analysis
 Classifying articles
A narrowly correlated model is the logistic regression. That is sometimes well thought-out to be the “hello world” of
modern machine learning. Don’t be deceived by its name as log reg is a classification algorithm somewhat a regression
algorithm. Considerably like Naive Bayes, up till now, it’s quite useful to this day as log reg predates computing for a
long time, Thanks to its modest and multipurpose nature. It’s frequently the first thing a data scientist would attempt on a
dataset to become a feel for the classification task at hand.

Types of Naïve Bayes Model

There are the following three types of Naive Bayes Model:

 Gaussian: The Gaussian model takes responsibility that features monitor a normal distribution. This means that if
analysts take nonstop values rather than separate, then the model takes up that these values are tested from the
Gaussian distribution.
 Multinomial: It is used when the data is multinomial circulated. It is mainly used for document classification
problems. It means a specific document goes to that category for example Sports, education, and Politics etc. The
classifier uses the rate of words for the predictors.
 Bernoulli: The Bernoulli classifier do work alike to the Multinomial classifier. Then the predictor variables are
the self-governing Booleans variables. For example, if a specific word is present or not in a document. This
model is as well well-known for document classification tasks.

Uses of Naïve Bayes Model

The Naïve Bayes Classifier used;

 For Credit Scoring.


 In medical data classification.
 It may be used in real-time predictions as Naïve Bayes Classifier is a keen learner.
 In-Text classification for example Spam filtering and Sentiment analysis.
Pros and Cons of Bayes Classifier

Pros

 Naïve Bayes is one of the easy and fast machine learning algorithms to foresee a class of datasets.
 It may be used for Binary also as Multi-class Classifications.
 It does well in Multi-class predictions for example likened to the other Algorithms.
 It is the greatest widespread selection for text classification
problems. Cons

 Naive Bayes accepts that all sorts are autonomous or disparate. Therefore it cannot learn the association between
features.

Probability Density: Assume a random variable x that has a probability distribution p(x). The relationship between the
outcomes of a random variable and its probability is referred to as the probability density.
The problem is that we don’t always know the full probability distribution for a random variable. This is because we only
use a small subset of observations to derive the outcome. This problem is referred to as Probability Density Estimation as
we use only a random sample of observations to find the general density of the whole sample space.

Probability Density Function (PDF)A

PDF is a function that tells the probability of the random variable from a sub-sample space falling within a particular
range of values and not just one value. It tells the likelihood of the range of values in the random variable sub-space being
the same as that of the whole sample. By definition, if X is any continuous random variable, then the function f(x) is
called a probability density function if:

where,

a -> lower limit

b -> upper limit

X -> continuous random variable

f(x) -> probability density function

Steps Involved:

Step 1 - Create a histogram for the random set of observations to understand the

density of the random sample.

Step 2 - Create the probability density function and fit it on the random sample.

Observe how it fits the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:

- Calculate the distribution parameters.

- Calculate the PDF for the random sample distribution.

- Observe the resulting PDF against the data.

- Transform the data to until it best fits the distribution.

Most of the histogram of the different random sample after fitting should match the histogram plot of the whole
population.

Density Estimation: It is the process of finding out the density of the whole population by examining a random
sample of data from that population. One of the best ways to achieve a density estimate is by using a histogram
plot.

Parametric Density Estimation

A normal distribution has two given parameters, mean and standard deviation. We calculate the sample mean
and standard deviation of the random sample taken from this population to estimate the density of the random
sample. The reason it is termed as ‘parametric’ is due to the fact that the relation between the observations and
its probability can be different based on the values of the two parameters.
Now, it is important to understand that the mean and standard deviation of this random sample is not going to
be the same as that of the whole population due to its small size. A sample plot for parametric density
estimation is shown below.

PDF fitted over histogram plot with one peak value

Nonparametric Density Estimation

In some cases, the PDF may not fit the random sample as it doesn’t follow a normal distribution (i.e instead of one peak
there are multiple peaks in the graph). Here, instead of using distribution parameters like mean and standard deviation, a
particular algorithm is used to estimate the probability distribution. Thus, it is known as a ‘nonparametric density
estimation’.

One of the most common nonparametric approach is known as Kernel Density Estimation. In this, the objective is to
calculate the unknown density fh(x) using the equation given below:

where,

K -> kernel (non-negative function)

h -> bandwidth (smoothing parameter, h > 0)

Kh -> scaled kernel

fh(x) -> density (to calculate)

n -> no. of samples in random sample.

A sample plot for nonparametric density estimation is given

below. A sample plot for nonparametric density estimation is

given below. A sample plot for nonparametric density estimation

is given below.
Naive Bayes Classifiers

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm
but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is
independent of each other.

To start with, let us consider a dataset.

Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather
conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf.

Here is a tabular representation of our dataset.

Outlook Temperature Humidity Windy Play Golf

Rainy Hot High False No

Rainy Hot High True No

Overcast Hot High False Yes

Sunny Mild High False Yes

Sunny Cool Normal False Yes

Sunny Cool Normal True No

Overcast Cool Normal True Yes


Outlook Temperature Humidity Windy Play Golf

Rainy Mild High False No

Rainy Cool Normal False Yes

Sunny Mild Normal False Yes

Rainy Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and the response vector.

 Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of dependent
features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.

 Response vector contains the value of class variable(prediction or output) for each row of feature matrix. In
above dataset, the class variable name is ‘Play golf’.

Assumption:

The fundamental Naive Bayes assumption is that each feature makes an:

 independent

 equal

contribution to the outcome.

With relation to our dataset, this concept can be understood as:

 We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to
do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are
assumed to be independent.

 Secondly, each feature is given the same weight(or importance). For example, knowing only temperature
and humidity alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to
be contributing equally to the outcome.

Note: The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence
assumption is never correct but often works well in practice.
Now, before moving to the formula for Naive Bayes, it is important to know about Bayes’ theorem.

Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already
occurred. Bayes’ theorem is stated mathematically as the following equation:

where A and B are events and P(B) ≠ 0.

 Basically, we are trying to find probability of event A, given the event B is true. Event B is also
termed as evidence.

 P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an
attribute value of an unknown instance(here, it is event B).

 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

where, y is class variable and X is a dependent feature vector (of size n) where:

just to clear, an example of a feature

X = (Rainy, Hot, High, False)

y = No

vector and corresponding class variable can be: (refer 1st row of dataset)

So basically, P(y|X) here means, the probability of “Not playing golf” given that the weather conditions are “Rainy
outlook”, “Temperature is hot”, “high humidity” and “no wind”.

Naive assumption

Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among the features. So now, we
split evidence into the independent parts.

Now, if any two events A and B are independent, then,

P(A,B) = P(A)P(B)

which can be expressed as:

Now, as the denominator remains constant for a given input, we can remove that term:

Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of
the class variable y and pick up the output with maximum probability. This can be expressed mathematically as:

So, finally, we are left with the task of calculating P(y) and P(xi | y).

Please note that P(y) is also called class probability and P(xi | y) is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(x i |
y).Let us try to apply the above formula manually on our weather dataset. For this, we need to do some
precomputations on our dataset.

We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been demonstrated in the tables below:

So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in y manually in the tables 1-4. For example,
probability of playing golf given that the temperature is cool, i.e P(temp. = cool | play golf = Yes) = 3/9.

Also, we need to find class probabilities (P(y)) which has been calculated in the table 5. For example, P(play golf = Yes)
= 9/14.

So now, we are done with our pre-computations and the classifier is ready!

Let us test it on a new set of features (let us call it today):

today = (Sunny, Hot, Normal, False)

So, probability of playing golf is given by:

and probability to not play golf is given by:

Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional probabilities as:

and
Now, since

These numbers can be converted into a probability by making the sum equal to 1 (normalization):

and

Since

So, prediction that golf would be played is ‘Yes’.

The method that we discussed above is applicable for discrete data. In case of continuous data, we need to make some
assumptions regarding the distribution of values of each feature. The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(xi | y).

Prediction

“Prediction” refers to the output of an algorithm after it has been trained on a historical dataset and applied to new
data when forecasting the likelihood of a particular outcome, such as whether or not a customer will churn in 30 days. The
algorithm will generate probable values for an unknown variable for each record in the new data, allowing the model
builder to identify what that value will most likely be.

The word “prediction” can be misleading. In some cases, it really does mean that you are predicting a future outcome,
such as when you’re using machine learning to determine the next best action in a marketing campaign. Other times,
though, the “prediction” has to do with, for example, whether or not a transaction that already occurred was fraudulent. In
that case, the transaction already happened, but you’re making an educated guess about whether or not it was legitimate,
allowing you to take the appropriate action.

Machine learning model predictions allow businesses to make highly accurate guesses as to the likely outcomes of a
question based on historical data, which can be about all kinds of things – customer churn likelihood, possible fraudulent
activity, and more. These provide the business with insights that result in tangible business value. For example, if a model
predicts a customer is likely to churn, the business can target them with specific communications and outreach that will
prevent the loss of that customer.

The DataRobot AI Cloud Platform allows users to easily develop models that make highly accurate predictions. It
streamlines the data science process so that users get high-quality predictions in a fraction of the time it took using
traditional methods, allowing them to more quickly implement those predictions and see the impact on their bottom line.

In order to start making predictions with DataRobot, you need to deploy the model into a production application. For
more details, see the deployment wiki entry or the DataRobot model deployment briefing.

Neural Networks
A single-layer network there is one layer of weights. Now, instead of directly connecting the inputs to the outputs, we will
insert a layer of “hidden” nodes, moving from a single-layer network to a multi-layer network. In multilayer network by
introduce non-linearity gives us non-linear decision boundaries
The major weakness of the of linear models they unable to have arbitrary decision boundaries
Let x1,x2,x3,x4,x5 are inputs w1 is 1st node weight and w2 is 2nd node weight
The activate value is a=w1.x+w2.x we apply activation function either tanh or sigmoid function
i.e
eg:-
h1=tanh(w1.x)
h2=tanh(w2.x)
we can generalize hi=f(w.x) where f is any activation function
2
y^=v1h1+v2h2=∑ ¿¿
i=1
=V.tanh(W*x)

The two-layer neural networks are more expressive than single layer networks (i.e., perceptrons).
To see this, you can construct a very small two-layer network for solving the XOR problem.
For simplicity, suppose that the data set consists of four data points, given in
The classification rule is that y = 0 if an only if x1 = x2, .
We can solve this problem using a two layer network with two hidden units.
The key idea is to make the first hidden unit compute an “or” function: x1 ∨ x2. The second hidden unit can compute an
“and” function: x1 ∧ x2. Then the output can combine these into a single prediction that mimics XOR.
Once you have the first hidden unit activate for “or” and the second for “and,” you need only set the output weights as −2
and +1, respectively.
Back –propagation Algorithm:-
We predict the values basing on weight and bias from input layer to the output layer in the forward manner layer by
layer
If our prediction is wrong then we update weight and bias values in the reverse manner i.e from output layer to input
layer in backward manner layer by layer
The Back-propagation Algorithm The back-propagation algorithm is a classic approach to training neural networks.
back-propagation = gradient descent + chain rule
The main goal of us is to minimize error
E=1/2Σ (yn-yn^)2
E=1/2Σ (en)2
En=yn-yn^

Here, f is some link function like tanh.


2
yn^=v1h1+v2h2=∑ ¿¿ = Σvihi
i=1
En=yn- Σvihi
Beyond Two Layers
The definition of neural networks and the back-propagation algorithm can be generalized beyond two layers to any arbitrary
directed acyclic graph. Suppose that your network structure is stored in some directed acyclic graph, We index nodes in this
graph as u, v. The activation before applying non-linearity at a node is au and after non-linearity is hu. The graph has a
single sink, which is the output node y with activation ay.
The graph has D-many inputs (i.e., nodes with no parent), whose activations hu are given by an input example. An edge (u,
v) is from a parent to a child (i.e., from an input to a hidden unit, or from a hidden unit to the sink). Each edge has a weight
wu,v. We say that par(u) is the set of parents of u.
There are two relevant algorithms: forward-propagation and backpropagation.
Forward-propagation tells you how to compute the activation of the sink y given the inputs.
Back-propagation computes derivatives of the edge weights for a given input.
The key aspect of the forward-propagation algorithm is to iteratively compute activations, going deeper and deeper in the
DAG. Once the activations of all the parents of a node u have been computed, you can compute the activation of node u.
back prop Back-propagation does the opposite: it computes gradients top-down in the network. The key idea is to compute
an error for each node in the network.
The error at the output unit is the “true error.” For any input unit, the error is the amount of gradient that we see coming
from our children (i.e., higher in the network).
These errors are computed backwards in the network (hence the name back-propagation) along with the gradients
themselves.
Perceptron back propagation problem
I1=0.05 w5=0.40 01=0.01
I2=0.10 w6=0.45 02=0.99
W1=0.15 w7=0.50
W2=0.20 w8=0.55
W3=0.25 b1=0.35
W4=0.30 b2=0.60
Backword propogation

You might also like