Machine Learning Notes

Naveen Pragallapati
Unit-3
Part 2: Bayesian Learning
Basic Concepts and Nota ons:
What h?
 h refers to the hypothesis which is a specific assump on or statement about the rela onship
between the input (data) and the outcome.
 In machine learning, it’s o en a rule or model that makes predic ons based on the data.
In our Weather Predic on Example:
 A hypothesis might describe the condi ons under which it will rain.
 Examples of specific hypotheses could be:
o h1: "It will rain if the sky is cloudy."
o h2: "It will rain if it is both cloudy and humid."
o h3: "It will rain only if the temperature is below 20°C."
Each of these hypotheses reflects a possible model for predic ng rain based on certain features
(clouds, humidity, temperature).
What is H?
The set of all possible hypotheses (denoted by H) contains every possible rule or model that could
relate the input features (like weather condi ons) to the outcome (whether it rains or not). The set H
is the collec on of all possible hypotheses that could be formed using the available features.
In formal learning algorithms, the size of H depends on:
1. Number of features (e.g., cloudy, temperature, humidity).
2. Possible values for each feature (e.g., cloudy = {yes, no}, temperature = {low, high}).
What is Likelihood?
Likelihood measures how well a hypothesis or model explains the observed data.
It answers: “Given the data I observed, how plausible(believable/credible) is this par cular model
or hypothesis?”
It does not sum to 1 across different hypotheses, unlike probability, which sums to 1 over all possible
outcomes.
Example of Likelihood:
 You roll a die and observe a 3. Now you want to evaluate two models:
o Model 1: The die is fair.
o Model 2: The die is biased toward showing 3 with a higher chance of 50%.
Naveen Pragallapati
The likelihood of observing the 3 under these two models would be:
L (Fair Model |3) =1/6, L (Biased Model |3) =0.5
Here, likelihood helps you determine which model be er explains the observed outcome (seeing a 3).
Key Difference between Probability and Likelihood:
Summary
 Probability: The likelihood (in everyday terms) of an event happening, given the model.
 Likelihood: The plausibility of a model or hypothesis given the observed data.
What is D?
 D usually refers to the training data or the set of observa ons used to learn a model.
 Depending on the context, D could be:
o A single observa on (e.g., "Today is cloudy").
o A set of training examples (e.g., historical weather data over several days).
In Bayesian learning, D o en refers to the en re set of observa ons or data points used to update the
belief about the hypothesis.
What is P(D)? (Marginal Likelihood / Evidence)
 P(D) is the marginal likelihood or evidence.
 It measures how likely the observed data D is, given all possible hypotheses.
 Mathema cally, it’s computed as:
 P(D) is a normalizing constant that ensures the posterior probabili es sum to 1. It tells us how
likely the en re dataset is, irrespec ve of any specific hypothesis.
Naveen Pragallapati
What is P(D∣h))? (Likelihood)
 P(D∣h) is the likelihood of the data D given a specific hypothesis h.
 It measures how well the hypothesis h explains the observed data D.
 If D contains mul ple training examples, P(D∣h) is typically the product of likelihoods for
individual examples (assuming independence):
Here, di is a single observa on, and the product runs over all examples in the dataset.
What is P(h∣D)? (Posterior Probability)
 P(h∣D) is the posterior probability of the hypothesis h given the observed data D.
 It represents the updated belief in the hypothesis a er seeing the data.
 Using Bayes’ theorem, it is calculated as:
Note:
 If a hypothesis h predicts the observed data well, its posterior probability P(h∣D) increases.
 If it does not predict the data well, its probability decreases.
***
Naveen Pragallapati
Introduc on to Bayesian Learning:

Bayesian Learning is a probabilis c approach to machine learning where we make predic ons
by maintaining and upda ng a belief about which hypothesis (or model) is most likely to be
true. The core idea is that as we gather new data, we update our beliefs using Bayes' theorem.
Bayesian learning is useful because it not only gives us the most probable hypothesis but also
provides a measure of uncertainty about that predic on. This is especially powerful in real-
world scenarios such as weather forecas ng, medical diagnosis, and spam detec on.
Nota ons in Bayesian Learning
Bayes’ Theorem in Bayesian Learning

Naveen Pragallapati
Example 1: Bayesian Learning with Two Hypotheses (Rain vs No Rain)

Problem Setup
We want to predict whether it will rain today based on the observa on:
 Observa on D: "It is Cloudy."
Our goal is to update the probabili es of the two hypotheses:
1. h1: It will rain today.
2. h2: It will not rain today.

Naveen Pragallapati
Conclusion
A er observing the cloudy sky, the posterior probability of rain increases from 0.4 to 0.64. This means
that based on the evidence (cloudy sky), we are now more confident that it will rain today. Similarly,
the probability of no rain decreases from 0.6 to 0.36.
Example 2: Weather Predic on using Bayesian Learning with Three Hypotheses

Problem Setup
Observed Data D:
“Today’s weather is Cloudy and High Humidity and we observe that it rained”
Hypotheses and Prior Probabili es
Our goal is to update the probabili es of the three hypotheses

Naveen Pragallapati
Likelihoods P(D∣h)
Calcula ng Marginal Likelihood P(D)
Calcula ng Posterior Probabili es P(h∣D)

Naveen Pragallapati
Summary of Posterior Probabili es
ℎ 𝑃(ℎ) 𝑃(𝐷|ℎ) 𝑃(ℎ|𝐷)

ℎ 0.3 0.7 0.3
ℎ 0.5 0.9 0.64
ℎ 0.2 0.2 0.06
Conclusion
A er observing the data (Cloudy + High Humidity + Rain), the posterior probabili es indicate that:
 h2 ("It rains if cloudy and humid") is the most probable hypothesis, with a posterior probability
of 0.64.
 h1 ("It rains if cloudy") s ll has some credibility, with a posterior probability of 0.30.
 h3 is unlikely, with only 0.06 probability.
This example demonstrates how Bayesian learning updates our beliefs about each hypothesis based
on the observed data.
Bayesian Inference Process

1. Select prior beliefs: Choose prior probabili es P(h) for all hypotheses h ∈ H.
2. Observe data: Collect the data D.
3. Compute posterior: Use Bayes' theorem to update the belief in each hypothesis.
4. Predict new data: Use the posterior distribu on to make predic ons about future events or
unseen data.
MAP and ML Hypothesis

Naveen Pragallapati
Example 3: Bayesian Inference in Medical Diagnosis

consider a medical diagnosis problem in which there are two alterna ve hypotheses: (1) that
the pa ent; has a- ar cular form of cancer. and (2) that the pa ent does not. The available
data is from a par cular laboratory test with two possible outcomes: ⨁ (posi ve) and ⊖
(nega ve). We have prior knowledge that over the en re popula on of people only .008 have
this disease. Furthermore, the lab test is only an imperfect indicator of the disease. The test
returns a correct posi ve result in only 98% of the cases in which the disease is actually
present and a correct nega ve result in only 97% of the cases in which the disease is not
present. In other cases, the test returns the opposite result. Suppose we now observe a new
pa ent for whom the lab test returns a posi ve result. Should we diagnose the pa ent as
having cancer or not?
Naveen Pragallapati
Thus ℎ =𝐻
Summary of Basic Probability Formulas:

Naveen Pragallapati
Concept Learning Using Bayes' Theorem

Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, we can use it as the basis for a straigh orward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.
Now we consider a brute-force Bayesian concept learning algorithm, then compares it to
concept learning algorithms we considered in concept learning. One interes ng result of this
comparison is that under certain condi ons several algorithms discussed in earlier chapters
output the same hypotheses as this brute-force Bayesian algorithm, despite the fact that they
do not explicitly manipulate probabili es and are considerably more efficient.
Brute-Force Bayes Concept Learning

Overview
Brute-force Bayes concept learning refers to applying Bayes’ theorem exhaus vely across a
hypothesis space H to find the most probable hypothesis. This approach calculates the
posterior probability of each hypothesis, based on given data D, and selects the hypothesis
with the highest probability.
Assume the learner considers some finite hypothesis space H defined over the instance space
X, in which the task is to learn some target concept 𝑐: 𝑋 → {0, 1}. Also assume that the learner
is given some sequence of training examples < < 𝑥 , 𝑑 >, < 𝑥 , 𝑑 >, . . . , < 𝑥 , 𝑑 > >
here 𝑥 is some instance from X and where 𝑑 is the target value of 𝑥 (i.e., 𝑑 = 𝑐(𝑥 ) ).
To simplify, we assume the sequence of instances < 𝑥 , 𝑥 , … , 𝑥 > is held fixed, so that the
training data D can be wri en simply as the sequence of target values D = < 𝑑 , 𝑑 , … , 𝑑 >
BRUTE-FORCE MAP Learning Algorithm:
Process:
Given a finite hypothesis space H:
1. Assign prior probabili es P(h) to each hypothesis.
2. Calculate the likelihood P(D∣h) for each hypothesis using the observed data D.
3. Normalize by dividing by the total probability P(D), which ensures that posterior
probabili es sum to 1.
Naveen Pragallapati
Challenges of Brute-Force Bayes Learning

1. Computa onal Cost:
For large hypothesis spaces H, compu ng the posterior for every hypothesis
becomes imprac cal. As H grows, the computa onal requirements increase
exponen ally.
2. Dependence on Priors:
The outcome can be influenced by how prior probabili es P(h) are assigned.
Incorrect priors can bias the learning process.
3. Need for Full Enumera on:
The brute-force method requires examining every possible hypothesis in the space,
which is feasible only for small spaces.
If we make the following assump ons,
1. The training data D is noise free (i.e., 𝑑 = 𝑐(𝑥 )).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any
other.
then we can use the following formulae for P(h) and P(D|h).
Thus, P(D∣h) yields values of 1 or 0, depending on whether the hypothesis accurately

predicts the target values.
Naveen Pragallapati
Naveen Pragallapati
The above analysis implies that under our choice for P(h) and P (D l h), every consistent
hypothesis has posterior probability and every inconsistent hypothesis has posterior
| ,
|
probability 0.
“Every consistent hypothesis is, therefore, a MAP hypothesis.”
MAP Hypothesis and Consistent Learners (FIND-S and Candidate Elimina on Algorithms):
Naveen Pragallapati
Naveen Pragallapati
Minimum Descrip on Length (MDL) Principle

Occam's Razor suggests that among compe ng hypotheses explaining the same data, the
simplest one should be preferred. The MDL principle refines this idea using concepts from
informa on theory and Bayesian learning, sta ng that the best hypothesis minimizes the
total descrip on length of both the hypothesis and the data it explains.
Recall that Maximum Posteriori Hypothesis states that
arg 𝑚𝑎𝑥
ℎ = 𝑃(𝐷|ℎ)𝑃(ℎ)
ℎ ∈𝐻
To simplify calcula ons, we take the logarithm (base 2) of the likelihood and prior:
arg 𝑚𝑎𝑥
ℎ = (𝑙𝑜𝑔 𝑃(𝐷|ℎ) + 𝑙𝑜𝑔 𝑃(ℎ))
ℎ ∈𝐻
or alterna vely,
arg 𝑚𝑖𝑛
ℎ = (−𝑙𝑜𝑔 𝑃(𝐷|ℎ) − 𝑙𝑜𝑔 𝑃(ℎ))
ℎ ∈𝐻
Descrip on Length
The MDL principle interprets the terms in the Bayesian equa on in terms of descrip on
lengths, using concepts from informa on theory.
1. Message Encoding and Descrip on Length: Consider the problem of designing a code
𝐶 to transmit messages drawn at random, where the probability of encountering a
message is 𝑝 . We want the most compact code; that is, we are interested in the code
that minimizes the expected number of bits we must transmit in order to encode a
message drawn at random. This minimum number of bits required to encode the
message 𝑖 op mally using code 𝐶 is called descrip on length of message 𝑖 with respect
to 𝐶. It is denoted by 𝐿 (𝑖) and is given by:
𝐿 (𝑖) = −𝑙𝑜𝑔 𝑝
2. Encoding a Hypothesis 𝒉: The descrip on length of a hypothesis ℎ under an op mal
encoding scheme for the hypothesis space 𝐻 is
𝐿 (ℎ) = −𝑙𝑜𝑔 𝑃(ℎ)
3. Encoding the Data Given the Hypothesis: The descrip on length of the data 𝐷 given
a hypothesis ℎ , under an op mal encoding scheme, is:
𝐿 | (𝐷|ℎ) = −𝑙𝑜𝑔 𝑃(𝐷|ℎ)
The MDL Principle: Formal Defini on: The MDL principle recommends selec ng the
hypothesis that minimizes the total descrip on length, which includes both:
1. The descrip on length of the hypothesis ℎ.
2. The descrip on length of the data given the hypothesis.
Assuming we use the codes 𝐶 and 𝐶 to represent the hypothesis and the data given the
hypothesis, we can state the MDL principle as
arg 𝑚𝑖𝑛
ℎ = (𝐿 (ℎ) + 𝐿 (𝐷|ℎ))
ℎ ∈𝐻
Note: If we choose 𝐶 to be the op mal encoding of hypotheses 𝐶 and if we choose 𝐶 to be
the op mal encoding of 𝐶 | then ℎ = ℎ
Naveen Pragallapati
Interpreta on of MDL
 The MDL principle balances the complexity of the hypothesis with how well it explains
the data.
 A simpler hypothesis that makes a few errors may be preferred over a complex one
that perfectly fits the data, to avoid overfi ng.
Example: MDL with Decision Trees
Suppose we want to apply the MDL principle to decision tree learning:
1. Encoding the Hypothesis (𝐶 ):
o We use an encoding where the descrip on length increases with the number
of nodes and edges in the tree.
2. Encoding the Data Given the Hypothesis (𝐶 | ):
o If the hypothesis perfectly classifies the training data, the cost of encoding the
data is zero.
o If some data points are misclassified, we need addi onal bits to:
 Iden fy the misclassified instances (at most 𝑙𝑜𝑔 𝑚 bits per instance,
where 𝑚 is the number of instances).
 Transmit the correct classifica on (at most 𝑙𝑜𝑔 𝑘 bits, where 𝑘 is the
number of possible classifica ons).
Applica ons of MDL:
1. Decision Tree Pruning:
o The MDL principle helps determine the best size for a decision tree by
balancing simplicity and accuracy.
o Research by Quinlan and Rivest (1989) and Mehta et al. (1995) shows that
MDL-based pruning achieves results similar to tradi onal methods.
2. Overfi ng Preven on:
o MDL avoids overfi ng by preferring simpler models, even if they make a few
errors on the training data.
Limita ons and Prac cal Considera ons:
1. The effec veness of MDL depends on selec ng appropriate encoding schemes for
hypotheses and data.
2. In prac ce, it can be difficult to determine op mal encoding schemes or es mate prior
probabili es accurately.
3. Human interven on may be required to design encodings that reflect domain
knowledge about the problem.
Naveen Pragallapati
Bayes Op mal Classifier:

Bayes' Op mal Classifier is the most probabilis cally op mal approach for classifica on. It
provides the most accurate predic on possible, given all available data and hypotheses. Unlike
simpler Bayesian methods, this classifier considers mul ple hypotheses and selects the label
with the highest expected probability.
Let the classifica on label 𝑣 represents the possible class or output label that we want to
predict. Then Bayes op mal classifier predicts the label 𝑣 that has the highest expected
probability over all hypotheses, weighted by their posterior probabili es given by
𝑃 𝑣 𝐷 = 𝑃 𝑣 h . P(h|D)
∈
Bayes Op mal Classifica on:

Step 1: Calculate the posterior probabili es P(h|D) for each hypothesis h.
Step 2: For each possible label 𝑣 , compute the weighted sum of probabili es
𝑃 𝑣 h . P(h|D) over all hypotheses h.
Step 3: Select the label 𝑣 with the highest probability 𝑃 𝑣 𝐷
arg 𝑚𝑎𝑥
Most probable class = 𝑣 ∈ 𝑉 ∑ ∈ 𝑃(𝑣 |h). P(h|D)
Example 1:
Advantages of Bayes' Op mal Classifier
1. Most accurate predic on possible since it considers all hypotheses.

2. Reduces risk of overfi ng by not relying on a single hypothesis.
3. Can handle uncertainty in data effec vely.
Challenges
1. Computa onally expensive: Summing over all hypotheses can be infeasible, especially
with large hypothesis spaces.
2. Requires complete knowledge of priors and accurate likelihood es ma ons
Naveen Pragallapati
Gibbs Algorithm:
The Gibbs Algorithm is a simple but powerful algorithm in the context of Bayesian learning. It
demonstrates how randomiza on can be used to make predic ons based on a prior
distribu on over hypotheses. Gibbs Algorithm is rooted in probabilis c learning, where
instead of determinis cally choosing a single hypothesis, the algorithm randomly selects one
according to a probability distribu on.
Although the Bayes op mal classifier obtains the best performance that can be achieved from
the given training data, it can be quite costly to apply. The expense is due to the fact that it
computes the posterior probability for every hypothesis in H and then combines the
predic ons of each hypothesis to classify each new instance. An alterna ve, less op mal
method is the Gibbs algorithm defined as follows:
For each new instance 𝑥, the Gibbs algorithm:
1. Choose a hypothesis ℎ from 𝐻 at random, according to the posterior distribu on

𝑃(ℎ|𝐷) over 𝐻.
2. Use ℎ to predict the classifica on of the instance 𝑥.
Advantages of the Gibbs Algorithm
1. Simple to implement: Since it just involves selec ng a hypothesis based on posterior

probabili es.
2. Theore cal founda on: It is closely aligned with the Bayesian learning framework.
3. Can be used in ensemble learning se ngs where the algorithm effec vely "votes" by
selec ng hypotheses according to their probabili es.
Limita ons of the Gibbs Algorithm

1. Randomiza on: Since the algorithm selects hypotheses randomly, predic ons can be
inconsistent.
2. Single hypothesis selec on: It ignores the overall ensemble of hypotheses, making it
sensi ve to the selected hypothesis.
3. Computa onally inefficient: Compu ng posterior probabili es for all hypotheses can
be expensive if the hypothesis space is large.
Applica ons and Use Cases:

1. Gibbs Algorithm is a conceptual framework useful for understanding the probabilis c
nature of learning and ensemble methods.
2. It has relevance in:
 Ensemble learning strategies like bagging and boos ng.
 Randomized models, where variability in predic ons is desirable.
Naveen Pragallapati
Example:
Problem 1:
You are working with a dataset of two fruits: Apples and Oranges. There are 6 Apples, of which
4 are Red, and 4 Oranges, of which 1 is Red. We want to determine whether a new red fruit
is more likely to be an Apple or an Orange.
Hypotheses:
ℎ : The fruit is an Apple.
ℎ : The fruit is an Orange.
Prior Probabili es: 𝑃(ℎ ) = 0.5, 𝑃(ℎ ) = 0.5
1. Compute posterior probabili es.

2. Use ℎ to predict the class of the new fruit.
3. Use Bayesian Classifier to predict the class of the new fruit.
Naveen Pragallapati
Solu on:
Naveen Pragallapati
Bayesian Classifier Predic on
Problem-2:
You are working with a small dataset of two fruits: Apples and Oranges, described by two
features: Colour (Red or Green) and Size (Large or Small). The dataset contains 6 Apples, of
which 4 are Red and 2 are Green, and 3 of them are Large while the other 3 are Small. There
are 4 Oranges, of which 1 is Red and 3 are Green, and 2 are Large while the other 2 are Small.
A new fruit appears with Colour = Red and Size = Large, and you need to determine whether
this new fruit is more likely to be an Apple or an Orange using the Bayesian classifier.
Hints:
1. Assume the prior probabili es for both fruits are equal, i.e., P(Apple)=P(Orange)=0.5
2. Your hypothesis space consists of two hypotheses:
h1: The fruit is an Apple
h2: The fruit is an Orange.
3. For each hypothesis h, apply Bayes' theorem to compute the posterior probabili es.
i.e., compute P (h1|Red, Large) and P (h2|Red, Large)
4. Use the fact that for independent features, the joint probability can be computed as
P(Red, Large ∣ h) = P(Red ∣ h)⋅P(Large ∣ h)
Naveen Pragallapati
Naveen Pragallapati
Naïve Bayes Classifier:

Naive Bayes is a probabilis c classifier based on Bayes' theorem with an assump on of
independence among features. It works well for large datasets, especially when features are
condi onally independent given the class label.
Naive Bayes Classifier Formula

For a given data point 𝑥 = (𝑥 , 𝑥 , … , 𝑥 ), the classifier predicts the class 𝑐 ∈ 𝐶 that
maximizes the posterior probability is given by
arg 𝑚𝑎𝑥
𝑣 = 𝑃(𝑥|𝑐) ∙ 𝑃( 𝑐)
𝑐 ∈𝐶
The Naive Bayes classifier assumes that all features 𝑥 are condi onally independent of one
another, given the class label. This means
𝑃(𝑥|𝑐) = 𝑃(𝑥 , 𝑥 , … , 𝑥 |𝑐) = 𝑃(𝑥 |𝑐)
Naïve Bayes Classifier:

arg 𝑚𝑎𝑥
𝑣 (𝑥) = 𝑃( 𝑐) ∙ 𝑃(𝑥 |𝑐)
𝑐 ∈𝐶
where 𝑣 denotes the target value output by the naive Bayes classifier.
Illustra ve Example:
Let us apply the naive Bayes classifier to a concept learning problem we considered during our
discussion of decision tree learning: classifying days according to whether someone will play
tennis. The following table provides a set –of 14 training examples of the target concept
PlayTennis, where each day is described by the a ributes Outlook, Temperature, Humidity,
and Wind.
Naveen Pragallapati
Here we use the naive Bayes classifier and the training data from this table to classify the
following new instance:
(Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new
instance. The following steps demonstrates how to predict the targe value by Naïve Bayes
classifier.
Conclusion:
Using the Naive Bayes classifier, we predict that the new instance corresponds to PlayTennis = No.
Naveen Pragallapati
Es ma ng the Probabili es:

In the above example, we es mated probabili es using rela ve frequencies:
While this method works well when the dataset is large, it becomes unreliable for small datasets. If 𝑛
is small, even a true probability as low as 0.08 might result in 𝑛 = 0 in the sample, leading to:
1. Biased Underes mate: A value of 0 falsely suggests the event is impossible.

2. Zero-Probability Issue: In Naive Bayes, mul plying a zero probability by other terms will make
the en re result zero, even if the other features suggest a non-zero posterior probability.
To avoid these difficul es, we adopt a Bayesian approach to es ma ng the probability,
using the m-es mate defined as follows.
Example of m-es mate

Naveen Pragallapati
An Example: Learning to classify text:

Objec ve
The task is to classify a new text document based on the following training data:
Training Data:
Document 1: "I love machine learning and data science"
Document 2: "Machine learning is exci ng and fun"
Document 3: "I dislike mathema cs and sta s cs"
Document 4: "Mathema cs is boring but necessary"
Document 5: "Python makes machine learning easier"
Target labels:
 Documents 1, 2, and 5 → like (interes ng documents)
 Documents 3 and 4 → dislike (uninteres ng documents)
New Document: "I love mathema cs and machine learning"
Step-by-Step Naive Bayes Calcula on:

We will train a Naive Bayes classifier to predict whether the new document belongs to the
"like" or "dislike" category based on word occurrences. We need to compute the likelihood of
the document belonging to each class:
1. P (Document | like)
2. P (Document | dislike)
Naive Bayes Classifier Formula:

The Naive Bayes classifier predicts the probability that a new document belongs to class like
or dislike.
Step 1: Calculate Class Priors

Assume equal prior probabili es for the classes "like" and "dislike":
P(like)=P(dislike)=0.5
Step 2: Apply Laplace Smoothing and Compute Likelihoods:

Now, we calculate the smoothed likelihoods with uniform priors and with 𝑚 equal to the size
of the word vocabulary. Thus, the es mate for 𝑃(𝑥 |𝑐) will be
𝑛 +1
𝑃(𝑥 |𝑐) =
𝑛 + |𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦|
Naveen Pragallapati
The Laplace smoothed probability for a word w in class C is:
Vocabulary Size (|V|)

The combined vocabulary from both classes is:
V = {I, love, machine, learning, and, data, science,
is, exci ng, fun, Python, makes, easier, dislike,
mathema cs, sta s cs, boring, but, necessary}
Total unique words (|V|): 19
“like" Class Frequency Distribu on:

{"I":2, "love":1, "machine":3, "learning":3, "and":1, "data":1, "science":1,
"is":1, "exci ng":1, "fun":1, "Python":1, "makes"”1, "easier":1}
Total Word Count (like): 18
"dislike" Class Frequency Distribu on:

{"I":1, "dislike":1, "mathema cs":2, "and":1, "sta s cs":1, "is":1, "boring":1, "but":1,
"necessary":1}
Total Word Count (dislike): 10
Step 3: Compute Likelihoods for the new document

Document: " I love mathema cs and machine learning "
Naveen Pragallapati
Step 4: Compute Posterior Probabili es
Step 5: Classifica on Decision
Conclusion
The Naive Bayes classifier predicts that the new document "I love mathema cs and machine
learning" belongs to the "like" class.
Naveen Pragallapati
Experimental Results of Text Classifica on Using Naive Bayes
Objec ve: To assess how well a Naive Bayes algorithm can classify Usenet news ar cles into
their respec ve newsgroups.
Task Descrip on:

 The goal was to assign new ar cles to the correct Usenet newsgroup based on the text
content.
 Think of this as an automated newsgroup pos ng service: It learns from exis ng
ar cles and tries to assign new ones to the appropriate group.
Dataset Details:
Number of Newsgroups: 20 dis nct Usenet newsgroups were selected, covering topics
ranging from science and sports to poli cs and technology.
Total Documents:
 1,000 ar cles were collected from each newsgroup.
 This resulted in a dataset of 20,000 documents.
Experimental Setup
Training and Test Split:
 Two-thirds of the 20,000 documents (~13,333 ar cles) were used for training
the Naive Bayes model.
 The remaining one-third (~6,667 ar cles) were used to test the performance.
Performance Results
 The performance of the Naive Bayes algorithm was evaluated based on its ability to
correctly assign ar cles to the correct newsgroup.
 Classifica on Accuracy:
o The algorithm achieved 89% accuracy across the 20 newsgroups.
o This is considered a remarkable result given that there were 20 possible
categories for each predic on (a mul -class classifica on task).
Naveen Pragallapati
Interpreta on of Results
1. Effec veness of Naive Bayes:
 The Naive Bayes classifier performed surprisingly well in this text classifica on
task, despite its assump on of condi onal independence (i.e., assuming that
words are independent given the class).
 This demonstrates the prac cal u lity of probabilis c models for text
classifica on tasks.
2. Challenges with Text Classifica on:
 While Naive Bayes works well for many tasks, it may struggle when words
exhibit strong dependencies (e.g., word pairs or sequences).
 Nevertheless, this experiment shows that simple probabilis c approaches can
be effec ve for large-scale classifica on problems, such as sor ng news
ar cles.
Conclusion:
This experiment by Joachims (1996) demonstrates that the Naive Bayes algorithm is not only
theoretically sound but also practically effective for large-scale text classification tasks,
achieving 89% accuracy on the challenging task of assigning 20,000 Usenet articles to 20
different newsgroups. This highlights the algorithm’s strength in handling high-dimensional
text data efficiently.
Naveen Pragallapati
The EM Algorithm:
In prac cal learning scenarios, not all features may be observable. EM Algorithm (Expecta on-
Maximiza on) is a widely used method to learn from data with hidden or unobserved
variables (introduced by Dempster et al., 1977).
Applica ons:
 Bayesian Networks (Heckerman, 1995)
 Radial Basis Func on Networks
 Clustering Algorithms (e.g., Cheeseman et al., 1988)
 Baum-Welch Algorithm for learning Par ally Observable Markov Models (Rabiner,
1989)
General Idea of the EM Algorithm

Core Principle:
1. Use the current hypothesis to es mate hidden variables (E-step).
2. Use the expected values of the hidden variables to refine the hypothesis(M-step).
Convergence: With each itera on, the algorithm increases the likelihood unless it reaches a
local maximum.
Example: Es ma ng Means of k Gaussians
Scenario: Data points are generated from a mixture of k Normal distribu ons (Gaussians), as
shown in the following figure.
Process:
1. Randomly select one of the k Gaussian distribu ons.
2. Generate a data point 𝑥 from the chosen Gaussian.
Goal: Find the maximum likelihood es mate of the means of these distribu ons; that is a
hypothesis ℎ = 𝜇 , 𝜇 , … , 𝜇 that maximises 𝑃(𝐷|ℎ).
Naveen Pragallapati
Naveen Pragallapati
Example Illustra on: EM Algorithm for a 1D Gaussian Mixture Model (GMM)

Naveen Pragallapati
Note: Using the EM algorithm, we can cluster the data into two Gaussian clusters. the EM
algorithm provides so assignments at each step, and the clusters could have overlapping
boundaries.
Naveen Pragallapati
Challenges with the EM Algorithm

1. Slow Convergence:
The algorithm can converge slowly, especially near local op ma.
2. Sensi ve to Ini aliza on:
Poor ini aliza on can result in convergence to subop mal solu ons.
3. Computa onal Cost:
The E-step and M-step can be computa onally expensive, depending on the
complexity of the model.
Content Beyond Syllabus

K-means Algorithm for Clustering:
The K-means algorithm is a popular unsupervised machine learning technique used for
clustering. Given a dataset, it par ons the data into K clusters such that each data point
belongs to the cluster with the nearest mean.
K-means Algorithm:
Naveen Pragallapati
Naveen Pragallapati
Key Rela onship of EM algorithm and K-mean Algorithm:

1. K-means is a simplified version of the EM algorithm for clustering.
2. K-means uses hard assignments and assumes spherical clusters, while EM provides so
assignments and models clusters as Gaussian distribu ons.
3. Both algorithms alternate between assigning points to clusters and upda ng
parameters itera vely un l convergence.
Both algorithms follow an itera ve process that alternates between two steps:
So vs. Hard Assignments
Differences in the Model Assump ons
Mixture of Gaussians and K-means:

If we apply K-means to a dataset with Gaussian-like clusters, it may not perform well unless
the clusters are spherical and well-separated. However, the EM algorithm for GMMs will
handle such data be er because it accounts for the shape and spread (covariance) of each
cluster.
Naveen Pragallapati
Unit-4
Part-1: Computa onal Learning Theory
Introduc on
• Computa onal Learning Theory (CoLT) is a subfield of ar ficial intelligence and
theore cal computer science that focuses on understanding the theore cal
founda ons of machine learning.
• It provides a framework for analyzing the feasibility of learning algorithms and
formalizes what it means for a machine to learn from data.
• The field inves gates the capabili es and limita ons of learning models and algorithms
with the aim of determining whether certain tasks can be learned efficiently.
Computa onal Learning Theory explores ques ons such as:
1. What can be learned by a computer?
2. How much data is required to learn effec vely?
3. How efficient are learning algorithms in terms of me and computa onal resources?
4. What guarantees can we provide about the generaliza on of learned models?
Important Models in CoLT
• The field draws on concepts from computer science, mathema cs, and sta s cs to
rigorously define learning and analyze learning algorithms' performance.
• Central to CoLT are different models of learning, such as Probably Approximately
Correct (PAC) learning and the Vapnik–Chervonenkis (VC) dimension, which are used
to describe the learnability of different classes of func ons.
Key Nota ons and Defini ons
Naveen Pragallapati
Sample Complexity
Computa onal Complexity
Mistake Bound
Naveen Pragallapati
True Error
Training Error
Naveen Pragallapati
Rela onship Between True Error and Training Error
PAC Learnability
PAC Learning
Naveen Pragallapati
Example
Sample Complexity for Finite Hypothesis Space

• Sample complexity refers to the number of training samples needed to ensure that a
learning algorithm, with high probability, can select a hypothesis from its hypothesis
space that performs well on unseen data.
• When discussing finite hypothesis spaces, sample complexity helps determine how
many examples are needed to guarantee that the learned hypothesis has an error rate
close to the best possible hypothesis in the space.
Upper Bound on Sample Complexity
Naveen Pragallapati
Explana on of the bound
Example
Sample Complexity for Finite Hypothesis Space

Naveen Pragallapati
The Union Bound
Ensuring PAC Learning
Key Observa ons
Implica ons
Naveen Pragallapati
Example
Summary
Sample Complexity in Infinite Hypothesis Spaces

In finite hypothesis spaces, we can o en compute the sample complexity using tools like the
union bound. In infinite hypothesis spaces, the challenge arises because it’s impossible to
directly enumerate hypotheses. Instead, we use VC dimension to measure the complexity of
the hypothesis space.
Infinite Hypothesis Spaces and Challenges

 In prac cal machine learning, hypothesis spaces are o en infinite (e.g., linear
func ons, neural networks).
 The key ques ons:
1. How many examples are needed to ensure the hypothesis learned is close to
the target func on?
2. How does the complexity of the hypothesis space affect the sample
complexity?
 Key Insight: We cannot simply count hypotheses in infinite spaces. Instead, we
measure the capacity of the hypothesis space using VC dimension.
Naveen Pragallapati
Vapnik-Chervonenkis (VC) Dimension
Rela on Between VC Dimension and Sample Complexity
Implica ons for Learning Algorithms

 Prac cal Implica on: To design a hypothesis space with good generaliza on, balance
between:
o Complexity (VC dimension): Ensure it is large enough to express the target
func on.
o Sample Size: Ensure sufficient data to learn accurately.
 Overfi ng and VC Dimension: Hypothesis spaces with large VC dimensions are prone
to overfi ng unless mmm is sufficiently large.
Naveen Pragallapati
Summary
Mistake Bound Model

Defini on: The Mistake Bound Model evaluates the performance of a learning algorithm by
analysing the maximum number of mistakes it makes while learning a target concept. It
focuses on learning in an online se ng, where instances are provided sequen ally, and the
algorithm must predict the class label before receiving feedback.
Goal:
Minimize the total number of mistakes across all examples.
Naveen Pragallapati
Strengths of the Model:

 Provides a theore cal framework for evalua ng online learning algorithms.
 Captures performance in terms of the number of errors rather than computa onal
complexity.
 Applies to a wide range of hypothesis spaces.
Limita ons:
 Assumes a worst-case scenario, which may be overly pessimis c.
 Does not account for probabilis c noise in data.
 Focused only on classifica on tasks, not regression.
Comparison with PAC Model:
 Mistake Bound Model focuses on online learning and tracks mistakes sequen ally.
 PAC Model evaluates performance in a batch se ng based on probabilis c
guarantees.
Summary: The Mistake Bound Model is a foundational concept in machine learning that
provides insight into how many errors a learning algorithm can tolerate in the worst case
while converging to the correct target concept. It helps in designing and analysing robust
algorithms, particularly in sequential learning scenarios.
Naveen Pragallapati
Naveen Pragallapati
Naveen Pragallapati
Naveen Pragallapati
Naveen Pragallapati
Naveen Pragallapati
Unit-4
Part-2: Instance Based Techniques
K-Nearest Neighbour (k-NN) Learning:

k-NN is an instance-based learning algorithm. It belongs to the family of lazy learning
algorithms because no explicit model is built during the training phase. Instead of learning a
parametric model, k-NN directly uses the training instances to predict outcomes for unseen
data.
Core Idea:
1. Given a query instance, the algorithm finds the k-nearest neighbours from the training
data (based on a distance metric).
2. The predicted class or value for the query instance is based on the majority class (for
classifica on) or average value (for regression) of the neighbours.
Algorithm
Note:
The k-NEAREST NEIGHBOR algorithm is easily adapted to approxima ng con nuous-valued
target func ons. To approximate a real-valued target func on 𝑓: 𝑅 → 𝑅 we replace the final
line of the above algorithm by the line
∑ 𝑓(𝑥 )
𝑓 𝑥 =
𝑘
Naveen Pragallapati
A Note on Terminology
Much of the literature on nearest-neighbour methods and weighted local regression uses a
terminology that has arisen from the field of sta s cal pa ern recogni on. In reading that
literature, it is useful to know the following terms:
1. Regression means approxima ng a real-valued target func on.
2. Residual is the error 𝑓 (𝑥) − 𝑓(𝑥) in approxima ng the target func on.
3. Kernel func on is the func on of distance that is used to determine the weight of
each training example. In other words, the kernel func on is the func on K such that
𝑤 = 𝐾 𝑑 𝑥 ,𝑥 .
Distance Metrics:
The choice of distance metric significantly impacts the performance of k-NN.
Common distance metrics include:
1. Euclidean Distance (used for con nuous data):
2. Manha an Distance (L1 norm):
3. Hamming Distance (for categorical data):

Counts the number of mismatched a ributes.
Choosing the Value of k:

1. Small k (e.g., k = 1): High sensi vity to noise, leading to overfi ng.
2. Large k: More robust but may lead to underfi ng.
3. Rule of Thumb:
A common prac ce is to 𝑘 = √𝑛, where 𝑛 is the size of the training dataset.
Weighted k-NN:
In some cases, closer neighbours are given more weight in the predic on.
Example: Use an inverse distance weigh ng scheme where closer points have higher
influence. Replacing the final line of the k-NN algorithm by
𝑎𝑟𝑔𝑚𝑎𝑥
𝑓 𝑥 ← ∑ 𝑤 𝛿 𝑣, 𝑓(𝑥 ) where 𝑤 =
𝑣∈𝑉 ( , )
We can distance-weight the instances for real-valued target func ons in a similar fashion,
replacing the final line of the algorithm in this case by
∑ 𝑤 𝑓(𝑥 )
𝑓 𝑥 =
∑ 𝑤
Naveen Pragallapati
Strengths and Weaknesses:

Strengths:
1. Simple and intui ve.
2. No need for training phase or parameter tuning (beyond choosing k).
3. Flexible: Can handle mul -class classifica on and regression.
Weaknesses:
1. Computa onally expensive during predic on (distance computa on).
2. Performance depends heavily on the choice of k and distance metric.
3. Sensi ve to irrelevant features and noisy data.
Applica ons of k-NN:

1. Image Recogni on: Used to classify images based on pixel similarity.
2. Recommender Systems: Iden fying similar users or products for recommenda ons.
3. Anomaly Detec on: Iden fying outliers by measuring how close an instance is to its
nearest neighbours.
Example Problem 1:
Given a dataset with three a ributes: Age, Income, and Credit Score. Predict whether a
person will buy a product (Yes/No).
Steps:
1. Compute the Euclidean distance between the query instance and all training points.
2. Choose k = 3 (find the 3 nearest neighbours).
3. Use majority vo ng to assign a class label (Yes/No).
Naveen Pragallapati
Example Problem-2:
Solve the Example Problem-1 above using weighted k-NN.
Naveen Pragallapati
Naveen Pragallapati
Example Problem-3:
Naveen Pragallapati
Weighted k-NN Regression Predic on: 297598.77
Summary of Predic ons:
Summary:
1. k-NN is a powerful, non-parametric algorithm that works well for classifica on and
regression.
2. Its main drawback is the high computa onal cost during predic on, especially with
large datasets.
3. k-NN performs be er when the feature space is small and the relevant features are
carefully selected.
Naveen Pragallapati
Locally Weighted Regression (LWR):

Locally Weighted Regression (LWR) is an instance-based learning algorithm that fits a
separate model for each query point by weighting nearby data points more heavily than
distant ones. Unlike global models like linear regression, LWR constructs a local
approximation specific to each query, making it a non-parametric learning approach. It’s
particularly useful for non-linear data patterns where global models cannot accurately
capture the relationships between variables.
Applica ons of LWR:

 Robot control: Used to model dynamics and predict local behaviour in robo cs.
 Time-series forecas ng: Helps predict future values by fi ng local trends to historical
data.
 Geospa al data modelling: Captures local varia ons in spa al data, like temperature
or pollu on levels.
Training Algorithm for Locally Weighted Regression (LWR) Using Gradient Descent:
Naveen Pragallapati
Summary:
This algorithm fits a local linear model for each query point by itera vely upda ng the weights
𝑤 using gradient descent. The kernel func on ensures that only nearby points have significant
influence, making the regression localized. The training process con nues un l the model
converges, a er which it can predict values based on the op mized weights.
Locally Weighted Regression Example (Using Gradient Descent):

Naveen Pragallapati
Naveen Pragallapati
Naveen Pragallapati
Radial Basis Func on (RBF) Networks:

 An RBF network is a type of ar ficial neural network used for func on approxima on.
 RBF networks model the target func on as a weighted sum of radial basis func ons,
typically Gaussian func ons.
Key Structure:
 Input Layer: Passes input to the hidden layer without weights.
 Hidden Layer: Each neuron represents a radial basis func on.
 Output Layer: Combines the hidden layer outputs linearly to generate predic ons.
RBF Networks Work in Learning:

Naveen Pragallapati
Training Process for RBF Networks:

Naveen Pragallapati
Summary:
1. Select centres: Place RBF neurons at key points (possibly through clustering).
2. Calculate ac va ons: Use the Gaussian kernel to compute how much influence each
RBF neuron has for a given input.
3. Op mize weights: Use least squares to find the weights that minimize the predic on
error.
Naveen Pragallapati
A radial basis func on network. Each hidden unit produces an ac va on determined by a Gaussian func on cantered at
some instance xu. Therefore, its ac va on will be close to zero unless the input x is near xu. The output unit produces a linear combina on
of the hidden unit ac va ons. Although the network shown here has just one output, mul ple output units can also
be included.
Diagram Summary
In the diagram, you would typically see:
 Input nodes connected to each hidden node (RBF neuron).
 RBF neurons in the hidden layer, each receiving the input vector and compu ng its
ac va on.
 Weights associated with each connec on from hidden neurons to the output layer.
 A single output node that aggregates the weighted ac va ons to produce the predicted
value.
This architecture enables RBF networks to model nonlinear rela onships by combining local
approxima ons (via Gaussian kernels) with a global linear combina on at the output layer.
Naveen Pragallapati
Note:
Solu on:
Steps to follow:
Naveen Pragallapati
Naveen Pragallapati
Case-Based Reasoning (CBR)

Case-Based Reasoning is an approach in machine learning where new problems are solved by
referencing or adap ng solu ons from previously encountered cases or examples. Unlike
methods that generalize from training data to build a model, CBR relies on a memory of
specific instances.
Key Steps in Case-Based Reasoning
1. Retrieve: Given a new problem, retrieve the most similar cases from memory. This involves
defining a similarity measure that can match the new case with those stored in memory.
2. Reuse: Once a similar case or set of cases is retrieved, reuse the solu on (or parts of it) for
the new problem. This step may require adapta on if the old solu on doesn't exactly fit the
new context.
3. Revise: A er applying the retrieved solu on, test it. If necessary, revise or adapt the
solu on to improve it for the current problem.
4. Retain: Once the solu on is successfully applied, it’s stored as a new case in memory for
future use. This reten on allows the system to improve its knowledge base over me.
Advantages of CBR
1. Efficiency with Limited Data: CBR works well even with a smaller number of cases,
making it suitable for applica ons where data is sparse.
2. Incremental Learning: New cases are con nually added to the memory, allowing the
system to learn over me without retraining.
3. Interpretability: Since solu ons are adapted from actual cases, CBR offers
interpretable outcomes and explana ons.
Applica ons of CBR
CBR is widely used in domains where historical cases are accessible, such as:
 Medical diagnosis, where pa ent cases help diagnose similar future pa ents.
 Technical support and troubleshoo ng, where past solu ons can be adapted for new
issues.
 Legal reasoning, where previous legal cases inform judgments in new cases.
Summary
CBR’s strength lies in its ability to adapt previous knowledge directly to new problems, which
is especially powerful in contexts where cases do not follow a strict generaliza on rule. It’s
ideal for tasks where excep ons are common or complex adapta ons are needed.
Naveen Pragallapati
Example
Imagine CADET’s library includes a case of a small irriga on pump with a flow rate of 10 liters
per minute and a pressure of 5 psi. The new problem requires a pump with 10 liters per minute
but with a higher pressure of 8 psi.
1. Retrieve: CADET retrieves the small irriga on pump case, recognizing that it meets the flow
rate requirement.
2. Reuse: CADET reuses much of the design, such as the general structure and configura on.
3. Revise: To meet the higher-pressure requirement, CADET modifies the pump by increasing
the impeller size or using a more powerful motor, ensuring it can achieve 8 psi.
4. Retain: The revised pump design is stored as a new case with specifica ons for a 10 L/min,
8 psi water pump.
Adapta on Techniques in CADET
CADET’s adapta on is based on both similarity metrics and specific engineering rules, such as:
 Increasing power to handle higher pressures.

 Altering materials based on durability requirements for different pressures or flow
rates.
Remarks on Lazy and Eager Learning

Eager Learning: In eager learning, the model is constructed in advance of any query. This
involves generalizing from the training data to create a model that can be used for predic on.
Eager learners typically build a complete model during the training phase, which can be
computa onally expensive but allows for quick predic ons once the model is built. Examples
include decision trees, neural networks, and support vector machines.
Lazy Learning: In contrast, lazy learning does not construct a general model un l a query is
made. Instead, it retains the training instances and uses them directly to make predic ons.
This can be more efficient in terms of the me taken during the training phase but may result
in slower predic ons because it must process the training data at query me. Examples of lazy
learning include k-nearest neighbours (k-NN) and locally weighted regression.
Key Characteris cs:

1. Generaliza on vs. Memoriza on:
 Eager learning emphasizes generaliza on, aiming to abstract pa erns from the
training data.
 Lazy learning focuses on memoriza on, retaining the original instances for later use.
Naveen Pragallapati
2. Time Complexity:
 Eager learners require more me and computa onal resources during the training
phase, as they need to analyse and construct a model.
 Lazy learners are quick to train, as they simply store the training data but may require
more me to make predic ons since they analyse the stored data at query me.
3. Memory Usage:
 Eager learning typically uses less memory at query me since it works with a model
rather than storing all instances.
 Lazy learning may require significant memory if the training dataset is large, as it must
keep all instances accessible for querying.
4. Flexibility:
 Eager learners can some mes struggle with changes in the underlying data
distribu on, as retraining the model is necessary.
 Lazy learners can adapt to changes in the data more easily since they can incorporate
new instances dynamically during predic on.
5. Performance and Applica on Context:
 Eager learning methods may perform be er in scenarios where the dataset is large,
and a general model can effec vely capture the rela onships in the data.
 Lazy learning may excel in cases where data is sparse or when predic ons must be
made based on local rela onships in the data.
Naveen Pragallapati
Use Cases:
 Eager learning is o en used in applica ons where the cost of computa on during
training can be jus fied by the need for fast predic ons, such as in online services.
 Lazy learning is useful in applica ons where real- me updates are cri cal, such as
recommenda on systems that adapt to user preferences.
Conclusion:
Both lazy and eager learning methods have their advantages and disadvantages, and the
choice between them o en depends on the specific problem context, the size and nature of
the dataset, and the computa onal resources available. Understanding these concepts is
crucial for selec ng appropriate learning algorithms in prac cal machine learning
applica ons.
Naveen Pragallapati
Unit-5
Gene c Algorithms
A Biological Mo va on
Gene c Algorithms (GAs) are inspired by the process of evolu on in the natural world.
Biological evolu on provides a framework for solving complex problems through the
principles of natural selec on, gene c varia on, and survival of the fi est
In nature:
 Popula ons of organisms evolve over me to adapt to their environments.
 Evolu on is driven by:
1. Natural Selec on: The environment "selects" individuals with traits that give
them an advantage in survival and reproduc on.
2. Gene c Varia on: Traits are passed from parents to offspring, with occasional
random changes (muta ons) introducing new traits.
3. Crossover (Recombina on): When two parents reproduce, their gene c
material combines to create offspring with mixed traits.
4. Fitness: The "fit" individuals (those be er adapted) are more likely to survive
and reproduce.
Visualizing the Biological Connec on
1. Imagine a popula on of animals evolving to escape predators.
o Fitness = Speed of running.
o Be er runners survive, reproduce, and pass on faster-running traits.
2. In GAs, imagine solving an equa on where:
o Fitness = How close a solu on is to the correct answer.
o Over genera ons, solu ons "evolve" to converge on the answer.
Transla ng Biology to Problem Solving

Naveen Pragallapati
In evolu onary algorithms, genes, chromosomes, and their representa ons are inspired by
biological concepts.
Gene
 A gene is the smallest unit of informa on in the solu on encoding. It represents a
single variable or parameter of the problem.
 In op miza on problems, a gene might be a value that contributes to the solu on
(e.g., a decision variable or a parameter to be op mized).
Representa on of a Gene
1. Binary: 0 or 1 (common in Gene c Algorithms).
2. Integer: A discrete value (e.g., 5, 10).
3. Real Number: A con nuous value (e.g., 3.14, -7.6).
4. Symbol: Non-numeric en es used in gene c programming or symbolic computa on.
Chromosome
 A chromosome is a collec on of genes arranged in a specific order. It represents an
individual candidate solu on to the op miza on problem.
 In essence, it is a data structure that holds the genes and encapsulates the solu on.
Representa on of a Chromosome
1. Binary Encoding: A sequence of 0s and 1s.
o Example: 101011
2. Integer Encoding: A sequence of integers.
o Example: [3, 5, 7, 9]
3. Real-valued Encoding: A sequence of real numbers.
o Example: [1.5, -2.3, 4.0, 7.8]
4. Tree Structure: Used in Gene c Programming.
o Example: A syntax tree represen ng a mathema cal expression like (x + y) * z.
5. Permuta on Encoding: Used in problems like the Traveling Salesman Problem.
o Example: [3, 1, 4, 2] (indica ng a path through ci es 3 → 1 → 4 → 2).
Defini on: A gene c algorithm is a general op miza on method that searches through a
large space of candidate solu ons, aiming to find the one with the highest fitness.
Naveen Pragallapati
The Gene c Algorithm Process

The gene c algorithm operates by itera vely upda ng a pool of hypotheses, called the
popula on. On each itera on, all members of the popula on are evaluated according to the
fitness func on. A new popula on is then generated by probabilis cally selec ng the most fit
individuals from the current popula on. Some of these selected individuals are carried
forward into the next genera on popula on intact. Others are used as the basis for crea ng
new offspring individuals by applying gene c opera ons such as crossover and muta on.
1. Ini aliza on: Generate an ini al popula on of poten al solu ons randomly.
2. Fitness Evalua on: Calculate the fitness of each solu on using the fitness func on.
3. Selec on: Select solu ons with higher fitness for reproduc on using methods like
roule e wheel selec on, tournament selec on, or rank-based selec on.
4. Crossover: Combine parts of two selected solu ons to create new offspring.
5. Muta on: Introduce small random changes to offspring to maintain diversity.
6. Replacement: Replace the old popula on with the new one.
7. Termina on: Repeat steps 2–6 un l a stopping condi on is met (e.g., maximum
genera ons, desired fitness).
Example: Solving the "Traveling Salesperson Problem" (TSP)

In TSP, the goal is to find the shortest route for visi ng a set of ci es and returning to the
star ng city.
1. Representa on:
o Chromosome: A sequence of city indices (e.g., [1, 3, 5, 2, 4]).
2. Fitness Func on:
o Inverse of the total distance travelled.
3. Popula on:
o A set of random city sequences.
4. Operators:
o Crossover: Combine segments of two routes.
o Muta on: Swap ci es or reverse a segment.
5. Evolu on:
o Over genera ons, routes evolve to minimize total distance.
Key Benefits of Gene c Algorithms

 Explora on vs. Exploita on:
o Crossover explores new areas of the solu on space.
o Muta on ensures diversity and avoids premature convergence.
 Robustness:
o GAs are flexible and can handle noisy, complex, or discon nuous func ons.
Naveen Pragallapati
A Prototypical Gene c Algorithm:
Represen ng Hypotheses
Hypotheses in GAS are o en represented by bit strings, so that they can be easily manipulated
by gene c operators such as muta on and crossover. The hypotheses represented by these
bit strings can be quite complex. Hypotheses, such as if-then rules, are represented as bit
strings for easy manipula on by gene c operators like muta on and crossover. For example,
consider PlayTennis problem. Let’s represent all the a ribute values as bit strings.
A ributes and Their Values
1. Outlook (3 values: Sunny, Overcast, Rain)
o 3-bit string:
 Sunny → 100
 Overcast → 010
 Rain → 001
 "Don't care" (all values allowed) → 111
2. Temperature (3 values: Hot, Mild, Cool)
o 3-bit string:
 Hot → 100
 Mild → 010
 Cool → 001
 "Don't care" → 111
Naveen Pragallapati
3. Humidity (2 values: High, Normal)

o 2-bit string:
 High → 10
 Normal → 01
4. Wind (2 values: Strong, Weak)
o 2-bit string:
 Strong → 10
 Weak → 01
5. PlayTennis (2 values: Yes, No)
o 2-bit string:
 Yes → 10
 No → 01
Encoding a Complete Rule

A rule includes precondi ons (combina ons of a ribute constraints) and a postcondi on
(outcome). For example,
Rule: IF Outlook = Overcast AND Humidity = Normal THEN PlayTennis = Yes
1. Outlook = Overcast → 010
2. Temperature = Don't care → 111
3. Humidity = Normal → 01
4. Wind = Don't care → 11
5. PlayTennis = Yes → 10
Full Bit String Encoding: 010 111 01 11 10
General Rule Representa on
Every rule will be a fixed-length bit string where:
 The first 3 bits represent Outlook.
 The next 3 bits represent Temperature.
 The next 2 bits represent Humidity.
 The next 2 bits represent Wind.
 The last 2 bits represent PlayTennis.
This ensures all hypotheses are encoded in a consistent format for use in gene c algorithms.
Naveen Pragallapati
Gene c Operators
Gene c operators are mechanisms inspired by biological evolu on that help generate new
candidates (successors) in a Gene c Algorithm. These successors are typically formed by
recombining or muta ng members of the current popula on.
Main Gene c Operators
A. Crossover
B. Muta on
C. Specialized Operators
A. Crossover
Crossover is the most commonly used operator, which generates offspring by combining bits
or segments from two parent individuals.
Types of Crossovers
1. Single-Point Crossover:
 The crossover mask is structured as con guous 1s followed by 0s.

 A random crossover point n is chosen, spli ng the bit string into two parts.
Example:
 Parent 1: 11001100100
 Parent 2: 10101010101
 Crossover Point: n = 5
 Crossover Mask: 11111000000
 Offspring:
o Offspring 1: 11001101010 (first 5 bits from Parent 1, rest from Parent 2)
o Offspring 2: 10101000100 (first 5 bits from Parent 2, rest from Parent 1)
2. Two-Point Crossover:
 The crossover mask contains a con guous segment of 1s in the middle, flanked by 0s.
 Two points (n0, n1) are chosen randomly to determine the sec on swapped between
the parents.
Naveen Pragallapati
Example:
 Parent 1: 11001100100
 Parent 2: 10101010101
 Crossover Mask: 00111110000 (n0 = 2, n1 = 5)
 Offspring:
 Offspring 1: 11101010101
 Offspring 2: 10001100100
3. Uniform Crossover:
 A random bit string serves as the crossover mask.
 Each bit is chosen randomly and independently to determine which parent contributes
that bit.
Example:
 Parent 1: 11001100100
 Parent 2: 10101010101
 Random Mask: 10110100101
 Offspring:
o Offspring 1: 10101100101 (bits selected according to the mask)
o Offspring 2: 11001010100
B. Muta on
Muta on introduces small, random changes to an individual, allowing explora on of the
solu on space.
 Typically applied a er crossover.
 A random bit posi on is selected, and its value is flipped.
Naveen Pragallapati
Example:
 Original: 11001100100
 Muta on Posi on: 6 (bit flipped)
 Mutated Offspring: 11001000100
Muta on ensures diversity and prevents the algorithm from being trapped in local op ma.
C. Specialized Operators
These are tailored to the problem or hypothesis representa on. Examples include:
 Rule Specializa on/Generaliza on: Operators replace or modify specific parts of rules
to broaden or narrow their applicability.
 Example Systems:
o Grefenste e et al. (1991): A system for learning robot control rules using a
rule-specializa on operator.
o Janikow (1993): A system for learning rules by explicitly modifying condi ons
(e.g., replacing a condi on with "don't care").
Summary
1. Crossover explores new combina ons of exis ng solu ons.
2. Muta on introduces varia on to escape local op ma.
3. Specialized operators adapt GAs to problem-specific requirements.
Exercise Problems:
1. Given the following parent strings and a single-point crossover at posi on n=3,
determine the offspring:
o Parent 1: 1011011
o Parent 2: 1100101
2. Perform a two-point crossover on the following parent strings using n0=2 and n1=5:
o Parent 1: 11110000
o Parent 2: 00001111
3. Simulate uniform crossover using the following parents and mask. Generate the two
offspring:
o Parent 1: 10101010
o Parent 2: 11001100
o Mask: 10110011
4. Apply mutation to the following bit strings by flipping the bit at the given positions:
o String: 11010101, Mutation at position 4
o String: 00101011, Mutation at position 1
Naveen Pragallapati
Fitness Function and Selection
Overview of Fitness Function and Selection

In the context of genetic algorithms (GAs), fitness functions and selection mechanisms are
essential for simulating natural selection. They guide the evolution of candidate solutions
(individuals) in a population toward an optimal solution.
Key Concepts
1. Fitness Function:
o A mathematical representation used to evaluate the quality (or "fitness") of
each individual in the population.
o Higher fitness values indicate better solutions.
o Fitness functions depend on the specific problem domain and the
representation of candidate solutions.
2. Selection:
o The process of choosing individuals from the current population to generate
the next generation.
o Selection favours individuals with higher fitness, giving them a higher chance
of passing on their genetic material.
o Common selection methods include:
 Proportional Selection: Probability of selection is proportional to an
individual’s fitness.
 Rank-Based Selection: Individuals are ranked based on fitness, and
selection probabilities are assigned according to rank.
 Tournament Selection: A subset of individuals is chosen randomly, and
the best individual in the subset is selected.
 Elitism: Ensures the best individuals are carried over to the next
generation unchanged.
3. Proportional Selection Mechanism:
o Often implemented using a roulette wheel analogy:
 Each individual gets a slice of the wheel proportional to its fitness.
 Spinning the wheel randomly selects individuals, with higher fitness
individuals more likely to be chosen.
Illustrated Example for Roulette Wheel Selection

Naveen Pragallapati
 Spin 1: Random number r1=0.2 → falls in B's range → Select B.

 Spin 2: Random number r2=0.7 → falls in D's range → Select D.
 Repeat for all spins to form the next genera on.
Solving Op miza on Problem using Gene c Algorithm
Genera on 1
Naveen Pragallapati
Naveen Pragallapati
Best Individual in the New Popula on (Genera on 2): 25 with fitness 625.
Repeat for Genera on 2:
1. Selec on: Using fitness values, select pairs for crossover.
2. Crossover: Perform single-point crossover.
3. Muta on: Introduce random muta ons.
Naveen Pragallapati
Why the Best Individual Decreased a er 1st Genera on?

Naveen Pragallapati
So, this scenario illustrates that gene c algorithms do not always maintain the current
"best" solu on across genera ons, especially if no eli sm is applied.
Conclusion
The temporary drop in the best fitness is a natural part of how gene c algorithms work. It’s a
trade-off between exploring new solu ons and exploi ng current best solu ons. With enough
genera ons and balance between crossover and muta on, the algorithm typically converges
to the global or near-global op mum.
Analogy: Think of climbing a mountain. Some mes, you may step down a li le to find a be er
path to the peak. Similarly, the algorithm sacrifices short-term gains (losing the best individual)
to explore more of the solu on space.
In op miza on problems, fitness func on is always the objec ve func on?

The answer is Not necessarily! While the fitness func on is o en derived from the
objec ve func on in op miza on problems, the two are not always iden cal. The
fitness func on is a measure used to evaluate how good a solu on is within the context
of a gene c algorithm or other heuris c methods, and it may involve addi onal
modifica ons or transforma ons of the objec ve func on.
Naveen Pragallapati
Key Takeaways
 The objec ve func on defines the goal of the op miza on problem (maximize or
minimize).
 The fitness func on is used internally by the algorithm to evaluate solu ons, o en
adapted for constraints, maximiza on, or algorithm-specific needs.
In short: The fitness func on is tailored to the problem and the algorithm, while the objec ve
func on is the mathema cal expression of the op miza on goal.
Naveen Pragallapati
Gene c Algorithm for Concept Learning:

A gene c algorithm (GA) is a general op miza on method that searches a large space of
candidate objects seeking the one that performs best according to a fitness func on. Although
not guaranteed to find an op mal solu on, GAs o en succeeds in finding an object with high
fitness. They have been applied to numerous op miza on problems beyond machine
learning, including circuit layout and job-shop scheduling, as well as in func on-approxima on
tasks and selec ng network topologies for neural networks.
To illustrate the use of GAs for concept learning, we can look at the GABIL system described
by DeJong et al. (1993). GABIL uses a GA to learn boolean concepts represented by a
disjunc ve set of proposi onal rules. In their experiments, GABIL's generaliza on accuracy
was comparable to other learning algorithms like C4.5 (a decision tree learning algorithm) and
AQ14 (a rule learning system). The tasks included ar ficial learning problems and real-world
issues such as breast cancer diagnosis.
The algorithm employed by GABIL is exactly as described in prototypical Gene c Algorithm
given above. Key parameters used in their experiments include:
 Crossover frac on (r): Set to 0.6, determining the frac on of the parent popula on
replaced by crossover.
 Muta on rate (m): Set to 0.001, which is a typical value.
 Popula on size (p): Varied between 100 and 1000, depending on the learning task.
The specific instan a on of the GA algorithm in GABIL involves:
Representa on: Each hypothesis corresponds to a disjunc ve set of proposi onal rules
encoded as bit strings. For example, in a hypothesis space where rule precondi ons are
conjunc ons over boolean a ributes a1a1 and a2a2, and the rule postcondi on is a single bit
indica ng the target a ribute cc, the bit-string representa on can look like this:
would be represented by the string
Note the length of the bit string grows with the number of rules in the hypothesis. This
variable bit-string length requires a slight modifica on to the crossover operator, as described
below.
Gene c operators. GABIL uses the standard muta on operator, in which a single bit is chosen
at random and replaced by its complement. The crossover operator that it uses is a fairly
standard extension to the two-point crossover operator. In par cular, to accommodate the
variable-length bit strings that encode rule sets, and to constrain the system so that crossover
occurs only between like sec ons of the bit strings that encode rules, the following approach
Naveen Pragallapati
is taken. To perform a crossover opera on on two parents, two crossover points are first
chosen at random in the first parent string. Let d1 (d2) denote the distance from the le most
(rightmost) of these two crossover points to the rule boundary immediately to its le . The
crossover points in the second parent are now randomly chosen, subject to the constraint that
they must have the same d1 and d2 value. For example, if the two parent strings are
and the crossover points chosen for the first parent are the points following bit posi ons 1
and 8,
where "[" and "]" indicate crossover points, then d1 = 1 and d2 = 3. Hence the allowed pairs
of crossover points for the second parent include the pairs of bit posi ons (1,3), (1,8), and
(6,8). If the pair (1,3) happens to be chosen,
then the two resul ng offspring will be
As this example illustrates, this crossover opera on enables offspring to contain a different
number of rules than their parents, while assuring that all bit strings generated in this fashion
represent well-defined rule sets.
Fitness func on. The fitness of each hypothesized rule set is based on its classifica on
accuracy over the training data. In par cular, the func on used to measure fitness is
where correct (h) is the percent of all training examples correctly classified by hypothesis h.
In experiments comparing the behaviour of GABIL to decision tree learning algorithms such
as C4.5 and ID5R, and to the rule learning algorithm AQ14, DeJong et al. (1993) report roughly
comparable performance among these systems, tested on a variety of learning problems. For
example, over a set of 12 synthe c problems, GABIL achieved an average generaliza on
accuracy of 92.1 %, whereas the performance of the other systems ranged from 91.2 % to
96.6 %.
Naveen Pragallapati
Hypothesis Space Search
The Gene c Algorithm (GA) employs a randomized beam search approach to explore the
hypothesis space, differing significantly from methods like neural network backpropaga on.
While backpropaga on follows a smooth gradient descent to incrementally adjust
hypotheses, GAs makes abrupt changes, replacing a parent hypothesis with an offspring that
may be very different. This characteris c makes GAs less prone to local minima compared to
gradient descent.
A challenge in GAs is crowding, where highly fit individuals reproduce excessively, reducing
popula on diversity and slowing progress. Strategies to mi gate crowding include:
1. Selec on modifica ons, such as tournament or rank selec on, instead of fitness-
propor onate selec on.
2. Fitness sharing, which reduces an individual's fitness score when similar individuals
are present.
3. Recombina on restric ons, limi ng ma ng to similar individuals to form clusters or
subspecies.
4. Spa al distribu on, allowing only nearby individuals to recombine.
Popula on Evolu on and the Schema Theorem

Naveen Pragallapati
Interpreta on of Equa on
Naveen Pragallapati
Full Schema Theorem
Components of Equa on
Naveen Pragallapati
Example Illustra on:

Naveen Pragallapati
Naveen Pragallapati
Naveen Pragallapati
Gene c Programming
Example: 𝒔𝒊𝒏(𝒙) + 𝒙𝟐 + 𝒚
Program tree representa on in gene c programming.

Arbitrary programs are represented by their parse trees.
Naveen Pragallapati
Gene c Programming Algorithm
Parent 1: sin(𝑥) + 2 Parent 2: sin(𝑥) + 𝑥 + 𝑦

Offspring 1: sin(𝑥) + 2 Offspring 2: sin(𝑥) + 𝑥 + 2𝑦
Crossover opera on applied to two parent program trees (top). Crossover points (nodes shown in
bold at top) are chosen at random. The subtrees rooted at these crossover points are then exchanged
to create children trees (bo om).
Naveen Pragallapati
Koza’s Experiments
 Setup:
o Retained 10% of the popula on (elite individuals) unchanged in the next
genera on.
o Created the remainder of the new genera on through crossover.
o Did not use muta on in the described experiments.
 Applica ons:
o GP was successfully applied to solve problems in various domains:
 Symbolic regression.
 Control systems.
 Classifica on tasks.
Illustra ve Example
Koza's example uses Gene c Programming (GP) to learn a program that can solve a block-
stacking problem. The goal is to stack blocks into a single stack to spell the word "universal"
regardless of their ini al configura on. The program manipulates blocks one at a me using
predefined ac ons.
Naveen Pragallapati
Solu on Program
A er 10 genera ons, the GP discovers the following solu on:
Naveen Pragallapati
Step by Step Example
Ini al State
Result
Naveen Pragallapati
Models of Evolu on and Learning

 Evolu on and learning are two key adap ve mechanisms through which
organisms acquire abili es to succeed in their environments.
 Evolu on operates at the popula on level over genera ons, while learning
operates at the individual level during a single life me.
 This sec on inves gates how evolu on and learning might interact and how
computa onal models can simulate these processes.
Lamarckian Evolu on
Lamarck, a 19th-century scien st, proposed that evolu on could be influenced by the
experiences of individual organisms during their life me. Specifically, he suggested
that if an organism learned something during its life, such as avoiding a toxic food, it
could pass this learned trait on gene cally to its offspring, which would not need to
learn it. This idea seemed appealing because it could allow for more efficient
evolu onary progress compared to the tradi onal generate-and-test approach, like in
Gene c Algorithms (GAs) and Gene c Programming (GPs), which do not take individual
experiences into account.
However, modern science overwhelmingly rejects Lamarck's theory, as evidence shows

that the gene c makeup of an individual is not directly affected by the life me
experiences of its parents. Despite this biological fact, recent computa onal studies
have shown that introducing Lamarckian-like processes into gene c algorithms can
some mes enhance their effec veness. Examples of studies exploring this idea include
works by Grefenste e (1991), Ackley and Li man (1994), and Hart and Belew (1995).
Baldwin Effect
The Baldwin Effect refers to a theory proposed by James Mark Baldwin, which suggests
that learning during an individual’s life me can influence evolu onary processes, even
though learned traits are not directly passed on gene cally. According to the Baldwin
Effect, individuals who are be er at learning or adap ng to their environment are more
likely to survive and reproduce. Over genera ons, this can lead to the evolu on of
gene c traits that make learning easier or more effec ve, even though the learned
behaviours themselves are not inherited.
In other words, while an individual’s learning doesn’t directly affect its gene c makeup,
the ability to learn efficiently can give that individual a survival advantage. Over me,
natural selec on may favour organisms with gene c traits that make them be er
learners, thus promo ng the evolu on of learning mechanisms.
Naveen Pragallapati
This concept links the processes of learning and evolu on, showing how evolu on
could "select" for learning abili es without directly inheri ng the outcomes of an
individual’s learning experiences.
Parallelizing Gene c Algorithms

Parallelizing gene c algorithms (GAs) involves using mul ple computa onal nodes to
perform parts of the search process simultaneously, taking advantage of their natural
suitability for parallel implementa on.
1. Coarse-grained paralleliza on: In this approach, the popula on is divided into
groups (demes), each assigned to a different node. Each deme runs its own GA,
and communica on between demes occurs less frequently. Migra on between
demes helps prevent a single genotype from domina ng the popula on,
addressing the crowding problem common in non-parallel GAs. Examples
include Tanese (1989) and Cohoon et al. (1987).
2. Fine-grained paralleliza on: Here, each individual in the popula on is assigned
to a separate processor. Recombina on happens between neighbouring
individuals, with different types of neighbourhoods proposed (e.g., planar grid
or torus). Examples include Spiessens and Manderick (1991).
Both methods enhance GA performance and allow for more diverse solu ons by
avoiding premature convergence.
Advantages of GP
 GP evolves interpretable models, as the output programs are o en human-readable.
 It is flexible and can handle diverse problem domains with minimal domain-specific
customiza on.
Challenges in GP
 Scalability: The search space of possible programs can grow exponen ally.
 Fitness Evalua on: Running programs on large datasets can be computa onally
expensive.
 Premature Convergence: The popula on may converge to subop mal solu ons early.
Summary
 Gene c Programming is a powerful technique for evolving programs to solve problems
in a variety of domains.
 The method relies heavily on tree representa ons, evolu onary operators like
crossover and muta on, and a fitness-based selec on process.
 Despite its challenges, GP has proven effec ve for tasks like symbolic regression,
op miza on, and automated program genera on.

Machine Learning Notes

Uploaded by

Machine Learning Notes

Uploaded by

Naveen Pragallapati

In our Weather Predic on Example:

 Examples of speciﬁc hypotheses could be:

o h1: "It will rain if the sky is cloudy."

o h2: "It will rain if it is both cloudy and humid."

o h3: "It will rain only if the temperature is below 20°C."

In formal learning algorithms, the size of H depends on:

1. Number of features (e.g., cloudy, temperature, humidity).

o Model 1: The die is fair.

L (Fair Model |3) =1/6, L (Biased Model |3) =0.5

Key Diﬀerence between Probability and Likelihood:

 Likelihood: The plausibility of a model or hypothesis given the observed data.

 Depending on the context, D could be:

o A single observa on (e.g., "Today is cloudy").

What is P(D)? (Marginal Likelihood / Evidence)

 P(D) is the marginal likelihood or evidence.

 Mathema cally, it’s computed as:

What is P(D∣h))? (Likelihood)

 P(D∣h) is the likelihood of the data D given a speciﬁc hypothesis h.

 It measures how well the hypothesis h explains the observed data D.

What is P(h∣D)? (Posterior Probability)

 It represents the updated belief in the hypothesis a er seeing the data.

 Using Bayes’ theorem, it is calculated as:

Introduc on to Bayesian Learning:

Bayes’ Theorem in Bayesian Learning

Example 1: Bayesian Learning with Two Hypotheses (Rain vs No Rain)

 Observa on D: "It is Cloudy."

Our goal is to update the probabili es of the two hypotheses:

1. h1: It will rain today.

2. h2: It will not rain today.

Example 2: Weather Predic on using Bayesian Learning with Three Hypotheses

Hypotheses and Prior Probabili es

Our goal is to update the probabili es of the three hypotheses

Calcula ng Marginal Likelihood P(D)

Calcula ng Posterior Probabili es P(h∣D)

Summary of Posterior Probabili es

ℎ 𝑃(ℎ) 𝑃(𝐷|ℎ) 𝑃(ℎ|𝐷)

 h3 is unlikely, with only 0.06 probability.

Bayesian Inference Process

2. Observe data: Collect the data D.

MAP and ML Hypothesis

Example 3: Bayesian Inference in Medical Diagnosis

Summary of Basic Probability Formulas:

Concept Learning Using Bayes' Theorem

Brute-Force Bayes Concept Learning

Challenges of Brute-Force Bayes Learning

Thus, P(D∣h) yields values of 1 or 0, depending on whether the hypothesis accurately

Minimum Descrip on Length (MDL) Principle

Bayes Op mal Classiﬁer:

Bayes Op mal Classiﬁca on:

Advantages of Bayes' Op mal Classiﬁer

1. Most accurate predic on possible since it considers all hypotheses.

For each new instance 𝑥, the Gibbs algorithm:

1. Choose a hypothesis ℎ from 𝐻 at random, according to the posterior distribu on

2. Use ℎ to predict the classiﬁca on of the instance 𝑥.

Advantages of the Gibbs Algorithm

1. Simple to implement: Since it just involves selec ng a hypothesis based on posterior

Limita ons of the Gibbs Algorithm

Applica ons and Use Cases:

Prior Probabili es: 𝑃(ℎ ) = 0.5, 𝑃(ℎ ) = 0.5

1. Compute posterior probabili es.

Bayesian Classiﬁer Predic on

Naïve Bayes Classiﬁer:

Naive Bayes Classiﬁer Formula

𝑃(𝑥|𝑐) = 𝑃(𝑥 , 𝑥 , … , 𝑥 |𝑐) = 𝑃(𝑥 |𝑐)

Naïve Bayes Classiﬁer:

Es ma ng the Probabili es:

1. Biased Underes mate: A value of 0 falsely suggests the event is impossible.

Example of m-es mate

An Example: Learning to classify text:

New Document: "I love mathema cs and machine learning"

Step-by-Step Naive Bayes Calcula on:

Naive Bayes Classiﬁer Formula:

Step 1: Calculate Class Priors

Step 2: Apply Laplace Smoothing and Compute Likelihoods:

The Laplace smoothed probability for a word w in class C is: