Machine Learning Notes
Machine Learning Notes
Unit-3
Part 2: Bayesian Learning
Basic Concepts and Nota ons:
What h?
h refers to the hypothesis which is a specific assump on or statement about the rela onship
between the input (data) and the outcome.
In machine learning, it’s o en a rule or model that makes predic ons based on the data.
A hypothesis might describe the condi ons under which it will rain.
Each of these hypotheses reflects a possible model for predic ng rain based on certain features
(clouds, humidity, temperature).
What is H?
The set of all possible hypotheses (denoted by H) contains every possible rule or model that could
relate the input features (like weather condi ons) to the outcome (whether it rains or not). The set H
is the collec on of all possible hypotheses that could be formed using the available features.
2. Possible values for each feature (e.g., cloudy = {yes, no}, temperature = {low, high}).
What is Likelihood?
Likelihood measures how well a hypothesis or model explains the observed data.
It answers: “Given the data I observed, how plausible(believable/credible) is this par cular model
or hypothesis?”
It does not sum to 1 across different hypotheses, unlike probability, which sums to 1 over all possible
outcomes.
Example of Likelihood:
You roll a die and observe a 3. Now you want to evaluate two models:
o Model 2: The die is biased toward showing 3 with a higher chance of 50%.
Naveen Pragallapati
The likelihood of observing the 3 under these two models would be:
Here, likelihood helps you determine which model be er explains the observed outcome (seeing a 3).
Summary
Probability: The likelihood (in everyday terms) of an event happening, given the model.
What is D?
D usually refers to the training data or the set of observa ons used to learn a model.
o A set of training examples (e.g., historical weather data over several days).
In Bayesian learning, D o en refers to the en re set of observa ons or data points used to update the
belief about the hypothesis.
It measures how likely the observed data D is, given all possible hypotheses.
P(D) is a normalizing constant that ensures the posterior probabili es sum to 1. It tells us how
likely the en re dataset is, irrespec ve of any specific hypothesis.
Naveen Pragallapati
If D contains mul ple training examples, P(D∣h) is typically the product of likelihoods for
individual examples (assuming independence):
Here, di is a single observa on, and the product runs over all examples in the dataset.
P(h∣D) is the posterior probability of the hypothesis h given the observed data D.
Note:
If a hypothesis h predicts the observed data well, its posterior probability P(h∣D) increases.
If it does not predict the data well, its probability decreases.
***
Naveen Pragallapati
We want to predict whether it will rain today based on the observa on:
Conclusion
A er observing the cloudy sky, the posterior probability of rain increases from 0.4 to 0.64. This means
that based on the evidence (cloudy sky), we are now more confident that it will rain today. Similarly,
the probability of no rain decreases from 0.6 to 0.36.
Likelihoods P(D∣h)
Conclusion
A er observing the data (Cloudy + High Humidity + Rain), the posterior probabili es indicate that:
h2 ("It rains if cloudy and humid") is the most probable hypothesis, with a posterior probability
of 0.64.
h1 ("It rains if cloudy") s ll has some credibility, with a posterior probability of 0.30.
This example demonstrates how Bayesian learning updates our beliefs about each hypothesis based
on the observed data.
3. Compute posterior: Use Bayes' theorem to update the belief in each hypothesis.
4. Predict new data: Use the posterior distribu on to make predic ons about future events or
unseen data.
Thus ℎ =𝐻
Process:
Given a finite hypothesis space H:
1. Assign prior probabili es P(h) to each hypothesis.
2. Calculate the likelihood P(D∣h) for each hypothesis using the observed data D.
3. Normalize by dividing by the total probability P(D), which ensures that posterior
probabili es sum to 1.
Naveen Pragallapati
The above analysis implies that under our choice for P(h) and P (D l h), every consistent
hypothesis has posterior probability and every inconsistent hypothesis has posterior
| ,
|
probability 0.
“Every consistent hypothesis is, therefore, a MAP hypothesis.”
MAP Hypothesis and Consistent Learners (FIND-S and Candidate Elimina on Algorithms):
Naveen Pragallapati
Naveen Pragallapati
The MDL Principle: Formal Defini on: The MDL principle recommends selec ng the
hypothesis that minimizes the total descrip on length, which includes both:
1. The descrip on length of the hypothesis ℎ.
2. The descrip on length of the data given the hypothesis.
Assuming we use the codes 𝐶 and 𝐶 to represent the hypothesis and the data given the
hypothesis, we can state the MDL principle as
arg 𝑚𝑖𝑛
ℎ = (𝐿 (ℎ) + 𝐿 (𝐷|ℎ))
ℎ ∈𝐻
Note: If we choose 𝐶 to be the op mal encoding of hypotheses 𝐶 and if we choose 𝐶 to be
the op mal encoding of 𝐶 | then ℎ = ℎ
Naveen Pragallapati
Interpreta on of MDL
The MDL principle balances the complexity of the hypothesis with how well it explains
the data.
A simpler hypothesis that makes a few errors may be preferred over a complex one
that perfectly fits the data, to avoid overfi ng.
Example: MDL with Decision Trees
Suppose we want to apply the MDL principle to decision tree learning:
1. Encoding the Hypothesis (𝐶 ):
o We use an encoding where the descrip on length increases with the number
of nodes and edges in the tree.
2. Encoding the Data Given the Hypothesis (𝐶 | ):
o If the hypothesis perfectly classifies the training data, the cost of encoding the
data is zero.
o If some data points are misclassified, we need addi onal bits to:
Iden fy the misclassified instances (at most 𝑙𝑜𝑔 𝑚 bits per instance,
where 𝑚 is the number of instances).
Transmit the correct classifica on (at most 𝑙𝑜𝑔 𝑘 bits, where 𝑘 is the
number of possible classifica ons).
Applica ons of MDL:
1. Decision Tree Pruning:
o The MDL principle helps determine the best size for a decision tree by
balancing simplicity and accuracy.
o Research by Quinlan and Rivest (1989) and Mehta et al. (1995) shows that
MDL-based pruning achieves results similar to tradi onal methods.
2. Overfi ng Preven on:
o MDL avoids overfi ng by preferring simpler models, even if they make a few
errors on the training data.
Limita ons and Prac cal Considera ons:
1. The effec veness of MDL depends on selec ng appropriate encoding schemes for
hypotheses and data.
2. In prac ce, it can be difficult to determine op mal encoding schemes or es mate prior
probabili es accurately.
3. Human interven on may be required to design encodings that reflect domain
knowledge about the problem.
Naveen Pragallapati
Let the classifica on label 𝑣 represents the possible class or output label that we want to
predict. Then Bayes op mal classifier predicts the label 𝑣 that has the highest expected
probability over all hypotheses, weighted by their posterior probabili es given by
𝑃 𝑣 𝐷 = 𝑃 𝑣 h . P(h|D)
∈
Example 1:
1. Computa onally expensive: Summing over all hypotheses can be infeasible, especially
with large hypothesis spaces.
2. Requires complete knowledge of priors and accurate likelihood es ma ons
Naveen Pragallapati
Gibbs Algorithm:
The Gibbs Algorithm is a simple but powerful algorithm in the context of Bayesian learning. It
demonstrates how randomiza on can be used to make predic ons based on a prior
distribu on over hypotheses. Gibbs Algorithm is rooted in probabilis c learning, where
instead of determinis cally choosing a single hypothesis, the algorithm randomly selects one
according to a probability distribu on.
Although the Bayes op mal classifier obtains the best performance that can be achieved from
the given training data, it can be quite costly to apply. The expense is due to the fact that it
computes the posterior probability for every hypothesis in H and then combines the
predic ons of each hypothesis to classify each new instance. An alterna ve, less op mal
method is the Gibbs algorithm defined as follows:
Example:
Problem 1:
You are working with a dataset of two fruits: Apples and Oranges. There are 6 Apples, of which
4 are Red, and 4 Oranges, of which 1 is Red. We want to determine whether a new red fruit
is more likely to be an Apple or an Orange.
Hypotheses:
ℎ : The fruit is an Apple.
ℎ : The fruit is an Orange.
Solu on:
Naveen Pragallapati
Problem-2:
You are working with a small dataset of two fruits: Apples and Oranges, described by two
features: Colour (Red or Green) and Size (Large or Small). The dataset contains 6 Apples, of
which 4 are Red and 2 are Green, and 3 of them are Large while the other 3 are Small. There
are 4 Oranges, of which 1 is Red and 3 are Green, and 2 are Large while the other 2 are Small.
A new fruit appears with Colour = Red and Size = Large, and you need to determine whether
this new fruit is more likely to be an Apple or an Orange using the Bayesian classifier.
Hints:
1. Assume the prior probabili es for both fruits are equal, i.e., P(Apple)=P(Orange)=0.5
2. Your hypothesis space consists of two hypotheses:
h1: The fruit is an Apple
h2: The fruit is an Orange.
3. For each hypothesis h, apply Bayes' theorem to compute the posterior probabili es.
i.e., compute P (h1|Red, Large) and P (h2|Red, Large)
4. Use the fact that for independent features, the joint probability can be computed as
P(Red, Large ∣ h) = P(Red ∣ h)⋅P(Large ∣ h)
Naveen Pragallapati
Naveen Pragallapati
arg 𝑚𝑎𝑥
𝑣 = 𝑃(𝑥|𝑐) ∙ 𝑃( 𝑐)
𝑐 ∈𝐶
The Naive Bayes classifier assumes that all features 𝑥 are condi onally independent of one
another, given the class label. This means
Illustra ve Example:
Let us apply the naive Bayes classifier to a concept learning problem we considered during our
discussion of decision tree learning: classifying days according to whether someone will play
tennis. The following table provides a set –of 14 training examples of the target concept
PlayTennis, where each day is described by the a ributes Outlook, Temperature, Humidity,
and Wind.
Naveen Pragallapati
Here we use the naive Bayes classifier and the training data from this table to classify the
following new instance:
(Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new
instance. The following steps demonstrates how to predict the targe value by Naïve Bayes
classifier.
Conclusion:
Using the Naive Bayes classifier, we predict that the new instance corresponds to PlayTennis = No.
Naveen Pragallapati
While this method works well when the dataset is large, it becomes unreliable for small datasets. If 𝑛
is small, even a true probability as low as 0.08 might result in 𝑛 = 0 in the sample, leading to:
Conclusion
The Naive Bayes classifier predicts that the new document "I love mathema cs and machine
learning" belongs to the "like" class.
Naveen Pragallapati
Objec ve: To assess how well a Naive Bayes algorithm can classify Usenet news ar cles into
their respec ve newsgroups.
Dataset Details:
Number of Newsgroups: 20 dis nct Usenet newsgroups were selected, covering topics
ranging from science and sports to poli cs and technology.
Total Documents:
1,000 ar cles were collected from each newsgroup.
This resulted in a dataset of 20,000 documents.
Experimental Setup
Training and Test Split:
Two-thirds of the 20,000 documents (~13,333 ar cles) were used for training
the Naive Bayes model.
The remaining one-third (~6,667 ar cles) were used to test the performance.
Performance Results
The performance of the Naive Bayes algorithm was evaluated based on its ability to
correctly assign ar cles to the correct newsgroup.
Classifica on Accuracy:
o The algorithm achieved 89% accuracy across the 20 newsgroups.
o This is considered a remarkable result given that there were 20 possible
categories for each predic on (a mul -class classifica on task).
Naveen Pragallapati
Interpreta on of Results
1. Effec veness of Naive Bayes:
The Naive Bayes classifier performed surprisingly well in this text classifica on
task, despite its assump on of condi onal independence (i.e., assuming that
words are independent given the class).
This demonstrates the prac cal u lity of probabilis c models for text
classifica on tasks.
2. Challenges with Text Classifica on:
While Naive Bayes works well for many tasks, it may struggle when words
exhibit strong dependencies (e.g., word pairs or sequences).
Nevertheless, this experiment shows that simple probabilis c approaches can
be effec ve for large-scale classifica on problems, such as sor ng news
ar cles.
Conclusion:
This experiment by Joachims (1996) demonstrates that the Naive Bayes algorithm is not only
theoretically sound but also practically effective for large-scale text classification tasks,
achieving 89% accuracy on the challenging task of assigning 20,000 Usenet articles to 20
different newsgroups. This highlights the algorithm’s strength in handling high-dimensional
text data efficiently.
Naveen Pragallapati
The EM Algorithm:
In prac cal learning scenarios, not all features may be observable. EM Algorithm (Expecta on-
Maximiza on) is a widely used method to learn from data with hidden or unobserved
variables (introduced by Dempster et al., 1977).
Applica ons:
Bayesian Networks (Heckerman, 1995)
Radial Basis Func on Networks
Clustering Algorithms (e.g., Cheeseman et al., 1988)
Baum-Welch Algorithm for learning Par ally Observable Markov Models (Rabiner,
1989)
Process:
1. Randomly select one of the k Gaussian distribu ons.
2. Generate a data point 𝑥 from the chosen Gaussian.
Goal: Find the maximum likelihood es mate of the means of these distribu ons; that is a
hypothesis ℎ = 𝜇 , 𝜇 , … , 𝜇 that maximises 𝑃(𝐷|ℎ).
Naveen Pragallapati
Naveen Pragallapati
Note: Using the EM algorithm, we can cluster the data into two Gaussian clusters. the EM
algorithm provides so assignments at each step, and the clusters could have overlapping
boundaries.
Naveen Pragallapati
K-means Algorithm:
Naveen Pragallapati
Naveen Pragallapati
Both algorithms follow an itera ve process that alternates between two steps:
Unit-4
Introduc on
• Computa onal Learning Theory (CoLT) is a subfield of ar ficial intelligence and
theore cal computer science that focuses on understanding the theore cal
founda ons of machine learning.
• It provides a framework for analyzing the feasibility of learning algorithms and
formalizes what it means for a machine to learn from data.
• The field inves gates the capabili es and limita ons of learning models and algorithms
with the aim of determining whether certain tasks can be learned efficiently.
Computa onal Learning Theory explores ques ons such as:
1. What can be learned by a computer?
2. How much data is required to learn effec vely?
3. How efficient are learning algorithms in terms of me and computa onal resources?
4. What guarantees can we provide about the generaliza on of learned models?
Important Models in CoLT
• The field draws on concepts from computer science, mathema cs, and sta s cs to
rigorously define learning and analyze learning algorithms' performance.
• Central to CoLT are different models of learning, such as Probably Approximately
Correct (PAC) learning and the Vapnik–Chervonenkis (VC) dimension, which are used
to describe the learnability of different classes of func ons.
Key Nota ons and Defini ons
Naveen Pragallapati
Sample Complexity
Mistake Bound
Naveen Pragallapati
True Error
Training Error
Naveen Pragallapati
PAC Learnability
PAC Learning
Naveen Pragallapati
Example
Example
Implica ons
Naveen Pragallapati
Example
Summary
Summary
Goal:
Minimize the total number of mistakes across all examples.
Naveen Pragallapati
Unit-4
Core Idea:
1. Given a query instance, the algorithm finds the k-nearest neighbours from the training
data (based on a distance metric).
2. The predicted class or value for the query instance is based on the majority class (for
classifica on) or average value (for regression) of the neighbours.
Algorithm
Note:
The k-NEAREST NEIGHBOR algorithm is easily adapted to approxima ng con nuous-valued
target func ons. To approximate a real-valued target func on 𝑓: 𝑅 → 𝑅 we replace the final
line of the above algorithm by the line
∑ 𝑓(𝑥 )
𝑓 𝑥 =
𝑘
Naveen Pragallapati
A Note on Terminology
Much of the literature on nearest-neighbour methods and weighted local regression uses a
terminology that has arisen from the field of sta s cal pa ern recogni on. In reading that
literature, it is useful to know the following terms:
1. Regression means approxima ng a real-valued target func on.
2. Residual is the error 𝑓 (𝑥) − 𝑓(𝑥) in approxima ng the target func on.
3. Kernel func on is the func on of distance that is used to determine the weight of
each training example. In other words, the kernel func on is the func on K such that
𝑤 = 𝐾 𝑑 𝑥 ,𝑥 .
Distance Metrics:
The choice of distance metric significantly impacts the performance of k-NN.
Common distance metrics include:
1. Euclidean Distance (used for con nuous data):
We can distance-weight the instances for real-valued target func ons in a similar fashion,
replacing the final line of the algorithm in this case by
∑ 𝑤 𝑓(𝑥 )
𝑓 𝑥 =
∑ 𝑤
Naveen Pragallapati
Example Problem 1:
Given a dataset with three a ributes: Age, Income, and Credit Score. Predict whether a
person will buy a product (Yes/No).
Steps:
1. Compute the Euclidean distance between the query instance and all training points.
2. Choose k = 3 (find the 3 nearest neighbours).
3. Use majority vo ng to assign a class label (Yes/No).
Naveen Pragallapati
Example Problem-2:
Solve the Example Problem-1 above using weighted k-NN.
Naveen Pragallapati
Naveen Pragallapati
Example Problem-3:
Naveen Pragallapati
Summary:
1. k-NN is a powerful, non-parametric algorithm that works well for classifica on and
regression.
2. Its main drawback is the high computa onal cost during predic on, especially with
large datasets.
3. k-NN performs be er when the feature space is small and the relevant features are
carefully selected.
Naveen Pragallapati
Training Algorithm for Locally Weighted Regression (LWR) Using Gradient Descent:
Naveen Pragallapati
Summary:
This algorithm fits a local linear model for each query point by itera vely upda ng the weights
𝑤 using gradient descent. The kernel func on ensures that only nearby points have significant
influence, making the regression localized. The training process con nues un l the model
converges, a er which it can predict values based on the op mized weights.
Summary:
1. Select centres: Place RBF neurons at key points (possibly through clustering).
2. Calculate ac va ons: Use the Gaussian kernel to compute how much influence each
RBF neuron has for a given input.
3. Op mize weights: Use least squares to find the weights that minimize the predic on
error.
Naveen Pragallapati
A radial basis func on network. Each hidden unit produces an ac va on determined by a Gaussian func on cantered at
some instance xu. Therefore, its ac va on will be close to zero unless the input x is near xu. The output unit produces a linear combina on
of the hidden unit ac va ons. Although the network shown here has just one output, mul ple output units can also
be included.
Diagram Summary
In the diagram, you would typically see:
Input nodes connected to each hidden node (RBF neuron).
RBF neurons in the hidden layer, each receiving the input vector and compu ng its
ac va on.
Weights associated with each connec on from hidden neurons to the output layer.
A single output node that aggregates the weighted ac va ons to produce the predicted
value.
This architecture enables RBF networks to model nonlinear rela onships by combining local
approxima ons (via Gaussian kernels) with a global linear combina on at the output layer.
Naveen Pragallapati
Note:
Solu on:
Steps to follow:
Naveen Pragallapati
Naveen Pragallapati
Medical diagnosis, where pa ent cases help diagnose similar future pa ents.
Technical support and troubleshoo ng, where past solu ons can be adapted for new
issues.
Legal reasoning, where previous legal cases inform judgments in new cases.
Summary
CBR’s strength lies in its ability to adapt previous knowledge directly to new problems, which
is especially powerful in contexts where cases do not follow a strict generaliza on rule. It’s
ideal for tasks where excep ons are common or complex adapta ons are needed.
Naveen Pragallapati
Example
Imagine CADET’s library includes a case of a small irriga on pump with a flow rate of 10 liters
per minute and a pressure of 5 psi. The new problem requires a pump with 10 liters per minute
but with a higher pressure of 8 psi.
1. Retrieve: CADET retrieves the small irriga on pump case, recognizing that it meets the flow
rate requirement.
2. Reuse: CADET reuses much of the design, such as the general structure and configura on.
3. Revise: To meet the higher-pressure requirement, CADET modifies the pump by increasing
the impeller size or using a more powerful motor, ensuring it can achieve 8 psi.
4. Retain: The revised pump design is stored as a new case with specifica ons for a 10 L/min,
8 psi water pump.
Adapta on Techniques in CADET
CADET’s adapta on is based on both similarity metrics and specific engineering rules, such as:
Eager learning emphasizes generaliza on, aiming to abstract pa erns from the
training data.
Lazy learning focuses on memoriza on, retaining the original instances for later use.
Naveen Pragallapati
2. Time Complexity:
Eager learners require more me and computa onal resources during the training
phase, as they need to analyse and construct a model.
Lazy learners are quick to train, as they simply store the training data but may require
more me to make predic ons since they analyse the stored data at query me.
3. Memory Usage:
Eager learning typically uses less memory at query me since it works with a model
rather than storing all instances.
Lazy learning may require significant memory if the training dataset is large, as it must
keep all instances accessible for querying.
4. Flexibility:
Eager learners can some mes struggle with changes in the underlying data
distribu on, as retraining the model is necessary.
Lazy learners can adapt to changes in the data more easily since they can incorporate
new instances dynamically during predic on.
5. Performance and Applica on Context:
Eager learning methods may perform be er in scenarios where the dataset is large,
and a general model can effec vely capture the rela onships in the data.
Lazy learning may excel in cases where data is sparse or when predic ons must be
made based on local rela onships in the data.
Naveen Pragallapati
Use Cases:
Eager learning is o en used in applica ons where the cost of computa on during
training can be jus fied by the need for fast predic ons, such as in online services.
Lazy learning is useful in applica ons where real- me updates are cri cal, such as
recommenda on systems that adapt to user preferences.
Conclusion:
Both lazy and eager learning methods have their advantages and disadvantages, and the
choice between them o en depends on the specific problem context, the size and nature of
the dataset, and the computa onal resources available. Understanding these concepts is
crucial for selec ng appropriate learning algorithms in prac cal machine learning
applica ons.
Naveen Pragallapati
Unit-5
Gene c Algorithms
A Biological Mo va on
Gene c Algorithms (GAs) are inspired by the process of evolu on in the natural world.
Biological evolu on provides a framework for solving complex problems through the
principles of natural selec on, gene c varia on, and survival of the fi est
In nature:
Popula ons of organisms evolve over me to adapt to their environments.
Evolu on is driven by:
1. Natural Selec on: The environment "selects" individuals with traits that give
them an advantage in survival and reproduc on.
2. Gene c Varia on: Traits are passed from parents to offspring, with occasional
random changes (muta ons) introducing new traits.
3. Crossover (Recombina on): When two parents reproduce, their gene c
material combines to create offspring with mixed traits.
4. Fitness: The "fit" individuals (those be er adapted) are more likely to survive
and reproduce.
Visualizing the Biological Connec on
1. Imagine a popula on of animals evolving to escape predators.
o Fitness = Speed of running.
o Be er runners survive, reproduce, and pass on faster-running traits.
2. In GAs, imagine solving an equa on where:
o Fitness = How close a solu on is to the correct answer.
o Over genera ons, solu ons "evolve" to converge on the answer.
In evolu onary algorithms, genes, chromosomes, and their representa ons are inspired by
biological concepts.
Gene
A gene is the smallest unit of informa on in the solu on encoding. It represents a
single variable or parameter of the problem.
In op miza on problems, a gene might be a value that contributes to the solu on
(e.g., a decision variable or a parameter to be op mized).
Representa on of a Gene
1. Binary: 0 or 1 (common in Gene c Algorithms).
2. Integer: A discrete value (e.g., 5, 10).
3. Real Number: A con nuous value (e.g., 3.14, -7.6).
4. Symbol: Non-numeric en es used in gene c programming or symbolic computa on.
Chromosome
A chromosome is a collec on of genes arranged in a specific order. It represents an
individual candidate solu on to the op miza on problem.
In essence, it is a data structure that holds the genes and encapsulates the solu on.
Representa on of a Chromosome
1. Binary Encoding: A sequence of 0s and 1s.
o Example: 101011
2. Integer Encoding: A sequence of integers.
o Example: [3, 5, 7, 9]
3. Real-valued Encoding: A sequence of real numbers.
o Example: [1.5, -2.3, 4.0, 7.8]
4. Tree Structure: Used in Gene c Programming.
o Example: A syntax tree represen ng a mathema cal expression like (x + y) * z.
5. Permuta on Encoding: Used in problems like the Traveling Salesman Problem.
o Example: [3, 1, 4, 2] (indica ng a path through ci es 3 → 1 → 4 → 2).
Defini on: A gene c algorithm is a general op miza on method that searches through a
large space of candidate solu ons, aiming to find the one with the highest fitness.
Naveen Pragallapati
Represen ng Hypotheses
Hypotheses in GAS are o en represented by bit strings, so that they can be easily manipulated
by gene c operators such as muta on and crossover. The hypotheses represented by these
bit strings can be quite complex. Hypotheses, such as if-then rules, are represented as bit
strings for easy manipula on by gene c operators like muta on and crossover. For example,
consider PlayTennis problem. Let’s represent all the a ribute values as bit strings.
A ributes and Their Values
1. Outlook (3 values: Sunny, Overcast, Rain)
o 3-bit string:
Sunny → 100
Overcast → 010
Rain → 001
"Don't care" (all values allowed) → 111
2. Temperature (3 values: Hot, Mild, Cool)
o 3-bit string:
Hot → 100
Mild → 010
Cool → 001
"Don't care" → 111
Naveen Pragallapati
Gene c Operators
Gene c operators are mechanisms inspired by biological evolu on that help generate new
candidates (successors) in a Gene c Algorithm. These successors are typically formed by
recombining or muta ng members of the current popula on.
Main Gene c Operators
A. Crossover
B. Muta on
C. Specialized Operators
A. Crossover
Crossover is the most commonly used operator, which generates offspring by combining bits
or segments from two parent individuals.
Types of Crossovers
1. Single-Point Crossover:
Parent 1: 11001100100
Parent 2: 10101010101
Crossover Point: n = 5
Crossover Mask: 11111000000
Offspring:
o Offspring 1: 11001101010 (first 5 bits from Parent 1, rest from Parent 2)
o Offspring 2: 10101000100 (first 5 bits from Parent 2, rest from Parent 1)
2. Two-Point Crossover:
The crossover mask contains a con guous segment of 1s in the middle, flanked by 0s.
Two points (n0, n1) are chosen randomly to determine the sec on swapped between
the parents.
Naveen Pragallapati
Example:
Parent 1: 11001100100
Parent 2: 10101010101
Crossover Mask: 00111110000 (n0 = 2, n1 = 5)
Offspring:
Offspring 1: 11101010101
Offspring 2: 10001100100
3. Uniform Crossover:
A random bit string serves as the crossover mask.
Each bit is chosen randomly and independently to determine which parent contributes
that bit.
Example:
Parent 1: 11001100100
Parent 2: 10101010101
Random Mask: 10110100101
Offspring:
o Offspring 1: 10101100101 (bits selected according to the mask)
o Offspring 2: 11001010100
B. Muta on
Muta on introduces small, random changes to an individual, allowing explora on of the
solu on space.
Typically applied a er crossover.
A random bit posi on is selected, and its value is flipped.
Naveen Pragallapati
Example:
Original: 11001100100
Muta on Posi on: 6 (bit flipped)
Mutated Offspring: 11001000100
Muta on ensures diversity and prevents the algorithm from being trapped in local op ma.
C. Specialized Operators
These are tailored to the problem or hypothesis representa on. Examples include:
Rule Specializa on/Generaliza on: Operators replace or modify specific parts of rules
to broaden or narrow their applicability.
Example Systems:
o Grefenste e et al. (1991): A system for learning robot control rules using a
rule-specializa on operator.
o Janikow (1993): A system for learning rules by explicitly modifying condi ons
(e.g., replacing a condi on with "don't care").
Summary
1. Crossover explores new combina ons of exis ng solu ons.
2. Muta on introduces varia on to escape local op ma.
3. Specialized operators adapt GAs to problem-specific requirements.
Exercise Problems:
1. Given the following parent strings and a single-point crossover at posi on n=3,
determine the offspring:
o Parent 1: 1011011
o Parent 2: 1100101
2. Perform a two-point crossover on the following parent strings using n0=2 and n1=5:
o Parent 1: 11110000
o Parent 2: 00001111
3. Simulate uniform crossover using the following parents and mask. Generate the two
offspring:
o Parent 1: 10101010
o Parent 2: 11001100
o Mask: 10110011
4. Apply mutation to the following bit strings by flipping the bit at the given positions:
o String: 11010101, Mutation at position 4
o String: 00101011, Mutation at position 1
Naveen Pragallapati
Key Concepts
1. Fitness Function:
o A mathematical representation used to evaluate the quality (or "fitness") of
each individual in the population.
o Higher fitness values indicate better solutions.
o Fitness functions depend on the specific problem domain and the
representation of candidate solutions.
2. Selection:
o The process of choosing individuals from the current population to generate
the next generation.
o Selection favours individuals with higher fitness, giving them a higher chance
of passing on their genetic material.
o Common selection methods include:
Proportional Selection: Probability of selection is proportional to an
individual’s fitness.
Rank-Based Selection: Individuals are ranked based on fitness, and
selection probabilities are assigned according to rank.
Tournament Selection: A subset of individuals is chosen randomly, and
the best individual in the subset is selected.
Elitism: Ensures the best individuals are carried over to the next
generation unchanged.
3. Proportional Selection Mechanism:
o Often implemented using a roulette wheel analogy:
Each individual gets a slice of the wheel proportional to its fitness.
Spinning the wheel randomly selects individuals, with higher fitness
individuals more likely to be chosen.
Genera on 1
Naveen Pragallapati
Naveen Pragallapati
Best Individual in the New Popula on (Genera on 2): 25 with fitness 625.
Repeat for Genera on 2:
1. Selec on: Using fitness values, select pairs for crossover.
2. Crossover: Perform single-point crossover.
3. Muta on: Introduce random muta ons.
Naveen Pragallapati
So, this scenario illustrates that gene c algorithms do not always maintain the current
"best" solu on across genera ons, especially if no eli sm is applied.
Conclusion
The temporary drop in the best fitness is a natural part of how gene c algorithms work. It’s a
trade-off between exploring new solu ons and exploi ng current best solu ons. With enough
genera ons and balance between crossover and muta on, the algorithm typically converges
to the global or near-global op mum.
Analogy: Think of climbing a mountain. Some mes, you may step down a li le to find a be er
path to the peak. Similarly, the algorithm sacrifices short-term gains (losing the best individual)
to explore more of the solu on space.
Key Takeaways
The objec ve func on defines the goal of the op miza on problem (maximize or
minimize).
The fitness func on is used internally by the algorithm to evaluate solu ons, o en
adapted for constraints, maximiza on, or algorithm-specific needs.
In short: The fitness func on is tailored to the problem and the algorithm, while the objec ve
func on is the mathema cal expression of the op miza on goal.
Naveen Pragallapati
Note the length of the bit string grows with the number of rules in the hypothesis. This
variable bit-string length requires a slight modifica on to the crossover operator, as described
below.
Gene c operators. GABIL uses the standard muta on operator, in which a single bit is chosen
at random and replaced by its complement. The crossover operator that it uses is a fairly
standard extension to the two-point crossover operator. In par cular, to accommodate the
variable-length bit strings that encode rule sets, and to constrain the system so that crossover
occurs only between like sec ons of the bit strings that encode rules, the following approach
Naveen Pragallapati
is taken. To perform a crossover opera on on two parents, two crossover points are first
chosen at random in the first parent string. Let d1 (d2) denote the distance from the le most
(rightmost) of these two crossover points to the rule boundary immediately to its le . The
crossover points in the second parent are now randomly chosen, subject to the constraint that
they must have the same d1 and d2 value. For example, if the two parent strings are
and the crossover points chosen for the first parent are the points following bit posi ons 1
and 8,
where "[" and "]" indicate crossover points, then d1 = 1 and d2 = 3. Hence the allowed pairs
of crossover points for the second parent include the pairs of bit posi ons (1,3), (1,8), and
(6,8). If the pair (1,3) happens to be chosen,
As this example illustrates, this crossover opera on enables offspring to contain a different
number of rules than their parents, while assuring that all bit strings generated in this fashion
represent well-defined rule sets.
Fitness func on. The fitness of each hypothesized rule set is based on its classifica on
accuracy over the training data. In par cular, the func on used to measure fitness is
where correct (h) is the percent of all training examples correctly classified by hypothesis h.
In experiments comparing the behaviour of GABIL to decision tree learning algorithms such
as C4.5 and ID5R, and to the rule learning algorithm AQ14, DeJong et al. (1993) report roughly
comparable performance among these systems, tested on a variety of learning problems. For
example, over a set of 12 synthe c problems, GABIL achieved an average generaliza on
accuracy of 92.1 %, whereas the performance of the other systems ranged from 91.2 % to
96.6 %.
Naveen Pragallapati
The Gene c Algorithm (GA) employs a randomized beam search approach to explore the
hypothesis space, differing significantly from methods like neural network backpropaga on.
While backpropaga on follows a smooth gradient descent to incrementally adjust
hypotheses, GAs makes abrupt changes, replacing a parent hypothesis with an offspring that
may be very different. This characteris c makes GAs less prone to local minima compared to
gradient descent.
A challenge in GAs is crowding, where highly fit individuals reproduce excessively, reducing
popula on diversity and slowing progress. Strategies to mi gate crowding include:
1. Selec on modifica ons, such as tournament or rank selec on, instead of fitness-
propor onate selec on.
2. Fitness sharing, which reduces an individual's fitness score when similar individuals
are present.
3. Recombina on restric ons, limi ng ma ng to similar individuals to form clusters or
subspecies.
4. Spa al distribu on, allowing only nearby individuals to recombine.
Interpreta on of Equa on
Naveen Pragallapati
Components of Equa on
Naveen Pragallapati
Gene c Programming
Example: 𝒔𝒊𝒏(𝒙) + 𝒙𝟐 + 𝒚
Crossover opera on applied to two parent program trees (top). Crossover points (nodes shown in
bold at top) are chosen at random. The subtrees rooted at these crossover points are then exchanged
to create children trees (bo om).
Naveen Pragallapati
Koza’s Experiments
Setup:
o Retained 10% of the popula on (elite individuals) unchanged in the next
genera on.
o Created the remainder of the new genera on through crossover.
o Did not use muta on in the described experiments.
Applica ons:
o GP was successfully applied to solve problems in various domains:
Symbolic regression.
Control systems.
Classifica on tasks.
Illustra ve Example
Koza's example uses Gene c Programming (GP) to learn a program that can solve a block-
stacking problem. The goal is to stack blocks into a single stack to spell the word "universal"
regardless of their ini al configura on. The program manipulates blocks one at a me using
predefined ac ons.
Naveen Pragallapati
Solu on Program
A er 10 genera ons, the GP discovers the following solu on:
Naveen Pragallapati
Ini al State
Result
Naveen Pragallapati
Lamarckian Evolu on
Lamarck, a 19th-century scien st, proposed that evolu on could be influenced by the
experiences of individual organisms during their life me. Specifically, he suggested
that if an organism learned something during its life, such as avoiding a toxic food, it
could pass this learned trait on gene cally to its offspring, which would not need to
learn it. This idea seemed appealing because it could allow for more efficient
evolu onary progress compared to the tradi onal generate-and-test approach, like in
Gene c Algorithms (GAs) and Gene c Programming (GPs), which do not take individual
experiences into account.
Baldwin Effect
The Baldwin Effect refers to a theory proposed by James Mark Baldwin, which suggests
that learning during an individual’s life me can influence evolu onary processes, even
though learned traits are not directly passed on gene cally. According to the Baldwin
Effect, individuals who are be er at learning or adap ng to their environment are more
likely to survive and reproduce. Over genera ons, this can lead to the evolu on of
gene c traits that make learning easier or more effec ve, even though the learned
behaviours themselves are not inherited.
In other words, while an individual’s learning doesn’t directly affect its gene c makeup,
the ability to learn efficiently can give that individual a survival advantage. Over me,
natural selec on may favour organisms with gene c traits that make them be er
learners, thus promo ng the evolu on of learning mechanisms.
Naveen Pragallapati
This concept links the processes of learning and evolu on, showing how evolu on
could "select" for learning abili es without directly inheri ng the outcomes of an
individual’s learning experiences.
Advantages of GP
GP evolves interpretable models, as the output programs are o en human-readable.
It is flexible and can handle diverse problem domains with minimal domain-specific
customiza on.
Challenges in GP
Scalability: The search space of possible programs can grow exponen ally.
Fitness Evalua on: Running programs on large datasets can be computa onally
expensive.
Premature Convergence: The popula on may converge to subop mal solu ons early.
Summary
Gene c Programming is a powerful technique for evolving programs to solve problems
in a variety of domains.
The method relies heavily on tree representa ons, evolu onary operators like
crossover and muta on, and a fitness-based selec on process.
Despite its challenges, GP has proven effec ve for tasks like symbolic regression,
op miza on, and automated program genera on.