ML Unit I - It
ML Unit I - It
(Autonomous)
(Accredited by NBA & NAAC with Grade “A” and Affiliated to JNTUK-Kakinada)
MACHINE LEARNING
LECTURE NOTES
B.TECH IV YEAR – I SEM
Machine Learning(20CI5T02)
SYLLABUS
Course Objectives:
The course is introduced for students to
Gain knowledge about basic concepts of Machine Learning
Study about different learning algorithms
Learn about of evaluation of learning algorithms
Learn about artificial neural networks
Course Outcomes:
Identify machine learning techniques suitable for a given problem
Solve the problems using various machine learning techniques Apply Dimensionality reduction techniques
Design application using machine learning techniques
UNIT I
Introduction: Definition of learning systems, Goals and applications of machine learning, Aspects of developing a
learning system: training data, concept representation, function approximation. Inductive Classification: The
concept learning task, Concept learning as search through a hypothesis space, General-to-specific ordering of
hypotheses, Finding maximally specific hypotheses, Version spaces and the candidate elimination algorithm,
Learning conjunctive concepts, The importance of inductive bias..
UNIT II
Decision Tree Learning: Representing concepts as decision trees, Recursive induction of decision trees, Picking the best splitting
attribute: entropy and information gain, Searching for simple trees and computational complexity, Occam's razor, Overfitting,
noisy data, and pruning. Experimental Evaluation of Learning Algorithms: Measuring the accuracy of learned hypotheses.
Comparing learning algorithms: cross-validation, learning curves, and statistical hypothesis testing.
UNIT III
Computational Learning Theory: Models of learnability: learning in the limit; probably approximately correct (PAC) learning.
Sample complexity for infinite hypothesis spaces, Vapnik-Chervonenkis dimension. Rule Learning: Propositional and First-Order,
Translating decision trees into rules, Heuristic rule induction using separate and conquer and information gain, First-order Horn-
clause induction (Inductive Logic Programming) and Foil, Learning recursive rules, Inverse resolution, Golem, and Progol
UNIT IV
Artificial Neural Networks: Neurons and biological motivation, Linear threshold units. Perceptrons: representational limitation
and gradient descent training, Multilayer networks and backpropagation, Hidden layers and constructing intermediate,
distributed representations. Overfitting, learning network structure, recurrent networks. Support Vector Machines: Maximum
margin linear separators. Quadractic programming solution to finding maximum margin separators. Kernels for learning non-
linear functions.
UNIT V
Bayesian Learning: Probability theory and Bayes rule. Naive Bayes learning algorithm. Parameter smoothing. Generative vs.
discriminative training. Logisitic regression. Bayes nets and Markov nets for representing dependencies. Instance-Based
Learning: Constructing explicit generalizations versus comparing to past specific examples. k-Nearest-neighbor algorithm. Case-
based learning
Text Books:
1) T.M. Mitchell, “Machine Learning”, McGraw-Hill, 1997.
2) Machine Learning, Saikat Dutt, Subramanian Chandramouli, Amit Kumar Das, Pearson, 2019
Reference Books:
1) Ethern Alpaydin, “Introduction to Machine Learning”, MIT Press, 2004.
2) Stephen Marsland, “Machine Learning -An Algorithmic Perspective”, Second Edition, Chapman and Hall/CRC Machine
Learning and Pattern Recognition Series, 2014.
3) Andreas C. Müller and Sarah Guido “Introduction to Machine Learning with Python: A Guide for Data Scientists”, Oreilly.
UNIT I
Introduction: Definition of learning systems, Goals and applications of machine learning,
Aspects of developing a learning system: training data, concept representation, function
approximation. Inductive Classification: The concept learning task, Concept learning as
search through a hypothesis space, General-to-specific ordering of hypotheses, Finding
maximally specific hypotheses, Version spaces and the candidate elimination algorithm,
Learning conjunctive concepts, The importance of inductive bias.
___________________________________________________________________________
____________________
1 INTRODUCTION
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given
classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering
commands recorded while observing a human driver
4
iii) A checkers learning problem
• Task T: Playing checkers
• Performance measure P: Percent of games won against opponents,Make
perfect moves to win games
• Training experience E: Playing practice games against itself
1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion:
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under
speech recognition, and it's a popular application of machine learning.
5
assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us
the correct path with the shortest route and predicts the traffic conditions.
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve
the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to
the user. Whenever we search for some product on Amazon, then we started
getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and
objects while driving.
6
Whenever we receive a new email, it is filtered automatically as important,
normal, and spam. We always receive an important mail in our inbox with
the important symbol and spam emails in our spam box, and the technology
behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
These assistant record our voice instructions, send it over the server on a
cloud, and decode it using ML algorithms and act accordingly.
For each genuine transaction, the output is converted into some hash values,
and these values become the input for the next round. For each genuine
transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more
secure.
7
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market,
there is always a risk of up and downs in shares, so for this machine
learning's long short term memory neural network is used for the
prediction of stock market trends.
8
The learning process starts with task T, performance measure P and
training experience E and objective are to find an unknown target function. The
target function is an exact knowledge to be learned from the training experience
and its unknown. For example, in a case of credit approval, the learning system
will have customer application records as experience and task would be to
classify whether the given customer application is eligible for a loan. So in this
case, the training examples can be represented as (x1,y1)(x2,y2)..(xn,yn) where X
represents customer application details and y represents the status of credit
approval.
Just now we looked into the learning process and also understood the goal of
the learning. When we want to design a learning system that follows the learning
process, we need to consider a few design choices. The design choices will be to
decide the following key components:
We will look into the game - checkers learning problem and apply the above
design choices. For a checkers learning problem, the three elements will be,
9
Impact on success or failure of learning training experience provides direct
and indirect feedback
Direct or Indirect training experience — In the case of direct training
experience, an individual board states and correct move
for each board state are given.Direct feedback is
simpler one
In case of indirect training experience, the move sequences for a game and
the final result (win, loss or draw) are given for a number of games. How to
assign credit or blame to individual moves is the credit assignment
problem. credit assignment problem can be particularly difficult problem
because the game can belost even when early moves are optimal
2. Degree :
A second attribute of training experience is the degree to which the learner
controls the sequence of training examples
Teacher or Not — Supervised — The training experience will be labeled,
which means, all the board states will be labeled with the correct move. So the
learning takes place in the presence of a supervisor or a teacher.
Unsupervised — The training experience will be unlabeled, which means, all
the board states will not have the moves. So the learner generates random
games and plays against itself with no supervision or teacher involvement.
Semi-supervised — Learner generates game states and asks the teacher
for help in finding the
c orrect move if the board state is confusing.
3. Is the training experience good — Do the training examples represent the
distribution of examples over which the final system performance will be
measured? Performance is best when training examples and test examples are
from the same/a similar distribution.
The checker player learns by playing against oneself. Its experience is indirect.
It may not encounter moves that are common in human expert play. Once the
proper training experience is available, the next design step will be choosing the
Target Function.
10
needs only to learn how to choose the best move among some large
search space. We need to find a target function that will help us
choose the best move among alternatives. Let us call this function
ChooseMove and use the notation ChooseMove : B →M to
indicate that this function accepts as input any board from the set of
legal board states B and produces as output some move from the set
of legal moves M.
1.3.2.2 When there is an indirect experience, it becomes difficult to learn
such function. How about assigning a real score to the board state.
So the function be V : B →R indicating that this accepts as input any board from
the set of legal board states B and produces an output a real score. This function
assigns the higher scores to better board states.
If the system can successfully learn such a target function V, then it can easily
use it to select the best move from any board position.
Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the
best final board state that can be achieved starting from b and playing
optimally until the end of the game.
The (4) is a recursive definition and to determine the value of V(b) for a
particular board state, it performs the search ahead for the optimal line of play,
all the way to the end of the game. So this definition is not efficiently computable
by our checkers playing program, we say that it is a nonoperational definition.
The goal of learning, in this case, is to discover an operational
description of V ; that is, a description that can be used by the checkers-
playing program to evaluate states and select moves within realistic time
bounds.
It may be very difficult in general to learn such an operational form of V
perfectly. We expect learning algorithms to acquire only some approximation to
11
the target function ^V.
12
1.3.4 Choosing an approximation algorithm for the Target Function
Generating training data —To train our learning program, we need a set of training data, each
describing
a specific board state b and the training value V_train (b) for b. Each training example is an
ordered pair
<b,V_train(b)> For example, a training example may be <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 =
0, x6 = 0), +100">.
This is an example where black has won the game since x2 = 0 or red has no remaining pieces.
However, such clean values of V_train (b) can be obtained only for board value b that
are clear win,
loss or draw. In above case, assigning a training value V_train(b) for the specific boards b that
are clean
win, loss or draw is direct as they are direct training experience.
But in the case of indirect training experience, assigning a training value
V_train(b) for the intermediate boards is difficult. In such case, the training
values are updated using temporal difference learning. Temporal difference
(TD) learning is a concept central to reinforcement learning, in which
learning happens through the iterative correction of your estimated returns
towards a more accurate target return.
Let Successor(b) denotes the next board state following b for which it is again
the program’s turn to move. ^V is the learner’s current approximation to V.
Using these information, assign the training value of V_train(b) for any
intermediate board state b as below :
V_train(b) ←V ^(Successor(b))
13
1.3.5 Final Design for Checkers Learning system
The final design of our checkers learning system can be naturally
described by four distinct program modules that represent the central
components in many learning systems.
1. The performance System — Takes a new board as input and outputs a trace
of the game it played against itself.
2. The Critic — Takes the trace of a game as an input and outputs a set of
training examples of the target function.
3. The Generalizer — Takes training examples as input and outputs a
hypothesis that estimates the target function. Good generalization to new
cases is crucial.
4. The Experiment Generator — Takes the current hypothesis (currently
learned function) as input and outputs a new problem (an initial board state)
for the performance system to explore.
14
algorithm for fitting weights achieves this goal by iteratively tuning the weights,
adding a correction to each weight each time the hypothesized evaluation
function predicts a value that differs from the training value.
This algorithm works well when the hypothesis representation considered
by the learner defines a continuously parameterized space of potential
hypotheses.
Algorithms that search a hypothesis space defined by some underlying
representation (e.g., linear functions, logical descriptions, decision trees, artificial
neural networks). These different hypothesis representations are appropriate for
learning different kinds of target functions. For each of these hypothesis
representations, the corresponding learning algorithm takes advantage of a
different underlying structure to organize the search through the hypothesis
space.
What algorithms exist for learning general target functions from specific
training examples? In what settings will particular algorithms converge
to the desired function, given sufficient training data? Which algorithms
perform best for which types of problems and representations?
How much training data is sufficient? What general bounds can be
found to relate the confidence in learned hypotheses to the amount of
training experience and the character of the learner's hypothesis space?
When and how can prior knowledge held by the learner guide the
process of generalizing from examples? Can prior knowledge be helpful
even when it is only approximately correct?
What is the best strategy for choosing a useful next training
experience, and how does the choice of this strategy alter the complexity
of the learning problem?
What is the best way to reduce the learning task to one or more
function approximation problems? Put another way, what specific
functions should the system attempt to learn? Can this process itself be
automated?
How can the learner automatically alter its representation to
improve its ability to represent and learn the target function?
15
Concept learning, also known as category learning. "The search for and listing of attributes that can be
used to distinguish exemplars from non exemplars of various categories". It is Acquiring the definition of
a general category from given sample positive and negative training examples of the category.
Much of human learning involves acquiring general concepts from past experiences. For
example, humans identify different vehicles among all the vehicles based on specific sets of features
defined over a large set of features. This special set of features differentiates the subset of cars in a set
of vehicles. This set of features that differentiate cars can be called a concept.
Similarly, machines can learn from concepts to identify whether an object belongs to a specific
category by processing past/training data to find a hypothesis that best fits the training examples.
Target concept:
The set of items/objects over which the concept is defined is called the set of instances and denoted by
X. The concept or function to be learned is called the target concept and denoted by c. It can be seen as
a boolean valued function defined over X and can be represented as c: X -> {0, 1}.
If we have a set of training examples with specific features of target concept C, the problem faced by the
learner is to estimate C that can be defined on training data.
H is used to denote the set of all possible hypotheses that the learner may consider regarding
the identity of the target concept. The goal of a learner is to find a hypothesis H that can identify all the
objects in X so that h(x) = c(x) for all x in X.
An algorithm that supports concept learning requires:
1. Training data (past experiences to train our models)
2. Target concept (hypothesis to identify data objects)
3. Actual data objects (for testing the models)
The hypothesis space
Each of the data objects represents a concept and hypotheses. Considering a hypothesis is more
specific because it can cover only one sample. Generally, we can add some notations into this
hypothesis. We have the following notations:
1. ⵁ (represents a hypothesis that rejects all)
2. < ? , ? , ? , ? > (accepts all)
3. (accepts some) The hypothesis ⵁ will reject all the data samples.
The hypothesis will accept all the data samples. The ? notation indicates that the values of this
specific feature do not affect the result.
The total number of the possible hypothesis is (3 * 3 * 3 * 3) + 1 — 3 because one feature can
have either true, false, or ? and one hypothesis for rejects all (ⵁ).
General to Specific
Many machine learning algorithms rely on the concept of general-to-specific ordering of
hypothesis.
1. h1 = < true, true, ?, ? >
2. h2 = < true, ? , ? , ? >
Any instance classified by h1 will also be classified by h2. We can say that h2 is more general
than h1. Using this concept, we can find a general hypothesis that can be defined over the entire dataset
X.
To find a single hypothesis defined on X, we can use the concept of being more general than
partial ordering. One way to do this is start with the most specific hypothesis from H and generalize this
hypothesis each time it fails to classify and observe positive training data object as positive.
1. The first step in the Find-S algorithm is to start with the most specific hypothesis, which can
16
be denoted by h <- .
2. This step involves picking up next training sample and applying Step 3 on the sample.
3. The next step involves observing the data sample. If the sample is negative, the hypothesis
remains unchanged and we pick the next training sample by processing Step 2 again. Otherwise, we
process Step 4.
4. If the sample is positive and we find that our initial hypothesis is too specific because it does
not cover the current training sample, then we need to update our current hypothesis. This can be done
by the pairwise conjunction (logical and operation) of the current hypothesis and training sample.
If the next training sample is and the current hypothesis is , then we can directly replace our
existing hypothes is with the new one
17
Important Representation :
Algorithm :
Example :
Consider the following data set having the data about which particular seeds
are poisonous.
18
First, we consider the hypothesis to be a more specific hypothesis. Hence, our
hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
Version Spaces
Version space is the subset of hypothesis H consistent with the training examples(D)
(Definition (Version space). A concept is complete if it covers all positive examples.
1.7.1 Representation
The Candidate – Elimination algorithm finds all describable hypotheses that are consistent
with the
observed training examples. In order to define this algorithm precisely, we begin
with a few basic definitions. First, let us say that a hypothesis is consistent with
the training examples if it correctly classifies these examples.
Definition: version space- The version space, denoted V SH, D with respect to
hypothesis space H and training examples D, is the subset of hypotheses from H
consistent with the training examples in D
20
to contain all hypotheses in H and then eliminates any hypothesis found
inconsistent with any training example.
21
Theorem: Version Space representation theorem
Theorem: Let X be an arbitrary set of instances and Let H be a set of Boolean-
valued hypotheses defined over X. Let c: X →{O, 1} be an arbitrary target
concept defined over X, and let D be an arbitrary set of training examples
{(x, c(x))). For all X, H, c, and D such that S and G are well defined,
To Prove:
1. Every h satisfying the right hand side of the above expression is in VS
H, D
2. Every member of VS satisfies the right-hand side of the expression
H, D
Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g h s
g g
By the definition of S, s must be satisfied by all positive examples in D. Because
h s,
g
h must also be satisfied by all positive examples in D.
By the definition of G, g cannot be satisfied by any negative example
in D, and because g h h cannot be satisfied by any negative
g
example in D. Because h is satisfied by all positive examples in D
and by no negative examples in D, h is consistent with D, and
therefore h is a member of VS .
H,D
2. It can be proven by assuming some h in VS ,that does not satisfy
H,D
the right-hand side of the expression, then showing that this leads to
an inconsistency
1.7.3 CANDIDATE-ELIMINATION Learning Algorithm
22
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis
in S
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than
another hypothesis in G CANDIDATE- ELIMINTION algorithm using
version spaces
1.7.4 Example
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 : < ɸ, ɸ, ɸ, ɸ, ɸ, ɸ >
23
When the second training example is observed, it has a similar
effect of generalizing S further to S2, leaving G again unchanged
i.e., G2 = G1 =G0
24
Given that there are six attributes that could be specified to specialize G 2,
why are there only three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal
specialization of G2 that correctly labels the new example as a
negative example, but it is not included in G 3. The reason this
hypothesis is excluded is that it is inconsistent with the previously
encountered positive examples
Consider the fourth training example.
After processing these four examples, the boundary sets S 4 and G4 delimit
the version space of all hypotheses consistent with the set of incrementally
observed training examples.
INDUCTIVE BIAS
As discussed above, the CANDIDATE-ELIMINATION algorithm will converge toward the true target
concept provided it is given accurate training examples and provided its initial hypothesis space contains
the target concept. What if the target concept is not contained in the hypothesis space? Can we avoid this
difficulty by using a hypothesis space that includes every possible hypothesis? How does the size of this
hypothesis space influence the ability of the algorithm to generalize to unobserved instances? How does
the size of the hypothesis space influence the number of training examples that must be observed? These
25
are fundamental questions for inductive inference in general. Here we examine them in the context of
the CANDIDATE-ELIMINATION algorithm. As we shall see, though, the conclusions we draw from
this analysis will apply to any concept learning system that outputs any hypothesis consistent with the
training data.
26
concept learning algorithm
is now completely unable to generalize beyond the observed examples! To see
why, suppose we present three positive examples (xl, x2, x3) and two negative examples (x4, x5) to the
learner. At this point, the S boundary of the version space
will contain the hypothesis which is just the disjunction of the positive examples
because this is the most specific possible hypothesis that covers these three examples. Similarly, the G
boundary will consist of the hypothesis that rules out only
the observed negative examples
The problem here is that with this very expressive hypothesis representation,
the S boundary will always be simply the disjunction of the observed positive
examples, while the G boundary will always be the negated disjunction of the
observed negative examples. Therefore, the only examples that will be unambiguously classified by S and
G are the observed training examples themselves. In
order to converge to a single, final target concept, we will have to present every
single instance in X as a training example!
It might at first seem that we could avoid this difficulty by simply using the
partially learned version space and by taking a vote among the members of the
version space as discussed in Section 2.6.3. Unfortunately, the only instances that
will produce a unanimous vote are the previously observed training examples. For,
all the other instances, taking a vote will be futile: each unobserved instance will
be classified positive by precisely half the hypotheses in the version space and
will be classified negative by the other half (why?). To see the reason, note that
when H is the power set of X and x is some previously unobserved instance,
then for any hypothesis h in the version space that covers x, there will be anoQer
hypothesis h' in the power set that is identical to h except for its classification of
x. And of course if h is in the version space, then h' will be as well, because it
agrees with h on all the observed training examples.
2.7.3 The Futility of Bias-Free Learning
The above discussion illustrates a fundamental property of inductive inference:
a learner that makes no a priori assumptions regarding the identity of the target concept has no rational
basis for classifying any unseen instances. In fact,
the only reason that the CANDIDATE-ELIMINATION algorithm was able to generalize beyond the
observed training examples in our original formulation of the
EnjoySport task is that it was biased by the implicit assumption that the target
concept could be represented by a conjunction of attribute values. In cases where
this assumption is correct (and the training examples are error-free), its classification of new instances
will also be correct. If this assumption is incorrect, however,
it is certain that the CANDIDATE-ELIMINATION algorithm will rnisclassify at least
some instances from X.
Because inductive learning requires some form of prior assumptions, or
inductive bias, we will find it useful to characterize different learning approaches
by the inductive biast they employ. Let us define this notion of inductive bias
more precisely. The key idea we wish to capture here is the policy by which the
learner generalizes beyond the observed training data, to infer the classification
of new instances. Therefore, consider the general setting in which an arbitrary
learning algorithm L is provided an arbitrary set of training data D, = {(x, c(x))}
of some arbitrary target concept c. After training, L is asked to classify a new
instance xi. Let L(xi, D,) denote the classification (e.g., positive or negative) that
L assigns to xi after learning from the training data D,. We can describe this
inductive inference step performed by L as follows
where the notation y + z indicates that z is inductively inferred from y. For
27
example, if we take L to be the CANDIDATE-ELIMINATION algorithm, D, to be
the training data from Table 2.1, and xi to be the fist instance from Table 2.6,
then the inductive inference performed in this case concludes that L(xi, D,) =
(EnjoySport = yes).
Because L is an inductive learning algorithm, the result L(xi, D,) that it infers will not in general be
provably correct; that is, the classification L(xi, D,) need
not follow deductively from the training data D, and the description of the new
instance xi. However, it is interesting to ask what additional assumptions could be
added to D, r\xi so that L(xi, D,) would follow deductively. We define the inductive bias of L as this set
of additional assumptions. More precisely, we define the
t~he trm inductive bias here is not to be confused with the term estimation bias commonly used in
statistics. Estimation bias will be discussed in Chapter 5.
CHAFI%R 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 43
inductive bias of L to be the set of assumptions B such that for all new instances xi
(B A D, A xi) F L(xi, D,)
where the notation y t z indicates that z follows deductively from y (i.e., that z
is provable from y). Thus, we define the inductive bias of a learner as the set of
additional assumptions B sufficient to justify its inductive inferences as deductive
inferences. To summarize,
Definition: Consider a concept learning algorithm L for the set of instances X. Let
c be an arbitrary concept defined over X, and let D, = ((x, c(x))} be an arbitrary
set of training examples of c. Let L(xi, D,) denote the classification assigned to
the instance xi by L after training on the data D,. The inductive bias of L is any
minimal set of assertions B such that for any target concept c and corresponding
training examples Dc
(Vxi E X)[(B A Dc A xi) k L(xi, D,)] (2.1)
What, then, is the inductive bias of the CANDIDATE-ELIMINATION algorithm?
To answer this, let us specify L(xi, D,) exactly for this algorithm: given a set
of data D,, the CANDIDATE-ELIMINATION algorithm will first compute the version
space VSH,D,, then classify the new instance xi by a vote among hypotheses in this
version space. Here let us assume that it will output a classification for xi only if
this vote among version space hypotheses is unanimously positive or negative and
that it will not output a classification otherwise. Given this definition of L(xi, D,)
for the CANDIDATE-ELIMINATION algorithm, what is its inductive bias? It is simply
the assumption c E H. Given this assumption, each inductive inference performed
by the CANDIDATE-ELIMINATION algorithm can be justified deductively.
To see why the classification L(xi, D,) follows deductively from B = {c E
H), together with the data D, and description of the instance xi, consider the following argument. First,
notice that if we assume c E H then it follows deductively
that c E VSH,Dc. This follows from c E H, from the definition of the version space
VSH,D, as the set of all hypotheses in H that are consistent with D,, and from our
definition of D, = {(x, c(x))} as training data consistent with the target concept
c. Second, recall that we defined the classification L(xi, D,) to be the unanimous
vote of all hypotheses in the version space. Thus, if L outputs the classification
L(x,, D,), it must be the case the every hypothesis in VSH,~, also produces this
classification, including the hypothesis c E VSHYDc. Therefore c(xi) = L(xi, D,).
To summarize, the CANDIDATE-ELIMINATION algorithm defined in this fashion can
be characterized by the following bias
Inductive bias of CANDIDATE-ELIMINATION algorithm. The target concept c is
contained in the given hypothesis space H.
Figure 2.8 summarizes the situation schematically. The inductive CANDIDATEELIMINATION
28
algorithm at the top of the figure takes two inputs: the training examples and a new instance to be
classified. At the bottom of the figure, a deductive
44 MACHINE LEARNING
Inductive system
Classification of
Candidate new instance, or Training examples Elimination "don't know"
New instance Using Hypothesis
Space H
Equivalent deductive system
I I Classification of
Training examples I new instance, or "don't know"
Theorem Prover
Assertion " Hcontains
the target concept" -D
P
Inductive bias
made explicit
FIGURE 2.8
Modeling inductive systems by equivalent deductive systems. The input-output behavior of the
CANDIDATE-ELIMINATION algorithm using a hypothesis space H is identical to that of a deductive
theorem prover utilizing the assertion "H contains the target concept." This assertion is therefore
called the inductive bias of the CANDIDATE-ELIMINATION algorithm. Characterizing inductive
systems
by their inductive bias allows modeling them by their equivalent deductive systems. This provides a
way to compare inductive systems according to their policies for generalizing beyond the observed
training data.
theorem prover is given these same two inputs plus the assertion "H contains the
target concept." These two systems will in principle produce identical outputs for
every possible input set of training examples and every possible new instance in
X. Of course the inductive bias that is explicitly input to the theorem prover is
only implicit in the code of the CANDIDATE-ELIMINATION algorithm. In a sense, it
exists only in the eye of us beholders. Nevertheless, it is a perfectly well-defined
set of assertions.
One advantage of viewing inductive inference systems in terms of their
inductive bias is that it provides a nonprocedural means of characterizing their
policy for generalizing beyond the observed data. A second advantage is that it
allows comparison of different learners according to the strength of the inductive
bias they employ. Consider, for example, the following three learning algorithms,
which are listed from weakest to strongest bias.
1. ROTE-LEARNER: Learning corresponds simply to storing each observed training example in
memory. Subsequent instances are classified by looking them
CHAPTER 2 CONCEPT. LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 45
up in memory. If the instance is found in memory, the stored classification
is returned. Otherwise, the system refuses to classify the new instance.
2. CANDIDATE-ELIMINATION algorithm: New instances are classified only in the
case where all members of the current version space agree on the classification. Otherwise, the system
refuses to classify the new instance.
3. FIND-S: This algorithm, described earlier, finds the most specific hypothesis
consistent with the training examples. It then uses this hypothesis to classify
all subsequent instances.
The ROTE-LEARNER has no inductive bias. The classifications it provides
29
for new instances follow deductively from the observed training examples, with
no additional assumptions required. The CANDIDATE-ELIMINATION algorithm has a
stronger inductive bias: that the target concept can be represented in its hypothesis
space. Because it has a stronger bias, it will classify some instances that the ROTELEARNER will not.
Of course the correctness of such classifications will depend
completely on the correctness of this inductive bias. The FIND-S algorithm has
an even stronger inductive bias. In addition to the assumption that the target
concept can be described in its hypothesis space, it has an additional inductive
bias assumption: that all instances are negative instances unless the opposite is
entailed by its other know1edge.t
As we examine other inductive inference methods, it is useful to keep in
mind this means of characterizing them and the strength of their inductive bias.
More strongly biased methods make more inductive leaps, classifying a greater
proportion of unseen instances. Some inductive biases correspond to categorical
assumptions that completely rule out certain concepts, such as the bias "the hypothesis space H includes
the target concept." Other inductive biases merely rank
order the hypotheses by stating preferences such as "more specific hypotheses are
preferred over more general hypotheses." Some biases are implicit in the learner
and are unchangeable by the learner, such as the ones we have considered here.
In Chapters 11 and 12 we will see other systems whose bias is made explicit as
a set of assertions represented and manipulated by the learner.
30