0% found this document useful (0 votes)
116 views30 pages

ML Unit I - It

The document outlines the syllabus and objectives for a Machine Learning course at KKR & KSR Institute of Technology and Sciences. It covers key topics such as learning algorithms, decision trees, artificial neural networks, and applications of machine learning in various fields. The course aims to equip students with the ability to identify and apply suitable machine learning techniques to solve problems.

Uploaded by

Divya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views30 pages

ML Unit I - It

The document outlines the syllabus and objectives for a Machine Learning course at KKR & KSR Institute of Technology and Sciences. It covers key topics such as learning algorithms, decision trees, artificial neural networks, and applications of machine learning in various fields. The course aims to equip students with the ability to identify and apply suitable machine learning techniques to solve problems.

Uploaded by

Divya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

KKR & KSR INSTITUTE OF TECHNOLOGY AND SCIENCES

(Autonomous)
(Accredited by NBA & NAAC with Grade “A” and Affiliated to JNTUK-Kakinada)

DEPARTMENT OF INFORMATION TECHNOLOGY

MACHINE LEARNING
LECTURE NOTES
B.TECH IV YEAR – I SEM
Machine Learning(20CI5T02)
SYLLABUS
Course Objectives:
The course is introduced for students to
 Gain knowledge about basic concepts of Machine Learning
 Study about different learning algorithms
 Learn about of evaluation of learning algorithms
 Learn about artificial neural networks

Course Outcomes:
 Identify machine learning techniques suitable for a given problem
 Solve the problems using various machine learning techniques  Apply Dimensionality reduction techniques
 Design application using machine learning techniques

UNIT I
Introduction: Definition of learning systems, Goals and applications of machine learning, Aspects of developing a
learning system: training data, concept representation, function approximation. Inductive Classification: The
concept learning task, Concept learning as search through a hypothesis space, General-to-specific ordering of
hypotheses, Finding maximally specific hypotheses, Version spaces and the candidate elimination algorithm,
Learning conjunctive concepts, The importance of inductive bias..

UNIT II
Decision Tree Learning: Representing concepts as decision trees, Recursive induction of decision trees, Picking the best splitting
attribute: entropy and information gain, Searching for simple trees and computational complexity, Occam's razor, Overfitting,
noisy data, and pruning. Experimental Evaluation of Learning Algorithms: Measuring the accuracy of learned hypotheses.
Comparing learning algorithms: cross-validation, learning curves, and statistical hypothesis testing.

UNIT III
Computational Learning Theory: Models of learnability: learning in the limit; probably approximately correct (PAC) learning.
Sample complexity for infinite hypothesis spaces, Vapnik-Chervonenkis dimension. Rule Learning: Propositional and First-Order,
Translating decision trees into rules, Heuristic rule induction using separate and conquer and information gain, First-order Horn-
clause induction (Inductive Logic Programming) and Foil, Learning recursive rules, Inverse resolution, Golem, and Progol

UNIT IV
Artificial Neural Networks: Neurons and biological motivation, Linear threshold units. Perceptrons: representational limitation
and gradient descent training, Multilayer networks and backpropagation, Hidden layers and constructing intermediate,
distributed representations. Overfitting, learning network structure, recurrent networks. Support Vector Machines: Maximum
margin linear separators. Quadractic programming solution to finding maximum margin separators. Kernels for learning non-
linear functions.

UNIT V
Bayesian Learning: Probability theory and Bayes rule. Naive Bayes learning algorithm. Parameter smoothing. Generative vs.
discriminative training. Logisitic regression. Bayes nets and Markov nets for representing dependencies. Instance-Based
Learning: Constructing explicit generalizations versus comparing to past specific examples. k-Nearest-neighbor algorithm. Case-
based learning

Text Books:
1) T.M. Mitchell, “Machine Learning”, McGraw-Hill, 1997.
2) Machine Learning, Saikat Dutt, Subramanian Chandramouli, Amit Kumar Das, Pearson, 2019

Reference Books:
1) Ethern Alpaydin, “Introduction to Machine Learning”, MIT Press, 2004.
2) Stephen Marsland, “Machine Learning -An Algorithmic Perspective”, Second Edition, Chapman and Hall/CRC Machine
Learning and Pattern Recognition Series, 2014.
3) Andreas C. Müller and Sarah Guido “Introduction to Machine Learning with Python: A Guide for Data Scientists”, Oreilly.
UNIT I
Introduction: Definition of learning systems, Goals and applications of machine learning,
Aspects of developing a learning system: training data, concept representation, function
approximation. Inductive Classification: The concept learning task, Concept learning as
search through a hypothesis space, General-to-specific ordering of hypotheses, Finding
maximally specific hypotheses, Version spaces and the candidate elimination algorithm,
Learning conjunctive concepts, The importance of inductive bias.
___________________________________________________________________________
____________________

1 INTRODUCTION

Learning is a very wide domain. Machine learning is a field of study that


gives computers a capability to learn without being explicitly programmed.
Machine learning is programming computers to optimize or improve
their performance criterion using example data or past experience. Based on the
data Machine learning adapts to the user.
We have a model defined up to some parameters, and learning is the
execution of a computer program to optimize the parameters of the model using
the training data or past experience.
The model may be predictive to make predictions in the future, or
descriptive to gain knowledge from data, or both.
Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning” in 1959 while at
IBM. He defined machine learning as “the field of study that gives computers the
ability to learn without being explicitly programmed.”

1.1 Definition of learning systems


Definition
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks T,
as measured by P, improves with experience E.
A computer program which learns from experience is called a
machine learning program or simply a learning program. Such a program is
sometimes also referred to as a learner.

Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given
classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering
commands recorded while observing a human driver

4
iii) A checkers learning problem
• Task T: Playing checkers
• Performance measure P: Percent of games won against opponents,Make
perfect moves to win games
• Training experience E: Playing practice games against itself

1.2 GOALS AND APPLICATIONS OF MACHINE LEARNING

1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever


we upload a photo with our Facebook friends, then we automatically get a
tagging suggestion with name, and the technology behind this is machine
learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is


responsible for face recognition and person identification in the picture.

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under
speech recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text,


and it is also known as "Speech to text", or "Computer speech
recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google

5
assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us
the correct path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-


moving, or heavily congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve
the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to
the user. Whenever we search for some product on Amazon, then we started
getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.

Google understands the user interest using various machine learning


algorithms and suggests the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for


entertainment series, movies, etc., and this is also done with the help of
machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and
objects while driving.

6. Email Spam and Malware Filtering:

6
Whenever we receive a new email, it is filtered automatically as important,
normal, and spam. We always receive an important mail in our inbox with
the important symbol and spam emails in our spam box, and the technology
behind this is Machine learning. Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer


Perceptron, Decision tree, and Naïve Bayes classifier are used for email
spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in
finding the information using our voice instruction. These assistants can help
us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important


part.

These assistant record our voice instructions, send it over the server on a
cloud, and decode it using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by
detecting fraud transaction. Whenever we perform some online transaction,
there may be various ways that a fraudulent transaction can take place such
as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by
checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values,
and these values become the input for the next round. For each genuine
transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more
secure.

7
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market,
there is always a risk of up and downs in shares, so for this machine
learning's long short term memory neural network is used for the
prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With
this, medical technology is growing very fast and able to build 3D models
that can predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.


11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then
it is not a problem at all, as for this also machine learning helps us by
converting the text into our known languages. Google's GNMT (Google
Neural Machine Translation) provide this feature, which is a Neural Machine
Learning that translates the text into our familiar language, and it called as
automatic translation.

The technology behind the automatic translation is a sequence to sequence


learning algorithm, which is used with image recognition and translates the
text from one language to another language.

1.3 Aspects of Learning System


For any learning system, we must be knowing the three elements — T
(Task), P (Performance Measure), and E (Training Experience). At a high
level, the process of learning system looks as below.

8
The learning process starts with task T, performance measure P and
training experience E and objective are to find an unknown target function. The
target function is an exact knowledge to be learned from the training experience
and its unknown. For example, in a case of credit approval, the learning system
will have customer application records as experience and task would be to
classify whether the given customer application is eligible for a loan. So in this
case, the training examples can be represented as (x1,y1)(x2,y2)..(xn,yn) where X
represents customer application details and y represents the status of credit
approval.

So the target function to be learned in the credit approval learning system is


a mapping function f:X →y. This function represents the exact knowledge
defining the relationship between input variable X and output variable y.

Design of a learning system

Just now we looked into the learning process and also understood the goal of
the learning. When we want to design a learning system that follows the learning
process, we need to consider a few design choices. The design choices will be to
decide the following key components:

1. Choosing the training experience


2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing an approximation algorithm for the Target Function
5. The final Design

We will look into the game - checkers learning problem and apply the above
design choices. For a checkers learning problem, the three elements will be,

1. Task T: To play checkers


2. Performance measure P: Total percent of the game won in the tournament.
3. Training experience E: A set of games played against itself

1.3.1 Choosing the training experience


During the design of the checker's learning system, the type of training
experience available for a learning system will have a significant effect on the
success or failure of the learning.
It involves 3 attributes
1. Choose the type of experience
2. Degree
3. Distribution of examples

1. Choose the type of experience:

9
Impact on success or failure of learning training experience provides direct
and indirect feedback
Direct or Indirect training experience — In the case of direct training
experience, an individual board states and correct move
for each board state are given.Direct feedback is
simpler one
In case of indirect training experience, the move sequences for a game and
the final result (win, loss or draw) are given for a number of games. How to
assign credit or blame to individual moves is the credit assignment
problem. credit assignment problem can be particularly difficult problem
because the game can belost even when early moves are optimal

2. Degree :
A second attribute of training experience is the degree to which the learner
controls the sequence of training examples
Teacher or Not — Supervised — The training experience will be labeled,
which means, all the board states will be labeled with the correct move. So the
learning takes place in the presence of a supervisor or a teacher.
Unsupervised — The training experience will be unlabeled, which means, all
the board states will not have the moves. So the learner generates random
games and plays against itself with no supervision or teacher involvement.
Semi-supervised — Learner generates game states and asks the teacher
for help in finding the
c orrect move if the board state is confusing.
3. Is the training experience good — Do the training examples represent the
distribution of examples over which the final system performance will be
measured? Performance is best when training examples and test examples are
from the same/a similar distribution.

The checker player learns by playing against oneself. Its experience is indirect.
It may not encounter moves that are common in human expert play. Once the
proper training experience is available, the next design step will be choosing the
Target Function.

1.3.2 Choosing the Target Function


When you are playing the checkers game, at any moment of time, you
make a decision on choosing the best move from different possibilities. You
think and apply the learning that you have gained from the experience. Here the
learning is, for a specific board, you move a checker such that your board state
tends towards the winning situation. Now the same learning has to be defined in
terms of the target function.

Here there are 2 considerations — direct and indirect experience.

1.3.2.1 During the direct experience, the checkers learning system, it

10
needs only to learn how to choose the best move among some large
search space. We need to find a target function that will help us
choose the best move among alternatives. Let us call this function
ChooseMove and use the notation ChooseMove : B →M to
indicate that this function accepts as input any board from the set of
legal board states B and produces as output some move from the set
of legal moves M.
1.3.2.2 When there is an indirect experience, it becomes difficult to learn
such function. How about assigning a real score to the board state.

So the function be V : B →R indicating that this accepts as input any board from
the set of legal board states B and produces an output a real score. This function
assigns the higher scores to better board states.

If the system can successfully learn such a target function V, then it can easily
use it to select the best move from any board position.
Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the
best final board state that can be achieved starting from b and playing
optimally until the end of the game.

The (4) is a recursive definition and to determine the value of V(b) for a
particular board state, it performs the search ahead for the optimal line of play,
all the way to the end of the game. So this definition is not efficiently computable
by our checkers playing program, we say that it is a nonoperational definition.
The goal of learning, in this case, is to discover an operational
description of V ; that is, a description that can be used by the checkers-
playing program to evaluate states and select moves within realistic time
bounds.
It may be very difficult in general to learn such an operational form of V
perfectly. We expect learning algorithms to acquire only some approximation to

11
the target function ^V.

1.3.3 Choosing a representation for the Target Function


Now that we have specified the ideal target function V, we must choose
a representation that the learning program will use to describe the function ^V
that it will learn. As with earlier design choices, we again have many options.
We could, for example, allow the program to represent using a large
table with a distinct entry specifying the value for each distinct board state. Or
we could allow it to represent using a collection of rules that match against
features of the board state, or a quadratic polynomial function of predefined
board features, or an artificial neural network.
In general, this choice of representation involves a crucial tradeoff. On
one hand, we wish to pick a very expressive representation to allow representing
as close an approximation as possible to the ideal target function V.
On the other hand, the more expressive the representation, the more
training data the program will require in order to choose among the alternative
hypotheses it can represent. To keep the discussion brief, let us choose a simple
representation:
for any given board state, the function ^V will be calculated as a linear
combination of the following board features:
 x1(b) — number of black pieces on board b
 x2(b) — number of red pieces on b
 x3(b) — number of black kings on b
 x4(b) — number of red kings on b
 x5(b) — number of red pieces threatened by black (i.e., which can be taken on black’s
next turn)
 x6(b) — number of black pieces threatened by red

V^(b) = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)

Where w0 through w6 are numerical coefficients or weights to be obtained by a


learning algorithm. Weights w1 to w6 will determine the relative importance of
different board features.w0 is an additive constant
Specification of the Machine Learning Problem at this time — Till now we worked on
choosing the type of training
experience, choosing the target function and its representation. The checkers learning task can
be summarized as below.
 Task T : Play Checkers
 Performance Measure : % of games won in world tournament
 Training Experience E : opportunity to play against itself
 Target Function : V : Board → R
 Target Function Representation :
V^(b) = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 ·
x5(b) + w6 · x6(b)
The first three items above correspond to the specification of the learning
task,whereas the final two items constitute design choices for the implementation
of the learning program.

12
1.3.4 Choosing an approximation algorithm for the Target Function

Generating training data —To train our learning program, we need a set of training data, each
describing
a specific board state b and the training value V_train (b) for b. Each training example is an
ordered pair
<b,V_train(b)> For example, a training example may be <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 =
0, x6 = 0), +100">.
This is an example where black has won the game since x2 = 0 or red has no remaining pieces.
However, such clean values of V_train (b) can be obtained only for board value b that
are clear win,
loss or draw. In above case, assigning a training value V_train(b) for the specific boards b that
are clean
win, loss or draw is direct as they are direct training experience.
But in the case of indirect training experience, assigning a training value
V_train(b) for the intermediate boards is difficult. In such case, the training
values are updated using temporal difference learning. Temporal difference
(TD) learning is a concept central to reinforcement learning, in which
learning happens through the iterative correction of your estimated returns
towards a more accurate target return.
Let Successor(b) denotes the next board state following b for which it is again
the program’s turn to move. ^V is the learner’s current approximation to V.
Using these information, assign the training value of V_train(b) for any
intermediate board state b as below :
V_train(b) ←V ^(Successor(b))

Adjusting the weights


Now its time to define the learning algorithm for choosing the weights
and best fit the set of training examples. One common approach is to define the
best hypothesis as that which minimizes the squared error E between the training
values and the values predicted by the hypothesis ^V.

The learning algorithm should incrementally refine weights as more training


examples become available and it needs to be robust to errors in training data
Least Mean Square (LMS) training rule is the one training algorithm that will
adjust weights a small amount in the direction that reduces the error.

The LMS algorithm is defined as follows:

13
1.3.5 Final Design for Checkers Learning system
The final design of our checkers learning system can be naturally
described by four distinct program modules that represent the central
components in many learning systems.
1. The performance System — Takes a new board as input and outputs a trace
of the game it played against itself.
2. The Critic — Takes the trace of a game as an input and outputs a set of
training examples of the target function.
3. The Generalizer — Takes training examples as input and outputs a
hypothesis that estimates the target function. Good generalization to new
cases is crucial.
4. The Experiment Generator — Takes the current hypothesis (currently
learned function) as input and outputs a new problem (an initial board state)
for the performance system to explore.

Final design of the checkers learning program.

1.4 PERSPECTIVES AND ISSUES IN MACHINE LEARNING

Perspectives in Machine Learning


One useful perspective on machine learning is that it involves searching
a very large space of possible hypotheses to determine one that best fits the
observed data and any prior knowledge held by the learner.
For example, consider the space of hypotheses that could in principle be
output by the above checkers learner. This hypothesis space consists of all
evaluation functions that can be represented by some choice of values for the
weights wo through w6.
The learner's task is thus to search through this vast space to locate the
hypothesis that is most consistent with the available training examples. The LMS

14
algorithm for fitting weights achieves this goal by iteratively tuning the weights,
adding a correction to each weight each time the hypothesized evaluation
function predicts a value that differs from the training value.
This algorithm works well when the hypothesis representation considered
by the learner defines a continuously parameterized space of potential
hypotheses.
Algorithms that search a hypothesis space defined by some underlying
representation (e.g., linear functions, logical descriptions, decision trees, artificial
neural networks). These different hypothesis representations are appropriate for
learning different kinds of target functions. For each of these hypothesis
representations, the corresponding learning algorithm takes advantage of a
different underlying structure to organize the search through the hypothesis
space.

Perspective of learning as a search problem in order to characterize


learning methods by their search strategies and by the underlying structure of the
search spaces they explore. We will also find this viewpoint useful in formally
analyzing the relationship between the size of the hypothesis space to be
searched, the number of training examples available, and the confidence we can
have that a hypothesis consistent with the training data will correctly generalize
to unseen examples.

Issues in Machine Learning


Our checkers example raises a number of generic questions about machine
learning. The field of machine learning, and much of this book, is concerned
with answering questions such as the following:

 What algorithms exist for learning general target functions from specific
training examples? In what settings will particular algorithms converge
to the desired function, given sufficient training data? Which algorithms
perform best for which types of problems and representations?
 How much training data is sufficient? What general bounds can be
found to relate the confidence in learned hypotheses to the amount of
training experience and the character of the learner's hypothesis space?
 When and how can prior knowledge held by the learner guide the
process of generalizing from examples? Can prior knowledge be helpful
even when it is only approximately correct?
 What is the best strategy for choosing a useful next training
experience, and how does the choice of this strategy alter the complexity
of the learning problem?
 What is the best way to reduce the learning task to one or more
function approximation problems? Put another way, what specific
functions should the system attempt to learn? Can this process itself be
automated?
 How can the learner automatically alter its representation to
improve its ability to represent and learn the target function?

1.5 CONCEPT LEARNING

15
Concept learning, also known as category learning. "The search for and listing of attributes that can be
used to distinguish exemplars from non exemplars of various categories". It is Acquiring the definition of
a general category from given sample positive and negative training examples of the category.
Much of human learning involves acquiring general concepts from past experiences. For
example, humans identify different vehicles among all the vehicles based on specific sets of features
defined over a large set of features. This special set of features differentiates the subset of cars in a set
of vehicles. This set of features that differentiate cars can be called a concept.
Similarly, machines can learn from concepts to identify whether an object belongs to a specific
category by processing past/training data to find a hypothesis that best fits the training examples.
Target concept:

The set of items/objects over which the concept is defined is called the set of instances and denoted by
X. The concept or function to be learned is called the target concept and denoted by c. It can be seen as
a boolean valued function defined over X and can be represented as c: X -> {0, 1}.
If we have a set of training examples with specific features of target concept C, the problem faced by the
learner is to estimate C that can be defined on training data.
H is used to denote the set of all possible hypotheses that the learner may consider regarding
the identity of the target concept. The goal of a learner is to find a hypothesis H that can identify all the
objects in X so that h(x) = c(x) for all x in X.
An algorithm that supports concept learning requires:
1. Training data (past experiences to train our models)
2. Target concept (hypothesis to identify data objects)
3. Actual data objects (for testing the models)
The hypothesis space
Each of the data objects represents a concept and hypotheses. Considering a hypothesis is more
specific because it can cover only one sample. Generally, we can add some notations into this
hypothesis. We have the following notations:
1. ⵁ (represents a hypothesis that rejects all)
2. < ? , ? , ? , ? > (accepts all)
3. (accepts some) The hypothesis ⵁ will reject all the data samples.
The hypothesis will accept all the data samples. The ? notation indicates that the values of this
specific feature do not affect the result.
The total number of the possible hypothesis is (3 * 3 * 3 * 3) + 1 — 3 because one feature can
have either true, false, or ? and one hypothesis for rejects all (ⵁ).

General to Specific
Many machine learning algorithms rely on the concept of general-to-specific ordering of
hypothesis.
1. h1 = < true, true, ?, ? >
2. h2 = < true, ? , ? , ? >
Any instance classified by h1 will also be classified by h2. We can say that h2 is more general
than h1. Using this concept, we can find a general hypothesis that can be defined over the entire dataset
X.
To find a single hypothesis defined on X, we can use the concept of being more general than
partial ordering. One way to do this is start with the most specific hypothesis from H and generalize this
hypothesis each time it fails to classify and observe positive training data object as positive.
1. The first step in the Find-S algorithm is to start with the most specific hypothesis, which can

16
be denoted by h <- .
2. This step involves picking up next training sample and applying Step 3 on the sample.
3. The next step involves observing the data sample. If the sample is negative, the hypothesis
remains unchanged and we pick the next training sample by processing Step 2 again. Otherwise, we
process Step 4.
4. If the sample is positive and we find that our initial hypothesis is too specific because it does
not cover the current training sample, then we need to update our current hypothesis. This can be done
by the pairwise conjunction (logical and operation) of the current hypothesis and training sample.
If the next training sample is and the current hypothesis is , then we can directly replace our
existing hypothes is with the new one

If the next positive training sample is <true,true,false,true> and current hypothesis is


<true,true,false, false > , then we can perform a pairwise conjunctive. With the current hypothesis and
next training sample, we can find a new hypothesis by putting ? in the place where the result of
conjunction is false:
<true,true,false,true> <true,true,false, false > =<true,true,false, ? >
ⵁ = Now, we can replace our existing hypothesis with the new one: h <-
5. This step involves repetition of Step 2 until we have more training samples.
6. Once there are no training samples, the current hypothesis is the one we wanted to find. We
can use the final hypothesis to classify the real objects.
Paths through the hypothesis space
we can notice is that every concept between the least general one and one of the most general
ones is also a possible hypothesis, i.e., covers all the positives and none of the negatives. Mathematically
speaking we say that the set of Algorithm : LGGConj-ID(x, y) – find least general conjunctive
generalisation of two conjunctions, employing internal disjunction.
Input : conjunctions x, y.
Output : conjunction z.
1 z ←true;
2 for each feature f do
3 if f = vx is a conjunct in x and f = vy is a conjunct in y then
4 add f = Combine-ID(vx , vy ) to z; // Combine-ID: see text
5 end
6 end
7 return z

1.5.Finding maximally specific hypotheses (Find S Algorithm)

The find-S algorithm is a basic concept learning algorithm in machine


learning. The find-S algorithm finds the most specific hypothesis that fits all the
positive examples. We have to note here that the algorithm considers only
those positive training example. The find-S algorithm starts with the most
specific hypothesis and generalizes this hypothesis each time it fails to classify
an observed positive training data. Hence, the Find-S algorithm moves from the
most specific hypothesis to the most general hypothesis.

17
Important Representation :

1. ? indicates that any value is acceptable for the attribute.


2. specify a single required value ( e.g., Cold ) for the attribute.
3. ϕindicates that no value is acceptable.
4. The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Algorithm :

1. Initialize h to the most specific hypothesis in H


2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general
constraint that is satisfied by x
3. Output hypothesis h

Steps Involved In Find-S :

1. Start with the most specific hypothesis.


h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the
hypothesis.
3. If the example is positive and we find that our initial hypothesis is too specific
then we update our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final
hypothesis when can use to classify the new examples.

Example :
Consider the following data set having the data about which particular seeds
are poisonous.

18
First, we consider the hypothesis to be a more specific hypothesis. Hence, our
hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Consider Training example 1 :


The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our
initial hypothesis is more specific and we have to generalize it for this example.
Hence, the hypothesis becomes :
h = { GREEN, HARD, NO, WRINKLED }
Consider Training example 2 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider Training example 3 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider Training example 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We
compare every single attribute with the initial data and if any mismatch is found
we replace that particular attribute with a general case ( ” ? ” ). After doing the
process the hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
Consider Training example 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We
compare every single attribute with the initial data and if any mismatch is found
we replace that particular attribute with a general case ( ” ? ” ). After doing the
19
process the hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have
the general condition, example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Hence, for the given data the final hypothesis would be :
Final Hyposthesis: h = { ?, ?, ?, ? }

Version Spaces
Version space is the subset of hypothesis H consistent with the training examples(D)
(Definition (Version space). A concept is complete if it covers all positive examples.

A concept is consistent if it covers none of the negative examples. The version


space is the set of all complete and consistent concepts. This set is convex and is fully
defined by its least and most general elements.

The key idea in the CANDIDATE-ELIMINATION algorithm is to output a description of


the set of all
hypotheses consistent with the training examples

1.7.1 Representation
The Candidate – Elimination algorithm finds all describable hypotheses that are consistent
with the
observed training examples. In order to define this algorithm precisely, we begin
with a few basic definitions. First, let us say that a hypothesis is consistent with
the training examples if it correctly classifies these examples.

Definition: A hypothesis h is consistent with a set of training examples D if


and only if h(x) = c(x) for each example (x, c(x)) in D.

Note difference between definitions of consistent and satisfies


 An example x is said to satisfy hypothesis h when h(x) = 1, regardless
of whether x is a positive or negative example of the target concept.
 An example x is said to consistent with hypothesis h iff h(x) = c(x)

Definition: version space- The version space, denoted V SH, D with respect to
hypothesis space H and training examples D, is the subset of hypotheses from H
consistent with the training examples in D

1.7.2 The LIST-THEN-ELIMINATION algorithm


The LIST-THEN-ELIMINATE algorithm first initializes the version space

20
to contain all hypotheses in H and then eliminates any hypothesis found
inconsistent with any training example.

1. VersionSpace c a list containing every hypothesis in H


2. For each training example, (x, c(x)) remove from VersionSpace any hypothesis h for
which h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace

 List-Then-Eliminate works in principle, so long as version space is finite.


 However, since it requires exhaustive enumeration of all hypotheses in practice it is not
feasible.

h1(x1)=c(x1) h2(x1)=c(x1) h3(x1) ≠ c(x1) h4(x1)=c(x1)


h1(x2=c(x2) h2(x2=c(x2) h3(x2=c(x2) h4(x2=c(x2)
h1(x3=c(x3) h2(x3 ≠ c(x3) h3(x3=c(x3) h4(x3=c(x3)

h1 is consistent h2 is inconsistent h3 is inconsistent h4 is consistent

Remove hypothesis h2 and h3 from version space because h2 and h3 are


inconsistent
A More Compact Representation for Version Spaces
The version space is represented by its most general and least general
members. These members form general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.
Definition: The general boundary G, with respect to hypothesis space H
and training data D, is the set of maximally general members of H consistent
with D

Definition: The specific boundary S, with respect to hypothesis space H


and training data D, is the set of minimally general (i.e., maximally specific)
members of H consistent with D.

21
Theorem: Version Space representation theorem
Theorem: Let X be an arbitrary set of instances and Let H be a set of Boolean-
valued hypotheses defined over X. Let c: X →{O, 1} be an arbitrary target
concept defined over X, and let D be an arbitrary set of training examples
{(x, c(x))). For all X, H, c, and D such that S and G are well defined,

To Prove:
1. Every h satisfying the right hand side of the above expression is in VS
H, D
2. Every member of VS satisfies the right-hand side of the expression
H, D

Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g  h  s
g g
 By the definition of S, s must be satisfied by all positive examples in D. Because
h s,
g
h must also be satisfied by all positive examples in D.
 By the definition of G, g cannot be satisfied by any negative example
in D, and because g  h h cannot be satisfied by any negative
g
example in D. Because h is satisfied by all positive examples in D
and by no negative examples in D, h is consistent with D, and
therefore h is a member of VS .
H,D
2. It can be proven by assuming some h in VS ,that does not satisfy
H,D
the right-hand side of the expression, then showing that this leads to
an inconsistency
1.7.3 CANDIDATE-ELIMINATION Learning Algorithm

The CANDIDATE-ELIMINTION algorithm computes the version


space containing all hypotheses from H that are consistent with an
observed sequence of training examples.

Initialize G to the set of maximally general hypotheses in H Initialize S to the set


of maximally specific hypotheses in H For each training example d, do
• If d is a positive example

22
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis
in S

• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than
another hypothesis in G CANDIDATE- ELIMINTION algorithm using
version spaces

1.7.4 Example

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

CANDIDATE-ELIMINTION algorithm begins by initializing the


version space to the set of all hypotheses in H;

Initializing the G boundary set to contain the most general hypothesis in H


G0: < ?, ?, ?, ?, ?, ? >

Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 : < ɸ, ɸ, ɸ, ɸ, ɸ, ɸ >

 When the first training example is presented, the CANDIDATE-


ELIMINTION algorithm checks the S boundary and finds that it is
overly specific and it fails to cover the positive example.
 The boundary is therefore revised by moving it to the least more
general hypothesis that covers this new example
 No update of the G boundary is needed in response to this training
example because Go correctly covers this example

23
 When the second training example is observed, it has a similar
effect of generalizing S further to S2, leaving G again unchanged
i.e., G2 = G1 =G0

 Consider the third training example. This negative example reveals


that the G boundary of the version space is overly general, that is, the
hypothesis in G incorrectly predicts that this new example is a
positive example.
 The hypothesis in the G boundary must therefore be specialized until
it correctly classifies this new negative example.

24
Given that there are six attributes that could be specified to specialize G 2,
why are there only three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal
specialization of G2 that correctly labels the new example as a
negative example, but it is not included in G 3. The reason this
hypothesis is excluded is that it is inconsistent with the previously
encountered positive examples
Consider the fourth training example.

This positive example further generalizes the S boundary of the


version space. It also results in removing one member of the G
boundary, because this member fails to cover the new positive
example

After processing these four examples, the boundary sets S 4 and G4 delimit
the version space of all hypotheses consistent with the set of incrementally
observed training examples.

INDUCTIVE BIAS
As discussed above, the CANDIDATE-ELIMINATION algorithm will converge toward the true target
concept provided it is given accurate training examples and provided its initial hypothesis space contains
the target concept. What if the target concept is not contained in the hypothesis space? Can we avoid this
difficulty by using a hypothesis space that includes every possible hypothesis? How does the size of this
hypothesis space influence the ability of the algorithm to generalize to unobserved instances? How does
the size of the hypothesis space influence the number of training examples that must be observed? These

25
are fundamental questions for inductive inference in general. Here we examine them in the context of
the CANDIDATE-ELIMINATION algorithm. As we shall see, though, the conclusions we draw from
this analysis will apply to any concept learning system that outputs any hypothesis consistent with the
training data.

2.7.1 A Biased Hypothesis Space


Suppose we wish to assure that the hypothesis space contains the unknown target concept. The obvious
solution is to enrich the hypothesis space to include
every possible hypothesis. To illustrate, consider again the EnjoySpor t example in
which we restricted the hypothesis space to include only conjunctions of attribute
values. Because of this restriction, the hypothesis space is unable to represent
even simple disjunctive target concepts such as "Sky = Sunny or Sky = Cloudy."
In fact, given the following three training examples of this disjunctive hypothesis,
our algorithm would find that there are zero hypotheses in the version space.
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Cool Change Yes
2 Cloudy Warm Normal Strong Cool Change Yes
3 Rainy Warm Normal Strong Cool Change No
To see why there are no hypotheses consistent with these three examples,
note that the most specific hypothesis consistent with the first two examples and
representable in the given hypothesis space H is
S2 : (?, Warm, Normal, Strong, Cool, Change)
This hypothesis, although it is the maximally specific hypothesis from H that is
consistent with the first two examples, is already overly general: it incorrectly
covers the third (negative) training example. The problem is that we have biased
the learner to consider only conjunctive hypotheses. In this case we require a more
expressive hypothesis space.
2.7.2 An Unbiased Learner
The obvious solution to the problem of assuring that the target concept is in the
hypothesis space H is to provide a hypothesis space capable of representing every
teachable concept; that is, it is capable of representing every possible subset of the
instances X. In general, the set of all subsets of a set X is called thepowerset of X.
In the EnjoySport learning task, for example, the size of the instance space
X of days described by the six available attributes is 96. How many possible
concepts can be defined over this set of instances? In other words, how large is
the power set of X? In general, the number of distinct subsets that can be defined
over a set X containing 1x1 elements (i.e., the size of the power set of X) is 21'1.
Thus, there are 296, or approximately distinct target concepts that could be
defined over this instance space and that our learner might be called upon to learn.
Recall from Section 2.3 that our conjunctive hypothesis space is able to represent
only 973 of these-a very biased hypothesis space indeed!
Let us reformulate the Enjoysport learning task in an unbiased way by
defining a new hypothesis space H' that can represent every subset of instances;
that is, let H' correspond to the power set of X. One way to define such an H' is to
allow arbitrary disjunctions, conjunctions, and negations of our earlier hypotheses.
For instance, the target concept "Sky = Sunny or Sky = Cloudy" could then be
described as
(Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?)
Given this hypothesis space, we can safely use the CANDIDATE-ELIMINATIONalgorithm without
worrying that the target concept might not be expressible. However, while this hypothesis space
eliminates any problems of expressibility, it unfortunately raises a new, equally difficult problem: our

26
concept learning algorithm
is now completely unable to generalize beyond the observed examples! To see
why, suppose we present three positive examples (xl, x2, x3) and two negative examples (x4, x5) to the
learner. At this point, the S boundary of the version space
will contain the hypothesis which is just the disjunction of the positive examples
because this is the most specific possible hypothesis that covers these three examples. Similarly, the G
boundary will consist of the hypothesis that rules out only
the observed negative examples
The problem here is that with this very expressive hypothesis representation,
the S boundary will always be simply the disjunction of the observed positive
examples, while the G boundary will always be the negated disjunction of the
observed negative examples. Therefore, the only examples that will be unambiguously classified by S and
G are the observed training examples themselves. In
order to converge to a single, final target concept, we will have to present every
single instance in X as a training example!
It might at first seem that we could avoid this difficulty by simply using the
partially learned version space and by taking a vote among the members of the
version space as discussed in Section 2.6.3. Unfortunately, the only instances that
will produce a unanimous vote are the previously observed training examples. For,
all the other instances, taking a vote will be futile: each unobserved instance will
be classified positive by precisely half the hypotheses in the version space and
will be classified negative by the other half (why?). To see the reason, note that
when H is the power set of X and x is some previously unobserved instance,
then for any hypothesis h in the version space that covers x, there will be anoQer
hypothesis h' in the power set that is identical to h except for its classification of
x. And of course if h is in the version space, then h' will be as well, because it
agrees with h on all the observed training examples.
2.7.3 The Futility of Bias-Free Learning
The above discussion illustrates a fundamental property of inductive inference:
a learner that makes no a priori assumptions regarding the identity of the target concept has no rational
basis for classifying any unseen instances. In fact,
the only reason that the CANDIDATE-ELIMINATION algorithm was able to generalize beyond the
observed training examples in our original formulation of the
EnjoySport task is that it was biased by the implicit assumption that the target
concept could be represented by a conjunction of attribute values. In cases where
this assumption is correct (and the training examples are error-free), its classification of new instances
will also be correct. If this assumption is incorrect, however,
it is certain that the CANDIDATE-ELIMINATION algorithm will rnisclassify at least
some instances from X.
Because inductive learning requires some form of prior assumptions, or
inductive bias, we will find it useful to characterize different learning approaches
by the inductive biast they employ. Let us define this notion of inductive bias
more precisely. The key idea we wish to capture here is the policy by which the
learner generalizes beyond the observed training data, to infer the classification
of new instances. Therefore, consider the general setting in which an arbitrary
learning algorithm L is provided an arbitrary set of training data D, = {(x, c(x))}
of some arbitrary target concept c. After training, L is asked to classify a new
instance xi. Let L(xi, D,) denote the classification (e.g., positive or negative) that
L assigns to xi after learning from the training data D,. We can describe this
inductive inference step performed by L as follows
where the notation y + z indicates that z is inductively inferred from y. For

27
example, if we take L to be the CANDIDATE-ELIMINATION algorithm, D, to be
the training data from Table 2.1, and xi to be the fist instance from Table 2.6,
then the inductive inference performed in this case concludes that L(xi, D,) =
(EnjoySport = yes).
Because L is an inductive learning algorithm, the result L(xi, D,) that it infers will not in general be
provably correct; that is, the classification L(xi, D,) need
not follow deductively from the training data D, and the description of the new
instance xi. However, it is interesting to ask what additional assumptions could be
added to D, r\xi so that L(xi, D,) would follow deductively. We define the inductive bias of L as this set
of additional assumptions. More precisely, we define the
t~he trm inductive bias here is not to be confused with the term estimation bias commonly used in
statistics. Estimation bias will be discussed in Chapter 5.
CHAFI%R 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 43
inductive bias of L to be the set of assumptions B such that for all new instances xi
(B A D, A xi) F L(xi, D,)
where the notation y t z indicates that z follows deductively from y (i.e., that z
is provable from y). Thus, we define the inductive bias of a learner as the set of
additional assumptions B sufficient to justify its inductive inferences as deductive
inferences. To summarize,
Definition: Consider a concept learning algorithm L for the set of instances X. Let
c be an arbitrary concept defined over X, and let D, = ((x, c(x))} be an arbitrary
set of training examples of c. Let L(xi, D,) denote the classification assigned to
the instance xi by L after training on the data D,. The inductive bias of L is any
minimal set of assertions B such that for any target concept c and corresponding
training examples Dc
(Vxi E X)[(B A Dc A xi) k L(xi, D,)] (2.1)
What, then, is the inductive bias of the CANDIDATE-ELIMINATION algorithm?
To answer this, let us specify L(xi, D,) exactly for this algorithm: given a set
of data D,, the CANDIDATE-ELIMINATION algorithm will first compute the version
space VSH,D,, then classify the new instance xi by a vote among hypotheses in this
version space. Here let us assume that it will output a classification for xi only if
this vote among version space hypotheses is unanimously positive or negative and
that it will not output a classification otherwise. Given this definition of L(xi, D,)
for the CANDIDATE-ELIMINATION algorithm, what is its inductive bias? It is simply
the assumption c E H. Given this assumption, each inductive inference performed
by the CANDIDATE-ELIMINATION algorithm can be justified deductively.
To see why the classification L(xi, D,) follows deductively from B = {c E
H), together with the data D, and description of the instance xi, consider the following argument. First,
notice that if we assume c E H then it follows deductively
that c E VSH,Dc. This follows from c E H, from the definition of the version space
VSH,D, as the set of all hypotheses in H that are consistent with D,, and from our
definition of D, = {(x, c(x))} as training data consistent with the target concept
c. Second, recall that we defined the classification L(xi, D,) to be the unanimous
vote of all hypotheses in the version space. Thus, if L outputs the classification
L(x,, D,), it must be the case the every hypothesis in VSH,~, also produces this
classification, including the hypothesis c E VSHYDc. Therefore c(xi) = L(xi, D,).
To summarize, the CANDIDATE-ELIMINATION algorithm defined in this fashion can
be characterized by the following bias
Inductive bias of CANDIDATE-ELIMINATION algorithm. The target concept c is
contained in the given hypothesis space H.
Figure 2.8 summarizes the situation schematically. The inductive CANDIDATEELIMINATION

28
algorithm at the top of the figure takes two inputs: the training examples and a new instance to be
classified. At the bottom of the figure, a deductive
44 MACHINE LEARNING
Inductive system
Classification of
Candidate new instance, or Training examples Elimination "don't know"
New instance Using Hypothesis
Space H
Equivalent deductive system
I I Classification of
Training examples I new instance, or "don't know"
Theorem Prover
Assertion " Hcontains
the target concept" -D
P
Inductive bias
made explicit
FIGURE 2.8
Modeling inductive systems by equivalent deductive systems. The input-output behavior of the
CANDIDATE-ELIMINATION algorithm using a hypothesis space H is identical to that of a deductive
theorem prover utilizing the assertion "H contains the target concept." This assertion is therefore
called the inductive bias of the CANDIDATE-ELIMINATION algorithm. Characterizing inductive
systems
by their inductive bias allows modeling them by their equivalent deductive systems. This provides a
way to compare inductive systems according to their policies for generalizing beyond the observed
training data.
theorem prover is given these same two inputs plus the assertion "H contains the
target concept." These two systems will in principle produce identical outputs for
every possible input set of training examples and every possible new instance in
X. Of course the inductive bias that is explicitly input to the theorem prover is
only implicit in the code of the CANDIDATE-ELIMINATION algorithm. In a sense, it
exists only in the eye of us beholders. Nevertheless, it is a perfectly well-defined
set of assertions.
One advantage of viewing inductive inference systems in terms of their
inductive bias is that it provides a nonprocedural means of characterizing their
policy for generalizing beyond the observed data. A second advantage is that it
allows comparison of different learners according to the strength of the inductive
bias they employ. Consider, for example, the following three learning algorithms,
which are listed from weakest to strongest bias.
1. ROTE-LEARNER: Learning corresponds simply to storing each observed training example in
memory. Subsequent instances are classified by looking them
CHAPTER 2 CONCEPT. LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 45
up in memory. If the instance is found in memory, the stored classification
is returned. Otherwise, the system refuses to classify the new instance.
2. CANDIDATE-ELIMINATION algorithm: New instances are classified only in the
case where all members of the current version space agree on the classification. Otherwise, the system
refuses to classify the new instance.
3. FIND-S: This algorithm, described earlier, finds the most specific hypothesis
consistent with the training examples. It then uses this hypothesis to classify
all subsequent instances.
The ROTE-LEARNER has no inductive bias. The classifications it provides

29
for new instances follow deductively from the observed training examples, with
no additional assumptions required. The CANDIDATE-ELIMINATION algorithm has a
stronger inductive bias: that the target concept can be represented in its hypothesis
space. Because it has a stronger bias, it will classify some instances that the ROTELEARNER will not.
Of course the correctness of such classifications will depend
completely on the correctness of this inductive bias. The FIND-S algorithm has
an even stronger inductive bias. In addition to the assumption that the target
concept can be described in its hypothesis space, it has an additional inductive
bias assumption: that all instances are negative instances unless the opposite is
entailed by its other know1edge.t
As we examine other inductive inference methods, it is useful to keep in
mind this means of characterizing them and the strength of their inductive bias.
More strongly biased methods make more inductive leaps, classifying a greater
proportion of unseen instances. Some inductive biases correspond to categorical
assumptions that completely rule out certain concepts, such as the bias "the hypothesis space H includes
the target concept." Other inductive biases merely rank
order the hypotheses by stating preferences such as "more specific hypotheses are
preferred over more general hypotheses." Some biases are implicit in the learner
and are unchangeable by the learner, such as the ones we have considered here.
In Chapters 11 and 12 we will see other systems whose bias is made explicit as
a set of assertions represented and manipulated by the learner.

30

You might also like