Unit-1 ML
Unit-1 ML
INTRODUCTION
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron –
Design a Learning System – Perspectives and Issues in Machine Learning – Concept Learning Task
– Concept Learning as Search – Finding a Maximally Specific Hypothesis – Version Spaces and the
Candidate Elimination Algorithm – Linear Discriminants – Perceptron – Linear Separability –
Linear Regression.
Learning
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses
on the use of data and algorithms to imitate the way that humans learn, gradually improving
its accuracy
Image recognition is a well-known and widespread example of machine learning in the real world.
It can identify an object as a digital image, based on the intensity of the pixels in black and white
images or colour images. Real-world examples of image recognition: Label an x-ray as cancerous or
not.
Data is not random; it contains structure that can be used to predict outcomes, or gain knowledge in
some way.
It is more difficult to design algorithms for such tasks (compared to, say, sorting an array or
calculating a payroll). Such algorithms need data.
Ex: construct a spam filter, using a collection of email messages labelled as spam/not
spam.
– Computer science: data structures and programs that solve a ML problem efficiently.
Supervised learning
A training set of examples with the correct responses (targets) is provided and, based on this
training set, the algorithm generalises to respond correctly to all possible inputs. This is also called
learning from exemplars.
1. Face recognition. Difficult because of the complex variability in the data: pose and illumination in
a face image, occlusions, glasses/beard/make-up/etc.
5. Credit scoring: classify customers into high- and low-risk, based on their income and savings,
using data about past loans (whether they were paid or not).
U
nsupervised learning no labels provided, only input data.
Correct responses are not provided, but instead the algorithm tries to identify similarities between
the inputs so that inputs that have something in common are categorized together. The statistical
approach to unsupervised learning is known as density estimation.
1. Learning associations: ∗ Basket analysis: let p(Y |X) = “probability that a customer who buys
product X also buys product Y ”, estimated from past purchases. If p(Y |X) is large (say 0.7),
associate “X → Y ”. When someone buys X, recommend them Y .
Reinforcement learning
This is somewhere between supervised and unsupervised learn-ing. The algorithm gets told when the
answer is wrong, but does not get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right. Reinforcement learning is sometime
called learning with a critic because of this monitor that scores the answer, but does not suggest
improvements.
Evolutionary learning
Biological evolution can be seen as a learning process: biological organisms adapt to improve their
survival rates and chance of having offspring in their environment. We’ll look at how we can model
this in a computer, using an idea of fitness, which corresponds to a score for how good the current
solution is.
Supervised learning
1. Learning a class from examples: two-class problems
We are given a training set of labeled examples (positive and negative) and want to learn a classifier
that we can use to predict unseen examples, or to understand the data.
Input representation: we need to decide what attributes (features) to use to describe the input patterns
(examples, instances). This implies ignoring other attributes as irrelevant.
Training set: X = {(xn, yn)} N n=1 where xn ∈ R D is the nth input vector and yn ∈ {0, 1} its class
label.
• Hypothesis (model) class H: the set of classifier functions we will use. Ideally, the true class
distribution C can be represented by a function in H (exactly, or with a small error).
• Having selected H, learning the class reduces to finding an optimal h ∈ H. We don’t know the true
class regions C, but we can approximate them by the empirical error :
– Attributes not considered that affect the label (hidden or latent attributes, may be
unobservable).
With K classes, we can code the label as an integer y = k ∈ {1, . . . , K}, or as a one-of-K binary
vector y = (y1, . . . , yK) T ∈ {0, 1} K (containing a single 1 in position k).
• One approach for K-class classification: consider it as K two-class classification problems, and
minimize the total empirical error
where yn is coded as one-of-K and hk is the two-class classifier for problem k, i.e., hk(x) ∈ {0, 1}.
• Ideally, for a given pattern x only one hk(x) is one. When no, or more than one, hk(x) is one then
the classifier is in doubt and may reject the pattern.
4.Regression
• Training set X = {(xn, yn)} N n=1 where the label for a pattern xn ∈ R D is a real value yn ∈ R. In
multivariate regression, yn ∈ R d is a real vector
. • We assume the labels were produced by applying an unknown function f to the instances, and we
want to learn (or estimate) that function using functions h from a hypothesis class H. Ex: H= class of
linear functions : h(x) = w0 + w1x1 + · · · + wDxD = wT x + w0.
• Interpolation: we learn a function h(x) that passes through each training pair (xn, yn) (no noise):
yn = h(xn), n = 1, . . . , N. Ex: polynomial interpolation (requires a polynomial of degree N − 1 with
N points in general position).
• Empirical error: E(h; X ) = 1 N X N n=1 (yn − h(xn))2 = sum of squared errors at each instance.
Other definitions of error possible, e.g. absolute value instead of square.
Machine learning problems (classification, regression and others) are typically ill-posed: the
observed data is finite and does not uniquely determine the classification or regression function.
• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data:
– If H is less complex: underfitting. Ex: fitting a line to data generated from a cubic
polynomial.
– If H is more complex: overfitting. Ex: fitting a cubic polynomial to data generated from a
line.
• An outlier is an instance that is very different from the other instances in the sample. Reasons: –
Abnormal behaviour. Fraud in credit card transactions, intruder in network traffic, etc. – Recording
error. Faulty sensors, etc.
• Not usually cast as a two-class classification problem because there are typically few outliers and
they don’t fit a consistent pattern that can be easily learned.
• Instead, “one-class classification”: fit a density p(x) to non-outliers, then consider x as an outlier if
p(x) < θ for some threshold θ > 0 (low-probability instance).
• We can also identify outliers as points that are far away from other samples.
Page No(39-43)
Data Collection and Preparation Throughout this book we will be in the for tunateposition of having
datasets readily available for downloading and using to test the algorithms. This is, of course, less
commonly the case when the desire is to learn about some new problem, when either the data has to
be collected from scratch, or at the very least, assembled and prepared. In fact, if the problem is
completely new, so that appropriate data can be chosen, then this process should be merged with the
next step of feature selection, so that only the required data is collected. This can typically be done
by assembling a reasonably small dataset with all of the features that you believe might be useful,
and experimenting with it before choosing the best features and collecting and analysing the full
dataset..
For supervised learning, target data is also needed, which can require the involvement of experts in
the relevant field and significant investments of time.
Finally, the quantity of data needs to be considered. Machine learning algorithms need significant
amounts of data, preferably without too much noise, but with increased dataset size comes increased
computational costs, and the sweet spot at which there is enough data without excessive
computational overhead is generally impossible to predict.
Feature Selection
An example of this part of the process was given in Section 1.4.2 when we looked at possible
features that might be useful for coin recognition. It consists of identifying the features that are most
useful for the problem under examination. This invariably requires prior knowledge of the problem
and the data; our common sense was used in the coins example above to identify some potentially
useful features and to exclude others.
As well as the identification of features that are useful for the learner, it is also necessary that the
features can be collected without significant expense or time, and that they are robust to noise and
other corruption of the data that may arise in the collection process.
Algorithm Choice
Given the dataset, the choice of an appropriate algorithm (or algo-rithms) is what this book should
be able to prepare you for, in that the knowledge of the underlying principles of each algorithm and
examples of their use is precisely what is required for this.
For many of the algorithms there are parameters that have to be set manually, or that require
experimentation to identify appropriate values. These requirements are discussed at the appropriate
points of the book.
Training
Given the dataset, algorithm, and parameters, training should be simply the use of computational
resources in order to build a model of the data in order to predict the outputs on new data.
Evaluation
Before a system can be deployed it needs to be tested and evaluated for ac-curacy on data that it was
not trained on. This can often include a comparison withhuman experts in the field, and the selection
of appropriate metrics for this compare-son.
the main goal is to find the hypothesis that best fits the training data set.
1. X — The set of items over which the concept is defined is called the set of instances, which we
denote by X. In the current example, X is the set of all possible days, each represented by the
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
2. c — The concept or function to be learned is called the target concept, which we denote by c. In
general, c can be any boolean valued function defined over the instances X; that is, c: X → {0, 1}.
In the current example, the target concept corresponds to the value of the attribute EnjoySport
(i.e, c(x)=1 if EnjoySport=Yes, and c(x)=0 if EnjoySport= No).
3. (x, c(x)) — When learning the target concept, the learner is presented by a set of training
examples, each consisting of an instance x from X, along with its target concept value c(x).
Instances for which c(x) = 1 are called positive examples and instances for which c(x) = 0 are
called negative examples. We will often write the ordered pair (x, c(x)) to describe the training
example consisting of the instance x and its target concept value c(x).
5. H — Given a set of training examples of the target concept c, the problem faced by the learner is
to hypothesize, or estimate, c. We use the symbol H to denote the set of all possible hypotheses
that the learner may consider regarding the identity of the target concept.
Notice that, the learning algorithm objective is to find a hypothesis h in H such that h(x) = c(x) for all
x in D.
We know that, the Inductive learning algorithm tries to induce a “general rule” from a set of observed
instances. So the above case is same as inductive learning where a learning algorithm is trying to find
a hypothesis h (general rule) in H such that h(x) = c(x) for all x in D. For a given collection of
examples, in reality, learning algorithm return a function h (hypothesis) that approximates c (target
concept). But the expectation is, the learning algorithm to return a function h (hypothesis) that equals
c (target concept) ie. h(x) = c(x) for all x in D
The number of combinations: 5×4×4×4×4×4 = 5120 syntactically distinct hypotheses. They are
syntactically distinct but not semantically. For example, the below 2 hypothesis says the same but
they look different.
Lets formalize this concept. Hypothesis h1 and h2 classifies the instance x as positive can written as
Now after learning the concept of general-to-specific ordering of hypotheses, Now its time to use
this partial ordering to organize the search for a hypothesis, that is consistent with the observed
training examples. One way is to begin with the most specific possible hypothesis in H, then
generalize this hypothesis each time it fails to cover an observed positive training example.FIND-S
algorithm is used for this purpose. Here are the steps for find-s algorithm.
To illustrate this algorithm, assume the learner is given the sequence of training examples from the
Enjoy Sport task
1. The first step of FIND-S is to initialize h to the most specific hypothesis in H h — (Ø, Ø, Ø, Ø,
Ø, Ø)
2. First training example x1 = < Sunny, Warm, Normal, Strong ,Warm ,Same>, EnjoySport = +ve.
Observing the first training example, it is clear that hypothesis h is too specific. None of the “Ø”
constraints in h are satisfied by this example, so each is replaced by the next more general
constraint that fits the example h1 = < Sunny, Warm, Normal, Strong ,Warm, Same>.
3. Consider the second training example x2 = < Sunny, Warm, High, Strong, Warm, Same>,
EnjoySport = +ve. The second training example forces the algorithm to further generalize h, this
time substituting a “?” in place of any attribute value in h that is not satisfied by the new example.
Now h2 =< Sunny, Warm, ?, Strong, Warm, Same>
4. Consider the third training example x3 =< Rainy, Cold, High, Strong, Warm,
Change>,EnjoySport = — ve. The FIND-S algorithm simply ignores every negative example. So
the hypothesis remain as before, so h3 =< Sunny, Warm, ?, Strong, Warm, Same>
• Consider learning a classifier given a sample {(xn, yn)} N n=1 where xn ∈ R D and yn ∈ {1, K}.
• Classification works as follows: – Training time: learn a set of discriminant functions {gk(x)} K
k=1. – Test time: given a new instance x, choose Ck if k = arg maxi=1,...,K {gi(x)}.
– Generative approach: we learn p(x|Ck) and p(Ck) for each class from the training data,
and then use gk(x) = p(Ck|x) ∝ p(x|Ck)p(Ck) (from Bayes’ rule) to predict a class. Hence, besides
learning the class boundaries (where p(Ci |x) = p(Cj |x) for i 6= j), we model also the density of each
class. Previous chapters, using parametric and nonparametric methods for p(x|Ck).
– Faster to train.
– Low space and time complexity at test time: O(D) per class. To store wk and multiply
times it.
Two classes
• One discriminant function is sufficient: g1(x) − g2(x) ?= wT x + w0 = g(x). Testing: choose C1 if
g(x) > 0 and C2 if g(x) < 0.
• This defines a hyperplane where w is the weight vector and w0 the threshold (or bias). It divides
the input space R D into two half-spaces, the decision regions R1 for C1 (positive side) and R2 for
C2 (negative side). The hyperplane itself is the boundary or decision surface.
• The origin x = 0 is on the positive side if w0 > 0 boundary if w0 = 0 negative side if w0 < 0
• The signed distance from x ∈ R D to the hyperplane is r = g(x)/kwk. Pf. Write x = xp + r w kwk
where xp = orthogonal projection of x on the hyperplane and compute g(x). The signed distance of
the origin to the hyperplane is r0 = w0/kwk.
• So w determines the orientation of the hyperplane and w0 its location wrt the origin
Perceptron
The Perceptron is nothing more than a collection of McCulloch and Pitts neurons together with a set
of inputs and some weights to fasten the inputs to the neurons. On the left of the figure, shaded in
light grey, are the input nodes. These are not neurons, they are just a nice schematic way of showing
how values are fed into the network,
FIGURE The Perceptron network, consisting of a set of input nodes (left) connectedto McCulloch
and Pitts neurons using weighted connections.
They are almost always drawn as circles, just like neurons, which is rather confusing, so I’ve shaded
them a different colour. The neurons are shown on the right, and you can see both the additive part
(shown as a circle) and the thresholder. In practice nobody bothers to draw the thresh older
separately, you just need to remember that it is part of the neuron.
LINEAR SEPARABILITY
Linear separability is the concept wherein the separation of input space into regions is based on
whether the network response is positive or negative. A decision line is drawn to separate positive
and negative responses.
It is computed by multiplying each element of the first vector by the matching element of the second
and adding them all together. As you might remember from high school, a · b = kakkbk cos θ, where
θ is the angle between a and b and kak is the length ofthe vector a. So the inner product computes a
function of the angle between the two vectors, scaled by their lengths. It can be computed in NumPy
using the np.inner() function.
Getting back to the Perceptron, the boundary case is where we find an input vectorx1 that has x1 ·
wT = 0. Now suppose that we find another input vector x2 that satisfies x2 · wT = 0. Putting these
two equations together we get:
x1 · wT = x2 · wT
What does this last equation mean? In order for the inner product to be 0, either kakor kbk or cos θ
needs to be zero. There is no reason to believe that kak or kbk should be 0,so cos θ = 0. This means
that θ = π/2 (or −π/2), which means that the two vectors areat right angles to each other.
LINEAR REGRESSION
Linear Regression is a machine learning algorithm based on supervised learning. It performs a
regression task. Regression models a target prediction value based on independent variables. It is mostly
used for finding out the relationship between variables and forecasting for examples in the class and 0 for all
of the others. Since classification can be replaced byregression using these methods, we’ll think about
regression here.
The only real difference between the Perceptron and more statistical approaches is inthe way that the
problem is set up. For regression we are making a prediction about anunknown value y (such as the
indicator variable for classes or a future value of some data)by computing some function of known
values xi. We are thinking about straight lines, sothe output y is going to be a sum of the xi values,
each multiplied by a constant parameter:y =PMi=0 βixi. The βi define a straight line (plane in 3D,
hyperplane in higher dimensions)that goes through (or at least near) the datapoints..
where t is a column vector containing the targets and X is the matrix of input values (evenincluding
the bias inputs), just as for the Perceptron.