Slide02 Haykin Chapter 2: Learning Processes
Slide02 Haykin Chapter 2: Learning Processes
Haykin Chapter 2: Learning • Property of primary significance in nnet: learn from its
environment, and improve its performance through learning.
Processes • Iterative adjustment of synaptic weights.
Spring 2008 process by which the free parameters of a neural network are
adapted through a process of stimulation by the environment
in which the network is embedded. The type of learning is
determined by the manner in which the parameter changes
take place.
1 2
Learning Overview
Sequence of events in nnet learning:
Organization of this chapter:
• nnet is stimulated by the environment.
1. Five basic learning rules
• nnet undergoes changes in its free parameters as a result of this error correction, Hebbian, memory-based, copetetive, and
stimulation. Boltzmann
• nnet responds in a new way to the environment because of the 2. Learning paradigms
changes that have occurred in its internal structure.
credit assignment problem, supervised learning, unsupervised
A prescribed set of well-defined rules for the solution of the learning learning
problem is called a learning algorithm.
3. Learning tasks, memory, and adaptation
The manner in which a nnet relates to the environment dictates the
4. Probabilistic and statistical aspects of learning
learning paradigm that refers to a model of environment operated on
by the nnet.
3 4
Error-Correction Learning Error-Correction Learning: Delta Rule
• Input x(n), output yk (n), and desired response or target • Widrow-Hoff rule, with learning rate η :
output dk (n).
∆wk j(n) = ηek (n)xj (n)
• Error signal ek (n) = dk (n) − yk (n)
• With that, we can update the weights:
• ek (n) actuates a control mechanism that gradually adjust the
synaptic weights, to miminize the cost function (or index of wkj (n + 1) = wkj (n) + ∆wk j(n)
performance):
1 2 • There is a sound theoretical reason for doing this, which we will
E(n) = e (n) discuss later.
2 k
• When synaptic weights reach a steady state, learning is stopped.
5 6
• Given a new input xtest , determine class based on local min d(xi , xtest ) = d(xi , xtest )
i
neighborhood of xtest .
where d(·, ·) is the Euclidean distance.
– Criterion used for determining the neighborhood
• xtest is classified as the same class as x0N .
– Learning rule applied to the neighborhood of the input, within
the set of training examples. • Cover and Hart (1967): The bound on error is at max twice that of
the optimal (Bayes probability of error), given
– The classified examples are independently and identically
distributed.
– The sample size N is infinitely large.
7 8
Memory-Based Learning: k−Nearest Neighbor Hebbian Learning
0
0 0
0
• Donald Hebb’s postulate of learning appeared in his book The
0 0
0 Organization of Behavior (1949).
0 0 0
1 1
1 1 0 0 When an axon of cell A is near enough to excite a cell B
1 1 0
1 x 1 1 and repeatedly or persistently takes part in firing it, some
0
1 1 growth process or metabolic changes take place in one or
1 1
both cells such that A’s efficiency as one of the cells firing
• Identify k classlfied patterns that lie nearest to the test vector B, is increased.
xtest , for some integer k. • Hebbian synapse
• Assign xtest to the class that is most frequently represented by – If two neurons on either side of a synapse are activated
the k neighbors (use majority vote). simultaneously, the synapse is strengthened.
• In effect, it is like averaging. It can deal with outliers. The input x – If they are activated asynchronously, the synapse is
above will be classified as 1. weakened or eliminated. (This part was not mentioned in
Hebb.)
9 10
11 12
Mathematical Models of Synaptic Plasticity Covariance Rule (Sejnowski 1977)
13 14
– Competition mechanism, to choose one winner: • Inputs and weights can be seen as vectors: x and wk . Note that
winner-takes-all neuron. the weight vector belongs to a certain output neuron k , and thus
the index.
15 16
Competetive Learning: Example Competetive Learning
• Single layer, feedforward excitatory, and lateral in-
hibitory connections x−w(n)
• Winner selection
x η (x−w(n))
( w(n+1)
1 if vk > vj for all j , j 6= k
yk =
0 otherwise
• Limit:
P
wkj = 1 for all k.
w(n)
j
• Adaptation: • Adaptation:
( 8
η(xj − wk j) if k is the winner < η(x − w j) if k is the winner
∆wkj = j k
0 otherwise ∆wkj =
: 0 otherwise
* The synaptic weight vector wk =
Interpreting this as a vector, we get the above plot.
(wk1 , wk2 , ..., wkn ) is moved toward
the input vector. • Weight vectors converge toward local input clusters: clustering.
17 18
19 20
Boltzmann Machine: Learning and Operation Learning Paradigms
• Learning:
How neural networks relate to their environment
+
– Correlation of activity during clamped condition ρkj
− • credit assignment problem
– Correlation of activity during free-running condition ρkj
21 22
25 26
Learning without a Teacher: Unsupervised Learning Learning Tasks, Memory, and Adaptation
Learning tasks
• Pattern association
• Pattern recognition
• Learn based on task-independent measure of the quality of • Function approximation
representation.
• Control
• Internal representations for encoding features of the input space.
• Filtering/Beamforming
• Competetive learning rule needed, such as winner-takes-all.
Memory and adaptation
27 28
Pattern Association Pattern Classification
• Mapping between input pattern and a pre-
scrived number of classes (categories).
29 30
d = f (x)
x = f −1 (d)
31 32
Control Filtering, Smoothing, and Prediction
• Control of a plant, a process or critical part of a system that is to • Smoothing: estimate quantity at time n, based on measurements
be maintained in a controlled condition. up to time n + α (α > 0).
• Feedback controller: adjust plant input u so that the output of the • Prediction: estimate quantity at time n + α, based on
plant y tracks the reference signal d. Learning is in the form of measurements up to time n (α > 0).
33 34
Blind Source Separation and Nonlinear Prediction Linear Algebra Tip: Partitioned (or Block) Matrices
35 36
Memory Associative Memory
xk1 yk1
xk2 yk2
• Memory: relatively enduring neural alterations induced by an
...
...
organism’s interaction with the environment. x kj yki
wij(k)
...
...
• Memory needs to be accessible by the nervous system to xkm ykm
influence behavior.
q pattern pairs: (xk , yk ), for k = 1, 2, ..., q .
• Activity patterns need to be stored through a learning process.
• Input (key vector) xk = [xk1 , xk2 , ..., xkm ]T .
• Types of memory: short-term and long-term memory.
• Output (memorized vector) yk = [yk1 , yk2 , ..., ykm ]T .
• Weights can be represented as a weight matrix:
yk = W(k)xk , for k = 1, 2, ..., q
m
X
yki = wij (k)xkj , for m = 1, 2, ..., m
j=1
37 38
• Will M̂xk give yk ? • Now, back to M̂: under what condition will M̂xj give yj for all j ? Let’s
T =
begin by assuming xk xk 1 (key vectors are normalized).
• For convenience, let’s say • We can decompose M̂xj as follows:
q q
X X q q
M̂ = yk xT
k = W(k). M̂xj =
X T
yk xk xj = yj xj xj +
T
X T
yk xk xj .
k=1 k=1 k=1 k=1,k6=j
squared). If all xk s were normalized to have length 1, • If all keys are orthogonal (perpendicular to each other), then for an arbitrary
W(k)xk = yk will hold! k 6= j , xT k xj = kxk kkxj k cos(θkj ) = 1 × 1 × 0 = 0 , so
the noise term becomes 0, and hence M̂xj = yj + 0 = yj . The
example in page 41 is one such (extreme) case!
43 44
Correlation Matrix Memory: Recall (cont’d) Adaptation
• We can also ask how many items can be stored in M̂, i.e., its
• When the environment is stationary (the statistic characteristics
capacity.
do not change over time), supervised learning can be used to
• The capacity is closely related with the rank of the matrix M̂. obtain a relatively stable set of parameters.
The rank means the number of linearly independent column
vectors (or row vectors) in the matrix. • If the environment is nonstationary, the parameters need to be
adapted over time, on an on-going basis (continuous learning or
• Linear independence means a linear combination of the vectors
learning-on-the-fly).
can be zero only when the coefficients are all zero:
• If the signal is locally stationary (pseudostationary), then the
c1 x1 + c2 x2 + ... + cn xn = 0
parameters can repeatedly be retrained based on a small window
only when ci = 0 for all i = 1, 2, ..., n. of samples, assuming these are stationary: continual training
• The above and the examples in the previous pages are best with time-ordered samples.
understood by running simple calculations in Octave or Matlab.
See the src/ directory for example scripts.
45 46
• Random input vectors X ∈ {xi }N • In this light, f (x) can be expressed in statistical terms:
i=1 and random output scalar
N
values D ∈ {di }i=1 f (x) = E[D|x], since from
• Suppose we have a training set T = {(xi , di )}N i=1 . The D = f (X) + , we can get
problem is that the target values D in the training set may only be
approximate (D ≈ f (X), i.e., D 6= f (X)).
E[D|x] = E[f (x) + ] = f (x) + E[|x] = f (x).
• So, we end up with a regressive model: • A property that can be derived from the above is that the
expectational error term is independent of the regressive function:
D = f (X) + ,
E[f (X)] = 0. This will become useful in the following.
where f (·) is deterministic and is a random expectational error
representing our ignorance.
47 48
Statistical Nature of Learning (cont’d) Statistical Nature of Learning (cont’d)
• Neural network realization of the regressive model:
Y = F (X, w). d−F (x, T ) = d−f (x)+f (x)−F (x, T ) = +(f (x)−F (x, T )).
We want to map the knowledge in the training data T into the With that,
weights w. 1 h
2
i
E(w) = ET (di − F (x, T )) becomes
2
• We can now define the cost function:
N
1 h
2
i
1X = ET ( + (f (x) − F (x, T ))
E(w) = (di − F (xi , w))2 2
2 i=1 1 h
2 2
i
= ET + 2(f (x) − F (x, T )) + (f (x) − F (x, T ))
2
which can be written equivalently as an average over the training
1 2 1 2
set ET [·]: = ET [ ] + ET [(f (x) − F (x, T ))] + ET [(f (x) − F (x, T )) ]
2 | {z } 2
1
| {z } | {z }
This reduces to 0
ET (di − F (x, T ))2
ˆ ˜
E(w) = Intrinsic error We’re interested in this!
2
49 50
can be rewritten, knowing f (x) = E[D|x]: • The variance indicates the variance in F (x, T ) over the entire
training set T : estimation error
ET [(E[D|x] − F (x, T ))2 ]
= ET [(E[D|x] − ET [F (x, T )] + ET [F (x, T )] − F (x, T ))2 ] • Typically, achieving smaller bias leads to higher variance, and
2 2
= (ET [F (x, T )] − E[D|x]) + ET [(F (x, T ) − ET [F (x, T )]) ]. smaller variance leads to higher bias.
| {z } | {z }
Bias V ariance
[E[D|x]2 ] = E[D|x]2 ,
The last step above is obtained using ET
ET [ET [F (x, T )]2 ], = ET [F (x, T )]2 , and
ET [E[D|x]F (x, T )] == E[D|x]ET [F (x, T )].
* Note: E[c] = c and E[cX] = cE[X] for constant c and random variable X .
51 52
Statistical Learning Theory Appendix on VC Dimension
• Statistical learning theory addresses the fundamental issue of
• The concept of Shattering
how to control the generalization ability of a neural network in
mathematical terms. • VC dimension
• Certain quantities such as sample size and the
Vapnik-Chevonenkis dimension (VC dimension) is closely
related to the bounds on generalization error.
53 54
55 56
The Vapnik-Chervonenkis Dimension VC Dim. of Linear Decision Surfaces
57 58
Uses of VC Dimension
59