0% found this document useful (0 votes)
231 views15 pages

Slide02 Haykin Chapter 2: Learning Processes

This document provides an introduction to neural network learning processes. It discusses several key points: 1. Neural networks learn by adjusting their synaptic weights through iterative processes in response to environmental stimuli in order to improve performance. 2. There are five basic learning rules: error correction, Hebbian, memory-based, competitive, and Boltzmann. Learning paradigms include supervised vs unsupervised learning. 3. Error correction learning aims to minimize an error function by gradually adjusting synaptic weights according to an error signal. The delta rule provides a theoretical basis for this type of learning.

Uploaded by

hossein_kho
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views15 pages

Slide02 Haykin Chapter 2: Learning Processes

This document provides an introduction to neural network learning processes. It discusses several key points: 1. Neural networks learn by adjusting their synaptic weights through iterative processes in response to environmental stimuli in order to improve performance. 2. There are five basic learning rules: error correction, Hebbian, memory-based, competitive, and Boltzmann. Learning paradigms include supervised vs unsupervised learning. 3. Error correction learning aims to minimize an error function by gradually adjusting synaptic weights according to an error signal. The delta rule provides a theoretical basis for this type of learning.

Uploaded by

hossein_kho
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Slide02 Introduction

Haykin Chapter 2: Learning • Property of primary significance in nnet: learn from its
environment, and improve its performance through learning.
Processes • Iterative adjustment of synaptic weights.

• Learning: hard to define.


CPSC 636-600
Instructor: Yoonsuck Choe – One definition by Mendel and McClaren: Learning is a

Spring 2008 process by which the free parameters of a neural network are
adapted through a process of stimulation by the environment
in which the network is embedded. The type of learning is
determined by the manner in which the parameter changes
take place.

1 2

Learning Overview
Sequence of events in nnet learning:
Organization of this chapter:
• nnet is stimulated by the environment.
1. Five basic learning rules
• nnet undergoes changes in its free parameters as a result of this error correction, Hebbian, memory-based, copetetive, and
stimulation. Boltzmann
• nnet responds in a new way to the environment because of the 2. Learning paradigms
changes that have occurred in its internal structure.
credit assignment problem, supervised learning, unsupervised
A prescribed set of well-defined rules for the solution of the learning learning
problem is called a learning algorithm.
3. Learning tasks, memory, and adaptation
The manner in which a nnet relates to the environment dictates the
4. Probabilistic and statistical aspects of learning
learning paradigm that refers to a model of environment operated on
by the nnet.

3 4
Error-Correction Learning Error-Correction Learning: Delta Rule

• Input x(n), output yk (n), and desired response or target • Widrow-Hoff rule, with learning rate η :
output dk (n).
∆wk j(n) = ηek (n)xj (n)
• Error signal ek (n) = dk (n) − yk (n)
• With that, we can update the weights:
• ek (n) actuates a control mechanism that gradually adjust the
synaptic weights, to miminize the cost function (or index of wkj (n + 1) = wkj (n) + ∆wk j(n)
performance):
1 2 • There is a sound theoretical reason for doing this, which we will
E(n) = e (n) discuss later.
2 k
• When synaptic weights reach a steady state, learning is stopped.
5 6

Memory-Based Learning Memory-Based Learning: Nearest Neighbor


• A set of instances observed so far:
• All (or most) past experiences are explicitly stored, as input-target
pairs{xi , di )}N X = {x1 , x2 , ..., xN }
i=1 .

• Two classes C1 , C2 . • Nearest neighbor x0N ∈ X of xtest :

• Given a new input xtest , determine class based on local min d(xi , xtest ) = d(xi , xtest )
i
neighborhood of xtest .
where d(·, ·) is the Euclidean distance.
– Criterion used for determining the neighborhood
• xtest is classified as the same class as x0N .
– Learning rule applied to the neighborhood of the input, within
the set of training examples. • Cover and Hart (1967): The bound on error is at max twice that of
the optimal (Bayes probability of error), given
– The classified examples are independently and identically
distributed.
– The sample size N is infinitely large.
7 8
Memory-Based Learning: k−Nearest Neighbor Hebbian Learning
0
0 0
0
• Donald Hebb’s postulate of learning appeared in his book The
0 0
0 Organization of Behavior (1949).
0 0 0
1 1
1 1 0 0 When an axon of cell A is near enough to excite a cell B
1 1 0
1 x 1 1 and repeatedly or persistently takes part in firing it, some
0
1 1 growth process or metabolic changes take place in one or
1 1
both cells such that A’s efficiency as one of the cells firing
• Identify k classlfied patterns that lie nearest to the test vector B, is increased.
xtest , for some integer k. • Hebbian synapse
• Assign xtest to the class that is most frequently represented by – If two neurons on either side of a synapse are activated
the k neighbors (use majority vote). simultaneously, the synapse is strengthened.

• In effect, it is like averaging. It can deal with outliers. The input x – If they are activated asynchronously, the synapse is
above will be classified as 1. weakened or eliminated. (This part was not mentioned in
Hebb.)
9 10

Hebbian Synapses Classification of Synaptic Plasticity

• Time-dependent mechanism Hebbian: time-dependent, highly local, heavily interactive.

• Local mechanism Type Positively correlated Negatively correlated

• Interactive mechanism Hebbian Strengthen Weaken

Anti-Hebbian Weaken Strengthen


• Correlative/conjunctive mechanism
Non-Hebbian × ×
Strong evidence for Hebbian plasticity in the Hippocampus (brain
region).

11 12
Mathematical Models of Synaptic Plasticity Covariance Rule (Sejnowski 1977)

∆wkj = η(xj − x̄)(yk − ȳ)


• Convergence to a nontrivial state
• Prediction of both potentiation and depression.
• Observations:
– Weight enhanced when both pre- and post-synaptic activities
are above average.
– Weight depressed when

• General form: ∆wkj (n) = F (yk (n), xj (n))


∗ Presynaptic activity more than average, and postsynaptic
activity less than average.
• Hebbian learning (with learning rate η ): ∆wkj (n) = ηyk (n)xj (n)
∗ Presynaptic activity less than average, and postsynaptic
• Covariance rule:∆wkj = η(xj − x̄)(yk − ȳ) activity more than average.

13 14

Competetive Learning Inputs and Weights Seen as Vectors in


High-dimensional Space
• Output neurons compete with each other for a chance to become ( x1 x2 x3 ... xn ) coordinate system
active.
x1 w
• Highly suited to discover statistically salient features (that may aid wk1
x2 wk2 x
in classification).
x3 wk3
yk
• Three basic elements:
wkn
– Same type of neurons with different weight sets, so that they ...
respond differently to a given set of inputs. xn ( wk1 wk2 wk3 ... wkn )

– A limit imposed on the strength of each neuron.

– Competition mechanism, to choose one winner: • Inputs and weights can be seen as vectors: x and wk . Note that
winner-takes-all neuron. the weight vector belongs to a certain output neuron k , and thus
the index.

15 16
Competetive Learning: Example Competetive Learning
• Single layer, feedforward excitatory, and lateral in-
hibitory connections x−w(n)
• Winner selection
x η (x−w(n))
( w(n+1)
1 if vk > vj for all j , j 6= k
yk =
0 otherwise

• Limit:
P
wkj = 1 for all k.
w(n)
j
• Adaptation: • Adaptation:
( 8
η(xj − wk j) if k is the winner < η(x − w j) if k is the winner
∆wkj = j k
0 otherwise ∆wkj =
: 0 otherwise
* The synaptic weight vector wk =
Interpreting this as a vector, we get the above plot.
(wk1 , wk2 , ..., wkn ) is moved toward
the input vector. • Weight vectors converge toward local input clusters: clustering.
17 18

Boltzmann Learning Boltzmann Machine


• Stochastic learning algorithm rooted in statistical mechanics.
• Recurrent network, binary neurons (on: ‘+1’, off: ‘-1’).
• Energy function E :
1X X
E=− wkj xk xj
2 j k,k6=j
• Two types of neurons
• Activation: – Visible neurons: can be affected by the environment
– Choose a random neuron k . – Hidden neurons: isolated

– Flip state with a probability (given temperature T ) • Two modes of operation


1 – Clamped: visible neuron states are fixed by environmental input and
P (xk → −xk ) = held constant.
1 + exp(−∆Ek /T )
– Free-running: all neurons are allowed to update their activity freely.
where ∆Ek is the change in E due to the flip.

19 20
Boltzmann Machine: Learning and Operation Learning Paradigms
• Learning:
How neural networks relate to their environment
+
– Correlation of activity during clamped condition ρkj
− • credit assignment problem
– Correlation of activity during free-running condition ρkj

∆wkj = η(ρ+ − • learning with a teacher


– Weight update: kj − ρkj ), j 6= k .
• Train weights wkj with various clamping input patterns. • learning without a teacher

• After training is completed, present new clamping input pattern


that is a partial input of one of the known vectors.

• Let it run clamped on the new input (subset of visible neurons),


and eventually it will complete the pattern (pattern completion).
Cov(x, y)
Correl(x, y) =
σx σy

21 22

Credit-Assignment Problem Learning with a Teacher

• How to assign credit or blame for overall outcome to individual


decisions made by the learning machine.

• In many cases, the outcomes depend on a sequence of actions.


– Assignment of credit for outcomes of actions (temporal
credit-assignment problem): When does a particular action
deserve credit.

– Assignment of credit for actions to internal decisions


• Also known as supervised learning
(structural credit-assignment problem): assign credit to
internal structures of actions generated by the system. • Teacher has knowledge, represented as input–output examples. The
environment is unknown to the nnet.
Credit-assignment problem routinely arises in neural network learning. • Nnet tries to emulate the teacher gradually.
Which neuron, which connection to credit or blame?
• Error-correction learning is one way to achieve this.
• Error surface, gradient, steepest descent, etc.
23 24
Learning without a Teacher Learning without a Teacher: Reinforcement Learning
• Learning input-output mapping through continued
interaction with the environment.

• Actor-critic: cricit converts primary reinforce-


ment signal into higher-quality, heuristic rein-
forcement signal (Barto, Sutton, ...).

• Goal is to optimize the cumulative cost of actions.


• In many cases, learning is under delayed rein-
forcement. Delayed RL is difficult since (1) teacher
does not provide desired action at each step, and
Two classes (2) must solve temporal credit-assignment problem.

• Relation to dynamic programming, in the context


• Reinforcement learning (RL)/Neurodynamic programming
of optimal control theory (Bellman).
• Unsupervised learning/Self-organization

25 26

Learning without a Teacher: Unsupervised Learning Learning Tasks, Memory, and Adaptation

Learning tasks

• Pattern association

• Pattern recognition
• Learn based on task-independent measure of the quality of • Function approximation
representation.
• Control
• Internal representations for encoding features of the input space.
• Filtering/Beamforming
• Competetive learning rule needed, such as winner-takes-all.
Memory and adaptation

27 28
Pattern Association Pattern Classification
• Mapping between input pattern and a pre-
scrived number of classes (categories).

• Two general types:


• Associtive memory: brainlike distributed memory that learns
– Feature extraction (observation space to
association. Storage and retrieval (recall).
feature space: cf. dimensionality reduc-
• Pattern association (xk : key pattern, yk : memorized pattern): tion), then classification (feature space to
xk → yk , k = 1, 2, ..., q decision space).
– Single step (observation space to deci-
– autoassociation (xk = yk ): given partial or corrupted
sion space).
version of stored pattern and retrieve the original.
– heteroassociation (xk 6= yk ): Learn arbitrary pattern pairs
and retrieve them.

• Relevant issues: storage capacity vs. accuracy.

29 30

Function Approximation Function Apprix: System Identification and Inverse


System Modeling
• Nonlinear input-output mapping: d = f (x) for an unknown f .

• Given a set of labeled examples T = {(xi , di )}N


i=1 , estimate
F(·) such that

kF(x) − f (x)k < , for all x

• System identification: learn function of an unknown system.

d = f (x)

• Inverse system modeling: learn inverse function:

x = f −1 (d)

31 32
Control Filtering, Smoothing, and Prediction

Extract information about a quantity of interest from a set of noisy data.

• Filtering: estimate quantity at time n, based on measurements up


to time n.

• Control of a plant, a process or critical part of a system that is to • Smoothing: estimate quantity at time n, based on measurements
be maintained in a controlled condition. up to time n + α (α > 0).

• Feedback controller: adjust plant input u so that the output of the • Prediction: estimate quantity at time n + α, based on
plant y tracks the reference signal d. Learning is in the form of measurements up to time n (α > 0).

free-parameter adjustment in the controller.

33 34

Blind Source Separation and Nonlinear Prediction Linear Algebra Tip: Partitioned (or Block) Matrices

• When multiplying matrices or matrix and a vector, partitioning them and


• Blind source separation: recover u(n) from distorted signal multiplying the corresponding partitions can be very convenient.
x(n) when mixing matrix A is unknown.
• Consider the 4 × 5 matrix above (let’s"call it X). If#you have another
x(n) = Au(n) E, F
5 × 4 matrix partitioned similarly into (let’s call it Y ), then
G, H
you can calculate the product as another block matrix:
• Nonlinear prediction: given x(n − T ), x(n − 2T ), ...,
estimate x(n) (x̂(n) is the estimated value).
" #" # " #
A, B E, F AE + BG, AF + BH
XY = =
C, D G, H CE + CG, CF + CH

Example from http:


//[Link]/matrix_linear_trans/08_partition/[Link].

35 36
Memory Associative Memory
xk1 yk1

xk2 yk2
• Memory: relatively enduring neural alterations induced by an

...
...
organism’s interaction with the environment. x kj yki

wij(k)

...

...
• Memory needs to be accessible by the nervous system to xkm ykm

influence behavior.
q pattern pairs: (xk , yk ), for k = 1, 2, ..., q .
• Activity patterns need to be stored through a learning process.
• Input (key vector) xk = [xk1 , xk2 , ..., xkm ]T .
• Types of memory: short-term and long-term memory.
• Output (memorized vector) yk = [yk1 , yk2 , ..., ykm ]T .
• Weights can be represented as a weight matrix:
yk = W(k)xk , for k = 1, 2, ..., q
m
X
yki = wij (k)xkj , for m = 1, 2, ..., m
j=1
37 38

Associative Memory (cont’d) Associative Memory (cont’d)


• Weight matrix: • With a single W(k), we can only represent one mapping (xk to
yk = W(k)xk , for k = 1, 2, ..., q yk ). For all pairs (xk , yk ) (k = 1, 2, ..., q ), we need q such
m
weight matrices.
X
yki = wij (k)xkj , for i = 1, 2, ..., m
yk = W(k)xk , for k = 1, 2, ..., q
j=1

• One strategy is to combine all W(k) into a single memory


2 3
xk1
matrix M by simple summation:
6 7
6 xk2 7
yki = [wi1 (k), wi2 (k), ..., wim (k)] 6 7 , i = 1, 2, ..., m
6
:
7 q
X
4 5
xkm M= W(k)
2 3 2 32 3 k=1
yk1 w11 (k), w12 (k), ..., w1m (k) xk1
6 7 6 76 7 • Will such a simple strategy work? That is, can the following be
6 yk2 w21 (k), w22 (k), ..., w2m (k) 7 6 xk2
possible with M?
7 6 7
6 7=6 76 7
: ... :
6 7 6 76 7
4 5 4 54 5
ykm wm1 (k), wm2 (k), ..., wmm (k) xkm yk ≈ Mxk , for k = 1, 2, ..., q
39 40
Associative Memory: Example – Storing Multiple Correlation Matrix Memory
Mappings • With q pairs (xk , yk ), we can construct a candidate memory matrix that
stores all q mappings as:
With fixed set of key vectors xk , an m × m matrix can store m arbitrary
output vectors yk . q
X T T T T
T
M̂ = yk xk = y1 x1 + y2 x2 + ... + yq xq ,
• Let xk = [0, 0, ...1, ...0] where only the k-th element is 1 and all k=1
the rest is 0.
T
where yk xk represents the outer product of vectors that results in a
• Construct a memory matrix M with each column representing the
matrix, i.e.,
arbitrary output vectors yk : T
(yk xk )ij = yki xkj .
2 3
6 7 • A more convenient notation is:
M=6
4 y1 , y2 , ..., ym 5
7 2 3
x1
h i6
6 x2
7
7 = YXT .
7
M̂ = y1 , y2 , ..., yq 6
• Then, yk = Mxk , for all k = 1, 2, ..., m. 4 ...
6 7
5
• But, we want xk to be arbitrary too! xq

This can be verified easily using partitioned matrices.


41 42

Correlation Matrix Memory: Recall Correlation Matrix Memory: Recall (cont’d)

• Will M̂xk give yk ? • Now, back to M̂: under what condition will M̂xj give yj for all j ? Let’s
T =
begin by assuming xk xk 1 (key vectors are normalized).
• For convenience, let’s say • We can decompose M̂xj as follows:
q q
X X q q
M̂ = yk xT
k = W(k). M̂xj =
X T
yk xk xj = yj xj xj +
T
X T
yk xk xj .
k=1 k=1 k=1 k=1,k6=j

• First, consider W(k) = yk xT


k only. • We know yj xT
j xj = yj , so it now becomes:
Check if W(k)xk = yk :
q
X T
M̂xj = yj + yk xk xj .
W(k)xk = yk xT
k xk = yk (xT
k xk ) = cyk k=1,k6=j
| {z }
where c = xT k xk , a scalar value (the length of vector xk
Noise term

squared). If all xk s were normalized to have length 1, • If all keys are orthogonal (perpendicular to each other), then for an arbitrary
W(k)xk = yk will hold! k 6= j , xT k xj = kxk kkxj k cos(θkj ) = 1 × 1 × 0 = 0 , so
the noise term becomes 0, and hence M̂xj = yj + 0 = yj . The
example in page 41 is one such (extreme) case!
43 44
Correlation Matrix Memory: Recall (cont’d) Adaptation
• We can also ask how many items can be stored in M̂, i.e., its
• When the environment is stationary (the statistic characteristics
capacity.
do not change over time), supervised learning can be used to
• The capacity is closely related with the rank of the matrix M̂. obtain a relatively stable set of parameters.
The rank means the number of linearly independent column
vectors (or row vectors) in the matrix. • If the environment is nonstationary, the parameters need to be
adapted over time, on an on-going basis (continuous learning or
• Linear independence means a linear combination of the vectors
learning-on-the-fly).
can be zero only when the coefficients are all zero:
• If the signal is locally stationary (pseudostationary), then the
c1 x1 + c2 x2 + ... + cn xn = 0
parameters can repeatedly be retrained based on a small window
only when ci = 0 for all i = 1, 2, ..., n. of samples, assuming these are stationary: continual training

• The above and the examples in the previous pages are best with time-ordered samples.
understood by running simple calculations in Octave or Matlab.
See the src/ directory for example scripts.
45 46

Statistical Nature of Learning Statistical Nature of Learning (cont’d)


• Deviation between the target function f (x) and the neural
• The error term  is typically assumed to have a zero mean:
network relization of the function F (x, w) can be expressed in
E[|x] = 0. (E[·] is the expected value of a random variable.)
statistical terms (note F (·, ·) is parameterized by the weight w).

• Random input vectors X ∈ {xi }N • In this light, f (x) can be expressed in statistical terms:
i=1 and random output scalar
N
values D ∈ {di }i=1 f (x) = E[D|x], since from

• Suppose we have a training set T = {(xi , di )}N i=1 . The D = f (X) + , we can get
problem is that the target values D in the training set may only be
approximate (D ≈ f (X), i.e., D 6= f (X)).
E[D|x] = E[f (x) + ] = f (x) + E[|x] = f (x).

• So, we end up with a regressive model: • A property that can be derived from the above is that the
expectational error term is independent of the regressive function:
D = f (X) + ,
E[f (X)] = 0. This will become useful in the following.
where f (·) is deterministic and  is a random expectational error
representing our ignorance.
47 48
Statistical Nature of Learning (cont’d) Statistical Nature of Learning (cont’d)
• Neural network realization of the regressive model:
Y = F (X, w). d−F (x, T ) = d−f (x)+f (x)−F (x, T ) = +(f (x)−F (x, T )).

We want to map the knowledge in the training data T into the With that,

weights w. 1 h
2
i
E(w) = ET (di − F (x, T )) becomes
2
• We can now define the cost function:
N
1 h
2
i
1X = ET ( + (f (x) − F (x, T ))
E(w) = (di − F (xi , w))2 2
2 i=1 1 h
2 2
i
= ET  + 2(f (x) − F (x, T )) + (f (x) − F (x, T ))
2
which can be written equivalently as an average over the training
1 2 1 2
set ET [·]: = ET [ ] + ET [(f (x) − F (x, T ))] + ET [(f (x) − F (x, T )) ]
2 | {z } 2
1
| {z } | {z }
This reduces to 0
ET (di − F (x, T ))2
ˆ ˜
E(w) = Intrinsic error We’re interested in this!
2

49 50

Statistical Nature of Learning: Bias/Variance Dillema Bias/Variance Dillema (cont’d)


The cost function we derived
• The bias indicates how much F (x, T ) differs from the true
2
ET [(f (x) − F (x, T )) ] function f (x): approximation error

can be rewritten, knowing f (x) = E[D|x]: • The variance indicates the variance in F (x, T ) over the entire
training set T : estimation error
ET [(E[D|x] − F (x, T ))2 ]
= ET [(E[D|x] − ET [F (x, T )] + ET [F (x, T )] − F (x, T ))2 ] • Typically, achieving smaller bias leads to higher variance, and
2 2
= (ET [F (x, T )] − E[D|x]) + ET [(F (x, T ) − ET [F (x, T )]) ]. smaller variance leads to higher bias.
| {z } | {z }
Bias V ariance

[E[D|x]2 ] = E[D|x]2 ,
The last step above is obtained using ET
ET [ET [F (x, T )]2 ], = ET [F (x, T )]2 , and
ET [E[D|x]F (x, T )] == E[D|x]ET [F (x, T )].
* Note: E[c] = c and E[cX] = cE[X] for constant c and random variable X .

51 52
Statistical Learning Theory Appendix on VC Dimension
• Statistical learning theory addresses the fundamental issue of
• The concept of Shattering
how to control the generalization ability of a neural network in
mathematical terms. • VC dimension
• Certain quantities such as sample size and the
Vapnik-Chevonenkis dimension (VC dimension) is closely
related to the bounds on generalization error.

• The probably approximately correct (PAC) learning model is


another framework to study such bounds. In this case, the the
confidence δ (probably) and tolerable error level  (approximately
correct) are important quantities. Given these, and other
measures such as the VC dimension, we can calculate the
sample complexity (how many samples are needed to achieve
that level of correctness  with that much confidence δ ).

53 54

Shattering a Set of Instances Three Instances Shattered


Instance space X
Definition: a dichotomy of a set S is a partition of S into
two disjoint subsets.

Definition: a set of instances S is shattered by a function


class F if and only if for every dichotomy of S there exists
some function in F consistent with this dichotomy.

Each closed contour indicates one dichotomy. What kind of classifier


function can shatter the instances?

55 56
The Vapnik-Chervonenkis Dimension VC Dim. of Linear Decision Surfaces

Definition: The Vapnik-Chervonenkis dimension,


V C(F ), of function class F defined over sample space X
is the size of the largest finite subset of X shattered by F . If (a) (b)

arbitrarily large finite sets of X can be shattered by F , then


• When F is a set of lines, and S a set of points, V C(F ) = 3.
V C(F ) ≡ ∞.
• (a) can be shattered, but (b) cannot be. However, if at least one
Note that |F | can be infinite, while V C(H) finite!
subset of size 3 can be shattered, that’s fine.

• Set of size 4 cannot be shattered, for any combination of points


(think about an XOR-like situation).

57 58

Uses of VC Dimension

• Training error decreases monotinically as the VC dimension is


increased.

• Confidence interval increases monotinically as the VC dimension


is increased.

• Sample complexity (in PAC framework) increases as VC


dimension increases.

59

You might also like