0% found this document useful (0 votes)
49 views37 pages

CS490 Advanced Topics in Computing (Deep Learning)

Lecture-7-ActivationFunctionsAndDataPreprocessingAndModelInitialization

Uploaded by

Afaq Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
49 views37 pages

CS490 Advanced Topics in Computing (Deep Learning)

Lecture-7-ActivationFunctionsAndDataPreprocessingAndModelInitialization

Uploaded by

Afaq Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 37

CS490 ̶ Advanced Topics in Computing

(Deep Learning)

Lecture 7: Activation Functions,


Data Preprocessing & Model Initialization

Dr. Muhammad Shahzad


muhammad.shehzad@seecs.edu.pk

Department Of Computing (DOC),


School of Electrical Engineering & Computer Science (SEECS),
National University of Sciences & Technology (NUST)

03/03/2021
Activation functions

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 2
Activation functions
Two major drawbacks:

1 - Sigmoids saturate and kill gradients

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 3
Activation functions
Two major drawbacks:

1 - Sigmoids saturate and kill gradients


▪ When the local gradient is very small, it will
effectively “kill” the gradient and almost no
signal will flow through the neuron to its
weights and recursively to its data

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 4
Activation functions
Two major drawbacks:

1 - Sigmoids saturate and kill gradients


▪ When the local gradient is very small, it will
effectively “kill” the gradient and almost no
signal will flow through the neuron to its
weights and recursively to its data

▪ Additionally, one must pay extra caution when


initializing the weights of sigmoid neurons to
prevent saturation
▪ E.g., if the initial weights are too large then
most neurons would become saturated
and the network will barely learn

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 5
Activation functions
Two major drawbacks:

1 - Sigmoids saturate and kill gradients


2 - Sigmoid outputs are not zero-centered
▪ Has implications on the dynamics during
gradient descent, because if the data coming
into a neuron is always positive, then during
backpropagation the gradient on the weights w
will become either all be positive, or all
negative (depending on the gradient of the
whole expression f)

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 6
Activation functions
Two major drawbacks:

1 - Sigmoids saturate and kill gradients


2 - Sigmoid outputs are not zero-centered
▪ This could introduce undesirable zig-zagging
dynamics in the gradient updates for the
weights

▪ However, notice that once these gradients are


added up across a batch of data the final
update for the weights can have variable signs,
somewhat mitigating this issue

▪ Therefore, this is an inconvenience but it has


less severe consequences compared to the
saturated activation problem above
CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 7
Activation functions
Two major drawbacks:

1 - Sigmoids saturate and kill gradients


2 - Sigmoid outputs are not zero-centered

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 8
Activation functions

In practice the tanh non-linearity is always


preferred to the sigmoid nonlinearity

Also note that the tanh neuron is simply a scaled


sigmoid neuron, in particular the following holds:

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 9
Activation functions

▪ It was found to greatly accelerate (x 6)


the convergence of stochastic gradient
descent compared to the sigmoid/tanh
functions
▪ It is argued that this is due to its
linear, non-saturating form

▪ Compared to tanh/sigmoid neurons


that involve expensive operations
(exponentials, etc.), the ReLU can be
implemented by simply thresholding a
matrix of activations at zero

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 10
Activation functions

▪ Unfortunately, ReLU units can be fragile


during training and can “die”
▪ E.g., a large gradient flowing
through a ReLU neuron could cause
the weights to update in such a way
that the neuron will never activate
on any datapoint again
▪ If this happens, then the gradient
flowing through the unit will
forever be zero from that point on

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 11
Activation functions

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 12
Activation functions

Some people report success with


this form of activation function, but
the results are not always consistent

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 13
Activation functions

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 14
Activation functions

Maxout units divide z into groups of k values

k=2

k=4

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 15
What neuron type should I use?”

▪ Use the ReLU non-linearity, be careful with your learning rates


and possibly monitor the fraction of “dead” units in a network
▪ If this concerns you, give Leaky ReLU or Maxout a try
▪ Never use sigmoid
▪ Try tanh, but expect it to work worse than ReLU/Maxout

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 16
Setting up the data and the model

17
Data Preprocessing
Mean subtraction
▪ The most common form of preprocessing
▪ It involves subtracting the mean across every individual feature in
the data
Normalization
▪ Refers to normalizing the data dimensions so that they are of
approximately the same scale
▪ Typically, it is done by dividing each dimension by its standard
deviation, once it has been zero-centered

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 18
Data Preprocessing
PCA and Whitening
▪ Data is first centered
▪ Decorrelate the data by projecting the original (but zero-centered)
data into the eigenbasis
▪ PCA is applied to reduce the dimensionality
▪ Whiten the data by taking the data in the eigenbasis and dividing
every dimension by the eigenvalue to normalize the scale

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 19
Common pitfall

▪ An important point to make about the preprocessing is that any


preprocessing statistics (e.g., the data mean) must only be computed
on the training data, and then applied to the validation / test data

► E.g., computing the mean and subtracting it from every image


across the entire dataset and then splitting the data into
train/val/test splits would be a mistake

▪ Instead, the mean must be computed only over the training data and
then subtracted equally from all splits (train/val/test)

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 20
Different Strategies in Practice for Images

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 21
Model Initialization

22
Weight Initialization

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 23
Weight Initialization
Do NOT perform all-zero intialization
▪ Every neuron in the network will compute the same output because
of which they will also all compute the same gradients during
backpropagation and undergo the exact same parameter updates
▪ In other words, there is no source of asymmetry between neurons
if their weights are initialized to be the same

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 24
Weight Initialization
Use small random numbers
▪ Instead, it is common to initialize the weights of the neurons to
small numbers and refer to doing so as symmetry breaking
▪ Randomly initialize neurons so that are all unique in the beginning
(typically sampled from a zero mean, unit standard deviation
gaussian), so that they compute distinct updates and integrate
themselves as diverse parts of the full network

(Gaussian with zero mean and 0.01 standard deviation)

Not necessarily small as it could greatly diminish the


“gradient signal” flowing backward through a network

Works okay for smaller networks

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 25
Weight Initialization

Calibrating the variances with 1/sqrt(n)

▪ One problem with the previous suggestion is that the distribution


of the outputs from a randomly initialized neuron has a variance
that grows with the number of inputs

▪ It turns out that we can normalize the variance of each neuron’s


output to 1 by scaling its weight vector by the square root of its
fan-in (i.e., its number of inputs)

▪ This ensures that all neurons in the network initially have


approximately the same output distribution and empirically
improves the rate of convergence

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 26
Weight Initialization

(Using property of variance)

(Assuming that all w, x are independent)

(Assuming zero-mean)

(Assuming that all w, x are identically distributed)

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 27
Weight Initialization

If we want variance of s to be same as that of


(Using
its input property
x, then of variance)
variance of w should be 1/n

Since

this implies that we should draw from unit


(Assuming
gaussian that all
and then w, it
scale x are
by independent)

(Assuming zero-mean)
to achieve variance of w to be 1/n

(Assuming that all w, x are identically distributed)

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 28
Weight Initialization: Statistics

A: All zero, no learning 

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 29
Weight Initialization: Statistics

A: Local gradients
all-zero, no learning 

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 30
Weight Initialization: “Xavier” Initialization

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 31
Weight Initialization: What about ReLU?

Activations collapse to
zero again, no learning 

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 32
Weight Initialization: Kaiming / MSRA Initialization

Current recommendation for use in


practice specifically with ReLU neurons

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 33
Proper initialization

An active research area…

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 34
Sparse Weight Initialization

▪ Another way to address the uncalibrated variances problem is to


set all weight matrices to zero, but to break symmetry every
neuron is randomly connected (with weights sampled from a
small gaussian as above) to a fixed number of neurons below it
▪ A typical number of neurons to connect to may be as small as 10

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 35
Bias Initialization

▪ It is possible and common to initialize the biases to be zero, since


the asymmetry breaking is provided by the small random
numbers in the weights

▪ For ReLU non-linearities, some people like to use small constant


value such as 0.01 for all biases because this ensures that all ReLU
units fire in the beginning and therefore obtain and propagate
some gradient

▪ However, it is not clear if this provides a consistent improvement


(in fact some results seem to indicate that this performs worse)
and it is more common to simply use 0 bias initialization

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 36
Acknowledgements

Various contents in this presentation have been taken from different


books, lecture notes (particularly CS231n Stanford, deeplearning.ai &
neuralnetworksanddeeplearning.com), and the web. These solely belong
to their owners and are here used only for clarifying various educational
concepts. Any copyright infringement is not intended.

CS490 – Advanced Topics in Computing (Deep Learning) Lecture 7: Activation Functions & Data Preprocessing 37

You might also like