03-Lecture Notes-Mid
03-Lecture Notes-Mid
Learning Objectives
It is a powerful learning algorithm inspired by how the brain works. Here is a definition
from mathworks:
image source: mathworks
A neural network (also called an artificial neural network) is an adaptive system that learns
by using interconnected nodes or neurons in a layered structure that resembles a human
brain. A neural network can learn from data—so it can be trained to recognize patterns,
classify data, and forecast future events.
A neural network breaks down the input into layers of abstraction. It can be trained using
many examples to recognize patterns in speech or images, for example, just as the human
brain does. Its behavior is defined by the way its individual elements are connected and
by the strength, or weights, of those connections. These weights are automatically
adjusted during training according to a specified learning rule until the artificial neural
network performs the desired task correctly.
A neural network combines several processing layers, using simple elements operating in
parallel and inspired by biological nervous systems. It consists of an input layer, one or
more hidden layers, and an output layer. In each layer there are several nodes, or neurons,
with each layer using the output of the previous layer as its input, so neurons interconnect
the different layers. Each neuron typically has weights that are adjusted during the
learning process, and as the weight decreases or increases, it changes the strength of the
signal of that neuron.
Supervised learning with neural networks
In supervised learning, we are given a data set and already know what our correct output
should look like, having the idea that there is a relationship between the input and the
output.
Structured data refers to things that has a defined meaning such as price, age
Unstructured data refers to thing like pixel, raw audio, text.
Deep learning is taking off due to a large amount of data available through the digitization
of the society, faster computation and innovation in the development of neural network
algorithm.
Two things have to be considered to get to the high level of performance:
Learning Objectives
Binary Classification
Topic B focuses on the basics of neural network programming, especially some important
techniques, such as how to deal with m training examples in the computation and how to
implement forward and backward propagation. To illustrate this process step by step,
Andrew Ng took a lot of time explaining how Logistic regression is implemented for a
binary classification, here a Cat vs. Non-Cat classification, which would take an image as
an input and output a label to propagation whether this image is a cat (label 1) or not
(label 0).
An image is store in the computer in three separate matrices corresponding to the Red,
Green, and Blue color channels of the image. The three matrices have the same size as
the image, for example, the resolution of the cat image is 64 pixels x 64 pixels, the three
matrices (RGB) are 64 by 64 each. To create a feature vector, x, the pixel intensity values
will be “unroll” or “reshape” for each color. The dimension of the input feature vector x is:
Logistic Regression
Logistic regression is useful for situations in which you want to be able to predict the
presence or absence of a characteristic or outcome based on values of a set of predictor
variables. It is similar to a linear regression model but is suited to models where the
dependent variable is dichotomous. Logistic regression coefficients can be used to
estimate odds ratios for each of the independent variables in the model. Logistic
regression is applicable to a broader range of research situations than discriminant
analysis. (from ibm knowledge center)
A detailed guide on Logistic Regression for Machine Learning by Jason Brownlee is the
best summary of this topic for data science engineers.
Andrew Ng's course on Logistic Regression here focuses more on LR as the simplest
neural network, as its programming implementation is a good starting point for the deep
neural networks that will be covered later.
In Logistic regression, we want to train the parameters w and b, we need to define a cost
function.
The loss function measures the discrepancy between the prediction (𝑦̂(𝑖)) and the desired
output (𝑦(𝑖)). In other words, the loss function computes the error for a single training
example.
The cost function is the average of the loss function of the entire training set. We are
going to find the parameters 𝑤 𝑎𝑛𝑑 𝑏 that minimize the overall cost function.
The loss function measures how well the model is doing on the single training example,
whereas the cost function measures how well the parameters w and b are doing on the
entire training set.
Gradient Descent
As you go through any course on machine learning or deep learning, gradient descent
the concept that comes up most often. It is used when training models, can be combined
with every algorithm and is easy to understand and implement.
The goal of the training model is to minimize the loss function, usually with randomly
initialized parameters, and using a gradient descent method with the following main
steps. Randomization of parameters initialization is not necessary in logistic regression
(zero initialization is fine), but it is necessary in multilayer neural networks.
1. Start calculating the cost and gradient for the given training set of (x,y) with the
parameters w and b.
2. update parameters w and b with pre-set learning rate: w_new =w_old –
learning_rate * gradient_of_at(w_old) Repeat these steps until you reach the
minimal values of cost function.
Derivatives
Derivatives are crucial in backpropagation during neural network training, which uses the
concept of computational graphs and the chain rule of derivatives to make the
computation of thousands of parameters in neural networks more efficient.
Computation Graph
Computational graphs are a nice way to think about mathematical expressions. For
example, consider the expression e=(a+b)∗(b+1). There are three operations: two
additions and one multiplication. To help us talk about this, let’s introduce two
intermediary variables, c and d so that every function’s output has a variable. We now
have:
c=a+b
d=b+1
e=c∗d
To create a computational graph, we make each of these operations, along with the input
variables, into nodes. When one node’s value is the input to another node, an arrow goes
from one to another.
Andrew did logistic regreesion gradient descent computation using the computation
graph in order to get us familiar with computation graph ideas for neural networks.
The cost function is computed as an average of the m individual loss values, the gradient
with respect to each parameter should also be calculated as the mean of the m gradient
values on each example.
The calculattion process can be done in a loop through m examples.
J=0
dw=np.zeros(n)
db=0
for i in range(m):
z[i] = w.transpose() * x[i] + b
a[i] = sigmoid(z[i])
J = J + (-[y[i]*log(a[i])+(1-y[i])*log(1-a[i])])
dz[i] = a[i] - y[i]
db = db + dz[i]
j = j / m
dw = dw / m
db = db / m
After gradient computation, we can update parameters with a learning rate alpha.
# vectorization should also applied here
for j in range(n):
w[j] = w[j] - alpha * dw[j]
b = b - alpha * db
As you can see above, to update parameters one step, we have to go throught all
the m examples. This will be mentioned again in later videos. Stay tuned!
Derivation of dL/dz
You may be wondering why dz=a-y in the above code is calculated this way and where it
comes from. Here is a detailed derivation process of dl/dz on discussion forum.
Vectorization
Both GPU and CPU have parallelization instructions. They're sometimes called SIMD
instructions, which stands for a single instruction multiple data. The rule of thumb to
remember is whenever possible avoid using explicit four loops.
If we stack all the m examples of x we have a input matrix X with each column representing
an example. So by the builtin vectorization of numpy we can simplify the above gradient
descent calculation with a few lines of code which can boost the computational efficiency
definitely.
Z = np.dot(w.T, X) + b
A = sigmoid(Z)
dz = A - Y
# in constrast to the inner loop above, vectorization is used here to boost computation
dw = 1/m * np.dot(X, dz.T)
db = 1/m * np.sum(dz)
Update parameters:
w = w - alpha * dw
b = b - alpha * db
Broadcasting in Python
The term broadcasting describes how numpy treats arrays with different shapes during
arithmetic operations. Subject to certain constraints, the smaller array is "broadcast"
across the larger array so that they have compatible shapes. Broadcasting provides a
means of vectorizing array operations so that looping occurs in C instead of Python. More
detailed examples on numpy.org.
A note on python/numpy vectors
a = np.random.randn(5,1)
a = np.random.randn(1,5)
a = a.reshape(5,1)
a.shape
# (5,1)
The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and narrative text. Uses
include: data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
See jupyter.org
pip install jupyterlab
jupyter-lab
But so to summarize, by minimizing this cost function J(w,b) we're really carrying out
maximum likelihood estimation with the logistic regression model because minimizing
the loss corresponds to maximizing the log of the probability.
C: Shallow Neural Networks
Learning Objectives
A neural network consists of three types of layers: input layer, hidden layer and output
layer. Input layer is not counted in the number of layers of one neural network. When we
talk about training a neural network, basically we are training parameters associated with
the hidden layers and the output layer.
Input layer: input features (x1, x2, x3, ...) stack up vertically
Hidden layer(s): values for the nodes are not observed
Output layer: responsilble for generating the predicted value
In the above example, z[1] is the result of linear computation of the input values and the
parameters of the hidden layer and a[1] is the activation as a sigmoid function of z[1].
Generally, in a two-layer neural network, if we have nx features of input x and n1 neurons
of hidden layer and one output value, we have the following dimensions of each variable.
Specifically, we have nx=3, n1=4 in the above network.
variable shape description
W[2] (1,n1) weight matrix of second layer, i.e., output layer here
We should compute z[1], a[1], z[2], a[2] for each example i of m examples:
for i in range(m):
z[1][i] = W[1]*x[i] + b[1]
a[1][i] = sigmoid(z[1][i])
z[2][i] = W[2]*a[1][i] + b[2]
a[2][i] = sigmoid(z[2][i])
Just as we have already been familiar with vectorization and broadcasting in the logistic
regression, we can also apply the same method to the neural networks training. Inevitably,
we have to go through the m examples of input values in the process of computation.
Stacking them together is good idea. So we have the following vectorizing variables with
only small differences as before.
variable shape description
W[2] (1,n1) weight matrix of second layer, i.e., output layer here
Activation functions
So far, we know that a non-linear function is applied in the output step of each layer.
Actually there are several common activation functions which are also popular.
If we only allow linear activation functions in a neural network, the output will just be a
linear transformation of the input, which is not enough to form a universal function
approximator. Such a network can just be represented as a matrix multiplication, and you
would not be able to obtain very interesting behaviors from such a network.
sigmoid
a(1-a)
tanh 1-a^2
Again we will have a single hidden layer in our neural network, this section focuses on the
equations we need to implement in order to get back-propagation or to get gradient
descent working. Suppose we have nx input features, n1 hidden units and n2 output units
in our examples. In the previous vectorization section we have n2 equals one. Here we will
have a more general representation in order to give ourselves a smoother transition to
the next topic of the course.
Variables:
W[2] (n2,n1) weight matrix of second layer, i.e., output layer here
# backpropagation
dZ[2] = A[2] - Y # get this with combination of the derivative of cost function and
g'[2]
dW[2] = 1/m * np.matmul(dZ[2], A[1].T)
db[2] = 1/m * np.sum(dZ[2], axis=1, keepdims=True)
dZ[1] = np.multiply(np.matmul(W[2].T, dZ[2]), g'[1](Z[1])) # derivative of activation
is used here
dW[1] = 1/m * np.matmul(dZ[1], X.T)
db[1] = 1/m * np.sum(dZ[1])
# update parameters
W[1] = W[1] - learning_rate * dW[1]
b[1] = b[1] - learning_rate * db[1]
W[2] = W[2] - learning_rate * dW[2]
b[2] = b[2] - learning_rate * db[2]
Repeat forward propagation and backpropagation a lot of times until the parameters look
like they're converging.
Random initialization
Initialization of parameters:
Imagine that you initialize all weights to the same value (e.g. zero or one). In this case,
each hidden unit will get exactly the same signal. E.g. if all weights are initialized to 1, each
unit gets signal equal to sum of inputs (and outputs sigmoid(sum(inputs))). If all weights
are zeros, which is even worse, every hidden unit will get zero signal. No matter what was
the input - if all weights are the same, all units in hidden layer will be the same too.
This is for sigmoid or tanh activation function. If weight parameters are initially large, we
are more likely to get large values of z calculated by z=wx+b. If we check this in the graph
of sigmoid(tanh) function, we can see the slope in large z is very close to zero, which
would slow down the learning process since parameters are updated by only a very small
amount each time.
Learning Objectives
Technically logistic regression is a 1-layer neural network. Deep neural networks, with
more layers, can learn functions that shallower models are often unable to.
Here L denotes the number of layers in a deep neural network. Some notations:
notation description
# backpropagation
dZ[2] = A[2] - Y # get this with combination of the derivative of cost function and
g'[2]
dW[2] = 1/m * np.matmul(dZ[2], A[1].T)
db[2] = 1/m * np.sum(dZ[2], axis=1, keepdims=True)
dZ[1] = np.multiply(np.matmul(W[2].T, dZ[2]), g'[1](Z[1])) # derivative of activation
is used here
dW[1] = 1/m * np.matmul(dZ[1], X.T)
db[1] = 1/m * np.sum(dZ[1])
# update parameters
W[1] = W[1] - learning_rate * dW[1]
b[1] = b[1] - learning_rate * db[1]
W[2] = W[2] - learning_rate * dW[2]
b[2] = b[2] - learning_rate * db[2]
matrix dimension
b[l] (n[l], 1)
matrix dimension
Z[l] (n[l], m)
A[l] (n[l], m)
db[l] (n[l], 1)
dZ[l] (n[l], m)
dA[l] (n[l], m)
Deep neural network with multiple hidden layers might be able to have the earlier
layers learn lower level simple features and then have the later deeper layers then
put together the simpler things it's detected in order to detect more complex
things like recognize specific words or even phrases or sentences.
If there aren't enough hidden layers, then we might require exponentially more
hidden units to compute in shallower networks.
Parameters vs Hyperparameters
Parameters:
Hyper parameters:
I do think that maybe the few that computer vision has taken a bit more inspiration from
the human brain than other disciplines that also apply deep learning, but I personally use
the analogy to the human brain less than I used to.