Deep Learning
Deep Learning
Solution: otherwise, we would have a composition of linear functions, which is also a linear
function, giving a linear model. A linear model has a much smaller number of parameters, and is
therefore limited in the complexity it can model.
2. Describe two ways of dealing with the vanishing gradient problem in a neural network.
Solution:
3. What are some advantages in using a CNN (convolutional neural network) rather than a DNN
(dense neural network) in an image classification task?
Solution: while both models can capture the relationship between close pixels, CNNs have the
following properties:
It is translation invariant — the exact location of the pixel is irrelevant for the filter.
It is less likely to overfit — the typical number of parameters in a CNN is much smaller
than that of a DNN.
Gives us a better understanding of the model — we can look at the filters’ weights and
visualize what the network “learned”.
Hierarchical nature — learns patterns in by describing complex patterns using simpler
ones.
Solution:
Input occlusion — cover a part of the input image and see which part affect the
classification the most. For instance, given a trained image classification model, give the
images below as input. If, for instance, we see that the 3rd image is classified with 98%
probability as a dog, while the 2nd image only with 65% accuracy, it means that the part
covered in the 2nd image is more important.
Activation Maximization — the idea is to create an artificial input image that maximize
the target response (gradient ascent).
5. Is trying the following learning rates: 0.1,0.2,…,0.5 a good strategy to optimize the learning
rate?
Solution: No, it is recommended to try a logarithmic scale to optimize the learning rate.
6. Suppose you have a NN with 3 layers and ReLU activations. What will happen if we initialize
all the weights with the same value? what if we only had 1 layer (i.e linear/logistic regression?)
Solution: If we initialize all the weights to be the same we would not be able to break the
symmetry; i.e, all gradients will be updated the same and the network will not be able to learn. In
the 1-layers scenario, however, the cost function is convex (linear/sigmoid) and thus the weights
will always converge to the optimal point, regardless of the initial value (convergence may be
slower).
Solution: Adam, or adaptive momentum, combines two ideas to improve convergence: per-
parameter updates which give faster convergence, and momentum which helps to avoid getting
stuck in saddle point.
Solution: GANs, or generative adversarial networks, consist of two networks (D,G) where D is
the “discriminator” network and G is the “generative” network. The goal is to create data —
images, for instance, which are undistinguishable from real images. Suppose we want to create
an adversarial example of a cat. The network G will generate images. The network D will
classify images according to whether they are a cat or not. The cost function of G will be
constructed such that it tries to “fool” D — to classify its output always as cat.
Solution: Batchnorm accelerates the training process. It also (as a byproduct of including some
noise) has a regularizing effect.
Solution: Multi-tasking is useful when we have a small amount of data for some task, and we
would benefit from training a model on a large dataset of another task. Parameters of the models
are shared — either in a “hard” way (i.e the same parameters) or a “soft” way (i.e
regularization/penalty to the cost function).
Solution: End-to-end learning is usually a model which gets the raw data and outputs directly
the desired outcome, with no intermediate tasks or feature engineering. It has several advantages,
among which: there is no need to handcraft features, and it generally leads to lower bias.
14. What happens if we use a ReLU activation and then a sigmoid as the final layer?
Solution: Since ReLU always outputs a non-negative result, the network will constantly predict
one class for all the inputs!
16. Is it necessary to shuffle the training data when using batch gradient descent?
Solution: No, because the gradient is calculated at each epoch using the entire training data, so
shuffling does not make a difference.
17. When using mini batch gradient descent, why is it important to shuffle the data?
Solution: otherwise, suppose we train a NN classifier and have two classes — A and B, and that
all samples of one class come before the other class. Not shuffling the data will make the weights
converge to a wrong value.
Solution: How many layers to keep, how many layers to add, how many to freeze.
Solution: No! only in the train set. Dropout is a regularization technique that is applied in the
training process.
Solution: There are several (related) explanations to why dropout works. It can be seen as a form
of model averaging — at each step we “turn off” a part of the model and average the models we
get. It also adds noise, which naturally has a regularizing effect. It also leads to more sparsity of
the weights and essentially prevents co-adaptation of neurons in the network.
Solution: A few examples are: sentiment analysis, gender recognition from speech, .
22. When can’t we use BiLSTM? Explain what assumption has to be made.
Solution: in any bi-directional model, we assume that we have access to the next elements of the
sequence in a given “time”. This is the case for text data (i.e sentiment analysis, translation etc.),
but not the case for time-series data.
23. True/false: adding L2 regularization to a RNN can help with the vanishing gradient problem.
Solution: false! Adding L2 regularization will shrink the weights towards zero, which can
actually make the vanishing gradients worse in some cases.
24. Suppose the training error/cost is high and that the validation cost/error is almost equal to it.
What does it mean? What should be done?
Solution: this indicates underfitting. One can add more parameters, increase the complexity of
the model, or lower the regularization.
Solution: Suppose our cost function is C(w), and that we add a penalization c|w|2 . When using
gradient descent, the iterations will look like