Deep Learning

Below are 25 questions on deep learning which can help you test your knowledge, as well as
being a good review resource for interview preparation.
1. Why is it necessary to introduce non-linearities in a neural network?
Solution: otherwise, we would have a composition of linear functions, which is also a linear
function, giving a linear model. A linear model has a much smaller number of parameters, and is
therefore limited in the complexity it can model.
2. Describe two ways of dealing with the vanishing gradient problem in a neural network.
Solution:
 Using ReLU activation instead of sigmoid.

 Using Xavier initialization.
3. What are some advantages in using a CNN (convolutional neural network) rather than a DNN
(dense neural network) in an image classification task?
Solution: while both models can capture the relationship between close pixels, CNNs have the
following properties:
 It is translation invariant — the exact location of the pixel is irrelevant for the filter.
 It is less likely to overfit — the typical number of parameters in a CNN is much smaller
than that of a DNN.
 Gives us a better understanding of the model — we can look at the filters’ weights and
visualize what the network “learned”.
 Hierarchical nature — learns patterns in by describing complex patterns using simpler
ones.
4. Describe two ways to visualize features of a CNN in an image classification task.
Solution:
 Input occlusion — cover a part of the input image and see which part affect the
classification the most. For instance, given a trained image classification model, give the
images below as input. If, for instance, we see that the 3rd image is classified with 98%
probability as a dog, while the 2nd image only with 65% accuracy, it means that the part
covered in the 2nd image is more important.
 Activation Maximization — the idea is to create an artificial input image that maximize
the target response (gradient ascent).
5. Is trying the following learning rates: 0.1,0.2,…,0.5 a good strategy to optimize the learning
rate?
Solution: No, it is recommended to try a logarithmic scale to optimize the learning rate.
6. Suppose you have a NN with 3 layers and ReLU activations. What will happen if we initialize
all the weights with the same value? what if we only had 1 layer (i.e linear/logistic regression?)
Solution: If we initialize all the weights to be the same we would not be able to break the
symmetry; i.e, all gradients will be updated the same and the network will not be able to learn. In
the 1-layers scenario, however, the cost function is convex (linear/sigmoid) and thus the weights
will always converge to the optimal point, regardless of the initial value (convergence may be
slower).
7. Explain the idea behind the Adam optimizer.
Solution: Adam, or adaptive momentum, combines two ideas to improve convergence: per-
parameter updates which give faster convergence, and momentum which helps to avoid getting
stuck in saddle point.
8. Compare batch, mini-batch and stochastic gradient descent.

Solution: batch refers to estimating the data by taking the entire data, mini-batch by sampling a
few datapoints, and SGD refers to update the gradient one datapoint at each epoch. The tradeoff
here is between how precise the calculation of the gradient is versus what size of batch we can
keep in memory. Moreover, taking mini-batch rather than the entire batch has a regularizing
effect by adding random noise at each epoch.
9. What is data augmentation? Give examples.
Solution: Data augmentation is a technique to increase the input data by performing

manipulations on the original data. For instance in images, one can: rotate the image, reflect
(flip) the image, add Gaussian blur.
10. What is the idea behind GANs?
Solution: GANs, or generative adversarial networks, consist of two networks (D,G) where D is
the “discriminator” network and G is the “generative” network. The goal is to create data —
images, for instance, which are undistinguishable from real images. Suppose we want to create
an adversarial example of a cat. The network G will generate images. The network D will
classify images according to whether they are a cat or not. The cost function of G will be
constructed such that it tries to “fool” D — to classify its output always as cat.
11. What are the advantages of using Batchnorm?
Solution: Batchnorm accelerates the training process. It also (as a byproduct of including some
noise) has a regularizing effect.
12. What is multi-task learning? When should it be used?
Solution: Multi-tasking is useful when we have a small amount of data for some task, and we
would benefit from training a model on a large dataset of another task. Parameters of the models
are shared — either in a “hard” way (i.e the same parameters) or a “soft” way (i.e
regularization/penalty to the cost function).
13. What is end-to-end learning? Give a few of its advantages.
Solution: End-to-end learning is usually a model which gets the raw data and outputs directly
the desired outcome, with no intermediate tasks or feature engineering. It has several advantages,
among which: there is no need to handcraft features, and it generally leads to lower bias.
14. What happens if we use a ReLU activation and then a sigmoid as the final layer?
Solution: Since ReLU always outputs a non-negative result, the network will constantly predict
one class for all the inputs!
15. How to solve the exploding gradient problem?

Solution: A simple solution to the exploding gradient problem is gradient clipping — taking the
gradient to be ±M when its absolute value is bigger than M, where M is some large number.
16. Is it necessary to shuffle the training data when using batch gradient descent?
Solution: No, because the gradient is calculated at each epoch using the entire training data, so
shuffling does not make a difference.
17. When using mini batch gradient descent, why is it important to shuffle the data?
Solution: otherwise, suppose we train a NN classifier and have two classes — A and B, and that
all samples of one class come before the other class. Not shuffling the data will make the weights
converge to a wrong value.
18. Describe some hyperparameters for transfer learning.
Solution: How many layers to keep, how many layers to add, how many to freeze.
19. Is dropout used on the test set?
Solution: No! only in the train set. Dropout is a regularization technique that is applied in the
training process.
20. Explain why dropout in a neural network acts as a regularizer.
Solution: There are several (related) explanations to why dropout works. It can be seen as a form
of model averaging — at each step we “turn off” a part of the model and average the models we
get. It also adds noise, which naturally has a regularizing effect. It also leads to more sparsity of
the weights and essentially prevents co-adaptation of neurons in the network.
21. Give examples in which a many-to-one RNN architecture is appropriate.
Solution: A few examples are: sentiment analysis, gender recognition from speech, .
22. When can’t we use BiLSTM? Explain what assumption has to be made.
Solution: in any bi-directional model, we assume that we have access to the next elements of the
sequence in a given “time”. This is the case for text data (i.e sentiment analysis, translation etc.),
but not the case for time-series data.
23. True/false: adding L2 regularization to a RNN can help with the vanishing gradient problem.
Solution: false! Adding L2 regularization will shrink the weights towards zero, which can
actually make the vanishing gradients worse in some cases.
24. Suppose the training error/cost is high and that the validation cost/error is almost equal to it.
What does it mean? What should be done?
Solution: this indicates underfitting. One can add more parameters, increase the complexity of
the model, or lower the regularization.
25. Describe how L2 regularization can be explained as a sort of a weight decay.
Solution: Suppose our cost function is C(w), and that we add a penalization c|w|2 . When using
gradient descent, the iterations will look like
w = w -grad(C)(w) — 2cw = (1–2c)w — grad(C)(w)
In this equation, the weight is multiplied by a factor < 1.

Deep Learning

Uploaded by

Deep Learning

Uploaded by

Below are 25 questions on deep learning which can help you test your knowledge, as well as

being a good review resource for interview preparation.

1. Why is it necessary to introduce non-linearities in a neural network?

 Using ReLU activation instead of sigmoid.

4. Describe two ways to visualize features of a CNN in an image classification task.

7. Explain the idea behind the Adam optimizer.

8. Compare batch, mini-batch and stochastic gradient descent.

9. What is data augmentation? Give examples.

Solution: Data augmentation is a technique to increase the input data by performing

10. What is the idea behind GANs?

11. What are the advantages of using Batchnorm?

12. What is multi-task learning? When should it be used?

13. What is end-to-end learning? Give a few of its advantages.

15. How to solve the exploding gradient problem?

18. Describe some hyperparameters for transfer learning.

19. Is dropout used on the test set?

20. Explain why dropout in a neural network acts as a regularizer.

21. Give examples in which a many-to-one RNN architecture is appropriate.

25. Describe how L2 regularization can be explained as a sort of a weight decay.

w = w -grad(C)(w) — 2cw = (1–2c)w — grad(C)(w)

In this equation, the weight is multiplied by a factor < 1.

You might also like