A Study of the Optimization Algorithms in Deep Learning
A Study of the Optimization Algorithms in Deep Learning
Abstract—Training the deep learning models involves learning The paper is organized as follows: Section II presents
of the parameters to meet the objective function. Typically the related work on optimization algorithms. Section III discusses
objective is to minimize the loss incurred during the learning various optimization algorithms and the datasets used in this
process. In a supervised mode of learning, a model is given
the data samples and their respective outcomes. When a model study. The various algorithms described are stochastic gradient
generates an output, it compares it with the desired output descent, momentum, rmsprop, adam, adagrad and adadelta.
and then takes the difference of generated and desired outputs The datasets examined are mnist, fashion mnist, cifar10 and
and then attempts to bring the generated output close to the cifar100. Section IV elaborates on the results obtained by
desired output. This is achieved through optimization algorithms. training the model on the selected datasets with the chosen
An optimization algorithm goes through several cycles until
convergence to improve the accuracy of the model. There are optimization algorithms and also focuses on the comparison
several types of optimization methods developed to address the of the results. Section V concludes the paper and provides
challenges associated with the learning process. Six of these have future directions.
been taken up to be examined in this study to gain insights about
their intricacies. The methods investigated are stochastic gradient II. R ELATED W ORK
descent, nesterov momentum, rmsprop, adam, adagrad, adadelta.
Four datasets have been selected to perform the experiments Machine learning and deep learning relies on optimization
which are mnist, fashionmnist, cifar10 and cifar100. The optimal algorithms to learn the parameters of the input data. Hence
training results obtained for mnist is 1.00 with RMSProp and optimization algorithms play a vital role in the successful im-
adam at epoch 200, fashionmnist is 1.00 with rmsprop and adam plementation of solutions to real world problems. Various stud-
at epoch 400, cifar10 is 1.00 with rmsprop at epoch 200, cifar100 ies have been conducted to determine the optimal algorithm
is 1.00 with adam at epoch 100. The highest testing results are
achieved with adam for mnist, fashionmnist, cifar10 and cifar100 for the problem at hand. Because there is no general method
are 0.9826, 0.9853, 0.9855, 0.9842 respectively. The analysis of that solves all the different kinds of problems, investigations
results shows that adam optimization algorithm performs better have to be carried out to figure out the method that works
than others at testing phase and rmsprop and adam at training best for a given problem. [1] discusses gradient descent and
phase. its variants. The authors have reviewed various optimization
Index Terms—Optimization Algorithm, Stochastic Gradient
Descent, Momentum, RMSprop, AdamOptimizer, MNIST, Fash- algorithms for parallel and distributed environment. The paper
ionMNIST, Cifar10, Cifar100 also investigates ways to optimize the basic gradient descent.
The discussion begins with introducing the three variants of
I. I NTRODUCTION gradient descent, batch gradient descent, stochastic gradient
Deep learning has been in the forefront of solving quite a descent and mini-batch gradient descent. These differ in the
few real world problems. A machine is made to learn from number of data samples used for the training process at once.
the dataset and is expected to improve its performance over The basic gradient descent is the batch method that computes
a period of time. When the input is given to the model a the gradients for the entire training samples. It is well suited
function applied on that and through a sequence of layers it for small to medium sized datasets. It does not scale due
is transformed into output value. Essentially the model then to the constraint to have the entire dataset in the memory.
compares the generated output with the actual output and the Also it does not consider the newly added data elements
difference is calculated. In order to reduce the difference the to the dataset on the fly for training. Thus batch method is
generated output is backpropagated into the model. The model slow and not appropriate for the large datasets. Stochastic
adjusts the weights and follow the same process repeatedly gradient descent (SGD) performs gradient computation for
until convergence. This leads to a quest of having an algorithm each training sample. It also takes into consideration the data
that accelerate the learning process and generate optimal elements on the go and train them. It is relatively faster.
output. In this direction several optimization algorithms have However it suffers from the problem of slow convergence
been developed and implemented on various tasks. This study due to the fluctuations that occur when training individual
examines the most widely used optimization algorithms and data elements. Mini-batch combines the goodness of batch
their impact on the learning process. and stochastic methods and trains the dataset in mini-batches.
This method achieves the advantages of fast convergence and and maintains only recent gradient information. [6] discusses
inclusion of online data (the data that arrives on the fly). There the rmsprop and its variants. The paper also examines adagrad
are few challenges that need to be addressed in all the types of with logarithmic regret bounds.
gradient descent aforementioned. The major challenge is the 6) Adam: It has derived its name from ”adaptive moments”.
selection of learning rate. Another is to have variable learning It is a combination of rmsprop and momentum. The update
rate. The other is to avoid convergence at a suboptimal local operation considers only the smooth version of the gradient
minima. Further the paper outlines techniques that overcome and also includes a bias correction mechanism. [7] discusses
these challenges. The techniques described are momentum, the adam algorithm.
nesterov momentum, rmsprop, adam, adagrad, adadelta and
nadam. The paper addresses the question of which method B. Datasets
is to be employed for a given scenario. Furthermore it talks 1) MNIST: A dataset of handwritten digits. It consists
about the extending methods to be implemented on parallel of 60,000 training examples and 10,000 test images. The
and distributed environment. Improvisation of SGD through dimensions of images are 28x28 and is of grayscale. The
the use of shuffling, curriculum learning, batch normalization images contain handwritten digits from 0 through 9, total of
and early stopping is also presented. The paper [2] exam- 10 classes. It is a benchmark dataset for experimenting various
ines SGD and its impact on over-parameterized networks. algorithms and techniques. [8] describes the data.
It has been observed that SGD performs reasonably well 2) FashionMNIST: A dataset of fashion related images. It
in low parameters networks and leads to fast convergence consists of 60,000 training examples and 10,000 test images.
and achieve optimal global minima. However it is worth The dimensions of images are 28x28 and is of grayscale.
investigating whether this observation generalize to a large The images contain training and test items of t-shirt, trouser,
massively parameterized networks. This is the work carried pullover, dress, coat, sandal, shirt, sneakers, bag and ankle
out by the authors wherein it is studied and proved through boot. [9] has developed the fashion mnist dataset.
their experiments that SGD indeed yields satisfactory results 3) Cifar10: [10] presents the details of cifar10 dataset. It
in over-parameterized networks. consists of 32x32 color images of airplane, automobile, bird,
cat deer, dog, frog,horse, ship and truck. There are total 60,000
III. O PTIMIZATION ALGORITHMS AND DATASETS images and 6000 of each class. 10 classes are defined for the
A. Optimization algorithms ten different types of images.
Optimization algorithms form the basis on which a machine 4) Cifar100: This dataset is just like the CIFAR-10, except
is able to learn through its experience. They compute gradi- it has 100 classes containing 600 images each. There are 500
ents and attempts to minimize the loss function. There are training images and 100 testing images per class. The 100
several ways learning is implemented with different kinds of classes in the CIFAR-100 are grouped into 20 superclasses.
optimization algorithms. The algorithms studied in the present Each image comes with a ”fine” label (the class to which
work are described below. it belongs) and a ”coarse” label (the superclass to which it
1) Stochastic Gradient Descent: The vanilla gradient de- belongs). Here are the examples of superclasses in the CIFAR-
scent trains the entire dataset together. Its variant is stochastic 100: aquatic mammals, fish, flowers Examples of classes are:
gradient descent [3] that performs the training on the individ- beaver, dolphin, otter, seal, whale, aquarium fish, flatfish, ray,
ual data element. shark, trout, orchids, poppies, roses, sunflowers, tulips [10].
2) Nesterov Momentum: Gradient is computed based on the
IV. R ESULTS AND D ISCUSSIONS
approximate fututre positions of the parameters rather than the
current parameters. Nesterov momemtum is an improvement A. Training the model
over momentum which does not determine the future position The training process begins by defining the model and
of the parameters. [4] has incorporated nesterov momentum then invoking fitgenerator method. The dataset is divided
into adam. into batches. A batch is a set or group of data elements.
3) Adagrad: This is an method that chooses the learning A batch size denotes the total number of training elements
rate based on the situation. Learning rates are adaptive because in a particular batch. An optimization algorithm is used to
the actual rate is determined from parameters. A high gradient iterate the training examples number of times to find the
for the parameters will have a reduced learning rate and optimum results. Optimization algorithms used in this study
parameters with small gradients will have increased learning are SGD, nesterov momentum, adagrad, adadelta, RMSprop
rate. and adam. Optimization algorithms operate in forward and
4) Adadelta: It is an extension of adagrad. [5] presents a backward manner. An epoch is one such pass of the entire
modification of adagrad into adadelta. Instead of accumulating dataset.
the gradients, adadelta makes use of some fixed size window
and tracks only the available gradients within the window. B. Model Evaluation
5) RMSProp: RMSProp changes the adagrad in a way how The model is evaluated using accuracy and crossentropy
the gradient is accumulated. Gradients are accumulated into an defined below
exponentially weighted average. RMSProp discards the history Accuracy is the correctly identified instances from the total
dataset. In other words it is the true predictions that model has adam optimization algorithm performs better than others at
achieved from the total predictions. It is given by the following testing phase and rmsprop at training phase.Table V shows
equation the testing results obtained on mnist, fashionmnist, cifar10,
Correct P redictions Count cifar100 datasets for the SGD, momentum, adagrad, adadelta,
Accuracy = rmsprop, adam optimization algorithms.
T otal P redictions
Loss function determines how well the network is performing V. C ONCLUSION AND F UTURE W ORK
its intended task. Cross-entropy (CE) loss or log-loss is defined The methods investigated are stochastic gradient descent,
as the measure of the performance of the classifier. nesterov momentum, rmsprop, adam, adagrad, adadelta. The
c
X
Cross Entropy = − ai log( pi )
i
where c is the no of classes, ai is the actual value and pi is
the predicted value.
C. Results
Table I shows the results obtained on mnist dataset for the
6 optimization algorithms. The first column gives the epoch
number followed by different optimization algorithm. Table II
shows the results obtained on fashionmnist dataset. Table III
(a) MNIST (b) FashionMNIST
shows the results obtained on cifar10 dataset. Table IV shows
the results obtained on cifar100 dataset. Figure 1 presents the
tabular information in the forms of graphs.
D. Comparison of Results
The optimal training results obtained for mnist is 1.00
with RMSProp and adam at epoch 200, fashionmnist is 1.00
with rmsprop and adam at epoch 400, cifar10 is 1.00 with
rmsprop at epoch 200, cifar100 is 1.00 with adam at epoch
100. The highest testing results are achieved with adam for (c) Cifar10 (d) Cifar100
mnist, fashionmnist, cifar10 and cifar100 are 0.9826, 0.9853,
0.9855, 0.9842 respectively. The analysis of results show that Fig. 1: Various optimization algorithms on datasets