0% found this document useful (0 votes)
9 views4 pages

A Study of the Optimization Algorithms in Deep Learning

The paper discusses optimization algorithms in deep learning, focusing on methods such as stochastic gradient descent, momentum, RMSProp, and Adam, and their effectiveness in minimizing loss during model training. It evaluates these algorithms using datasets like MNIST, FashionMNIST, CIFAR10, and CIFAR100, reporting that Adam consistently yields the best performance in testing accuracy. The study concludes with insights on the importance of selecting the appropriate optimization method for different scenarios and suggests future research directions.

Uploaded by

kircalinecla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
9 views4 pages

A Study of the Optimization Algorithms in Deep Learning

The paper discusses optimization algorithms in deep learning, focusing on methods such as stochastic gradient descent, momentum, RMSProp, and Adam, and their effectiveness in minimizing loss during model training. It evaluates these algorithms using datasets like MNIST, FashionMNIST, CIFAR10, and CIFAR100, reporting that Adam consistently yields the best performance in testing accuracy. The study concludes with insights on the importance of selecting the appropriate optimization method for different scenarios and suggests future research directions.

Uploaded by

kircalinecla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 4

International Conference on Inventive Systems and Control (ICISC 2019)

IEEE Xplore Part Number: CFP19J06-ART; ISBN: 978-1-5386-3950-4

A Study of the Optimization Algorithms in Deep


Learning
Raniah Zaheer Humera Shaziya
Lecturer Assistant Professor
Department of CS Informatics, Nizam College
Najran University Osmania University

Abstract—Training the deep learning models involves learning The paper is organized as follows: Section II presents
of the parameters to meet the objective function. Typically the related work on optimization algorithms. Section III discusses
objective is to minimize the loss incurred during the learning various optimization algorithms and the datasets used in this
process. In a supervised mode of learning, a model is given
the data samples and their respective outcomes. When a model study. The various algorithms described are stochastic gradient
generates an output, it compares it with the desired output descent, momentum, rmsprop, adam, adagrad and adadelta.
and then takes the difference of generated and desired outputs The datasets examined are mnist, fashion mnist, cifar10 and
and then attempts to bring the generated output close to the cifar100. Section IV elaborates on the results obtained by
desired output. This is achieved through optimization algorithms. training the model on the selected datasets with the chosen
An optimization algorithm goes through several cycles until
convergence to improve the accuracy of the model. There are optimization algorithms and also focuses on the comparison
several types of optimization methods developed to address the of the results. Section V concludes the paper and provides
challenges associated with the learning process. Six of these have future directions.
been taken up to be examined in this study to gain insights about
their intricacies. The methods investigated are stochastic gradient II. R ELATED W ORK
descent, nesterov momentum, rmsprop, adam, adagrad, adadelta.
Four datasets have been selected to perform the experiments Machine learning and deep learning relies on optimization
which are mnist, fashionmnist, cifar10 and cifar100. The optimal algorithms to learn the parameters of the input data. Hence
training results obtained for mnist is 1.00 with RMSProp and optimization algorithms play a vital role in the successful im-
adam at epoch 200, fashionmnist is 1.00 with rmsprop and adam plementation of solutions to real world problems. Various stud-
at epoch 400, cifar10 is 1.00 with rmsprop at epoch 200, cifar100 ies have been conducted to determine the optimal algorithm
is 1.00 with adam at epoch 100. The highest testing results are
achieved with adam for mnist, fashionmnist, cifar10 and cifar100 for the problem at hand. Because there is no general method
are 0.9826, 0.9853, 0.9855, 0.9842 respectively. The analysis of that solves all the different kinds of problems, investigations
results shows that adam optimization algorithm performs better have to be carried out to figure out the method that works
than others at testing phase and rmsprop and adam at training best for a given problem. [1] discusses gradient descent and
phase. its variants. The authors have reviewed various optimization
Index Terms—Optimization Algorithm, Stochastic Gradient
Descent, Momentum, RMSprop, AdamOptimizer, MNIST, Fash- algorithms for parallel and distributed environment. The paper
ionMNIST, Cifar10, Cifar100 also investigates ways to optimize the basic gradient descent.
The discussion begins with introducing the three variants of
I. I NTRODUCTION gradient descent, batch gradient descent, stochastic gradient
Deep learning has been in the forefront of solving quite a descent and mini-batch gradient descent. These differ in the
few real world problems. A machine is made to learn from number of data samples used for the training process at once.
the dataset and is expected to improve its performance over The basic gradient descent is the batch method that computes
a period of time. When the input is given to the model a the gradients for the entire training samples. It is well suited
function applied on that and through a sequence of layers it for small to medium sized datasets. It does not scale due
is transformed into output value. Essentially the model then to the constraint to have the entire dataset in the memory.
compares the generated output with the actual output and the Also it does not consider the newly added data elements
difference is calculated. In order to reduce the difference the to the dataset on the fly for training. Thus batch method is
generated output is backpropagated into the model. The model slow and not appropriate for the large datasets. Stochastic
adjusts the weights and follow the same process repeatedly gradient descent (SGD) performs gradient computation for
until convergence. This leads to a quest of having an algorithm each training sample. It also takes into consideration the data
that accelerate the learning process and generate optimal elements on the go and train them. It is relatively faster.
output. In this direction several optimization algorithms have However it suffers from the problem of slow convergence
been developed and implemented on various tasks. This study due to the fluctuations that occur when training individual
examines the most widely used optimization algorithms and data elements. Mini-batch combines the goodness of batch
their impact on the learning process. and stochastic methods and trains the dataset in mini-batches.

978-1-5386-3950-4/19/$31.00 ©2019 IEEE 536


Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 20:00:30 UTC from IEEE Xplore. Restrictions apply.
International Conference on Inventive Systems and Control (ICISC 2019)
IEEE Xplore Part Number: CFP19J06-ART; ISBN: 978-1-5386-3950-4

This method achieves the advantages of fast convergence and and maintains only recent gradient information. [6] discusses
inclusion of online data (the data that arrives on the fly). There the rmsprop and its variants. The paper also examines adagrad
are few challenges that need to be addressed in all the types of with logarithmic regret bounds.
gradient descent aforementioned. The major challenge is the 6) Adam: It has derived its name from ”adaptive moments”.
selection of learning rate. Another is to have variable learning It is a combination of rmsprop and momentum. The update
rate. The other is to avoid convergence at a suboptimal local operation considers only the smooth version of the gradient
minima. Further the paper outlines techniques that overcome and also includes a bias correction mechanism. [7] discusses
these challenges. The techniques described are momentum, the adam algorithm.
nesterov momentum, rmsprop, adam, adagrad, adadelta and
nadam. The paper addresses the question of which method B. Datasets
is to be employed for a given scenario. Furthermore it talks 1) MNIST: A dataset of handwritten digits. It consists
about the extending methods to be implemented on parallel of 60,000 training examples and 10,000 test images. The
and distributed environment. Improvisation of SGD through dimensions of images are 28x28 and is of grayscale. The
the use of shuffling, curriculum learning, batch normalization images contain handwritten digits from 0 through 9, total of
and early stopping is also presented. The paper [2] exam- 10 classes. It is a benchmark dataset for experimenting various
ines SGD and its impact on over-parameterized networks. algorithms and techniques. [8] describes the data.
It has been observed that SGD performs reasonably well 2) FashionMNIST: A dataset of fashion related images. It
in low parameters networks and leads to fast convergence consists of 60,000 training examples and 10,000 test images.
and achieve optimal global minima. However it is worth The dimensions of images are 28x28 and is of grayscale.
investigating whether this observation generalize to a large The images contain training and test items of t-shirt, trouser,
massively parameterized networks. This is the work carried pullover, dress, coat, sandal, shirt, sneakers, bag and ankle
out by the authors wherein it is studied and proved through boot. [9] has developed the fashion mnist dataset.
their experiments that SGD indeed yields satisfactory results 3) Cifar10: [10] presents the details of cifar10 dataset. It
in over-parameterized networks. consists of 32x32 color images of airplane, automobile, bird,
cat deer, dog, frog,horse, ship and truck. There are total 60,000
III. O PTIMIZATION ALGORITHMS AND DATASETS images and 6000 of each class. 10 classes are defined for the
A. Optimization algorithms ten different types of images.
Optimization algorithms form the basis on which a machine 4) Cifar100: This dataset is just like the CIFAR-10, except
is able to learn through its experience. They compute gradi- it has 100 classes containing 600 images each. There are 500
ents and attempts to minimize the loss function. There are training images and 100 testing images per class. The 100
several ways learning is implemented with different kinds of classes in the CIFAR-100 are grouped into 20 superclasses.
optimization algorithms. The algorithms studied in the present Each image comes with a ”fine” label (the class to which
work are described below. it belongs) and a ”coarse” label (the superclass to which it
1) Stochastic Gradient Descent: The vanilla gradient de- belongs). Here are the examples of superclasses in the CIFAR-
scent trains the entire dataset together. Its variant is stochastic 100: aquatic mammals, fish, flowers Examples of classes are:
gradient descent [3] that performs the training on the individ- beaver, dolphin, otter, seal, whale, aquarium fish, flatfish, ray,
ual data element. shark, trout, orchids, poppies, roses, sunflowers, tulips [10].
2) Nesterov Momentum: Gradient is computed based on the
IV. R ESULTS AND D ISCUSSIONS
approximate fututre positions of the parameters rather than the
current parameters. Nesterov momemtum is an improvement A. Training the model
over momentum which does not determine the future position The training process begins by defining the model and
of the parameters. [4] has incorporated nesterov momentum then invoking fitgenerator method. The dataset is divided
into adam. into batches. A batch is a set or group of data elements.
3) Adagrad: This is an method that chooses the learning A batch size denotes the total number of training elements
rate based on the situation. Learning rates are adaptive because in a particular batch. An optimization algorithm is used to
the actual rate is determined from parameters. A high gradient iterate the training examples number of times to find the
for the parameters will have a reduced learning rate and optimum results. Optimization algorithms used in this study
parameters with small gradients will have increased learning are SGD, nesterov momentum, adagrad, adadelta, RMSprop
rate. and adam. Optimization algorithms operate in forward and
4) Adadelta: It is an extension of adagrad. [5] presents a backward manner. An epoch is one such pass of the entire
modification of adagrad into adadelta. Instead of accumulating dataset.
the gradients, adadelta makes use of some fixed size window
and tracks only the available gradients within the window. B. Model Evaluation
5) RMSProp: RMSProp changes the adagrad in a way how The model is evaluated using accuracy and crossentropy
the gradient is accumulated. Gradients are accumulated into an defined below
exponentially weighted average. RMSProp discards the history Accuracy is the correctly identified instances from the total

978-1-5386-3950-4/19/$31.00 ©2019 IEEE 537


Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 20:00:30 UTC from IEEE Xplore. Restrictions apply.
International Conference on Inventive Systems and Control (ICISC 2019)
IEEE Xplore Part Number: CFP19J06-ART; ISBN: 978-1-5386-3950-4

TABLE I: MNIST TABLE III: Cifar10


Epoch SGD Momen Ada Ada RMS Adam Epoch SGD Momen Ada Ada RMS Adam
tum grad delta Prop tum grad delta Prop
Epoch 000 0.10 0.20 0.18 0.06 0.14 0.16 Epoch 000 0.12 0.26 0.14 0.06 0.18 0.16
Epoch 100 0.54 0.90 0.84 0.04 0.96 0.98 Epoch 100 0.80 0.88 0.92 0.18 0.98 0.96
Epoch 200 0.74 0.92 0.94 0.16 1.00 1.00 Epoch 200 0.76 0.92 0.96 0.14 1.00 0.98
Epoch 300 0.86 0.96 0.94 0.18 1.00 1.00 Epoch 300 0.84 0.94 0.90 0.18 1.00 0.96
Epoch 400 0.80 0.96 0.96 0.12 0.98 0.98 Epoch 400 0.84 0.92 0.98 0.20 1.00 1.00
Epoch 500 0.82 0.94 0.96 0.28 0.98 0.92 Epoch 500 0.90 0.94 0.92 0.34 1.00 0.94
Epoch 600 0.88 0.96 0.96 0.32 1.00 1.00 Epoch 600 0.90 0.98 0.88 0.28 1.00 0.98
Epoch 700 0.86 0.92 0.94 0.46 1.00 1.00 Epoch 700 0.88 0.98 0.94 0.26 1.00 0.94
Epoch 800 0.96 0.98 1.00 0.40 1.00 1.00 Epoch 800 0.82 0.96 0.92 0.40 1.00 0.98
Epoch 900 0.88 0.88 0.96 0.44 1.00 1.00 Epoch 900 0.90 0.96 0.96 0.34 1.00 0.98

TABLE II: FashionMNIST TABLE IV: Cifar100


Epoch SGD Momen Ada Ada RMS Adam Epoch SGD Momen Ada Ada RMS Adam
tum grad delta Prop tum grad delta Prop
Epoch 000 0.12 0.22 0.20 0.08 0.20 0.22 Epoch 000 0.12 0.08 0.22 0.12 0.10 0.36
Epoch 100 0.56 0.86 0.80 0.16 0.90 0.96 Epoch 100 0.62 0.90 0.78 0.10 0.84 1.00
Epoch 200 0.84 0.90 0.90 0.10 0.96 0.98 Epoch 200 0.72 0.98 0.88 0.08 1.00 0.98
Epoch 300 0.98 0.94 0.92 0.10 0.98 0.98 Epoch 300 0.84 0.98 0.94 0.16 0.98 0.98
Epoch 400 0.88 0.90 0.96 0.20 1.00 1.00 Epoch 400 0.80 0.92 0.82 0.16 1.00 0.98
Epoch 500 0.94 0.96 0.86 0.18 0.98 0.96 Epoch 500 0.86 0.92 0.96 0.12 1.00 0.98
Epoch 600 0.92 1.00 0.90 0.22 0.98 1.00 Epoch 600 0.84 0.96 0.88 0.16 0.98 0.98
Epoch 700 0.94 0.98 0.96 0.32 1.00 0.98 Epoch 700 0.90 0.92 0.96 0.20 1.00 0.98
Epoch 800 0.90 0.94 0.96 0.38 1.00 1.00 Epoch 800 0.94 0.96 0.96 0.30 1.00 0.94
Epoch 900 0.96 0.94 0.94 0.36 1.00 1.00 Epoch 900 0.84 0.96 0.96 0.32 1.00 0.98

dataset. In other words it is the true predictions that model has adam optimization algorithm performs better than others at
achieved from the total predictions. It is given by the following testing phase and rmsprop at training phase.Table V shows
equation the testing results obtained on mnist, fashionmnist, cifar10,
Correct P redictions Count cifar100 datasets for the SGD, momentum, adagrad, adadelta,
Accuracy = rmsprop, adam optimization algorithms.
T otal P redictions

Loss function determines how well the network is performing V. C ONCLUSION AND F UTURE W ORK
its intended task. Cross-entropy (CE) loss or log-loss is defined The methods investigated are stochastic gradient descent,
as the measure of the performance of the classifier. nesterov momentum, rmsprop, adam, adagrad, adadelta. The
c
X
Cross Entropy = − ai log( pi )
i
where c is the no of classes, ai is the actual value and pi is
the predicted value.
C. Results
Table I shows the results obtained on mnist dataset for the
6 optimization algorithms. The first column gives the epoch
number followed by different optimization algorithm. Table II
shows the results obtained on fashionmnist dataset. Table III
(a) MNIST (b) FashionMNIST
shows the results obtained on cifar10 dataset. Table IV shows
the results obtained on cifar100 dataset. Figure 1 presents the
tabular information in the forms of graphs.
D. Comparison of Results
The optimal training results obtained for mnist is 1.00
with RMSProp and adam at epoch 200, fashionmnist is 1.00
with rmsprop and adam at epoch 400, cifar10 is 1.00 with
rmsprop at epoch 200, cifar100 is 1.00 with adam at epoch
100. The highest testing results are achieved with adam for (c) Cifar10 (d) Cifar100
mnist, fashionmnist, cifar10 and cifar100 are 0.9826, 0.9853,
0.9855, 0.9842 respectively. The analysis of results show that Fig. 1: Various optimization algorithms on datasets

978-1-5386-3950-4/19/$31.00 ©2019 IEEE 538


Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 20:00:30 UTC from IEEE Xplore. Restrictions apply.
International Conference on Inventive Systems and Control (ICISC 2019)
IEEE Xplore Part Number: CFP19J06-ART; ISBN: 978-1-5386-3950-4

TABLE V: Comparison of Results


Optimization Algorithm MNIST FashionMNIST Cifar10 Cifar100
SGD 0.9086 0.9108 0.9089 0.9151
Momentum 0.9663 0.9666 0.9643 0.9650
Adagrad 0.9394 0.9406 0.9386 0.9334
Adadelta 0.4524 0.4346 0.4422 0.3491
RMSProp 0.9823 0.9838 0.9773 0.9795
Adam 0.9826 0.9853 0.9855 0.9842

datasets selected to perform the experiments are mnist, fash-


ionmnist, cifar10 and cifar100. This study of optimization
algorithms have been conducted to gain insights into the in-
tricacies of the different methods used with different datasets.
6 algorithms have been chosen for detailed experiments on
4 datasets. Training results show that rmsprop obtains 1.00
value quicker than other algorithms at the training stage with
the cifar10 whereas adam generates the value 1.00 faster
for cifar100. Both rmsprop and adam give 1.00 value at
the same epoch for mnist, fashionmnist. The testing stage
presents better results for all the datasets with adam algorithm.
The values obtained are 0.9826, 0.9853, 0.9855, 0.9842 for
mnist, fashionmnist, cifar10 and cifar100 respectively. Future
work can be taken up to study optimization algorithms with
varying parameters like learning rate, no of filters and so
on. Optimization algorithms can be examined in different
architectures of deep learning models.
R EFERENCES
[1] S. Ruder, “An overview of gradient descent optimization algorithms,”
arXiv preprint arXiv:1609.04747, 2016.
[2] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz, “Sgd
learns over-parameterized networks that provably generalize on linearly
separable data,” arXiv preprint arXiv:1710.10174, 2017.
[3] C. Darken, J. Chang, and J. Moody, “Learning rate schedules for faster
stochastic gradient search,” in Neural Networks for Signal Processing
[1992] II., Proceedings of the 1992 IEEE-SP Workshop. IEEE, 1992,
pp. 3–12.
[4] T. Dozat, “Incorporating nesterov momentum into adam,” 2016.
[5] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
preprint arXiv:1212.5701, 2012.
[6] M. C. Mukkamala and M. Hein, “Variants of rmsprop and adagrad with
logarithmic regret bounds,” arXiv preprint arXiv:1706.05507, 2017.
[7] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[8] “Mnist handwritten digit database, yann lecun, corinna cortes and chris
burges,” http://yann.lecun.com/exdb/mnist/, (Accessed on 12/16/2018).
[9] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms,” arXiv preprint
arXiv:1708.07747, 2017.
[10] “Cifar-10 and cifar-100 datasets,” https://www.cs.toronto.edu/ kriz/cifar.html,
(Accessed on 12/16/2018).

978-1-5386-3950-4/19/$31.00 ©2019 IEEE 539


Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 20:00:30 UTC from IEEE Xplore. Restrictions apply.

You might also like