Optimizing deep learning models from multi-objective perspective via Bayesian optimization
Optimizing deep learning models from multi-objective perspective via Bayesian optimization
Corresponding Author:
Shafaf Ibrahim
College of Computing, Informatics and Mathematics, Universiti Teknologi MARA
Shah Alam, Selangor, Malaysia
Email: shafaf2429@uitm.edu.my
1. INTRODUCTION
The performance of deep learning (DL) relies heavily on the hyperparameters [1], reinforcing the
necessity for hyperparameter optimization. The process of finding optimal hyperparameters configuration for
the DL models is referred to as hyperparameter tuning [1]. In the context of hyperparameter tuning, there are
various widely known techniques such as manual search [2], grid search and random search [3]. In manual
search, the process of finding optimal hyperparameter configuration is intervened by a human directly where
individuals rely on intuition, and experiences [4]. Due to that, it is laborious, time consuming and prone to
errors [5]. In contrast to manual search, grid search automates the exploration process, systematically
traversing the hyperparameter space in sequential order. Yet, this approach's brute-force methodology incurs
significant computational overhead, particularly as the dimensionality of the search space escalates
exponentially [6].
Random search is an alternative to grid search, which adopts a random approach, sampling
hyperparameters configuration in random order within the designated search space [7]. Compared to grid
search, the implementation of random search evidently proven to be more effective especially in
high-dimensional space [8]. Even with the randomness nature, the results obtained by random search is
purely by ‘luck’ as it samples the hyperparameter search space without any guidance. It is evident that the
implementation of grid search and random search exhibits a lack of sophistication, resulting in a naïve
approach to hyperparameter tuning [8]. This method often leads to substantial computational resource
consumption due to the exponential increase in search space [6]. Due to that, a modern approach which
utilizes the data and finds optimal hyperparameter configuration intelligently is needed. Within the context of
this study, Bayesian optimization (BO) is an alternative choice for the naïve methods. The utilization of BO
leverages statistical model, employing surrogate model and acquisition function to guide BO for finding
optimal hyperparameter configuration. BO known for its capabilities to optimize expensive function by
iteratively constructing a probabilistic surrogate model of the underlying target function [9].
According to Nasayreh et al. [10], the implementation of BO, grid search and random search were
tested on different machine learning (ML) models. Support vector machine (SVM), logistic regression (LR),
random forest (RF), and naïve Bayes (NB) showed grid search and random search provided superior results
in most tested models except LR. Based on the study conducted in [11], the implementation of BO was done
by using Gaussian process (GP) as the surrogate model. Unfortunately, the process of hyperparameter tuning
with large dimension space and small fitness evaluation budget showed that an alternative to GP is necessary
[12]. In this study, the implementation of hyperparameter tuning was done using BO, an alternative to the
brute-force methodology inherent in grid search and the stochastic nature of random search. BO offers a
sophisticated approach to identifying optimal solutions. BO leverages prior data to intelligently navigate the
search space by employing a probabilistic model, notably the tree-structured parzen estimator (TPE), which
encapsulates the underlying objective function.
Within the context of hyperparameter tuning, previous studies have predominantly focused on the
single objective optimization (SSO) as in [10]. SSO provides advantages such as reduced runtime and
improved convergence; however, it limits performance evaluation to a single objective, precluding the
consideration of conflicting objectives. The implementation of a single objective typically fails to meet the
scenario of the real-world, where it involves many conflicting objectives. The occurrence of clashes between
multiple conflicting objectives happens most of the time in the real-world scenario.
This study proposes multi-objective hyperparameter tuning by using BO on the different
architectures of the DL models. The primary objective of this paper is to show that the implementation of BO
for the hyperparameter tuning from multi-objective perspective provides a better performance compared to
baseline methods on the different DL architectures, particularly when the search space grows exponentially
when the new hyperparameters are added. In addition to that, this study offers a comprehensive exploration
of the performance using different hyperparameter tuning methods within the context of multi-objective
optimization.
2. METHODS
2.1. Multi-objective hyperparameter tuning
The effectiveness of the learning algorithm is heavily dependent on the configuration of
hyperparameters, λ. The performance of a DL model can directly be impacted by the right hyperparameter
configurations [1], [13]. Mathematically, the efficacy of a learning algorithm with designated hyperparameter
is denoted as 𝒜λ, and f = 𝒜λ (X (train) ) for a training set X (train) . As instance, in a convolutional neural
network (CNN) model where the batch size is bs and learning rate denoted as l, the equation for λ will be
written as λ = (𝑏𝑠, l). In the context of hyperparameter configuration, the search space can be measured as
stated in (1).
N = ∏ni=1 mi (1)
As referring to (1), hyperparameters are represented in n m whereas the possible values of each
hyperparameters are represented in mi. Within the context of this study, the n (hyperparameters) is learning
rate, epochs, batch size, kernel size, and neuron layers, depending on different architecture of multi-layer
perceptron (MLP), LeNet, or CNN. Theoretically, based on the (1), the dimension of the search space
increases exponentially when the new hyperparameters is added into the equation [14]. As the dimension of
the search space grows bigger, the traditional approach of hyperparameter tuning is proven to be tedious,
laborious, prone to errors, and consumes a lot of computing power [2].
Optimizing deep learning models from multi-objective perspective … (Abdul Rahman Mohamad Rom)
1422 ISSN: 2252-8938
Other than that, within the context of hyperparameter tuning, the conflicting objectives often arises
when optimizing for multiple performance metrics. Theoretically, multi-objective optimization problem can
be defined as in (2) [15].
Where x representing variables, n representing number of objective functions, and U is the feasible set, and
min-max are objective functions. Now, within the context of DL, based on the previous studies, the
conflicting arises between accuracy vs model size [16], specificity vs accuracy vs sensitivity [17], latency
and accuracy [18]. In the context of this study, the conflicting objectives are not limit to bi-objective, but
extended to tri-objectives which are accuracy, F1-score and weight of the model.
Contradicts to GP, TPE concentrating on the approximation of the conditional probability P(x|y),
rather than directly modelling P(y|x). This conditional probability, P(x∣y), is estimated using two distinct
functions: l(x) for cases where the performance is below a certain threshold, and g(x) for cases where the
performance surpasses the specified threshold are as stated in (4) [21]. In the context of this study, the
threshold is adaptively adjusted and configured by using Optuna depending on the problems given.
𝑙(𝑥 ) 𝑖𝑓 𝑦 < 𝑦 ∗
𝑃 (𝑥|𝑦) = { (4)
𝑔(𝑥 ) 𝑖𝑓 𝑦 ≥ 𝑦 ∗
These two density functions, l(x) and g(x), are subsequently employed in the expected improvement
(EI) function. The EI, as represented by this equation, guides the decision of where to sample the next set of
hyperparameters. In essence, this rephrasing aims to convey the key concepts of TPE's approach in
approximating the conditional probability of hyperparameters given a score, with a specific focus on the
functions l(x) and g(x) and their utilization in the EI function for determining optimal hyperparameter
sampling points.
As mentioned previously, the acquisition function used in this experiment is EI. Alternatively, the
other widely used acquisition function in BO is called probability of improvement (PI). The implementation
of PI only considers the probability of improving our current best estimate, but it does not factor in the
magnitude of the improvement. Contrary to that, the implementation of EI in BO is widely used as it
considered both the probability and increasement of a point. In addition to that, EI is able to solve the
problem of falling into the local optimum solution. In the scope of this study, the acquisition function that
will be using in this study is EI. The mathematical notation for EI is as stated in (5) [2].
Φ(.) and 𝜙(.) denote the cumulative distribution function and the probability density function of the standard
𝑙(𝑥)
normal distribution, respectively. Maximizing EI is corresponds to maximizing the ratio 𝑔(𝑥) in TPE, as
shown in (6), where 𝑦 ∗ is some quantile of the observed y.
Optimizing deep learning models from multi-objective perspective … (Abdul Rahman Mohamad Rom)
1424 ISSN: 2252-8938
In addition to accuracy, the F1-score will also be evaluated as a classification parameter in this
study. Defined as the harmonic mean of precision and recall [25], the F1-score offers a balanced assessment
of model efficacy. F1-score can be calculated by utilizing precision and recall as mentioned in (7). Precision
and recall are integral components of the F1-score. Precision quantifies the accuracy of positive predictions,
representing the ratio of correctly identified positive instances to all instances predicted as positive. Recall,
conversely, measures the model's ability to detect positive instances, defined as the proportion of correctly
identified positive cases among all actual positive instances in the dataset. The formula for measuring the
F1 score is presented in (7).
In addition to classification metrics, the model's efficiency will be assessed by examining its
computational complexity. This evaluation will be conducted by calculating the model's weight, which is
determined by the total number of parameters it contains. Similar approach of measuring efficiency of the
model was seen in [26]. The performance of the model will be evaluated from the multi-objective point of
view, using scalarization approach. In order to do that, a weighted sum will be implemented to measure the
performance from multi-objective perspective. In (8) shows the formula for weighted sum to calculate three
conflicting objectives in a scalarization equation.
As seen in the (8), w refers to the weight of performance metrics. In the context of this study, the weightage
(w1, w2, w3) are equivalent as there is no prioritize between one and another conflicting objective. Hence,
all weightages are set to 1.
Optimizing deep learning models from multi-objective perspective … (Abdul Rahman Mohamad Rom)
1426 ISSN: 2252-8938
Similarly in the CNN architecture, the most effective combination of hyperparameter found by BO
is articulated as (kernel size 1=4, kernel size 2=3, kernel size 3=3, filters 1=16, filters 2=48, filters 3=64,
optimizer=RMSprop, learning rate=0.0001, batch size=256, epochs=50, activation = elu), yielding an
accuracy of 0.9932, an F1 score of 0.9931, and 40,714 number of parameters equivalent to normalized
parameters of 0.9997. The combined value of these three metrics is 2.9861. The results underscore the
effectiveness of BO in navigating the complex hyperparameter landscape to identify optimal configurations.
Across all architectures, the optimized configurations achieved high accuracy and F1 scores, demonstrating
their effectiveness in accurately classifying digits in the MNIST dataset.
In the different architecture of LeNet, as referring to Table 6, the process of hyperparameter tuning
produced a notable outcome of an accuracy of 0.6787, F1 score of 0.6739, 168,254 number of parameters,
which equivalent to 0.8902 after normalized summing up to 2.2428. The hyperparameter combination for
above result is (kernel size 1=4, kernel size 2=4, filters 1=16, filters 2=48, optimizer=Nadam, learning
rate=0.1, batch size=64, epochs=10, activation = relu). On the other hand, in LeNet architecture, the best
hyperparameter produced an accuracy of 0.6822, F1-score of 0.6827, 18,1093,54 number of parameters
which equivalent to 0.8864 after normalized the parameters summing up to 2.2513. The hyperparameter
combination contributing to this outcome is specified as (kernel size 1=4, kernel size 2=4, kernel size 3=4,
kernel size 4=2, kernel size 5=2, filters 1=96, filters 2=64, filters 3=64, filters 4=128, filters 5=64,
optimizer=Adadelta, learning rate=0.1, batch size=256, epochs=30, activation = relu).
Similarly in the CNN architecture, the most effective combination of hyperparameter found by BO
is articulated as (kernel size 1=4, kernel size 2=3, kernel size 3=3, filters 1=16, filters 2=48, filters 3=64,
optimizer=RMSprop, learning rate=0.0001, batch size=256, epochs=30, activation = elu), yielding an
accuracy of 0.6984, an F1 score of 0.0.6968, and 45,706 number of parameters equivalent to normalized
parameters of 0.9898. The combined value of these three metrics is 2.385. Comparing these results to the
MNIST dataset, the CIFAR-10 dataset presents greater challenges, as evidenced by the lower average
accuracy of approximately 61%. Nonetheless, the hyperparameter tuning process successfully identified
configurations that significantly improved model performance across all architectures, demonstrating the
effectiveness of BO in optimizing DL models for image classification tasks.
Optimizing deep learning models from multi-objective perspective … (Abdul Rahman Mohamad Rom)
1428 ISSN: 2252-8938
Similarly on CNN architecture, the performance of BO once again stands out comparing to other
baselined methods on both datasets. On MNIST dataset, BO achieves a weighted sum of 2.9861, follows by
random search with 2.9846, grid search with 2.8773, and manual search with 2.8699. On the CIFAR-10
dataset. BO obtained 2.3842 of weighted sum comparing to other baselined methods which achieved 2.3109
(random search), 2.1385 (manual search) and 2.1307 (grid search).
4. CONCLUSION
BO stands out from manual search, grid search, and random search by leveraging past data to guide
the next iteration, aiming for an optimal solution. Unlike grid search, where the hyperparameter search space
exponentially increases with the addition of new hyperparameters, BO offers a more efficient approach,
reducing computing costs. Additionally, random search's inconsistency due to its random nature contrasts
with BO's fability to provide more consistent results. However, it's important to acknowledge that BO is not
without its weaknesses. One limitation is its reliance on probabilistic models, which may not always
accurately capture the underlying complexities of the hyperparameter space. Furthermore, BO's performance
heavily depends on the configuration of its hyperparameters, and the quality of the surrogate model used.
Future studies could focus on addressing these limitations and further refining the BO approach. Research
avenues might include exploring more advanced surrogate models, enhancing the acquisition function to
better balance exploration and exploitation, and investigating strategies to handle noisy or uncertain
evaluations. Other than that, future studies could focus more on other modern optimization techniques, such
as heuristic or swarm intelligence approach, which could be beneficial within the field of hyperparameter
tuning from the perspective of multi-objective. In addition to that, comparative studies could delve deeper
into understanding the trade-offs between BO and other hyperparameter tuning techniques across a wider
range of DL architectures and datasets.
ACKNOWLEDGEMENTS
The research was supported by Ministry of Higher Education Malaysia (MoHE), and Universiti
Teknologi MARA through the Fundamental Research Grant Scheme (FRGS) (600-RMC/FRGS 5/3
(024/2021)).
REFERENCES
[1] R. Krithiga and E. Ilavarasan, “Hyperparameter tuning of adaboost algorithm for social spammer identification,” International
Journal of Pervasive Computing and Communications, vol. 17, no. 5, pp. 462–482, 2020, doi: 10.1108/IJPCC-09-2020-0130.
[2] L. Wen, X. Ye, and L. Gao, “A new automatic machine learning based hyperparameter optimization for workpiece quality
prediction,” Measurement and Control, vol. 53, no. 7–8, pp. 1–11, 2020, doi: 10.1177/0020294020932347.
[3] A. M. Vincent and P. Jidesh, “An improved hyperparameter optimization framework for automl systems using evolutionary
algorithms,” Scientific Reports, vol. 13, no. 1, 2023, doi: 10.1038/s41598-023-32027-3.
[4] H.-C. Kim and M.-J. Kang, “Comparison of hyper-parameter optimization methods for deep neural networks,” Journal of IKEEE,
vol. 24, no. 4, pp. 969-974, 2020, doi: 10.7471/ikeee.2020.24.4.969.
[5] A. A. R. K. Bsoul, M. A. Al-Shannaq, and H. M. Aloqool, “Maximizing cnn accuracy: a bayesian optimization approach with
gaussian processes,” 9th 2023 International Conference on Control, Decision and Information Technologies, CoDIT 2023,
pp. 2597–2602, 2023, doi: 10.1109/CoDIT58514.2023.10284448.
[6] X. Zhang, S. Kuenzel, N. Colombo, and C. Watkins, “Hybrid short-term load forecasting method based on empirical wavelet
transform and bidirectional long short-term memory neural networks,” Journal of Modern Power Systems and Clean Energy,
vol. 10, no. 5, pp. 1216–1228, 2022, doi: 10.35833/MPCE.2021.000276.
[7] J. A. Pandian et al., “A five convolutional layer deep convolutional neural network for plant leaf disease detection,” Electronics,
vol. 11, no. 8, 2022, doi: 10.3390/electronics11081266.
[8] A. R. M. Rom, N. Jamil, and S. Ibrahim, “Multi objective hyperparameter tuning via random search on deep learning models,”
Telkomnika (Telecommunication Computing Electronics and Control), vol. 22, no. 4, pp. 956–968, 2024, doi:
10.12928/TELKOMNIKA.v22i4.25847.
[9] A. Mathern et al., “Multi-objective constrained bayesian optimization for structural design,” Structural and Multidisciplinary
Optimization, vol. 63, no. 2, pp. 689–701, 2021, doi: 10.1007/s00158-020-02720-2.
[10] A. Nasayreh et al., “Arabic sentiment analysis for chatgpt using machine learning classification algorithms: a hyperparameter
optimization technique,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 3, 2024,
doi: 10.1145/3638285.
[11] O. Stephen and M. Sain, “Using deep learning with bayesian-gaussian inspired convolutional neural architectural search for
cancer recognition and classification from histopathological image frames,” Journal of Healthcare Engineering, vol. 2023, 2023,
doi: 10.1155/2023/4597445.
[12] S. Hanifi, A. Cammarono, and H. Zare-Behtash, “Advanced hyperparameter optimization of deep learning models for wind power
prediction,” Renewable Energy, vol. 221, 2024, doi: 10.1016/j.renene.2023.119700.
[13] A. Morales-Hernández, I. V. Nieuwenhuyse, and S. R. Gonzalez, “A survey on multi-objective hyperparameter optimization
algorithms for machine learning,” Artificial Intelligence Review, vol. 56, no. 8, pp. 8043–8093, 2023, doi: 10.1007/s10462-022-
10359-2.
[14] M. A. Amirabadi, M. H. Kahaei, and S. A. Nezamalhosseini, “Novel suboptimal approaches for hyperparameter tuning of deep
neural network [under the shelf of optical communication],” Physical Communication, vol. 41, 2020, doi:
10.1016/j.phycom.2020.101057.
[15] B. Gülmez, “A new multi-objective hyperparameter optimization algorithm for covid-19 detection from x-ray images,” Soft
Computing, vol. 28, pp. 11601–11617, 2024, doi: 10.1007/s00500-024-09872-z.
[16] L. Fromberg, T. Nielsen, F. D. Frumosu, and L. K. H. Clemmensen, “Beyond accuracy: fairness, scalability, and uncertainty
considerations in facial emotion recognition,” in Proceedings of the 5th Northern Lights Deep Learning Conference (NLDL), 2024.
[17] S. S. Mostafa, F. Mendonca, A. G. Ravelo-Garcia, G. Julia-Serda, and F. Morgado-Dias, “Multi-objective hyperparameter
optimization of convolutional neural network for obstructive sleep apnea detection,” IEEE Access, vol. 8, pp. 129586–129599,
2020, doi: 10.1109/ACCESS.2020.3009149.
[18] S. P. Chen, J. Wu, and X. Y. Liu, “EMORL: effective multi-objective reinforcement learning method for hyperparameter
optimization,” Engineering Applications of Artificial Intelligence, vol. 104, 2021, doi: 10.1016/j.engappai.2021.104315.
[19] H. Alibrahim and S. A. Ludwig, “Hyperparameter optimization: comparing genetic algorithm against grid search and bayesian
optimization,” 2021 IEEE Congress on Evolutionary Computation, CEC 2021, pp. 1551–1559, 2021, doi:
10.1109/CEC45853.2021.9504761.
[20] X. Wang, Y. Jin, S. Schmitt, and M. Olhofer, “Recent advances in bayesian optimization,” in ACM Computing Surveys, vol. 55,
no. 13s, Jul. 2023, pp.1–36, doi: 10.1145/3582078.
[21] J. Bergstra, R. Bardenet, Y. Bengio and B. Kégl, “Algorithms for hyper-parameter optimization,” in NIPS’11: Proceedings of the
24th International Conference on Neural Information Processing Systems, 2011.
[22] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol.
29, no. 6, pp. 141–142, 2012, doi: 10.1109/MSP.2012.2211477.
[23] Q. Gou and Y. Ren, “Research on multi-scale cnn and transformer-based multi-level multi-classification method for images,”
IEEE Access, vol. 12, pp. 103049–103059, 2024, doi: 10.1109/ACCESS.2024.3433374.
[24] Z. Gao and D. S. Boning, “A review of bayesian methods in electronic design automation,” arXiv-Statistics, pp. 1-24, 2023.
[25] H. M. Rai, K. Chatterjee, and S. Dashkevich, “Automatic and accurate abnormality detection from brain mr images using a novel
hybrid unetresnext-50 deep CNN model,” Biomedical Signal Processing and Control, vol. 66, 2021, doi:
10.1016/j.bspc.2021.102477.
[26] S. M. Jeong, S. G. Lee, C. L. Seok, E. C. Lee, and J. Y. Lee, “Lightweight deep learning model for real-time colorectal polyp
segmentation,” Electronics, vol. 12, no. 9, 2023, doi: 10.3390/electronics12091962.
BIOGRAPHIES OF AUTHORS
Optimizing deep learning models from multi-objective perspective … (Abdul Rahman Mohamad Rom)