On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice
On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Machine learning algorithms have been used widely in various applications and areas. To fit a machine
Received 13 December 2019 learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-
Revised 14 May 2020 parameter configuration for machine learning models has a direct impact on the model’s performance.
Accepted 16 July 2020
It often requires deep knowledge of machine learning algorithms and appropriate hyper-parameter opti-
Available online 25 July 2020
mization techniques. Although several automatic optimization techniques exist, they have different
Communicated by Yuhua Cheng
strengths and drawbacks when applied to different types of problems. In this paper, optimizing the
hyper-parameters of common machine learning models is studied. We introduce several state-of-the-
Keywords:
Hyper-parameter optimization
art optimization techniques and discuss how to apply them to machine learning algorithms. Many avail-
Machine learning able libraries and frameworks developed for hyper-parameter optimization problems are provided, and
Bayesian optimization some open challenges of hyper-parameter optimization research are also discussed in this paper.
Particle swarm optimization Moreover, experiments are conducted on benchmark datasets to compare the performance of different
Genetic algorithm optimization methods and provide practical examples of hyper-parameter optimization. This survey
Grid search paper will help industrial users, data analysts, and researchers to better develop machine learning models
by identifying the proper hyper-parameter configurations effectively.
Ó 2020 Elsevier B.V. All rights reserved.
1. Introduction support vector machine, and the learning rate to train a neural net-
work) or to specify the algorithm used to minimize the loss func-
Machine learning (ML) algorithms have been widely used in tion (e.g., the activation function and optimizer types in a neural
many applications domains, including advertising, recommenda- network, and the kernel type in a support vector machine) [5].
tion systems, computer vision, natural language processing, and To build an optimal ML model, a range of possibilities must be
user behavior analytics [1]. This is because they are generic and explored. The process of designing the ideal model architecture
demonstrate high performance in data analytics problems. Differ- with an optimal hyper-parameter configuration is named hyper-
ent ML algorithms are suitable for different types of problems or parameter tuning. Tuning hyper-parameters is considered a key
datasets [2]. In general, building an effective machine learning component of building an effective ML model, especially for tree-
model is a complex and time-consuming process that involves based ML models and deep neural networks, which have many
determining the appropriate algorithm and obtaining an optimal hyper-parameters [6]. Hyper-parameter tuning process is different
model architecture by tuning its hyper-parameters (HPs) [3]. among different ML algorithms due to their different types of
Two types of parameters exist in machine learning models: one hyper-parameters, including categorical, discrete, and continuous
that can be initialized and updated through the data learning pro- hyper-parameters [7]. Manual testing is a traditional way to tune
cess (e.g., the weights of neurons in neural networks), named hyper-parameters and is still prevalent in graduate student
model parameters; while the other, named hyper-parameters, can- research, although it requires a deep understanding of the used
not be directly estimated from data learning and must be set ML algorithms and their hyper-parameter value settings [8]. How-
before training a ML model because they define the model archi- ever, manual tuning is ineffective for many problems due to certain
tecture [4]. Hyper-parameters are the parameters that are used factors, including a large number of hyper-parameters, complex
to either configure a ML model (e.g., the penalty parameter C in a models, time-consuming model evaluations, and non-linear
hyper-parameter interactions. These factors have inspired
increased research in techniques for automatic optimization of
E-mail addresses: lyang339@uwo.ca (L. Yang), abdallah.shami@uwo.ca hyper-parameters; so-called hyper-parameter optimization (HPO)
(A. Shami)
https://doi.org/10.1016/j.neucom.2020.07.061
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
296 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
[9]. The main aim of HPO is to automate hyper-parameter tuning Training a ML model often takes considerable time and space.
process and make it possible for users to apply machine learning Multi-fidelity optimization algorithms are developed to tackle
models to practical problems effectively [3]. The optimal model problems with limited resources, and the most common ones
architecture of a ML model is expected to be obtained after a being bandit-based algorithms. Hyperband [16] is a popular
HPO process. Some important reasons for applying HPO techniques bandit-based optimization technique that can be considered an
to ML models are as follows [6]: improved version of RS. It generates small versions of datasets
and allocates a same budget to each hyper-parameter combination.
1. It reduces the human effort required, since many ML developers In each iteration of Hyperband, poorly-performing hyper-
spend considerable time tuning the hyper-parameters, espe- parameter configurations are eliminated to save time and
cially for large datasets or complex ML algorithms with a large resources.
number of hyper-parameters. Metaheuristic algorithms are a set of techniques used to solve
2. It improves the performance of ML models. Many ML hyper- complex, large search space and non-convex optimization prob-
parameters have different optimums to achieve best perfor- lems to which HPO problems belong [17]. Among all metaheuristic
mance in different datasets or problems. methods, genetic algorithm (GA) [18] and particle swarm opti-
3. It makes the models and research more reproducible. Only mization (PSO) [19] are the two most prevalent metaheuristic algo-
when the same level of hyper-parameter tuning process is rithms used for HPO problems. Genetic algorithms detect well-
implemented can different ML algorithms be compared fairly; performing hyper-parameter combinations in each generation,
hence, using a same HPO method on different ML algorithms and pass them to the next generation until the best-performing
also helps to determine the most suitable ML model for a speci- combination is identified. In PSO algorithms, each particle commu-
fic problem. nicates with other particles to detect and update the current global
optimum in each iteration until the final optimum is detected.
It is crucial to select an appropriate optimization technique to Metaheuristics can efficiently explore the search space to detect
detect optimal hyper-parameters. Traditional optimization tech- optimal or near-optimal solutions. Hence, they are particularly
niques may be unsuitable for HPO problems, since many HPO suitable for the HPO problems with large configuration space due
problems are non-convex or non-differentiable optimization prob- to their high efficiency. For instance, they can be used in deep neu-
lems, and may result in a local instead of a global optimum [10]. ral networks (DNNs) which have a large configuration space with
Gradient descent-based methods are a common type of traditional multiple hyper-parameters, including the activation and optimizer
optimization algorithm that can be used to tune continuous hyper- types, the learning rate, drop-out rate, etc.
parameters by calculating their gradients [11]. For example, the Although using HPO algorithms to tune the hyper-parameters
learning rate in a neural network can be optimized by a of ML models greatly improves the model performance, certain
gradient-based method. other aspects, like their computational complexity, still have much
Compared with traditional optimization methods like gradient room for improvement. On the other hand, since different HPO
descent, many other optimization techniques are more suitable models have their own advantages and suitable problems,
for HPO problems, including decision-theoretic approaches, Baye- overviewing them is necessary for proper optimization algorithm
sian optimization models, multi-fidelity optimization techniques, selection in terms of different types of ML models and problems.
and metaheuristics algorithms [7]. Apart from detecting continu- This paper makes the following contributions:
ous hyper-parameters, many of these algorithms also have the
capacity to effectively identify discrete, categorical, and condi- 1. It reviews common ML algorithms and their important hyper-
tional hyper-parameters. parameters.
Decision-theoretic methods are based on the concept of defin- 2. It analyzes common HPO techniques, including their benefits
ing a hyper-parameter search space and then detecting the and drawbacks, to help apply them to different ML models by
hyper-parameter combinations in the search space, ultimately appropriate algorithm selection in practical problems.
selecting the best-performing hyper-parameter combination. Grid 3. It surveys common HPO libraries and frameworks for practical
search (GS) [12] is a decision-theoretic approach that exhaustively use.
searches the optimal configuration in a fixed domain of hyper- 4. It discusses the open challenges and research directions of the
parameters. Random search (RS) [13] is another decision- HPO research domain.
theoretic method that randomly selects hyper-parameter combi-
nations in the search space, given limited execution time and In this survey paper, we begin with a comprehensive introduc-
resources. In GS and RS, each hyper-parameter configuration is tion of the common optimization techniques used in ML hyper-
treated independently. parameter tuning problems. Section 2 introduces the main con-
Unlike GS and RS, Bayesian optimization (BO) [14] models cepts of mathematical optimization and hyper-parameter opti-
determine the next hyper-parameter value based on the previous mization, as well as the general HPO process. In Section 3, we
results of tested hyper-parameter values, which avoids many discuss the key hyper-parameters of common ML models that need
unnecessary evaluations; thus, BO can detect the optimal hyper- to be tuned. Section 4 covers the various state-of-the-art optimiza-
parameter combination within fewer iterations than GS and RS. tion approaches that have been proposed for tackling HPO prob-
To be applied to different problems, BO can model the distribution lems. In Section 5, we analyze different HPO methods and
of the objective function using different models as the surrogate discuss how they can be applied to ML algorithms. In Section 6,
function, including Gaussian process (GP), random forest (RF), we provide an introduction to various public libraries and frame-
and tree-structured Parzen estimators (TPE) models [15]. BO-RF works that are developed to implement HPO. Section 7 presents
and BO-TPE can retain the conditionality of variables [15]. Thus, and discusses the experimental results of using HPO on benchmark
they can be used to optimize conditional hyper-parameters, like datasets for HPO method comparison and practical use case
the kernel type and the penalty parameter C in a support vector demonstration. In Section 8, we discuss several research directions
machine (SVM). However, since BO models work sequentially to and open challenges that should be considered to improve current
balance the exploration of unexplored areas and the exploitation HPO models or develop new HPO approaches. We conclude the
of currently-tested regions, it is difficult to parallelize them. paper in Section 9.
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 297
2. Mathematical optimization and hyper-parameter For optimization problems, in many cases, only a local instead
optimization problems of a global optimum can be obtained. For example, to obtain the
minimum of a problem, assuming D is the feasible region of a deci-
The key process of machine learning is to solve optimization sion variable x, a global minimum is the point x 2 D satisfying
problems. To build a ML model, its weight parameters are initial- f ðx Þ 6 f ðxÞ 8x 2 D , while a local minimum is a point x 2 D in a
ized and optimized by an optimization method until the objective neighborhood N satisfying f ðx Þ 6 f ðxÞ 8x 2 N \ D [21]. Thus, the
function approaches a minimum value or the accuracy approaches local optimum may only be an optimum in a small range instead
a maximum value [20]. Similarly, hyper-parameter optimization of being the optimal solution in the entire feasible region.
methods aim to optimize the architecture of a ML model by detect- A local optimum is only guaranteed to be the global optimum in
ing the optimal hyper-parameter configurations. In this section, convex functions [22]. Convex functions are the functions that only
the main concepts of mathematical optimization and hyper- have one optimum. Therefore, continuing to search along the
parameter optimization for machine learning models are direction in which the objective function decreases can detect
discussed. the global minimum value. A function f ðxÞ is a convex function if
for 8x1 ; x2 2 X; 8t 2 ½0; 1,
2.1. Mathematical optimization f ðtx1 þ ð1 tÞx2 Þ 6 tf ðx1 Þ þ ð1 t Þf ðx2 Þ; ð4Þ
Mathematical optimization is the process of finding the best where X is the domain of decision variables, and t is a coefficient in
solution from a set of available candidates to maximize or mini- the range of [0,1].
mize the objective function [20]. Generally, optimization problems An optimization problem is a convex optimization problem only
can be classified as constrained or unconstrained optimization when the objective function f ðxÞ is a convex function and the fea-
problems based on whether they have constraints for the decision sible region C is a convex set, denoted by [22]:
variables or the solution variables.
minf ðxÞ
In unconstrained optimization problems, a decision variable, x, x ð5Þ
can take any values from the one-dimensional space of all real subject to x 2 C:
numbers, R. An unconstrained optimization problem can be
On the other hand, nonconvex functions have multiple local
denoted by [21]:
optimums, but only one of these optimums is the global optimum.
minf ðxÞ; ð1Þ Most ML and HPO problems are nonconvex optimization problems.
x2R
Thus, utilizing inappropriate optimization methods may only
where f ðxÞ is the objective function. result in a local instead of a global optimum.
On the other hand, most real-life optimization problems are There are many traditional methods that can be used to solve
constrained optimization problems. The decision variable x for optimization problems, including gradient descent, Newton’s
constrained optimization problems should be subject to certain method, conjugate gradient, and heuristic optimization methods
constraints which could be mathematical equalities or inequalities. [20]. Gradient descent is a commonly-used optimization method
Therefore, constrained optimization problems or general optimiza- that uses the negative gradient direction as the search direction
tion problems can be expressed as [21]: to move towards the optimum. However, gradient descent cannot
guarantee to detect the global optimum unless the objective func-
minf ðxÞ tion is a convex function. Newton’s method uses the inverse matrix
x
of the Hessian matrix to obtain the optimum. Newton’s method
subject to
has faster convergence speed than gradient descent, but often
g i ðxÞ 6 0; i ¼ 1; 2; . . . ; m; ð2Þ
requires more time and larger space than gradient descent to store
hj ðxÞ ¼ 0; j ¼ 1; 2; . . . ; p; and calculate the Hessian matrix. Conjugate gradient searches
x 2 X; along the conjugated directions constructed by the gradient of
known data points to detect the optimum. Conjugate gradient
where g i ðxÞ; i ¼ 1; 2; . . . ; m, are the inequality constraint functions; has faster convergence speed than gradient descent but its calcula-
hj ðxÞ; j ¼ 1; 2; . . . ; p, are the equality constraint function; and X is tion of conjugate gradient is more complex. Unlike other tradi-
the domain of x. tional methods, heuristic methods use empirical rules to solve
The role of constraints is to limit the possible values of the opti- the optimization problems instead of following systematical steps
mal solution to certain areas of the search space, named the feasi- to obtain the solution. Heuristic methods can often detect the
ble region [21]. Thus, the feasible region D of x can be represented approximate global optimum within a few iterations, but cannot
by: guarantee to detect the global optimum [20].
D ¼ x 2 Xjg i ðxÞ 6 0; hj ðxÞ ¼ 0 : ð3Þ
2.2. Hyper-parameter optimization problem statement
To conclude, an optimization problem consists of three major
components: a set of decision variables x, an objective function During the design process of ML models, effectively searching
f ðxÞ to be either minimized or maximized, and a set of constraints the hyper-parameters’ space using optimization techniques can
that allow the variables to take on values in certain ranges (if it is a identify the optimal hyper-parameters for the models. The
constrained optimization problem). Therefore, the goal of opti- hyper-parameter optimization process consists of four main com-
mization tasks is to obtain the set of variable values that minimize ponents: an estimator (a regressor or a classifier) with its objective
or maximize the objective function while satisfying any applicable function, a search space (configuration space), a search or opti-
constraints. mization method used to find hyper-parameter combinations,
Regarding ML models, many HPO problems have certain con- and an evaluation function to compare the performance of differ-
straints, like the feasible domain of the number of clusters in k- ent hyper-parameter configurations.
means, as well as time and space constraints. Therefore, con- The domain of a hyper-parameter can be continuous (e.g., learn-
strained optimization techniques are widely-used in HPO prob- ing rate), discrete (e.g., number of clusters), binary (e.g., whether to
lems [3]. use early stopping or not), or categorical (e.g., type of optimizer).
298 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
Therefore, hyper-parameters are classified as continuous, discrete, 1. The optimization target, the objective function of ML models, is
and categorical hyper-parameters. For continuous and discrete usually a non-convex and non-differentiable function. There-
hyper-parameters, their domains are usually bounded in practical fore, many traditional optimization methods designed to solve
applications [12] [23]. On the other hand, the hyper-parameter convex or differentiable optimization problems are often
configuration space sometimes contains conditionality. A hyper- unsuitable for HPO problems, since these methods may return
parameter may need to be used or tuned depending on the value a local optimum instead of a global optimum. Additionally, an
of another hyper-parameter, called a conditional hyper- optimization target lacking smoothness makes certain tradi-
parameter [10]. For instance, in SVM, the degree of the polynomial tional derivative-free optimization models perform poorly for
kernel function only needs to be tuned when the kernel type is HPO problems [26].
chosen to be polynomial. 2. The hyper-parameters of ML models include continuous, dis-
In simple cases, all hyper-parameters can take unrestricted real crete, categorical, and conditional hyper-parameters. Thus,
values, and the feasible set X of hyper-parameters can be a real- many traditional numerical optimization methods [27] that
valued n-dimensional vector space. However, in most cases, the only aim to tackle numerical or continuous variables are unsuit-
hyper-parameters of a ML model often take on values from differ- able for HPO problems.
ent domains and have different constraints, so their optimization 3. It is often computationally expensive to train a ML model on a
problems are often complex constrained optimization problems large-scale dataset. HPO techniques sometimes use data sam-
[24]. For instance, the number of considered features in a decision pling to obtain approximate values of the objective function.
tree should be in the range of 0 to the number of features, and the Thus, effective optimization techniques for HPO problems
number of clusters in k-means should not be larger than the size of should be able to use these approximate values. However, func-
data points. Additionally, categorical features can often only take tion evaluation time is often ignored in many black-box opti-
several certain values, like the limited choices of the activation mization (BBO) models, so they often require exact instead of
function and the optimizer of a neural network. Therefore, the fea- approximate objective function values. Consequently, many
sible domain of X often has a complex structure, which increases BBO algorithms are often unsuitable for HPO problems with
the problems’ complexity [24]. limited time and resource budgets.
In general, for a hyper-parameter optimization problem, the
aim is to obtain [19]: Therefore, appropriate optimization algorithms should be
applied to HPO problems to identify optimal hyper-parameter con-
x ¼ arg minf ðxÞ; ð6Þ figurations for ML models.
x2X
shrinkage; thus, the coefficients are also more robust to collinearity. arg min max f0; 1 yi f ðxi Þg þ Cw w ;
T
ð12Þ
w n i¼1
Lasso regression [37] is another linear model used to estimate
sparse coefficients, consisting of a linear model with an L1 priori where w is a normalization vector; C is the penalty parameter of the
added regularization term. It aims to minimize the objective func- error term, which is an important hyper-parameter of all SVM
tion [36]: models.
X
p The kernel function f ðxÞ, which is used to measure the similarity
akwk1 þ ðyi wi xi Þ2 ; ð10Þ between two data points xi and xj , can be chosen from multiple
i¼1 types of kernels in SVM models. Therefore, the kernel type would
where a is the regularization strength and kwk1 is the L1 -norm of be a vital hyper-parameter to be tuned. Common kernel types in
the coefficient vector. Therefore, the regularization strength a is SVM include linear kernels, radial basis function (RBF), polynomial
an crucial hyper-parameter of both ridge and lasso regression kernels, and sigmoid kernels.
models. The different kernel functions can be denoted as follows [45]:
Logistic regression (LR) [38] is a linear model used for classifica-
tion problems. In LR, its cost function may be different, depending 1. Linear kernel:
on the regularization method chosen for the penalization. There f ðxÞ ¼ xTi xj ; ð13Þ
are three main types of regularization methods in LR: L1 -norm,
L2 -norm, and elastic-net regularization [39]. 2. Polynomial kernel:
Therefore, the first hyper-parameter that needs to be tuned in d
f ðxÞ ¼ cxTi xj þ r ; ð14Þ
LR is to the regularization method used in the penalization, ‘l1’,
‘l2’, ‘elasticnet’ or ‘none’, which is called ‘penalty’ in sklearn. The 3. RBF kernel:
coefficient, ‘C’, is another essential hyper-parameter that determi-
f ðxÞ ¼ exp ckx x0 k ;
2
nes the regularization strength of the model. In addition, the ‘sol- ð15Þ
ver’ type, representing the optimization algorithm type, can be
set to ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, or ‘saga’ in LR. The ‘sol- 4. Sigmoid kernel:
ver’ type has correlations with ‘penalty’ and ‘C’, so they are condi- f ðxÞ ¼ tanh cxTi xj þ r ; ð16Þ
tional hyper-parameters.
300 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
continuous hyper-parameters — ‘gamma’, ‘alpha’, ‘lambda’, and (or prediction tasks) while avoiding over-fitting. At the next stage,
‘learning_rate’ — indicating the minimum loss reduction for a split, certain function types need to be set or tuned. The first function
L1 , and L2 regularization term on weights, and the learning rate, type to configure is the loss function type, which is chosen mainly
respectively. based on the problem type (e.g., binary cross-entropy for binary
classification problems, multi-class cross-entropy for multi-
3.1.6. Ensemble learning algorithms classification problems, and RMSE for regression problems).
Apart from tree-based ensemble models, there are several other Another important hyper-parameter is the activation function type
generic ensemble learning methods that combine multiple singular used to model non-linear functions, which be set to ‘softmax’, ‘rec-
ML models to achieve better model performance than any singular tified linear unit (ReLU)’, ‘sigmoid’, ‘tanh’, or ‘softsign’. Lastly, the
algorithms alone. Three common ensemble learning models — vot- optimizer type can be set to stochastic gradient descent (SGD),
ing, bagging, and AdaBoost — are introduced in this subsection adaptive moment estimation (Adam), root mean square propaga-
[59]. tion (RMSprop), etc. [62].
Voting [59] is a basic ensemble learning algorithm that uses the On the other hand, some other hyper-parameters are related to
majority voting rule to combine singular estimators and generate a the optimization and training process of DL models; hence, catego-
comprehensive estimator with improved accuracy. In sklearn, the rized as optimizer hyper-parameters. The learning rate is one of
voting method can be set to be ‘hard’ or ’soft’, indicating whether the most important hyper-parameters in DL models [63]. It deter-
to use majority voting or averaged predicted probabilities to deter- mines the step size at each iteration, which enables the objective
mine the classification result. The list of selected single ML estima- function to converge. A large learning rate speeds up the learning
tors and their weights can also be tuned in certain cases. For process, but the gradient may oscillate around a local minimum
instance, a higher weight can be assigned to a better-performing value or even cannot converge. On the other hand, a small learning
singular ML model in a voting model. rate converges smoothly, but will largely increase model training
Bootstrap aggregating [59], also named bagging, trains multiple time by requiring more training epochs. An appropriate learning
base estimators on different randomly-extracted subsets to con- rate should enable the objective function to converge to a global
struct a final predictor [130]. When using bagging methods, the minimum in a reasonable amount of time. Another common
first consideration should be the type and number of base estima- hyper-parameter is the drop-out rate. Drop-out is a standard regu-
tors in the ensemble, denoted by ‘base_estimator’ and ‘n_estima- larization method for DL models proposed to reduce over-fitting. In
tors’, respectively. Then, the ‘max_samples’ and ‘max_features’, drop-out, a proportion of neurons are randomly removed, and the
indicating the sample size and feature size to generate different percentage of neurons to be removed should be tuned.
subsets, can also be tuned. Mini-batch size and the number of epochs are the other two DL
AdaBoost [59], short for adaptive boosting, is an ensemble hyper-parameters that represent the number of processed samples
learning method that trains multiple base learners consecutively before updating the model, and the number of complete passes
(weak learners), and later learners emphasize the mis-classified through the entire training set, respectively [64]. Mini-batch size
samples of previous learners; ultimately, a final strong learner is is affected by the resource requirements of the training process
obtained. During this process, incorrectly-classified instances are and the number of iterations. The number of epochs depends on
retrained with other new instances, and their weights are adjusted the size of the training set and should be tuned by slowly increas-
so that the subsequent classifiers focus more on difficult cases, ing its value until validation accuracy starts to decrease, which
thereby gradually building a stronger classifier. In AdaBoost, the indicates over-fitting. On the other hand, DL models often converge
type of base estimator, ‘base_estimator’, can be set to a decision within a few epochs, and the following epochs may lead to unnec-
tree or other methods. In addition, the maximum number of esti- essary additional execution time and over-fitting, which can be
mators at which boosting is terminated, ‘n_estimators’, and the avoided by the early stopping method. Early stopping is a form
learning rate that shrinks the contribution of each classifier, should of regularization whereby model training stops in advance when
also be tuned to achieve a trade-off between these two hyper- validation accuracy does not increase after a certain number of
parameters. consecutive epochs. The number of waiting epochs, called early
stop patience, can also be tuned to reduce model training time.
3.1.7. Deep learning models Apart from traditional DL models, transfer learning (TL) is a
Deep learning (DL) algorithms are widely applied to various technology that obtains a pre-trained model on the data in a
areas — like computer vision, natural language processing, and related domain and transfers it to other target tasks [65]. To trans-
machine translation — since they have had great success solving fer a DL model from one problem to another problem, a certain
many types of problems. DL models are based on the theory of arti- number of top layers are frozen, and only the remaining layers
ficial neural networks (ANNs). Common types of DL architectures are retrained to fit the new problem. Therefore, the number of fro-
include deep neural networks (DNNs), feedforward neural net- zen layers is a vital hyper-parameter to tune if TL is used.
works (FFNNs), deep belief networks (DBNs), convolutional neural
networks (CNNs), recurrent neural networks (RNNs) and many 3.2. Unsupervised learning algorithms
more [60]. All these DL models have similar hyper-parameters since
they have similar underlying neural network architecture. Com- Unsupervised learning algorithms are a set of ML algorithms
pared with other ML models, DL models benefit more from HPO used to identify unknown patterns in unlabeled datasets. Cluster-
since they often have many hyper-parameters that require tuning. ing and dimensionality-reduction algorithms are the two main
The first set of hyper-parameters is related to the construction types of unsupervised learning methods. Clustering methods
of a DL model; hence, named model design hyper-parameters. include k-means, DBSCAN, EM, hierarchical clustering, etc.; while
Since all neural network models have an input layer and an output PCA and LDA are two commonly-used dimensionality reduction
layer, the complexity of a deep learning model mainly depends on algorithms [29].
the number of hidden layers and the number of neurons in each
layer, which are two main hyper-parameters to build DL models 3.2.1. Clustering algorithms
[61]. These two hyper-parameters are set and tuned according to For most clustering algorithms — including k-means, EM, and
the complexity of the datasets or the problems. DL models hierarchical clustering — the number of clusters is the most impor-
need to have enough capacity to model objective functions tant hyper-parameter to tune [66].
302 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
The k-means algorithm [67] uses k prototypes, indicating the applications, many features are irrelevant or redundant to predict
centroids of clusters, to cluster data. In k-means algorithms, the target variables. Dimensionality reduction algorithms often serve
number of clusters, ‘n_clusters’, must be specified, and is deter- as feature engineering methods to extract important features and
mined by minimizing the sum of squared errors [68]: eliminate insignificant or redundant features. Two common
X
nk
2 dimensionality-reduction algorithms are principal component
min xi uj ; ð21Þ analysis (PCA) and linear discriminant analysis (LDA). In PCA and
uj 2C k
i¼0 LDA, the number of features to be extracted, represented by ’n_-
where ðx1 ; . . . ; xn Þ is the data matrix; uj , also called the centroid of components’ in sklearn, is the main hyper-parameter to be tuned.
the cluster C k , is the mean of the samples in the cluster; and nk is Principal component analysis (PCA) [74] is a widely used linear
the number of sample points in the cluster C k . dimensionality reduction method. PCA is based on the concept of
To tune k-means, ‘n_clusters’ is the most crucial hyper- mapping the original n-dimensional features into k-dimension fea-
parameter. Besides this, the method for centroid initialization, tures as the new orthogonal features, also called the principal com-
‘init’, could be set to ‘k-means++’, ‘random’ or a human-defined ponents. PCA works by calculating the covariance matrix of the
array, which slightly affects model performance. In addition, data matrix to obtain the eigenvectors of the covariance matrix.
‘n_init’, denoting the number of times that the k-means algorithm The matrix comprises the eigenvectors of k features with the lar-
will be executed with different centroid seeds, and the ‘max_iter’, gest eigenvalues (i.e., the largest variance). Consequently, the data
the maximum number of iterations in a single execution of k- matrix can be transformed into a new space with a reduced dimen-
means, also have slight impacts on model performance [30]. sionality. Singular value decomposition (SVD) [75] is a popular
The expectation–maximization (EM) algorithm [69] is an itera- method used to obtain the eigenvalues and eigenvectors of the
tive algorithm used to detect the maximum likelihood estimation covariance matrix of PCA. Therefore, in addition to ‘n_components’,
of parameters. Gaussian Mixture model is a clustering method that the SVD solver type is another hyper-parameter of PCA to be tuned,
uses a mixture of Gaussian distributions to model data by imple- which can be assigned to ‘auto’, ‘full’, ‘arpack’ or ‘randomized’ [30].
menting the EM method. Similar to k-means, its major hyper- Linear discriminant analysis (LDA) [76] is another common
parameter to be tuned is ‘n_components’, indicating the number dimensionality reduction method that projects the features onto
of clusters or Gaussian distributions. Additionally, different meth- the most discriminative directions. Unlike PCA, which obtains the
ods can be chosen to constrain the covariance of the estimated direction with the largest variance as the principal component,
classes in Gaussian mixture models, including ‘full covariance’, LDA optimizes the feature subspace of classification. The objective
‘tied’, ‘diagonal’ or ‘spherical’ [70]. Other hyper-parameters could of LDA is to minimize the variance inside each class and maximize
also be tuned, including ‘max_iter’ and ‘tol’, representing the num- the variance between different classes after projection. Thus, the
ber of EM iterations to perform and the convergence threshold, projection points in each class should be as close as possible, and
respectively [30]. the distance between the center points of different classes should
Hierarchical clustering [71] methods build clusters by continu- be as large as possible. Similar to PCA, the number of features to
ously merging or splitting the built-in clusters. The hierarchy of be extracted, ‘n_components’, should be tuned in LDA models.
clusters is represented by a tree-structure; its root indicates the Additionally, the solver type of LDA can also be set to ‘svd’ for
unique cluster gathering all samples, and its leaves represent the SVD, ‘lsqr’ for least-squares solution, or ‘eigen’ for eigenvalue
clusters with only one sample [71]. In sklearn, the function decomposition [77]. LDA also has a conditional hyper-parameter,
‘AgglomerativeClustering’ is a common type of hierarchical clus- the shrinkage parameter, ‘shrinkage’, which can be set to a float
tering. In agglomerative clustering, the linkage criteria, ‘linkage’, value along with ‘lsqr’ and ‘eigen’ solvers.
determines the distance between sets of observations and can be
set to ‘ward’, ‘complete’, ‘average’, or ‘single’, indicating whether
4. Hyper-parameter optimization techniques
to minimize the variance of the all clusters, or use the maximum,
average, or minimum distance between every two clusters, respec-
4.1. Model-free algorithms
tively. Like other clustering methods, its main hyper-parameter is
the number of clusters, ‘n_clusters’. However, ’n_clusters’ cannot
4.1.1. Babysitting
be set if we choose to set the ‘distance_threshold’, the linkage dis-
Babysitting, also called ‘Trial and Error’ or grad student descent
tance threshold for merging clusters, since if so, ‘n_clusters’ will be
(GSD), is a basic hyper-parameter tuning method [8]. This method
determined automatically.
is implemented by 100% manual tuning and widely used by stu-
DBSCAN [72] is a density-based clustering method that deter-
dents and researchers. The workflow is simple: after building a
mines the clusters by dividing data into clusters with sufficiently
ML model, a student tests many possible hyper-parameter values
high density. Unlike other clustering models, the number of clus-
based on experience, guessing, or the analysis of previously-
ters does not need to be configured before training. Instead,
evaluated results; the process is repeated until this student runs
DBSCAN has two significant conditional hyper-parameters — the
out of time (often reaching a deadline) or is satisfied with the
scan radius represented by ‘eps’, and the minimum number of con-
results. As such, this approach requires a sufficient amount of prior
sidered neighbor points represented by ‘min_samples’ — which
knowledge and experience to identify optimal hyper-parameter
define the cluster density together [73]. DBSCAN works by starting
values with limited time.
with an unvisited point and detecting all its neighbor points within
Manual tuning is infeasible for many problems due to several
a pre-defined distance ‘eps’. If the number of neighbor points
factors, like a large number of hyper-parameters, complex models,
reaches the value of ‘min_samples’, this unvisited point and all
time-consuming model evaluations, and non-linear hyper-
its neighbors are defined as a cluster. The procedures are executed
parameter interactions [9]. These factors inspired increased
recursively until all data points have been visited. A higher
research into techniques for the automatic optimization of
‘min_samples’ or a lower ‘eps’ indicates a higher density to form
hyper-parameters [78].
a cluster.
sidered an exhaustive search or a brute-force method that evalu- based algorithms have a time complexity of O nk for optimizing
ates all the hyper-parameter combinations given to the grid of con- k hyper-parameters [82].
figurations [131]. GS works by evaluating the Cartesian product of For specific machine learning algorithms, the gradient of certain
a user-specified finite set of values [6]. hyper-parameters can be calculated, and then the gradient descent
GS cannot exploit the well-performing regions further by itself. can be used to optimize these hyper-parameters. Although
Therefore, to identify the global optimums, the following proce- gradient-based algorithms have a faster convergence speed to
dure needs to be performed manually [2]: reach local optimum than the previously-presented methods in
Section 4.1, they have several limitations. Firstly, they can only
1. Start with a large search space and step size. be used to optimize continuous hyper-parameters because other
2. Narrow the search space and step size based on the previous types of hyper-parameters, like categorical hyper-parameters, do
results of well-performing hyper-parameter configurations. not have gradient directions. Secondly, they are only efficient for
3. Repeat step 2 multiple times until an optimum is reached. convex functions because the local instead of a global optimum
may be reached for non-convex functions [2]. Therefore, the
GS can be easily implemented and parallelized. However, the gradient-based algorithms can only be used in some cases where
main drawback of GS is its inefficiency for high-dimensionality it is possible to obtain the gradient of hyper-parameters; e.g., opti-
hyper-parameter configuration space, since the number of evalua- mizing the learning rate in neural networks (NN) [11]. Still, it is not
tions increases exponentially as the number of hyper-parameters guaranteed for ML algorithms to identify global optimums using
grows. This exponential growth is referred to as the curse of gradient-based optimization techniques.
dimensionality [79]. For GS, assuming that there are k parameters,
and each of them has n distinct values, its computational complex-
ity increases exponentially at a rate of O nk [19]. Thus, only when
the hyper-parameter configuration space is small can GS be an 4.3. Bayesian optimization
effective HPO method.
Bayesian optimization (BO) [83] is an iterative algorithm that is
popularly used for HPO problems. Unlike GS and RS, BO determines
4.1.3. Random search the future evaluation points based on the previously-obtained
To overcome certain limitations of GS, random search (RS) was results. To determine the next hyper-parameter configuration, BO
proposed in [13]. RS is similar to GS; but, instead of testing all values uses two key components: a surrogate model and an acquisition
in the search space, RS randomly selects a pre-defined number of function [56]. The surrogate model aims to fit all the currently-
samples between the upper and lower bounds as candidate observed points into the objective function. After obtaining the
hyper-parameter values, and then trains these candidates until predictive distribution of the probabilistic surrogate model, the
the defined budget is exhausted. The theoretical basis of RS is that acquisition function determines the usage of different points by
if the configuration space is large enough, then the global opti- balancing the trade-off between exploration and exploitation.
mums, or at least their approximations, can be detected. With a lim- Exploration is to sample the instances in the areas that have not
ited budget, RS is able to explore a larger search space than GS [13]. been sampled, while exploitation is to sample in the currently
The main advantage of RS is that it is easily parallelized and promising regions where the global optimum is most likely to
resource-allocated since each evaluation is independent. Unlike occur, based on the posterior distribution. BO models balance the
GS, RS samples a fixed number of parameter combinations from exploration and the exploitation processes to detect the current
the specified distribution, which improves system efficiency by most likely optimal regions and avoid missing better configura-
reducing the probability of wasting much time on a small poor- tions in the unexplored areas [84].
performing region. Since the number of total evaluations in RS is The basic procedures of BO are as follows [83]:
set to a fixed value n before the optimization process starts, the
computational complexity of RS is OðnÞ [80]. In addition, RS can 1. Build a probabilistic surrogate model of the objective function.
detect the global optimum or the near-global optimum when given 2. Detect the optimal hyper-parameter values on the surrogate
enough budgets [6]. model.
Although RS is more efficient than GS for large search spaces, 3. Apply these hyper-parameter values to the real objective func-
there are still a large number of unnecessary function evaluations tion to evaluate them.
since it does not exploit the previously well-performing regions 4. Update the surrogate model with new results.
[2]. 5. Repeat steps 2–4 until the maximum number of iterations is
To conclude, the main limitation of both RS and GS is that every reached.
evaluation in their iterations is independent of previous evalua-
tions; thus, they waste massive time evaluating poorly- Thus, BO works by updating the surrogate model after each
performing areas of the search space. This issue can be solved by evaluation on the objective function. BO is more efficient than GS
other optimization methods, like Bayesian optimization that uses and RS since it can detect the optimal hyper-parameter combina-
previous evaluation records to determine the next evaluation [14]. tions by analyzing the previously-tested values, and running a sur-
rogate model is often much cheaper than running the entire
objective function.
4.2. Gradient-based optimization However, since Bayesian optimization models are executed
based on the previously-tested values, they belong to sequential
Gradient descent [81] is a traditional optimization technique methods that are difficult to parallelize; but they can usually
that calculates the gradient of variables to identify the promising detect near-optimal hyper-parameter combinations within a few
direction and moves towards the optimum. After randomly select- iterations [7].
ing a data point, the technique moves towards the opposite direc- Common surrogate models for BO include Gaussian process
tion of the largest gradient to locate the next data point. Therefore, (GP) [85], random forest (RF) [86], and the tree Parzen estimator
a local optimum can be reached after convergence. The local opti- (TPE) [12]. Therefore, there are three main types of BO algorithms
mum is also the global optimum for convex functions. Gradient- based on their surrogate models: BO-GP, BO-RF, BO-TPE. An alter-
304 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
native name for BO-RF is sequential model-based algorithm config- lðxÞ; if y < y
pðxjy; DÞ ¼ : ð25Þ
uration (SMAC) [86]. g ðxÞ; if y > y
After that, the expected improvement in the acquisition func-
4.3.1. BO-GP
tion is reflected by the ratio between the two density functions,
Gaussian process (GP) is a standard surrogate model for objec-
which is used to determine the new configurations for evaluation.
tive function modeling in BO [83]. Assuming that the function f
The Parzen estimators are organized in a tree structure, so the
with a mean l and a covariance r2 is a realization of a GP, the pre-
specified conditional dependencies are retained. Therefore, TPE
dictions follow a normal distribution [87]:
naturally supports specified conditional hyper-parameters [87].
pð yjx; DÞ ¼ N yjl
^; r
^2 ; ð22Þ The time complexity of BO-TPE is OðnlognÞ, which is lower than
the complexity of BO-GP [3].
where D is the configuration space of hyper-parameters, and BO methods are effective for many HPO problems, even if the
y ¼ f ðxÞ is the evaluation result of each hyper-parameter value x. objective function f is stochastic, non-convex, or non-continuous.
After obtaining a set of predictions, the points to be evaluated next However, the main drawback of BO models is that, if they fail to
are then selected from the confidence intervals generated by the achieve the balance between exploration and exploitation, they
BO-GP model. Each newly-tested data point is added to the sample might only reach a local instead of a global optimum. RS does
records, and the BO-GP model is re-built with the new information. not have this limitation since it does not focus on any specific area.
This procedure is repeated until termination. Applying a BO-GP to a Additionally, it is difficult to parallelize BO models since their
size n dataset has a time complexity of O n3 and space complexity intermediate results are dependent on each other [7].
2
of O n [88]. One main limitation of BO-GP is that the cubic com-
plexity to the number of instances limits the capacity for paral- 4.4. Multi-fidelity optimization algorithms
lelization [3]. Additionally, it is mainly used to optimize
continuous variables. One major issue with HPO is the long execution time, which
increases with a larger hyper-parameter configuration space and
4.3.2. SMAC larger datasets. The execution time can take several hours, several
Random forest (RF) is another popular surrogate function for BO days, or even more [89]. Multi-fidelity optimization techniques are
to model the objective function using an ensemble of regression common approaches to solve the constraint of limited time and
trees. BO using RF as the surrogate model is also called SMAC [86]. resources. To save time, people can use a subset of the original
Assuming that there is a Gaussian model N yjl ^; r
^ 2 , and l
^ and dataset or a subset of the features [90]. Multi-fidelity involves
r^ 2 are the mean and variance of the regression function rðxÞ, low-fidelity and high-fidelity evaluations and combines them for
respectively, then [86]: practical applications [91]. In low-fidelity evaluations, a relatively
small subset is evaluated at a low cost but with poor generalization
1 X
l^ ¼ r ðxÞ; ð23Þ performance. In high-fidelity evaluations, a relatively large subset
jBj r2B is evaluated with better generalization performance but at a higher
cost than low-fidelity evaluations. In multi-fidelity optimization
1 X algorithms, poorly-performing configurations are discarded after
r^ 2 ¼ ð r ð xÞ l
^ Þ2 ; ð24Þ
jBj 1 each round of hyper-parameter evaluation on generated subsets,
r2B
and only well-performing hyper-parameter configurations will be
where B is a set of regression trees in the forest. The major proce- evaluated on the entire training set.
dures of SMAC are as follows [3]: Bandit-based algorithms categorized to multi-fidelity optimiza-
tion algorithms have shown success dealing with deep learning
1. RF starts with building B regression trees, each constructed by optimization problems [3]. Two common bandit-based techniques
sampling n instances from the training set with replacement. are successive halving [92] and Hyperband [16].
2. A split node is selected from d hyper-parameters for each tree.
3. To maintain a low computational cost, both the minimum num- 4.4.1. Successive halving
ber of instances considered for further split and the number of Theoretically speaking, exhaustive methods are able to identify
trees to grow are set to a certain value. the optimal hyper-parameter combination by evaluating all the
4. Finally, the mean and variance for each new configuration are given combinations. However, many factors, including limited
estimated by RF. time and resources, should be considered in practical applications.
These factors are called budgets (B). To overcome the limitations of
Compared with BO-GP, the main advantage of SMAC is its sup- GS and RS and to improve efficiency, successive halving algorithms
port for all types of variables, including continuous, discrete, cate- were proposed in [92].
gorical, and conditional hyper-parameters [87]. The time The main process of using successive halving algorithms for
complexities of using SMAC to fit and predict variances are HPO is as follows. Firstly, it is presumed that there are n sets of
OðnlognÞ and OðlognÞ, respectively, which are much lower than hyper-parameter combinations, and that they are evaluated with
the complexities of BO-GP [3]. uniformly-allocated budgets (b ¼ B=n). Then, according to the eval-
uation results for each iteration, half of the poorly-performing
4.3.3. BO-TPE hyper-parameter configurations are eliminated, and the better-
Tree-structured Parzen estimator (TPE) [12] is another com- performing half is passed to the next iteration with double budgets
mon surrogate model for BO. Instead of defining a predictive dis- (biþ1 ¼ 2 bi ). The above process is repeated until the final opti-
tribution used in BO-GP, BO-TPE creates two density functions, mal hyper-parameter combination is detected.
lðxÞ and g ðxÞ, to act as the generative models for all domain vari- Successive halving is more efficient than RS, but is affected by
ables [3]. To apply TPE, the observation results are divided into the trade-off between the number of hyper-parameter configura-
good results and poor results by a pre-defined percentile y , tions and the budgets allocated to each configuration [6]. Thus,
and the two sets of results are modeled by simple Parzen win- the main concern of successive halving is how to allocate the bud-
dows [12]: get and how to determine whether to test fewer configurations
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 305
with a higher budget for each or to test more configurations with a metaheuristics have the capacity to solve non-convex, non-
lower budget for each [2]. continuous, and non-smooth optimization problems.
Population-based optimization algorithms (POAs) are a major
4.4.2. Hyperband type of metaheuristic algorithm, including genetic algorithms
Hyperband [16] is then proposed to solve the dilemma of suc- (GAs), evolutionary algorithms, evolutionary strategies, and parti-
cessive halving algorithms by dynamically choosing a reasonable cle swarm optimization (PSO). POAs start by creating and updating
number of configurations. It aims to achieve a trade-off between a population as each generation; each individual in every genera-
the number of hyper-parameter configurations (n) and their allo- tion is then evaluated until the global optimum is identified [14].
cated budgets by dividing the total budgets (B) into n pieces and The main differences between different POAs are the methods used
allocating these pieces to each configuration (b ¼ B=n). Successive to generate and select populations [17]. POAs can be easily paral-
halving serves as a subroutine on each set of random configura- lelized since a population of N individuals can be evaluated on at
tions to eliminate the poorly-performing hyper-parameter config- most N threads or machines in parallel [6]. Genetic algorithms
urations and improve efficiency. The main steps of Hyperband and particle swarm optimization are the two main POAs that are
algorithms are shown in Algorithm 1 [2]. popularly-used for HPO problems.
Among the above steps, the population initialization step is an is a one-way flow of information sharing, and the entire search
important step of GA and PSO since it provides an initial guess of process follows the direction of the current optimal solution [2].
the optimal values. Although the initialized values will be itera- The computational complexity of PSO algorithm is OðnlognÞ
tively improved in the optimization process, a suitable population [100]. In most cases, the convergence speed of PSO is faster than
initialization method can significantly improve the convergence of GA. In addition, particles in PSO operate independently and only
speed and performance of POAs. A good initial population of need to share information with each other after each iteration, so
hyper-parameters should involve individuals that are close to global this process is easily parallelized to improve model efficiency [9].
optimums by covering the promising regions and should not be The main limitation of PSO is that it requires proper population
localized to an unpromising region of the search space [96]. initialization; otherwise, it might only reach a local instead of a
To generate hyper-parameter configuration candidates for the global optimum, especially for discrete hyper-parameters [101].
initial population, random initialization that simply creates the ini- Proper population initialization requires developers’ prior experi-
tial population with random values in the given search space is ence or using population initialization techniques. Many popula-
often used in GA [97]. Thus, GA is easily implemented and does tion initialization techniques have been proposed to improve the
not necessitate good initializations, because its selection, cross- performance of evolutionary algorithms, like the opposition-
over, and mutation operations lower the possibility of missing based optimization algorithm [97] and the space transformation
the global optimum. search method [102]. Involving additional population initialization
Hence, it is useful when the data analyst does not have much techniques will require more execution time and resources.
experience determining a potential appropriate initial search space
for the hyper-parameters. The main limitation of GA is that the
algorithm itself introduces additional hyper-parameters to be con- 5. Applying optimization techniques to machine learning
figured, including the fitness function type, population size, cross- algorithms
over rate, and mutation rate. Moreover, GA is a sequential
execution algorithm, making it difficult to parallelize. The time 5.1. Optimization techniques analysis
complexity of GA is O n2 [98]. As a result, sometimes, GA may
be inefficient due to its low convergence speed. Grid search (GS) is a simple method, its major limitation being
that it is time-consuming and impacted by the curse of dimension-
4.5.2. Particle swarm optimization ality [79]. Thus, it is unsuitable for a large number of hyper-
Particle swarm optimization (PSO) [99] is another set of evolu- parameters. Moreover, GS is often not able to detect the global
tionary algorithms that are commonly used for optimization prob- optimum of continuous parameters, since it requires a pre-
lems. PSO algorithms are inspired by biological populations that defined, finite set of hyper-parameter values. It is also unrealistic
exhibit both individual and social behaviors [17]. PSO works by for GS to be used to identify integer and continuous hyper-
enabling a group of particles (swarm) to traverse the search space parameter optimums with limited time and resources. Therefore,
in a semi-random manner [9]. PSO algorithms identify the optimal compared with other techniques, GS is only efficient for a small
solution through cooperation and information sharing among indi- number of categorical hyper-parameters.
vidual particles in a group. Random search is more efficient than GS and supports all types
In PSO, there are a group of n particles in a swarm S [2]: of hyper-parameters. In practical applications, using RS to evaluate
the randomly-selected hyper-parameter values helps analysts to
S ¼ ðS1 ; S2 ; . . . ; Sn Þ; ð26Þ explore a large search space. However, since RS does not consider
and each particle Si is represented by a vector: previously-tested results, it may involve many unnecessary evalu-
ations, which decrease its efficiency [13].
!!!
Si ¼< xi ; v i ; pi >; ð27Þ Hyperband can be considered an improved version of RS, and
! ! ! they both support parallel executions. Hyperband balances model
where xi is the current position, v i is the current velocity, and pi is performance and resource usage, so it is more efficient than RS,
the known best position of the swarm so far. especially with limited time and resources [15]. However, GS, RS,
After initializing the position and velocity of each particle, it and Hyperband all have a major constraint in that they treat each
evaluates the current position and records the position with its hyper-parameter independently and do not consider hyper-
!
performance score. In the next iteration, the velocity v i of each parameter correlations [103]. Thus, they will be inefficient for ML
!
particle is changed based on the previous position pi and the cur- algorithms with conditional hyper-parameters, like SVM, DBSCAN,
! and logistic regression.
rent global optimal position p :
Gradient-based algorithms are not a prevalent choice for hyper-
!
v i :¼ !
v i þ Uð0; u1 Þ ! ! ! !
pi xi þ U ð0; u2 Þ p xi ; ð28Þ parameter optimization, since they only support continuous
hyper-parameters and can only detect a local instead of a global
where U ð0; uÞ is the continuous uniform distributions based on the optimum for non-convex HPO problems [2]. Therefore, gradient-
acceleration constants u1 and u2 . based algorithms can only be used to optimize certain hyper-
After that, the particles move based on their new velocity parameters, like the learning rate in DL models.
vectors: Bayesian optimization models are divided into three different
! ! ! models—BO-GP, SMAC, and BO-TPE—based on their surrogate
xi :¼ xi þ v i : ð29Þ
models. BO algorithms determine the next hyper-parameter value
The above procedures are repeated until convergence or termi- based on the previously-evaluated results to reduce unnecessary
nation constraints are reached. evaluations and improve efficiency. BO-GP mainly supports contin-
Compared with GA, it is easier to implement PSO, since PSO uous and discrete hyper-parameters (by rounding them), but does
does not have certain additional operations like crossover and not support conditional hyper-parameters [14]; while SMAC and
mutation. In GA, all chromosomes share information with each BO-TPE are both able to handle categorical, discrete, continuous,
other, so the entire population moves uniformly toward the opti- and conditional hyper-parameters. SMAC performs better when
mal region; while in PSO, only information on the individual best there are many categorical and conditional parameters, or
particle and the global best particle is transmitted to others, which cross-validation is used, while BO-GP performs better for only a
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 307
hyper-parameters. SMAC is also a good choice, since it also per- deficiency is that it is not very efficient for categorical and condi-
forms well for tuning conditional hyper-parameters. GA and PSO tional hyper-parameters.
can be used, as well.
6.3. BayesOpt
5.2.4. A large hyper-parameter configuration space with multiple types
of hyper-parameters Bayesian Optimization (BayesOpt) [105] is a Python library
Tree-based algorithms, including DT, RF, ET, and XGBoost, as employed to solve HPO problems using BO. BayesOpt uses a Gaus-
well as DL algorithms, like DNN, CNN, RNN, are the most complex sian process as its surrogate model to calculate the objective func-
types of ML algorithms to bed tuned, since they have many hyper- tion based on past evaluations and utilizes an acquisition function
parameters with various, different types. For these ML models, PSO to determine the next values.
is the best choice since it enables parallel executions to improve
efficiency, particularly for DL models that often require massive 6.4. Hyperopt
training time. Some other techniques, like GA, BO-TPE, and SMAC
can also be used, but they may cost more time than PSO, since it Hyperopt [106] is a HPO framework that involves RS and
is difficult to parallelize these techniques. BO-TPE as the optimization algorithms. Unlike some of the other
libraries that only support a single model, Hyperopt is able to
5.2.5. Categorical hyper-parameters use multiple models to model hierarchical hyper-parameters. In
This category of hyper-parameters is mainly for ensemble addition, Hyperopt is parallelizable since it uses MongoDb as the
learning algorithms, since their major hyper-parameter is a cate- central database to store the hyper-parameter combinations.
gorical hyper-parameter. For bagging and AdaBoost, the categorical Hyperopt-sklearn [107] and hyperas [108] are the two libraries
hyper-parameter is ‘base_estimator’, which is set to be a singular that can apply Hyperopt to scikit-learn and Keras libraries.
ML model. For voting, it is ‘estimators’, indicating a list of ML sin-
gular models to be combined. The voting method has another cat- 6.5. SMAC
egorical hyper-parameter, ‘voting’, which is used to choose
whether to use a hard or soft voting method. If we only consider SMAC [109] is another library that uses BO with random forest
these categorical hyper-parameters, GS would be sufficient to as the surrogate model. It supports categorical, continuous, and
detect their suitable base machine learners. On the other hand, in discrete variables.
many cases, other hyper-parameters need to be considered, like
‘n_estimators’, ‘max_samples’, and ‘max_features’ in bagging, as 6.6. BOHB
well as ‘n_estimators’ and ‘learning_rate’ in AdaBoost; conse-
quently, BO algorithms would be a better choice to optimize these BOHB framework [93] is a combination of Bayesian optimiza-
continuous or discrete hyper-parameters. tion and Hyperband [15]. It overcomes one limitation of Hyper-
In conclusion, when tuning a ML model to achieve high model band, in that it randomly generates the test configurations, by
performance and low computational costs, the most suitable HPO replacing this procedure by BO. TPE is used as the surrogate model
algorithm should be selected based on the properties of its to store and model function evaluations. Using BOHB to evaluate
hyper-parameters. the instance can achieve a trade-off between model performance
and the current budget.
To tackle HPO problems, many open-source libraries exist to Optunity [79] is a popular HPO framework that provides several
apply theory into practice and lower the threshold for ML develop- optimization techniques, including GS, RS, PSO, and BO-TPE. In
ers. In this section, we provide a brief introduction to some popular Optunity, categorical hyper-parameters are converted to discrete
open-source HPO libraries or frameworks mainly for Python pro- hyper-parameters by indexing, and discrete hyper-parameters
gramming. The principles behind the involved optimization algo- are processed as continuous hyper-parameters by rounding them;
rithms are provided in Section 4. as such, it supports all types of hyper-parameters.
In sklearn [30], ‘GridSearchCV’ can be implemented to detect Skopt (scikit-optimize) [110] is a HPO library that is built on top
the optimal hyper-parameters using the GS algorithm. Each of the scikit-learn [30] library. It implements several sequential
hyper-parameter value in the human-defined configuration space model-based optimization models, including RS and BO-GP. The
is evaluated by the program, with its performance evaluated using methods exhibit good performance with small search space and
cross-validation. When all the instances in the configuration space proper initialization.
have been evaluated, the optimal hyper-parameter combination in
the defined search space with its performance score will be 6.9. GpFlowOpt
returned. ’RandomizedSearchCV’ is also provided in sklearn to
implement a RS method. It evaluates a pre-defined number of GpFlowOpt [111] is a Python library for BO using GP as the sur-
randomly-selected hyper-parameter values in parallel. Cross- rogate model. It supports running BO-GP on GPU using the Tensor-
validation is conducted to effectively evaluate the performance of flow library. Therefore, GpFlowOpt is a good choice if BO is used in
each configuration. deep learning models with GPU resources available.
Spearmint [83] is a library using Bayesian optimization with the Talos [112] is a Python package designed for hyper-parameter
Gaussian process as the surrogate model. Spearmint’s primary optimization with Keras models. Talos can be fully deployed into
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 309
any Keras models and implemented easily without learning any 7. Experiments
new syntax. Several optimization techniques, including GS, RS,
and probabilistic reduction, can be implemented using Talos. To summarize the content of Sections 3–6, a comprehensive
overview of applying hyper-parameter optimization techniques
to ML models is shown in Table 2. It provides a summary of com-
6.11. Sherpa mon ML algorithms, their hyper-parameters, suitable optimization
methods, and available Python libraries; thus, data analysts and
Sherpa [113] is a Python package used for HPO problems. It can researchers can look up this table and select suitable optimization
be used with other ML libraries, including sklearn [30], Tensorflow algorithms as well as libraries for practical use.
[114], and Keras [32]. It supports parallel computations and has To put theory into practice, several experiments have been con-
several optimization methods, including GS, RS, BO-GP (via ducted based on Table 2. This section provides the experiments of
GPyOpt), Hyperband, and population-based training (PBT). applying eight different HPO techniques to three common and rep-
resentative ML algorithms on two benchmark datasets. In the first
6.12. Osprey part of this section, the experimental setup and the main process of
HPO are discussed. In the second part, the results of utilizing differ-
Osprey [115] is a Python library designed to optimize hyper- ent HPO methods are compared and analyzed. The sample code of
parameters. Several HPO strategies are available in Osprey, includ- the experiments has been published in [132] to illustrate the pro-
ing GS, RS, BO-TPE (via Hyperopt), and BO-GP (via GPyOpt). cess of applying hyper-parameter optimization to ML models.
6.13. FAR-HO
7.1. Experimental setup
FAR-HO [116] is a hyper-parameter optimization package that
employs gradient-based algorithms with TensorFlow. FAR-HO con- Based on the steps to optimize hyper-parameters discussed in
tains a few gradient-based optimizers, like reverse hyper-gradient Section 2.2, several steps were completed before the actual opti-
and forward hyper-gradient methods. This library is designed to mization experiments start.
build access to the gradient-based hyper-parameter optimizers in Firstly, two standard benchmarking datasets provided by the
TensorFlow, allowing deep learning model training and hyper- sklearn library [30], namely, the Modified National Institute of
parameter optimization in GPU or other tensor-optimized comput- Standards and Technology dataset (MNIST) and the Boston housing
ing environments. dataset, are selected as the benchmark datasets for HPO method
evaluation on data analytics problems. MNIST is a hand-written
digit recognition dataset used as a multi-classification problem,
6.14. Hyperband while the Boston housing dataset contains information about the
price of houses in various places in the city of Boston and can be
Hyperband [16] is a Python package for tuning hyper- used as a regression dataset to predict the housing prices.
parameters by Hyperband, a bandit-based approach. Similar to At the next stage, the ML models with their objective function
‘GridSearchCV’ and ‘RandomizedSearchCV’ in scikit-learn, there is need to be configured. In Section 5.2, all common ML models are
a class named ‘HyperbandSearchCV’ in Hyperband that can be divided into five categories based on their hyper-parameter types.
combined with sklearn and used for HPO problems. In Among those ML categories, ‘‘one discrete hyper-parameter”, ‘‘a
‘HyperbandSearchCV’ method, cross-validation is used for few conditional hyper-parameters”, and ‘‘a large hyper-parameter
evaluation. configuration space with multiple types of hyper-parameters” are
the three most common cases. Thus, three ML algorithms, KNN,
SVM, and RF, are selected as the target models to be optimized,
6.15. DEAP
since their hyper-parameter types represent the three most com-
mon HPO cases: KNN has one important hyper-parameter, the
DEAP [117] is a novel evolutionary computation package for
number of considered nearest neighbors for each sample; SVM
Python that contains several evolutionary algorithms like GA and
has a few conditional hyper-parameters, like the kernel type and
PSO. It integrates with parallelization mechanisms like multipro-
the penalty parameter C; RF has multiple hyper-parameters of dif-
cessing, and machine learning packages like sklearn.
ferent types, as discussed in Section 3. Moreover, KNN, SVM, and RF
can all be applied to solve both classification and regression
6.16. TPOT problems.
In the next step, the performance metrics and evaluation meth-
TPOT [118] is a Python tool for auto-ML that uses genetic pro- ods are configured. For each experiment on the selected two data-
gramming to optimize ML pipelines. TPOT is built on top of sklearn, sets, 3-fold cross validation is implemented to evaluate the
so it is easy to implement TPOT on ML models. ‘TPOTClassifier’ is involved HPO methods. The two most commonly-used perfor-
its principal function, and several additional hyper-parameters of mance metrics are used in our experiments. For classification mod-
GA must be set to fit specific problems. els, accuracy is used as the classifier performance metric, which is
the proportion of correctly classified data; while for regression
models, the mean squared error (MSE) is used as the regressor per-
6.17. Nevergrad formance metric, which measures the average squared difference
between the predicted values and the actual values. Additionally,
Nevergrad [119] is an open-source Python library that includes the computational time (CT), the total time needed to complete a
a wide range of optimizers, like fast-GA and PSO. In ML, Nevergrad HPO process with 3-fold cross-validation, is also used as the model
can be used to tune all types of hyper-parameters, including dis- efficiency metric [54]. In each experiment, the optimized ML model
crete, continuous, and categorical hyper-parameters, by choosing architecture that has the highest accuracy or the lowest MSE and
different optimizers. the optimal hyper-parameter configuration will be returned.
310 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
Table 2
A comprehensive overview of common ML models, their hyper-parameters, suitable optimization techniques, and available Python libraries.
we compare different HPO methods using the same hyper- ML Model Hyper-parameter Type Search Space
parameter configuration space. For KNN, the only hyper- RF Classifier n_estimators Discrete [10,100]
parameter to be optimized, ‘n_neighbors’, is set to be in the same max_depth Discrete [5,50]
range of 1 to 20 for each optimization method evaluation. The min_samples_split Discrete [2,11]
hyper-parameters of SVM and RF models for classification and min_samples_leaf Discrete [1,11]
criterion Categorical [‘gini’, ‘entropy’]
regression problems are also set to be in the same configuration max_features Discrete [1,64]
space for each type of problem. The specifics of the configuration SVM C Continuous [0.1,50]
space for ML models are shown in Table 3. The selected hyper- Classifier
parameters and their search space are determined based on the kernel Categorical [‘linear’, ‘poly’, ‘rbf’,
‘sigmoid’]
concepts in Section 3, domain knowledge, and manual testings
KNN n_neighbors Discrete [1,20]
[120]. The hyper-parameter types of each ML algorithm are also Classifier
summarized in Table 3. RF Regressor n_estimators Discrete [10,100]
On the other hand, to fairly compare the performance metrics of max_depth Discrete [5,50]
optimization techniques, the maximum number of iterations for all min_samples_split Discrete [2,11]
min_samples_leaf Discrete [1,11]
HPO methods is set to 50 for RF and SVM model optimizations, and
criterion Categorical [‘mse’, ‘mae’]
10 for KNN model optimization based on manual testings and max_features Discrete [1,13]
domain knowledge. Moreover, to avoid the impacts of randomness, SVM C Continuous [0.1,50]
all experiments are repeated ten times with different random Regressor
kernel Categorical [‘linear’, ‘poly’, ‘rbf’,
seeds, and results are averaged for regression problems or given
‘sigmoid’]
the majority vote for classification problems. epsilon Continuous [0.001,1]
In Section 4, more than ten HPO methods are introduced. In KNN n_neighbors Discrete [1,20]
our experiments, eight representative HPO approaches are Regressor
selected for performance comparison, including GS, RS, BO-GP,
BO-TPE, Hyperband, BOHB, GA, and PSO. After setting up the fair
experimental environments for each HPO method, the HPO exper- All experiments were conducted using Python 3.5 on a machine
iments are implemented based on the steps discussed in with 6 Core i7-8700 processor and 16 gigabytes (GB) of memory.
Section 2.2. The involved ML and HPO algorithms are evaluated using multiple
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 311
and compare their accuracies for classification problems, or MSEs Optimization Algorithm MSE CT (s)
for regression problems, as well as their computational time (CT). Default HPs 31.26 0.08
From Tables 4–9, we can see that using the default HP configu- GS 29.02 4.64
rations do not yield the best model performance in our experi- RS 27.92 3.42
ments, which emphasizes the importance of utilizing HPO BO-GP 26.79 17.94
BO-TPE 25.42 1.53
methods. GS and RS can be seen as baseline models for HPO prob-
Hyperband 26.14 2.56
lems. From the results in Tables 4–9, it is shown that the computa- BOHB 25.56 1.88
tional time of GS is often much higher than other optimization GA 26.95 4.73
methods. With the same search space size, RS is faster than GS, PSO 25.69 3.20
but both of them cannot guarantee to detect the near-optimal
hyper-parameter configurations of ML models, especially for RF
and SVM models, which have a larger search space than KNN. Table 8
The performance of BO and multi-fidelity models is much better Performance evaluation of applying HPO methods to the SVM regressor on the
Boston–housing dataset.
than GS and RS. The computation time of BO-GP is often higher
than other HPO methods due to its cubic time complexity, but it Optimization Algorithm MSE CT (s)
can obtain better performance metrics for ML models with small- Default HPs 77.43 0.02
size continuous hyper-parameter space, like KNN. Conversely, GS 67.07 1.33
hyperband is often not able to obtain the highest accuracy or the RS 61.40 0.48
lowest MSE among the optimization methods, but their computa- BO-GP 61.27 5.87
BO-TPE 59.40 0.33
tional time is low because it works on the small-sized subsets. The Hyperband 73.44 0.32
performance of BO-TPE and BOHB is often better than others, since BOHB 59.67 0.31
GA 60.17 1.12
PSO 58.72 0.53
Table 4
Performance evaluation of applying HPO methods to the RF classifier on the MNIST
Table 9
dataset.
Performance evaluation of applying HPO methods to the KNN regressor on the
Optimization Algorithm Accuracy (%) CT (s) Boston–housing dataset.
Table 5
Performance evaluation of applying HPO methods to the SVM classifier on the MNIST
dataset.
they can detect the optimal or near-optimal hyper-parameter con-
Optimization Algorithm Accuracy (%) CT (s)
figurations within a short computational time.
Default HPs 97.05 0.29 For metaheuristics methods, GA and PSO, their accuracies are
GS 97.44 32.90
often higher than other HPO methods for classification problems,
RS 97.35 12.48
BO-GP 97.50 17.56
and their MSEs are often lower than other optimization techniques.
BO-TPE 97.44 3.02 However, their computational time is often higher than BO-TPE
Hyperband 97.44 11.37 and multi-fidelity models, especially for GA, which does not sup-
BOHB 97.44 8.18 port parallel executions.
GA 97.44 16.89
To summarize, it is simple to implement GS and RS, but they
PSO 97.44 8.33
often cannot detect the optimal hyper-parameter configurations
312 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
or cost much computational time. BO-GP and GA also cost more To solve this problem by HPO algorithms, BO models reduce the
computational time than many other HPO methods, but BO-GP total number of evaluations by spending time choosing the next
works well on small configuration space, while GA is effective for evaluating point instead of simply evaluating all possible hyper-
large configuration space. Hyperband’s computatinal time is low, parameter configurations; however, they still require much execu-
but it cannot guarantee to detect the global optimums. For ML tion time due to their poor capacity for parallelization. On the
models with large configuration space, BO-TPE, BOHB, and PSO other hand, although multi-fidelity optimization methods, like
often work well. Hyperband, have had some success dealing with HPO problems
with limited budgets, there are still some problems that cannot
be effectively solved by HPO due to the complexity of models or
8. Open issues, challenges, and future research directions the scale of datasets [6]. For example, the ImageNet [122] chal-
lenge is a very popular problem in the image processing domain,
Although there have been many existing HPO algorithms and but there has not been any research or work on efficiently optimiz-
practical frameworks, some issues still need to be addressed, and ing hyper-parameters for the ImageNet challenge yet, due to its
several aspects in this domain could be improved. In this section, huge scale and the complexity of CNN models used on ImageNet.
we discuss the open challenges, current research questions, and
potential research directions in the future. They can be classified
8.1.2. Complex search space
as model complexity challenges and model performance chal-
In many problems to which ML algorithms are applied, only a
lenges, as summarized in Table 10.
few hyper-parameters have significant effects on model perfor-
mance, and they are the main hyper-parameters that require tun-
ing. However, certain other unimportant hyper-parameters may
8.1. Model complexity
still affect the performance slightly and may be considered to opti-
mize the ML model further, which increases the dimensionality of
8.1.1. Costly objective function evaluations
hyper-parameter search space. As the number of hyper-parameters
To evaluate the performance of a ML model with different
and configurations increase, they exponentially increase the
hyper-parameter configurations, its objective function must be
dimensionality of the search space and the complexity of the prob-
minimized in each evaluation. Depending on the scale of data,
lems, and the total objective function evaluation time will also
the model complexity, and available computational resources,
increase exponentially [7]. Therefore, it is necessary to reduce
the evaluation of each hyper-parameter configuration may take
the influence of large search spaces on execution time by improv-
several minutes, hours, days, or even more [89]. Additionally, the
ing existing HPO methods.
values of certain hyper-parameters have a direct impact on the
execution time, like the number of considered neighbors in KNN,
the number of basic decision trees in RF, and the number of hidden 8.2. Model performance
layers in deep neural networks [121].
8.2.1. Strong anytime performance and final performance
HPO techniques are often expensive and sometimes require
extreme resources, especially for massive datasets or complex
Table 10 ML models. One example of a resource-intensive model is deep
The open challenges and future directions of HPO research.
learning models, since they view objective function evaluations
Category Challenges & Brief Description as black-box functions and do not consider their complexity. How-
Future ever, the overall budget is often very limited for most practical sit-
Requirements
uations, soHPO algorithms should be able to prioritize objective
Model complexity Costly objective HPO methods should reduce function evaluations and have a strong anytime performance,
function evaluation time on large datasets.
which indicates the capacity to detect optimal or near-optimal
evaluations
Complex search HPO methods should reduce configurations even with a very limited budget [93]. For instance,
space execution time on high an efficient HPO method should have a high convergence speed
dimensionalities (large hyper- so that there would not be a huge difference between the results
parameter search space). before and after model convergence, and should avoid random
Model Strong anytime HPO methods should be able to
performance performance detect the optimal or near-optimal
results even if time and resources are limited, like RS methods
HPs even with a very limited budget. cannot.
Strong final HPO methods should be able to On the other hand, if conditions permit and an adequate budget
performance detect the global optimum when is given, HPO approaches should be able to identify the global opti-
given a sufficient budget.
mal hyper-parameter configuration, named a strong final perfor-
Comparability There should exist a standard set of
benchmarks to fairly evaluate and mance [93].
compare different optimization
algorithms.
8.2.2. Comparability of HPO methods
Over-fitting and The optimal HPs detected by HPO
generalization methods should have To optimize the hyper-parameters of ML models, different opti-
generalizability to build efficient mization algorithms can be applied to each ML framework. Differ-
models on unseen data. ent optimization techniques have their own strengths and
Randomness HPO methods should reduce
drawbacks in different cases, and currently, there is no single opti-
randomness on the obtained results.
Scalability HPO methods should be scalable to mization approach that outperforms all other approaches when
multiple libraries or platforms (e.g., processing different datasets with various metrics and hyper-
distributed ML platforms). parameter types [3]. In this paper, we have analyzed the strengths
Continuous HPO methods should consider their and weaknesses of common hyper-parameter optimization tech-
updating capacity to detect and update
niques based on their principles and their performance in practical
capability optimal HP combinations on
continuously-updated data. applications; but this topic could be extended more
comprehensively.
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 313
To solve this problem, a standard set of benchmarks could be been developed; however, only very few HPO frameworks exist
designed and agreed on by the community for a better comparison that support distributed ML. Therefore, more research efforts and
of different HPO algorithms. For example, there is a platform called scalable HPO frameworks, like the ones supporting distributed
COCO (Comparing Continuous Optimizers) [123] that provides ML platforms, should be developed to support more libraries.
benchmarks and analyzes common continuous optimizers. How- On the other hand, future practical HPO algorithms should have
ever, there is, to date, not any reliable platform that provides the scalability to efficiently optimize hyper-parameters from a
benchmarks and analysis of all common hyper-parameter opti- small size to a large size, irrespective of whether they are continu-
mization approaches. It would be easier for people to choose ous, discrete, categorical, or conditional hyper-parameters.
HPO algorithms in practical applications if a platform like COCO
8.2.6. Continuous updating capability
exists for HPO problems. In addition, a unified metric can also
In practice, many datasets are not stationary and are constantly
improve the comparability of different HPO algorithms, since dif-
updated by adding new data and deleting old data. Correspond-
ferent metrics are currently used in different practical problems
ingly, the optimal hyper-parameter values or combinations may
[6].
also change with the changes in data. Currently, developing HPO
On the other hand, based on the comparison of different HPO
methods with the capacity to continuously tune hyper-parameter
algorithms, a way to further improve HPO is to combine existing
values as the data changes has not drawn much attention, since
models or propose new models that contain as many benefits as
researchers and data analysts often do not alter the ML model after
possible and are more suitable for practical problems than existing
achieving a currently optimal performance [3]. However, since
singular models. For example, the BOHB method [93] has had some
their optimal hyper-parameter values would change as data
success dealing with HPO problems by combining Bayesian opti-
changes, proper approaches should be proposed to achieve contin-
mization and Hyperband. In addition, future research should con-
uous updating capability.
sider both model performance and time budgets to develop HPO
algorithms that suit real-world applications.
9. Conclusion
8.2.3. Over-fitting and generalization
Generalization is another issue with HPO models. Since hyper- Machine learning has become the primary strategy for tackling
parameter evaluations are done with a finite number of evalua- data-related problems and has been widely used in various appli-
tions in datasets, the optimal hyper-parameter values detected cations. To apply ML models to practical problems, their hyper-
by HPO approaches might not be the same optimums on parameters need to be tuned to fit specific datasets. However, since
previously-unseen data. This is similar to over-fitting issues with the scale of produced data is greatly increased in real-life, and
ML models that occur when a model is closely fit to a finite number manually tuning hyper-parameters is extremely computationally
of known data points but is unfit to unseen data [124]. Generaliza- expensive, it has become crucial to optimize hyper-parameters
tion is also a common concern for multi-fidelity algorithms, like by an automatic process. In this survey paper, we have comprehen-
Hyperband and BOHB, since they need to extract subsets to repre- sively discussed the state-of-the-art research into the domain of
sent the entire dataset. hyper-parameter optimization as well as how to apply them to dif-
One solution to reduce or avoid over-fitting is to use cross- ferent ML models by theory and practical experiments. To apply
validation to identify a stable optimum that performs best in all optimization methods to ML models, the hyper-parameter types
or most of the subsets instead of a sharp optimum that only per- in a ML model is the main concern for HPO method selection. To
forms well in a singular validation set [6]. However, cross- summarize, BOHB is the recommended choice for optimizing a
validation increases the execution time several-fold. It would be ML model, if randomly selected subsets are highly-representative
beneficial if methods can better deal with overfitting and improve of the given dataset, since it can efficiently optimize all types of
generalization in future research. hyper-parameters; otherwise, BO models are recommended for
small hyper-parameter configuration space, while PSO is usually
8.2.4. Randomness the best choice for large configuration space. Moreover, some
There are stochastic components in the objective function of ML existing useful HPO tools and frameworks, open challenges, and
algorithms; thus, in some cases, the optimal hyper-parameter con- potential research directions are also provided and highlighted
figuration might be different after each run. This randomness could for practical use and future research purposes. We hope that our
be due to various procedures of certain ML models, like neural net- survey paper serves as a useful resource for ML users, developers,
work initialization, or different sampled subsets in a bagging data analysts, and researchers to use and tune ML models utilizing
model [89]; or due to certain procedures of HPO algorithms, like proper HPO techniques and frameworks. We also hope that it helps
crossover and mutation operations in GA. In addition, it is often dif- to enhance understanding of the challenges that still exist within
ficult for HPO methods to identify the global optimums, due to the the HPO domain, and thereby further advancing HPO and ML appli-
fact that HPO problems are mainly NP-hard problems. Many exist- cations in future research.
ing HPO algorithms can only collect several different near-optimal
values, which is caused by randomness. Thus, the existing HPO CRediT authorship contribution statement
models can be further improved to reduce the impact of random-
ness. One possible solution is to run a HPO method multiple times Li Yang: Conceptualization, Methodology, Software, Validation,
and select the hyper-parameter value that occurs most as the final Formal analysis, Investigation, Data curation, Writing - original
optimum. draft, Visualization. Abdallah Shami: Conceptualization,
Resources, Writing - review & editing, Supervision, Project admin-
8.2.5. Scalability istration, Funding acquisition.
In practice, one main limitation of many existing HPO frame-
works is that they are tightly integrated with one or a couple of Declaration of Competing Interest
machine learning libraries, like sklearn and Keras, which restricts
them to only work with a single node instead of large data volumes The authors declare that they have no known competing finan-
[3]. To tackle large datasets, some distributed machine learning cial interests or personal relationships that could have appeared
platforms, like Apache SystemML [125] and Spark MLib [126], have to influence the work reported in this paper.
314 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
shallow methods for modeling bioactivity data, J. Cheminf. 9 (2017) 1–13, [94] A. Gogna, A. Tayal, Metaheuristics: review and application, J. Exp. Theor. Artif.
https://doi.org/10.1186/s13321-017-0226-y. Intell. 25 (2013) 503–526, https://doi.org/10.1080/0952813X.2013.782347.
[62] T. Domhan, J.T. Springenberg, F. Hutter, Speeding up automatic [95] F. Itano, M.A. De Abreu De, E. Del-Moral-Hernandez Sousa, Extending MLP
hyperparameter optimization of deep neural networks by extrapolation of ANN hyper-parameters Optimization by using Genetic Algorithm, Proc. Int. Jt.
learning curves, IJCAI Int. Jt. Conf. Artif. Intell. (2015- (2015)) 3460–3468. Conf. Neural Networks (2018) 1–8, https://doi.org/10.1109/
[63] Y. Ozaki, M. Yano, M. Onishi, Effective hyperparameter optimization using IJCNN.2018.8489520.
Nelder-Mead method in deep learning, IPSJ Trans. Comput. Vis. Appl. 9 [96] B. Kazimipour, X. Li, A.K. Qin, A Review of Population Initialization
(2017), https://doi.org/10.1186/s41074-017-0030-7. Techniques for Evolutionary Algorithms, 2014 IEEE Congr. Evol. Comput.
[64] F.C. Soon, H.Y. Khaw, J.H. Chuah, J. Kanesan, Hyper-parameters optimisation (2014) 2585–2592. https://doi.org/10.1109/CEC.2014.6900618.
of deep CNN architecture for vehicle logo recognition, IET Intell. Transp. Syst. [97] S. Rahnamayan, H.R. Tizhoosh, M.M.A. Salama, A novel population initialization
12 (2018) 939–946, https://doi.org/10.1049/iet-its.2018.5127. method for accelerating evolutionary algorithms, Comput. Math. Appl. 53
[65] D. Han, Q. Liu, W. Fan, A new image classification method using CNN transfer (2007) 1605–1614, https://doi.org/10.1016/j.camwa.2006.07.013.
learning and web data augmentation, Expert Syst. Appl. 95 (2018) 43–56, [98] F.G. Lobo, D.E. Goldberg, M. Pelikan, Time complexity of genetic algorithms on
https://doi.org/10.1016/j.eswa.2017.11.028. exponentially scaled problems, Proc. Genet. Evol. Comput. Conf. (2000) 151–
[66] C. Di Francescomarino, M. Dumas, M. Federici, C. Ghidini, F.M. Maggi, W. 158.
Rizzi, L. Simonetto, Genetic algorithms for hyperparameter optimization in [99] Y. Shi, R.C. Eberhart, Parameter Selection in Particle Swarm Optimization,
predictive business process monitoring, Inf. Syst. 74 (2018) 67–83, https:// Evolutionary Programming VII, Springer, 1998, pp. 591–600.
doi.org/10.1016/j.is.2018.01.003. [100] X. Yan, F. He, Y. Chen, A Novel Hardware/ Software Partitioning Method
[67] A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, Student engagement level in Based on Position Disturbed Particle Swarm Optimization with Invasive
e-learning environment: clustering using K-means, Am. J. Distance Educ. 34 Weed Optimization 32 (2017) 340–355, https://doi.org/10.1007/s11390-
(2020) 1–20, https://doi.org/10.1080/08923647.2020.1696140. 017-1714-2.
[68] C. Ding, X. He, Cluster structure of K-means clustering via principal [101] M.Y. Cheng, K.Y. Huang, M. Hutomo, Multiobjective dynamic-guiding PSO for
component analysis, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes optimizing work shift schedules, J. Constr. Eng. Manag. 144 (2018) 1–7,
Artif. Intell. Lect. Notes Bioinformatics) 3056 (2004) 414–418, https://doi.org/ https://doi.org/10.1061/(ASCE)CO.1943-7862.0001548.
10.1145/1015330.1015408. [102] H. Wang, Z. Wu, J. Wang, X. Dong, S. Yu, G. Chen, A new population
[69] T.K. Moon, The expectation-maximization algorithm, IEEE Signal Process. initialization method based on space transformation search, 5th Int, Conf.
Mag. 13 (6) (1996) 47–60. Nat. Comput. ICNC 2009 (5) (2009) 332–336, https://doi.org/10.1109/
[70] S. Brahim-Belhouari, A. Bermak, M. Shi, P.C.H. Chan, Fast and Robust gas ICNC.2009.371.
identification system using an integrated gas sensor technology and Gaussian [103] J. Wang, J. Xu, and X. Wang, Combination of Hyperband and Bayesian
mixture models, IEEE Sens. J. 5 (2005) 1433–1444, https://doi.org/10.1109/ Optimization for Hyperparameter Optimization in Deep Learning, arXiv
JSEN.2005.858926. preprint arXiv:1801.01596, (2018). https://arxiv.org/abs1801.01596.
[71] Z. Y., K. G., Hierarchical clustering algorithms for document dataset, Data Min. [104] P. Cazzaniga, M.S. Nobile, D. Besozzi, The impact of particles initialization in
Knowl. Discov. 10 (2005) 141–168. PSO: parameter estimation as a case in point, 2015 IEEE Conf. Comput. Intell.
[72] K. Khan, S.U. Rehman, K. Aziz, S. Fong, S. Sarasvady, A. Vishwa, DBSCAN: Past, Bioinforma. Comput. Biol. CIBCB 2015 (2015) 1–8, https://doi.org/10.1109/
present and future, 5th Int. Conf. Appl. Digit. Inf. Web Technol. ICADIWT CIBCB.2015.7300288.
2014, 2014, pp. 232–238. https://doi.org/10.1109/ICADIWT.2014.6814687. [105] R. Martinez-Cantin, BayesOpt: a Bayesian optimization library for nonlinear
[73] H. Zhou, P. Wang, H. Li, Research on adaptive parameters determination in optimization, experimental design and bandits, J. Mach. Learn. Res. 15 (2015)
DBSCAN algorithm, J. Inf. Comput. Sci. 9 (2012) 1967–1973. 3735–3739.
[74] J. Shlens, A Tutorial on Principal Component Analysis, arXiv preprint [106] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D.D. Cox, Hyperopt: a Python
arXiv:1404.1100, (2014). https://arxiv.org/abs1404.1100. library for model selection and hyperparameter optimization, Comput. Sci.
[75] N. Halko, P. Martinsson, J. Tropp, Finding structure with randomness: Discov. 8 (2015), https://doi.org/10.1088/1749-4699/8/1/014008.
probabilistic algorithms for constructing approximate matrix [107] B. Komer, J. Bergstra, C. Eliasmith, Hyperopt-sklearn: automatic
decompositions, SIAM Rev. 53 (2) (2011) 217–288. hyperparameter configuration for scikit-learn, Proc. ICML Workshop
[76] M. Loog, Conditional linear discriminant analysis, Proc. – Int. Conf. Pattern AutoML (2014) 34–40.
Recognit. 2 (2006) 387–390, https://doi.org/10.1109/ICPR.2006.402. [108] M. Pumperla, Hyperas, 2019. http://maxpumperla.com/hyperas/.
[77] P. Howland, J. Wang, H. Park, Solving the small sample size problem in face [109] M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp, and F.
recognition using generalized discriminant analysis, Pattern Recognit. 39 Hutter, Smac v3: Algorithm configuration in python, 2017. https://
(2006) 277–287, https://doi.org/10.1016/j.patcog.2005.06.013. github.com/automl/SMAC3.
[78] I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparameter [110] Tim Head, MechCoder, Gilles Louppe, et al., scikitoptimize/scikit-optimize:
optimization of deep learning algorithms using deterministic RBF v0.5.2, 2018. doi: 10.5281/zenodo.1207017.
surrogates, 31st AAAI Conf. Artif. Intell. AAAI 2017, 2017, pp. 822–829. [111] N. Knudde, J. van der Herten, T. Dhaene, I. Couckuyt, GPflowOpt: A Bayesian
[79] M. Claesen, J. Simm, D. Popovic, Y. Moreau, B. De Moor, Easy Hyperparameter Optimization Library using TensorFlow, arXiv preprint arXiv:1711.03845
Search Using Optunity, arXiv preprint arXiv:1412.1114 (2014). (2017).
[80] C. Witt, Worst-case and average-case approximations by simple randomized [112] Autonomio Talos [Computer software], 2019. http://github.com/
search heuristics, in: Proceedings of the 22nd Annual Symposium on autonomio/talos.
Theoretical Aspects of Computer Science, STACS’05, Stuttgart, Germany, [113] L. Hertel, P. Sadowski, J. Collado, P. Baldi, Sherpa: hyperparameter optimization
2005, pp. 44-56. for machine learning models, Conf. Neural Inf. Process. Syst., 2018.
[81] Y. Bengio, Gradient-based optimization of hyperparameters, Neural Comput. [114] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et al.,
12 (8) (2000) 1889–1900. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
[82] H.H. Yang, S.I. Amari, Complexity issues in natural gradient descent method Systems, arXiv preprint arXiv:1603.04467, (2016). https://arxiv.org/
for training multilayer perceptrons, Neural Comput. 10 (8) (1998) 2137– abs1603.04467.
2157. [115] J. Grandgirard, D. Poinsot, L. Krespi, J.P. Nénon, A.M. Cortesero, Osprey:
[83] J. Snoek, H. Larochelle, R. Adams, Practical Bayesian optimization of machine Hyperparameter Optimization for Machine Learning, 103 (2002) 239–248.
learning algorithms, Adv. Neural Inf. Process. Syst. 4 (2012) 2951–2959. https://doi.org/10.21105/joss.00034.
[84] E. Hazan, A. Klivans, Y. Yuan, Hyperparameter optimization: a spectral [116] L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and reverse gradient-
approach, arXiv preprint arXiv:1706.00764, 2017. based hyperparameter optimization, 34th Int, Conf. Mach. Learn. ICML 2017
[85] M. Seeger, Gaussian processes for machine learning, Int. J. Neural Syst. 14 (70) (2017) 1165–1173.
(2004) 69–106. [117] F.A. Fortin, F.M. De Rainville, M.A. Gardner, M. Parizeau, C. Gagńe, DEAP:
[86] F. Hutter, H.H. Hoos, K. Leyton-Brown, Sequential model-based optimization evolutionary algorithms made easy, J. Mach. Learn. Res. 13 (2012) 2171–
for general algorithm configuration, Proc. LION 5 (2011) 507–523. 2175.
[87] I. Dewancker, M. McCourt, S. Clark, Bayesian Optimization Primer, (2015). [118] R.S. Olson, J.H. Moore, TPOT: a tree-based pipeline optimization tool for
URL: https://sigopt.com/static/pdf/SigOpt Bayesian Optimization Primer.pdf. automating machine learning, Auto Mach. Learn. (2019) 151–160.
[88] J. Hensman, N. Fusi, N.D. Lawrence, Gaussian processes for big data, arXiv https://doi.org/10.1007/978-3-030-05318-5_8.
preprint arXiv:1309.6835, 2013. [119] J. Rapin, O. Teytaud, Nevergrad – a gradient-free optimization platform, 2018.
[89] M. Claesen, B. De Moor, Hyperparameter Search in Machine Learning, arXiv https://GitHub.com/FacebookResearch/Nevergrad.
preprint arXiv:1502.02127, 2015. [120] M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Systematic ensemble model
[90] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: selection approach for educational data mining, Knowl.-Based Syst. 200
Proceedings of the COMPSTAT, Springer, 2010, pp. 177–186. (2020) 105992, https://doi.org/10.1016/j.knosys.2020.105992.
[91] S. Zhang, J. Xu, E. Huang, C.H. Chen, A new optimal sampling rule for multi- [121] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University
fidelity optimization via ordinal transformation, IEEE Int. Conf. Autom. Sci. Press, 1995.
Eng. (2016- (2016)) 670–674, https://doi.org/10.1109/COASE.2016.7743467. [122] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[92] Z. Karnin, T. Koren, O. Somekh, Almost optimal exploration in multi-armed convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012)
bandits, 30th Int. Conf. Mach. Learn. ICML 2013 (28) (2013) 2275–2283. 1097–1105.
[93] S. Falkner, A. Klein, F. Hutter, BOHB: robust and efficient hyperparameter [123] N. Hansen, A. Auger, O. Mersmann, T. Tusar, D. Brockhoff, COCO: A Platform
optimization at scale, 35th Int. Conf. Mach. Learn. ICML 2018 (4) (2018) for Comparing Continuous Optimizers in a Black-Box Setting, arXiv preprint
2323–2341. arXiv:1603.08785, (2016). https://arxiv.org/abs1603.08785.
316 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316
[124] G.C. Cawley, N.L.C. Talbot, On over-fitting in model selection and subsequent Li Yang received the B.E. degree in computer science
selection bias in performance evaluation, J. Mach. Learn. Res. 11 (2010) from Wuhan University of Science and Technology,
2079–2107. Wuhan, China in 2016 and the MASc degree in Engi-
[125] M. Boehm, A. Surve, S. Tatikonda, et al., SystemML: declarative machine neering from University of Guelph, Guelph, Canada,
learning on spark, Proc. VLDB Endow. 9 (2016) 1425–1436, https://doi.org/ 2018. Since 2018 he has been working toward the Ph.D.
10.14778/3007263.3007279. degree in the Department of Electrical and Computer
[126] X. Meng, J. Bradley, B. Yavuz, et al., Mllib: machine learning in apache spark, J. Engineering, Western University, London, Canada. His
Mach. Learn. Res. 17 (1) (2016) 1235–1241. research interests include cybersecurity, machine
[127] A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, DNS typo-squatting domain learning, data analytics, and intelligent transportation
detection: a data analytics & machine learning based approach, 2018 IEEE
systems.
Glob. Commun. Conf. GLOBECOM. (2018), https://doi.org/10.1109/
GLOCOM.2018.8647679.
[128] Li Yang, Comprehensive visibility indicator algorithm for adaptable speed
limit control in intelligent transportation systems, University of Guelph,
2018.
[129] F. Salo, M.N. Injadat, A. Moubayed, A.B. Nassif, A. Essex, Clustering enabled
classification using ensemble feature selection for intrusion detection, 2019 Abdallah Shami is a professor with the ECE Department
Int. Conf. Comput. Netw. Commun. ICNC (2019) 276–281, https://doi.org/ at Western University, Ontario, Canada. He is the
10.1109/ICCNC.2019.8685636. Director of the Optimized Computing and Communica-
[130] A. Moubayed, E. Aqeeli, A. Shami, Ensemble-based feature selection and tions Laboratory at Western University (https://www.
classification model for DNS typo-squatting detection, 2020 IEEE Can. Conf. eng.uwo.ca/oc2/). He is currently an associate editor for
Electr. Comput. Eng. (2020). IEEE Transactions on Mobile Computing, IEEE Network,
[131] M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Multi-split optimized bagging
and IEEE Communications Surveys and Tutorials. He has
ensemble model selection for multi-class educational data mining, Springer’s
chaired key symposia for IEEE GLOBECOM, IEEE ICC,
Appl. Intell. (2020).
IEEE ICNC, and ICCIT. He was the elected Chair of the
[132] L. Yang, A. Shami, Hyperparameter Optimization of Machine Learning
Algorithms (2020). https://github.com/LiYangHart/Hyperparameter- IEEE Communications Society Technical Committee on
Optimization-of-Machine-Learning-Algorithms. Communications Software (2016–2017) and the IEEE
London Ontario Section Chair (2016–2018).