0% found this document useful (0 votes)

126 views

On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice

This document discusses hyperparameter optimization techniques for machine learning algorithms. It introduces several state-of-the-art optimization techniques like Bayesian optimization, particle swarm optimization, and genetic algorithms. Experiments are conducted on benchmark datasets to compare different optimization methods. The document aims to help users select the proper techniques to effectively tune hyperparameters and improve machine learning model performance.

Uploaded by

Rhafael Freitas da Costa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views

On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice

Uploaded by

Rhafael Freitas da Costa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Neurocomputing 415 (2020) 295–316

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

On hyperparameter optimization of machine learning algorithms:

Theory and practice
Li Yang, Abdallah Shami
Department of Electrical and Computer Engineering, University of Western Ontario, 1151 Richmond St, London, ON N6A 3K7, Canada

a r t i c l e i n f o a b s t r a c t

Article history: Machine learning algorithms have been used widely in various applications and areas. To fit a machine
Received 13 December 2019 learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-
Revised 14 May 2020 parameter configuration for machine learning models has a direct impact on the model’s performance.
Accepted 16 July 2020
It often requires deep knowledge of machine learning algorithms and appropriate hyper-parameter opti-
Available online 25 July 2020
mization techniques. Although several automatic optimization techniques exist, they have different
Communicated by Yuhua Cheng
strengths and drawbacks when applied to different types of problems. In this paper, optimizing the
hyper-parameters of common machine learning models is studied. We introduce several state-of-the-
Keywords:
Hyper-parameter optimization
art optimization techniques and discuss how to apply them to machine learning algorithms. Many avail-
Machine learning able libraries and frameworks developed for hyper-parameter optimization problems are provided, and
Bayesian optimization some open challenges of hyper-parameter optimization research are also discussed in this paper.
Particle swarm optimization Moreover, experiments are conducted on benchmark datasets to compare the performance of different
Genetic algorithm optimization methods and provide practical examples of hyper-parameter optimization. This survey
Grid search paper will help industrial users, data analysts, and researchers to better develop machine learning models
by identifying the proper hyper-parameter configurations effectively.
Ó 2020 Elsevier B.V. All rights reserved.

1. Introduction support vector machine, and the learning rate to train a neural net-
work) or to specify the algorithm used to minimize the loss func-
Machine learning (ML) algorithms have been widely used in tion (e.g., the activation function and optimizer types in a neural
many applications domains, including advertising, recommenda- network, and the kernel type in a support vector machine) [5].
tion systems, computer vision, natural language processing, and To build an optimal ML model, a range of possibilities must be
user behavior analytics [1]. This is because they are generic and explored. The process of designing the ideal model architecture
demonstrate high performance in data analytics problems. Differ- with an optimal hyper-parameter configuration is named hyper-
ent ML algorithms are suitable for different types of problems or parameter tuning. Tuning hyper-parameters is considered a key
datasets [2]. In general, building an effective machine learning component of building an effective ML model, especially for tree-
model is a complex and time-consuming process that involves based ML models and deep neural networks, which have many
determining the appropriate algorithm and obtaining an optimal hyper-parameters [6]. Hyper-parameter tuning process is different
model architecture by tuning its hyper-parameters (HPs) [3]. among different ML algorithms due to their different types of
Two types of parameters exist in machine learning models: one hyper-parameters, including categorical, discrete, and continuous
that can be initialized and updated through the data learning pro- hyper-parameters [7]. Manual testing is a traditional way to tune
cess (e.g., the weights of neurons in neural networks), named hyper-parameters and is still prevalent in graduate student
model parameters; while the other, named hyper-parameters, can- research, although it requires a deep understanding of the used
not be directly estimated from data learning and must be set ML algorithms and their hyper-parameter value settings [8]. How-
before training a ML model because they define the model archi- ever, manual tuning is ineffective for many problems due to certain
tecture [4]. Hyper-parameters are the parameters that are used factors, including a large number of hyper-parameters, complex
to either configure a ML model (e.g., the penalty parameter C in a models, time-consuming model evaluations, and non-linear
hyper-parameter interactions. These factors have inspired
increased research in techniques for automatic optimization of
E-mail addresses: lyang339@uwo.ca (L. Yang), abdallah.shami@uwo.ca hyper-parameters; so-called hyper-parameter optimization (HPO)
(A. Shami)

https://doi.org/10.1016/j.neucom.2020.07.061
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
296 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

[9]. The main aim of HPO is to automate hyper-parameter tuning Training a ML model often takes considerable time and space.
process and make it possible for users to apply machine learning Multi-fidelity optimization algorithms are developed to tackle
models to practical problems effectively [3]. The optimal model problems with limited resources, and the most common ones
architecture of a ML model is expected to be obtained after a being bandit-based algorithms. Hyperband [16] is a popular
HPO process. Some important reasons for applying HPO techniques bandit-based optimization technique that can be considered an
to ML models are as follows [6]: improved version of RS. It generates small versions of datasets
and allocates a same budget to each hyper-parameter combination.
1. It reduces the human effort required, since many ML developers In each iteration of Hyperband, poorly-performing hyper-
spend considerable time tuning the hyper-parameters, espe- parameter configurations are eliminated to save time and
cially for large datasets or complex ML algorithms with a large resources.
number of hyper-parameters. Metaheuristic algorithms are a set of techniques used to solve
2. It improves the performance of ML models. Many ML hyper- complex, large search space and non-convex optimization prob-
parameters have different optimums to achieve best perfor- lems to which HPO problems belong [17]. Among all metaheuristic
mance in different datasets or problems. methods, genetic algorithm (GA) [18] and particle swarm opti-
3. It makes the models and research more reproducible. Only mization (PSO) [19] are the two most prevalent metaheuristic algo-
when the same level of hyper-parameter tuning process is rithms used for HPO problems. Genetic algorithms detect well-
implemented can different ML algorithms be compared fairly; performing hyper-parameter combinations in each generation,
hence, using a same HPO method on different ML algorithms and pass them to the next generation until the best-performing
also helps to determine the most suitable ML model for a speci- combination is identified. In PSO algorithms, each particle commu-
fic problem. nicates with other particles to detect and update the current global
optimum in each iteration until the final optimum is detected.
It is crucial to select an appropriate optimization technique to Metaheuristics can efficiently explore the search space to detect
detect optimal hyper-parameters. Traditional optimization tech- optimal or near-optimal solutions. Hence, they are particularly
niques may be unsuitable for HPO problems, since many HPO suitable for the HPO problems with large configuration space due
problems are non-convex or non-differentiable optimization prob- to their high efficiency. For instance, they can be used in deep neu-
lems, and may result in a local instead of a global optimum [10]. ral networks (DNNs) which have a large configuration space with
Gradient descent-based methods are a common type of traditional multiple hyper-parameters, including the activation and optimizer
optimization algorithm that can be used to tune continuous hyper- types, the learning rate, drop-out rate, etc.
parameters by calculating their gradients [11]. For example, the Although using HPO algorithms to tune the hyper-parameters
learning rate in a neural network can be optimized by a of ML models greatly improves the model performance, certain
gradient-based method. other aspects, like their computational complexity, still have much
Compared with traditional optimization methods like gradient room for improvement. On the other hand, since different HPO
descent, many other optimization techniques are more suitable models have their own advantages and suitable problems,
for HPO problems, including decision-theoretic approaches, Baye- overviewing them is necessary for proper optimization algorithm
sian optimization models, multi-fidelity optimization techniques, selection in terms of different types of ML models and problems.
and metaheuristics algorithms [7]. Apart from detecting continu- This paper makes the following contributions:
ous hyper-parameters, many of these algorithms also have the
capacity to effectively identify discrete, categorical, and condi- 1. It reviews common ML algorithms and their important hyper-
tional hyper-parameters. parameters.
Decision-theoretic methods are based on the concept of defin- 2. It analyzes common HPO techniques, including their benefits
ing a hyper-parameter search space and then detecting the and drawbacks, to help apply them to different ML models by
hyper-parameter combinations in the search space, ultimately appropriate algorithm selection in practical problems.
selecting the best-performing hyper-parameter combination. Grid 3. It surveys common HPO libraries and frameworks for practical
search (GS) [12] is a decision-theoretic approach that exhaustively use.
searches the optimal configuration in a fixed domain of hyper- 4. It discusses the open challenges and research directions of the
parameters. Random search (RS) [13] is another decision- HPO research domain.
theoretic method that randomly selects hyper-parameter combi-
nations in the search space, given limited execution time and In this survey paper, we begin with a comprehensive introduc-
resources. In GS and RS, each hyper-parameter configuration is tion of the common optimization techniques used in ML hyper-
treated independently. parameter tuning problems. Section 2 introduces the main con-
Unlike GS and RS, Bayesian optimization (BO) [14] models cepts of mathematical optimization and hyper-parameter opti-
determine the next hyper-parameter value based on the previous mization, as well as the general HPO process. In Section 3, we
results of tested hyper-parameter values, which avoids many discuss the key hyper-parameters of common ML models that need
unnecessary evaluations; thus, BO can detect the optimal hyper- to be tuned. Section 4 covers the various state-of-the-art optimiza-
parameter combination within fewer iterations than GS and RS. tion approaches that have been proposed for tackling HPO prob-
To be applied to different problems, BO can model the distribution lems. In Section 5, we analyze different HPO methods and
of the objective function using different models as the surrogate discuss how they can be applied to ML algorithms. In Section 6,
function, including Gaussian process (GP), random forest (RF), we provide an introduction to various public libraries and frame-
and tree-structured Parzen estimators (TPE) models [15]. BO-RF works that are developed to implement HPO. Section 7 presents
and BO-TPE can retain the conditionality of variables [15]. Thus, and discusses the experimental results of using HPO on benchmark
they can be used to optimize conditional hyper-parameters, like datasets for HPO method comparison and practical use case
the kernel type and the penalty parameter C in a support vector demonstration. In Section 8, we discuss several research directions
machine (SVM). However, since BO models work sequentially to and open challenges that should be considered to improve current
balance the exploration of unexplored areas and the exploitation HPO models or develop new HPO approaches. We conclude the
of currently-tested regions, it is difficult to parallelize them. paper in Section 9.
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 297

2. Mathematical optimization and hyper-parameter For optimization problems, in many cases, only a local instead
optimization problems of a global optimum can be obtained. For example, to obtain the
minimum of a problem, assuming D is the feasible region of a deci-
The key process of machine learning is to solve optimization sion variable x, a global minimum is the point x 2 D satisfying
problems. To build a ML model, its weight parameters are initial- f ðx Þ 6 f ðxÞ 8x 2 D , while a local minimum is a point x 2 D in a
ized and optimized by an optimization method until the objective neighborhood N satisfying f ðx Þ 6 f ðxÞ 8x 2 N \ D [21]. Thus, the
function approaches a minimum value or the accuracy approaches local optimum may only be an optimum in a small range instead
a maximum value [20]. Similarly, hyper-parameter optimization of being the optimal solution in the entire feasible region.
methods aim to optimize the architecture of a ML model by detect- A local optimum is only guaranteed to be the global optimum in
ing the optimal hyper-parameter configurations. In this section, convex functions [22]. Convex functions are the functions that only
the main concepts of mathematical optimization and hyper- have one optimum. Therefore, continuing to search along the
parameter optimization for machine learning models are direction in which the objective function decreases can detect
discussed. the global minimum value. A function f ðxÞ is a convex function if
for 8x1 ; x2 2 X; 8t 2 ½0; 1,
2.1. Mathematical optimization f ðtx1 þ ð1 tÞx2 Þ 6 tf ðx1 Þ þ ð1 t Þf ðx2 Þ; ð4Þ

Mathematical optimization is the process of finding the best where X is the domain of decision variables, and t is a coefficient in
solution from a set of available candidates to maximize or mini- the range of [0,1].
mize the objective function [20]. Generally, optimization problems An optimization problem is a convex optimization problem only
can be classified as constrained or unconstrained optimization when the objective function f ðxÞ is a convex function and the fea-
problems based on whether they have constraints for the decision sible region C is a convex set, denoted by [22]:
variables or the solution variables.
minf ðxÞ
In unconstrained optimization problems, a decision variable, x, x ð5Þ
can take any values from the one-dimensional space of all real subject to x 2 C:
numbers, R. An unconstrained optimization problem can be
On the other hand, nonconvex functions have multiple local
denoted by [21]:
optimums, but only one of these optimums is the global optimum.
minf ðxÞ; ð1Þ Most ML and HPO problems are nonconvex optimization problems.
x2R
Thus, utilizing inappropriate optimization methods may only
where f ðxÞ is the objective function. result in a local instead of a global optimum.
On the other hand, most real-life optimization problems are There are many traditional methods that can be used to solve
constrained optimization problems. The decision variable x for optimization problems, including gradient descent, Newton’s
constrained optimization problems should be subject to certain method, conjugate gradient, and heuristic optimization methods
constraints which could be mathematical equalities or inequalities. [20]. Gradient descent is a commonly-used optimization method
Therefore, constrained optimization problems or general optimiza- that uses the negative gradient direction as the search direction
tion problems can be expressed as [21]: to move towards the optimum. However, gradient descent cannot
guarantee to detect the global optimum unless the objective func-
minf ðxÞ tion is a convex function. Newton’s method uses the inverse matrix
x
of the Hessian matrix to obtain the optimum. Newton’s method
subject to
has faster convergence speed than gradient descent, but often
g i ðxÞ 6 0; i ¼ 1; 2; . . . ; m; ð2Þ
requires more time and larger space than gradient descent to store
hj ðxÞ ¼ 0; j ¼ 1; 2; . . . ; p; and calculate the Hessian matrix. Conjugate gradient searches
x 2 X; along the conjugated directions constructed by the gradient of
known data points to detect the optimum. Conjugate gradient
where g i ðxÞ; i ¼ 1; 2; . . . ; m, are the inequality constraint functions; has faster convergence speed than gradient descent but its calcula-
hj ðxÞ; j ¼ 1; 2; . . . ; p, are the equality constraint function; and X is tion of conjugate gradient is more complex. Unlike other tradi-
the domain of x. tional methods, heuristic methods use empirical rules to solve
The role of constraints is to limit the possible values of the opti- the optimization problems instead of following systematical steps
mal solution to certain areas of the search space, named the feasi- to obtain the solution. Heuristic methods can often detect the
ble region [21]. Thus, the feasible region D of x can be represented approximate global optimum within a few iterations, but cannot
by: guarantee to detect the global optimum [20].

D ¼ x 2 Xjg i ðxÞ 6 0; hj ðxÞ ¼ 0 : ð3Þ
2.2. Hyper-parameter optimization problem statement
To conclude, an optimization problem consists of three major
components: a set of decision variables x, an objective function During the design process of ML models, effectively searching
f ðxÞ to be either minimized or maximized, and a set of constraints the hyper-parameters’ space using optimization techniques can
that allow the variables to take on values in certain ranges (if it is a identify the optimal hyper-parameters for the models. The
constrained optimization problem). Therefore, the goal of opti- hyper-parameter optimization process consists of four main com-
mization tasks is to obtain the set of variable values that minimize ponents: an estimator (a regressor or a classifier) with its objective
or maximize the objective function while satisfying any applicable function, a search space (configuration space), a search or opti-
constraints. mization method used to find hyper-parameter combinations,
Regarding ML models, many HPO problems have certain con- and an evaluation function to compare the performance of differ-
straints, like the feasible domain of the number of clusters in k- ent hyper-parameter configurations.
means, as well as time and space constraints. Therefore, con- The domain of a hyper-parameter can be continuous (e.g., learn-
strained optimization techniques are widely-used in HPO prob- ing rate), discrete (e.g., number of clusters), binary (e.g., whether to
lems [3]. use early stopping or not), or categorical (e.g., type of optimizer).
298 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

Therefore, hyper-parameters are classified as continuous, discrete, 1. The optimization target, the objective function of ML models, is
and categorical hyper-parameters. For continuous and discrete usually a non-convex and non-differentiable function. There-
hyper-parameters, their domains are usually bounded in practical fore, many traditional optimization methods designed to solve
applications [12] [23]. On the other hand, the hyper-parameter convex or differentiable optimization problems are often
configuration space sometimes contains conditionality. A hyper- unsuitable for HPO problems, since these methods may return
parameter may need to be used or tuned depending on the value a local optimum instead of a global optimum. Additionally, an
of another hyper-parameter, called a conditional hyper- optimization target lacking smoothness makes certain tradi-
parameter [10]. For instance, in SVM, the degree of the polynomial tional derivative-free optimization models perform poorly for
kernel function only needs to be tuned when the kernel type is HPO problems [26].
chosen to be polynomial. 2. The hyper-parameters of ML models include continuous, dis-
In simple cases, all hyper-parameters can take unrestricted real crete, categorical, and conditional hyper-parameters. Thus,
values, and the feasible set X of hyper-parameters can be a real- many traditional numerical optimization methods [27] that
valued n-dimensional vector space. However, in most cases, the only aim to tackle numerical or continuous variables are unsuit-
hyper-parameters of a ML model often take on values from differ- able for HPO problems.
ent domains and have different constraints, so their optimization 3. It is often computationally expensive to train a ML model on a
problems are often complex constrained optimization problems large-scale dataset. HPO techniques sometimes use data sam-
[24]. For instance, the number of considered features in a decision pling to obtain approximate values of the objective function.
tree should be in the range of 0 to the number of features, and the Thus, effective optimization techniques for HPO problems
number of clusters in k-means should not be larger than the size of should be able to use these approximate values. However, func-
data points. Additionally, categorical features can often only take tion evaluation time is often ignored in many black-box opti-
several certain values, like the limited choices of the activation mization (BBO) models, so they often require exact instead of
function and the optimizer of a neural network. Therefore, the fea- approximate objective function values. Consequently, many
sible domain of X often has a complex structure, which increases BBO algorithms are often unsuitable for HPO problems with
the problems’ complexity [24]. limited time and resource budgets.
In general, for a hyper-parameter optimization problem, the
aim is to obtain [19]: Therefore, appropriate optimization algorithms should be
applied to HPO problems to identify optimal hyper-parameter con-
x ¼ arg minf ðxÞ; ð6Þ figurations for ML models.
x2X

where f ðxÞ is the objective function to be minimized, such as the

3. Hyper-parameters in machine learning models
error rate or the root mean squared error (RMSE); x is the hyper-
parameter configuration that produces the optimum value of f ðxÞ;
To boost ML models by HPO, firstly, we need to find out what
and a hyper-parameter x can take any value in the search space X.
the key hyper-parameters are that people need to tune to fit the
The aim of HPO is to achieve optimal or near-optimal model
ML models into specific problems or datasets.
performance by tuning hyper-parameters within the given budgets
In general, ML models can be classified as supervised and unsu-
[3]. The mathematical expression of the function f varies, depend-
pervised learning algorithms, based on whether they are built to
ing on the objective function of the chosen ML algorithm and the
model labeled or unlabeled datasets [127]. Supervised learning
performance metric function. Model performance can be evaluated
algorithms are a set of machine learning algorithms that map input
by various metrics, like accuracy, RMSE, F1-score, and false alarm
features to a target by training on labeled data, and mainly include
rate. On the other hand, in practice, time budgets are an essential
linear models, k-nearest neighbors (KNN), support vector machines
constraint for optimizing HPO models and must be considered. It
(SVM), naíve Bayes (NB), decision-tree-based models, and deep
often requires a massive amount of time to optimize the objective
learning (DL) algorithms [28]. Unsupervised learning algorithms
function of a ML model with a reasonable number of hyper-
are used to find patterns from unlabeled data and can be divided
parameter configurations. Every time a hyper-parameter value is
into clustering and dimensionality reduction algorithms based on
tested, the entire ML model needs to be retrained, and the valida-
their aims. Clustering methods mainly include k-means, density-
tion set needs to be processed to generate a score that reflects the
based spatial clustering of applications with noise (DBSCAN), hier-
model performance. The main process of HPO is as follows [10]:
archical clustering, and expectation–maximization (EM); while
two common dimensionality reduction algorithms are principal
1. Select the objective function and the performance metrics;
component analysis (PCA) and linear discriminant analysis (LDA)
2. Select the hyper-parameters that require tuning, summarize
[29]. Moreover, there are several ensemble learning methods that
their types, and determine the appropriate optimization
combine different singular models to further improve model per-
technique;
formance, like voting, bagging, and AdaBoost. In this paper, the
3. Train the ML model using the default hyper-parameter configu-
important hyper-parameters of common ML models are studied
ration or common values as the baseline model;
based on their names in Python libraries, including scikit-learn
4. Start the optimization process with a large search space as the
(sklearn) [30], XGBoost [31], and Keras [32].
hyper-parameter feasible domain determined by manual test-
ing and/or domain knowledge;
5. Narrow the search space based on the regions of currently- 3.1. Supervised learning algorithms
tested well-performing hyper-parameter values, or explore
new search spaces if necessary. In supervised learning, both the input x and the output y are
6. Return the best-performing hyper-parameter configuration as available, and the goal is to obtain an optimal predictive model

the final solution. function f to minimize the cost function Lðf ðxÞ; yÞ that models
the error between the estimated output and ground-truth labels.
However, most traditional optimization techniques [25] are The predictive model function f varies based on its model structure.
unsuitable for HPO, since HPO problems are different from tradi- With limited model architectures determined by different hyper-
tional optimization problems in the following aspects [10]: parameter configurations, the domain of the ML model function f
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 299

is restricted to a set of functions F. Thus, the optimal predictive 3.1.2. KNN

model f can be obtained by [33]: K-nearest neighbor (KNN) is a simple ML algorithm that is used
to classify data points by calculating the distances between differ-
1X n
ent data points. In KNN, the predicted class of each test sample is
f ¼ arg min Lðf ðxi Þ; yi Þ ð7Þ
f 2F n i¼1 set to the class to which most of its k-nearest neighbors in the
training set belong.
where n is the number of training data points, xi is the feature vec-
Assuming the training set T ¼ fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . . ; ðxn ; yn Þg; xi is
tor of the i-th instance, yi is the corresponding actual output, and L
the feature vector of an instance, and yi 2 fc1 ; c2 ; . . . ; cm g is the class
is the cost function value of each sample.
of the instance, i ¼ ð1; 2; . . . nÞ, for a test instance x, its class y can be
Many different loss functions exist in supervised learning algo-
denoted by [40]:
rithms, including the square of Euclidean distance, cross-entropy, X
information gain, etc. [33]. On the other hand, different ML algo- y ¼ arg max I yi ¼ cj ; i ¼ 1; 2; . . . ; n; j ¼ 1; 2; . . . ; m; ð11Þ
cj
rithms generate different predictive model architectures based xi 2Nk ðxÞ
on different hyper-parameter configurations, which will be dis-
where IðxÞ is an indicator function, I ¼ 1 when yi ¼ cj , otherwise
cussed in detail in this subsection.
I ¼ 0; Nk ðxÞ is the field involving the k-nearest neighbors of x.
In KNN, the number of considered nearest neighbors, k, is the
3.1.1. Linear models
most crucial hyper-parameter [41]. If k is too small, the model will
In general, supervised learning models can be classified as
be under-fitting; if k is too large, the model will be over-fitting and
regression and classification techniques when used to predict con-
require high computational time. In addition, the weighted func-
tinuous or discrete target variables, respectively. Linear regression
tion used in the prediction can also be chosen from ‘uniform’
[34] is a typical regression model that predicts a target y by the fol-
(points are weighted equally) or ‘distance’ (points are weighted
lowing equation:
by the inverse of their distance), depending on specific problems.
^ðw; xÞ ¼ w0 þ w1 x1 þ . . . þ wp xp ;
y ð8Þ The distance metric and the power parameter of the Minkowski
metric can also be tuned as it can result in minor improvement.
where the target variable y is expected to be a linear combination of
Lastly, the ‘algorithm’ used to compute the nearest neighbors can
^ is the predicted value. The
p input features x ¼ x1 ; . . . xp , and y
also be chosen from a ball tree, a k-dimensional (KD) tree, or a
weight vector w ¼ w1 ; . . . wp is designated as an attribute ‘coef_’, brute force search. Typically, the model can determine the most
and w0 is defined as another attribute ‘intercept_’ in the linear appropriate algorithm itself by setting the ‘algorithm’ to ’auto’ in
model of sklearn. Usually, no hyper-parameter needs to be tuned sklearn [30].
in linear regression. A linear model’s performance mainly depends
on how well the problem or data follows a linear distribution. 3.1.3. SVM
To improve the original linear regression models, ridge regres- A support vector machines (SVM) [42] is a supervised learning
sion was proposed in [35]. Ridge regression imposes a penalty on algorithm that can be used for both classification and regression
the coefficients, and aims to minimize the objective function [36]: problems. SVM algorithms are based on the concept of mapping
X
p data points from low-dimensional into high-dimensional space to
akwk22 þ ðyi wi xi Þ2 ; ð9Þ make them linearly separable; a hyperplane is then generated as
i¼1 the classification boundary to partition data points [43]. Assuming
there are n data points, the objective function of SVM is [44] [128]:
where kwk2 is the L2 -norm of the coefficient vector, and a is the reg-
( )
ularization strength. A larger value of a indicates a larger amount of 1X n

shrinkage; thus, the coefficients are also more robust to collinearity. arg min max f0; 1 yi f ðxi Þg þ Cw w ;
T
ð12Þ
w n i¼1
Lasso regression [37] is another linear model used to estimate
sparse coefficients, consisting of a linear model with an L1 priori where w is a normalization vector; C is the penalty parameter of the
added regularization term. It aims to minimize the objective func- error term, which is an important hyper-parameter of all SVM
tion [36]: models.
X
p The kernel function f ðxÞ, which is used to measure the similarity
akwk1 þ ðyi wi xi Þ2 ; ð10Þ between two data points xi and xj , can be chosen from multiple
i¼1 types of kernels in SVM models. Therefore, the kernel type would
where a is the regularization strength and kwk1 is the L1 -norm of be a vital hyper-parameter to be tuned. Common kernel types in
the coefficient vector. Therefore, the regularization strength a is SVM include linear kernels, radial basis function (RBF), polynomial
an crucial hyper-parameter of both ridge and lasso regression kernels, and sigmoid kernels.
models. The different kernel functions can be denoted as follows [45]:
Logistic regression (LR) [38] is a linear model used for classifica-
tion problems. In LR, its cost function may be different, depending 1. Linear kernel:
on the regularization method chosen for the penalization. There f ðxÞ ¼ xTi xj ; ð13Þ
are three main types of regularization methods in LR: L1 -norm,
L2 -norm, and elastic-net regularization [39]. 2. Polynomial kernel:
Therefore, the first hyper-parameter that needs to be tuned in d
f ðxÞ ¼ cxTi xj þ r ; ð14Þ
LR is to the regularization method used in the penalization, ‘l1’,
‘l2’, ‘elasticnet’ or ‘none’, which is called ‘penalty’ in sklearn. The 3. RBF kernel:
coefficient, ‘C’, is another essential hyper-parameter that determi-
f ðxÞ ¼ exp ckx x0 k ;
2
nes the regularization strength of the model. In addition, the ‘sol- ð15Þ
ver’ type, representing the optimization algorithm type, can be
set to ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, or ‘saga’ in LR. The ‘sol- 4. Sigmoid kernel:

ver’ type has correlations with ‘penalty’ and ‘C’, so they are condi- f ðxÞ ¼ tanh cxTi xj þ r ; ð16Þ
tional hyper-parameters.
300 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

3.1.5. Tree-based models

As shown in the kernel function equations, a few other different Decision tree (DT) [52] is a common classification method that
hyper-parameters need to be tuned after a kernel type is chosen. uses a tree-structure to model decisions and possible conse-
The coefficient c, denoted by ‘gamma’ in sklearn, is the conditional quences by summarizing a set of classification rules from the data.
hyper-parameter of the ‘kernel type’ hyper-parameter when it is A DT has three main components: a root node representing the
set to polynomial, RBF, or sigmoid; r, specified by ‘coef0’ in sklearn, entire data; multiple decision nodes indicating decision tests and
is the conditional hyper-parameter of polynomial and sigmoid ker- sub-node splits over each feature; and several leaf nodes repre-
nels. Moreover, the polynomial kernel has an additional condi- senting the result classes [53]. DT algorithms recursively split the
tional hyper-parameter d representing the ‘degree’ of the training set with better feature values to achieve good decisions
polynomial kernel function. In support vector regression (SVR) on each subset. Pruning, which means removing some of the
models, there is another hyper-parameter, ‘epsilon’, indicating sub-nodes of decision nodes, is used in DT to avoid over-fitting.
the distance error to of its loss function [30]. Since a deeper tree has more sub-trees to make more accurate
decisions, the maximum tree depth, ’max_depth’, is an essential
hyper-parameter that controls the complexity of DT algorithms
3.1.4. Naíve Bayes
[54].
Naíve Bayes (NB) [46] algorithms are supervised learning algo-
There are many other important HPs to be tuned to build effec-
rithms based on Bayes’ theorem. Assuming there are n dependent
tive DT models [55]. Firstly, the quality of splits can be measured
features x1 ; . . . xn and a target variable y, the objective function of
by setting a measuring function, denoted by ’criterion’ in sklearn.
naíve Bayes can be denoted by:
Gini impurity and information gain are the two main types of mea-
Y
n suring functions. The split selection method, ‘splitter’, can also be
^ ¼ arg maxPð yÞ
y Pðxi jyÞ; ð17Þ set to ‘best’ to choose the best split, or ‘random’ to select a random
y
i¼1 split. The number of considered features to generate the best split,
‘max_features’, can also be tuned as a feature selection process.
where Pð yÞ is the probability of a value y, and P ðxi jyÞ is the posterior
probabilities of xi given the values of y. Regarding the different Moreover, there are several discrete hyper-parameters related to
the splitting process: the minimum number of data points to split
assumptions of the distribution of Pðxi jyÞ, there are different types
of naíve Bayes classifiers. The four main types of NB models are: a decision node or to obtain a leaf node, denoted by ‘min_samples_
split’ and ‘min_samples_leaf’, respectively; the ‘max_leaf_nodes’,
Bernoulli NB, Gaussian NB, multinomial NB, and complement NB
[47]. indicating the maximum number of leaf nodes, and the ‘min_wei
ght_fraction_leaf’ that means the minimum weighted fraction of
For Gaussian NB [48], the likelihood of features is assumed to
follow a Gaussian distribution: the total weights, can also be tuned to improve model performance
[30] [55].
0 2 1 Based on the concept of DT models, many decision-tree-based
1 B xi l y C
Pðxi jyÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffi exp @ ensemble algorithms have been proposed to improve model per-
A: ð18Þ
2pr2y 2r2y formance by combining multiple decision trees, including random
forest (RF), extra trees (ET), and extreme gradient boosting
The maximum likelihood method is used to calculate the mean (XGBoost) models. RF [56] is an ensemble learning method that
value, ly , and the variance, r2y . Normally, there is not any hyper- uses the bagging method to combine multiple decision trees. In
parameter that needs to be tuned for Gaussian NB. The perfor- RF, basic DTs are built on many randomly-generated subsets, and
mance of a Gaussian NB model mainly depends on how well the the class with the majority voting will be selected to be the final
dataset follows Gaussian distributions. classification result [129]. ET [57] is another tree-based ensemble
Multinomial NB [49] is designed for multinomially-distributed learning method that is similar to RF, but it uses all samples to
data based on the naíve Bayes algorithm. Assuming there are n fea- build DTs and randomly selects the feature sets. In addition, RF
tures, and hyi is the distribution of each value of the target variable optimizes splits on DTs while ET randomly makes the splits.
y, which equals the conditional probability P ðxi jyÞ when a feature XGBoost [31] is a popular tree-based ensemble model designed
value i is involved in a data point belonging to the class y. Based for speed and performance improvement, which uses the boosting
on the concept of relative frequency counting, hy can be estimated and gradient descent methods to combine basic DTs. In XGBoost,
the next input sample of a new DT will be related to the results
by a smoothed version of hyi [30]:
of previous DTs. XGBoost aims to minimize the following objective
^hyi ¼ Nyi þ a ; ð19Þ
function [54]:
Ny þ an
1X t
G2j
Obj ¼ þ ct; ð20Þ
where N yi is the number of times when feature i is in a data point 2 j¼1 Hj þ k
belonging to class y, and N y is the sum of all N yi (i ¼ 0; 1; 2; . . . ; n).
The smoothing priors a P 0 are used for features that are not in where t is the number of leaves in a decision tree, G and H are the
the learning samples. When a ¼ 1, it is called Laplace smoothing; sums of the first and second order gradient statistics of the cost
when a < 1, it is called Lidstone smoothing. function, c and k are the penalty coefficients.
Complement NB [50] is an improved version of the standard Since tree-based ensemble models are built with decision trees
multinomial NB algorithm and is suitable for processing imbal- as base learners, they have the same hyper-parameters as DT mod-
anced data, while Bernoulli NB [51] requires samples to have els, as described in this subsection. Apart from these hyper-
binary-valued feature vectors so that the data can follow multivari- parameters, RF, ET, and XGBoost all have another crucial hyper-
ate Bernoulli distributions. They both have the additive (Laplace/ parameter to be tuned, which is the number of decision trees to
Lidstone) smoothing parameter, a, as the main hyper-parameter be combined, denoted by ‘n_estimators’ in sklearn. XGBoost has
that needs tuning. To conclude, for naíve Bayes algorithms, devel- several additional hyper-parameters, including [58]: ‘min_child_
opers often do not need to tune hyper-parameters or only need to weight’ which means the minimum sum of weights in a child
tune the smoothing parameter a, which is a continuous hyper- node; ‘subsample’ and ’colsample_bytree’ used to control the sub-
parameter. sampling ratio of instances and features, respectively; and four
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 301

continuous hyper-parameters — ‘gamma’, ‘alpha’, ‘lambda’, and (or prediction tasks) while avoiding over-fitting. At the next stage,
‘learning_rate’ — indicating the minimum loss reduction for a split, certain function types need to be set or tuned. The first function
L1 , and L2 regularization term on weights, and the learning rate, type to configure is the loss function type, which is chosen mainly
respectively. based on the problem type (e.g., binary cross-entropy for binary
classification problems, multi-class cross-entropy for multi-
3.1.6. Ensemble learning algorithms classification problems, and RMSE for regression problems).
Apart from tree-based ensemble models, there are several other Another important hyper-parameter is the activation function type
generic ensemble learning methods that combine multiple singular used to model non-linear functions, which be set to ‘softmax’, ‘rec-
ML models to achieve better model performance than any singular tified linear unit (ReLU)’, ‘sigmoid’, ‘tanh’, or ‘softsign’. Lastly, the
algorithms alone. Three common ensemble learning models — vot- optimizer type can be set to stochastic gradient descent (SGD),
ing, bagging, and AdaBoost — are introduced in this subsection adaptive moment estimation (Adam), root mean square propaga-
[59]. tion (RMSprop), etc. [62].
Voting [59] is a basic ensemble learning algorithm that uses the On the other hand, some other hyper-parameters are related to
majority voting rule to combine singular estimators and generate a the optimization and training process of DL models; hence, catego-
comprehensive estimator with improved accuracy. In sklearn, the rized as optimizer hyper-parameters. The learning rate is one of
voting method can be set to be ‘hard’ or ’soft’, indicating whether the most important hyper-parameters in DL models [63]. It deter-
to use majority voting or averaged predicted probabilities to determines the step size at each iteration, which enables the objective
mine the classification result. The list of selected single ML estima- function to converge. A large learning rate speeds up the learning
tors and their weights can also be tuned in certain cases. For process, but the gradient may oscillate around a local minimum
instance, a higher weight can be assigned to a better-performing value or even cannot converge. On the other hand, a small learning
singular ML model in a voting model. rate converges smoothly, but will largely increase model training
Bootstrap aggregating [59], also named bagging, trains multiple time by requiring more training epochs. An appropriate learning
base estimators on different randomly-extracted subsets to con- rate should enable the objective function to converge to a global
struct a final predictor [130]. When using bagging methods, the minimum in a reasonable amount of time. Another common
first consideration should be the type and number of base estima- hyper-parameter is the drop-out rate. Drop-out is a standard regu-
tors in the ensemble, denoted by ‘base_estimator’ and ‘n_estima- larization method for DL models proposed to reduce over-fitting. In
tors’, respectively. Then, the ‘max_samples’ and ‘max_features’, drop-out, a proportion of neurons are randomly removed, and the
indicating the sample size and feature size to generate different percentage of neurons to be removed should be tuned.
subsets, can also be tuned. Mini-batch size and the number of epochs are the other two DL
AdaBoost [59], short for adaptive boosting, is an ensemble hyper-parameters that represent the number of processed samples
learning method that trains multiple base learners consecutively before updating the model, and the number of complete passes
(weak learners), and later learners emphasize the mis-classified through the entire training set, respectively [64]. Mini-batch size
samples of previous learners; ultimately, a final strong learner is is affected by the resource requirements of the training process
obtained. During this process, incorrectly-classified instances are and the number of iterations. The number of epochs depends on
retrained with other new instances, and their weights are adjusted the size of the training set and should be tuned by slowly increas-
so that the subsequent classifiers focus more on difficult cases, ing its value until validation accuracy starts to decrease, which
thereby gradually building a stronger classifier. In AdaBoost, the indicates over-fitting. On the other hand, DL models often converge
type of base estimator, ‘base_estimator’, can be set to a decision within a few epochs, and the following epochs may lead to unnec-
tree or other methods. In addition, the maximum number of esti- essary additional execution time and over-fitting, which can be
mators at which boosting is terminated, ‘n_estimators’, and the avoided by the early stopping method. Early stopping is a form
learning rate that shrinks the contribution of each classifier, should of regularization whereby model training stops in advance when
also be tuned to achieve a trade-off between these two hyper- validation accuracy does not increase after a certain number of
parameters. consecutive epochs. The number of waiting epochs, called early
stop patience, can also be tuned to reduce model training time.
3.1.7. Deep learning models Apart from traditional DL models, transfer learning (TL) is a
Deep learning (DL) algorithms are widely applied to various technology that obtains a pre-trained model on the data in a
areas — like computer vision, natural language processing, and related domain and transfers it to other target tasks [65]. To trans-
machine translation — since they have had great success solving fer a DL model from one problem to another problem, a certain
many types of problems. DL models are based on the theory of arti- number of top layers are frozen, and only the remaining layers
ficial neural networks (ANNs). Common types of DL architectures are retrained to fit the new problem. Therefore, the number of fro-
include deep neural networks (DNNs), feedforward neural net- zen layers is a vital hyper-parameter to tune if TL is used.
works (FFNNs), deep belief networks (DBNs), convolutional neural
networks (CNNs), recurrent neural networks (RNNs) and many 3.2. Unsupervised learning algorithms
more [60]. All these DL models have similar hyper-parameters since
they have similar underlying neural network architecture. Com- Unsupervised learning algorithms are a set of ML algorithms
pared with other ML models, DL models benefit more from HPO used to identify unknown patterns in unlabeled datasets. Cluster-
since they often have many hyper-parameters that require tuning. ing and dimensionality-reduction algorithms are the two main
The first set of hyper-parameters is related to the construction types of unsupervised learning methods. Clustering methods
of a DL model; hence, named model design hyper-parameters. include k-means, DBSCAN, EM, hierarchical clustering, etc.; while
Since all neural network models have an input layer and an output PCA and LDA are two commonly-used dimensionality reduction
layer, the complexity of a deep learning model mainly depends on algorithms [29].
the number of hidden layers and the number of neurons in each
layer, which are two main hyper-parameters to build DL models 3.2.1. Clustering algorithms
[61]. These two hyper-parameters are set and tuned according to For most clustering algorithms — including k-means, EM, and
the complexity of the datasets or the problems. DL models hierarchical clustering — the number of clusters is the most impor-
need to have enough capacity to model objective functions tant hyper-parameter to tune [66].
302 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

The k-means algorithm [67] uses k prototypes, indicating the applications, many features are irrelevant or redundant to predict
centroids of clusters, to cluster data. In k-means algorithms, the target variables. Dimensionality reduction algorithms often serve
number of clusters, ‘n_clusters’, must be specified, and is deter- as feature engineering methods to extract important features and
mined by minimizing the sum of squared errors [68]: eliminate insignificant or redundant features. Two common
X
nk
2 dimensionality-reduction algorithms are principal component
min xi uj ; ð21Þ analysis (PCA) and linear discriminant analysis (LDA). In PCA and
uj 2C k
i¼0 LDA, the number of features to be extracted, represented by ’n_-
where ðx1 ; . . . ; xn Þ is the data matrix; uj , also called the centroid of components’ in sklearn, is the main hyper-parameter to be tuned.
the cluster C k , is the mean of the samples in the cluster; and nk is Principal component analysis (PCA) [74] is a widely used linear
the number of sample points in the cluster C k . dimensionality reduction method. PCA is based on the concept of
To tune k-means, ‘n_clusters’ is the most crucial hyper- mapping the original n-dimensional features into k-dimension fea-
parameter. Besides this, the method for centroid initialization, tures as the new orthogonal features, also called the principal com-
‘init’, could be set to ‘k-means++’, ‘random’ or a human-defined ponents. PCA works by calculating the covariance matrix of the
array, which slightly affects model performance. In addition, data matrix to obtain the eigenvectors of the covariance matrix.
‘n_init’, denoting the number of times that the k-means algorithm The matrix comprises the eigenvectors of k features with the lar-
will be executed with different centroid seeds, and the ‘max_iter’, gest eigenvalues (i.e., the largest variance). Consequently, the data
the maximum number of iterations in a single execution of k- matrix can be transformed into a new space with a reduced dimen-
means, also have slight impacts on model performance [30]. sionality. Singular value decomposition (SVD) [75] is a popular
The expectation–maximization (EM) algorithm [69] is an itera- method used to obtain the eigenvalues and eigenvectors of the
tive algorithm used to detect the maximum likelihood estimation covariance matrix of PCA. Therefore, in addition to ‘n_components’,
of parameters. Gaussian Mixture model is a clustering method that the SVD solver type is another hyper-parameter of PCA to be tuned,
uses a mixture of Gaussian distributions to model data by imple- which can be assigned to ‘auto’, ‘full’, ‘arpack’ or ‘randomized’ [30].
menting the EM method. Similar to k-means, its major hyper- Linear discriminant analysis (LDA) [76] is another common
parameter to be tuned is ‘n_components’, indicating the number dimensionality reduction method that projects the features onto
of clusters or Gaussian distributions. Additionally, different meth- the most discriminative directions. Unlike PCA, which obtains the
ods can be chosen to constrain the covariance of the estimated direction with the largest variance as the principal component,
classes in Gaussian mixture models, including ‘full covariance’, LDA optimizes the feature subspace of classification. The objective
‘tied’, ‘diagonal’ or ‘spherical’ [70]. Other hyper-parameters could of LDA is to minimize the variance inside each class and maximize
also be tuned, including ‘max_iter’ and ‘tol’, representing the num- the variance between different classes after projection. Thus, the
ber of EM iterations to perform and the convergence threshold, projection points in each class should be as close as possible, and
respectively [30]. the distance between the center points of different classes should
Hierarchical clustering [71] methods build clusters by continu- be as large as possible. Similar to PCA, the number of features to
ously merging or splitting the built-in clusters. The hierarchy of be extracted, ‘n_components’, should be tuned in LDA models.
clusters is represented by a tree-structure; its root indicates the Additionally, the solver type of LDA can also be set to ‘svd’ for
unique cluster gathering all samples, and its leaves represent the SVD, ‘lsqr’ for least-squares solution, or ‘eigen’ for eigenvalue
clusters with only one sample [71]. In sklearn, the function decomposition [77]. LDA also has a conditional hyper-parameter,
‘AgglomerativeClustering’ is a common type of hierarchical clus- the shrinkage parameter, ‘shrinkage’, which can be set to a float
tering. In agglomerative clustering, the linkage criteria, ‘linkage’, value along with ‘lsqr’ and ‘eigen’ solvers.
determines the distance between sets of observations and can be
set to ‘ward’, ‘complete’, ‘average’, or ‘single’, indicating whether
4. Hyper-parameter optimization techniques
to minimize the variance of the all clusters, or use the maximum,
average, or minimum distance between every two clusters, respec-
4.1. Model-free algorithms
tively. Like other clustering methods, its main hyper-parameter is
the number of clusters, ‘n_clusters’. However, ’n_clusters’ cannot
4.1.1. Babysitting
be set if we choose to set the ‘distance_threshold’, the linkage dis-
Babysitting, also called ‘Trial and Error’ or grad student descent
tance threshold for merging clusters, since if so, ‘n_clusters’ will be
(GSD), is a basic hyper-parameter tuning method [8]. This method
determined automatically.
is implemented by 100% manual tuning and widely used by stu-
DBSCAN [72] is a density-based clustering method that deter-
dents and researchers. The workflow is simple: after building a
mines the clusters by dividing data into clusters with sufficiently
ML model, a student tests many possible hyper-parameter values
high density. Unlike other clustering models, the number of clus-
based on experience, guessing, or the analysis of previously-
ters does not need to be configured before training. Instead,
evaluated results; the process is repeated until this student runs
DBSCAN has two significant conditional hyper-parameters — the
out of time (often reaching a deadline) or is satisfied with the
scan radius represented by ‘eps’, and the minimum number of con-
results. As such, this approach requires a sufficient amount of prior
sidered neighbor points represented by ‘min_samples’ — which
knowledge and experience to identify optimal hyper-parameter
define the cluster density together [73]. DBSCAN works by starting
values with limited time.
with an unvisited point and detecting all its neighbor points within
Manual tuning is infeasible for many problems due to several
a pre-defined distance ‘eps’. If the number of neighbor points
factors, like a large number of hyper-parameters, complex models,
reaches the value of ‘min_samples’, this unvisited point and all
time-consuming model evaluations, and non-linear hyper-
its neighbors are defined as a cluster. The procedures are executed
parameter interactions [9]. These factors inspired increased
recursively until all data points have been visited. A higher
research into techniques for the automatic optimization of
‘min_samples’ or a lower ‘eps’ indicates a higher density to form
hyper-parameters [78].
a cluster.

3.2.2. Dimensionality reduction algorithms 4.1.2. Grid search

The increasing amount of collected data provides ample infor- Grid search (GS) is one of the most commonly-used methods to
mation, but also increases problem complexity. In real-world explore hyper-parameter configuration space [120]. GS can be con-
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 303

sidered an exhaustive search or a brute-force method that evalu- based algorithms have a time complexity of O nk for optimizing
ates all the hyper-parameter combinations given to the grid of con- k hyper-parameters [82].
figurations [131]. GS works by evaluating the Cartesian product of For specific machine learning algorithms, the gradient of certain
a user-specified finite set of values [6]. hyper-parameters can be calculated, and then the gradient descent
GS cannot exploit the well-performing regions further by itself. can be used to optimize these hyper-parameters. Although
Therefore, to identify the global optimums, the following proce- gradient-based algorithms have a faster convergence speed to
dure needs to be performed manually [2]: reach local optimum than the previously-presented methods in
Section 4.1, they have several limitations. Firstly, they can only
1. Start with a large search space and step size. be used to optimize continuous hyper-parameters because other
2. Narrow the search space and step size based on the previous types of hyper-parameters, like categorical hyper-parameters, do
results of well-performing hyper-parameter configurations. not have gradient directions. Secondly, they are only efficient for
3. Repeat step 2 multiple times until an optimum is reached. convex functions because the local instead of a global optimum
may be reached for non-convex functions [2]. Therefore, the
GS can be easily implemented and parallelized. However, the gradient-based algorithms can only be used in some cases where
main drawback of GS is its inefficiency for high-dimensionality it is possible to obtain the gradient of hyper-parameters; e.g., opti-
hyper-parameter configuration space, since the number of evalua- mizing the learning rate in neural networks (NN) [11]. Still, it is not
tions increases exponentially as the number of hyper-parameters guaranteed for ML algorithms to identify global optimums using
grows. This exponential growth is referred to as the curse of gradient-based optimization techniques.
dimensionality [79]. For GS, assuming that there are k parameters,
and each of them has n distinct values, its computational complex-

ity increases exponentially at a rate of O nk [19]. Thus, only when
the hyper-parameter configuration space is small can GS be an 4.3. Bayesian optimization
effective HPO method.
Bayesian optimization (BO) [83] is an iterative algorithm that is
popularly used for HPO problems. Unlike GS and RS, BO determines
4.1.3. Random search the future evaluation points based on the previously-obtained
To overcome certain limitations of GS, random search (RS) was results. To determine the next hyper-parameter configuration, BO
proposed in [13]. RS is similar to GS; but, instead of testing all values uses two key components: a surrogate model and an acquisition
in the search space, RS randomly selects a pre-defined number of function [56]. The surrogate model aims to fit all the currently-
samples between the upper and lower bounds as candidate observed points into the objective function. After obtaining the
hyper-parameter values, and then trains these candidates until predictive distribution of the probabilistic surrogate model, the
the defined budget is exhausted. The theoretical basis of RS is that acquisition function determines the usage of different points by
if the configuration space is large enough, then the global opti- balancing the trade-off between exploration and exploitation.
mums, or at least their approximations, can be detected. With a lim- Exploration is to sample the instances in the areas that have not
ited budget, RS is able to explore a larger search space than GS [13]. been sampled, while exploitation is to sample in the currently
The main advantage of RS is that it is easily parallelized and promising regions where the global optimum is most likely to
resource-allocated since each evaluation is independent. Unlike occur, based on the posterior distribution. BO models balance the
GS, RS samples a fixed number of parameter combinations from exploration and the exploitation processes to detect the current
the specified distribution, which improves system efficiency by most likely optimal regions and avoid missing better configura-
reducing the probability of wasting much time on a small poor- tions in the unexplored areas [84].
performing region. Since the number of total evaluations in RS is The basic procedures of BO are as follows [83]:
set to a fixed value n before the optimization process starts, the
computational complexity of RS is OðnÞ [80]. In addition, RS can 1. Build a probabilistic surrogate model of the objective function.
detect the global optimum or the near-global optimum when given 2. Detect the optimal hyper-parameter values on the surrogate
enough budgets [6]. model.
Although RS is more efficient than GS for large search spaces, 3. Apply these hyper-parameter values to the real objective func-
there are still a large number of unnecessary function evaluations tion to evaluate them.
since it does not exploit the previously well-performing regions 4. Update the surrogate model with new results.
[2]. 5. Repeat steps 2–4 until the maximum number of iterations is
To conclude, the main limitation of both RS and GS is that every reached.
evaluation in their iterations is independent of previous evalua-
tions; thus, they waste massive time evaluating poorly- Thus, BO works by updating the surrogate model after each
performing areas of the search space. This issue can be solved by evaluation on the objective function. BO is more efficient than GS
other optimization methods, like Bayesian optimization that uses and RS since it can detect the optimal hyper-parameter combina-
previous evaluation records to determine the next evaluation [14]. tions by analyzing the previously-tested values, and running a sur-
rogate model is often much cheaper than running the entire
objective function.
4.2. Gradient-based optimization However, since Bayesian optimization models are executed
based on the previously-tested values, they belong to sequential
Gradient descent [81] is a traditional optimization technique methods that are difficult to parallelize; but they can usually
that calculates the gradient of variables to identify the promising detect near-optimal hyper-parameter combinations within a few
direction and moves towards the optimum. After randomly select- iterations [7].
ing a data point, the technique moves towards the opposite direc- Common surrogate models for BO include Gaussian process
tion of the largest gradient to locate the next data point. Therefore, (GP) [85], random forest (RF) [86], and the tree Parzen estimator
a local optimum can be reached after convergence. The local opti- (TPE) [12]. Therefore, there are three main types of BO algorithms
mum is also the global optimum for convex functions. Gradient- based on their surrogate models: BO-GP, BO-RF, BO-TPE. An alter-
304 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

native name for BO-RF is sequential model-based algorithm config- lðxÞ; if y < y
pðxjy; DÞ ¼ : ð25Þ
uration (SMAC) [86]. g ðxÞ; if y > y
After that, the expected improvement in the acquisition func-
4.3.1. BO-GP
tion is reflected by the ratio between the two density functions,
Gaussian process (GP) is a standard surrogate model for objec-
which is used to determine the new configurations for evaluation.
tive function modeling in BO [83]. Assuming that the function f
The Parzen estimators are organized in a tree structure, so the
with a mean l and a covariance r2 is a realization of a GP, the pre-
specified conditional dependencies are retained. Therefore, TPE
dictions follow a normal distribution [87]:
naturally supports specified conditional hyper-parameters [87].

pð yjx; DÞ ¼ N yjl
^; r
^2 ; ð22Þ The time complexity of BO-TPE is OðnlognÞ, which is lower than
the complexity of BO-GP [3].
where D is the configuration space of hyper-parameters, and BO methods are effective for many HPO problems, even if the
y ¼ f ðxÞ is the evaluation result of each hyper-parameter value x. objective function f is stochastic, non-convex, or non-continuous.
After obtaining a set of predictions, the points to be evaluated next However, the main drawback of BO models is that, if they fail to
are then selected from the confidence intervals generated by the achieve the balance between exploration and exploitation, they
BO-GP model. Each newly-tested data point is added to the sample might only reach a local instead of a global optimum. RS does
records, and the BO-GP model is re-built with the new information. not have this limitation since it does not focus on any specific area.
This procedure is repeated until termination. Applying a BO-GP to a Additionally, it is difficult to parallelize BO models since their

size n dataset has a time complexity of O n3 and space complexity intermediate results are dependent on each other [7].
2
of O n [88]. One main limitation of BO-GP is that the cubic com-
plexity to the number of instances limits the capacity for paral- 4.4. Multi-fidelity optimization algorithms
lelization [3]. Additionally, it is mainly used to optimize
continuous variables. One major issue with HPO is the long execution time, which
increases with a larger hyper-parameter configuration space and
4.3.2. SMAC larger datasets. The execution time can take several hours, several
Random forest (RF) is another popular surrogate function for BO days, or even more [89]. Multi-fidelity optimization techniques are
to model the objective function using an ensemble of regression common approaches to solve the constraint of limited time and
trees. BO using RF as the surrogate model is also called SMAC [86]. resources. To save time, people can use a subset of the original

Assuming that there is a Gaussian model N yjl ^; r
^ 2 , and l
^ and dataset or a subset of the features [90]. Multi-fidelity involves
r^ 2 are the mean and variance of the regression function rðxÞ, low-fidelity and high-fidelity evaluations and combines them for
respectively, then [86]: practical applications [91]. In low-fidelity evaluations, a relatively
small subset is evaluated at a low cost but with poor generalization
1 X
l^ ¼ r ðxÞ; ð23Þ performance. In high-fidelity evaluations, a relatively large subset
jBj r2B is evaluated with better generalization performance but at a higher
cost than low-fidelity evaluations. In multi-fidelity optimization
1 X algorithms, poorly-performing configurations are discarded after
r^ 2 ¼ ð r ð xÞ l
^ Þ2 ; ð24Þ
jBj 1 each round of hyper-parameter evaluation on generated subsets,
r2B
and only well-performing hyper-parameter configurations will be
where B is a set of regression trees in the forest. The major proce- evaluated on the entire training set.
dures of SMAC are as follows [3]: Bandit-based algorithms categorized to multi-fidelity optimiza-
tion algorithms have shown success dealing with deep learning
1. RF starts with building B regression trees, each constructed by optimization problems [3]. Two common bandit-based techniques
sampling n instances from the training set with replacement. are successive halving [92] and Hyperband [16].
2. A split node is selected from d hyper-parameters for each tree.
3. To maintain a low computational cost, both the minimum num- 4.4.1. Successive halving
ber of instances considered for further split and the number of Theoretically speaking, exhaustive methods are able to identify
trees to grow are set to a certain value. the optimal hyper-parameter combination by evaluating all the
4. Finally, the mean and variance for each new configuration are given combinations. However, many factors, including limited
estimated by RF. time and resources, should be considered in practical applications.
These factors are called budgets (B). To overcome the limitations of
Compared with BO-GP, the main advantage of SMAC is its sup- GS and RS and to improve efficiency, successive halving algorithms
port for all types of variables, including continuous, discrete, cate- were proposed in [92].
gorical, and conditional hyper-parameters [87]. The time The main process of using successive halving algorithms for
complexities of using SMAC to fit and predict variances are HPO is as follows. Firstly, it is presumed that there are n sets of
OðnlognÞ and OðlognÞ, respectively, which are much lower than hyper-parameter combinations, and that they are evaluated with
the complexities of BO-GP [3]. uniformly-allocated budgets (b ¼ B=n). Then, according to the eval-
uation results for each iteration, half of the poorly-performing
4.3.3. BO-TPE hyper-parameter configurations are eliminated, and the better-
Tree-structured Parzen estimator (TPE) [12] is another com- performing half is passed to the next iteration with double budgets
mon surrogate model for BO. Instead of defining a predictive dis- (biþ1 ¼ 2 bi ). The above process is repeated until the final opti-
tribution used in BO-GP, BO-TPE creates two density functions, mal hyper-parameter combination is detected.
lðxÞ and g ðxÞ, to act as the generative models for all domain vari- Successive halving is more efficient than RS, but is affected by
ables [3]. To apply TPE, the observation results are divided into the trade-off between the number of hyper-parameter configura-
good results and poor results by a pre-defined percentile y , tions and the budgets allocated to each configuration [6]. Thus,
and the two sets of results are modeled by simple Parzen win- the main concern of successive halving is how to allocate the bud-
dows [12]: get and how to determine whether to test fewer configurations
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 305

with a higher budget for each or to test more configurations with a metaheuristics have the capacity to solve non-convex, non-
lower budget for each [2]. continuous, and non-smooth optimization problems.
Population-based optimization algorithms (POAs) are a major
4.4.2. Hyperband type of metaheuristic algorithm, including genetic algorithms
Hyperband [16] is then proposed to solve the dilemma of suc- (GAs), evolutionary algorithms, evolutionary strategies, and parti-
cessive halving algorithms by dynamically choosing a reasonable cle swarm optimization (PSO). POAs start by creating and updating
number of configurations. It aims to achieve a trade-off between a population as each generation; each individual in every genera-
the number of hyper-parameter configurations (n) and their allo- tion is then evaluated until the global optimum is identified [14].
cated budgets by dividing the total budgets (B) into n pieces and The main differences between different POAs are the methods used
allocating these pieces to each configuration (b ¼ B=n). Successive to generate and select populations [17]. POAs can be easily paral-
halving serves as a subroutine on each set of random configura- lelized since a population of N individuals can be evaluated on at
tions to eliminate the poorly-performing hyper-parameter config- most N threads or machines in parallel [6]. Genetic algorithms
urations and improve efficiency. The main steps of Hyperband and particle swarm optimization are the two main POAs that are
algorithms are shown in Algorithm 1 [2]. popularly-used for HPO problems.

Algorithm 1: Hyperband 4.5.1. Genetic algorithm

Genetic algorithm (GA) [18] is one of the common metaheuris-
Input: bmax ; bmin
tic algorithms based on the evolutionary theory that individuals
1: smax ¼ log bbmax
min
with the best survival capability and adaptability to the environ-
2: for s 2 fbmax ; bmin 1; . . . ; 0g do ment are more likely to survive and pass on their capabilities to
3: n ¼ DetermineBudget ðsÞ future generations. The next generation will also inherit their par-
4: c ¼ SampleConfigurationsðnÞ ents’ characteristics and may involve better and worse individuals.
5: Successiv eHalv ing ðcÞ Better individuals will be more likely to survive and have more
6: end for capable offspring, while the worse individuals will gradually disap-
7: return The best configuration so far. pear. After several generations, the individual with the best adapt-
ability will be identified as the global optimum [95].
To apply GA to HPO problems, each chromosome or individual
represents a hyper-parameter, and its decimal value is the actual
Firstly, the budget constraints bmin and bmax are determined by input value of the hyper-parameter in each evaluation. Every chro-
the total number of data points, the minimum number of instances mosome has several genes, which are binary digits; and then cross-
required to train a sensible model, and the available budgets. After over and mutation operations are performed on the genes of this
that, the number of configurations n and the budget size allocated chromosome. The population involves all possible values within
to each configuration are calculated based on bmin and bmax in steps the initialized chromosome/parameter ranges, while the fitness
2–3 of Algorithm 1. The configurations are sampled based on n and function characterizes the evaluation metrics of the parameters
b, and then passed to the successive halving model demonstrated [95].
in steps 4–5. The successive halving algorithm discards the identi- Since the randomly-initialized parameter values often do not
fied poorly-performing configurations and passes the well- include the optimal parameter values, several operations, including
performing configurations on to the next iteration. This process selection, crossover, and mutation operations, must be performed
is repeated until the final optimal hyper-parameter configuration on the well-performing chromosomes to identify the optimums
is identified. By involving the successive halving searching method, [18]. Chromosome selection is implemented by selecting those
Hyperband has a computational complexity of OðnlognÞ [16]. chromosomes with good fitness function values. To keep the popu-
lation size unchanged, the chromosomes with good fitness function
4.4.3. BOHB values are passed to the next generation with higher probability,
Bayesian Optimization HyperBand (BOHB) [93] is a state-of- where they generate new chromosomes with the parents’ best
the-art HPO technique that combines Bayesian optimization and characteristics. Chromosome selection ensures that good charac-
Hyperband to incorporate the advantages of both while avoiding teristics of each generation can be passed to later generations.
their drawbacks. The original Hyperband uses a random search Crossover is used to generate new chromosomes by exchanging a
to search the hyper-parameter configuration space, which has a proportion of genes in different chromosomes. Mutation operations
low efficiency. BOHB replaces the RS method by BO to achieve both are also used to generate new chromosomes by randomly altering
high performance as well as low execution time by effectively one or more genes of a chromosome. Crossover and mutation oper-
using parallel resources to optimize all types of hyper- ations enable later generations to have different characteristics and
parameters. In BOHB, TPE is the standard surrogate model for BO, reduce the chance of missing good characteristics [3].
but it uses multidimensional kernel density estimators. Therefore, The main procedures of GA are as follows [94]:
the complexity of BOHB is also OðnlognÞ [93].
It has been shown that BOHB outperforms many other opti- 1. Randomly initialize the population, chromosomes, and genes,
mization techniques when tuning SVM and DL models [93]. The which represent the entire search space, hyper-parameters,
only limitation of BOHB is that it requires the evaluations on sub- and hyper-parameter values, respectively.
sets with small budgets to be representative of evaluations on the 2. Evaluate the performance of each individual in the current gen-
entire training set; otherwise, BOHB may have a slower conver- eration by calculating the fitness function, which indicates the
gence speed than standard BO models. objective function of a ML model.
3. Perform selection, crossover, and mutation operations on the
4.5. Metaheuristic algorithms chromosomes to produce a new generation involving the next
hyper-parameter configurations to be evaluated.
Metaheuristic algorithms [94] are a set of algorithms mainly 4. Repeat steps 2 & 3 until the termination condition is met.
inspired by biological theories and widely used for optimization 5. Terminate and output the optimal hyper-parameter
problems. Unlike many traditional optimization methods, configuration.
306 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

Among the above steps, the population initialization step is an is a one-way flow of information sharing, and the entire search
important step of GA and PSO since it provides an initial guess of process follows the direction of the current optimal solution [2].
the optimal values. Although the initialized values will be itera- The computational complexity of PSO algorithm is OðnlognÞ
tively improved in the optimization process, a suitable population [100]. In most cases, the convergence speed of PSO is faster than
initialization method can significantly improve the convergence of GA. In addition, particles in PSO operate independently and only
speed and performance of POAs. A good initial population of need to share information with each other after each iteration, so
hyper-parameters should involve individuals that are close to global this process is easily parallelized to improve model efficiency [9].
optimums by covering the promising regions and should not be The main limitation of PSO is that it requires proper population
localized to an unpromising region of the search space [96]. initialization; otherwise, it might only reach a local instead of a
To generate hyper-parameter configuration candidates for the global optimum, especially for discrete hyper-parameters [101].
initial population, random initialization that simply creates the ini- Proper population initialization requires developers’ prior experi-
tial population with random values in the given search space is ence or using population initialization techniques. Many popula-
often used in GA [97]. Thus, GA is easily implemented and does tion initialization techniques have been proposed to improve the
not necessitate good initializations, because its selection, cross- performance of evolutionary algorithms, like the opposition-
over, and mutation operations lower the possibility of missing based optimization algorithm [97] and the space transformation
the global optimum. search method [102]. Involving additional population initialization
Hence, it is useful when the data analyst does not have much techniques will require more execution time and resources.
experience determining a potential appropriate initial search space
for the hyper-parameters. The main limitation of GA is that the
algorithm itself introduces additional hyper-parameters to be con- 5. Applying optimization techniques to machine learning
figured, including the fitness function type, population size, cross- algorithms
over rate, and mutation rate. Moreover, GA is a sequential
execution algorithm, making it difficult to parallelize. The time 5.1. Optimization techniques analysis

complexity of GA is O n2 [98]. As a result, sometimes, GA may
be inefficient due to its low convergence speed. Grid search (GS) is a simple method, its major limitation being
that it is time-consuming and impacted by the curse of dimension-
4.5.2. Particle swarm optimization ality [79]. Thus, it is unsuitable for a large number of hyper-
Particle swarm optimization (PSO) [99] is another set of evolu- parameters. Moreover, GS is often not able to detect the global
tionary algorithms that are commonly used for optimization prob- optimum of continuous parameters, since it requires a pre-
lems. PSO algorithms are inspired by biological populations that defined, finite set of hyper-parameter values. It is also unrealistic
exhibit both individual and social behaviors [17]. PSO works by for GS to be used to identify integer and continuous hyper-
enabling a group of particles (swarm) to traverse the search space parameter optimums with limited time and resources. Therefore,
in a semi-random manner [9]. PSO algorithms identify the optimal compared with other techniques, GS is only efficient for a small
solution through cooperation and information sharing among indi- number of categorical hyper-parameters.
vidual particles in a group. Random search is more efficient than GS and supports all types
In PSO, there are a group of n particles in a swarm S [2]: of hyper-parameters. In practical applications, using RS to evaluate
the randomly-selected hyper-parameter values helps analysts to
S ¼ ðS1 ; S2 ; . . . ; Sn Þ; ð26Þ explore a large search space. However, since RS does not consider
and each particle Si is represented by a vector: previously-tested results, it may involve many unnecessary evalu-
ations, which decrease its efficiency [13].
!!!
Si ¼< xi ; v i ; pi >; ð27Þ Hyperband can be considered an improved version of RS, and
! ! ! they both support parallel executions. Hyperband balances model
where xi is the current position, v i is the current velocity, and pi is performance and resource usage, so it is more efficient than RS,
the known best position of the swarm so far. especially with limited time and resources [15]. However, GS, RS,
After initializing the position and velocity of each particle, it and Hyperband all have a major constraint in that they treat each
evaluates the current position and records the position with its hyper-parameter independently and do not consider hyper-
!
performance score. In the next iteration, the velocity v i of each parameter correlations [103]. Thus, they will be inefficient for ML
!
particle is changed based on the previous position pi and the cur- algorithms with conditional hyper-parameters, like SVM, DBSCAN,
! and logistic regression.
rent global optimal position p :
Gradient-based algorithms are not a prevalent choice for hyper-
!
v i :¼ !
v i þ Uð0; u1 Þ ! ! ! !
pi xi þ U ð0; u2 Þ p xi ; ð28Þ parameter optimization, since they only support continuous
hyper-parameters and can only detect a local instead of a global
where U ð0; uÞ is the continuous uniform distributions based on the optimum for non-convex HPO problems [2]. Therefore, gradient-
acceleration constants u1 and u2 . based algorithms can only be used to optimize certain hyper-
After that, the particles move based on their new velocity parameters, like the learning rate in DL models.
vectors: Bayesian optimization models are divided into three different
! ! ! models—BO-GP, SMAC, and BO-TPE—based on their surrogate
xi :¼ xi þ v i : ð29Þ
models. BO algorithms determine the next hyper-parameter value
The above procedures are repeated until convergence or termi- based on the previously-evaluated results to reduce unnecessary
nation constraints are reached. evaluations and improve efficiency. BO-GP mainly supports contin-
Compared with GA, it is easier to implement PSO, since PSO uous and discrete hyper-parameters (by rounding them), but does
does not have certain additional operations like crossover and not support conditional hyper-parameters [14]; while SMAC and
mutation. In GA, all chromosomes share information with each BO-TPE are both able to handle categorical, discrete, continuous,
other, so the entire population moves uniformly toward the opti- and conditional hyper-parameters. SMAC performs better when
mal region; while in PSO, only information on the individual best there are many categorical and conditional parameters, or
particle and the global best particle is transmitted to others, which cross-validation is used, while BO-GP performs better for only a
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 307

few continuous parameters [15]. BO-TPE preserves the specified Table 1

conditional relationships, so one advantage of BO-TPE over BO- The comparison of common HPO algorithms (n is the number of hyper-parameter
values and k is the number of hyper-parameters).
GP is its innate support for specified conditional hyper-
parameters [14]. HPO Strengths Limitations Time
Metaheuristic algorithms, including GA and PSO, are more com- Method Complexity

plicated than many other HPO algorithms, but often perform well GS Simple. Time-consuming O nk
for complex optimization problems. They support all types of Only efficient with
categorical HPs.
hyper-parameters and are particularly efficient for large configura- RS More efficient Not consider previous OðnÞ
tion spaces, since they can obtain the near-optimal solutions even than GS results
within very few iterations. However, GA and PSO have their own Enable Not efficient with
advantages and disadvantages in practical use. PSO is able to sup- parallelization. conditional HPs.

Gradient- Fast convergence Only support O nk
port large-scale parallelization, and is particularly suitable for con-
based speed for continuous HPs
tinuous and conditional HPO problems [19]; on the other hand, GA models continuous HPs. May only detect local
is executed sequentially, making it difficult to be parallelized. optimums.

Therefore, PSO often executes faster than GA, especially for large BO-GP Fast convergence Poor capacity for O n3
configuration spaces and large datasets. However, an appropriate speed for parallelization
continuous HPs. Not efficient with
population initialization is crucial for PSO; otherwise, it may con-
conditional HPs.
verge slowly or only identify a local instead of a global optimum. SMAC Efficient with all Poor capacity for OðnlognÞ
Yet, the impact of proper population initialization is not as signif- types of HPs. parallelization.
icant for GA as for PSO [104]. Another limitation of GA is that it BO-TPE Efficient with all Poor capacity for OðnlognÞ
types of HPs parallelization.
introduces additional hyper-parameters, like its crossover and
Keep conditional
mutation rates [18]. dependencies.
The strengths and limitations of the hyper-parameter optimiza- Hyperband Enable Not efficient with OðnlognÞ
tion algorithms involved in this paper are summarized in Table 1. parallelization. conditional HPs
Require subsets with
small budgets to be
5.2. Apply HPO algorithms to ML models
representative.
BOHB Efficient with all Require subsets with OðnlognÞ
Since there are many different HPO methods for different use types of HPs small budgets to be
cases, it is crucial to select the appropriate optimization techniques Enable representative.
parallelization.
for different ML models.
GA Efficient with all Poor capacity for O n2
Firstly, if we have access to multiple fidelities, which means types of HPs parallelization.
that it is able to define meaningful budgets: the performance rank- Not require good
ings of hyper-parameter configurations evaluated on small budgets initialization.
should be the same as or similar to the configuration rankings on PSO Efficient with all Require proper OðnlognÞ
types of HPs initialization.
the full budget (the original dataset); BOHB would be the best
Enable
choice, since it has the advantages of both BO and Hyperband [6] parallelization.
[93].
On the other hand, if multiple fidelities are not applicable,
which means that using the subsets of the original dataset or the 5.2.2. One continuous hyper-parameter
subsets of original features is misleading or too noisy to reflect Some linear models, including ridge and lasso algorithms, and
the performance of the entire dataset, BOHB may perform poorly some naíve Bayes algorithms, involving multinomial NB, Bernoulli
with higher time complexity than standard BO models, then choos- NB, and complement NB, generally only have one vital continuous
ing other HPO algorithms would be more efficient [93]. hyper-parameter to be tuned. In ridge and lasso algorithms, the
ML algorithms can be classified by the characteristics of their continuous hyper-parameter is ‘alpha’, the regularization strength.
hyper-parameter configurations. Appropriate optimization algo- In the three NB algorithms mentioned above, the critical hyper-
rithms can be chosen to optimize the hyper-parameters based on parameter is also named ‘alpha’, but it represents the additive
these characteristics. (Laplace/Lidstone) smoothing parameter. In terms of these ML
algorithms, BO-GP is the best choice, since it is good at optimizing
5.2.1. One discrete hyper-parameter a small number of continuous hyper-parameters. Gradient-based
Commonly for some ML algorithms, like certain neighbor- algorithms can also be used, but might only detect local optimums,
based, clustering, and dimensionality reduction algorithms, only so they are less effective than BO-GP.
one discrete hyper-parameter needs to be tuned. For KNN, the
major hyper-parameter is k, the number of considered neighbors.
The most essential hyper-parameter of k-means, hierarchical clus- 5.2.3. A few conditional hyper-parameters
tering, and EM is the number of clusters. Similarly, for dimension- It is noticeable that many ML algorithms have conditional
ality reduction algorithms, including PCA and LDA, their basic hyper-parameters, like SVM, LR, and DBSCAN. LR has three corre-
hyper-parameter is ‘n_components’, the number of features to be lated hyper-parameters, ‘penalty’, ‘C’, and the solver type. Simi-
extracted. larly, DBSCAN has ‘eps’ and ‘min_samples’ that must be tuned in
In these situations, Bayesian optimization is the best choice, and conjunction. SVM is more complex, since after setting a different
the three surrogates could be tested to find the best one. Hyper- kernel type, there is a separate set of conditional
band is another good choice, which may have a fast execution hyper-parameters that need to be tuned next, as described in
speed due to its capacity for parallelization. In some cases, people Section 3.1.3. Hence, some HPO methods that cannot effectively
may want to fine-tune the ML model by considering other less optimize conditional hyper-parameters, including GS, RS, BO-GP,
important hyper-parameters, like the distance metric of KNN and and Hyperband, are not suitable for ML models with conditional
the SVD solver type of PCA; so BO-TPE, GA, or PSO could be chosen hyper-parameters. For these ML methods, BO-TPE is the best
for these situations. choice if we have pre-defined relationships among the
308 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

hyper-parameters. SMAC is also a good choice, since it also per- deficiency is that it is not very efficient for categorical and condi-
forms well for tuning conditional hyper-parameters. GA and PSO tional hyper-parameters.
can be used, as well.
6.3. BayesOpt
5.2.4. A large hyper-parameter configuration space with multiple types
of hyper-parameters Bayesian Optimization (BayesOpt) [105] is a Python library
Tree-based algorithms, including DT, RF, ET, and XGBoost, as employed to solve HPO problems using BO. BayesOpt uses a Gaus-
well as DL algorithms, like DNN, CNN, RNN, are the most complex sian process as its surrogate model to calculate the objective func-
types of ML algorithms to bed tuned, since they have many hyper- tion based on past evaluations and utilizes an acquisition function
parameters with various, different types. For these ML models, PSO to determine the next values.
is the best choice since it enables parallel executions to improve
efficiency, particularly for DL models that often require massive 6.4. Hyperopt
training time. Some other techniques, like GA, BO-TPE, and SMAC
can also be used, but they may cost more time than PSO, since it Hyperopt [106] is a HPO framework that involves RS and
is difficult to parallelize these techniques. BO-TPE as the optimization algorithms. Unlike some of the other
libraries that only support a single model, Hyperopt is able to
5.2.5. Categorical hyper-parameters use multiple models to model hierarchical hyper-parameters. In
This category of hyper-parameters is mainly for ensemble addition, Hyperopt is parallelizable since it uses MongoDb as the
learning algorithms, since their major hyper-parameter is a cate- central database to store the hyper-parameter combinations.
gorical hyper-parameter. For bagging and AdaBoost, the categorical Hyperopt-sklearn [107] and hyperas [108] are the two libraries
hyper-parameter is ‘base_estimator’, which is set to be a singular that can apply Hyperopt to scikit-learn and Keras libraries.
ML model. For voting, it is ‘estimators’, indicating a list of ML sin-
gular models to be combined. The voting method has another cat- 6.5. SMAC
egorical hyper-parameter, ‘voting’, which is used to choose
whether to use a hard or soft voting method. If we only consider SMAC [109] is another library that uses BO with random forest
these categorical hyper-parameters, GS would be sufficient to as the surrogate model. It supports categorical, continuous, and
detect their suitable base machine learners. On the other hand, in discrete variables.
many cases, other hyper-parameters need to be considered, like
‘n_estimators’, ‘max_samples’, and ‘max_features’ in bagging, as 6.6. BOHB
well as ‘n_estimators’ and ‘learning_rate’ in AdaBoost; conse-
quently, BO algorithms would be a better choice to optimize these BOHB framework [93] is a combination of Bayesian optimiza-
continuous or discrete hyper-parameters. tion and Hyperband [15]. It overcomes one limitation of Hyper-
In conclusion, when tuning a ML model to achieve high model band, in that it randomly generates the test configurations, by
performance and low computational costs, the most suitable HPO replacing this procedure by BO. TPE is used as the surrogate model
algorithm should be selected based on the properties of its to store and model function evaluations. Using BOHB to evaluate
hyper-parameters. the instance can achieve a trade-off between model performance
and the current budget.

6. Existing HPO frameworks 6.7. Optunity

To tackle HPO problems, many open-source libraries exist to Optunity [79] is a popular HPO framework that provides several
apply theory into practice and lower the threshold for ML develop- optimization techniques, including GS, RS, PSO, and BO-TPE. In
ers. In this section, we provide a brief introduction to some popular Optunity, categorical hyper-parameters are converted to discrete
open-source HPO libraries or frameworks mainly for Python pro- hyper-parameters by indexing, and discrete hyper-parameters
gramming. The principles behind the involved optimization algo- are processed as continuous hyper-parameters by rounding them;
rithms are provided in Section 4. as such, it supports all types of hyper-parameters.

6.1. Sklearn 6.8. Skopt

In sklearn [30], ‘GridSearchCV’ can be implemented to detect Skopt (scikit-optimize) [110] is a HPO library that is built on top
the optimal hyper-parameters using the GS algorithm. Each of the scikit-learn [30] library. It implements several sequential
hyper-parameter value in the human-defined configuration space model-based optimization models, including RS and BO-GP. The
is evaluated by the program, with its performance evaluated using methods exhibit good performance with small search space and
cross-validation. When all the instances in the configuration space proper initialization.
have been evaluated, the optimal hyper-parameter combination in
the defined search space with its performance score will be 6.9. GpFlowOpt
returned. ’RandomizedSearchCV’ is also provided in sklearn to
implement a RS method. It evaluates a pre-defined number of GpFlowOpt [111] is a Python library for BO using GP as the sur-
randomly-selected hyper-parameter values in parallel. Cross- rogate model. It supports running BO-GP on GPU using the Tensor-
validation is conducted to effectively evaluate the performance of flow library. Therefore, GpFlowOpt is a good choice if BO is used in
each configuration. deep learning models with GPU resources available.

6.2. Spearmint 6.10. Talos

Spearmint [83] is a library using Bayesian optimization with the Talos [112] is a Python package designed for hyper-parameter
Gaussian process as the surrogate model. Spearmint’s primary optimization with Keras models. Talos can be fully deployed into
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 309

any Keras models and implemented easily without learning any 7. Experiments
new syntax. Several optimization techniques, including GS, RS,
and probabilistic reduction, can be implemented using Talos. To summarize the content of Sections 3–6, a comprehensive
overview of applying hyper-parameter optimization techniques
to ML models is shown in Table 2. It provides a summary of com-
6.11. Sherpa mon ML algorithms, their hyper-parameters, suitable optimization
methods, and available Python libraries; thus, data analysts and
Sherpa [113] is a Python package used for HPO problems. It can researchers can look up this table and select suitable optimization
be used with other ML libraries, including sklearn [30], Tensorflow algorithms as well as libraries for practical use.
[114], and Keras [32]. It supports parallel computations and has To put theory into practice, several experiments have been con-
several optimization methods, including GS, RS, BO-GP (via ducted based on Table 2. This section provides the experiments of
GPyOpt), Hyperband, and population-based training (PBT). applying eight different HPO techniques to three common and rep-
resentative ML algorithms on two benchmark datasets. In the first
6.12. Osprey part of this section, the experimental setup and the main process of
HPO are discussed. In the second part, the results of utilizing differ-
Osprey [115] is a Python library designed to optimize hyper- ent HPO methods are compared and analyzed. The sample code of
parameters. Several HPO strategies are available in Osprey, includ- the experiments has been published in [132] to illustrate the pro-
ing GS, RS, BO-TPE (via Hyperopt), and BO-GP (via GPyOpt). cess of applying hyper-parameter optimization to ML models.

6.13. FAR-HO
7.1. Experimental setup
FAR-HO [116] is a hyper-parameter optimization package that
employs gradient-based algorithms with TensorFlow. FAR-HO con- Based on the steps to optimize hyper-parameters discussed in
tains a few gradient-based optimizers, like reverse hyper-gradient Section 2.2, several steps were completed before the actual opti-
and forward hyper-gradient methods. This library is designed to mization experiments start.
build access to the gradient-based hyper-parameter optimizers in Firstly, two standard benchmarking datasets provided by the
TensorFlow, allowing deep learning model training and hyper- sklearn library [30], namely, the Modified National Institute of
parameter optimization in GPU or other tensor-optimized comput- Standards and Technology dataset (MNIST) and the Boston housing
ing environments. dataset, are selected as the benchmark datasets for HPO method
evaluation on data analytics problems. MNIST is a hand-written
digit recognition dataset used as a multi-classification problem,
6.14. Hyperband while the Boston housing dataset contains information about the
price of houses in various places in the city of Boston and can be
Hyperband [16] is a Python package for tuning hyper- used as a regression dataset to predict the housing prices.
parameters by Hyperband, a bandit-based approach. Similar to At the next stage, the ML models with their objective function
‘GridSearchCV’ and ‘RandomizedSearchCV’ in scikit-learn, there is need to be configured. In Section 5.2, all common ML models are
a class named ‘HyperbandSearchCV’ in Hyperband that can be divided into five categories based on their hyper-parameter types.
combined with sklearn and used for HPO problems. In Among those ML categories, ‘‘one discrete hyper-parameter”, ‘‘a
‘HyperbandSearchCV’ method, cross-validation is used for few conditional hyper-parameters”, and ‘‘a large hyper-parameter
evaluation. configuration space with multiple types of hyper-parameters” are
the three most common cases. Thus, three ML algorithms, KNN,
SVM, and RF, are selected as the target models to be optimized,
6.15. DEAP
since their hyper-parameter types represent the three most com-
mon HPO cases: KNN has one important hyper-parameter, the
DEAP [117] is a novel evolutionary computation package for
number of considered nearest neighbors for each sample; SVM
Python that contains several evolutionary algorithms like GA and
has a few conditional hyper-parameters, like the kernel type and
PSO. It integrates with parallelization mechanisms like multipro-
the penalty parameter C; RF has multiple hyper-parameters of dif-
cessing, and machine learning packages like sklearn.
ferent types, as discussed in Section 3. Moreover, KNN, SVM, and RF
can all be applied to solve both classification and regression
6.16. TPOT problems.
In the next step, the performance metrics and evaluation meth-
TPOT [118] is a Python tool for auto-ML that uses genetic pro- ods are configured. For each experiment on the selected two data-
gramming to optimize ML pipelines. TPOT is built on top of sklearn, sets, 3-fold cross validation is implemented to evaluate the
so it is easy to implement TPOT on ML models. ‘TPOTClassifier’ is involved HPO methods. The two most commonly-used perfor-
its principal function, and several additional hyper-parameters of mance metrics are used in our experiments. For classification mod-
GA must be set to fit specific problems. els, accuracy is used as the classifier performance metric, which is
the proportion of correctly classified data; while for regression
models, the mean squared error (MSE) is used as the regressor per-
6.17. Nevergrad formance metric, which measures the average squared difference
between the predicted values and the actual values. Additionally,
Nevergrad [119] is an open-source Python library that includes the computational time (CT), the total time needed to complete a
a wide range of optimizers, like fast-GA and PSO. In ML, Nevergrad HPO process with 3-fold cross-validation, is also used as the model
can be used to tune all types of hyper-parameters, including dis- efficiency metric [54]. In each experiment, the optimized ML model
crete, continuous, and categorical hyper-parameters, by choosing architecture that has the highest accuracy or the lowest MSE and
different optimizers. the optimal hyper-parameter configuration will be returned.
310 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

Table 2
A comprehensive overview of common ML models, their hyper-parameters, suitable optimization techniques, and available Python libraries.

ML Algorithm Main HPs Optional HPs HPO methods Libraries

Linear regression – – – –
Ridge & lasso alpha – BO-GP Skpot
Logistic regression penalty, c, solver - BO-TPE, SMAC Hyperopt, SMAC
KNN n_neighbors weights, p, algorithm BOs, Hyperband Skpot, Hyperopt, SMAC,
Hyperband
SVM C, kernel, epsilon (for SVR) gamma, coef0, degree BO-TPE, SMAC, BOHB Hyperopt, SMAC, BOHB
NB alpha – BO-GP Skpot
DT criterion, max_depth, splitter, min_weight_fraction_leaf, GA, PSO, BO-TPE, SMAC, TPOT, Optunity, SMAC, BOHB
min_samples_split, max_leaf_nodes BOHB
min_samples_leaf, max_features
RF & ET n_estimators max_depth, criterion, splitter, min_weight_fraction_leaf, GA, PSO, BO-TPE, SMAC, TPOT, Optunity, SMAC, BOHB
min_samples_split, max_leaf_nodes BOHB
min_samples_leaf, max_features
XGBoost n_estimators, max_depth, min_child_weight, gamma, alpha, lambda GA, PSO, BO-TPE, SMAC, TPOT, Optunity, SMAC, BOHB
learning_rate, subsample, BOHB
colsample_bytree,
Voting estimators, voting weights GS Sklearn
Bagging base_estimator, n_estimators max_samples, max_features GS, BOs sklearn, Skpot, Hyperopt,
SMAC
AdaBoost base_estimator, n_estimators, - BO-TPE, SMAC Hyperopt, SMAC
learning_rate
Deep learning number of hidden layers, ‘units’ per number of frozen layers PSO, BOHB Optunity, BOHB
layer, loss, optimizer, Activation, (if transfer learning is used)
learning_rate, dropout rate, epochs,
batch_size, early stop patience
K-means n_clusters init, n_init, max_iter BOs, Hyperband Skpot, Hyperopt, SMAC,
Hyperband
Hierarchical n_clusters, distance_threshold linkage BOs, Hyperband Skpot, Hyperopt, SMAC,
clustering Hyperband
DBSCAN eps, min_samples - BO-TPE, SMAC, BOHB Hyperopt, SMAC, BOHB
Gaussian mixture n_components covariance_type, max_iter, tol BO-GP Skpot
PCA n_components svd_solver BOs, Hyperband Skpot, Hyperopt, SMAC,
Hyperband
LDA n_components solver, shrinkage BOs, Hyperband Skpot, Hyperopt, SMAC,
Hyperband

After that, to fairly compare different optimization algorithms Table 3

and frameworks, certain constraints should be satisfied. Firstly, Configuration space for the hyper-parameters of tested ML models.

we compare different HPO methods using the same hyper- ML Model Hyper-parameter Type Search Space
parameter configuration space. For KNN, the only hyper- RF Classifier n_estimators Discrete [10,100]
parameter to be optimized, ‘n_neighbors’, is set to be in the same max_depth Discrete [5,50]
range of 1 to 20 for each optimization method evaluation. The min_samples_split Discrete [2,11]
hyper-parameters of SVM and RF models for classification and min_samples_leaf Discrete [1,11]
criterion Categorical [‘gini’, ‘entropy’]
regression problems are also set to be in the same configuration max_features Discrete [1,64]
space for each type of problem. The specifics of the configuration SVM C Continuous [0.1,50]
space for ML models are shown in Table 3. The selected hyper- Classifier
parameters and their search space are determined based on the kernel Categorical [‘linear’, ‘poly’, ‘rbf’,
‘sigmoid’]
concepts in Section 3, domain knowledge, and manual testings
KNN n_neighbors Discrete [1,20]
[120]. The hyper-parameter types of each ML algorithm are also Classifier
summarized in Table 3. RF Regressor n_estimators Discrete [10,100]
On the other hand, to fairly compare the performance metrics of max_depth Discrete [5,50]
optimization techniques, the maximum number of iterations for all min_samples_split Discrete [2,11]
min_samples_leaf Discrete [1,11]
HPO methods is set to 50 for RF and SVM model optimizations, and
criterion Categorical [‘mse’, ‘mae’]
10 for KNN model optimization based on manual testings and max_features Discrete [1,13]
domain knowledge. Moreover, to avoid the impacts of randomness, SVM C Continuous [0.1,50]
all experiments are repeated ten times with different random Regressor
kernel Categorical [‘linear’, ‘poly’, ‘rbf’,
seeds, and results are averaged for regression problems or given
‘sigmoid’]
the majority vote for classification problems. epsilon Continuous [0.001,1]
In Section 4, more than ten HPO methods are introduced. In KNN n_neighbors Discrete [1,20]
our experiments, eight representative HPO approaches are Regressor
selected for performance comparison, including GS, RS, BO-GP,
BO-TPE, Hyperband, BOHB, GA, and PSO. After setting up the fair
experimental environments for each HPO method, the HPO exper- All experiments were conducted using Python 3.5 on a machine
iments are implemented based on the steps discussed in with 6 Core i7-8700 processor and 16 gigabytes (GB) of memory.
Section 2.2. The involved ML and HPO algorithms are evaluated using multiple
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 311

open-source Python libraries and frameworks introduced in Sec- Table 6

tion 6, including sklearn [30], Skopt [110], Hyperopt [106], Optu- Performance evaluation of applying HPO methods to the KNN classifier on the MNIST
dataset.
nity [79], Hyperband [16], BOHB [93], and TPOT [118].
Optimization Algorithm Accuracy (%) CT (s)
Default HPs 96.27 0.24
7.2. Performance comparison GS 96.22 7.86
RS 96.33 6.44
The experiments of applying eight different HPO methods to ML BO-GP 96.83 1.12
BO-TPE 96.83 2.33
models are summarized in Tables 4–9. Tables 4–6 provide the per-
Hyperband 96.22 4.54
formance of each optimization algorithm when applied to RF, SVM, BOHB 97.44 3.84
and KNN classifiers evaluated on the MNIST dataset after a com- GA 96.83 2.34
plete optimization process; while Tables 7–9 demonstrate the per- PSO 96.83 1.73
formance of each HPO method when applied to RF, SVM, and KNN
regressors evaluated on the Boston–housing dataset. In the first
step, each ML model with its default hyper-parameter configura- Table 7
tion is trained and evaluated as baseline models. After that, each Performance evaluation of applying HPO methods to the RF regressor on the Boston–
HPO algorithm is implemented on the ML models to evaluate housing dataset.

and compare their accuracies for classification problems, or MSEs Optimization Algorithm MSE CT (s)
for regression problems, as well as their computational time (CT). Default HPs 31.26 0.08
From Tables 4–9, we can see that using the default HP configu- GS 29.02 4.64
rations do not yield the best model performance in our experi- RS 27.92 3.42
ments, which emphasizes the importance of utilizing HPO BO-GP 26.79 17.94
BO-TPE 25.42 1.53
methods. GS and RS can be seen as baseline models for HPO prob-
Hyperband 26.14 2.56
lems. From the results in Tables 4–9, it is shown that the computa- BOHB 25.56 1.88
tional time of GS is often much higher than other optimization GA 26.95 4.73
methods. With the same search space size, RS is faster than GS, PSO 25.69 3.20
but both of them cannot guarantee to detect the near-optimal
hyper-parameter configurations of ML models, especially for RF
and SVM models, which have a larger search space than KNN. Table 8
The performance of BO and multi-fidelity models is much better Performance evaluation of applying HPO methods to the SVM regressor on the
Boston–housing dataset.
than GS and RS. The computation time of BO-GP is often higher
than other HPO methods due to its cubic time complexity, but it Optimization Algorithm MSE CT (s)
can obtain better performance metrics for ML models with small- Default HPs 77.43 0.02
size continuous hyper-parameter space, like KNN. Conversely, GS 67.07 1.33
hyperband is often not able to obtain the highest accuracy or the RS 61.40 0.48
lowest MSE among the optimization methods, but their computa- BO-GP 61.27 5.87
BO-TPE 59.40 0.33
tional time is low because it works on the small-sized subsets. The Hyperband 73.44 0.32
performance of BO-TPE and BOHB is often better than others, since BOHB 59.67 0.31
GA 60.17 1.12
PSO 58.72 0.53

Table 4
Performance evaluation of applying HPO methods to the RF classifier on the MNIST
Table 9
dataset.
Performance evaluation of applying HPO methods to the KNN regressor on the
Optimization Algorithm Accuracy (%) CT (s) Boston–housing dataset.

Default HPs 90.65 0.09 Optimization Algorithm MSE CT (s)

GS 93.32 48.62
Default HPs 81.48 0.004
RS 93.38 16.73
GS 81.53 0.12
BO-GP 93.38 20.60
RS 80.77 0.11
BO-TPE 93.88 12.58
BO-GP 80.77 0.49
Hyperband 93.38 8.89
BO-TPE 80.83 0.08
BOHB 93.38 9.45
Hyperband 80.87 0.10
GA 93.83 19.19
BOHB 80.77 0.09
PSO 93.73 12.43
GA 80.77 0.33
PSO 80.74 0.19

Table 5
Performance evaluation of applying HPO methods to the SVM classifier on the MNIST
dataset.
they can detect the optimal or near-optimal hyper-parameter con-
Optimization Algorithm Accuracy (%) CT (s)
figurations within a short computational time.
Default HPs 97.05 0.29 For metaheuristics methods, GA and PSO, their accuracies are
GS 97.44 32.90
often higher than other HPO methods for classification problems,
RS 97.35 12.48
BO-GP 97.50 17.56
and their MSEs are often lower than other optimization techniques.
BO-TPE 97.44 3.02 However, their computational time is often higher than BO-TPE
Hyperband 97.44 11.37 and multi-fidelity models, especially for GA, which does not sup-
BOHB 97.44 8.18 port parallel executions.
GA 97.44 16.89
To summarize, it is simple to implement GS and RS, but they
PSO 97.44 8.33
often cannot detect the optimal hyper-parameter configurations
312 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

or cost much computational time. BO-GP and GA also cost more To solve this problem by HPO algorithms, BO models reduce the
computational time than many other HPO methods, but BO-GP total number of evaluations by spending time choosing the next
works well on small configuration space, while GA is effective for evaluating point instead of simply evaluating all possible hyper-
large configuration space. Hyperband’s computatinal time is low, parameter configurations; however, they still require much execu-
but it cannot guarantee to detect the global optimums. For ML tion time due to their poor capacity for parallelization. On the
models with large configuration space, BO-TPE, BOHB, and PSO other hand, although multi-fidelity optimization methods, like
often work well. Hyperband, have had some success dealing with HPO problems
with limited budgets, there are still some problems that cannot
be effectively solved by HPO due to the complexity of models or
8. Open issues, challenges, and future research directions the scale of datasets [6]. For example, the ImageNet [122] chal-
lenge is a very popular problem in the image processing domain,
Although there have been many existing HPO algorithms and but there has not been any research or work on efficiently optimiz-
practical frameworks, some issues still need to be addressed, and ing hyper-parameters for the ImageNet challenge yet, due to its
several aspects in this domain could be improved. In this section, huge scale and the complexity of CNN models used on ImageNet.
we discuss the open challenges, current research questions, and
potential research directions in the future. They can be classified
8.1.2. Complex search space
as model complexity challenges and model performance chal-
In many problems to which ML algorithms are applied, only a
lenges, as summarized in Table 10.
few hyper-parameters have significant effects on model perfor-
mance, and they are the main hyper-parameters that require tun-
ing. However, certain other unimportant hyper-parameters may
8.1. Model complexity
still affect the performance slightly and may be considered to opti-
mize the ML model further, which increases the dimensionality of
8.1.1. Costly objective function evaluations
hyper-parameter search space. As the number of hyper-parameters
To evaluate the performance of a ML model with different
and configurations increase, they exponentially increase the
hyper-parameter configurations, its objective function must be
dimensionality of the search space and the complexity of the prob-
minimized in each evaluation. Depending on the scale of data,
lems, and the total objective function evaluation time will also
the model complexity, and available computational resources,
increase exponentially [7]. Therefore, it is necessary to reduce
the evaluation of each hyper-parameter configuration may take
the influence of large search spaces on execution time by improv-
several minutes, hours, days, or even more [89]. Additionally, the
ing existing HPO methods.
values of certain hyper-parameters have a direct impact on the
execution time, like the number of considered neighbors in KNN,
the number of basic decision trees in RF, and the number of hidden 8.2. Model performance
layers in deep neural networks [121].
8.2.1. Strong anytime performance and final performance
HPO techniques are often expensive and sometimes require
extreme resources, especially for massive datasets or complex
Table 10 ML models. One example of a resource-intensive model is deep
The open challenges and future directions of HPO research.
learning models, since they view objective function evaluations
Category Challenges & Brief Description as black-box functions and do not consider their complexity. How-
Future ever, the overall budget is often very limited for most practical sit-
Requirements
uations, soHPO algorithms should be able to prioritize objective
Model complexity Costly objective HPO methods should reduce function evaluations and have a strong anytime performance,
function evaluation time on large datasets.
which indicates the capacity to detect optimal or near-optimal
evaluations
Complex search HPO methods should reduce configurations even with a very limited budget [93]. For instance,
space execution time on high an efficient HPO method should have a high convergence speed
dimensionalities (large hyper- so that there would not be a huge difference between the results
parameter search space). before and after model convergence, and should avoid random
Model Strong anytime HPO methods should be able to
performance performance detect the optimal or near-optimal
results even if time and resources are limited, like RS methods
HPs even with a very limited budget. cannot.
Strong final HPO methods should be able to On the other hand, if conditions permit and an adequate budget
performance detect the global optimum when is given, HPO approaches should be able to identify the global opti-
given a sufficient budget.
mal hyper-parameter configuration, named a strong final perfor-
Comparability There should exist a standard set of
benchmarks to fairly evaluate and mance [93].
compare different optimization
algorithms.
8.2.2. Comparability of HPO methods
Over-fitting and The optimal HPs detected by HPO
generalization methods should have To optimize the hyper-parameters of ML models, different opti-
generalizability to build efficient mization algorithms can be applied to each ML framework. Differ-
models on unseen data. ent optimization techniques have their own strengths and
Randomness HPO methods should reduce
drawbacks in different cases, and currently, there is no single opti-
randomness on the obtained results.
Scalability HPO methods should be scalable to mization approach that outperforms all other approaches when
multiple libraries or platforms (e.g., processing different datasets with various metrics and hyper-
distributed ML platforms). parameter types [3]. In this paper, we have analyzed the strengths
Continuous HPO methods should consider their and weaknesses of common hyper-parameter optimization tech-
updating capacity to detect and update
niques based on their principles and their performance in practical
capability optimal HP combinations on
continuously-updated data. applications; but this topic could be extended more
comprehensively.
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 313

To solve this problem, a standard set of benchmarks could be been developed; however, only very few HPO frameworks exist
designed and agreed on by the community for a better comparison that support distributed ML. Therefore, more research efforts and
of different HPO algorithms. For example, there is a platform called scalable HPO frameworks, like the ones supporting distributed
COCO (Comparing Continuous Optimizers) [123] that provides ML platforms, should be developed to support more libraries.
benchmarks and analyzes common continuous optimizers. How- On the other hand, future practical HPO algorithms should have
ever, there is, to date, not any reliable platform that provides the scalability to efficiently optimize hyper-parameters from a
benchmarks and analysis of all common hyper-parameter opti- small size to a large size, irrespective of whether they are continu-
mization approaches. It would be easier for people to choose ous, discrete, categorical, or conditional hyper-parameters.
HPO algorithms in practical applications if a platform like COCO
8.2.6. Continuous updating capability
exists for HPO problems. In addition, a unified metric can also
In practice, many datasets are not stationary and are constantly
improve the comparability of different HPO algorithms, since dif-
updated by adding new data and deleting old data. Correspond-
ferent metrics are currently used in different practical problems
ingly, the optimal hyper-parameter values or combinations may
[6].
also change with the changes in data. Currently, developing HPO
On the other hand, based on the comparison of different HPO
methods with the capacity to continuously tune hyper-parameter
algorithms, a way to further improve HPO is to combine existing
values as the data changes has not drawn much attention, since
models or propose new models that contain as many benefits as
researchers and data analysts often do not alter the ML model after
possible and are more suitable for practical problems than existing
achieving a currently optimal performance [3]. However, since
singular models. For example, the BOHB method [93] has had some
their optimal hyper-parameter values would change as data
success dealing with HPO problems by combining Bayesian opti-
changes, proper approaches should be proposed to achieve contin-
mization and Hyperband. In addition, future research should con-
uous updating capability.
sider both model performance and time budgets to develop HPO
algorithms that suit real-world applications.
9. Conclusion
8.2.3. Over-fitting and generalization
Generalization is another issue with HPO models. Since hyper- Machine learning has become the primary strategy for tackling
parameter evaluations are done with a finite number of evalua- data-related problems and has been widely used in various appli-
tions in datasets, the optimal hyper-parameter values detected cations. To apply ML models to practical problems, their hyper-
by HPO approaches might not be the same optimums on parameters need to be tuned to fit specific datasets. However, since
previously-unseen data. This is similar to over-fitting issues with the scale of produced data is greatly increased in real-life, and
ML models that occur when a model is closely fit to a finite number manually tuning hyper-parameters is extremely computationally
of known data points but is unfit to unseen data [124]. Generaliza- expensive, it has become crucial to optimize hyper-parameters
tion is also a common concern for multi-fidelity algorithms, like by an automatic process. In this survey paper, we have comprehen-
Hyperband and BOHB, since they need to extract subsets to repre- sively discussed the state-of-the-art research into the domain of
sent the entire dataset. hyper-parameter optimization as well as how to apply them to dif-
One solution to reduce or avoid over-fitting is to use cross- ferent ML models by theory and practical experiments. To apply
validation to identify a stable optimum that performs best in all optimization methods to ML models, the hyper-parameter types
or most of the subsets instead of a sharp optimum that only per- in a ML model is the main concern for HPO method selection. To
forms well in a singular validation set [6]. However, cross- summarize, BOHB is the recommended choice for optimizing a
validation increases the execution time several-fold. It would be ML model, if randomly selected subsets are highly-representative
beneficial if methods can better deal with overfitting and improve of the given dataset, since it can efficiently optimize all types of
generalization in future research. hyper-parameters; otherwise, BO models are recommended for
small hyper-parameter configuration space, while PSO is usually
8.2.4. Randomness the best choice for large configuration space. Moreover, some
There are stochastic components in the objective function of ML existing useful HPO tools and frameworks, open challenges, and
algorithms; thus, in some cases, the optimal hyper-parameter con- potential research directions are also provided and highlighted
figuration might be different after each run. This randomness could for practical use and future research purposes. We hope that our
be due to various procedures of certain ML models, like neural net- survey paper serves as a useful resource for ML users, developers,
work initialization, or different sampled subsets in a bagging data analysts, and researchers to use and tune ML models utilizing
model [89]; or due to certain procedures of HPO algorithms, like proper HPO techniques and frameworks. We also hope that it helps
crossover and mutation operations in GA. In addition, it is often dif- to enhance understanding of the challenges that still exist within
ficult for HPO methods to identify the global optimums, due to the the HPO domain, and thereby further advancing HPO and ML appli-
fact that HPO problems are mainly NP-hard problems. Many exist- cations in future research.
ing HPO algorithms can only collect several different near-optimal
values, which is caused by randomness. Thus, the existing HPO CRediT authorship contribution statement
models can be further improved to reduce the impact of random-
ness. One possible solution is to run a HPO method multiple times Li Yang: Conceptualization, Methodology, Software, Validation,
and select the hyper-parameter value that occurs most as the final Formal analysis, Investigation, Data curation, Writing - original
optimum. draft, Visualization. Abdallah Shami: Conceptualization,
Resources, Writing - review & editing, Supervision, Project admin-
8.2.5. Scalability istration, Funding acquisition.
In practice, one main limitation of many existing HPO frame-
works is that they are tightly integrated with one or a couple of Declaration of Competing Interest
machine learning libraries, like sklearn and Keras, which restricts
them to only work with a single node instead of large data volumes The authors declare that they have no known competing finan-
[3]. To tackle large datasets, some distributed machine learning cial interests or personal relationships that could have appeared
platforms, like Apache SystemML [125] and Spark MLib [126], have to influence the work reported in this paper.
314 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

References [33] C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization Models for Machine

Learning: A Survey (2019) 1–40, http://arxiv.org/abs/1901.05331.
[34] C.M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer, ISBN:
[1] M.I. Jordan, T.M. Mitchell, Machine learning: trends, perspectives, and
978-0-387-31073-2.
prospects, Science 349 (2015) 255–260, https://doi.org/10.1126/science.
[35] A.E. Hoerl, R.W. Kennard, Ridge regression: applications to nonorthogonal
aaa8415.
problems, Technometrics 12 (1970) 69–82, https://doi.org/10.1080/
[2] M.-A. Zöller, M.F. Huber, Benchmark and Survey of Automated Machine
00401706.1970.10488635.
Learning Frameworks, arXiv preprint arXiv:1904.12054, (2019). https://arxiv.
[36] L.E. Melkumova, S.Y. Shatskikh, Comparing ridge and LASSO estimators for
org/abs/1904.12054.
data analysis, Procedia Eng. 201 (2017) 746–755, https://doi.org/10.1016/j.
[3] R.E. Shawi, M. Maher, S. Sakr, Automated machine learning: State-of-the-art
proeng.2017.09.615.
and open challenges, arXiv preprint arXiv:1906.02287, (2019). http://arxiv.
[37] R. Tibshirani, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc.
org/abs/1906.02287.
Ser. B 58 (1996) 267–288, https://doi.org/10.1111/j.2517-6161.1996.
[4] M. Kuhn, K. Johnson, Applied Predictive Modeling, Springer, 2013, ISBN:
tb02080.x.
9781461468493..
[38] D.W. Hosmer Jr, S. Lemeshow, Applied logistic regression, Technometrics 34
[5] G.I. Diaz, A. Fokoue-Nkoutche, G. Nannicini, H. Samulowitz, An effective
(1) (2013) 358–359.
algorithm for hyperparameter optimization of neural networks, IBM J. Res.
[39] J.O. Ogutu, T. Schulz-Streeck, H.P. Piepho, Genomic selection using
Dev. 61 (2017) 1–20, https://doi.org/10.1147/JRD.2017.2709578.
regularized linear regression models: ridge regression, lasso, elastic net and
[6] F. Hutter, L. Kotthoff, J. Vanschoren (Eds.), Automatic Machine Learning:
their extensions, BMC Proc. BioMed Cent. 6 (2012).
Methods, Systems, Challenges, Springer, 2019, ISBN 9783030053185.
[40] J.M. Keller, M.R. Gray, A fuzzy K-nearest neighbor algorithm, IEEE Trans. Syst.
[7] N. Decastro-García, Á.L. Muñoz Castañeda, D. Escudero García, M.V. Carriegos,
Man Cybern. SMC-15 (1985) 580–585, https://doi.org/10.1109/
Effect of the sampling of a dataset in the hyperparameter optimization phase
TSMC.1985.6313426.
over the efficiency of a machine learning algorithm, Complexity (2019
[41] W. Zuo, D. Zhang, K. Wang, On kernel difference-weighted k-nearest neighbor
(2019).), https://doi.org/10.1155/2019/6278908.
classification, Pattern Anal. Appl. 11 (2008) 247–257, https://doi.org/10.1007/
[8] S. Abreu, Automated Architecture Design for Deep Neural Networks, arXiv
s10044-007-0100-z.
preprint arXiv:1908.10714, (2019). http://arxiv.org/abs/1908.10714.
[42] A. Smola, V. Vapnik, Support vector regression machines, Adv. Neural Inf.
[9] O.S. Steinholtz, A Comparative Study of Black-box Optimization Algorithms
Process. Syst. 9 (1997) 155–161.
for Tuning of Hyper-parameters in Deep Neural Networks, M.S. thesis, Dept.
[43] L. Yang, R. Muresan, A. Al-Dweik, L.J. Hadjileontiadis, Image-based visibility
Elect. Eng., Luleå Univ. Technol., 2018.
estimation algorithm for intelligent transportation systems, IEEE Access 6
[10] G. Luo, A review of automatic selection methods for machine learning
(2018) 76728–76740, https://doi.org/10.1109/ACCESS.2018.2884225.
algorithms and hyper-parameter values, Netw. Model. Anal. Heal. Inf. Bioinf.
[44] J. Zhang, R. Jin, Y. Yang, A.G. Hauptmann, Modified logistic regression: an
5 (2016) 1–16, https://doi.org/10.1007/s13721-016-0125-6.
approximation to SVM and its applications in large-scale text categorization,
[11] D. Maclaurin, D. Duvenaud, R.P. Adams, Gradient-based Hyperparameter
Proceedings Twent. Int. Conf. Mach. Learn. 2 (2003) 888–895.
Optimization through Reversible Learning, arXiv preprint arXiv:1502.03492,
[45] O.S. Soliman, A.S. Mahmoud, A classification system for remote sensing
(2015). http://arxiv.org/abs/1502.03492.
satellite images using support vector machine with non-linear kernel
[12] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter
functions, 2012 8th Int. Conf. Informatics Syst. INFOS 2012. (2012) BIO-
optimization, Proc. Adv. Neural Inf. Process. Syst. (2011) 2546–2554.
181-BIO-187.
[13] B. James, B. Yoshua, Random search for hyper-parameter optimization, J.
[46] I. Rish, An empirical study of the naive Bayes classifier, IJCAI 2001 Work
Mach. Learn. Res. 13 (1) (2012) 281–305.
Empir. Methods Artif. Intell. (2001) 41–46.
[14] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, K. Leyton-
[47] J.N. Sulzmann, J. Fürnkranz, E. Hüllermeier, On pairwise naive bayes
Brown, Towards an empirical foundation for assessing Bayesian optimization
classifiers, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif.
of hyperparameters, BayesOpt Work (2013) 1–5.
Intell. Lect. Notes Bioinformatics). 4701 LNAI (2007) 371-381. https://doi.org/
[15] K. Eggensperger, F. Hutter, H.H. Hoos, K. Leyton-Brown, Efficient
10.1007/978-3-540-74958-5_35.
benchmarking of hyperparameter optimizers via surrogates, Proc. Natl.
[48] C. Bustamante, L. Garrido, R. Soto, Comparing fuzzy Naive Bayes and Gaussian
Conf. Artif. Intell. 2 (2015) 1114–1120.
Naive Bayes for decision making in RoboCup 3D, Lect. Notes Comput. Sci.
[16] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, Hyperband: a
(Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) (2006)
novel bandit-based approach to hyperparameter optimization, J. Mach. Learn.
237–247, https://doi.org/10.1007/11925231_23, 4293 LNA I.
Res. 18 (2012) 1–52.
[49] A.M. Kibriya, E. Frank, B. Pfahringer, G. Holmes, Multinomial naive bayes for
[17] Q. Yao, et al., Taking Human out of Learning Applications: A Survey on
text categorization revisited, Lect. Notes Artif. Intell. (Subseries Lect. Notes
Automated Machine Learning, arXiv preprint arXiv:1810.13306, (2018).
Comput. Sci.) 3339 (2004) 488–499.
http://arxiv.org/abs/1810.13306.
[50] J.D.M. Rennie, L. Shih, J. Teevan, D.R. Karger Tackling the poor assumptions of
[18] S. Lessmann, R. Stahlbock, S.F. Crone, Optimizing hyperparameters of support
Naive Bayes text classifiers, Proc. Twent. Int. Conf. Mach. Learn. ICML (2003),
vector machines by genetic algorithms, Proc. 2005 Int. Conf. Artif. Intell.
616–623.
ICAI’05. 1 (2005) 74–80.
[51] V. Narayanan, I. Arora, A. Bhatia, Fast and accurate sentiment classification
[19] P.R. Lorenzo, J. Nalepa, M. Kawulok, L.S. Ramos, J.R. Paster, Particle swarm
using an enhanced naíve Bayes model, arXiv preprint arXiv:1305.6143,
optimization for hyper-parameter selection in deep neural networks, Proc.
(2013). https://arxiv.org/abs/1305.6143.
ACM Int. Conf. Genet. Evol. Comput. (2017) 481–488.
[52] S. Rasoul, L. David, A survey of decision tree classifier methodology, IEEE
[20] S. Sun, Z. Cao, H. Zhu, J. Zhao, A Survey of Optimization Methods from a
Trans. Syst. Man. Cybern. 21 (1991) 660–674.
Machine Learning Perspective, arXiv preprint arXiv:1906.06821, (2019).
[53] D.M. Manias, M. Jammal, H. Hawilo, A. Shami, P. Heidari, A. Larabi, R. Brunner,
https://arxiv.org/abs/1906.06821.
Machine learning for performance-aware virtual network function
[21] T.M.S. Bradley, A. Hax, Applied Mathematical Programming, Addison-Wesley,
placement, 2019 IEEE Glob. Commun. Conf. GLOBECOM 2019 – Proc. (2019)
Reading, Massachusetts, 1977.
12–17, https://doi.org/10.1109/GLOBECOM38437.2019.9013246.
[22] S. Bubeck, Convex optimization: algorithms and complexity, Found. Trends
[54] L. Yang, A. Moubayed, I. Hamieh, A. Shami, Tree-based intelligent intrusion
Mach. Learn. 8 (2015) 231–357, https://doi.org/10.1561/2200000050.
detection system in internet of vehicles, 2019 IEEE Glob. Commun. Conf.
[23] B. Shahriari, A. Bouchard-Côté, N. de Freitas, Unbounded Bayesian
GLOBECOM 2019 – Proc. (2019), https://doi.org/10.1109/
optimization via regularization, Proc. Artif. Intell. Statist., (2016) 1168–1176.
GLOBECOM38437.2019.9013892.
[24] G.I. Diaz, A. Fokoue-Nkoutche, G. Nannicini, H. Samulowitz, An effective
[55] S. Sanders, C. Giraud-Carrier, Informing the use of hyperparameter
algorithm for hyperparameter optimization of neural networks, IBM J. Res.
optimization through metalearning, Proc. – IEEE Int. Conf. Data Mining,
Dev. 61 (2017) 1–20, https://doi.org/10.1147/JRD.2017.2709578.
ICDM. 2017-Novem (2017) 1051–1056. https://doi.org/10.1109/
[25] C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization Models for Machine
ICDM.2017.137.
Learning: A Survey, arXiv preprint arXiv:1901.05331, 2019.
[56] M. Injadat, F. Salo, A.B. Nassif, A. Essex, A. Shami, Bayesian optimization with
[26] E.R. Sparks, A. Talwalkar, D. Haas, M.J. Franklin, M.I. Jordan, T. Kraska,
machine learning algorithms towards anomaly detection, 2018 IEEE Glob.
Automating model search for large scale machine learning, Proc. 6th ACM
Commun. Conf. (2018) 1–6. https://doi.org/10.1109/glocom.2018.8647714.
Symp. Cloud Comput. (2015) 368–380.
[57] K. Arjunan, C.N. Modi, An enhanced intrusion detection framework for
[27] J. Nocedal, S. Wright, Numerical Optimization, 2006, Springer-Verlag, ISBN:
securing network layer of cloud computing, ISEA Asia Secur. Priv. Conf. 2017,
978-0-387-40065-5.
ISEASP 2017. (2017) 1–10. doi: 10.1109/ISEASP.2017.7976988.
[28] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised
[58] Y. Xia, C. Liu, Y.Y. Li, N. Liu, A boosted decision tree approach using Bayesian
learning algorithms, ACM Int. Conf. Proc. Ser. 148 (2006) 161–168, https://
hyper-parameter optimization for credit scoring, Expert Syst. Appl. 78 (2017)
doi.org/10.1145/1143844.1143865.
225–241, https://doi.org/10.1016/j.eswa.2017.02.017.
[29] O. Kramer, Scikit-Learn, in Machine Learning for Evolution Strategies,
[59] T.G. Dietterich, Ensemble methods in machine learning, Mult. Classif. Syst.
Springer International Publishing, Cham, Switzerland, 2016, pp. 45–53.
2000 (1857) 1–15.
[30] F. Pedregosa et al., Scikit-learn: machine learning in Python, J. Mach. Learn.
[60] W. Yin, K. Kann, M. Yu, H. Schütze, Comparative Study of CNN and RNN for
Res. 12 (2011) 2825–2830.
Natural Language Processing, arXiv preprint arXiv:1702.01923, (2017).
[31] T. Chen, C.Guestrin, XGBoost: a scalable tree boosting system, arXiv preprint
https://arxiv.org/abs1702.01923.
arXiv:1603.02754, (2016). http://arxiv.org/abs/1603.02754.
[61] A. Koutsoukas, K.J. Monaghan, X. Li, J. Huan, Deep-learning: Investigating
[32] F. Chollet, Keras, 2015. https://github.com/fchollet/keras.
deep neural networks hyper-parameters and comparison of performance to
L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316 315

shallow methods for modeling bioactivity data, J. Cheminf. 9 (2017) 1–13, [94] A. Gogna, A. Tayal, Metaheuristics: review and application, J. Exp. Theor. Artif.
https://doi.org/10.1186/s13321-017-0226-y. Intell. 25 (2013) 503–526, https://doi.org/10.1080/0952813X.2013.782347.
[62] T. Domhan, J.T. Springenberg, F. Hutter, Speeding up automatic [95] F. Itano, M.A. De Abreu De, E. Del-Moral-Hernandez Sousa, Extending MLP
hyperparameter optimization of deep neural networks by extrapolation of ANN hyper-parameters Optimization by using Genetic Algorithm, Proc. Int. Jt.
learning curves, IJCAI Int. Jt. Conf. Artif. Intell. (2015- (2015)) 3460–3468. Conf. Neural Networks (2018) 1–8, https://doi.org/10.1109/
[63] Y. Ozaki, M. Yano, M. Onishi, Effective hyperparameter optimization using IJCNN.2018.8489520.
Nelder-Mead method in deep learning, IPSJ Trans. Comput. Vis. Appl. 9 [96] B. Kazimipour, X. Li, A.K. Qin, A Review of Population Initialization
(2017), https://doi.org/10.1186/s41074-017-0030-7. Techniques for Evolutionary Algorithms, 2014 IEEE Congr. Evol. Comput.
[64] F.C. Soon, H.Y. Khaw, J.H. Chuah, J. Kanesan, Hyper-parameters optimisation (2014) 2585–2592. https://doi.org/10.1109/CEC.2014.6900618.
of deep CNN architecture for vehicle logo recognition, IET Intell. Transp. Syst. [97] S. Rahnamayan, H.R. Tizhoosh, M.M.A. Salama, A novel population initialization
12 (2018) 939–946, https://doi.org/10.1049/iet-its.2018.5127. method for accelerating evolutionary algorithms, Comput. Math. Appl. 53
[65] D. Han, Q. Liu, W. Fan, A new image classification method using CNN transfer (2007) 1605–1614, https://doi.org/10.1016/j.camwa.2006.07.013.
learning and web data augmentation, Expert Syst. Appl. 95 (2018) 43–56, [98] F.G. Lobo, D.E. Goldberg, M. Pelikan, Time complexity of genetic algorithms on
https://doi.org/10.1016/j.eswa.2017.11.028. exponentially scaled problems, Proc. Genet. Evol. Comput. Conf. (2000) 151–
[66] C. Di Francescomarino, M. Dumas, M. Federici, C. Ghidini, F.M. Maggi, W. 158.
Rizzi, L. Simonetto, Genetic algorithms for hyperparameter optimization in [99] Y. Shi, R.C. Eberhart, Parameter Selection in Particle Swarm Optimization,
predictive business process monitoring, Inf. Syst. 74 (2018) 67–83, https:// Evolutionary Programming VII, Springer, 1998, pp. 591–600.
doi.org/10.1016/j.is.2018.01.003. [100] X. Yan, F. He, Y. Chen, A Novel Hardware/ Software Partitioning Method
[67] A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, Student engagement level in Based on Position Disturbed Particle Swarm Optimization with Invasive
e-learning environment: clustering using K-means, Am. J. Distance Educ. 34 Weed Optimization 32 (2017) 340–355, https://doi.org/10.1007/s11390-
(2020) 1–20, https://doi.org/10.1080/08923647.2020.1696140. 017-1714-2.
[68] C. Ding, X. He, Cluster structure of K-means clustering via principal [101] M.Y. Cheng, K.Y. Huang, M. Hutomo, Multiobjective dynamic-guiding PSO for
component analysis, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes optimizing work shift schedules, J. Constr. Eng. Manag. 144 (2018) 1–7,
Artif. Intell. Lect. Notes Bioinformatics) 3056 (2004) 414–418, https://doi.org/ https://doi.org/10.1061/(ASCE)CO.1943-7862.0001548.
10.1145/1015330.1015408. [102] H. Wang, Z. Wu, J. Wang, X. Dong, S. Yu, G. Chen, A new population
[69] T.K. Moon, The expectation-maximization algorithm, IEEE Signal Process. initialization method based on space transformation search, 5th Int, Conf.
Mag. 13 (6) (1996) 47–60. Nat. Comput. ICNC 2009 (5) (2009) 332–336, https://doi.org/10.1109/
[70] S. Brahim-Belhouari, A. Bermak, M. Shi, P.C.H. Chan, Fast and Robust gas ICNC.2009.371.
identification system using an integrated gas sensor technology and Gaussian [103] J. Wang, J. Xu, and X. Wang, Combination of Hyperband and Bayesian
mixture models, IEEE Sens. J. 5 (2005) 1433–1444, https://doi.org/10.1109/ Optimization for Hyperparameter Optimization in Deep Learning, arXiv
JSEN.2005.858926. preprint arXiv:1801.01596, (2018). https://arxiv.org/abs1801.01596.
[71] Z. Y., K. G., Hierarchical clustering algorithms for document dataset, Data Min. [104] P. Cazzaniga, M.S. Nobile, D. Besozzi, The impact of particles initialization in
Knowl. Discov. 10 (2005) 141–168. PSO: parameter estimation as a case in point, 2015 IEEE Conf. Comput. Intell.
[72] K. Khan, S.U. Rehman, K. Aziz, S. Fong, S. Sarasvady, A. Vishwa, DBSCAN: Past, Bioinforma. Comput. Biol. CIBCB 2015 (2015) 1–8, https://doi.org/10.1109/
present and future, 5th Int. Conf. Appl. Digit. Inf. Web Technol. ICADIWT CIBCB.2015.7300288.
2014, 2014, pp. 232–238. https://doi.org/10.1109/ICADIWT.2014.6814687. [105] R. Martinez-Cantin, BayesOpt: a Bayesian optimization library for nonlinear
[73] H. Zhou, P. Wang, H. Li, Research on adaptive parameters determination in optimization, experimental design and bandits, J. Mach. Learn. Res. 15 (2015)
DBSCAN algorithm, J. Inf. Comput. Sci. 9 (2012) 1967–1973. 3735–3739.
[74] J. Shlens, A Tutorial on Principal Component Analysis, arXiv preprint [106] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D.D. Cox, Hyperopt: a Python
arXiv:1404.1100, (2014). https://arxiv.org/abs1404.1100. library for model selection and hyperparameter optimization, Comput. Sci.
[75] N. Halko, P. Martinsson, J. Tropp, Finding structure with randomness: Discov. 8 (2015), https://doi.org/10.1088/1749-4699/8/1/014008.
probabilistic algorithms for constructing approximate matrix [107] B. Komer, J. Bergstra, C. Eliasmith, Hyperopt-sklearn: automatic
decompositions, SIAM Rev. 53 (2) (2011) 217–288. hyperparameter configuration for scikit-learn, Proc. ICML Workshop
[76] M. Loog, Conditional linear discriminant analysis, Proc. – Int. Conf. Pattern AutoML (2014) 34–40.
Recognit. 2 (2006) 387–390, https://doi.org/10.1109/ICPR.2006.402. [108] M. Pumperla, Hyperas, 2019. http://maxpumperla.com/hyperas/.
[77] P. Howland, J. Wang, H. Park, Solving the small sample size problem in face [109] M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp, and F.
recognition using generalized discriminant analysis, Pattern Recognit. 39 Hutter, Smac v3: Algorithm configuration in python, 2017. https://
(2006) 277–287, https://doi.org/10.1016/j.patcog.2005.06.013. github.com/automl/SMAC3.
[78] I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparameter [110] Tim Head, MechCoder, Gilles Louppe, et al., scikitoptimize/scikit-optimize:
optimization of deep learning algorithms using deterministic RBF v0.5.2, 2018. doi: 10.5281/zenodo.1207017.
surrogates, 31st AAAI Conf. Artif. Intell. AAAI 2017, 2017, pp. 822–829. [111] N. Knudde, J. van der Herten, T. Dhaene, I. Couckuyt, GPflowOpt: A Bayesian
[79] M. Claesen, J. Simm, D. Popovic, Y. Moreau, B. De Moor, Easy Hyperparameter Optimization Library using TensorFlow, arXiv preprint arXiv:1711.03845
Search Using Optunity, arXiv preprint arXiv:1412.1114 (2014). (2017).
[80] C. Witt, Worst-case and average-case approximations by simple randomized [112] Autonomio Talos [Computer software], 2019. http://github.com/
search heuristics, in: Proceedings of the 22nd Annual Symposium on autonomio/talos.
Theoretical Aspects of Computer Science, STACS’05, Stuttgart, Germany, [113] L. Hertel, P. Sadowski, J. Collado, P. Baldi, Sherpa: hyperparameter optimization
2005, pp. 44-56. for machine learning models, Conf. Neural Inf. Process. Syst., 2018.
[81] Y. Bengio, Gradient-based optimization of hyperparameters, Neural Comput. [114] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et al.,
12 (8) (2000) 1889–1900. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
[82] H.H. Yang, S.I. Amari, Complexity issues in natural gradient descent method Systems, arXiv preprint arXiv:1603.04467, (2016). https://arxiv.org/
for training multilayer perceptrons, Neural Comput. 10 (8) (1998) 2137– abs1603.04467.
2157. [115] J. Grandgirard, D. Poinsot, L. Krespi, J.P. Nénon, A.M. Cortesero, Osprey:
[83] J. Snoek, H. Larochelle, R. Adams, Practical Bayesian optimization of machine Hyperparameter Optimization for Machine Learning, 103 (2002) 239–248.
learning algorithms, Adv. Neural Inf. Process. Syst. 4 (2012) 2951–2959. https://doi.org/10.21105/joss.00034.
[84] E. Hazan, A. Klivans, Y. Yuan, Hyperparameter optimization: a spectral [116] L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and reverse gradient-
approach, arXiv preprint arXiv:1706.00764, 2017. based hyperparameter optimization, 34th Int, Conf. Mach. Learn. ICML 2017
[85] M. Seeger, Gaussian processes for machine learning, Int. J. Neural Syst. 14 (70) (2017) 1165–1173.
(2004) 69–106. [117] F.A. Fortin, F.M. De Rainville, M.A. Gardner, M. Parizeau, C. Gagńe, DEAP:
[86] F. Hutter, H.H. Hoos, K. Leyton-Brown, Sequential model-based optimization evolutionary algorithms made easy, J. Mach. Learn. Res. 13 (2012) 2171–
for general algorithm configuration, Proc. LION 5 (2011) 507–523. 2175.
[87] I. Dewancker, M. McCourt, S. Clark, Bayesian Optimization Primer, (2015). [118] R.S. Olson, J.H. Moore, TPOT: a tree-based pipeline optimization tool for
URL: https://sigopt.com/static/pdf/SigOpt Bayesian Optimization Primer.pdf. automating machine learning, Auto Mach. Learn. (2019) 151–160.
[88] J. Hensman, N. Fusi, N.D. Lawrence, Gaussian processes for big data, arXiv https://doi.org/10.1007/978-3-030-05318-5_8.
preprint arXiv:1309.6835, 2013. [119] J. Rapin, O. Teytaud, Nevergrad – a gradient-free optimization platform, 2018.
[89] M. Claesen, B. De Moor, Hyperparameter Search in Machine Learning, arXiv https://GitHub.com/FacebookResearch/Nevergrad.
preprint arXiv:1502.02127, 2015. [120] M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Systematic ensemble model
[90] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: selection approach for educational data mining, Knowl.-Based Syst. 200
Proceedings of the COMPSTAT, Springer, 2010, pp. 177–186. (2020) 105992, https://doi.org/10.1016/j.knosys.2020.105992.
[91] S. Zhang, J. Xu, E. Huang, C.H. Chen, A new optimal sampling rule for multi- [121] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University
fidelity optimization via ordinal transformation, IEEE Int. Conf. Autom. Sci. Press, 1995.
Eng. (2016- (2016)) 670–674, https://doi.org/10.1109/COASE.2016.7743467. [122] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[92] Z. Karnin, T. Koren, O. Somekh, Almost optimal exploration in multi-armed convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012)
bandits, 30th Int. Conf. Mach. Learn. ICML 2013 (28) (2013) 2275–2283. 1097–1105.
[93] S. Falkner, A. Klein, F. Hutter, BOHB: robust and efficient hyperparameter [123] N. Hansen, A. Auger, O. Mersmann, T. Tusar, D. Brockhoff, COCO: A Platform
optimization at scale, 35th Int. Conf. Mach. Learn. ICML 2018 (4) (2018) for Comparing Continuous Optimizers in a Black-Box Setting, arXiv preprint
2323–2341. arXiv:1603.08785, (2016). https://arxiv.org/abs1603.08785.
316 L. Yang, A. Shami / Neurocomputing 415 (2020) 295–316

[124] G.C. Cawley, N.L.C. Talbot, On over-fitting in model selection and subsequent Li Yang received the B.E. degree in computer science
selection bias in performance evaluation, J. Mach. Learn. Res. 11 (2010) from Wuhan University of Science and Technology,
2079–2107. Wuhan, China in 2016 and the MASc degree in Engi-
[125] M. Boehm, A. Surve, S. Tatikonda, et al., SystemML: declarative machine neering from University of Guelph, Guelph, Canada,
learning on spark, Proc. VLDB Endow. 9 (2016) 1425–1436, https://doi.org/ 2018. Since 2018 he has been working toward the Ph.D.
10.14778/3007263.3007279. degree in the Department of Electrical and Computer
[126] X. Meng, J. Bradley, B. Yavuz, et al., Mllib: machine learning in apache spark, J. Engineering, Western University, London, Canada. His
Mach. Learn. Res. 17 (1) (2016) 1235–1241. research interests include cybersecurity, machine
[127] A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, DNS typo-squatting domain learning, data analytics, and intelligent transportation
detection: a data analytics & machine learning based approach, 2018 IEEE
systems.
Glob. Commun. Conf. GLOBECOM. (2018), https://doi.org/10.1109/
GLOCOM.2018.8647679.
[128] Li Yang, Comprehensive visibility indicator algorithm for adaptable speed
limit control in intelligent transportation systems, University of Guelph,
2018.
[129] F. Salo, M.N. Injadat, A. Moubayed, A.B. Nassif, A. Essex, Clustering enabled
classification using ensemble feature selection for intrusion detection, 2019 Abdallah Shami is a professor with the ECE Department
Int. Conf. Comput. Netw. Commun. ICNC (2019) 276–281, https://doi.org/ at Western University, Ontario, Canada. He is the
10.1109/ICCNC.2019.8685636. Director of the Optimized Computing and Communica-
[130] A. Moubayed, E. Aqeeli, A. Shami, Ensemble-based feature selection and tions Laboratory at Western University (https://www.
classification model for DNS typo-squatting detection, 2020 IEEE Can. Conf. eng.uwo.ca/oc2/). He is currently an associate editor for
Electr. Comput. Eng. (2020). IEEE Transactions on Mobile Computing, IEEE Network,
[131] M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Multi-split optimized bagging
and IEEE Communications Surveys and Tutorials. He has
ensemble model selection for multi-class educational data mining, Springer’s
chaired key symposia for IEEE GLOBECOM, IEEE ICC,
Appl. Intell. (2020).
IEEE ICNC, and ICCIT. He was the elected Chair of the
[132] L. Yang, A. Shami, Hyperparameter Optimization of Machine Learning
Algorithms (2020). https://github.com/LiYangHart/Hyperparameter- IEEE Communications Society Technical Committee on
Optimization-of-Machine-Learning-Algorithms. Communications Software (2016–2017) and the IEEE
London Ontario Section Chair (2016–2018).

Algorithms For Big Data
100% (1)
Algorithms For Big Data
458 pages
Neural Networks: MATLAB
No ratings yet
Neural Networks: MATLAB
91 pages
Time Series
No ratings yet
Time Series
69 pages
Genetic Algorithms
100% (2)
Genetic Algorithms
94 pages
15hc11 Optimization Techniques in Engineering
No ratings yet
15hc11 Optimization Techniques in Engineering
1 page
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
MATLab Tutorial #5 PDF
No ratings yet
MATLab Tutorial #5 PDF
7 pages
Particle Swarm Optimization Matlab Toolbox 1 - Conformat
No ratings yet
Particle Swarm Optimization Matlab Toolbox 1 - Conformat
5 pages
Chatfield The Analysis of Time Series
No ratings yet
Chatfield The Analysis of Time Series
293 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
11 pages
1 - Intro To Machine Learning
100% (1)
1 - Intro To Machine Learning
20 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Roadmap To Build A Machine Learning Model
No ratings yet
Roadmap To Build A Machine Learning Model
12 pages
Applications of Artificial Neural Networks in Foundation Engineering
100% (1)
Applications of Artificial Neural Networks in Foundation Engineering
25 pages
机器学习周志华 8.16.23 PM
No ratings yet
机器学习周志华 8.16.23 PM
443 pages
Machine Learning Regression
No ratings yet
Machine Learning Regression
64 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
Regression Problems in Python PDF
No ratings yet
Regression Problems in Python PDF
34 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
Genetic Algorithms: and Other Approaches For Similar Applications
100% (1)
Genetic Algorithms: and Other Approaches For Similar Applications
83 pages
Tabu Search 1
100% (1)
Tabu Search 1
15 pages
Energy Prediction of Appliances Using Supervised ML Algorithms
No ratings yet
Energy Prediction of Appliances Using Supervised ML Algorithms
17 pages
Air Quality Prediction
No ratings yet
Air Quality Prediction
21 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
135 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
Maximum Entropy Distribution
No ratings yet
Maximum Entropy Distribution
11 pages
Jupyter Installation
100% (1)
Jupyter Installation
19 pages
Artificial Intelligence: Computer Science Engineering
No ratings yet
Artificial Intelligence: Computer Science Engineering
1 page
Deep Learning Based Recommendation Systems
No ratings yet
Deep Learning Based Recommendation Systems
47 pages
Principle of Maximum Entropy
No ratings yet
Principle of Maximum Entropy
10 pages
Stochastic Search Methods
100% (1)
Stochastic Search Methods
45 pages
Portfolio Optimization Using Particle Swarm Optimization
No ratings yet
Portfolio Optimization Using Particle Swarm Optimization
6 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
313 pages
Introduction To Machine Learning PDF
100% (1)
Introduction To Machine Learning PDF
17 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
02 Fundamentals of Neural Network
No ratings yet
02 Fundamentals of Neural Network
40 pages
The Mosaic of Metaheuristic Algorithms in Structural Optimization
No ratings yet
The Mosaic of Metaheuristic Algorithms in Structural Optimization
57 pages
Car Make and Model Recognition Using Ima
No ratings yet
Car Make and Model Recognition Using Ima
8 pages
Temperature Control and Adaptive Fuzzy Systems
No ratings yet
Temperature Control and Adaptive Fuzzy Systems
11 pages
Role of Machine Learning in The Field of Fiber Reinforced Polymer
No ratings yet
Role of Machine Learning in The Field of Fiber Reinforced Polymer
6 pages
Mathematica Laboratories For Mathematical Statistics (ASA-SIAM Series On Statistics and Applied Probability) (Jenny A. Baglivo) 0898715660
No ratings yet
Mathematica Laboratories For Mathematical Statistics (ASA-SIAM Series On Statistics and Applied Probability) (Jenny A. Baglivo) 0898715660
281 pages
Fitting A Neural Network Model
No ratings yet
Fitting A Neural Network Model
9 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Machine Learning 1
No ratings yet
Machine Learning 1
11 pages
Research Methods in Machine Learning: A Content Analysis: Jackson Kamiri Geoffrey Mariga
No ratings yet
Research Methods in Machine Learning: A Content Analysis: Jackson Kamiri Geoffrey Mariga
14 pages
DEEP_LEARNING_UNIT_1[1]
No ratings yet
DEEP_LEARNING_UNIT_1[1]
24 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
ML QP
No ratings yet
ML QP
6 pages
Feature engineering Complete Self-Assessment Guide
From Everand
Feature engineering Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
Hyperparameter Optimization of ML Algorithms
No ratings yet
Hyperparameter Optimization of ML Algorithms
69 pages
On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice
No ratings yet
On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice
69 pages
ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.
No ratings yet
ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.
13 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Hyperparameter Optimization For Machine Learning Models Based On Bayesian Optimization
No ratings yet
Hyperparameter Optimization For Machine Learning Models Based On Bayesian Optimization
15 pages
1 s2.0 S1674862X19300047 Main
No ratings yet
1 s2.0 S1674862X19300047 Main
15 pages
Adaptive Bayesian Contextual Hyperband: A Novel Hyperparameter Optimization Approach
No ratings yet
Adaptive Bayesian Contextual Hyperband: A Novel Hyperparameter Optimization Approach
11 pages
Automl: A Perspective Where Industry Meets Academy
No ratings yet
Automl: A Perspective Where Industry Meets Academy
154 pages
Hyper Parameters
No ratings yet
Hyper Parameters
7 pages
Hyper-Parameter Optimization: A Review of Algorithms and Applications
No ratings yet
Hyper-Parameter Optimization: A Review of Algorithms and Applications
56 pages
Multi-Objective Optimization of Mooring Systems For Offshore Renewable Energy
No ratings yet
Multi-Objective Optimization of Mooring Systems For Offshore Renewable Energy
8 pages
Google - Machine Learning Glossary
No ratings yet
Google - Machine Learning Glossary
83 pages
Optimal Planning of Hybrid Renewable Energy Systems Using HOMER
No ratings yet
Optimal Planning of Hybrid Renewable Energy Systems Using HOMER
13 pages
Roc-Hj:: Reachability Analysis and Optimal Control Problems - Hamilton-Jacobi Equations
No ratings yet
Roc-Hj:: Reachability Analysis and Optimal Control Problems - Hamilton-Jacobi Equations
12 pages
Optimization of Isolated Microgrids With Cost and Reliability Targets
No ratings yet
Optimization of Isolated Microgrids With Cost and Reliability Targets
6 pages
HW3 Solution
No ratings yet
HW3 Solution
7 pages
PrgIAMG2022 Small
No ratings yet
PrgIAMG2022 Small
10 pages
Automate Strategy Finding With LLM in Quant Invest
No ratings yet
Automate Strategy Finding With LLM in Quant Invest
13 pages
Maths Assignment
No ratings yet
Maths Assignment
3 pages
Unit 5
No ratings yet
Unit 5
23 pages
Robust Controller Design Using Multi-Objective Optimization Power Management DesignLine
No ratings yet
Robust Controller Design Using Multi-Objective Optimization Power Management DesignLine
5 pages
M.Phil Computer Science Networking Projects
100% (1)
M.Phil Computer Science Networking Projects
34 pages
Solving Linear Fractional Programming Problems With Interval Coefficients in The Objective Function. A New Approach
No ratings yet
Solving Linear Fractional Programming Problems With Interval Coefficients in The Objective Function. A New Approach
11 pages
Five Levers For Optimizing Supply Chain Costs
No ratings yet
Five Levers For Optimizing Supply Chain Costs
6 pages
PDF Green Design and Manufacturing for Sustainability 1st Edition Nand K. Jha (Author) download
100% (5)
PDF Green Design and Manufacturing for Sustainability 1st Edition Nand K. Jha (Author) download
61 pages
JTOM
No ratings yet
JTOM
11 pages
Cmfe Termproj
No ratings yet
Cmfe Termproj
7 pages
Integer
No ratings yet
Integer
54 pages
Edexcel D1 Revision Sheets PDF
No ratings yet
Edexcel D1 Revision Sheets PDF
17 pages
10 1 1 206 4846 PDF
No ratings yet
10 1 1 206 4846 PDF
7 pages
Bigdata Researched
No ratings yet
Bigdata Researched
34 pages
Dr. Alan Brown - Reengineering The Naval Ship Concept Design Process
No ratings yet
Dr. Alan Brown - Reengineering The Naval Ship Concept Design Process
12 pages
CH 3 OPERATIONS Research LP
No ratings yet
CH 3 OPERATIONS Research LP
31 pages
Dynamicprogrammingkk
No ratings yet
Dynamicprogrammingkk
513 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Operations Research Outline Updated
No ratings yet
Operations Research Outline Updated
8 pages
13-Method For Synthesis of TE01-TE11 Mode Converter
No ratings yet
13-Method For Synthesis of TE01-TE11 Mode Converter
5 pages
5 - HJB
No ratings yet
5 - HJB
12 pages
Structural Optimization of Jacket Platform Based o PDF
No ratings yet
Structural Optimization of Jacket Platform Based o PDF
8 pages
978 0 387 95864 4
No ratings yet
978 0 387 95864 4
2 pages