OptimisationII Notes
OptimisationII Notes
Dr Matthew Woolway
2018-09-26
2
Contents
Course Outline 11
Course Structure and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Course Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Course Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3
4 CONTENTS
9.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.0.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 CONTENTS
List of Tables
7
8 LIST OF TABLES
List of Figures
9
10 LIST OF FIGURES
Course Outline
Course Assessment
11
12 LIST OF FIGURES
Course Topics
Hardware Requirements
The course will be very computational in nature, however, you do not need your own personal machine. MSL already
has Python installed. The labs will be running the IDEs for Python (along with Jupyter) while I will be using Jupyter
for easier presentation and explanation in lectures. You will at some point need to become familiar with Jupyter as
the tests will be conducted in the Maths Science Labs (MSL) utilising this platform for autograding purposes.
If you do have your own machine and would prefer to work from that you are more than welcome. Since all the
notes and code will be presented through Jupyter please follow the following steps:
• Install Anaconda from here: https://repo.continuum.io/archive/Anaconda3-5.2.0-Windows-x86_64.exe
– Make sure when installing Anaconda to set the installation to PATH when prompted (it will be deselected
by default)
• To launch a Jupyter notebook, open the command promt (cmd) and type jupyter notebook. This should
launch the browser and jupyter. If you see any proxy issues while on campus, then you will need to set the
proxy to exclude the localhost.
If you are not running Windows but rather Linux, then you can get Anaconda at: https://repo.continuum.io/archive/
Anaconda3-5.2.0-Linux-x86_64.sh
Chapter 1
In industry, commerce, government, indeed in all walks of life, one frequently needs answers to questions concerning
operational efficiency. Thus an architect may need to know how to lay out a factory floor so that the article being
manufactured does not have to be moved about too much as it goes from one machine tool to another; the manager
of a shipping company needs to plan the itineraries of his ships so as to increase the amount of goods handled,
while avoiding costly waiting-around in docks. A telecommunications engineer may want to know how best to
transmit signals so as to minimise the possibility of error on reception. Further examples of problems of this sort
are provided by the planning of a railroad time-table to ensure that trains are available as and when needed, the
synchronisation of traffic lights, and many other real-life situations. Formerly such problems would usually be
‘solved’ by imprecise methods giving results that were both unreliable and costly. Today, they are increasingly being
subjected to rigid mathematical analysis, designed to provide methods for finding exact solutions or highly reliable
estimates rather than vague approximations. Optimisation provide many of the mathematical tools used for solving
such problems.
Optimisation, or mathematical programming, is the study of maximising and minimising functions subject to
specified boundary conditions or constraints. The functions to be optimised arise in engineering, the physical
and management sciences, and in various branches of mathematics. With the emergence of the computer age,
optimisation experienced a dramatic growth as a mathematical theory, enhancing both combinatorics and classical
analysis. In its applications to engineering and management science, optimisation (linear programming) forms an
important part of the discipline of operations research.
Linear programming has proved an extremely powerful tool, both in modelling real-world problems and as a widely
applicable mathematical theory. However, many interesting but practical optimisation problems are nonlinear. The
study of such problems involves a diverse blend of linear algebra, multivariate calculus, quadratic form, numerical
analysis and computing techniques. Important special areas include the design of computational algorithms,
the geometry and analysis of convex sets and functions, and the study of specially structured problems such as
unconstrained and constrained nonlinear optimisation problems. Nonlinear optimisation provides fundamental
insights into mathematical analysis, and is widely used in the applied sciences, in areas such as engineering design,
regression analysis, inventory control, and geophysical exploration among others. General nonlinear optimisation
problems and various computational algorithm to addressing such problems will be taught in this course.
13
14 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
Optimisation problems are made of three primary ingredients; (i) an objective function, (ii) variables (unknowns)
and (iii) the constraints.
• An objective function which we want to minimize or maximize. For instance, in a manufacturing process, we
might want to maximize the profit or minimize the cost. In fitting experimental data to a user-defined model,
we might minimize the total deviation of observed data from predictions based on the model. In designing an
automobile panel, we might want to maximize the strength.
• A set of unknowns/variables which affect the value of the objective function. In the manufacturing problem,
the variables might include the amounts of different resources used or the time spent on each activity. In fitting-
the-data problem, the unknowns are the parameters that define the model. In the panel design problem, the
variables used define the shape and dimensions of the panel.
• A set of constraints that allow the unknowns to take on certain values but exclude others. For the manufac-
turing problem, it does not make sense to spend a negative amount of time on any activity, so we constrain all
the ‘time’ variables to be non-negative. In the panel design problem, we would probably want to limit the
weight of the product and to constrain its shape. The optimisation problem is then :
Find values of the variables that minimize or maximize the objective function while satisfying the constraints.
subject to:
g j (x) ≤ 0, j = 1, 2, . . . , m,
h j (x) = 0, j = 1, 2, . . . , r.
where f (x), g j (x) and h j (x) are scalar functions of the real vector x.
The continuous components x i of x are called the the variables, f (x) is the objective function, g j (x) denotes the
respective inequality constraint functions and h j (x) the equality constraint functions. The optimum vector x that
solves Equation (1.1) is denoted by x∗ with a corresponding optimum function value f (x∗ ). If there are no constraints
specified, then the problem is aptly named an unconstrained minimisation problem. A large quantity of progress
has been made when solving different classes of the general problem introduced in Equation (1.1). On occasion
these solutions can be attained analytically yielding a closed-form solution. However, most real world problems
exist where n > 2 and as a result need to be solved numerically through suitable computational algorithms.
1.3.1 Definitions
1.3.1.1 Neighbourhoods
A δ-neighbourhood of a point y is the set of all points within a neighbourhood of y, it is denoted by Nδ (y). It is the
set of all points x such that:
Nδ (y) : kx − yk ≤ δ, (1.2)
that is x ∈ Nδ (y).
1.3. IMPORTANT OPTIMISATION CONCEPTS 15
Definition 1.2 (Global Minimiser/Maximiser). If a point x ∈ S, where S is the feasible set, then x is a feasible solution.
A feasible solution xg of some problem P is the global minimiser of f (x) if:
for all feasible points x of the problem P . The value f (xg ) is called the global minimum. The converse applies for the
global maximum.
Definition 1.3 (Local Minimiser/Maximiser). A point x∗ is called a local minimiser of f (x) if there exists a suitable
δ > 0 such that for all feasible x ∈ Nδ (x∗ ):
f (x∗ ) ≤ f (x). (1.4)
In other words, x∗ is a local minimiser of f (x) if there exists a neighbourhood Nδ (x∗ ) of x∗ containing feasible x such
that:
f (x∗ ) ≤ f (x), ∀ ∈ Nδ (x∗ ). (1.5)
Definition 1.4 (Infimum). If f is a function on S, the greatest lower bound or infimum of f on S is the largest
number m (possibly m = −∞) such that f (x) ≥ m ∀x ∈ S. The infimum is denoted by:
Definition 1.5 (Supremum). If f is a function on S, the least upper bound or supremum of f on S is the smallest
number m (possibly m = +∞) such that f (x) ≤ m ∀x ∈ S. The supremum is denoted by:
1.3.2 Convexity
Definition 1.6 (Affine Set). A line though the points x1 and x1 and x2 in Rn is the set:
This is known as an Affine Set. An example of this is the solution of linear equations Ax = b.
Definition 1.7 (Convex Set). We can define the Convex Set as the line segment between the points x1 and x2 , that is:
If this condition does not hold then the set is non-convex. Think of this as line of sight. Some are examples are
considered in the Figure below:
16 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
Definition 1.8 (Convex Combination). We can definite the Convex Combination of x1 , . . . , xn , as any point x which
satisfies:
x = θ1 x1 + θ2 x2 + . . . + θn xn , (1.10)
where θ1 + . . . + θn = 1, θi ≥ 0.
The Convex Hull (conv L) is the set of all convex combinations of the points in L. This can be thought of as the
tightest bound across all points in the set, as can be seen in the Figure below:
Definition 1.9 (Hyperplane/Halfspace). We can define the Hyperplane as the set of the form:
{x | a T x = b} (a 6= 0), (1.11)
{x | a T x ≤ b} (a 6= 0), (1.12)
Note: Hyperplanes are both affine and convex, while halfspaces are only convex. These are illustrated in the Figures
below:
Definition 1.10 (Level Set). Consider the real valued function f on L. Let a be in R, then we denote L a to be the set:
These sets are known as level sets of f on L and can be thought of, as all vectors returning a function value less than
or equal to that of the constant f value drawn out by a.
18 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
Definition 1.11 (Level Surface). Consider the real valued function f on L. Let a be in R, then we denote C a to be
the set:
C a = {x ∈ L | f (x) = a}. (1.14)
These sets are known as level surfaces of f on L and can be thought of, as the cross section taken at some point
x0 ∈ L.
1.3.3 Exercises
S1 = {(x 1 , x 2 ) | x 12 + x 22 = 1}
S2 = {(x 1 , x 2 ) | x 12 + x 22 > 0}
S3 = {(0, 0), (1, 0), (1, 1), (0, 1)}
S4 = {(x 1 , x 2 ) | |x 1 | + |x 2 | < 1}
4. In each of the following cases, sketch the level sets L b of the function f :
• f (x 1 , x 2 ) = x 1 + x 2 , b = 1, 2
• f (x 1 , x 2 ) = x 1 x 2 , b = 1, 2
• f (x) = e x , b = 10, 0
5. Let L be a convex set in Rn , A be an m × n matrix and α a scalar. Show that the following sets are convex.
• {y : y = Ax, x ∈ L}
• {αx : x ∈ L}
6. If you have two points that solve a system of linear equations Ax = b, i.e. points x 1 and x 2 , where x 1 6= x 2 .
Then prove that the line that passes through these two points is in the affine set.
7. Prove that a halfspace is convex.
20 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
Chapter 2
where f is a continuous and twice differentiable function. If L is an interval, then x is bound constrained. If indeed
L ∈ R, then the problem is unconstrained.
A function is said to be monotonic increasing along a given path when f (x 2 ) > f (x 1 ), and monotonic decreasing
when f (x 2 ) < f (x 1 ) for all points in the domain for which x 2 > x 1 . For the situation in which f (x 2 ) ≥ f (x 1 ) and
f (x 2 ) ≤ f (x 1 ) the functions are respectively called monotonic non-decreasing and monotonic non-increasing.
For example if we consider f (x) = x 2 , where x > 0, then f (x) is monotonic increasing. For x < 0, f (x) is monotonic
decreasing. A function that has a single minimum or a single maximum (single peak) is known as unimodal function.
Functions with two peaks (two minima or two maxima) are called bimodal and functions with many peaks are
known as multimodal functions.
Definition 2.1 (Convex Functions). A function f : L ⊂ Rn → R defined on the convex set L is convex if for all
x1 , x2 ∈ L, for all θ ∈ [0, 1]:
Consider the function of a single variable. We may think of this easily then by saying; The function f is convex if the
chord connecting x 1 and x 2 lays above the graph.
See the Figure below:
Note:
The function is strictly convex if only < applies f is concave if − f is convex
21
22 CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS
We shall summarise some definitions that were previously mentioned. We begin with the concept of global optimi-
sation (maximisation or minimisation). In the context of optimisation, relative optima (maximum or minimum) are
normally referred to as local optima.
Global Optima
A point f (x) attains its greatest (or least) value on an interval [a, b] is called a point of global maximum (or
minimum). In general, however, a function f (x) takes on its absolute (global) maximum (minimum) at a point x ∗ if
f (x) < f (x ∗ ) ( f (x) > f (x ∗ )) for all x over which the function f (x) is defined.
Local Optima
f (x) has a strong local (relative) maximum (minimum) at an interior point x ∗ ∈ (a, b) if f (x) < f (x ∗ ) ( f (x) > f (x ∗ ))
for all x in some neighbourhood of x ∗ . The maximum (minimum) is weak if ≤ replaces <. Strong local are illustrated
in the Figure below. If a function f (x) has a strong relative maximum at some point x ∗ , then there is an interval
including x ∗ , no matter how small, such that for all x in this interval, f (x) is strictly less than f (x ∗ ), i.e. f (x) < f (x ∗ ).
It is the ‘strictly less’ that makes this a stronger relative maximum. If however, the strictly less sign is replaced by a ≤
sign, i.e. f (x) ≤ f (x ∗ ), then the minimum value at x ∗ is a weak minimum and x ∗ is a weak minimiser.
A necessary condition for f (x) to have a maximum or a minimum at an interior point x ∗ ∈ (a, b) is that the slope at
this point must be zero, i.e. where:
d f (x)
f 0 (x) = = 0, (2.3)
dx
which corresponds to the first order necessary condition (FONC). The FONC may be necessary but it is not
sufficient. For example, consider the function f (x) = x 3 as seen in the Figure below. At x = 0, f 0 (x) = 0 but there
is no maximum or minimum point on the interval (−∞, ∞). At x = 0 there is a point of inflection, where f 00 = 0.
Therefore, the point x = 0 is a stationary point but not a local optima.
Thus in addition to the FONC, non-negative curvature is also necessary at x ∗ , i.e. it is required that the second
order condition:
d 2 f (x)
f 00 (x) = > 0, (2.4)
d x2
must hold at x ∗ for a strong local minimum. This is known as the second order sufficient condition (SOSC).
2.4. NECESSARY AND SUFFICIENT CONDITIONS 23
2.4.1 Exercises
1. If the convexity condition for any real valued function f : R → R is given by:
then using the above, prove that the following one dimensional functions are convex;
• f 1 (x) = 1 + x 2
• f 2 (x) = x 2 − 1
2. Find all stationary points of:
f (x) = x 3 (3x 2 − 5) + 1,
and decide the maximiser, minimiser and the point of inflection, if any.
3. Using the FONC of optimality, determine the optimiser of the following functions:
• f (x) = 13 x 3 − 27 x 2 + 12x + 3
• f (x) = 2(x − 1)2 + x 2
4. Using necessary and sufficient conditions for optimality, investigate the maximiser/minimiser of:
It is often necessary to find the stationary point(s) of a given function f (x). This would mean finding the root of a
nonlinear function g (x) if we consider g (x) = f 0 (x) = 0. In other words, solving g (x) = 0. Here, we introduce the
Newton method. This method is important because when we cannot solve f 0 (x) = 0 analytically we will be able to
solve numerically.
Newton’s method is one of the more powerful and well known numerical methods for finding a root of g (x) i.e. for
solving for x such that g (x) = 0. So we can use it to find the turning point i.e., when f 0 (x) = 0. In the context of
optimization we want an x ∗ such that f 0 (x ∗ ) = 0. Consider the figure below:
## /home/matt/anaconda3/envs/rstudio/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning:
##
## numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
25
26 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS
40
30
f(x)
20
10
The general equation of the tangent to the curve of g (x) at (x n , g (x n )) has slope
(y − g (x n ))
g 0 (x n ) = . When y = 0, let (3.1)
(x − x n )
g (x n )
x n+1 = x n − g 0 (x n ) 6= 0 (3.2)
g 0 (x n )
Hence the Newton method can be described be the following two steps:
• Iterate by x n+1 = x n − f 0 (x n )/ f 00 (x n ).
3.1.0.1 Example
x4 x2
Find the minimum of f (x) = + − 3x near x = 2.
4 2
15
10
y
5
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
We can see that the find the root of the deritive yields the minimum value of f (x) in this case (approximately
1.21341). Check: perform a few iterations of Equation (3.2) to check the solution.
Advantages
Disadvantages:
• Unknown number of steps needed for required accuracy, unlike Bisection for example.
• f must be at least twice differentiable
• Run into problems when g 0 (x ∗ ) = 0.
• Potentially could be difficult to compute g (x) and g 0 (x) even if they do exist.
In general Newton’s Method is fast, reliable and trouble-free, but one has to be mindful of the potential problems.
28 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS
1500 300
1000 200
100
y
500
y
0
0
100
500
200
0 20 40 20.0 22.5 25.0 27.5 30.0
x x
3.2.1 Exercises
3x − sin(x) − exp(x) = 0,
1 1
f (x) = x 4 + x 2 − 2x + 1,
4 2
using both Newton’s Method and the Secant Method. If the critical point is a minimiser then obtain the
minimum value. You may assume x = 2 as an initial guess.
Chapter 4
The simplest functions with which to begin a study of non-linear optimisation methods are those with a single
independent variable. Although the minimisation of univariate functions is in itself of some practical importance,
the main area of application for these techniques is as a subproblem of multivariate minimisation.
There are functions to be minimised where the variable x is unrestricted (say, x ∈ R); there are also functions to
be optimised over a finite interval (in n-dimension it is a box). Single variable optimization in a finite interval
is important because of its application is in multi-variable optimisation. In this chapter we will consider one
dimensional optimisation.
If one needs to find the maximum or minimum (i.e. the optimal) value of a function f (x) on the interval [a, b] the
procedure would be:
1. Find all turning (stationary) points of f (x) (assuming f (x) is differentiable) on [a, b] and then decide the
optimum.
2. Find the optimal turning point of f (x) on [a, b].
Generally it may be difficult/impossible/tiresome to implement (i) analytically, so we resort to the computer and an
appropriate numerical method to find an optimal (hopefully the best estimate!) solution of an univariate function.
In the next section we introduce some numerical techniques. The numerical approach is mandatory when the
function f (x) is not given explicitly.
In many cases when one would like to find the minimiser of a function f (x) but neither f (x) nor f 0 (x) are given (or
known) explicitly, then the numerical approaches viz : polynomial interpolations or function comparison methods
are used. These are the univariate optimisation used as line search in multivariate optimisation.
We assume that an interval [a, b] is given and that a local minimum x ∗ ∈ [a, b]. When the first derivative of the
objective function f (x) is known at a and b, it is necessary to evaluate function information at only one interior
point in order to reduce this interval. This is because it is possible to decide if any interval brackets a minimum
simply by looking at the function values f (a), f (b) and f 0 (a), f 0 (b) at extreme points a and b. The conditions that to
be satisfied are:
29
30 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
These three situations are illustrated in the Figure below. The next step of the bisection method is to reduce the
interval. At the k-th iteration we have an interval [a k , b k ] and the mid-point c k = 12 (a k + b k ) is computed. The
next interval will be called [a k+1 , b k+1 ] which is either [a k , c k ] or [c k , b k ] depending on which interval brackets the
minimum. The process continues until two consecutive interval produces minima which are within an acceptable
tolerance.
Condition 1
12
10
0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 31
Condition 2
70
60
50
40
30
20
10
3 2 1 0 1 2 3
32 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
Condition 3
30
20
10
10
20
2 1 0 1 2
4.1.1.1 Exercise
Suppose f : R → R on the interval [a 0 , b 0 ] and f has only one minimum (we say f is unimodal at x ∗ . The problem is
to locate x ∗ . The method we now discuss is based on evaluating the objective function at different points in the
interval [a 0 , b 0 ]. We choose these points in such a way that an approximation to the minimiser of f may be achieved
in as few evaluation as possible. Our goal is to progressively narrow down the range of the subinterval containing x ∗ .
If we evaluate f at only one intermediate point of the interval [a 0 , b 0 ], we cannot narrow the range within which we
know the minimiser is located. We have to evaluate f at two intermediate points in such a way that the reduction
in the range is symmetrical, in the sense that a 1 − a 0 = b 0 − b 1 = ρ(b 0 − a 0 ), where ρ < 12 to keep a 1 ‘near’ to b 0 . We
then evaluate f at the intermediate points. If f (a 1 ) < f (b 1 ), then the minimiser must lie in the range [a 0 , b 1 ]. If, on
the other hand, f (a 1 ) ≥ f (b 1 ), then the minimiser located in the range [a 1 , b 0 ]. Starting with the reduced range of
uncertainty we can repeat the process and similarly find two new points a 2 and b 2 , using the same value of ρ as
before. However, we would like to minimise the number of function evaluations while reducing the width of the
interval of uncertainty. Suppose that f (a 1 ) < f (b 1 ). Then we know that x ∗ ∈ [a 0 , b 1 ]. Because a 1 is already in the
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 33
uncertainty interval and f (a 1 ) is known, we can use these information. We can make a 1 coincide with b 2 . Thus,
only one new evaluation of f at a 2 would be necessary. We can now calculate the value of ρ that results in only one
new evaluation of f . To save algebra we will assume that b 0 − a 0 = 1. Then, to have only one new evaluation of f it is
enough to choose ρ so that:
ρ(b 1 − a 0 ) = b 1 − b 2 .
Because b 1 − a 0 = 1 − ρ and b 1 − b 2 = 1 − 2ρ, we have:
Therefore,
a1 = a 0 + ρ(b 0 − a 0 ) (4.2)
b1 = b 0 − ρ(b 0 − a 0 ) (4.3)
4.1.2.1 Example
Use the four iterations Golden Section search to find the value of x that minimizes:
a 1 = a 0 + ρ(b 0 − a 0 ) = 0.763,
b 1 = a 0 + (1 − ρ)(b 0 − a 0 ) = 1.236.
We compute
f (a 1 ) = −24.36,
f (b 1 ) = −18.96.
Thus we have f (a 1 ) < f (b 1 ), and so the uncertainty interval is reduced to [a 0 , b 1 ] = [0, 1.236].
Iteration 2:
We choose b 2 to coincide with a 1 , and f need only to be evaluated at one new point
a 2 = a 0 + ρ(b 1 − a 0 ) = 0.4721.
Now we have:
f (a 2 ) = −21.10,
f (b 2 ) = −24.36.
34 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
Iteration 3:
b 3 = a 2 + (1 − ρ)(b 1 − a 2 ) = 0.9443.
We have:
f (a 3 ) = −24.36,
f (b 3 ) = −23.59.
Iteration 4:
a 4 = a 2 + ρ(b 3 − a 2 ) = 0.6525.
We have:
f (a 4 ) = −23.84,
f (b 4 ) = −24.36.
Since f (a 4 ) > f (b 4 ). Thus the value of x that minimizes f is located in the interval
[a 4 , b 3 ] = [0.652, 0.944].
plt.show()
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 35
10
15
20
25
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
4.1.3 Exercises
1. Find the minimum value of the one dimensional function f (x) = x 2 − 3x exp(−x), over [0, 1], using:
• Bisection Method
• Golden Search Method
36 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
Chapter 5
Unconstrained optimisation is optimisation when we know we do not have to worry about the boundaries of the
feasible set.
min f (x) (5.1)
s.t . x∈S
where S is the feasible set. It should then be possible to find local minima and maxima just by looking at the
behaviour of the objective function; and indeed sufficient and necessary conditions. In this chapter these conditions
will be derived. The idea of a line in a particular direction is important for any unconstrained optimization methods,
we discuss this and derive the slope and curvature of the function f at a point on the line.
For a function f (x) ∈ Rn there exists, at any point x a vector of first order partial derivatives, or gradient vector:
∂f
(x)
∂x 1
∂f
(x)
∂x 2
∇ f (x) = = g(x). (5.2)
..
.
∂f
(x)
∂x n
37
38 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
It can be shown that if the function f (x) is smooth, then at the point x the gradient vector ∇ f (x) (denoted by g (x)) is
always perpendicular to the contours (or surfaces of constant function value) and is the direction of maximum
increase of f (x) as seen in the Figure above. You can copy the Mathematica code to generate the output above. The
manipulation construct will allow you to move the point around to see the gradient at different contours.
If f (x) is twice continuously differentiable then at the point x there exists a matrix of second order partial derivatives
called the Hessian matrix:
2
∂ f ∂2 f ∂2 f
(x) (x) . . . (x)
∂x 12
∂x 1 ∂x 2 ∂x 1 ∂x n
2
∂ f ..
H(x) = . = ∇2 f (x) (5.3)
∂x 2 ∂x 1 (x)
2 2
∂ f ∂ f
(x) ... (x)
∂x n ∂x 1 ∂x n 2
5.1.0.1 Example
Let f (x 1 , x 2 ) = 5x 1 + 8x 2 + x 1 x 2 − x 12 − 2x 22 . Then:
· ¸
5 + x 2 − 2x 1
∇ f (x) = ,
8 + x 1 − 4x 2
and · ¸
−2 1
∇2 f (x) = .
1 −4
Definition 5.1 (Feasible Direction). A vector d ∈ Rn , d 6= 0, is a feasible direction at x ∈ S if there exists α0 > 0 such
that x + α0 d ∈ S for all α ∈ [0, α0 ].
Definition 5.2 (Directional Derivative). Let f : Rn → R be a real-valued function and let d be a feasible direction at
x ∈ S. The directional derivative of f in the direction of d , denoted by d T ∇ f (x), is given by:
f (x + αd ) − f (x)
∇ f T d = lim (5.4)
α→0 α
5.2. A LINE IN A PARTICULAR DIRECTION IN THE CONTEXT OF OPTIMISATION 39
If kd k = 1, then d T ∇ f (x) is the rate of increase of f at x in the direction d . To compute the above directional
derivative, suppose that x and d are given. Then, f (x + αd ) is a function of α, and:
d
d T ∇ f (x) = f (x + αd )¯α=0 .
¯
(5.5)
dα
the line, but only the value of α associated with any point along the line. For Example:
import numpy as np
from numpy import linalg as LA
40 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
d = np.array([3, 1])
alpha = LA.norm(d, 2)
print(d)
## [3 1]
print(alpha)
## 3.1622776601683795
norm_d = d/alpha
print('The normalised vector d is:')
## [0.9486833 0.31622777]
print('The normalised d^Td gives:f', np.dot(norm_d,norm_d))
## [3. 1.]
We now use the gradient and the Hessian of f (x) to derive the derivative of f (x) along a line of any direction. For a
fixed line of a given direction like Equation (5.6) we see that the points on the line is a function of α only. Hence a
change in α causes change in all coordinates of x(α). The derivative of f (x) with respect to α :
d
The Equation (5.7) represents the derivative of f (x) at any point x(α) along the line. The operator dα can be
expressed as:
d ∂ d x1 ∂ d x2 ∂ d xn
= + +···+ = dT ∇ (5.8)
d α ∂x 1 d α ∂x 2 d α ∂x n d α
df
= dT ∇ f (x(α)) = ∇ f (x(α))T d. (5.9)
dα
d2 f d d f (x(α))
µ ¶
= dT ∇ ∇ f T d = dT ∇2 f d,
¡ ¢
= (5.10)
dα 2 dα dα
where ∇ f and ∇2 f are evaluated at x(α). These (slope and curvature) when evaluated at α=0 are respectively known
as derivative (also called slope since f = f (α) is now a function of the single variable α) and curvature of f at x 0 in
the direction of d .
5.3. TAYLOR SERIES FOR MULTIVARIATE FUNCTION 41
5.2.0.1 Example
−400x 1 (x 2 − x 12 ) − 2(1 − x 1 )
· ¸ · ¸
−2
5f = =
200(x 2 − x 12 ) 0
−400(x 2 − x 12 ) + 800x 12 + 2
· ¸ · ¸
−400x 1 2 0
52 f = =
−400x 1 200 0 200
· ¸
2 0
Thus dT Gd = [1 0] × × [1 0]T = 2
0 200
These definitions of slope and curvature depend on the size of d, and this ambiguity can be resolved by
requiring that kdk = 1. Hence Equation (5.9) is the directional derivative in the direction of a unit vector
d and this given by ∇ f (x)dT . Likewise the curvature along the line in the direction of the unit vector is
given by dT ∇2 f (x)d.
Since x(α) = x0 + αd, at α = 0 we have x(0) = x0 . Therefore, the function value f (x(0)) = f (x0 ), the slope at
α = 0 in the direction of d is f 0 (0) = dT ∇ f (x0 ) and the curvature at α = 0 is f 00 (0) = dT G(x0 )d.
In the context of optimization involving smooth function f (x) the Taylor series is indispensable. Since x = x(α) =
x0 +αd for a fixed point x0 and a given direction d, the f (x) at x(α) becomes a function of the single variable α. Hence,
f (x) = f (x(α)) = f (α). Therefore, expanding the Taylor series around zero we have:
1
f (α) = f (0 + α) = f (0) + α f 0 (0) + α2 f 00 (0) + · · · (5.12)
2
But f (α) = f (x0 + αd) is the value of the function f (x) of many variable along the line x(α). Hence, we can re-write
Equation (5.12) as:
1
f (x0 + αd) = f (x0 ) + αdT ∇ f (x0 ) + α2 dT ∇2 f (x0 ) d + · · ·
£ ¤
(5.13)
2
and:
The form A is said to be positive definite if A ≥ 0 for all x with A = 0 iff x = 0. The form A is said to be positive
semi-definite if A ≥ 0 for all x. Similar definitions apply to negative definite and negative semi-definite with the
inequalities reversed.
5.4.0.1 Example
µ 5 ¶µ ¶
1 2 x1
Solution: A(x) = (x 1 , x 2 ) 5
2 4 x2
In the following chapters we will be concerned with gradient based minimization methods. Therefore, we only
consider the minimization of smooth functions. We will not consider the non-smooth minima as they do not satisfy
the same conditions as smooth minima. We, however, will consider the case of saddle point. Hence, we assume that
the first and the second derivative exist.
• If ∇2 f (x∗ ) is indefinite, i.e. all λi are mixed sign, then x∗ is a saddle point.
• If ∇2 f (x∗ ) is positive definite, i.e. all λi > 0, then x∗ is a minimum.
• If ∇2 f (x∗ ) is negative definite, i.e. all λi < 0, then x∗ is a maximum.
• If ∇2 f (x∗ ) is postive semi-definite, i.e. all λi ≥ 0, then x∗ is a half cylinder.
Postive Definite
70
60
50
40 z
30
20
10
6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
44 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
Negative Definite
10
20
30 z
40
50
60
70
6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
5.5. STATIONARY POINTS 45
Positive Semi-Definite
35
30
25
20 z
15
10
5
6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
46 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
30
20
10
0 z
10
20
30
6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
In summary:
Let G = ∇2 f (x), i.e. the Hessian.
• G(x) is positive semi-definite if x T G x ≥ 0, ∀x
• G(x) is negative semi-definite if x T G x ≤ 0, ∀x
• G(x) is positive definite iff x T G x > 0, ∀x 6= 0
• G(x) is negative definite iff x T G x < 0, ∀x 6= 0
• G(x) is indefinite iff x T G x is mixed negative and positive
and:
• f (x) is concave iff G(x) is negative semi-definite
• f (x) is strictly concave iff G(x) is negative definite
• f (x) is convex iff G(x) is positive semi-definite
• f (x) is convex if G(x) is positive definite
There are a number of ways for us to test for positive or negative definiteness. Namely;
5.5.1.1.1 Example
Classify the stationary points of the function
f (x) = 2x 12 + x 1 x 22 + x 22 .
5.5. STATIONARY POINTS 47
Solution:
∂f
= 4x 1 + x 2 2 = 0
∂x 1
∂f
= 2x 1 x 2 + 2x 2 = 0
∂x 2
which gives x1 = (0, 0)T , x2 = (−1, 2)T and x3 = (−1, −2)T . The Hessian matrix is:
µ ¶
4 2x 2
G=
2x 2 2x 1 + 2
Thus:
µ ¶
4 0
G1 =
0 2
(4 − λ)(2 − λ) = 0
µ ¶
4 4
G2 =
4 0
has eigenvalues:
p p
λ = 2 + 20, 2 − 20
µ ¶
4 −4
G3 =
−4 0
20.0
17.5
15.0
12.5
f(X, Y)
10.0
7.5
5.0
2.5
1.52.0
1.0
2.0 1.5 0.5
0.0
1.0 0.5
0.0 1.00.5 Y
X 0.5 1.0 1.5 2.0 1.5
2.0
5.6. NECESSARY AND SUFFICIENT CONDITIONS 49
20.0
17.5 2.0
15.0
12.5
f(X, Y)
10.0
7.5 1.5
5.0
2.5 1.0
0.0 0.5
2.0 1.5 0.0 X
1.0 0.5 0.5
1.0
Y0.0 0.5 1.0 1.5
1.5 2.0 2.0
From the Hessian we can compute the determinant of all subminors. If these are all greater than zero, then the
Hessian is positive definite. Utilising the example above. If:
µ ¶
4 0
G1 =
0 2
Then the first subminor is just det|4| which is > 0. The second and final subminor is the entire matrix, so:
¯ ¯
¯ 4 0 ¯¯
det ¯¯ = 8 − 0 > 0.
0 2 ¯
Therefore G 1 is positive definite. G 2 and G 3 are dealt with similarly. However, to prove negative definiteness we need
to prove (−1)k D k > 0, where D is the determinant of the k-th principle minor.
This approach would be preferable when dealing with the case of large matrices.
Theorem 5.1 (First Order Necessary Condition (FONC) for Local Maxima/Minima). If f (x) has continuous first
partial derivatives at all points of S ⊂ R n and if x∗ is an interior point of the feasible set S then x∗ is a local minimum
or maximum of f (x):
∇ f (x ∗ ) = 0. (5.17)
50 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
∂ f (x ∗ )
= 0; i = 1, 2, ...n. (5.18)
∂x i
Theorem 5.2 (Second Order Necessary Condition (SONC) for Local Maxima/Minima). Let f be twice continuously
differentiable on the feasible set S, x∗ is a local minimiser of f (x), and d is a feasible direction at x∗ . If dT ∇ f (x∗ ) = 0,
then:
dT ∇2 f (x∗ )d ≥ 0. (5.19)
Theorem 5.3 (Second Order Sufficient Condition (SOSC) for Strong Local Maxima/Minima). Let x∗ be an interior of
S. If x∗ is a local minimiser of f (x) then (i) ∇ f (x∗ ) = 0 and (ii) dT ∇2 f (x∗ )d > 0. That is the hessian is positive definite.
5.6.0.1 Example
Let f (x) = x 12 + x 22 . Show that x = (0, 0)T satisfies the FONC, the SONC and SOSC hence (0, 0)T is a strict local
minimiser. We see that ∇ f (x) = (2x 1 , 2x 2 ) = 0 if and only if x 1 = x 2 = 0. It also can be easily shown that for all d 6= 0,
dT ∇2 f (x)d = 2d 12 + 2d 22 > 0. Hence ∇2 f (x) is positive definite.
5.6.0.2 Example
f (x 1 , x 2 ) = x 14 + x 24
µ 3¶
12x 12
µ ¶
4x 1 T 2 0
∇ f (x) = . The only stationary point is (0 0) . Now the Hessian ∇ f = . At the origin the Hessian
4x 3 0 12x 22
µ ¶ 2
0 0
is and so there is no prediction of the minimum from the test although it is easy to see that the origin is a
0 0
minimum.
5.6.0.3 Example
1 x 12 x 22
à !
f (x 1 , x 2 ) = + ,
2c a 2 b 2
µ x1 ¶
where a, b, and c are constants. ∇ f (x) = −x ca 2 . So the only stationary point is (0 0)T . The Hessian is ∇2 f (x) =
2
µ 1 ¶ cb 2
c a2
0
. This is clearly indefinite and hence (0 0)T is a saddle point.
0 − cb1 2
Thus in summary, the necessary and sufficent condition for x∗ to be a strong local minimum are:
• ∇ f (x∗ ) = 0
• Hessian is positive definite
5.6. NECESSARY AND SUFFICIENT CONDITIONS 51
5.6.1 Exercises
f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 1)2 + x 1 x 2
4. Prove that for a general quadratic function f (x) = c + bT x + 12 xT Gx, the Hessian G of f maps differences in
position into differences in gradient, i.e., g1 − g2 = G(x1 − x2 ).
5. For the following functions, find the points where the gradients vanish, and investigate which of these are
local minima, maxima or saddle.
• f (x 1 , x 2 ) = x 1 (1 + x 1 ) + x 2 (1 + x 2 ) − 1.
• f (x 1 , x 2 ) = x 1 2 + x 1 x 2 + x 2 2 .
6. Consider the function f : R 2 → R determined by
· ¸ · ¸
1 2 3
f (x) = x T x + xT +6
4 8 4
Show that f has an absolute minimum at each of the points (x 1 , x 2 ) = (±2, 0). Show that the point (0, 0) is a
saddle point.
8. Show that the point x ∗ on the line x 2 − 2x 1 = 0 is a weak global minimiser of
f (x) = 4x 1 2 − 4x 1 x 2 + x 2 2
9. Show that
f (x) = 3x 1 2 − x 2 2 + x 1 3
has a strong local maximiser at (−2, 0)T and a saddle point at (0, 0)T , but has no minimisers.
52 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
Chapter 6
In this chapter we will study the methods for solving nonlinear unconstrained optimisation problems. The non-
linear minimisation algorithms to be described here are iterative methods which generate a sequence of points,
x0 , x1 . . . . say, or {xk } (superscripts denoting iteration number), hopefully converging to a minimiser x∗ of f (x). Uni-
variate minimisation along the line in a particular direction is known as the line search technique. One dimensional
minimisation is known as line search subproblem in many variable unconstrained non-linear minimisation.
The algorithms for multivariate minimisation are all iterative processes which fit into the same general framework:
At the beginning of the k-th iteration the current estimate of minimum is f (xk ), and a search is made
in Rn from xk along a given vector direction dk (dk is different for different minimization methods) in
an attempt to find a new point xk+1 such that f (xk+1 ) is sufficiently smaller than f (xk ). This process is
called line (or linear) search.
xk+1 = xk + αk dk (6.1)
Therefore, for a given dk , a line-search procedure is used to choose an αk > 0 that approximately minimises f along
the ray x k + αk d k : αk > 0. Hence, the line search is the univariate minimisation involving the single variable αk
(since both the xk and dk ) are known f (xk + αk dk ) becomes a function of αk only) such that:
Bear in mind that this single variable minimiser cannot always be obtained analytically and hence some numerical
techniques may be necessary.
53
54 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
The challenges in finding a good αk are both in avoiding a step length that is too long or too short. Consider the
Figures below:
Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k + αk d k are generated by the descent directions
3
d k = (−1)k+1 with steps αk = 2 + 2k+1 with an initial starting point of x 0 = 2.
6.1. GENERAL LINE SEARCH TECHNIQUES USED IN UNCONSTRAINED MULTIVARIATE MINIMISATION 55
Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k + αk d k are generated by the descent directions
1
d k = (−1) with steps αk = 2k+1 with an initial starting point of x 0 = 2.
56 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
Varying Alpha
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Given the direction dk and the point xk , f (xk + αdk ) becomes a function of α. Hence it is simply a one dimensional
d f (α)
minimisation with respect to α. The solution of d α = 0 will determine the exact location of the minimiser αk .
d f (α)
However, it may not be possible to locate the exact location of αk for which d α = 0. It may even require very large
number of iterations to locate the minimiser αk . Nonetheless, the idea is conceptually useful. Notice that for exact
df
line search the slope d α at αk must be zero. Therefore, we get:
d f (xk+1 ) d xk+1
= ∇ f (xk+1 )T = g(x k+1 )T dk = 0. (6.4)
dα dα
Line search algorithms used in practice are much more involved than the one dimensional search methods (optimi-
sation methods) presented in the previous chapter. The reason for this stems from several practical considerations.
First, determining the value of αk that exactly minimises f (α) may be computationally demanding; even worse,
the minimiser of f (α) may not even exist. Second, practical experience suggests that it is better to allocate more
computational time on iterating the optimisation algorithm rather than performing exact line searches. These
considerations led to the development of conditions for terminating line search algorithms that would result in
low-accuracy line searches while still securing a decrease in the value of f from one iteration to the next.
In practice, the line search is terminated when some descent conditions along the line xk + αdk are satisfied. Hence,
it is no longer necessary to go for the exact line search. The line search carried out in this way is known as the
inexact line search. A further justification for the inexact line search is that it is not efficient to determine the line
search minima to a high accuracy when xk is far from the minimiser x∗ . Under these circumstances, nonlinear
minimisation algorithms employ an inexact or approximate line search. To sum up, exact line search relates to
theoretical concept and the inexact is its practical implementation.
6.3. THE DESCENT CONDITION 57
Remark:
Each iteration of a line search method computes a search direction dk and then decides how far to move along that
direction. The iteration is given by
xk+1 = xk + αk dk ,
where the positive scalar αk is called the step length. The success of a line search method depends on effective
choices of both the direction dk and and the step length αk . Most line search algorithms require dk to be a descent
direction.
The typical behaviour of a minimisation algorithm is that it repeatedly generates points xk such that as k increases
xk moves close to x∗ . Features of a minimisation algorithm is that f (xk ) is always reduced on each iteration, which
imply that the stationary point turns out to be a local minimiser. In a minimisation algorithm it is required to supply
an initial estimate, say x0 . At each iteration the algorithm finds a descent direction along which the function is
minimised. This minimisation algorithm in a particular direction is known as the line search. The basic structure of
the general algorithm is:
1. Initialise the algorithm with estimate xk . Initialise k = 0.
2. Determine a search direction dk at xk .
3. Find αk to minimise f (xk + αdk ) with respect to α.
4. Set xk+1 = xk + αk dk .
5. Line search is stopped when f (xk+1 ) < f (xk )
6. If algorithm meets stopping criteria then STOP, ELSE set k = k + 1 and go back to (2).
Different minimisation methods select dk in different ways in (2). Steps (3&4) is the one dimensional sub-problem
carried out along the line xk+1 = xk + αk dk for α ∈ [0, 1]. The direction dk at xk must satisfy the descent condition.
Central to the development of the gradient based minimisation methods is the idea of a descend direction. Condi-
tions for the descent direction can be obtained using Taylor series around the point xk . Using two terms of Taylor
series we have:
T
f (xk + αdk ) − f (xk ) = αdk ∇ f (xk ) + · · · (6.5)
Clearly the descent condition can easily be seen as:
T
dk ∇ f (xk ) < 0, (6.6)
A simple line search descent method is the steepest descent method in which:
where θ can be interpreted geometrically as the angle between dk and gk . If we allow θ to vary holding αk , kdk k and
kgk k constant, then the right hand side of Equation (6.9) is most negative when θ = π. Thus when αk is sufficiently
small, the greatest reduction in function is obtained in the direction:
dk = −gk (6.10)
This negative gradient direction which satisfy the descent condition (6.10) gives rise to the method of steepest descent.
Here the the search direction is taken as the negative gradient and the step size, αk , is chosen to achieve the
maximum decrease in the objective function f at each step. Specifically we solve the problem:
³ ³ ´´
Minimise f x(k) − α∇ f x(k) w.r.t. α (6.11)
In practice the algorithm is terminated if some convergence criterion is satisfied. Usually termination is enforced at
iteration k if one, or a combination of the following is met:
• k∇ f (xk )k < ²2
6.5.2.1 Example
Consider f (x) = 2x 12 + 3x 22 , where x0 = (1, 1). Use two iterations of Steepest Descent.
Solution:
· ¸
4x 1
Compute ∇ f (x) = = −g.
6x 2
First Iteration:
We know that:
x1 = x0 + α0 g (x0 ),
6.5. THE METHOD OF STEEPEST DESCENT 59
so: · ¸ · ¸ · ¸
1 4α 1 − 4α
x1 = − = .
1 6α 1 − 6α
Therefore:
Finally:
13 9
1 − 4
70 35
x1 = 13 = −4 .
1−6
70 35
Second Iteration:
We have:
x2 = x1 + α1 g (x1 )
Compute (Simplified here):
1 ¡
f (x1 + α1 g (x1 )) = 2(9 − 36α)2 + 3(−4 + 24α)2 .
¢
35 2
We get:
∇ f (x1 + α1 g (x1 )) = 0
⇒ 60α = 13
13
α =
60
Therefore:
9 9 6
35 13 35 175
x2 = x1 + α1 g (x1 ) = −4 − = 6 .
60 −4
35 35 175
The process continues in the same manner above. We can see from inspection that the function should achieve a
minimum at (0, 0). We can see this as a sanity check in the Python code below.
It is also worth noting that since this is a quadratic function, we can actually use another technique. We will redo the
first iteration as illustration. Specifically, the quadratic functions allow α to be solved using:
T
−g k d k
αk = T
.
d k Qd k
Thus:
First Iteration:
· ¸
4 0
Compute f (x0 ) = 5, g(x̄ 0 )T = (4, 6) and Q =
0 6
Therefore:
(g k )T d k 52 13
α1 = − = ¸· ¸ =
(d k )T Qd k 70
·
4 0 4
(4, 6)
0 6 6
Thus: µ ¶
13 9 4
x1 = (1, 1) − (4, 6) = − ,
70 35 35
Similarly, the process repeats.
60 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
2x12 + 3x22
6
5
4
f(X, Y)
3
2
1
0
1.0
0.5
1.0 0.0
0.5 0.0 0.5 Y
X 0.5 1.0
1.0
6.5. THE METHOD OF STEEPEST DESCENT 61
0 4.
1.00 4.80 4.000 4.00 800
0 2.400 0
3.20
0.75
0.50 0.800
0.25
0.00
0.25
0.50
1.600
0.75 0
4. 3.20
1.00 4.80 000 4.000 4.800
0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
Although you will only cover inexact line search techniques in the third year syllabus, we will quickly introduce a
very simply inexact technique to use for the purpose of your labs.
t
f (x − t ∇ f (x)) > f (x) − k∇ f (x)k2 ,
2
and update t = βt .
This is a simple technique and tends to work quite well in practice. For further reading you can consult Convex
Optimisation by Boyd.
62 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
2x12 + 3x22
6
5
f(X, Y)
4
3
2
1
0
1.0
0.5
Y 0.0 1.0
0.5 0.5
0.0
0.5 X
1.0 1.0
6.5.4 Exercises
1 T
x Qx + b T x + c
2
using the iteration
x k+1 = x k + αk d k ,
then:
g (k)T d k
αk = − .
d (k)T Qd k
3. Compute the first two iterations of the method of steepest descent applied to the objective function
f (x) = 4x 1 2 + x 2 2 − x 1 2 x 2
f (x) = 3x 1 2 + 2x 2 2
We will briefly look at the context of what we have learnt from the machine learning perspective. This is to emphasize
the power of this chapter. In machine learning, you will find the gradient descent algorithm everywhere. While
the literature may seem to allude to this method being new, powerful and cool, it is really nothing more than the
method of steepest descent introduced above.
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5
So from the above plot we can see that there is a local minimum somewhere around 1.3 - 1.4 according to the x-axis.
Of course, we normally won’t be afforded the luxury of information such as this a priori, so let’s just assume we
arbitrarily set our starting point to be x 0 = 2. Implementing the gradient descent with a fixed stepsize, or learning
rate (in the context of ML) we have:
64 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.1 # step size fixed at 0.1
precision = 0.0001 # tolerance value
x_list, y_list = [x_new], [f(x_new)]
# returns the value of the derivative of our function
def f_prime(x):
return 3*x**2-4*x
## Number of steps: 17
How did the algorithm look step by step?
In our above implementation we had a fixed step-size n k . In machine learning, this is called the learning rate. You’ll
notice this is contrary to the algorithm in the aforementioned pseudocode. Making the assumption of the fixed
learning rate made the implementation easier but could yield the issues mentioned in the beginning of the chapter.
One means of overcoming this issue is to use adaptive step-sizes. This can be done using scipy’s fmin function to
find the optimal step-size at each iteration.
from scipy import stats
from scipy.optimize import fmin
# we setup this function to pass into the fmin algorithm
def f2(n,x,s):
x = x + n*s
return f(x)
x_old = 0
x_new = 2 # The algorithm starts at x=2
precision = 0.0001
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 65
## Number of steps: 4
So we can see that using the adaptive step-sizes, we’ve reduced the number of iterations to convergence from 17 to
4. This is a substantial reduction, however, it must be noted that it takes time to compute the appropriate step-size
at each iterations. This highlights a major issue in the decision making for optimisation: trying to find the balance
between speed and accuracy.
How did the modified algorithm look step by step?
Well we can see that it converges rapidly and after the first two iterations, we need to zoom in to see further
improvements.
Gradient descent zoomed in zoomed in more
3.0 3.0
5
4 2.5 2.5
3 2.0 2.0
2 1.5 1.5
1 1.0 1.0
0 0.5 0.5
1 0.0 0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.2 1.4 1.6 1.8 2.0 0.000300
0.000325
0.000350
0.000375
0.000400
0.000425
0.000450
0.000475
0.000500
+1.333
Instead of using computational resources having to find an optimal step-size at each iteration, we could apply an
dampening factor at each step to reduce the step-size over time. For example:
η(t )
η(t + 1) =
1+t ×d
x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.17 # step size
precision = 0.0001
t, d = 0, 1
x_list, y_list = [x_new], [f(x_new)]
66 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
## Number of steps: 6
We can now see that we’ve still reduced the number of iterations required substantially but are not bounding to
finding an optimal step-size at each iteration. This highlights that trade-off of finding cheap improvements that
improve convergence at minimal cost.
While using these line methods to find the minima of basic functions is interesting, one might wonder how this
relates to some of the regressions we are interested in performing. Let us consider a slightly more complicated ex-
ample. In this data set, we have data relating to how temperature affects the noise produced by crickets. Specifically,
the data is a number of observations or samples of cricket chirp rates at various temperatures.
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 67
90
temperature in degrees Fahrenheit
85
80
75
70
65
13 14 15 16 17 18 19 20 21
chirps/sec for striped ground crickets
What can we deduce from the plotted data?
We can see that the data set is exhibiting a linear relationship. Therefore, our aim is to find the equation of the
straight line given by:
h θ (x) = θ0 + θ1 x,
that best fits all of our data points, i.e. minimise the residual error.
returnValue = np.array([0.,0.])
for i in range(m):
returnValue[0] += (h(theta_0,theta_1,x[i])-y[i])
returnValue[1] += (h(theta_0,theta_1,x[i])-y[i])*x[i]
returnValue = returnValue/(m)
return returnValue
import time
start = time.time()
theta_old = np.array([0.,0.])
theta_new = np.array([1.,1.]) # The algorithm starts at [1,1]
n_k = 0.001 # step size
precision = 0.001
num_steps = 0
s_k = float("inf")
while np.linalg.norm(s_k) > precision:
num_steps += 1
theta_old = theta_new
s_k = -grad_J(x,y,m,theta_old[0],theta_old[1])
theta_new = theta_old + n_k * s_k
print("Local minimum occurs where:")
## theta_0 = 25.128552558595363
print("theta_1 =", theta_new[1])
## theta_1 = 3.297264756251897
print("This took",num_steps,"steps to converge")
## 19.70560359954834seconds
It’s clear that the algorithm seems to take quite a long time for such a trivial example. Let’s check that the values
we’ve obtained from the gradient descent are any good. We can get the true values for θ0 and θ1 with the following:
from scipy import stats as sp
start = time.time()
actualvalues = sp.stats.linregress(x,y)
print("Actual values for theta are:")
## theta_0 = 25.232304983426026
print("theta_1 =", actualvalues.slope)
## theta_1 = 3.2910945679475647
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 69
end = time.time()
print(str(end - start) + 'seconds')
## 0.009187698364257812seconds
One thing this highlights is how much effort goes into optimising the functions found in these libraries. If one looks
at the code inside linregress, clever exploitations to speed up the computation can be found.
Now, let’s plot our obtained results on the original data set:
90
temperature in degrees Fahrenheit
85
80
75
70
65
13 14 15 16 17 18 19 20 21
chirps/sec for striped ground crickets
So in our implementation above, we needed to compute the gradient at each step. While this might not seem
important, it is! In this toy example, we only have 15 data points, however, imagine the computational intractability
when millions of data points are involved.
What we implemented above is often called Vanilla/Batch gradient descent. As pointed out, this implementation
means that we need to sum the cost of each sample in order to calculate the gradient of the cost function. This
means given 3 million samples, we would have to loop through 3 million times!
So to move a single step towards the minimum, one would need to calculate each cost 3 million times.
So what can we do to overcome this? Well, we can use the stochastic gradient descent. In this idea, we use the cost
gradient of 1 sample at each iteration rather than the sum of the cost gradient of all samples. So recall our gradient
equations from above:
∂ 1 Xm
J (θ0 , θ1 ) = (h θ (x i ) − y i ),
∂θ0 m i =1
70 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
∂ 1 Xm
J (θ0 , θ1 ) = ((h θ (x i ) − y i ) · x i ),
∂θ1 m i =1
where:
h θ (x) = θ0 + θ1 x.
We now want to update our values at each item in the training set instead of all so that we can begin improvement
straight away.
We can redefine our algorithm into the stochastic gradient descent for the simple linear regression as follows:
Randomly shuffle the data set
for k = 0, 1, 2, ... do
for i = 1 to m do
θ0 θ0
· ¸ · ¸ · ¸
2(h θ (x i ) − y i )
= −α
θ1 θ1 2x i (h θ (x i ) − y i )
end for
end for
Depending on the size of the data set, we run the entire data set 1 to k times.
So the key advantage here is that unlike batch gradient descent where we have to go through the entire data set
before initiating any progress, we can now make process straight away as we move through the data set. This is the
primary reason why stochastic gradient descent is used when dealing with large data sets.
Let us look at another example with the use of stochastic gradient descent for linear regression. We can create a set
of 500 000 data points around the equation y = 2x + 17 + ² on the domain x ∈ [0, 100].
Show in jupyter notebook - example breaks rstudio!
Chapter 7
The steepest descent method uses information based only on the first partial derivatives in selecting a suitable
search direction. This strategy is not always the most effective. A faster method may be obtained by approximating
the objective function f (x) as a quadratic q(x) and making use of a knowledge of the second partial derivatives. This
is the basis of Newton’s method. The idea behind this method is as follows. Given a starting point, we construct a
quadratic approximation to the objective function that matches the first and the second derivative of the original
objective function at that point. We then minimise the approximate (quadratic) function instead of the original
objective function. We then use the minimiser of the quadratic function to obtain the next iterate and repeat the
procedure iteratively. If the objective function is quadratic then the approximation is exact and and the method
yields the true minimiser in one step. If, on the other hand, the objective function is not quadratic, then the
approximation will provide only an estimate of the position of the true minimiser.
We can obtain a quadratic approximation to the given twice continuously differentiable objective function using the
Taylor series expansion of f about the current x k , neglecting terms of order three and the higher. Using the Taylor
series expansion:
f (x) ≈ f (x(k) ) + (x − x(k) )T g(k) + (x − x(k) )T H (x(k) )(x − x(k) ) = q(x),
where g = ∇ f and H is the Hessian matrix. The minimum of the quadratic q(x) satisfies:
or inverting:
x = x(k) − H −1 (x(k) )g(k) .
Newton’s formula is:
x(k+1) = x(k) − H −1 (x(k) )g(k) . (7.1)
This can be rewritten as
H (k) d(k) = −g(k) (7.2)
(k) (k+1) (k)
where d =x −x
g (x )
Note to solve in 1-dimension g (x) = 0, we iterate x k+1 = x k − g 0 (xk ) . The above formula is the multidi-
k
mensional extension of Newton’s method.
The Method requires that f k , gk and H k i.e., the function value, the gradient and the Hessian to be
made available at each iterate xk . Most importantly the Newton method is only well defined if the
Hessian H k is positive definite. This is because only then q(x) will have a unique minimiser. The
positive definiteness of the Hessian can only be guaranteed if the starting iterate x0 is very near to the
minimizer x∗ of f (x)
The Newton method is fast to converge when it is applied close to the minimiser. If the starting point (the initial
point) is further from the minimiser then the Algorithm may not converge.
71
72 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
7.0.0.1 Example
f (x) = 100(x 2 − x 12 )2 + (1 − x 1 )2 .
¡0¢
Let us take x0 = 0 . The gradient vector and the Hessian at x0 are respectively given by:
à !
−400x 1 x 2 − x 12 − 2(1 − x 1 )
¡ ¢
∇ f (x) = = −g,
200 x 2 − x 12
¡ ¢
and:
800x 12 − 400 x 2 − x 12 + 2
µ ¡ ¢ ¶
−400x 1
H (x) = .
−400x 1 200
So substituting x 0 gives:
µ ¶
0 T 0 2 0
g = (2, 0) ; H = .
0 200
Now using
H 0 d0 = −g0 ,
recall that:
H k d k = −g k ,
so:
µ ¶µ ¶ µ ¶
2 1/2 0 1
H 0 d0 = −g0 ⇒ d0 = g0 (H 0 )−1 = = .
0 0 1/200 0
Recall:
dk = xk+1 − xk ⇒ d0 = x1 − x0 ⇒ x1 = d0 + x0
Thus:
à ! à ! à !
1 1 0 1
x = + =
0 0 0
Rosenbrock Function
500
400
300
f(X, Y)
200 1.0
100 0.5
1.0 0.0 X
0.5 0.5
0.0
Y 0.5
1.0 1.0
The step length parameter αk modifies the step taken in the search direction, usually to minimize f (x(k+1) ). Newton’s
method applied without this modification does not necessarily produce a decrease in f (x(k+1) ), as described by the
above example.
To address the drawbacks of Newton method line search is introduced where f k+1 < f k is sought. As with the other
gradient based methods the new iterate xk+1 is found by minimizing f along the search direction dk such that:
xk+1 = xk + αk dk
Although Newton method without this modification may generate points where the function may increase (see
example above), the directions generated by Newton method are initially downhill if H k is positive definite.
Remarks:
• Newton’s method always goes in a descent direction provided we do not go too far but sometimes Newton
over-steps the mark and does not work.
• The drawback to the method is that evaluating H −1 can be expensive in computational time.
74 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
If f (x) = 12 xT Qx + xT b + c is a quadratic function with positive definite symmetric Q, then Newton’s method reaches
the minimum in 1 step irrespective of the initial starting point.
Proof:
The gradient vector g(x) = ∇ f (x) = Qx + b. The Hessian H (x) = Q and is a constant. Hence given x(0) ,
The result also works if Q is negative definite resulting in a strong local maximum or Q is symmetric indefinite giving
x∗ as a saddle point.
The basic Newton method as it stands is not suitable for a general purpose algorithm since H k may not be positive
definite when xk is remote from the solution. Furthermore, as we have shown in the previous example, even if H k is
positive definite the convergence may not occur. To address these issues Quasi-Newton algorithms were developed.
We start by describing the drawbacks of the Newton method. At each iteration (say, at the k-th iteration) of the
Newton’s method a new matrix H k has to be calculated (even if the method uses line search) and then either the
inverse of this matrix has to found or a system of equation has to be solved before the new point x(k+1) is found
using x(k+1) = x(k) + d(k) . Quasi-Newton methods avoid the calculation of a new matrix at each iteration, rather they
only update the matrix (positive definite) of the previous iteration. This matrix remains also positive definite. This
method also does not need to solve a system of equation. First it finds its direction using the positive definite matrix
and it finds the step length using line search.
Introduction of the quasi-Newton method largely increased the range of problems which could be solved. This
−1
type of method is like Newton method with line search, except that H k at each iteration is approximated by a
symmetric positive definite matrix G k , which is updated from iteration to iteration. Thus the kth iteration has the
basic structure.
1. Set dk = −G k gk
2. Line search along dk giving xk+1 = xk + αk dk
3. Update G k giving G k+1
The initial positive definite matrix is chosen as G 0 = I . Potential advantages of the method (as against Newton’s
method) are:
• Only first derivative required (Second derivative required in Newton method)
• G k positive definite implies the descent property (H k may be indefinite in Newton method)
Much of the interest lies in the updating formula which enables G k+1 to be calculated from G k . We know that for
any quadratic function:
1
q(x) = xT H x + bT x + c,
2
where H , b and c are constant and H is symmetric, the Hessian maps differences in position into differences in
gradient,.i.e.,
gk+1 − gk = H (xk+1 − xk ). (7.3)
The above property says that changes in gradient g (=∇ f (x)) provide information about the second derivative of q(x)
along (xk+1 − xk ). In the quasi-Newton methods at xk we have the information about the direction dk , G k and the
gradient gk . We can use these information to perform line search to obtain xk+1 and gk+1 . We now need to calculate
7.3. QUASI-NEWTON METHODS 75
G k+1 (the approximate inverse of H k+1 ) using the above information. At this point we impose the condition given
by Equation (7.3) for the non-quadratic function f . In other words, we impose that changes in the gradient provide
information about the second derivative of f along the search direction dk . Hence, we have:
−1
H (k+1) (gk+1 − gk ) = (xk+1 − xk ) (7.4)
G k+1 γk = δk , (7.5)
−1
where G k+1 = H k+1 , δk = (xk+1 − xk ) and γk = (gk+1 − gk ). This is known as the quasi-Newton condition and for
the quasi-Newton algorithm the update H k+1 from H k must satisfy Equation (7.5).
Methods differ in the way they update the matrix G k . Essentially they are classified according to a rank one and rank
two updating formulae.
This formula was first suggested as part of a method due to Davidon (1959), and later also presented by Fletcher
and Powel (1963). The Quasi-Newton method which goes with this updating formula is known as DFP (Davidson,
Fletcher and Powel) method. The DFP algorithm is also known as the variable matrix algorithm. The DFP algorithm
preserves the positive definiteness of G k but can sometimes gives trouble when G k becomes nearly singular. A
modification (known as BFGS) introduced in 1970 can cure this problem. The algorithm for DFP method is given
below:
7.3.2 Exercises
f (x) = x 1 4 − 3x 1 x 2 + (x 2 + 2)2 ,
starting at the point x 0 = [0, 0]T and show that the function value at x 0 cannot be improved searching in
Newton direction.
76 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
1
f (x) = x 2 − sin(x).
2
The initial value is x 0 = 0.5. The required accuracy is ² = 10−5 in the sense that you stop when ¯x k+1 − x k ¯ < ².
¯ ¯
4. Using the DFP method, find the minimum of the following function:
f (x) = 4x 12 − 4x 1 x 2 + 3x 22 + x 1 ,
Direct search methods, unlike the Descent methods discussed in earlier Chapters do not require the derivatives of
the function. The Direct search methods require only the objective function values when finding minima and are
often known as zeroth-order methods since they use the zeroth-order derivatives of the function. We will consider
two Direct Methods in this course. Namely, the Random Walk Method and the Downhill Simplex Method.
The random walk method is based on generating a sequence of improved approximations to a minimum, where
each approximation is derived from the previous approximation. Therefore, xi is the approximation to the minimum
obtained in the (i − 1)th iteration, yielding the relation:
xi +1 = xi + λui ,
where λ is some scalar step length and ui some random unit vector generated at the i th stage.
We can describe the algorithm as follows:
1. Start with an initial point x1 , a sufficiently large initial step length λ, a minimum allowable step length ², and a
maximum permissible number of iterations N .
2. Find the function value f 1 = f (x1 ).
3. Set the iteration number, i , to 1
4. Generate a set of n random numbers, r 1 , . . . , r n , each lying in the interval [−1, 1] and formulate the unit vector
u as:
r1
r
1 2
.
u= 2 2 2 1/2 ..
(r + r + . . . + r n ) .
1 2
rn
To avoid bias in the calculation, we only accept the vector if the length of:
1
is ≤ 1.
(r 12 + r 22 + . . . + r n2 )1/2
5. Compute the new vector and the corresponding function value x = x1 + λu and f = f (x).
6. If f < f 1 , then set the new values of x1 = x and f 1 = f and go to step 3, else continue to 7.
7. If i ≤ N , set the new iteration to i + 1 and go to step 4. Otherwise, if i > N , go to step 8.
77
78 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION
8. Compute new, reduced, step length as λ = λ/2. If new step length is smaller than or equal to ², then go to step
9, else go to step 4.
9. Stop the procedure by taking xopt = x1 and f opt = f 1 .
8.1.0.1 Example
Minimise f (x 1 , x 2 ) = x 1 − x 2 + 2x 12 + 2x 1 x 2 + x 22 using the random walk method. Begin with the initial point x 0 = [0, 0]
and a starting step length of λ = 1. Use ² = 0.05 and iteration limit N = 100
f = lambda x1, x2: x1 - x2 + 2*x1**2 + 2*x1*x2 + x2**2
x0 = np.array([0, 0])
lam = 1
eps = 0.05
n = 2
N = 100
print(random_walk(f, x0, lam, eps, n, N))
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
1.5 1.0 0.5 0.0 0.5 1.0
8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 79
30
25
20
f(X, Y)
15
10
5
0
2
1
Y 0
1
2
2 1 0 1 2
X
A direct search method for the unconstrained optimisation problem is the Downhill simplex method developed
by Nelder and Mead (1965). It does not make an assumption on the cost function to minimise. Importantly, the
function in question does not need to satisfy any condition of differentiability unlike other methods, i.e. it is a zero
order method. It makes use of simplices, or polytopes in given dimension n + 1. For example, in 2 dimensions, the
simplex is a polytope of 3 vertices (triangle). In 3 dimensional space it forms a tetrahedron.
The method starts from an initial simplex. Subsequent steps of the method consist of updating the simplex where it
defines:
• xh is the vertex with highest function value,
• xs is the vertex with second highest function value,
• xl is the vertex with lowest function value,
• G is the centroid of all the vertices except xh , ie. the centroid of n points out of n + 1:
1 n+1
xj
X
G= (8.1)
n j =1, j 6=h
The movement of the simplex is achieved by using three operations, known as reflection, contraction and expansion.
These can be seen in the Figures below:
A common practice to generate the initial remaining simplex vertices is to make use of x0 + ei b, where ei is the unit
vector in the direction of the x i coordinate and b an edge length. Assume a value of 0.1 for b.
80 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION
Let y = f (x) and y h = f (xh ) then the algorithm suggested by Nelder and Mead is as follows:
82 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION
The typical values for the above factors are α = 1, γ = 2 and β = 0.5. The stopping criteria to use is defined by:
s
1 X n ³ ´2
f (x i ) − f (x i ) ≤ ² (8.2)
n + 1 i =0
8.3.1 Exercises
1. Apply the above two strategies to the all the multivariate function introduced in earlier chapters and achieve
their respective minima.
84 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION
In this Chapter we will briefly consider the optimisation of continuous functions subjected to equality constraints
(this will be covered extensively in the 3rd year course), that is the problem:
subject to:
g i (x) = b i
where f and g i are differentiable. The Lagrange function, L, is defined by introducing one Lagrange multiplier λi
for each constraint g i (x) as:
L(x, λ) = f (x) +
X£ ¤
b i − g i (x) (9.2)
i =1
∂L ∂f m ∂g ∂L
λi
X
= + = 0, = g i (x) = 0 (9.3)
∂x j ∂x j i =1 ∂x i ∂λi
9.0.1 Example
Solution:
∂f ∂g
−λ = 0,
∂x 1 ∂x 1
∂f ∂g
−λ = 0,
∂x 2 ∂x 2
and
g (x 1 , x 2 ) = b.
Therefore, we solve:
2x 1 − λ = 0,
85
86 CHAPTER 9. LAGRANGIAN MULTIPLIERS FOR CONSTRAINT OPTIMISATION
8x 2 − 2λ = 0,
and
x 1 + 2x 2 = 1.
1 1
Solving these three equations we obtain x 1 = , x 2 = and λ = 1. Therefore, the optimum is:
2 4
1
f (x 1 , x 2 ) = .
2
9.0.2 Exercises
1. A length of wire L metre long is to be divided into two pieces, one in a circular shape and the other into a
square. What must be individual lengths be so that the total area is a minimum. Formulate the optimisation
problem mathematically and then solve.
2. Minimise
f (x) = x 1 2 + x 2 2
subject to
x 1 + 2x 2 + 1 = 0
3. Find the dimensions of a cylindrical tin of sheet metal to maximise its volume such that the total surface area
is equal to A 0 = 24π.