0% found this document useful (0 votes)
10 views86 pages

OptimisationII Notes

Uploaded by

Dakalo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
10 views86 pages

OptimisationII Notes

Uploaded by

Dakalo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 86

Optimisation II APPM2007

Dr Matthew Woolway
2018-09-26
2
Contents

Course Outline 11
Course Structure and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Course Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Course Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1 Definition and General Concepts 13


1.1 Nonlinear Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 General Statement of a Optimisation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Important Optimisation Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1.1 Neighbourhoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1.2 Local and Global Minimisers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1.3 Infimum and Supremum of a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2.1 Affine Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2.2 Convex Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2.3 Convex Combination and Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2.4 Hyperplanes and Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2.5 Level Sets and Level Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 One Dimensional Unconstrained and Bound Constrained Problems 21


2.1 Unimodal and Multimodal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Global Extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Necessary and Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Numerical Solutions to Nonlinear Equations 25


3.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Advantages and Disadvantages of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Numerical Optimisation of Univariate Functions 29


4.1 Techniques Using Function Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1.1 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Golden Search Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3
4 CONTENTS

5 Multivariate Unconstrained Optimisation 37


5.1 Terminology for Functions of Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 A Line in a Particular Direction in the Context of Optimisation . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Taylor Series for Multivariate Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Stationary Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5.1 Tests for Positive Definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5.1.1 Compute the Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5.1.2 Principle Minors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Necessary and Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6.0.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6.0.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Gradient Methods for Unconstrained Optimisation 53


6.1 General Line Search Techniques used in Unconstrained Multivariate Minimisation . . . . . . . . . . . 53
6.1.1 Challenges in Computing Step Length αk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Exact and Inexact Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2.1 Algorithmic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 The Descent Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 The Direction of Greatest Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5 The Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5.1 Steepest Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5.2 Convergence Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5.3 Inexact Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5.3.1 Backtracking Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.6 The Gradient Descent Algorithm and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6.1 Basic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6.2 Adaptive Step-Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6.3 Decreasing Step-Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.6.4.1 Additional Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Newton and Quasi-Newton Methods 71


7.0.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1 The Modified Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Convergence of Newton’s Method for Quadratic Functions . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3.1 The DFP Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8 Direct Search Methods for Unconstrained Optimisation 77


8.1 Random Walk Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.1.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 Downhill Simplex Method of Nelder and Mead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3 Rosenbrock Function Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9 Lagrangian Multipliers for Constraint Optimisation 85


CONTENTS 5

9.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.0.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 CONTENTS
List of Tables

7
8 LIST OF TABLES
List of Figures

1.1 A Line Segement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


1.2 Left and Right are Non-Convex, Center Convex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 An Example of a Convex Hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 A Hyperplane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 A Halfspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 An Example of a Convex Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 Mathematica Demo of Gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


5.2 An Example of a Line in a Particular Direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8.1 Here we have reflection and expansion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


8.2 Here we have contraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3 Here we have multiple contraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4 Application of Downhill Simplex on Rosenbrock Function - 3D. . . . . . . . . . . . . . . . . . . . . . . . 83
8.5 Application of Downhill Simplex on Rosenbrock Function - 2D. . . . . . . . . . . . . . . . . . . . . . . . 84

9
10 LIST OF FIGURES
Course Outline

Course Structure and Details

• Office: UG3 - Maths Science Building (MSB)


• Consultation: Wednesdays - 12:30 - 14:15 (This is for all three topics)
• Lecture Venues: P115 - Mondays (09:00 - 10:00)

Course Assessment

• There will be two tests and no assignment


• There will be labs. These may/may not count for extra marks
• The programming language will be Python

11
12 LIST OF FIGURES

Course Topics

We will be covering the following topics throughout the course:


• Generalised concept of optimisation
• One dimensional unconstrained and bound constrained problems
• Numerical differentiation and solutions to nonlinear equations
• Numerical optimisation of univariate functions
• Multivariate unconstrained optimisation
• Gradient methods for unconstrained optimisation
• Newton and Quasi-Newton methods
• Direct search method for unconstrained problems
• Lagrangian multipliers for constraint optimisation

Hardware Requirements

The course will be very computational in nature, however, you do not need your own personal machine. MSL already
has Python installed. The labs will be running the IDEs for Python (along with Jupyter) while I will be using Jupyter
for easier presentation and explanation in lectures. You will at some point need to become familiar with Jupyter as
the tests will be conducted in the Maths Science Labs (MSL) utilising this platform for autograding purposes.
If you do have your own machine and would prefer to work from that you are more than welcome. Since all the
notes and code will be presented through Jupyter please follow the following steps:
• Install Anaconda from here: https://repo.continuum.io/archive/Anaconda3-5.2.0-Windows-x86_64.exe
– Make sure when installing Anaconda to set the installation to PATH when prompted (it will be deselected
by default)
• To launch a Jupyter notebook, open the command promt (cmd) and type jupyter notebook. This should
launch the browser and jupyter. If you see any proxy issues while on campus, then you will need to set the
proxy to exclude the localhost.
If you are not running Windows but rather Linux, then you can get Anaconda at: https://repo.continuum.io/archive/
Anaconda3-5.2.0-Linux-x86_64.sh
Chapter 1

Definition and General Concepts

In industry, commerce, government, indeed in all walks of life, one frequently needs answers to questions concerning
operational efficiency. Thus an architect may need to know how to lay out a factory floor so that the article being
manufactured does not have to be moved about too much as it goes from one machine tool to another; the manager
of a shipping company needs to plan the itineraries of his ships so as to increase the amount of goods handled,
while avoiding costly waiting-around in docks. A telecommunications engineer may want to know how best to
transmit signals so as to minimise the possibility of error on reception. Further examples of problems of this sort
are provided by the planning of a railroad time-table to ensure that trains are available as and when needed, the
synchronisation of traffic lights, and many other real-life situations. Formerly such problems would usually be
‘solved’ by imprecise methods giving results that were both unreliable and costly. Today, they are increasingly being
subjected to rigid mathematical analysis, designed to provide methods for finding exact solutions or highly reliable
estimates rather than vague approximations. Optimisation provide many of the mathematical tools used for solving
such problems.

Optimisation, or mathematical programming, is the study of maximising and minimising functions subject to
specified boundary conditions or constraints. The functions to be optimised arise in engineering, the physical
and management sciences, and in various branches of mathematics. With the emergence of the computer age,
optimisation experienced a dramatic growth as a mathematical theory, enhancing both combinatorics and classical
analysis. In its applications to engineering and management science, optimisation (linear programming) forms an
important part of the discipline of operations research.

1.1 Nonlinear Optimisation

Linear programming has proved an extremely powerful tool, both in modelling real-world problems and as a widely
applicable mathematical theory. However, many interesting but practical optimisation problems are nonlinear. The
study of such problems involves a diverse blend of linear algebra, multivariate calculus, quadratic form, numerical
analysis and computing techniques. Important special areas include the design of computational algorithms,
the geometry and analysis of convex sets and functions, and the study of specially structured problems such as
unconstrained and constrained nonlinear optimisation problems. Nonlinear optimisation provides fundamental
insights into mathematical analysis, and is widely used in the applied sciences, in areas such as engineering design,
regression analysis, inventory control, and geophysical exploration among others. General nonlinear optimisation
problems and various computational algorithm to addressing such problems will be taught in this course.

13
14 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

1.2 General Statement of a Optimisation Problem

Optimisation problems are made of three primary ingredients; (i) an objective function, (ii) variables (unknowns)
and (iii) the constraints.

• An objective function which we want to minimize or maximize. For instance, in a manufacturing process, we
might want to maximize the profit or minimize the cost. In fitting experimental data to a user-defined model,
we might minimize the total deviation of observed data from predictions based on the model. In designing an
automobile panel, we might want to maximize the strength.
• A set of unknowns/variables which affect the value of the objective function. In the manufacturing problem,
the variables might include the amounts of different resources used or the time spent on each activity. In fitting-
the-data problem, the unknowns are the parameters that define the model. In the panel design problem, the
variables used define the shape and dimensions of the panel.
• A set of constraints that allow the unknowns to take on certain values but exclude others. For the manufac-
turing problem, it does not make sense to spend a negative amount of time on any activity, so we constrain all
the ‘time’ variables to be non-negative. In the panel design problem, we would probably want to limit the
weight of the product and to constrain its shape. The optimisation problem is then :

Find values of the variables that minimize or maximize the objective function while satisfying the constraints.

The general statement being:

minimise f (x), x = [x 1 , x 2 , . . . , x n ]T ∈ Rn (1.1)


w.r.t x

subject to:

g j (x) ≤ 0, j = 1, 2, . . . , m,
h j (x) = 0, j = 1, 2, . . . , r.

where f (x), g j (x) and h j (x) are scalar functions of the real vector x.

The continuous components x i of x are called the the variables, f (x) is the objective function, g j (x) denotes the
respective inequality constraint functions and h j (x) the equality constraint functions. The optimum vector x that
solves Equation (1.1) is denoted by x∗ with a corresponding optimum function value f (x∗ ). If there are no constraints
specified, then the problem is aptly named an unconstrained minimisation problem. A large quantity of progress
has been made when solving different classes of the general problem introduced in Equation (1.1). On occasion
these solutions can be attained analytically yielding a closed-form solution. However, most real world problems
exist where n > 2 and as a result need to be solved numerically through suitable computational algorithms.

1.3 Important Optimisation Concepts

1.3.1 Definitions

1.3.1.1 Neighbourhoods

Definition 1.1 (Neighbourhood). δ-neighbourhood of a point, say y, where y ∈ Rn :

A δ-neighbourhood of a point y is the set of all points within a neighbourhood of y, it is denoted by Nδ (y). It is the
set of all points x such that:
Nδ (y) : kx − yk ≤ δ, (1.2)

that is x ∈ Nδ (y).
1.3. IMPORTANT OPTIMISATION CONCEPTS 15

1.3.1.2 Local and Global Minimisers

Definition 1.2 (Global Minimiser/Maximiser). If a point x ∈ S, where S is the feasible set, then x is a feasible solution.
A feasible solution xg of some problem P is the global minimiser of f (x) if:

f (xg ) ≤ f (x), (1.3)

for all feasible points x of the problem P . The value f (xg ) is called the global minimum. The converse applies for the
global maximum.

Definition 1.3 (Local Minimiser/Maximiser). A point x∗ is called a local minimiser of f (x) if there exists a suitable
δ > 0 such that for all feasible x ∈ Nδ (x∗ ):
f (x∗ ) ≤ f (x). (1.4)

In other words, x∗ is a local minimiser of f (x) if there exists a neighbourhood Nδ (x∗ ) of x∗ containing feasible x such
that:
f (x∗ ) ≤ f (x), ∀ ∈ Nδ (x∗ ). (1.5)

Again the converse applies for local maximisers.

1.3.1.3 Infimum and Supremum of a Function

Definition 1.4 (Infimum). If f is a function on S, the greatest lower bound or infimum of f on S is the largest
number m (possibly m = −∞) such that f (x) ≥ m ∀x ∈ S. The infimum is denoted by:

inf f (x). (1.6)


x∈S

Definition 1.5 (Supremum). If f is a function on S, the least upper bound or supremum of f on S is the smallest
number m (possibly m = +∞) such that f (x) ≤ m ∀x ∈ S. The supremum is denoted by:

sup f (x). (1.7)


x∈S

1.3.2 Convexity

1.3.2.1 Affine Set

Definition 1.6 (Affine Set). A line though the points x1 and x1 and x2 in Rn is the set:

L = {x| θx1 + (1 − θ)x2 , ∀ θ ∈ R}. (1.8)

This is known as an Affine Set. An example of this is the solution of linear equations Ax = b.

1.3.2.2 Convex Set

Definition 1.7 (Convex Set). We can define the Convex Set as the line segment between the points x1 and x2 , that is:

x = θx1 + (1 − θ)x2 , where 0 ≤ θ ≤ 1. (1.9)

If this condition does not hold then the set is non-convex. Think of this as line of sight. Some are examples are
considered in the Figure below:
16 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

Figure 1.1: A Line Segement.

Figure 1.2: Left and Right are Non-Convex, Center Convex.


1.3. IMPORTANT OPTIMISATION CONCEPTS 17

Figure 1.3: An Example of a Convex Hull.

1.3.2.3 Convex Combination and Convex Hull

Definition 1.8 (Convex Combination). We can definite the Convex Combination of x1 , . . . , xn , as any point x which
satisfies:
x = θ1 x1 + θ2 x2 + . . . + θn xn , (1.10)

where θ1 + . . . + θn = 1, θi ≥ 0.

The Convex Hull (conv L) is the set of all convex combinations of the points in L. This can be thought of as the
tightest bound across all points in the set, as can be seen in the Figure below:

1.3.2.4 Hyperplanes and Halfspaces

Definition 1.9 (Hyperplane/Halfspace). We can define the Hyperplane as the set of the form:

{x | a T x = b} (a 6= 0), (1.11)

we can define the Halfspace as the set of the form:

{x | a T x ≤ b} (a 6= 0), (1.12)

where a is the normal vector.

Note: Hyperplanes are both affine and convex, while halfspaces are only convex. These are illustrated in the Figures
below:

1.3.2.5 Level Sets and Level Surfaces

Definition 1.10 (Level Set). Consider the real valued function f on L. Let a be in R, then we denote L a to be the set:

L a = {x ∈ L | f (x) ≤ a}. (1.13)

These sets are known as level sets of f on L and can be thought of, as all vectors returning a function value less than
or equal to that of the constant f value drawn out by a.
18 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

Figure 1.4: A Hyperplane.

Figure 1.5: A Halfspace.


1.3. IMPORTANT OPTIMISATION CONCEPTS 19

Definition 1.11 (Level Surface). Consider the real valued function f on L. Let a be in R, then we denote C a to be
the set:
C a = {x ∈ L | f (x) = a}. (1.14)
These sets are known as level surfaces of f on L and can be thought of, as the cross section taken at some point
x0 ∈ L.

See the example of x 2 y + x y 2 below:

1.3.3 Exercises

1. Find the convex hull of the following sets:

S1 = {(x 1 , x 2 ) | x 12 + x 22 = 1}
S2 = {(x 1 , x 2 ) | x 12 + x 22 > 0}
S3 = {(0, 0), (1, 0), (1, 1), (0, 1)}
S4 = {(x 1 , x 2 ) | |x 1 | + |x 2 | < 1}

2. Find the minimum (if any) of the following:


• min x1 for x ∈ [1, ∞)
• min x for x ∈ (0, 1]
3. Determine the supremum, infimum, maximum and minimum of the following functions:
• f (x) = e x , x ∈ R
• Let f : L ⊂ R2 → R. The set L is defined by the disc x 12 + x 22 ≤ 1 ∈ R2 and:
2 2
f (x 1 , x 2 ) = e x1 +x2 .

4. In each of the following cases, sketch the level sets L b of the function f :
• f (x 1 , x 2 ) = x 1 + x 2 , b = 1, 2
• f (x 1 , x 2 ) = x 1 x 2 , b = 1, 2
• f (x) = e x , b = 10, 0
5. Let L be a convex set in Rn , A be an m × n matrix and α a scalar. Show that the following sets are convex.
• {y : y = Ax, x ∈ L}
• {αx : x ∈ L}
6. If you have two points that solve a system of linear equations Ax = b, i.e. points x 1 and x 2 , where x 1 6= x 2 .
Then prove that the line that passes through these two points is in the affine set.
7. Prove that a halfspace is convex.
20 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
Chapter 2

One Dimensional Unconstrained and


Bound Constrained Problems

The one dimensional unconstrained problem is defined by:

minimise f (x), L ∈ R, (2.1)


x∈L

where f is a continuous and twice differentiable function. If L is an interval, then x is bound constrained. If indeed
L ∈ R, then the problem is unconstrained.

2.1 Unimodal and Multimodal

A function is said to be monotonic increasing along a given path when f (x 2 ) > f (x 1 ), and monotonic decreasing
when f (x 2 ) < f (x 1 ) for all points in the domain for which x 2 > x 1 . For the situation in which f (x 2 ) ≥ f (x 1 ) and
f (x 2 ) ≤ f (x 1 ) the functions are respectively called monotonic non-decreasing and monotonic non-increasing.
For example if we consider f (x) = x 2 , where x > 0, then f (x) is monotonic increasing. For x < 0, f (x) is monotonic
decreasing. A function that has a single minimum or a single maximum (single peak) is known as unimodal function.
Functions with two peaks (two minima or two maxima) are called bimodal and functions with many peaks are
known as multimodal functions.

2.2 Convex Functions

Definition 2.1 (Convex Functions). A function f : L ⊂ Rn → R defined on the convex set L is convex if for all
x1 , x2 ∈ L, for all θ ∈ [0, 1]:

f (θx2 + (1 − θ)x1 ) ≤ θ f (x2 ) + (1 − θ) f (x2 ). (2.2)

Consider the function of a single variable. We may think of this easily then by saying; The function f is convex if the
chord connecting x 1 and x 2 lays above the graph.
See the Figure below:
Note:
The function is strictly convex if only < applies f is concave if − f is convex

21
22 CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS

Figure 2.1: An Example of a Convex Function.

2.3 Global Extrema

We shall summarise some definitions that were previously mentioned. We begin with the concept of global optimi-
sation (maximisation or minimisation). In the context of optimisation, relative optima (maximum or minimum) are
normally referred to as local optima.

Global Optima

A point f (x) attains its greatest (or least) value on an interval [a, b] is called a point of global maximum (or
minimum). In general, however, a function f (x) takes on its absolute (global) maximum (minimum) at a point x ∗ if
f (x) < f (x ∗ ) ( f (x) > f (x ∗ )) for all x over which the function f (x) is defined.

Local Optima

f (x) has a strong local (relative) maximum (minimum) at an interior point x ∗ ∈ (a, b) if f (x) < f (x ∗ ) ( f (x) > f (x ∗ ))
for all x in some neighbourhood of x ∗ . The maximum (minimum) is weak if ≤ replaces <. Strong local are illustrated
in the Figure below. If a function f (x) has a strong relative maximum at some point x ∗ , then there is an interval
including x ∗ , no matter how small, such that for all x in this interval, f (x) is strictly less than f (x ∗ ), i.e. f (x) < f (x ∗ ).
It is the ‘strictly less’ that makes this a stronger relative maximum. If however, the strictly less sign is replaced by a ≤
sign, i.e. f (x) ≤ f (x ∗ ), then the minimum value at x ∗ is a weak minimum and x ∗ is a weak minimiser.

2.4 Necessary and Sufficient Conditions

A necessary condition for f (x) to have a maximum or a minimum at an interior point x ∗ ∈ (a, b) is that the slope at
this point must be zero, i.e. where:
d f (x)
f 0 (x) = = 0, (2.3)
dx
which corresponds to the first order necessary condition (FONC). The FONC may be necessary but it is not
sufficient. For example, consider the function f (x) = x 3 as seen in the Figure below. At x = 0, f 0 (x) = 0 but there
is no maximum or minimum point on the interval (−∞, ∞). At x = 0 there is a point of inflection, where f 00 = 0.
Therefore, the point x = 0 is a stationary point but not a local optima.

Thus in addition to the FONC, non-negative curvature is also necessary at x ∗ , i.e. it is required that the second
order condition:
d 2 f (x)
f 00 (x) = > 0, (2.4)
d x2
must hold at x ∗ for a strong local minimum. This is known as the second order sufficient condition (SOSC).
2.4. NECESSARY AND SUFFICIENT CONDITIONS 23

2.4.1 Exercises

1. If the convexity condition for any real valued function f : R → R is given by:

λ f (b) + (1 − λ) f (a) ≥ f (λb + (1 − λ)a), [a, b] ∈ R,

then using the above, prove that the following one dimensional functions are convex;
• f 1 (x) = 1 + x 2
• f 2 (x) = x 2 − 1
2. Find all stationary points of:
f (x) = x 3 (3x 2 − 5) + 1,
and decide the maximiser, minimiser and the point of inflection, if any.
3. Using the FONC of optimality, determine the optimiser of the following functions:
• f (x) = 13 x 3 − 27 x 2 + 12x + 3
• f (x) = 2(x − 1)2 + x 2
4. Using necessary and sufficient conditions for optimality, investigate the maximiser/minimiser of:

f (x) = −(x − 1)4 .

5. The function f (x) = max{0.5, x, x 2 } is convex. True or false?


6. The function f (x) = min{0.5, x, x 2 } is concave. True or false?
x2 + 2
7. The function f (x) = is concave. True or false?
x +2
8. Determine the maximum and minimum values of the function:

f (x) = 12x 5 − 45x 4 + 40x 3 + 5


24 CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS
Chapter 3

Numerical Solutions to Nonlinear Equations

It is often necessary to find the stationary point(s) of a given function f (x). This would mean finding the root of a
nonlinear function g (x) if we consider g (x) = f 0 (x) = 0. In other words, solving g (x) = 0. Here, we introduce the
Newton method. This method is important because when we cannot solve f 0 (x) = 0 analytically we will be able to
solve numerically.

3.1 Newton’s Method

Newton’s method is one of the more powerful and well known numerical methods for finding a root of g (x) i.e. for
solving for x such that g (x) = 0. So we can use it to find the turning point i.e., when f 0 (x) = 0. In the context of
optimization we want an x ∗ such that f 0 (x ∗ ) = 0. Consider the figure below:

## /home/matt/anaconda3/envs/rstudio/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning:
##
## numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

25
26 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS

Newton's Method in Action


50

40

30
f(x)

20

10

1.5 2.0 2.5 3.0 3.5 4.0


x
Suppose at some stage we have obtained the point x n as an approximation to the root at x ∗ (initially this is a guess).
Newton observed that if g (x) was a straight line through (x n , g (x n )) with slope = g 0 (x n ) = g 0 (x) = constant for all x
then the equation of the straight line could be found and the root read off. Obviously there would be no problem if
g (x) was actually a straight line; however the tangent might still be a good approximation (as seen in the Figure
above). If we regard the tangent as a model of the function g (x) and we have an approximation x n then we can
produce a better approximation x n+1 . The method can be applied again and again, to give a sequence of values,
each approximating x ∗ with more and more certainty.

The general equation of the tangent to the curve of g (x) at (x n , g (x n )) has slope

(y − g (x n ))
g 0 (x n ) = . When y = 0, let (3.1)
(x − x n )

x = x n+1 (the point where the tangent cuts the x-axis).

Thus the Newton formula for root finding is :

g (x n )
x n+1 = x n − g 0 (x n ) 6= 0 (3.2)
g 0 (x n )

Hence the Newton method can be described be the following two steps:

• Get an initial ’guess’ x 0 .

• Iterate by x n+1 = x n − f 0 (x n )/ f 00 (x n ).

x n converges to a turning point for suitable choice of x 0 .


3.1. NEWTON’S METHOD 27

3.1.0.1 Example

x4 x2
Find the minimum of f (x) = + − 3x near x = 2.
4 2

25 Derivative Function g(x)


Original Function f(x)
20

15

10
y

5
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
We can see that the find the root of the deritive yields the minimum value of f (x) in this case (approximately
1.21341). Check: perform a few iterations of Equation (3.2) to check the solution.

3.1.1 Advantages and Disadvantages of Newton’s Method

Advantages

• The method is really fast when it works (quadratic convergence)

Disadvantages:

• Unknown number of steps needed for required accuracy, unlike Bisection for example.
• f must be at least twice differentiable
• Run into problems when g 0 (x ∗ ) = 0.
• Potentially could be difficult to compute g (x) and g 0 (x) even if they do exist.

In general Newton’s Method is fast, reliable and trouble-free, but one has to be mindful of the potential problems.
28 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS

3.2 Secant Method


f 0 (x n )− f 0 (x n−1 )
Here we try to avoid the problems of computing f 00 (x) and approximate it by x n −x n−1 . This given the updating
formula:
(x n − x n−1 )
x n+1 = x n − f 0 (x n ) ¡ 0 ¢. (3.3)
f (x n ) − f 0 (x n−1 )
The other disadvantages of Newton’s method still apply.

Plotted Function x2 612 Zoomed Scale - Secant Method


400

1500 300

1000 200

100
y

500
y

0
0
100
500
200
0 20 40 20.0 22.5 25.0 27.5 30.0
x x

3.2.1 Exercises

1. Beginning with x = 0, apply Newton’s Method to find the solution of

3x − sin(x) − exp(x) = 0,

up to four iterations. Here x is in radians.


2. Find the critical point(s) of the function

1 1
f (x) = x 4 + x 2 − 2x + 1,
4 2
using both Newton’s Method and the Secant Method. If the critical point is a minimiser then obtain the
minimum value. You may assume x = 2 as an initial guess.
Chapter 4

Numerical Optimisation of Univariate


Functions

The simplest functions with which to begin a study of non-linear optimisation methods are those with a single
independent variable. Although the minimisation of univariate functions is in itself of some practical importance,
the main area of application for these techniques is as a subproblem of multivariate minimisation.

There are functions to be minimised where the variable x is unrestricted (say, x ∈ R); there are also functions to
be optimised over a finite interval (in n-dimension it is a box). Single variable optimization in a finite interval
is important because of its application is in multi-variable optimisation. In this chapter we will consider one
dimensional optimisation.

If one needs to find the maximum or minimum (i.e. the optimal) value of a function f (x) on the interval [a, b] the
procedure would be:

1. Find all turning (stationary) points of f (x) (assuming f (x) is differentiable) on [a, b] and then decide the
optimum.
2. Find the optimal turning point of f (x) on [a, b].

Generally it may be difficult/impossible/tiresome to implement (i) analytically, so we resort to the computer and an
appropriate numerical method to find an optimal (hopefully the best estimate!) solution of an univariate function.
In the next section we introduce some numerical techniques. The numerical approach is mandatory when the
function f (x) is not given explicitly.

In many cases when one would like to find the minimiser of a function f (x) but neither f (x) nor f 0 (x) are given (or
known) explicitly, then the numerical approaches viz : polynomial interpolations or function comparison methods
are used. These are the univariate optimisation used as line search in multivariate optimisation.

4.1 Techniques Using Function Evaluations

4.1.1 Bisection Method

We assume that an interval [a, b] is given and that a local minimum x ∗ ∈ [a, b]. When the first derivative of the
objective function f (x) is known at a and b, it is necessary to evaluate function information at only one interior
point in order to reduce this interval. This is because it is possible to decide if any interval brackets a minimum
simply by looking at the function values f (a), f (b) and f 0 (a), f 0 (b) at extreme points a and b. The conditions that to
be satisfied are:

• f 0 (a) < 0 and f 0 (b) > 0.

29
30 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS

• f 0 (a) < 0 and f (b) > f (a).


• f 0 (a) > 0 and f 0 (b) > 0 and f (b) < f (a).

These three situations are illustrated in the Figure below. The next step of the bisection method is to reduce the
interval. At the k-th iteration we have an interval [a k , b k ] and the mid-point c k = 12 (a k + b k ) is computed. The
next interval will be called [a k+1 , b k+1 ] which is either [a k , c k ] or [c k , b k ] depending on which interval brackets the
minimum. The process continues until two consecutive interval produces minima which are within an acceptable
tolerance.

Condition 1
12

10

0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 31

Condition 2
70

60

50

40

30

20

10
3 2 1 0 1 2 3
32 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS

Condition 3
30

20

10

10

20
2 1 0 1 2

4.1.1.1 Exercise

Find the minimum value of:


1 1
f (x) = − x 3 − x 2 + 2x − 5,
3 2
over the domain [−3, 0] using the bisection method. The problem has a minimum of value of -8.33 at x = 2.

4.1.2 Golden Search Method

Suppose f : R → R on the interval [a 0 , b 0 ] and f has only one minimum (we say f is unimodal at x ∗ . The problem is
to locate x ∗ . The method we now discuss is based on evaluating the objective function at different points in the
interval [a 0 , b 0 ]. We choose these points in such a way that an approximation to the minimiser of f may be achieved
in as few evaluation as possible. Our goal is to progressively narrow down the range of the subinterval containing x ∗ .
If we evaluate f at only one intermediate point of the interval [a 0 , b 0 ], we cannot narrow the range within which we
know the minimiser is located. We have to evaluate f at two intermediate points in such a way that the reduction
in the range is symmetrical, in the sense that a 1 − a 0 = b 0 − b 1 = ρ(b 0 − a 0 ), where ρ < 12 to keep a 1 ‘near’ to b 0 . We
then evaluate f at the intermediate points. If f (a 1 ) < f (b 1 ), then the minimiser must lie in the range [a 0 , b 1 ]. If, on
the other hand, f (a 1 ) ≥ f (b 1 ), then the minimiser located in the range [a 1 , b 0 ]. Starting with the reduced range of
uncertainty we can repeat the process and similarly find two new points a 2 and b 2 , using the same value of ρ as
before. However, we would like to minimise the number of function evaluations while reducing the width of the
interval of uncertainty. Suppose that f (a 1 ) < f (b 1 ). Then we know that x ∗ ∈ [a 0 , b 1 ]. Because a 1 is already in the
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 33

uncertainty interval and f (a 1 ) is known, we can use these information. We can make a 1 coincide with b 2 . Thus,
only one new evaluation of f at a 2 would be necessary. We can now calculate the value of ρ that results in only one
new evaluation of f . To save algebra we will assume that b 0 − a 0 = 1. Then, to have only one new evaluation of f it is
enough to choose ρ so that:

ρ(b 1 − a 0 ) = b 1 − b 2 .
Because b 1 − a 0 = 1 − ρ and b 1 − b 2 = 1 − 2ρ, we have:

ρ(1 − ρ) = 1 − 2ρ. (4.1)


p p
3± 5 3− 5
The solutions of Equation (4.1) are 2 , because ρ < 0.5 we take ρ = 2 .

Therefore,

a1 = a 0 + ρ(b 0 − a 0 ) (4.2)
b1 = b 0 − ρ(b 0 − a 0 ) (4.3)

Somewhere in the intervals [a 0 , a 1 ), [a 1 , b 1 ), [b 1 , b 0 ], lies the point x ∗ .


• If f (a 1 ) < f (b 1 ), x ∗ ∈ [a 0 , b 1 ].
• If f (a 1 ) ≥ f (b 1 ), x ∗ ∈ [a 1 , b 0 ].
This forms the basis of a search algorithm since the technique is applied again on the reduced interval.

4.1.2.1 Example

Use the four iterations Golden Section search to find the value of x that minimizes:

f (x) = x 4 − 14x 3 + 60x 2 − 70x,

on the domain [0, 2].


Answer:
Iteration 1:
We evaluate f in two intermediate points a 1 and b 1 . We have:

a 1 = a 0 + ρ(b 0 − a 0 ) = 0.763,
b 1 = a 0 + (1 − ρ)(b 0 − a 0 ) = 1.236.

We compute
f (a 1 ) = −24.36,
f (b 1 ) = −18.96.

Thus we have f (a 1 ) < f (b 1 ), and so the uncertainty interval is reduced to [a 0 , b 1 ] = [0, 1.236].
Iteration 2:
We choose b 2 to coincide with a 1 , and f need only to be evaluated at one new point

a 2 = a 0 + ρ(b 1 − a 0 ) = 0.4721.

Now we have:

f (a 2 ) = −21.10,
f (b 2 ) = −24.36.
34 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS

Now, f (b 2 ) < f (a 2 ), and so the uncertainty interval is reduced to [a 2 , b 1 ] = [0.4721, 1.236].

Iteration 3:

We set a 3 = b 2 , and compute b 3 :

b 3 = a 2 + (1 − ρ)(b 1 − a 2 ) = 0.9443.

We have:

f (a 3 ) = −24.36,
f (b 3 ) = −23.59.

So we have f (b 3 ) > f (a 3 ). Hence, the new interval is [a 2 , b 3 ] = [0.472, 0.944].

Iteration 4:

We set b 4 = a 3 , and compute a 4 :

a 4 = a 2 + ρ(b 3 − a 2 ) = 0.6525.

We have:

f (a 4 ) = −23.84,
f (b 4 ) = −24.36.

Since f (a 4 ) > f (b 4 ). Thus the value of x that minimizes f is located in the interval

[a 4 , b 3 ] = [0.652, 0.944].

plt.show()
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 35

10

15

20

25
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Goldern Search Pseudocode


Inputs [a, b], tol and rho = (1 + sqrt(5))/2
Let c = b + (a - b)/rho and d = a + (b - a)/rho
while abs(c - d) > tol
if f(c) < f(d) do
(b, f(b)) <- (d, f(d)) and (d, f(d)) <- (c, f(c))
Update c = b + (a - b)/rho and f(c)
else
(a, f(a)) <- (c, f(c)) and (c, f(c)) <- (d, f(d))
end
end
Return (b + a)/2

4.1.3 Exercises

1. Find the minimum value of the one dimensional function f (x) = x 2 − 3x exp(−x), over [0, 1], using:
• Bisection Method
• Golden Search Method
36 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
Chapter 5

Multivariate Unconstrained Optimisation

Unconstrained optimisation is optimisation when we know we do not have to worry about the boundaries of the
feasible set.
min f (x) (5.1)
s.t . x∈S
where S is the feasible set. It should then be possible to find local minima and maxima just by looking at the
behaviour of the objective function; and indeed sufficient and necessary conditions. In this chapter these conditions
will be derived. The idea of a line in a particular direction is important for any unconstrained optimization methods,
we discuss this and derive the slope and curvature of the function f at a point on the line.

5.1 Terminology for Functions of Several Variables

For a function f (x) ∈ Rn there exists, at any point x a vector of first order partial derivatives, or gradient vector:
∂f
 
(x)
 ∂x 1 
 ∂f
 

 (x)
 ∂x 2 

∇ f (x) =   = g(x). (5.2)
 .. 
 . 
 ∂f
 

(x)
∂x n

f[x_, y_] := x^2/4 - 2 x^2 y - 3 x y + y^4/16


grad[x_, y_] := Grad[f[x, y], {x, y}]
normal[x_, y_] = Simplify[grad[x, y]/Sqrt[grad[x, y].grad[x, y]]]
{
Manipulate[
ContourPlot[f[x, y], {x, -2, 2}, {y, -2, 2},
Epilog -> Arrow[{pt, pt + normal @@ pt}],
PerformanceGoal -> "Quality", Contours -> 20,
PlotRange -> {{-2, 2}, {-2, 2}, {-30, 30}},
ImageSize -> Medium], {{pt, {.01, -0.1}}, Locator},
FrameLabel -> "Click a point to see its normal",
SaveDefinitions -> True],
Plot3D[x^2/4 - 2 x^2 y - 3 x y + y^4/16, {x, -2 , 2}, {y, -2, 2},
PlotRange -> Automatic, ColorFunction -> "DarkRainbow",
ImageSize -> Large]
}

37
38 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Figure 5.1: Mathematica Demo of Gradient.

It can be shown that if the function f (x) is smooth, then at the point x the gradient vector ∇ f (x) (denoted by g (x)) is
always perpendicular to the contours (or surfaces of constant function value) and is the direction of maximum
increase of f (x) as seen in the Figure above. You can copy the Mathematica code to generate the output above. The
manipulation construct will allow you to move the point around to see the gradient at different contours.
If f (x) is twice continuously differentiable then at the point x there exists a matrix of second order partial derivatives
called the Hessian matrix:
 2
∂ f ∂2 f ∂2 f

(x) (x) . . . (x)
 ∂x 12
 ∂x 1 ∂x 2 ∂x 1 ∂x n  
 2
 ∂ f ..

H(x) =  .  = ∇2 f (x) (5.3)

 ∂x 2 ∂x 1 (x) 
 2 2
 ∂ f ∂ f


(x) ... (x)
∂x n ∂x 1 ∂x n 2

5.1.0.1 Example

Let f (x 1 , x 2 ) = 5x 1 + 8x 2 + x 1 x 2 − x 12 − 2x 22 . Then:
· ¸
5 + x 2 − 2x 1
∇ f (x) = ,
8 + x 1 − 4x 2

and · ¸
−2 1
∇2 f (x) = .
1 −4
Definition 5.1 (Feasible Direction). A vector d ∈ Rn , d 6= 0, is a feasible direction at x ∈ S if there exists α0 > 0 such
that x + α0 d ∈ S for all α ∈ [0, α0 ].
Definition 5.2 (Directional Derivative). Let f : Rn → R be a real-valued function and let d be a feasible direction at
x ∈ S. The directional derivative of f in the direction of d , denoted by d T ∇ f (x), is given by:
f (x + αd ) − f (x)
∇ f T d = lim (5.4)
α→0 α
5.2. A LINE IN A PARTICULAR DIRECTION IN THE CONTEXT OF OPTIMISATION 39

Figure 5.2: An Example of a Line in a Particular Direction.

If kd k = 1, then d T ∇ f (x) is the rate of increase of f at x in the direction d . To compute the above directional
derivative, suppose that x and d are given. Then, f (x + αd ) is a function of α, and:

d
d T ∇ f (x) = f (x + αd )¯α=0 .
¯
(5.5)

5.2 A Line in a Particular Direction in the Context of Optimisation

A line is a set of points x such that:


x = x0 + αd, ∀ α, (5.6)
where d and x0 are given. For α ≥ 0 Equation (5.6) is a half-line. The point x0 is a fixed point (corresponding to α = 0)
along the line, d is the direction of the line. For instance, if we take the fixed point x0 to be (2, 2)T and the direction
d = (3, 1)T then the Figure below shows the line in the direction of d.
The vector d in indicated by the arrow. If we normalise the vector d so that dT d = i d i2 = 1. This does not change
P

the line, but only the value of α associated with any point along the line. For Example:

import numpy as np
from numpy import linalg as LA
40 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

d = np.array([3, 1])
alpha = LA.norm(d, 2)
print(d)

## [3 1]
print(alpha)

## 3.1622776601683795
norm_d = d/alpha
print('The normalised vector d is:')

## The normalised vector d is:


print(norm_d)

## [0.9486833 0.31622777]
print('The normalised d^Td gives:f', np.dot(norm_d,norm_d))

## The normalised d^Td gives:f 0.9999999999999999


print('So alpha x normalised d returns d:')

## So alpha x normalised d returns d:


print(alpha*norm_d)

## [3. 1.]

We now use the gradient and the Hessian of f (x) to derive the derivative of f (x) along a line of any direction. For a
fixed line of a given direction like Equation (5.6) we see that the points on the line is a function of α only. Hence a
change in α causes change in all coordinates of x(α). The derivative of f (x) with respect to α :

d f (x(α)) ∂ f (x(α)) d x 1 (α) ∂ f (x(α)) d x 2 (α) ∂ f (x(α)) d x n (α)


= + +···+ (5.7)
dα ∂x 1 dα ∂x 2 ∂α ∂x n dα

d
The Equation (5.7) represents the derivative of f (x) at any point x(α) along the line. The operator dα can be
expressed as:
d ∂ d x1 ∂ d x2 ∂ d xn
= + +···+ = dT ∇ (5.8)
d α ∂x 1 d α ∂x 2 d α ∂x n d α

The slope of f (x) at x(α) can be written as:

df
= dT ∇ f (x(α)) = ∇ f (x(α))T d. (5.9)

Likewise, the curvature of f (x(α)) along the line:

d2 f d d f (x(α))
µ ¶
= dT ∇ ∇ f T d = dT ∇2 f d,
¡ ¢
= (5.10)
dα 2 dα dα

where ∇ f and ∇2 f are evaluated at x(α). These (slope and curvature) when evaluated at α=0 are respectively known
as derivative (also called slope since f = f (α) is now a function of the single variable α) and curvature of f at x 0 in
the direction of d .
5.3. TAYLOR SERIES FOR MULTIVARIATE FUNCTION 41

5.2.0.1 Example

Let us consider the Rosenbrock’s function:

f (x) = 100(x 2 − x 12 )2 + (1 − x 1 )2 (5.11)


¡0¢ ¡1¢
If x0 = 0 then show that the slope of f (x) along the line generated by d = 0 is dT ∇ f = −2 and the curvature is
dT Gd = 2 where G = ∇2 f (x0 ).
Solution:

−400x 1 (x 2 − x 12 ) − 2(1 − x 1 )
· ¸ · ¸
−2
5f = =
200(x 2 − x 12 ) 0

Therefore d 5 f = [1 0] × [−2 0]T = −2. Next:

−400(x 2 − x 12 ) + 800x 12 + 2
· ¸ · ¸
−400x 1 2 0
52 f = =
−400x 1 200 0 200
· ¸
2 0
Thus dT Gd = [1 0] × × [1 0]T = 2
0 200

These definitions of slope and curvature depend on the size of d, and this ambiguity can be resolved by
requiring that kdk = 1. Hence Equation (5.9) is the directional derivative in the direction of a unit vector
d and this given by ∇ f (x)dT . Likewise the curvature along the line in the direction of the unit vector is
given by dT ∇2 f (x)d.
Since x(α) = x0 + αd, at α = 0 we have x(0) = x0 . Therefore, the function value f (x(0)) = f (x0 ), the slope at
α = 0 in the direction of d is f 0 (0) = dT ∇ f (x0 ) and the curvature at α = 0 is f 00 (0) = dT G(x0 )d.

5.3 Taylor Series for Multivariate Function

In the context of optimization involving smooth function f (x) the Taylor series is indispensable. Since x = x(α) =
x0 +αd for a fixed point x0 and a given direction d, the f (x) at x(α) becomes a function of the single variable α. Hence,
f (x) = f (x(α)) = f (α). Therefore, expanding the Taylor series around zero we have:
1
f (α) = f (0 + α) = f (0) + α f 0 (0) + α2 f 00 (0) + · · · (5.12)
2
But f (α) = f (x0 + αd) is the value of the function f (x) of many variable along the line x(α). Hence, we can re-write
Equation (5.12) as:
1
f (x0 + αd) = f (x0 ) + αdT ∇ f (x0 ) + α2 dT ∇2 f (x0 ) d + · · ·
£ ¤
(5.13)
2

5.4 Quadratic Forms

The quadratic function in n variables may be written as:


1
f (x) = xT Ax + bT x + c, (5.14)
2
where c ∈ R, b is a real n vector and A is a n × n real matrix that can be chosen in a non-unique manner. It is usually
chosen symmetrical in which case it follows that:
n X
n
a 11 x 1 2 + 2a 12 x 1 x 2 + a 22 x 2 2 + 2a 13 x 1 x 3 + · · · + a nn x n 2 =
X
ai j xi x j , (5.15)
i =1 j =1
42 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

and:

∇ f (x) = Ax + b; H(x) = A. (5.16)

The form A is said to be positive definite if A ≥ 0 for all x with A = 0 iff x = 0. The form A is said to be positive
semi-definite if A ≥ 0 for all x. Similar definitions apply to negative definite and negative semi-definite with the
inequalities reversed.

5.4.0.1 Example

Write A(x) = x 1 2 + 5x 1 x 2 + 4x 2 2 in the matrix form.

µ 5 ¶µ ¶
1 2 x1
Solution: A(x) = (x 1 , x 2 ) 5
2 4 x2

5.5 Stationary Points

In the following chapters we will be concerned with gradient based minimization methods. Therefore, we only
consider the minimization of smooth functions. We will not consider the non-smooth minima as they do not satisfy
the same conditions as smooth minima. We, however, will consider the case of saddle point. Hence, we assume that
the first and the second derivative exist.

We can classify definiteness by looking at the eigenvalues of ∇2 f (x). Specifically:

• If ∇2 f (x∗ ) is indefinite, i.e. all λi are mixed sign, then x∗ is a saddle point.
• If ∇2 f (x∗ ) is positive definite, i.e. all λi > 0, then x∗ is a minimum.
• If ∇2 f (x∗ ) is negative definite, i.e. all λi < 0, then x∗ is a maximum.
• If ∇2 f (x∗ ) is postive semi-definite, i.e. all λi ≥ 0, then x∗ is a half cylinder.

These can be seen in the Figure below:


5.5. STATIONARY POINTS 43

Postive Definite

70
60
50
40 z
30
20
10

6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
44 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Negative Definite

10
20
30 z
40
50
60
70
6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
5.5. STATIONARY POINTS 45

Positive Semi-Definite

35
30
25
20 z
15
10
5

6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6
46 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Saddle Point - Indefinite

30
20
10
0 z
10
20
30
6
4
2
6 4 0 y
2 0 2
x 2 4
4 6 6

In summary:
Let G = ∇2 f (x), i.e. the Hessian.
• G(x) is positive semi-definite if x T G x ≥ 0, ∀x
• G(x) is negative semi-definite if x T G x ≤ 0, ∀x
• G(x) is positive definite iff x T G x > 0, ∀x 6= 0
• G(x) is negative definite iff x T G x < 0, ∀x 6= 0
• G(x) is indefinite iff x T G x is mixed negative and positive
and:
• f (x) is concave iff G(x) is negative semi-definite
• f (x) is strictly concave iff G(x) is negative definite
• f (x) is convex iff G(x) is positive semi-definite
• f (x) is convex if G(x) is positive definite

5.5.1 Tests for Positive Definiteness

There are a number of ways for us to test for positive or negative definiteness. Namely;

5.5.1.1 Compute the Eigenvalues

5.5.1.1.1 Example
Classify the stationary points of the function

f (x) = 2x 12 + x 1 x 22 + x 22 .
5.5. STATIONARY POINTS 47

Solution:

The stationary points are the solutions of

∂f
= 4x 1 + x 2 2 = 0
∂x 1
∂f
= 2x 1 x 2 + 2x 2 = 0
∂x 2

which gives x1 = (0, 0)T , x2 = (−1, 2)T and x3 = (−1, −2)T . The Hessian matrix is:

µ ¶
4 2x 2
G=
2x 2 2x 1 + 2

Thus:

µ ¶
4 0
G1 =
0 2

The eigenvalues are the solution of

(4 − λ)(2 − λ) = 0

which gives λ = 4, 2. Thus x1 correspond to a minimum. Similarly:

µ ¶
4 4
G2 =
4 0

has eigenvalues:

p p
λ = 2 + 20, 2 − 20

Thus x2 correspond to a saddle point. Finally

µ ¶
4 −4
G3 =
−4 0

has the same eigenvalues as G 2 and therefore x3 corresponds to a saddle point.


48 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

2x12 + x1x22 + x22

20.0
17.5
15.0
12.5

f(X, Y)
10.0
7.5
5.0
2.5

1.52.0
1.0
2.0 1.5 0.5
0.0
1.0 0.5
0.0 1.00.5 Y
X 0.5 1.0 1.5 2.0 1.5
2.0
5.6. NECESSARY AND SUFFICIENT CONDITIONS 49

2x12 + x1x22 + x22

20.0
17.5 2.0
15.0
12.5
f(X, Y)

10.0
7.5 1.5
5.0
2.5 1.0
0.0 0.5
2.0 1.5 0.0 X
1.0 0.5 0.5
1.0
Y0.0 0.5 1.0 1.5
1.5 2.0 2.0

5.5.1.2 Principle Minors

From the Hessian we can compute the determinant of all subminors. If these are all greater than zero, then the
Hessian is positive definite. Utilising the example above. If:
µ ¶
4 0
G1 =
0 2

Then the first subminor is just det|4| which is > 0. The second and final subminor is the entire matrix, so:
¯ ¯
¯ 4 0 ¯¯
det ¯¯ = 8 − 0 > 0.
0 2 ¯

Therefore G 1 is positive definite. G 2 and G 3 are dealt with similarly. However, to prove negative definiteness we need
to prove (−1)k D k > 0, where D is the determinant of the k-th principle minor.

This approach would be preferable when dealing with the case of large matrices.

5.6 Necessary and Sufficient Conditions

Theorem 5.1 (First Order Necessary Condition (FONC) for Local Maxima/Minima). If f (x) has continuous first
partial derivatives at all points of S ⊂ R n and if x∗ is an interior point of the feasible set S then x∗ is a local minimum
or maximum of f (x):
∇ f (x ∗ ) = 0. (5.17)
50 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Alternatively, if x∗ ∈ S is a local minimum or maximum, then at the point x∗ :

∂ f (x ∗ )
= 0; i = 1, 2, ...n. (5.18)
∂x i

Theorem 5.2 (Second Order Necessary Condition (SONC) for Local Maxima/Minima). Let f be twice continuously
differentiable on the feasible set S, x∗ is a local minimiser of f (x), and d is a feasible direction at x∗ . If dT ∇ f (x∗ ) = 0,
then:
dT ∇2 f (x∗ )d ≥ 0. (5.19)

Theorem 5.3 (Second Order Sufficient Condition (SOSC) for Strong Local Maxima/Minima). Let x∗ be an interior of
S. If x∗ is a local minimiser of f (x) then (i) ∇ f (x∗ ) = 0 and (ii) dT ∇2 f (x∗ )d > 0. That is the hessian is positive definite.

5.6.0.1 Example

Let f (x) = x 12 + x 22 . Show that x = (0, 0)T satisfies the FONC, the SONC and SOSC hence (0, 0)T is a strict local
minimiser. We see that ∇ f (x) = (2x 1 , 2x 2 ) = 0 if and only if x 1 = x 2 = 0. It also can be easily shown that for all d 6= 0,
dT ∇2 f (x)d = 2d 12 + 2d 22 > 0. Hence ∇2 f (x) is positive definite.

5.6.0.2 Example

f (x 1 , x 2 ) = x 14 + x 24
µ 3¶
12x 12
µ ¶
4x 1 T 2 0
∇ f (x) = . The only stationary point is (0 0) . Now the Hessian ∇ f = . At the origin the Hessian
4x 3 0 12x 22
µ ¶ 2
0 0
is and so there is no prediction of the minimum from the test although it is easy to see that the origin is a
0 0
minimum.

5.6.0.3 Example

1 x 12 x 22
à !
f (x 1 , x 2 ) = + ,
2c a 2 b 2
µ x1 ¶
where a, b, and c are constants. ∇ f (x) = −x ca 2 . So the only stationary point is (0 0)T . The Hessian is ∇2 f (x) =
2
µ 1 ¶ cb 2
c a2
0
. This is clearly indefinite and hence (0 0)T is a saddle point.
0 − cb1 2

Thus in summary, the necessary and sufficent condition for x∗ to be a strong local minimum are:

• ∇ f (x∗ ) = 0
• Hessian is positive definite
5.6. NECESSARY AND SUFFICIENT CONDITIONS 51

5.6.1 Exercises

1. Find the gradient vectors of the following functions (where x ∈ Rn ):


• f (x) = cT x, c ∈ Rn
• f (x) = 21 xT x
• f (x) = 21 xT Gx where G is symmetric
2. Find the slope and the curvature of the following functions.
• f (x) = 100(x 2 − x 12 ) + (1 − x 1 )2 at (0, 0)T in the direction of (1, 0)T .
• f (x) = x 12 − 2x 1 + 3x 1 x 22 + 4x 23 at (−1, 1)T along (−1, 0)T .
3. Use the necessary condition of optimality to determine the optimiser of the following function

f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 1)2 + x 1 x 2

4. Prove that for a general quadratic function f (x) = c + bT x + 12 xT Gx, the Hessian G of f maps differences in
position into differences in gradient, i.e., g1 − g2 = G(x1 − x2 ).
5. For the following functions, find the points where the gradients vanish, and investigate which of these are
local minima, maxima or saddle.
• f (x 1 , x 2 ) = x 1 (1 + x 1 ) + x 2 (1 + x 2 ) − 1.
• f (x 1 , x 2 ) = x 1 2 + x 1 x 2 + x 2 2 .
6. Consider the function f : R 2 → R determined by
· ¸ · ¸
1 2 3
f (x) = x T x + xT +6
4 8 4

• Find the gradient and Hessian of f at the point (1, 1).


• Find the directional derivative of f at the point (1, 1) in the direction of the maximal rate of increase.
• Find a point that satisfies the first order necessary condition (FONC). Does the point also satisfy the
second order necessary condition (SONC) for a minimum?
7. Find the stationary points of the function
¢2
f (x 1 , x 2 ) = x 1 2 − 4 + x 2 2
¡

Show that f has an absolute minimum at each of the points (x 1 , x 2 ) = (±2, 0). Show that the point (0, 0) is a
saddle point.
8. Show that the point x ∗ on the line x 2 − 2x 1 = 0 is a weak global minimiser of

f (x) = 4x 1 2 − 4x 1 x 2 + x 2 2

9. Show that
f (x) = 3x 1 2 − x 2 2 + x 1 3
has a strong local maximiser at (−2, 0)T and a saddle point at (0, 0)T , but has no minimisers.
52 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
Chapter 6

Gradient Methods for Unconstrained


Optimisation

In this chapter we will study the methods for solving nonlinear unconstrained optimisation problems. The non-
linear minimisation algorithms to be described here are iterative methods which generate a sequence of points,
x0 , x1 . . . . say, or {xk } (superscripts denoting iteration number), hopefully converging to a minimiser x∗ of f (x). Uni-
variate minimisation along the line in a particular direction is known as the line search technique. One dimensional
minimisation is known as line search subproblem in many variable unconstrained non-linear minimisation.

6.1 General Line Search Techniques used in Unconstrained Multivariate


Minimisation

The algorithms for multivariate minimisation are all iterative processes which fit into the same general framework:

At the beginning of the k-th iteration the current estimate of minimum is f (xk ), and a search is made
in Rn from xk along a given vector direction dk (dk is different for different minimization methods) in
an attempt to find a new point xk+1 such that f (xk+1 ) is sufficiently smaller than f (xk ). This process is
called line (or linear) search.

Line-search methods, therefore, generate the iterates by setting:

xk+1 = xk + αk dk (6.1)

where dk is a search direction and αk > 0 is chosen so that:

f (xk + αk dk ) = f (xk+1 ) < f (xk ), (6.2)

Therefore, for a given dk , a line-search procedure is used to choose an αk > 0 that approximately minimises f along
the ray x k + αk d k : αk > 0. Hence, the line search is the univariate minimisation involving the single variable αk
(since both the xk and dk ) are known f (xk + αk dk ) becomes a function of αk only) such that:

f (αk ) = f (xk + αk dk ). (6.3)

Bear in mind that this single variable minimiser cannot always be obtained analytically and hence some numerical
techniques may be necessary.

53
54 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

6.1.1 Challenges in Computing Step Length αk

The challenges in finding a good αk are both in avoiding a step length that is too long or too short. Consider the
Figures below:

Step-Size too Big


4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k + αk d k are generated by the descent directions
3
d k = (−1)k+1 with steps αk = 2 + 2k+1 with an initial starting point of x 0 = 2.
6.1. GENERAL LINE SEARCH TECHNIQUES USED IN UNCONSTRAINED MULTIVARIATE MINIMISATION 55

Step-Size too Small


4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k + αk d k are generated by the descent directions
1
d k = (−1) with steps αk = 2k+1 with an initial starting point of x 0 = 2.
56 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

Varying Alpha
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

6.2 Exact and Inexact Line Search

Given the direction dk and the point xk , f (xk + αdk ) becomes a function of α. Hence it is simply a one dimensional
d f (α)
minimisation with respect to α. The solution of d α = 0 will determine the exact location of the minimiser αk .
d f (α)
However, it may not be possible to locate the exact location of αk for which d α = 0. It may even require very large
number of iterations to locate the minimiser αk . Nonetheless, the idea is conceptually useful. Notice that for exact
df
line search the slope d α at αk must be zero. Therefore, we get:

d f (xk+1 ) d xk+1
= ∇ f (xk+1 )T = g(x k+1 )T dk = 0. (6.4)
dα dα

Line search algorithms used in practice are much more involved than the one dimensional search methods (optimi-
sation methods) presented in the previous chapter. The reason for this stems from several practical considerations.
First, determining the value of αk that exactly minimises f (α) may be computationally demanding; even worse,
the minimiser of f (α) may not even exist. Second, practical experience suggests that it is better to allocate more
computational time on iterating the optimisation algorithm rather than performing exact line searches. These
considerations led to the development of conditions for terminating line search algorithms that would result in
low-accuracy line searches while still securing a decrease in the value of f from one iteration to the next.
In practice, the line search is terminated when some descent conditions along the line xk + αdk are satisfied. Hence,
it is no longer necessary to go for the exact line search. The line search carried out in this way is known as the
inexact line search. A further justification for the inexact line search is that it is not efficient to determine the line
search minima to a high accuracy when xk is far from the minimiser x∗ . Under these circumstances, nonlinear
minimisation algorithms employ an inexact or approximate line search. To sum up, exact line search relates to
theoretical concept and the inexact is its practical implementation.
6.3. THE DESCENT CONDITION 57

Remark:
Each iteration of a line search method computes a search direction dk and then decides how far to move along that
direction. The iteration is given by
xk+1 = xk + αk dk ,
where the positive scalar αk is called the step length. The success of a line search method depends on effective
choices of both the direction dk and and the step length αk . Most line search algorithms require dk to be a descent
direction.

6.2.1 Algorithmic Structure

The typical behaviour of a minimisation algorithm is that it repeatedly generates points xk such that as k increases
xk moves close to x∗ . Features of a minimisation algorithm is that f (xk ) is always reduced on each iteration, which
imply that the stationary point turns out to be a local minimiser. In a minimisation algorithm it is required to supply
an initial estimate, say x0 . At each iteration the algorithm finds a descent direction along which the function is
minimised. This minimisation algorithm in a particular direction is known as the line search. The basic structure of
the general algorithm is:
1. Initialise the algorithm with estimate xk . Initialise k = 0.
2. Determine a search direction dk at xk .
3. Find αk to minimise f (xk + αdk ) with respect to α.
4. Set xk+1 = xk + αk dk .
5. Line search is stopped when f (xk+1 ) < f (xk )
6. If algorithm meets stopping criteria then STOP, ELSE set k = k + 1 and go back to (2).
Different minimisation methods select dk in different ways in (2). Steps (3&4) is the one dimensional sub-problem
carried out along the line xk+1 = xk + αk dk for α ∈ [0, 1]. The direction dk at xk must satisfy the descent condition.

6.3 The Descent Condition

Central to the development of the gradient based minimisation methods is the idea of a descend direction. Condi-
tions for the descent direction can be obtained using Taylor series around the point xk . Using two terms of Taylor
series we have:
T
f (xk + αdk ) − f (xk ) = αdk ∇ f (xk ) + · · · (6.5)
Clearly the descent condition can easily be seen as:
T
dk ∇ f (xk ) < 0, (6.6)

since we require the left hand side of Equation (6.5) to be negative.

6.4 The Direction of Greatest Reduction

A simple line search descent method is the steepest descent method in which:

dk = −∇ f (xk ) = −gk , ∀k (6.7)

From Equation (6.5) we see that:


T
f k+1 − f k = αk dk gk (6.8)
k k k
= α kd kkg k cos θ, (6.9)
58 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

where θ can be interpreted geometrically as the angle between dk and gk . If we allow θ to vary holding αk , kdk k and
kgk k constant, then the right hand side of Equation (6.9) is most negative when θ = π. Thus when αk is sufficiently
small, the greatest reduction in function is obtained in the direction:

dk = −gk (6.10)

This negative gradient direction which satisfy the descent condition (6.10) gives rise to the method of steepest descent.

6.5 The Method of Steepest Descent

Here the the search direction is taken as the negative gradient and the step size, αk , is chosen to achieve the
maximum decrease in the objective function f at each step. Specifically we solve the problem:
³ ³ ´´
Minimise f x(k) − α∇ f x(k) w.r.t. α (6.11)

This is now a one-dimensional optimisation problem.

6.5.1 Steepest Descent Algorithm

Given x0 , for all iterations k = 1, 2, . . . until stopping criterion is met, do:

• Compute gradient g(xk ) = ∇ f (xk ).


• Compute αk such that f (xk − αk gk ) = α f (xk − αgk ).
min
• Compute xk+1 = xk − αk gk
• If stopping criterion met STOP, Else set k = k + 1 and go to (1).

6.5.2 Convergence Criteria

In practice the algorithm is terminated if some convergence criterion is satisfied. Usually termination is enforced at
iteration k if one, or a combination of the following is met:

• kxk − xk−1 k < ²1 .

• k∇ f (xk )k < ²2

• | f (xk ) − f (xk−1 )| < ²3

Here ²1 , ²2 and ²3 are designated some small positive tolerances.

6.5.2.1 Example

Consider f (x) = 2x 12 + 3x 22 , where x0 = (1, 1). Use two iterations of Steepest Descent.

Solution:
· ¸
4x 1
Compute ∇ f (x) = = −g.
6x 2
First Iteration:

We know that:
x1 = x0 + α0 g (x0 ),
6.5. THE METHOD OF STEEPEST DESCENT 59

so: · ¸ · ¸ · ¸
1 4α 1 − 4α
x1 = − = .
1 6α 1 − 6α
Therefore:

f (x0 + α0 g (x0 )) = 2(1 − 4α)2 + 3(1 − 6α)2


= 2 − 16α + 32α2 + 3 − 36α + 108α2
⇒ ∇ f (x0 + α0 g (x0 )) = 280α − 52 = 0
⇒α = 52/280 = 13/70.

Finally:
13 9
   
1 − 4
70   35 
x1 =  13  =  −4  .

1−6
70 35

Second Iteration:
We have:
x2 = x1 + α1 g (x1 )
Compute (Simplified here):
1 ¡
f (x1 + α1 g (x1 )) = 2(9 − 36α)2 + 3(−4 + 24α)2 .
¢
35 2

We get:

∇ f (x1 + α1 g (x1 )) = 0
⇒ 60α = 13
13
α =
60
Therefore:
9 9 6
    
 35  13  35   175 
x2 = x1 + α1 g (x1 ) =  −4 −  =  6 .
60 −4

35 35 175
The process continues in the same manner above. We can see from inspection that the function should achieve a
minimum at (0, 0). We can see this as a sanity check in the Python code below.
It is also worth noting that since this is a quadratic function, we can actually use another technique. We will redo the
first iteration as illustration. Specifically, the quadratic functions allow α to be solved using:
T
−g k d k
αk = T
.
d k Qd k
Thus:
First Iteration:
· ¸
4 0
Compute f (x0 ) = 5, g(x̄ 0 )T = (4, 6) and Q =
0 6
Therefore:
(g k )T d k 52 13
α1 = − = ¸· ¸ =
(d k )T Qd k 70
·
4 0 4
(4, 6)
0 6 6
Thus: µ ¶
13 9 4
x1 = (1, 1) − (4, 6) = − ,
70 35 35
Similarly, the process repeats.
60 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

2x12 + 3x22

6
5
4
f(X, Y)

3
2
1
0
1.0
0.5
1.0 0.0
0.5 0.0 0.5 Y
X 0.5 1.0
1.0
6.5. THE METHOD OF STEEPEST DESCENT 61

0 4.
1.00 4.80 4.000 4.00 800
0 2.400 0
3.20
0.75
0.50 0.800

0.25
0.00
0.25
0.50
1.600
0.75 0
4. 3.20
1.00 4.80 000 4.000 4.800
0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

6.5.3 Inexact Line Search

Although you will only cover inexact line search techniques in the third year syllabus, we will quickly introduce a
very simply inexact technique to use for the purpose of your labs.

6.5.3.1 Backtracking Line Search

One way to adaptively choose the step size is to do the following:

• First fix a parameter 0 < β < 1


• Then at each iteration, start with t = 1 and while

t
f (x − t ∇ f (x)) > f (x) − k∇ f (x)k2 ,
2

and update t = βt .

This is a simple technique and tends to work quite well in practice. For further reading you can consult Convex
Optimisation by Boyd.
62 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

2x12 + 3x22

6
5
f(X, Y)

4
3
2
1
0
1.0
0.5
Y 0.0 1.0
0.5 0.5
0.0
0.5 X
1.0 1.0

6.5.4 Exercises

1. Show that the value of the function


ax 12 + bx 22 + c x 32
reached after taking a single of the steepest descent method from the point (1, 1, 1)T is:

ab(b − a)2 + bc(c − b)2 + c a(a − c)2


.
a3 + b3 + c 3
2. Show that if exact line search is carried out on the quadratic

1 T
x Qx + b T x + c
2
using the iteration
x k+1 = x k + αk d k ,
then:
g (k)T d k
αk = − .
d (k)T Qd k
3. Compute the first two iterations of the method of steepest descent applied to the objective function

f (x) = 4x 1 2 + x 2 2 − x 1 2 x 2

with x 0 = [1, 1]T . Use exact line search.


6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 63

4. Use three iterations of the steepest descent method on the function

f (x) = 3x 1 2 + 2x 2 2

with initial point (1, 1)T .

6.6 The Gradient Descent Algorithm and Machine Learning

We will briefly look at the context of what we have learnt from the machine learning perspective. This is to emphasize
the power of this chapter. In machine learning, you will find the gradient descent algorithm everywhere. While
the literature may seem to allude to this method being new, powerful and cool, it is really nothing more than the
method of steepest descent introduced above.

6.6.1 Basic Example

Let’s try find a local minimum for the function f (x) = x 3 − 2x 2 + 2:

3.0

2.5

2.0

1.5

1.0

0.5

0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5

So from the above plot we can see that there is a local minimum somewhere around 1.3 - 1.4 according to the x-axis.
Of course, we normally won’t be afforded the luxury of information such as this a priori, so let’s just assume we
arbitrarily set our starting point to be x 0 = 2. Implementing the gradient descent with a fixed stepsize, or learning
rate (in the context of ML) we have:
64 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.1 # step size fixed at 0.1
precision = 0.0001 # tolerance value
x_list, y_list = [x_new], [f(x_new)]
# returns the value of the derivative of our function
def f_prime(x):
return 3*x**2-4*x

while abs(x_new - x_old) > precision:


x_old = x_new
s_k = -f_prime(x_old)
x_new = x_old + n_k * s_k
x_list.append(x_new)
y_list.append(f(x_new))
print("Local minimum occurs at:", x_new)

## Local minimum occurs at: 1.3334253508453249


print("Number of steps:", len(x_list))

## Number of steps: 17
How did the algorithm look step by step?

Gradient descent Gradient descent (zoomed in)


3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.2 1.4 1.6 1.8 2.0

In our above implementation we had a fixed step-size n k . In machine learning, this is called the learning rate. You’ll
notice this is contrary to the algorithm in the aforementioned pseudocode. Making the assumption of the fixed
learning rate made the implementation easier but could yield the issues mentioned in the beginning of the chapter.

6.6.2 Adaptive Step-Size

One means of overcoming this issue is to use adaptive step-sizes. This can be done using scipy’s fmin function to
find the optimal step-size at each iteration.
from scipy import stats
from scipy.optimize import fmin
# we setup this function to pass into the fmin algorithm
def f2(n,x,s):
x = x + n*s
return f(x)
x_old = 0
x_new = 2 # The algorithm starts at x=2
precision = 0.0001
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 65

x_list, y_list = [x_new], [f(x_new)]


# returns the value of the derivative of our function
def f_prime(x):
return 3*x**2-4*x
while abs(x_new - x_old) > precision:
x_old = x_new
s_k = -f_prime(x_old)

# use scipy fmin function to find ideal step size.


# Uses the downhill simplex algorithm which is a zero-order method
n_k = fmin(f2,0.1,(x_old,s_k), full_output = False, disp = False)
x_new = x_old + n_k * s_k
x_list.append(x_new)
y_list.append(f(x_new))

print("Local minimum occurs at ", float(x_new))

## Local minimum occurs at 1.3333333284505209


print("Number of steps:", len(x_list))

## Number of steps: 4
So we can see that using the adaptive step-sizes, we’ve reduced the number of iterations to convergence from 17 to
4. This is a substantial reduction, however, it must be noted that it takes time to compute the appropriate step-size
at each iterations. This highlights a major issue in the decision making for optimisation: trying to find the balance
between speed and accuracy.
How did the modified algorithm look step by step?
Well we can see that it converges rapidly and after the first two iterations, we need to zoom in to see further
improvements.
Gradient descent zoomed in zoomed in more
3.0 3.0
5
4 2.5 2.5
3 2.0 2.0
2 1.5 1.5
1 1.0 1.0
0 0.5 0.5
1 0.0 0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.2 1.4 1.6 1.8 2.0 0.000300
0.000325
0.000350
0.000375
0.000400
0.000425
0.000450
0.000475
0.000500
+1.333

6.6.3 Decreasing Step-Size

Instead of using computational resources having to find an optimal step-size at each iteration, we could apply an
dampening factor at each step to reduce the step-size over time. For example:

η(t )
η(t + 1) =
1+t ×d

x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.17 # step size
precision = 0.0001
t, d = 0, 1
x_list, y_list = [x_new], [f(x_new)]
66 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

# returns the value of the derivative of our function


def f_prime(x):
return 3*x**2-4*x

while abs(x_new - x_old) > precision:


x_old = x_new
s_k = -f_prime(x_old)
x_new = x_old + n_k * s_k
x_list.append(x_new)
y_list.append(f(x_new))
n_k = n_k / (1 + t * d)
t += 1
print("Local minimum occurs at:", x_new)

## Local minimum occurs at: 1.3308506740900838


print("Number of steps:", len(x_list))

## Number of steps: 6

We can now see that we’ve still reduced the number of iterations required substantially but are not bounding to
finding an optimal step-size at each iteration. This highlights that trade-off of finding cheap improvements that
improve convergence at minimal cost.

How Do We Use the Gradient Descent in Linear Regression?

While using these line methods to find the minima of basic functions is interesting, one might wonder how this
relates to some of the regressions we are interested in performing. Let us consider a slightly more complicated ex-
ample. In this data set, we have data relating to how temperature affects the noise produced by crickets. Specifically,
the data is a number of observations or samples of cricket chirp rates at various temperatures.
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 67

cricket chirps vs temperature


95

90
temperature in degrees Fahrenheit

85

80

75

70

65
13 14 15 16 17 18 19 20 21
chirps/sec for striped ground crickets
What can we deduce from the plotted data?

We can see that the data set is exhibiting a linear relationship. Therefore, our aim is to find the equation of the
straight line given by:
h θ (x) = θ0 + θ1 x,

that best fits all of our data points, i.e. minimise the residual error.

The function that we are trying to minimize in this case is:


m
1
J (θ0 , θ1 ) = (h θ (x i ) − y i )2
P
2m
i =1

In this case, our gradient will be defined in two dimensions:


m
∂ 1
J (θ0 , θ1 ) =
P
∂θ0 m (h θ (x i ) − y i )
i =1
m
∂ 1
J (θ0 , θ1 ) =
P
∂θ1 m ((h θ (x i ) − y i ) · x i )
i =1

Below, we set up our function for h, J and the gradient:


h = lambda theta_0,theta_1,x: theta_0 + theta_1*x
def J(x,y,m,theta_0,theta_1):
returnValue = 0
for i in range(m):
returnValue += (h(theta_0,theta_1,x[i])-y[i])**2
returnValue = returnValue/(2*m)
return returnValue
def grad_J(x,y,m,theta_0,theta_1):
68 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

returnValue = np.array([0.,0.])
for i in range(m):
returnValue[0] += (h(theta_0,theta_1,x[i])-y[i])
returnValue[1] += (h(theta_0,theta_1,x[i])-y[i])*x[i]
returnValue = returnValue/(m)
return returnValue
import time
start = time.time()
theta_old = np.array([0.,0.])
theta_new = np.array([1.,1.]) # The algorithm starts at [1,1]
n_k = 0.001 # step size
precision = 0.001
num_steps = 0
s_k = float("inf")
while np.linalg.norm(s_k) > precision:
num_steps += 1
theta_old = theta_new
s_k = -grad_J(x,y,m,theta_old[0],theta_old[1])
theta_new = theta_old + n_k * s_k
print("Local minimum occurs where:")

## Local minimum occurs where:


print("theta_0 =", theta_new[0])

## theta_0 = 25.128552558595363
print("theta_1 =", theta_new[1])

## theta_1 = 3.297264756251897
print("This took",num_steps,"steps to converge")

## This took 565859 steps to converge


end = time.time()
print(str(end - start) + 'seconds')

## 19.70560359954834seconds

It’s clear that the algorithm seems to take quite a long time for such a trivial example. Let’s check that the values
we’ve obtained from the gradient descent are any good. We can get the true values for θ0 and θ1 with the following:
from scipy import stats as sp
start = time.time()
actualvalues = sp.stats.linregress(x,y)
print("Actual values for theta are:")

## Actual values for theta are:


print("theta_0 =", actualvalues.intercept)

## theta_0 = 25.232304983426026
print("theta_1 =", actualvalues.slope)

## theta_1 = 3.2910945679475647
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 69

end = time.time()
print(str(end - start) + 'seconds')

## 0.009187698364257812seconds
One thing this highlights is how much effort goes into optimising the functions found in these libraries. If one looks
at the code inside linregress, clever exploitations to speed up the computation can be found.
Now, let’s plot our obtained results on the original data set:

cricket chirps vs temperature


95

90
temperature in degrees Fahrenheit

85

80

75

70

65
13 14 15 16 17 18 19 20 21
chirps/sec for striped ground crickets
So in our implementation above, we needed to compute the gradient at each step. While this might not seem
important, it is! In this toy example, we only have 15 data points, however, imagine the computational intractability
when millions of data points are involved.

6.6.4 Stochastic Gradient Descent

What we implemented above is often called Vanilla/Batch gradient descent. As pointed out, this implementation
means that we need to sum the cost of each sample in order to calculate the gradient of the cost function. This
means given 3 million samples, we would have to loop through 3 million times!
So to move a single step towards the minimum, one would need to calculate each cost 3 million times.
So what can we do to overcome this? Well, we can use the stochastic gradient descent. In this idea, we use the cost
gradient of 1 sample at each iteration rather than the sum of the cost gradient of all samples. So recall our gradient
equations from above:

∂ 1 Xm
J (θ0 , θ1 ) = (h θ (x i ) − y i ),
∂θ0 m i =1
70 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

∂ 1 Xm
J (θ0 , θ1 ) = ((h θ (x i ) − y i ) · x i ),
∂θ1 m i =1
where:
h θ (x) = θ0 + θ1 x.
We now want to update our values at each item in the training set instead of all so that we can begin improvement
straight away.
We can redefine our algorithm into the stochastic gradient descent for the simple linear regression as follows:
Randomly shuffle the data set
for k = 0, 1, 2, ... do
for i = 1 to m do

θ0 θ0
· ¸ · ¸ · ¸
2(h θ (x i ) − y i )
= −α
θ1 θ1 2x i (h θ (x i ) − y i )
end for
end for
Depending on the size of the data set, we run the entire data set 1 to k times.
So the key advantage here is that unlike batch gradient descent where we have to go through the entire data set
before initiating any progress, we can now make process straight away as we move through the data set. This is the
primary reason why stochastic gradient descent is used when dealing with large data sets.

6.6.4.1 Additional Example

Let us look at another example with the use of stochastic gradient descent for linear regression. We can create a set
of 500 000 data points around the equation y = 2x + 17 + ² on the domain x ∈ [0, 100].
Show in jupyter notebook - example breaks rstudio!
Chapter 7

Newton and Quasi-Newton Methods

The steepest descent method uses information based only on the first partial derivatives in selecting a suitable
search direction. This strategy is not always the most effective. A faster method may be obtained by approximating
the objective function f (x) as a quadratic q(x) and making use of a knowledge of the second partial derivatives. This
is the basis of Newton’s method. The idea behind this method is as follows. Given a starting point, we construct a
quadratic approximation to the objective function that matches the first and the second derivative of the original
objective function at that point. We then minimise the approximate (quadratic) function instead of the original
objective function. We then use the minimiser of the quadratic function to obtain the next iterate and repeat the
procedure iteratively. If the objective function is quadratic then the approximation is exact and and the method
yields the true minimiser in one step. If, on the other hand, the objective function is not quadratic, then the
approximation will provide only an estimate of the position of the true minimiser.
We can obtain a quadratic approximation to the given twice continuously differentiable objective function using the
Taylor series expansion of f about the current x k , neglecting terms of order three and the higher. Using the Taylor
series expansion:
f (x) ≈ f (x(k) ) + (x − x(k) )T g(k) + (x − x(k) )T H (x(k) )(x − x(k) ) = q(x),
where g = ∇ f and H is the Hessian matrix. The minimum of the quadratic q(x) satisfies:

0 = ∇q(x) = g(k) + H (x(k) )(x − x(k) ),

or inverting:
x = x(k) − H −1 (x(k) )g(k) .
Newton’s formula is:
x(k+1) = x(k) − H −1 (x(k) )g(k) . (7.1)
This can be rewritten as
H (k) d(k) = −g(k) (7.2)
(k) (k+1) (k)
where d =x −x
g (x )
Note to solve in 1-dimension g (x) = 0, we iterate x k+1 = x k − g 0 (xk ) . The above formula is the multidi-
k
mensional extension of Newton’s method.
The Method requires that f k , gk and H k i.e., the function value, the gradient and the Hessian to be
made available at each iterate xk . Most importantly the Newton method is only well defined if the
Hessian H k is positive definite. This is because only then q(x) will have a unique minimiser. The
positive definiteness of the Hessian can only be guaranteed if the starting iterate x0 is very near to the
minimizer x∗ of f (x)
The Newton method is fast to converge when it is applied close to the minimiser. If the starting point (the initial
point) is further from the minimiser then the Algorithm may not converge.

71
72 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS

7.0.0.1 Example

For example let us take the following example

f (x) = 100(x 2 − x 12 )2 + (1 − x 1 )2 .

¡0¢
Let us take x0 = 0 . The gradient vector and the Hessian at x0 are respectively given by:

à !
−400x 1 x 2 − x 12 − 2(1 − x 1 )
¡ ¢
∇ f (x) = = −g,
200 x 2 − x 12
¡ ¢

and:

800x 12 − 400 x 2 − x 12 + 2
µ ¡ ¢ ¶
−400x 1
H (x) = .
−400x 1 200

So substituting x 0 gives:
µ ¶
0 T 0 2 0
g = (2, 0) ; H = .
0 200

Now using

H 0 d0 = −g0 ,

recall that:

H k d k = −g k ,

so:
µ ¶µ ¶ µ ¶
2 1/2 0 1
H 0 d0 = −g0 ⇒ d0 = g0 (H 0 )−1 = = .
0 0 1/200 0

Recall:

dk = xk+1 − xk ⇒ d0 = x1 − x0 ⇒ x1 = d0 + x0

Thus:
à ! à ! à !
1 1 0 1
x = + =
0 0 0

Calculating the function value we have:

f (x1 ) = 100 > f (x0 ) = 1

which shows that the algorithm is diverging!


7.1. THE MODIFIED NEWTON METHOD 73

Rosenbrock Function

500
400
300
f(X, Y)

200 1.0
100 0.5
1.0 0.0 X
0.5 0.5
0.0
Y 0.5
1.0 1.0

7.1 The Modified Newton Method

The modified Newton method is


x(k+1) = x(k) − αk H −1 (x(k) )g(k) .

The step length parameter αk modifies the step taken in the search direction, usually to minimize f (x(k+1) ). Newton’s
method applied without this modification does not necessarily produce a decrease in f (x(k+1) ), as described by the
above example.

To address the drawbacks of Newton method line search is introduced where f k+1 < f k is sought. As with the other
gradient based methods the new iterate xk+1 is found by minimizing f along the search direction dk such that:

xk+1 = xk + αk dk

where αk is the value of α which minimizes f (xk + αdk ).

Although Newton method without this modification may generate points where the function may increase (see
example above), the directions generated by Newton method are initially downhill if H k is positive definite.

Remarks:

• Newton’s method always goes in a descent direction provided we do not go too far but sometimes Newton
over-steps the mark and does not work.
• The drawback to the method is that evaluating H −1 can be expensive in computational time.
74 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS

7.2 Convergence of Newton’s Method for Quadratic Functions

If f (x) = 12 xT Qx + xT b + c is a quadratic function with positive definite symmetric Q, then Newton’s method reaches
the minimum in 1 step irrespective of the initial starting point.
Proof:
The gradient vector g(x) = ∇ f (x) = Qx + b. The Hessian H (x) = Q and is a constant. Hence given x(0) ,

x(1) = x(0) −G −1 g(0)


x(1) x(0) −Q −1 Qx(0) + b
¡ ¢
=
x(1) = −Q −1 b = x∗ .

The result also works if Q is negative definite resulting in a strong local maximum or Q is symmetric indefinite giving
x∗ as a saddle point.

7.3 Quasi-Newton Methods

The basic Newton method as it stands is not suitable for a general purpose algorithm since H k may not be positive
definite when xk is remote from the solution. Furthermore, as we have shown in the previous example, even if H k is
positive definite the convergence may not occur. To address these issues Quasi-Newton algorithms were developed.
We start by describing the drawbacks of the Newton method. At each iteration (say, at the k-th iteration) of the
Newton’s method a new matrix H k has to be calculated (even if the method uses line search) and then either the
inverse of this matrix has to found or a system of equation has to be solved before the new point x(k+1) is found
using x(k+1) = x(k) + d(k) . Quasi-Newton methods avoid the calculation of a new matrix at each iteration, rather they
only update the matrix (positive definite) of the previous iteration. This matrix remains also positive definite. This
method also does not need to solve a system of equation. First it finds its direction using the positive definite matrix
and it finds the step length using line search.
Introduction of the quasi-Newton method largely increased the range of problems which could be solved. This
−1
type of method is like Newton method with line search, except that H k at each iteration is approximated by a
symmetric positive definite matrix G k , which is updated from iteration to iteration. Thus the kth iteration has the
basic structure.
1. Set dk = −G k gk
2. Line search along dk giving xk+1 = xk + αk dk
3. Update G k giving G k+1
The initial positive definite matrix is chosen as G 0 = I . Potential advantages of the method (as against Newton’s
method) are:
• Only first derivative required (Second derivative required in Newton method)
• G k positive definite implies the descent property (H k may be indefinite in Newton method)
Much of the interest lies in the updating formula which enables G k+1 to be calculated from G k . We know that for
any quadratic function:
1
q(x) = xT H x + bT x + c,
2
where H , b and c are constant and H is symmetric, the Hessian maps differences in position into differences in
gradient,.i.e.,
gk+1 − gk = H (xk+1 − xk ). (7.3)
The above property says that changes in gradient g (=∇ f (x)) provide information about the second derivative of q(x)
along (xk+1 − xk ). In the quasi-Newton methods at xk we have the information about the direction dk , G k and the
gradient gk . We can use these information to perform line search to obtain xk+1 and gk+1 . We now need to calculate
7.3. QUASI-NEWTON METHODS 75

G k+1 (the approximate inverse of H k+1 ) using the above information. At this point we impose the condition given
by Equation (7.3) for the non-quadratic function f . In other words, we impose that changes in the gradient provide
information about the second derivative of f along the search direction dk . Hence, we have:

−1
H (k+1) (gk+1 − gk ) = (xk+1 − xk ) (7.4)

Therefore, we would like have G k+1 = G k + ∆G k such that:

G k+1 γk = δk , (7.5)

−1
where G k+1 = H k+1 , δk = (xk+1 − xk ) and γk = (gk+1 − gk ). This is known as the quasi-Newton condition and for
the quasi-Newton algorithm the update H k+1 from H k must satisfy Equation (7.5).

Methods differ in the way they update the matrix G k . Essentially they are classified according to a rank one and rank
two updating formulae.

7.3.1 The DFP Quasi-Newton Method

Rank two updating formulae are given by:

G k+1 = G k γk + auu T + bv v T γk . (7.6)

One method is to choose u = δk and v = G k γk . Then au T γk = 1 and bv T γk = −1 determine a and b. Thus:


T T
δ k δk G k γk γk G k
G k+1 = G k + T
− T
(7.7)
δk γk γk G k γk

This formula was first suggested as part of a method due to Davidon (1959), and later also presented by Fletcher
and Powel (1963). The Quasi-Newton method which goes with this updating formula is known as DFP (Davidson,
Fletcher and Powel) method. The DFP algorithm is also known as the variable matrix algorithm. The DFP algorithm
preserves the positive definiteness of G k but can sometimes gives trouble when G k becomes nearly singular. A
modification (known as BFGS) introduced in 1970 can cure this problem. The algorithm for DFP method is given
below:

1. Set k = 0,G 0 = I and compute gk = g (xk ).


2. Compute dk from dk = −G k gk .
3. Compute α = αk such that f (xk + αk dk ), set xk+1 = xk + αk dk .
4. Compute gk+1 such that gk+1 = g (xk+1 ).
5. If kg k+1 k ≤ ² (² is a user supplied small number) then go to (9).
6. Compute δk and γk such that δk = xk+1 − xk and γ = gk+1 − gk .
7. Compute G k+1 .
8. Set k = k + 1 and go to (2).
9. Set x∗ = k + 1, STOP.

7.3.2 Exercises

1. Use Newton method to minimise the function:

f (x) = x 1 4 − 3x 1 x 2 + (x 2 + 2)2 ,

starting at the point x 0 = [0, 0]T and show that the function value at x 0 cannot be improved searching in
Newton direction.
76 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS

2. Find the stationary points of:


f (x) = x 12 + x 22 − x 12 x 2
and determine their nature. Plot the contours of f . Find the value of f after taking a basic Newton optimisation
method from x 0 = (1, 1)T .
3. Using Newton method, find the minimiser of:

1
f (x) = x 2 − sin(x).
2

The initial value is x 0 = 0.5. The required accuracy is ² = 10−5 in the sense that you stop when ¯x k+1 − x k ¯ < ².
¯ ¯

4. Using the DFP method, find the minimum of the following function:

f (x) = 4x 12 − 4x 1 x 2 + 3x 22 + x 1 ,

using the starting point (4, 3).


5. Find the minimum of the function given in question (2) utilising the DFP method. Use the same starting
point.
Chapter 8

Direct Search Methods for Unconstrained


Optimisation

Direct search methods, unlike the Descent methods discussed in earlier Chapters do not require the derivatives of
the function. The Direct search methods require only the objective function values when finding minima and are
often known as zeroth-order methods since they use the zeroth-order derivatives of the function. We will consider
two Direct Methods in this course. Namely, the Random Walk Method and the Downhill Simplex Method.

8.1 Random Walk Method

The random walk method is based on generating a sequence of improved approximations to a minimum, where
each approximation is derived from the previous approximation. Therefore, xi is the approximation to the minimum
obtained in the (i − 1)th iteration, yielding the relation:

xi +1 = xi + λui ,

where λ is some scalar step length and ui some random unit vector generated at the i th stage.
We can describe the algorithm as follows:
1. Start with an initial point x1 , a sufficiently large initial step length λ, a minimum allowable step length ², and a
maximum permissible number of iterations N .
2. Find the function value f 1 = f (x1 ).
3. Set the iteration number, i , to 1
4. Generate a set of n random numbers, r 1 , . . . , r n , each lying in the interval [−1, 1] and formulate the unit vector
u as:  
r1
r 
1  2
 .
u= 2 2 2 1/2  .. 
(r + r + . . . + r n )  . 
1 2
rn
To avoid bias in the calculation, we only accept the vector if the length of:

1
is ≤ 1.
(r 12 + r 22 + . . . + r n2 )1/2

5. Compute the new vector and the corresponding function value x = x1 + λu and f = f (x).
6. If f < f 1 , then set the new values of x1 = x and f 1 = f and go to step 3, else continue to 7.
7. If i ≤ N , set the new iteration to i + 1 and go to step 4. Otherwise, if i > N , go to step 8.

77
78 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

8. Compute new, reduced, step length as λ = λ/2. If new step length is smaller than or equal to ², then go to step
9, else go to step 4.
9. Stop the procedure by taking xopt = x1 and f opt = f 1 .

8.1.0.1 Example

Minimise f (x 1 , x 2 ) = x 1 − x 2 + 2x 12 + 2x 1 x 2 + x 22 using the random walk method. Begin with the initial point x 0 = [0, 0]
and a starting step length of λ = 1. Use ² = 0.05 and iteration limit N = 100
f = lambda x1, x2: x1 - x2 + 2*x1**2 + 2*x1*x2 + x2**2
x0 = np.array([0, 0])
lam = 1
eps = 0.05
n = 2
N = 100
print(random_walk(f, x0, lam, eps, n, N))

## (array([-0.99768499, 1.49885167]), -1.249993279604305)

Let us plot the function to see if our answer makes sense:

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5
1.5 1.0 0.5 0.0 0.5 1.0
8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 79

30
25
20
f(X, Y)
15
10
5
0

2
1
Y 0
1
2
2 1 0 1 2
X

8.2 Downhill Simplex Method of Nelder and Mead

A direct search method for the unconstrained optimisation problem is the Downhill simplex method developed
by Nelder and Mead (1965). It does not make an assumption on the cost function to minimise. Importantly, the
function in question does not need to satisfy any condition of differentiability unlike other methods, i.e. it is a zero
order method. It makes use of simplices, or polytopes in given dimension n + 1. For example, in 2 dimensions, the
simplex is a polytope of 3 vertices (triangle). In 3 dimensional space it forms a tetrahedron.
The method starts from an initial simplex. Subsequent steps of the method consist of updating the simplex where it
defines:
• xh is the vertex with highest function value,
• xs is the vertex with second highest function value,
• xl is the vertex with lowest function value,
• G is the centroid of all the vertices except xh , ie. the centroid of n points out of n + 1:

1 n+1
xj
X
G= (8.1)
n j =1, j 6=h

The movement of the simplex is achieved by using three operations, known as reflection, contraction and expansion.
These can be seen in the Figures below:
A common practice to generate the initial remaining simplex vertices is to make use of x0 + ei b, where ei is the unit
vector in the direction of the x i coordinate and b an edge length. Assume a value of 0.1 for b.
80 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 8.1: Here we have reflection and expansion.

Figure 8.2: Here we have contraction.


8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 81

Figure 8.3: Here we have multiple contraction.

Let y = f (x) and y h = f (xh ) then the algorithm suggested by Nelder and Mead is as follows:
82 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

The typical values for the above factors are α = 1, γ = 2 and β = 0.5. The stopping criteria to use is defined by:
s
1 X n ³ ´2
f (x i ) − f (x i ) ≤ ² (8.2)
n + 1 i =0

Here is a gif of the method in action:

8.3 Rosenbrock Function Example

Recall the Rosenbrock function:

f (x) = (1 − x 1 )2 + 10(x 2 − x 12 )2 , (8.3)


8.3. ROSENBROCK FUNCTION EXAMPLE 83

Figure 8.4: Application of Downhill Simplex on Rosenbrock Function - 3D.

Applying the downhill simplex method on the above equation gives:


With the 2D contours looking as follows:

8.3.1 Exercises

1. Apply the above two strategies to the all the multivariate function introduced in earlier chapters and achieve
their respective minima.
84 CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 8.5: Application of Downhill Simplex on Rosenbrock Function - 2D.


Chapter 9

Lagrangian Multipliers for Constraint


Optimisation

In this Chapter we will briefly consider the optimisation of continuous functions subjected to equality constraints
(this will be covered extensively in the 3rd year course), that is the problem:

minimize z = f (x) (9.1)

subject to:
g i (x) = b i
where f and g i are differentiable. The Lagrange function, L, is defined by introducing one Lagrange multiplier λi
for each constraint g i (x) as:
L(x, λ) = f (x) +
X£ ¤
b i − g i (x) (9.2)
i =1

The necessary condition of optimality are given by:

∂L ∂f m ∂g ∂L
λi
X
= + = 0, = g i (x) = 0 (9.3)
∂x j ∂x j i =1 ∂x i ∂λi

9.0.1 Example

Use Lagrangian multipliers to minimise:


f (x 1 , x 2 ) = x 12 + 4x 22
subject to:
x 1 + 2x 2 = 1

Solution:

∂f ∂g
−λ = 0,
∂x 1 ∂x 1
∂f ∂g
−λ = 0,
∂x 2 ∂x 2
and
g (x 1 , x 2 ) = b.
Therefore, we solve:
2x 1 − λ = 0,

85
86 CHAPTER 9. LAGRANGIAN MULTIPLIERS FOR CONSTRAINT OPTIMISATION

8x 2 − 2λ = 0,
and
x 1 + 2x 2 = 1.
1 1
Solving these three equations we obtain x 1 = , x 2 = and λ = 1. Therefore, the optimum is:
2 4
1
f (x 1 , x 2 ) = .
2

9.0.2 Exercises

1. A length of wire L metre long is to be divided into two pieces, one in a circular shape and the other into a
square. What must be individual lengths be so that the total area is a minimum. Formulate the optimisation
problem mathematically and then solve.
2. Minimise
f (x) = x 1 2 + x 2 2
subject to
x 1 + 2x 2 + 1 = 0
3. Find the dimensions of a cylindrical tin of sheet metal to maximise its volume such that the total surface area
is equal to A 0 = 24π.

You might also like