0% found this document useful (0 votes)
19 views210 pages

LectureNotes ESS101 2023

These lecture notes for the Chalmers course ESS101, authored by Sebastien Gros and Bo Egardt, cover topics in Modelling and Simulation with significant updates made since the original 2018 version. Key revisions include a rewritten chapter on system identification focusing on Prediction Error Methods and the introduction of a new chapter on physics-based modelling. The document serves as a comprehensive resource for students, detailing various mathematical and physical concepts essential for modelling and simulation.

Uploaded by

sgvf98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views210 pages

LectureNotes ESS101 2023

These lecture notes for the Chalmers course ESS101, authored by Sebastien Gros and Bo Egardt, cover topics in Modelling and Simulation with significant updates made since the original 2018 version. Key revisions include a rewritten chapter on system identification focusing on Prediction Error Methods and the introduction of a new chapter on physics-based modelling. The document serves as a comprehensive resource for students, detailing various mathematical and physical concepts essential for modelling and simulation.

Uploaded by

sgvf98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

T HIS WORK IS NOT IN THE PUBLIC DOMAIN . T HIS WORK IS COPYRIGHT © 2018. A LL RIGHTS RESERVED.

Modelling and Simulation


Lecture notes for the Chalmers course ESS101

Sebastien Gros
with

Bo Egardt

2023

( q)
) −V
q , q̇
T(
)=
, q̇
d (q
d t ∂L
L
∂q̇ −
∂L
∂q =
Q
θ ¤
y|
P £
P[

θ
ax
θ|

rg m
y]

=a
=
R P[ y

y ¤
P[ | θ ]
y|

θ|
θ] [θ]

P £
P[
P
θ] θ

x θ
d

ma
arg
θˆ =
P REFACE
These Lecture Notes were originally prepared in 2018 by Sebastien Gros for a
significantly revised course Modelling and Simulation. Since 2020, the
following changes have been done to the original manuscript:
• The chapter on system identification has been largely rewritten, and later
revised. The probabilistic aspects have been de-emphasized in order to
make the presentation more easily accessible. The “Bayesian approach”
has been omitted, and Maximum Likelihood estimation is for optional
reading. The focus is now shifted towards Prediction Error Methods, in-
cluding Least-squares, and a fairly thorough introductory section is de-
voted to simple curve-fitting. The section on practical aspects on system
identification is also new.
• Chapter 2 on physics-based modelling is new.
• In Chapter 8, Collocation methods are for optional reading.

Göteborg, June 2023


Bo Egardt, Yasemin Bekiroglu
C ONTENTS
1 Notation & Background material 5
1.1 Norms and scalar products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Some basics from Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Some basics from Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Ordinary Differential Equations (ODE) . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Linear ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Solution of ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Solution to LTI ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9 Discrete-time systems: Z transform and shift operator . . . . . . . . . . . . . . 14
1.10 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Building models from physics 19


2.1 Physical modelling workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Analogies in physical modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Some practical aspects on physical modelling . . . . . . . . . . . . . . . . . . . 29
2.4 Time-scale separation and the Tikhonov Theorem . . . . . . . . . . . . . . . . 36
2.5 Equation based modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Formulating a model – another viewpoint . . . . . . . . . . . . . . . . . 41
2.5.2 The Modelica language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Lagrange mechanics 51
3.1 Kinetic Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Potential Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Lagrange Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 External forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Constrained Lagrange Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.1 Handling Models from Constrained Lagrange . . . . . . . . . . . . . . . 80
3.6 Consistency conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7 Constraints drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 Newton Method 89
4.1 Basic idea of Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Convergence of the Newton method . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.1 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.2 Reduced Newton steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Implicit Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Jacobian Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Newton Methods for Unconstrained Optimization . . . . . . . . . . . . . . . . 100
4.5.1 Gauss-Newton Hessian approximation . . . . . . . . . . . . . . . . . . . 102
4.5.2 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

1
5 System Identification (SysId) 105
5.1 Introductory example: fitting a function to data . . . . . . . . . . . . . . . . . . 106
5.2 Parameter Estimation for Linear Dynamic Systems . . . . . . . . . . . . . . . . 115
5.2.1 A special case: linear regression . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.2 Predictions for linear black-box models . . . . . . . . . . . . . . . . . . . 120
5.2.3 Prediction Error Methods (PEM) . . . . . . . . . . . . . . . . . . . . . . . 123
5.2.4 Properties of the PEM estimate . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 System identification in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.1 Design of experimental conditions . . . . . . . . . . . . . . . . . . . . . . 127
5.3.2 Pretreatment of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.3 Model structure selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.4 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4 The maximum likelihood method* . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6 Differential Algebraic Equations (DAEs) 137


6.1 What are DAEs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 Different forms of DAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3 Differential Index of DAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.4 Connection to Lagrange mechanics . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5 Index reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7 Explicit Integration Methods - Runge-Kutta 153


7.1 Explicit Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.1.1 Accuracy of the explicit Euler method . . . . . . . . . . . . . . . . . . . . 154
7.1.2 Stability of the explicit Euler method . . . . . . . . . . . . . . . . . . . . 157
7.2 Explicit Runge-Kutta 2 methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.3 General RK methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.3.1 Butcher tableau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.3.2 The RK4 method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.3.3 Stages, order & efficiency of explicit RK methods . . . . . . . . . . . . . 166
7.4 Stability of explicit RK schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.4.1 Stiff systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.5 Error control & Adaptive integrators . . . . . . . . . . . . . . . . . . . . . . . . . 173

8 Implicit Integration Methods – Runge-Kutta 177


8.1 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1.1 Stability of the implicit Euler method . . . . . . . . . . . . . . . . . . . . 178
8.2 Implicit Runge-Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.1 Accuracy and efficiency of IRK methods . . . . . . . . . . . . . . . . . . 181
8.2.2 Stability of RK methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.3 Collocation methods* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.3.1 Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.3.2 Interpolation of the trajectories . . . . . . . . . . . . . . . . . . . . . . . 188
8.4 RK methods for implicit ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

2
8.5 RK methods for implicit DAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.1 RK method for semi-explicit DAE models . . . . . . . . . . . . . . . . . . 196

9 Sensitivity of Simulations 199


9.1 Variational approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.2 Algorithmic Differentiation of the explicit Euler scheme . . . . . . . . . . . . . 200
9.3 Algorithmic Differentiation of explicit Runge-Kutta methods . . . . . . . . . . 202
9.4 Sensitivity of implicit Runge-Kutta steps . . . . . . . . . . . . . . . . . . . . . . 203
9.5 Sensitivity with respect to inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

3
4
1 N OTATION & B ACKGROUND MATERIAL
Most often, we will note vectors in the bold math format, i.e. a ∈ Rn . Scalars will be noted
in the standard math format, i.e. a ∈ R. Matrices will be noted using capital letters, i.e.
A ∈ Rn×m . There will be a few exceptions to these rules.

We will use subscript to denote the elements of vectors and matrices, i.e. a i ∈ R will be the
i :th element of vector a, and A i j the element in the i :th row and j :th column of matrix A.
Generic functions will obey the same notation rules. We will reserve the capital letter I for
the identity matrix.

1.1 N ORMS AND SCALAR PRODUCTS


In this course we will often use the notion of norm, noted as k.k. A norm associates to a
vector a non-negative scalar. It is often used to simply “measure the length" of a vector, but
it has a deeper meaning than this. An example of norm is:
p
kxk = x ⊤ x (1.1)

But any operator ρ taking an element of the vector space under consideration into R+ , and
having the following properties, can serve as a norm:

1. ρ(x + y ) ≤ ρ(x) + ρ(y )

2. ρ(ax) = |a|ρ(x) for any a ∈ R

3. ρ(x) = 0 implies that x = 0

There are several norms that differ from (1.1), e.g. the 1-norm
n
X
kxk1 = |xk | (1.2)
k=1

and the infinity norm

kxk∞ = max |xk | (1.3)


k=1,...,n

Note that since all norms are equivalent (i.e. boundedness in one norm implies bounded-
ness in the others), we often omit (when irrelevant) which type of norm we use.

We will use the notation

〈., .〉 (1.4)

to denote scalar products on any vector space. On Rn , the scalar product between two
vectors x, y ∈ Rn reads as:

〈 x , y 〉 = x⊤y (1.5)

5
1.2 S OME BASICS FROM L INEAR A LGEBRA
Let us review here some basic principles of linear algebra that can be useful in this course.

• The eigenvalues λ and eigenvectors v satisfy the equation

Av = λv (1.6)

i.e. the eigenvectors are transformed into scaled versions of themselves by the ap-
plication of matrix A. The eigenvalues can be computed by solving the polynomial
equation

det (A − λI ) = 0 (1.7)

where det is the matrix determinant.

• A matrix A is invertible (i.e. A −1 exists) iff


– det (A) 6= 0 or
– All the eigenvalues of A are nonzero
An invertible matrix is said “full rank".

• A matrix A is said symmetric if A = A ⊤

• A matrix A is said positive-definite if all its eigenvalues are real positive.

• A quadratic form is the operation:

x ⊤ Ax ∈ R (1.8)

where A is a square matrix, and x is a vector of adequate dimension. If A is a positive-


definite matrix, then

x ⊤ Ax ≥ 0 (1.9)

for any x.

1.3 S OME BASICS FROM C ALCULUS


We will use a lot of calculus in this course, and you need to be comfortable with it. We recall
some basic calculus here.

• The Jacobian of a multi-variate function f : Rm → Rn


 
f 1 (x)
 ..  m
f (x) =  . , x ∈ R (1.10)
f n (x)

6
is a Rn×m matrix given by:
 ∂ f1 ∂ f1

∂x1 ... ∂xm
∂f  .. .. 
= . .
 (1.11)
∂x  ∂ fn ∂ fn

∂x1 ... ∂xm

• It is useful to have the Jacobians of some matrix functions readily available, in partic-
ular for the Lagrange modelling chapter. In the following expressions, A is a matrix:

(Ax) = A (1.12)
∂x
∂ ¡ ⊤ ¢
x Ax = x ⊤ (A + A ⊤ ) (1.13)
∂x
• It will be convenient certain times to use the gradient operator ∇, which is essentially
the transpose of the Jacobian operator, i.e.
∂f ⊤
∇x f = (1.14)
∂x
The gradient is most often used for scalar functions f ∈ R, but the notion of gradient
can be readily applied to all functions.

• Chain ruling will


¡ be¢ used a lot in this course. For a composition of functions
( f ◦ g )(x) = f g (x) , chain ruling reads as:
¡ ¢ ¯
∂( f ◦ g )(x) ∂ f g (x) ∂ f (y ) ¯¯ ∂g (x)
= = ¯ (1.15)
∂x ∂x ∂y y =g (x) ∂x
Note that the order in the right-hand-side of the equation matters for multi-variate
functions, as the Jacobians are matrices. This order is reversed under the gradient
operator, i.e.
¡ ¢ ¡ ¢¯
∇x f g (x) = ∇x g (x) ∇ y f y ¯ y =g (x) (1.16)

• We will need a few times to distinguish between total and partial derivatives. The
notion can be tricky and even ambiguous sometimes, but let us recall here the basic
principle. Consider a function:
¡ ¢
f x, y (1.17)
where y = g (x). Then the partial derivative ignores that y is intrinsically a function of
x, i.e.
¡ ¢
∂ f x, y
(1.18)
∂x
disregards this dependency, while the total derivatives takes it into account and does:
¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢
d f x, y ∂ f x, y ∂ f x, y ∂y ∂ f x, y ∂ f x, y ∂g
= + = + (1.19)
dx ∂x ∂y ∂x ∂x ∂y ∂x

7
1.4 O RDINARY D IFFERENTIAL E QUATIONS (ODE)
Ordinary Differential Equations (ODEs) will play a crucial role in this course. ODEs are a
way to relate the outputs y of a system to the inputs u acting on the system. Generally
speaking, an ODE reads as
¡ ¢
ϕ y (m) , y (m−1) , . . . , y , u (m−1) , u (m−2) , . . . , u = 0 (1.20)

for some function ϕ and where we use the notation

dk
y (k) = y. (1.21)
dt k
A trivial example of such an equation is the motion of a mass m attached to a spring of
constant K , and subject to an external force u and viscous friction of parameter ξ. The
ODE describing such a system reads as:

ϕ = m ÿ + ξ ẏ + K y − u = 0 (1.22)

Most often we prefer to work with ODEs in their state-space form rather than in the form
(1.20). An ODE in the state-space is most often written as:

ẋ = f (x, u) (1.23)

for some state x ∈ Rn . E.g. the spring-mass example (1.22) in its state-space form reads as:
· ¸ · ¸
x2 y
ẋ = 1 , x= (1.24)
m (u − ξx 2 − K x 1 ) ẏ

A complete state-space model is then made of the dynamics (1.23) and an output function
y = h (x, u) telling us what we can “measure" or alternatively what we “care about" in the
system. The full state-space model then reads as:

ẋ = f (x, u) (1.25a)
y = h (x, u) (1.25b)

We can make a few remarks here concerning (1.25).


• For x ∈ Rn , n provides the order of the system (state-space dimension)

• Functions f , h are generally nonlinear, often smooth

• The model is said time-invariant if the functions f , h do not explicitly depend on


time
We will see later in the course that it can be sometimes useful to treat the dynamics of a
system via an implicit ODE, which has the generic form:

F (ẋ, x, u) = 0 (1.26)

i.e. it delivers the derivatives of the state space not via a function f that can be evaluated
directly, but implicitly via the equations (1.26).

8
1.5 L INEAR ODE
A very important class of models is the class of linear models, where functions f , h are
linear. We can then rewrite (1.25) in the form:

ẋ = Ax + Bu (1.27a)
y = C x + Du (1.27b)

for some matrices A ∈ Rnx ×nx , B ∈ Rnx ×nu , C ∈ Rn y ×nx , D ∈ Rn y ×nu . Linear systems have the
superposition property, i.e. if x1 (t ) and x2 (t ) are solutions of (1.27) for the input profiles
u 1 (t ), u 2 (t ) respectively, then α1 x1 (t ) + α2 x2 (t ) is solution corresponding to the input pro-
file α1 u 1 (t ) + α2 u 2 (t ). Linear ODEs are “nice" to work with because one can treat them
exploiting the powerful theorems provided by functional analysis, which - among other
things - give us the Fourier and Laplace transforms.

When matrices A, B, C , D are fixed, then system (1.27) is said Linear Time Invariant (LTI).
If (any of) the matrices A(t ), B(t ), C (t ), D(t ) is functions of time, then system (1.27) is said
Linear Time Varying (LTV).

1.6 L INEARIZATION
Linearization consists in forming locally valid linear approximations of nonlinear ODEs
(1.25). The linearization of an ODE is nothing more than an application of the Taylor ex-
pansion, to a first order. I.e. the functions f , h are replaced by their first-order approxima-
tion with respect to all arguments. In that context, we will consider deviations in the states
and inputs from a given reference, and establish how this deviation will evolve in time, to
a first-order approximation. Let us build this step-by-step. Consider first the first-order
expansion of function f . Consider a reference trajectory x(t ) solution of (2.1a) for a given
reference input profile u and reference initial conditions x0 . Consider then a deviation ∆u(t )
and ∆x0 in the input profile and/or initial conditions, and consider ∆x(t ) the resulting de-
viation in the ODE trajectories. We observe that:

˙ ) = f (x(t ) + ∆x(t ), u(t ) + ∆u(t )) ,


ẋ(t ) + ∆x(t ∆x(0) = ∆x0 (1.28a)
y (t ) + ∆y (t ) = h (x(t ) + ∆x(t ), u(t ) + ∆u(t )) (1.28b)

must hold. For small deviations ∆u(t ), ∆x(t ), we can form the first-order Taylor expansion
of the right-hand-side of (1.28), and obtaining:
¯ ¯
∂ f ¯¯ ∂ f ¯¯ ¡ ¢
˙ ) = f (x(t ), u(t )) +
ẋ(t ) + ∆x(t ¯ ∆x(t ) + ¯ ∆u(t ) + O k∆xk2 , k∆uk2
∂x x(t),u(t) ∂u x(t),u(t)
(1.29a)
¯ ¯
∂h ¯¯ ∂h ¯¯ ¡ ¢
y (t ) + ∆y (t ) = h (x(t ), u(t )) + ¯ ∆x(t ) + ¯ ∆u(t ) + O k∆xk2 , k∆uk2
∂xx(t),u(t) ∂u x(t),u(t)
(1.29b)

9
Since ẋ(t ) = f (x(t ), u(t )), we equivalently get:
˙ ) ≈ A(t )∆x(t ) + B(t )∆u(t )
∆x(t (1.30a)
∆y (t ) ≈ C (t )∆x(t ) + D(t )∆u(t ) (1.30b)
for the time-varying matrices:
¯ ¯
∂ f ¯¯ ∂ f ¯¯
A(t ) = , B(t ) = (1.31a)
∂x ¯x(t),u(t) ∂u ¯x(t),u(t)
¯ ¯
∂h ¯¯ ∂h ¯¯
C (t ) = , D(t ) = (1.31b)
∂x ¯ x(t),u(t) ∂u ¯ x(t),u(t)
A very commonly used special case of (1.30)-(1.31) is when the reference inputs and trajec-
tories u(t ), x(t ) are constant, in which case matrices A(t ), B(t ), C (t ), D(t ) are also constant
(because they come from Jacobians evaluated on fixed arguments). Such trajectories are in
fact a stationary point of the dynamics, and can be found by solving:
f (x0 , u 0 ) = 0 (1.32)
and using x = x0 , u(t ) = u 0 such that ẋ = 0. We can then use these reference trajectory in
(1.31).

Example 1.1 (Linearization). Consider the nonlinear dynamics


ẋ1 = (1 − x22 )x1 − x2 + u, (1.33a)
ẋ2 = x1 , (1.33b)
with the output function:
h(x) = x1 + x23 (1.34)
One can compute:
· ¸ · ¸
1 − x2 (t )2 −2x1 x2 − 1 1 £ ¤
A= , B= , C= 1 3x2 (t )2 , D =0 (1.35)
1 0 0
where x(t ) is a trajectory of the system. A steady state x̄ of the system is obtained by solving:

ẋ1 = (1 − x22 )x1 − x2 + u = 0, (1.36a)


ẋ2 = x1 = 0, (1.36b)
yielding
· ¸
0
x̄ = (1.37)

for a given ū constant. A linearization at steady-state then reads as:
· ¸ · ¸
1 −1 1 £ ¤
A= , B= , C = 1 3ū 2 , D =0 (1.38)
1 0 0
It is important to observe here that the linear system approximating the nonlinear dynam-
ics depends in the input ū selected. 

10
1.7 S OLUTION OF ODE S
Let us now turn to a crucial question in this course. Consider an ODE in its state-space
form

ẋ = f (x, u) (1.39)

An important question to ask here is “provided the initial conditions x(0) = x0 , is there a
trajectory x(t ) solution of (1.39), and is it unique?".

Before treating that question, let us consider two simple examples showing that it is not
trivial.

Example 1.2 (Finite escape time). Consider the ODE:

ẋ(t ) = x(t )2 , x(0) = 1 (1.40)

One can verify that (1.40) admits

1
x(t ) = (1.41)
1−t
as solution. We illustrate the trajectory x(t ) in the figure below. One can easily observe
from (1.41) that the state x(t ) becomes arbitrarily large as t approaches 1. The trajectories,
in fact, do not exist at and beyond t = 1.

10
9
8
7
6
x

5
4
3
2
1
0 0.2 0.4 0.6 0.8 1
t

Figure 1.1: Illustration of the solution (1.41). The trajectories reach ∞ in finite time (when
getting close to t = 1.

This simple examples illustrates that the solution of an ODE can “explode" in finite time,
and cease to exist beyond a limited time interval. 

11
Example 1.3 (Non-uniqueness of solutions). Consider the ODE
p
ẋ = |x|, x(0) = 0 (1.42)

One can verify that:


(
(t−t0 )2
t ≥ t0
x(t ) = 4 (1.43)
0 t < t0

is an admissible solution for (1.42) for any t0 ≥ 0. The issue here is that it is an admissible
solution regardless of t0 . In other words, there are infinitely many trajectories x(t ) (for
infinitely many choice of t0 ) that are solution of (1.42). We illustrate these solutions in the
next Figure.

25

20

15
x

10

0
0 2 4 6 8 10
t

Figure 1.2: Illustration of the solutions (1.43) for different values of t0 (dashed lines).

Let us introduce two theorems that unpack the two issues (existence and unicity) raised in
the examples above. The first theorem deals with the existence of the ODE solution.

Theorem 1. Consider the ODE

ẋ = f (x) (1.44)

where f is continuous. Then if1


° ¡ ¢° ° °
° f (x) − f y ° ≤ c · ° x − y ° , ∀x, y (1.45)

for some finite constant c, then the solution of (1.44) exists and is unique for all t .

1
the following property is labelled “Lipschitz continuity"

12
Let us return to the first example above. One can observe that

f (x) = x 2 (1.46)

does not satisfy condition (1.45). Indeed, in this case,


° ¡ ¢°
° f (x) − f y ° = x 2 − y 2 = (x + y)(x − y) (1.47)

we can observe that the term x + y can be arbitrarily large, such that we cannot find a con-
stant c for which (1.45) holds.

Unfortunately, property (1.45) can be difficult to verify for non-trivial ODEs. Let us intro-
duce a second theorem that makes our life easier.

Theorem 2. Consider the ODE

ẋ = f (x) (1.48)
∂f
Then if f is continuously differentiable (i.e. the Jacobian ∂x exists and is continuous), then
the solution to the ODE exists and is unique on some time interval.

Let us return to the second example above. One can observe that
∂f 1
= p (1.49)
∂x 2 x

such that f is not differentiable at x = 0. This results in the non-unicity of the solution of
(1.42). Note that the first theorem does not apply to the ODE (1.42) as function f has “infi-
nite slopes" at x = 0, such that condition (1.45) does not hold around x = 0. Note that the
second theorem is a “weaker" version of the first one, and requires weaker conditions.

Before closing, let us consider the case of linear ODEs. We observe that the theorems above
readily apply to any linear ODE as linear functions readily satisfy (1.45), and are contin-
uously differentiable. That is, linear ODEs always have a unique solution over the time
interval [0, ∞[. However, the solution may be unbounded, i.e. it can grow forever if the
ODE is unstable, but there is not specific time where the solution becomes infinite (unlike
(1.40)).

1.8 S OLUTION TO LTI ODE S


Consider the LTI ODE:

ẋ = Ax + Bu, x(t ) = x0 (1.50)

One can verify that the solution to (1.50) reads as:


Zt
At
x(t ) = e x0 + e A(t−τ) Bu(τ) dτ (1.51)
0

13
We ought to recall here the definition of a “matrix exponential" like e At . Here we simply
extend the series corresponding to the exponential function to matrices. More specifically,
similarly to:

t t2 X∞ tk
e = 1+ t + +... = (1.52)
2! k=0 k!

we define:

(At )2 X∞ (At )k
e At = I + At + +... = (1.53)
2! k=0 k!

Note that here the power of a matrix A k does not correspond to taking the power of the ma-
trix entries, but rather to multiplying the matrix by itself k times. We additionally observe
that the differentiation rule:
d at
e = ae at (1.54)
dt
becomes for the matrix exponential function:

d At
e = Ae At (1.55)
dt
Note that the matrix exponential function is “expm.m" in Matlab, and does not (necessar-
ily) correspond to the classic exponential function “exp.m".

We can conclude here that LTI ODEs do not need to be “simulated" as one can compute
their solution explicitly from (1.51). For nonlinear ODEs (and some LTV ODEs), however,
the best way of building their trajectories is by using computer-based simulations.

1.9 D ISCRETE - TIME SYSTEMS : Z TRANSFORM AND SHIFT OPERATOR


In one part of this course we will use the Z -transform. We briefly recall here its principle.
The Z -transform applies to LTI discrete-time systems, i.e. systems of e.g. the form:
Nb
X Na
X
yk = b i u k−i + ai y k−i , (1.56)
i =0 i =1

where the system input u is described as a discrete sequence u 0 , . . . , u ∞ , and similarly for
the system output y. Similarly to the Laplace transform, the Z -transform allows one to
treat dynamics in the form (1.56) in the “polynomial world".

The Z -transform of a signal y k is given by:



X
Y (z) = Z (y) = y k z −k (1.57)
k=0

14
The Z -transform has a number of useful properties (such as e.g. linearity), among which
the time-shift property

Z (y k−i ) = z −i Z (y) (1.58)

allows to write the Z -transform of the dynamics (1.56) as:


Nb
X Na
X
Y (z) = b i z −i U (z) + ai z −i Y (z) (1.59)
i =0 i =1

i.e.
P Nb
b z −i
i =0 i
Y (z) = PN a U (z) (1.60)
1 − i =1 ai z −i

Formally, calculations in the Z -domain, using the Z -transform, can equivalently be car-
ried out in the time domain using the shift operator q, defined from q y k = y k+1 or its in-
verse q −1 , satisfying q −1 y k = y k−1 . Using this notation, the system defined by the difference
equation (1.56) can be written
Nb
X Na
X Nb
X Na
X
yk = b i u k−i + ai y k−i = b i q −i u k + ai q −i y k , (1.61)
i =0 i =1 i =0 i =1

or, equivalently,
¡ Na
X ¢ Nb
¡X ¢
1− ai q −i y k = b i q −i u k (1.62)
i =1 i =0
By introducing polynomials in the (backward) shift operator, defined as
Na
X Nb
X
A(q) = 1 − ai q −i , B(q) = b i q −i , (1.63)
i =1 i =0

the difference equation can be written as

A(q)y k = B(q)u k (1.64)

or, in a “transfer-function-like” style,


B(q)
yk = uk (1.65)
A(q)
The latter is thus nothing but a compact representation of the (time-domain) difference
equation (1.56).

1.10 P ROBABILITY
In the system identification chapter, some basic concepts from probability theory will be
used. Let us review some basic principles and notations here.

15
R ANDOM VARIABLES . In order to model uncertainties like disturbances and noise, the
concept of a random variable is used. A real, scalar (or univariate) random variable (r.v.) X
is defined by its (Cumulative) Distribution Function (CDF), describing the probability that
X takes a value less than or equal x:

F X (x) = P[X ≤ x] (1.66)

We will occasionally use the extension to vector (multivariate) r.v., defined in an analogous
way.
In most cases, we will use the Probability Density Function (PDF) to characterize the ran-
dom variable. The PDF f X (x) of a continuous r.v. is defined by
Zx
F X (x) = f X (y)d y (1.67)
−∞

The PDF has an intuitive interpretation: f X (x) indicates the “relative” likelihood that the
r.v. X takes the value x. For a multivariate random variable X , the PDF f X (x) is a function
from Rn to R.

I NDEPENDENCE . Two random variables X and Y with joint PDF f X ,Y (x, y) are called in-
dependent if f X ,Y (x, y) = f X (x) · f Y (y). This definition can be extended to any collection of
random variables, and the term mutual independence is then often used.

E XPECTED VALUE . The expected value or expectation of a function g (X ) of a r.v. X with


PDF f X (x) is given by Z ∞
E[g (X )] = g (x) f X (x)d x.
−∞
The expected value can be thought of as “the average taken over many experiments", i.e. if
one were to draw the random variable X many times and average the result of g (X ), one
would get something close to the expected value.
The mean µ and the variance λ of a r.v. X are particular expected values:
Z∞ Z∞
2
µ = E[X ] = x f X (x)d x, λ = Var [X ] = E[(X − µ) ] = (x − µ)2 f X (x)d x. (1.68)
−∞ −∞

The covariance of two jointly distributed random variables X and Y is defined as

Cov (X , Y ) = E[(X − E[X ])(Y − E[Y ])] (1.69)

In the multivariate case, we will need the concept of covariance with itself (auto-covariance),
and we will refer to the covariance matrix defined as

Cov X = E[(X − E[X ])(X − E[X ])⊤ ]. (1.70)

16
N ORMAL DISTRIBUTION . The most familiar and often used random variable distribution
is the normal or Gaussian distribution. It is characterized by the PDF

1 − 1
(x−µ)2
f X (x) = p e 2σ2 , (1.71)
2πσ2

where µ is the mean, σ is the standard deviation, and σ2 is the variance of the r.v. X . In
short, this can be written as X ∼ N (µ, σ2 ). Analogously, the multivariate normal (Gaus-
sian) distribution has a PDF
1 1 1 T
Σ−1 (x−µ)
f X (x) = n/2 1/2
e − 2 (x−µ) , (1.72)
(2π) (det Σ)

where µ (a vector) is the mean and Σ is the covariance matrix. Notation: X ∼ N (µ, Σ).

S TOCHASTIC PROCESSES . When studying properties of signals, or sequences of e.g. input,


output or noise variables, it is natural to define the concept of a stochastic process as a
sequence of random variables {v (t )}∞t=1 . When the stochastic process has properties, which
are independent of absolute time, we call it a stationary stochastic process, and we often
characterize it by its second-order properties, which in the scalar case are:

• Mean value: m = E[v (t )]

• Covariance function: R v (τ) = E[(v (t + τ) − µ)(v (t ) − µ)]

Of particular interest are stochastic processes, which consist of a sequence of independent


random variables. Specifically, a white noise process is a sequence of independent, identi-
cally distributed (i.i.d.) random variables with mean µ = 0 and a covariance function
(
σ2 , τ = 0
R(τ) = (1.73)
0, τ 6= 0

For a stochastic process {v (t )} with covariance function R v (τ), the spectral density Φv (ω) is
defined as the discrete Fourier transform of its covariance function:

X
Φv (ω) = R v (t )e −i tω . (1.74)
t=−∞

The spectral density brings information on the “frequency content” of typical realizations
of the stochastic process, with useful engineering applications. Parseval’s formula applied
to stochastic processes suggests that the variance (“power”) of the stochastic process can
be thought of as the summing-up of all contributions along the frequency axis:
Z
2 1 π
E v (t ) = Φv (ω)d ω. (1.75)
2π −π

17
18
2 B UILDING MODELS FROM PHYSICS
Recall from the introductory chapter that our “favourite” ODE model is the state-space
model (1.25), repeated here for convenience:

ẋ = f (x, u) (2.1a)
y = h (x, u) (2.1b)

In this chapter, we will discuss the process of going from characterizing a system from its
physical properties to determining a useful state-space model of the type (2.1). This pro-
cess, often referred to as physical modelling or first-principles modelling, should be familiar
from e.g. the basic control course. We will nevertheless spend some effort to refresh some
of the basic ideas, and to illustrate these on examples from different domains.

It should immediately be stressed that physical modelling depends crucially on domain


knowledge, i.e. knowledge about the basic physical phenomena that characterize e.g. me-
chanical, electrical, fluid, and thermal systems. Clearly, a course like this one cannot go
into details concerning this, but we have to rely on the fundamentals learnt in e.g. basic
physics courses. Instead, we will emphasize what is needed in addition to domain knowl-
edge, namely a sound methodology to develop useful models from basic physics.

One of the lessons to be learnt in this chapter is that the state-space model (2.1) is not quite
sufficient to model all systems that we are interested in. In particular, we will encounter a
class of more general models, based on differential-algebraic equations or DAE. Such mod-
els will be dealt with in some detail later in the course.

2.1 P HYSICAL MODELLING WORKFLOW


Developing a model for a complex technical system may require significant effort and time.
It is therefore useful to follow a systematic process, and to apply proven and well docu-
mented principles. We will here adopt a recommended work-flow described in [4]. It con-
sists of the three following main steps:

1. Analyze the system’s function and structure

2. Determine basic relations and equations

3. Formulate a model

Let us discuss each of these main steps in some more detail.

A NALYZE THE SYSTEM ’ S FUNCTION AND STRUCTURE . One of the first actions in the mod-
elling process is often to identify how the system can be viewed as a connection of subsys-
tems. The rationale for doing this is that it is usually beneficial to adopt the divide-and-
conquer paradigm, by which the (big) modelling task is divided into smaller tasks, each

19
dealing with a separate subsystem. When identifying the subsystems, one of the criteria is
that the interactions/connections between subsystems become as simple as possible.

In this first step, it is also important to carry out a basic analysis of the system’s function.
This means, for example, to identify which physical mechanisms are important, which
quantities/variables that describe these mechanisms, and which are their qualitative re-
lations. When doing this, domain expertise (or intuition!) is important to decide which
phenomena are important and which can be neglected, which dynamic effects are slow
and which are fast etc.

The result of the first step is typically some type of graph or block diagram that gives an
overview of the system, along with a list of the most important variables. If the system is
decomposed into subsystems, connections between them can have different “semantics”,
depending on context.

Example 2.1 (Combustion engine). In the figure below, a rough sketch of an internal com-
bustion engine is shown. The sketch illustrates the airflow through the throttle valve into
the intake manifold, how the injected fuel is mixed with the air and transported into the
cylinder for combustion, and how the resulting pressure is transformed into mechanical
torque.

Figure 2.1: The workings of an internal combustion engine illustrated by a simple sketch.

Based on this simple sketch, a more systematic description of the different physical pro-
cesses involved is obtained through the block diagram shown below. In the block diagram,
the most important variables that couple the subsystems are also shown. The block dia-
gram with its listed variables would be a useful output from the first step of the modelling
process.

20
Figure 2.2: A block diagram of the internal combustion engine.

D ETERMINE BASIC RELATIONS AND EQUATIONS . In the second step, the basically quali-
tative description obtained from the first step is made quantitative. This is done by de-
termining equations describing the physical mechanisms and relating variables involved.
This is where you apply knowledge gained in courses on physics, mechanics, electricity,
thermodynamics etc. Two different types of equations can be distinguished:
• Balance equations encode principles of mass balance, energy balance, force balance
etc. These equations typically relate several variables of the same kind, i.e. having the
same units, which is also the case for e.g. Kirchhoff’s voltage and current laws.

• Constitutive relations, on the other hand, describe relations between different kinds
of variables. An example is Ohm’s law, which characterizes the relation between volt-
age and current for an ideal resistor; another example is the ideal gas law, which links
pressure, volume and temperature for an ideal gas.
When forming the equations, it is a good habit to check dimensions, i.e. to make sure that
all terms in an equation have the same physical unit—this is the most basic quality check
you can apply to a model, and it is easily done!

21
The result from the second step is a collection of equations, some of which may involve
derivatives, i.e. being differential equations. However, it is likely that also a number of
algebraic equations are obtained.

F ORMULATE A MODEL . The last step in the simplified modelling workflow aims at arriv-
ing at a final model. This is accomplished by simplifying all the equations obtained from
the previous step—superfluous variables can often be eliminated by substitutions and by
solving simple equations etc. Further simplifications may be achieved by linearizing the
equations. The final goal is often to obtain a state-space model (or, in the linear case, a
transfer function), but in some cases this is not possible and we need to accept a model
in the form of an implicit differential equation or a differential-algebraic equation (DAE),
containing a mix of differential and algebraic equations. The latter case is illustrated in the
following simple example.

Example 2.2 (Nonlinear resistance). Consider the electrical circuit depicted below, con-
sisting of a resistor and a capacitor in series.

i
R

u
C vC

Figure 2.3: A simple electrical circuit with nonlinear resistance.

In case the resistor is linear, it is straightforward to formulate the ODE model

dv C u − v C
C = (2.2)
dt R
However, assuming the resistor is nonlinear with a voltage-current relation

uR = R1i + R2i 5 , (2.3)

we cannot solve for i any longer, and the model needs to be stated in terms of a differential-
algebraic equation or a DAE:
dv C
i =C (2.4a)
dt
u = vC + R1i + R2i 5 (2.4b)

We will return to a much more in-depth discussion of DAE models later in the course. 
We will now illustrate in a few examples how the proposed workflow can be used. The
examples will be kept simple, though, which means that the full extent of the first step
cannot be captured.

22
Example 2.3 (Electric circuit). Consider the electrical circuit depicted in the figure below,
and let us develop a state-space model for the circuit.

v3 v4 v5
i3 i4 i5
R3 R4
i1 i2 L5
va vb
C1 v1 C2 v2

Figure 2.4: The simple electrical circuit to be modelled.

1. A qualitative analysis of the circuit reveals the following:


• The terminal voltages v a and v b are supplied as inputs.
• The capacitors C 1 , C 2 and the inductor L 5 represent energy storages and thus
give rise to dynamics.
• The resistors R 3 and R 4 are considered ideal and are hence static components.
• Temperature variations are considered small, and therefore component param-
eter values can be considered constant.

2. For the energy storage components, the following differential equations describe the
dynamics:

dv 1
C1 = i1 (2.5a)
dt
dv 2
C2 = i2 (2.5b)
dt
di 5
L5 = v5 (2.5c)
dt
The resistors can be characterized by the static, constitutive relations

v 3 = R3i 3 (2.5d)
v 4 = R4i 4 . (2.5e)

Finally, Kirchhoff’s voltage and current laws can be applied to give the balance equa-
tions

v1 = va − v3 (2.5f)
v2 = v1 − v4 (2.5g)
vb = v2 − v5 (2.5h)
i3 = i1 + i4 (2.5i)
i4 = i2 + i5 (2.5j)

23
3. In the search for a state-space model, the three differential equations suggest that
v 1 , v 2 , and i 5 could serve as state variables; these are also associated with the energy
storage in the capacitors and the inductor, indicating that they are indeed natural
candidates for state variables. By using the remaining equations, the state equations
can be derived as follows:
dv 1 1 1 1 1
C1 = i1 = i3 − i4 = v3 − v4 = (v a − v 1 ) − (v 1 − v 2 ) (2.6a)
dt R3 R4 R3 R4
dv 2 1
C2 = i2 = i4 − i5 = (v 1 − v 2 ) − i 5 (2.6b)
dt R4
di 5
L5 = v5 = v2 − vb (2.6c)
dt
Notice that we have reduced the initial 10 equations into 3 equations; several vari-
ables have been eliminated in this process, but they can easily be calculated from
the state variables. We can now obtain a final state-space model in vector form by
introducing x = (v 1 , v 2 , i 5 ) and u = (v a , v b ):
   1 
− R31C 1 − R41C 1 1
R4C 1 0 0
 1 1 1   R3C 1 
ẋ(t ) =  R4 C 2 − R4C 2 − C2  x(t ) +  0 0  u(t ) (2.7)
1 1
0 L 0 0 − L5
5


Example 2.4 (DC motor with load). Consider the DC motor illustrated in the figure below.
The motor is supplied with a DC source with voltage u. The motor drives a rotating load,
characterized by its moment of inertia J and friction coefficient b. We would like to derive
a state-space model and a block diagram with transfer functions, describing the motor.

uR uL
i
R
L ω

u um J

Figure 2.5: A simple sketch of a DC motor.

1. The structure of this electro-mechanical system can be illustrated by the simple dia-
gram below. The two blocks represent the electrical and the mechanical subsystems,
respectively. We also see how the subsystems are connected. The electrical subsys-
tem delivers by induction the torque Td = k m i , where i is the current and k m is the
torque constant. Conversely, the rotation causes a back-emf, i.e. a voltage u m = k e ω,
where ω is the rotational speed and k e is a constant.

24
Td = k m i
u electrical mechanical ω
subsystem subsystem
um = ke ω

Figure 2.6: The interconnected subsystems of a DC motor with load.

2. Letting u m denote the voltage over the motor and u R , u L be the component voltages,
the constitutive relations and Kirchhoff’s voltage law give the following equations for
the electrical sybsystem:

u = uR + uL + um (2.8a)
u R = Ri (2.8b)
di
uL = L (2.8c)
dt
um = ke ω (2.8d)

For the mechanical subsystem, Newton’s equation (a torque balance) and the consti-
tutive relations give:


J = Td − T f (2.9a)
dt
T f = bω (2.9b)
Td = km i (2.9c)

3. Choosing the differentiated variables, i and ω, as state variables, the following state-
space model is readily derived:

di
L = −Ri − k e ω + u (2.10a)
dt

J = k m i − bω (2.10b)
dt
An alternative representation can be obtained by applying the Laplace transform to
the equations and solve for Td (s) = k m I (s) and Ω(s):

km ¡ ¢
Td (s) = U (s) − k e Ω(s) (2.11a)
Ls + R
1
Ω(s) = Td (s) (2.11b)
Js +b

These equations can be depicted in a block diagram, which clearly shows how the
electrical and mechanical subsystems are connected via a feedback caused by the
back-emf.

25
+ Td
u km 1 ω
Ls+R J s+b

um

ke

Figure 2.7: A transfer function block diagram for the DC motor.


Example 2.5 (Hydraulic system). Consider the hydraulic (fluid) system depicted in the fig-
ure below, consisting of two interconnected tanks containing a liquid. The terminal pres-
sures in the pipes are p a and p b . The laminar flows q 3 and q 4 are subject to linear flow re-
sistances. However, the flow q 5 can be considered without friction, but instead the inertial
effects are important. These assumptions illustrate how domain knowledge is important to
be able to judge which phenomena are important and which can be neglected.

pa pb
q3 q4 q5

Figure 2.8: A simple hydraulic system.

1. Assuming the liquid is non-compressible, we can start by observing that the tanks
represent accumulation of mass and that this accumulation is related to build-up
of potential energy (measured by volume or level). Similarly, based on the statement
that inertial effects are important for the flow q 5 , the corresponding flow velocity rep-
resents kinetic energy (measured by flow-rate q 5 ). Both these observations point to
dynamic effects, described by differential equations. The external pressures p a and
p b are inputs to the system.

2. A mass balance for the tanks (with cross-sectional areas A 1 and A 2 ) and a force bal-
ance (Newton’s equation) for the outflow pipe (with cross-sectional area A) give the
following differential equations (ρ is the density of the liquid):
dh 1
A1 = q1 (2.12a)
dt
dh 2
A2 = q2 (2.12b)
dt
dq 5
ρl = A(p 2 − p b ) (2.12c)
dt

26
In addition, we have two constitutive relations for the linear flow resistances R 3 and
R 4 , constitutive relations linking pressure and level in the tanks, and balance equa-
tions for the flows:

p a − p 1 = R3 q3 (2.13a)
p 1 − p 2 = R4 q4 (2.13b)
p 1 = ρg h 1 (2.13c)
p 2 = ρg h 2 (2.13d)
q1 = q3 − q4 (2.13e)
q2 = q4 − q5 (2.13f)

3. Using pressures p 1 , p 2 and flow-rate q 5 as state variables, we can now form a state-
space model by combining all equations in (2.12) and (2.13):

A 1 dp 1 1 1
= q1 = q3 − q4 = (p a − p 1 ) − (p 1 − p 2 ) (2.14a)
ρg dt R3 R4
A 2 dp 2 1
= q2 = q4 − q5 = (p 1 − p 2 ) − q 5 (2.14b)
ρg dt R4
ρl dq 5
= p2 − pb (2.14c)
A dt

S OME REMARKS ON THE EXAMPLES . Let us finish this section by making a few general
remarks related to the physical modelling workflow:

• As already mentioned, domain knowledge and experience is essential to guide the


modelling work. It gives useful guidance to judge which aspects and phenomena are
important for the intended model use, and which are not.

• One particular aspect of domain knowledge is that there are “standard” choices of
state variables that you quickly get accustomed to. Examples include positions and
velocities for masses in mechanical systems, charge of capacitors, current of induc-
tors, accumulated mass or volume in fluid systems, and enthalpy or temperature in
thermal systems.

• Similarly, there are many “standard” simplifications such as the assumption on point
mass, no mass or no friction in mechanical systems; incompressible liquids in hy-
draulics; perfect mixing, ideal gas, and no heat losses in chemical engineering.

• A good habit during the modelling work is to make assumptions and approximations
explicit, i.e. to clearly state and document them. This makes it easier for you or some-
one else to go back and check the validity of these.

27
• Another good habit is to always make dimensions/units explicit, and to check that
dimensions are compatible in equations (this has not been done rigorously in the
examples in the interest of brevity).

• We finally stress again that the initial step, involving e.g. the decomposition of the
system into interconnected subsystems, is more important in realistic modelling tasks,
as opposed to the “toy examples” presented here.

2.2 A NALOGIES IN PHYSICAL MODELLING


By comparing the state-space model (2.14) with the state-space model (2.6), we can ob-
serve that the two models are equivalent. Pressures correspond to voltages, and flow-rates
correspond to currents. We can also see that the tanks have a “capacitance” C f = A 1 /(ρg )
and similarly for tank 2, and that the outflow pipe has an “inductance” (called inertance)
L f = ρl /A. The similarity between the two models is striking, and it can in fact be extended
into the mechanical domain as well—it is not difficult to show (try it!) that the mechanical
system depicted below will give rise to the same model (with the state vector built up by the
spring forces and the velocity v 3 of the mass):

v1 v2 v3
k1 k2

Fa m Fb

b1 b2

Figure 2.9: A mechanical system with the same model as the electrical system in Fig. 2.4
and the hydraulic system in Fig. 2.8. The system consists of a mass m connected
to two springs and two viscous dampers.

This observation of analogy between different physical domains can be taken further and
is formalized within the framework of bond-graphs (sw: bindnings-grafer), see e.g. [4] for
an introduction. At the core of this framework is the observation that in many types of
systems, the basic mechanisms of energy transformation, accumulation and dissipation
is conveyed by a pair of variables, referred to as effort (or potential) and flow variables,
respectively. Denoting the two types of variables e and f , respectively, the power trans-
formed/accumulated/dissipated is given by the product, i.e. P = e · f .

Table 2.1 summarizes the analogies between the three domains considered above, and how
the basic concepts can be generalized. We can see that resistance, inductance and capaci-
tance are concepts that apply generally. Whereas resistance is a static, dissipative element,
inductance and capacitance both represent dynamic effects that involve energy storage. For

28
Table 2.1: Basic elements involving effort and flow and their specializations in different do-
mains.

General Electrical Hydraulic Mechanical

Intensity e u p F
Flow f i q v
Power P =e·f P = u ·i P = p ·q P =F ·v
df di dq
Inductance e = α dt u = L dt p = Lf dt F = m dv
dt
dp
Capacitance f = β de
dt
i = C du
dt
q = C f dt v = k1 dF
dt

Resistance e = γf u = Ri p = Rf q F = bv

an inductance, we can derive the stored energy by integration of power:


Zt Zt
d f (τ) 1
E I (t ) = P (τ)dτ = α f (τ)dτ = α f 2 (t ), (2.15)
dτ 2

and similarly, a capacitance stores the energy 21 βe 2 (you are encouraged to check what
these formulas reveal in the special cases!).

2.3 S OME PRACTICAL ASPECTS ON PHYSICAL MODELLING


Physical modelling can sometimes turn out to be quite demanding and time-consuming.
One reason for this could be the sheer size and complexity of the system, a situation which
we cannot really illustrate within the scope of this course. It has already been pointed out
that it is a good idea in such cases to divide the system into sub-systems in order to get
some structure for the modelling work. Another challenge is often to make judgments on
what is important and what is not—domain knowledge and experience is essential, which
always makes it a good idea seeking advice from experts and colleagues with experience!

Beyond these general remarks, there are some useful guidelines and practical hints that can
be of some help in the modelling task. We will discuss some of these in this section.

M ODEL REQUIREMENTS AND MODEL VALIDATION

It is desirable for the modeller to have a clear idea of the modelling purpose, i.e. the in-
tended use of the model. The purpose could be to get some general insights into the work-
ings of the system, to make some important design decisions, to simply determine some
parameters in the system, or to perform control design. The intended use of the model

29
naturally leads to requirements on model quality. If, for example, the model is going to be
used for control design, there will be higher requirements on model fidelity in certain fre-
quency ranges than in other. It is not always easy to quantify the required model quality,
but it is nevertheless important to have this aspect “in the back of your mind”.

All models are approximations of reality, and as such they come with a certain region of
validity—it is important not to use the model outside this region without extrem care! A
simple example of this rule is when linearization is performed to get a simple model; by
performing linearization in several operating points, the region of validity of the complete
model is extended.

Model validation is the action taken to verify that the model requirements are met. Typi-
cally, it involves simulating the model under some relevant conditions, and comparing the
results with data from the real system; it may also include analysis. The tests being done
during model validation may motivate modifications to steps taken earlier in the modelling
process. This underlines the important fact that modeling is an iterative process, as illus-
trated in the figure below.

Real system
Physical modelling data

Model
1 2 3 Model
validation

modifications

Figure 2.10: Modelling is an iterative process.

M ODEL APPROXIMATIONS

As already pointed out, approximations and simplifications are always carried out to some
extent in the modelling work. It can be done already during the initial modelling, but it may
also be the result of going back and revising some of the decisions, e.g. in order to reduce
the final model complexity. Some common situations are described in the following.

N EGLECTION OF SMALL EFFECTS . A common way to find approximations is to simply ne-


glect phenomena, which you are well aware of, but which you consider less important to
capture the main system characteristics. Here are some standard examples:
• Neglect air resistance and/or friction

• Neglect a mass, or alternatively, assume the mass is a point mass

• Assume a flow is laminar instead of turbulent (Reynold’s number is an indicator of


this)

30
• Approximate a nonlinear sensor characteristics with a linear one

• Neglect heat losses

• Assume perfect mixing when modelling a chemical process

• Neglect temperature variations in order to use constant parameter values


These and many other similar approximations typically rely on domain expertise, experi-
ence, and intuition. A general rule is that you should strive to make balanced approxima-
tions, i.e. to avoid making some crude approximations in one part of the model and treat
other parts in great detail (“stain at gnats and swallow camels”). Another good habit is to
always be prepared to reconsider the choices you have made in the modelling process.

S EPARATION OF TIME CONSTANTS . A common ingredient in modelling is judging which


dynamic phenomena are important for the final use of the model. This is sometimes re-
ferred to as time-scale separation or separation of time constants. The general rule is to
approximate dynamics that are either too slow or too fast to be relevant:
• fast dynamics are approximated with static relations, obtained by describing steady-
state conditions

• slow dynamics are approximated by considering some variables to be constant


Here are some examples:
• actuator and sensor dynamics can often be approximated by static relations

• mechanical links are often approximated as rigid, i.e. with no compliance/flexibility

• temperature dynamics are often considered slow compared to other dynamics

• dynamics due to long-time wear of components are often neglected


Example 2.6 (DC-motor, cont’d). For an example of time-scale separation, recall the DC-
motor Example 2.4. We concluded that the model can be viewed as a feedback connection
of two first order transfer functions, see Fig. 2.7. Assume the following parameter values
(from a small laboratory setup previously used in the basic control course):

R = 2.4 Ω
L = 0.25 m H
J = 1.1 × 10−3 kg m2
b = 0.0025Nm/rad/s

From these parameter values, we can compute the mechanical and electrical time con-
stants for the two transfer functions:

τm = J /b = 0.44 s (2.16a)
τe = L/R = 0.0001 s (2.16b)

31
It is evident that for many applications, we can safely neglect the electrical time constant
km
and thus simplify the model by replacing the transfer function Ls+R by the steady-state gain
k m /R. 

It is of course also possible to think about time-scale separation in the frequency domain.
As an example of this, we know from the basic control course that the model needs to be
accurate in the mid-frequency range when used for feedback control design. The following
example illustrates the point.

Example 2.7 (Open vs closed-loop [2]). Consider first three different systems described by
the transfer functions
1
G(s) = , a = 0, ±0.01 (2.17)
(s + 1)(s + a)
It can be seen that one system is stable, one is unstable, and one is marginally stable. It is
thus not surprising that the three systems have quite different open-loop step responses, as
seen in Figure 2.11 (top left). However, when (a unity gain) feedback is applied, the closed-
loop step responses turn out to be very similar (same figure, top right).

Open-loop step responses Closed-loop step responses


400 1.2

1
300 a=-0.01
0.8 a=-0.01, 0, 0.01
Amplitude

Amplitude

200 a=0 0.6

a=0.01 0.4
100
0.2

0 0
0 100 200 300 0 2 4 6 8 10
Time (seconds) Time (seconds)

Open-loop Bode diagram Closed-loop Bode diagram


Phase (deg) Magnitude (dB)

Magnitude (dB)

100 20
a=-0.01, 0.01
50 0
a=0
0 -20 a=-0.01, 0, 0.01

-50 -40
0 0
Phase (deg)

a=0.01
a=0
-90 -90 a=-0.01, 0, 0.01
a=-0.01
-180 -180
10 -2 10 0 10 -2 10 0
Frequency (rad/s) Frequency (rad/s)

Figure 2.11: Open-loop (left column) and closed-loop (right column) step responses and
Bode diagrams for the systems in (2.17).

32
The result can be explained by looking at the corresponding Bode diagrams. The open loop
frequency responses in Figure 2.11 (bottom left) differ only in the low-frequency region.
However, it is the mid/high-frequency region that matters for the closed-loop frequency
response, which can be seen to be virtually identical (bottom right).

Consider now the three transfer functions


400(1 − sT )
G(s) = , T = 0, 0.015, 0.03 (2.18)
(s + 1)(s + 20)(1 + sT )
which have very different closed-loop responses, although the open-loop step responses
are very similar, see Figure 2.12. Again, a look at the corresponding frequency responses
help to understand what is going on. The open-loop frequency responses differ only for
mid/high frequencies (bottom left), which thus affect only the initial step responses (top
left). But the effect on the closed-loop frequency response is evident (bottom right) and
this is reflected in the step responses (top right).

Open-loop step responses Closed-loop step responses


20 2
T=0.03
15 1.5
T=0.015
Amplitude

Amplitude

10 T=0, 0.015, 0.03 1


T=0
5 0.5

0 0

-5 -0.5
0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
Time (seconds) Time (seconds)

Open-loop Bode diagram Closed-loop Bode diagram


Phase (deg) Magnitude (dB)

Magnitude (dB)

50 40
T=0, 0.015, 0.03 20 T=0.03
0
0 T=0.015
-50 T=0
-20
-100 -40
0 360
Phase (deg)

T=0 180
-180 0 T=0
T=0.015 -180
T=0.03 T=0.03 T=0.015
-360 -360
10 0 10 2 10 0 10 2
Frequency (rad/s) Frequency (rad/s)

Figure 2.12: Open-loop (left column) and closed-loop (right column) step responses and
Bode diagrams for the systems in (2.18).

One of the conclusions is that open-loop simulation may not always be the best option to
validate a model! 

33
A GGREGATION OF STATES . Another way to reduce the complexity of the model is to “lump”
states together and represent them by one state only, e.g. the average. A classical example
of this is a distributed parameter system, whose state space is infinite-dimensional, and e.g.
described by a partial differential equation. Such systems frequently appear in chemical
engineering (reactors, heat exchangers etc.) and in systems involving heat conduction and
convection. Here are a couple of more examples of state aggregation:

• A battery in an electric car consists of many small cells, each with its own state of
charge, temperature etc. In order to study effects on system level, it is common to
lump these states together and to approximate the cell states with an average taken
over the cells in a larger battery module, or even in the entire battery.

• In vehicle dynamics, many longitudinal effects may be studied by aggregating the left
and right halves of the vehicle, leading to a so called bicycle model.

S CALING AND NORMALIZATION

It has already been pointed out that it is a good habit to always check dimensions/units in
the equations used and also in the final model. Closely related to this is scaling of variables,
i.e. basically deciding which units to use, which can have an impact on the numerical prop-
erties during simulation. This can be taken one step further by employing normalization
of variables, to be illustrated below. In short, scaling and normalization can contribute to

• simplify equations;

• reduce number of parameters;

• reveal structural properties of the model;

• simplify simulations and improve numerics.

We will illustrate in a couple of examples how scaling and normalization may be used. For
a more in-depth discussion and a formal treatment, please consult [4].

Example 2.8 (Mass-spring system). A simple mass-spring system can be described by the
equation
m ẍ(t ) + kx(t ) = F (t ), (2.19)
where F is the external force acting on the system. The system is characterized by the two
parameters m and k. However, we can simplify the study of the system properties by using
normalization. Define the following dimensionless position and time variables:

z = x/L (2.20a)
s
k
τ= t, (2.20b)
m

34
where L is some chosen length scale. This leads to the new differential equation

d2 z d2 (x/L) ¡ dt ¢2 F − kx m 1
= = · = F − z = u − z, (2.21)
dτ2 dt 2 dτ mL k kL
where the new (also dimensionless) input u = F /(kL) has been introduced. The conclusion
is that every mass-spring system of the type (2.19) can be studied in terms of the differential
equation
z̈ + z = u, (2.22)
given the particular interpretation (using re-scaling) of the variables z and u. 

Example 2.9 (Cart with pendulum [1]). The figure below depicts a simple model for a cart
with a balancing inverted pendulum. This model can be seen as a simple representation of
e.g. a Segway.

θ
l

F M

Figure 2.13: An inverted pendulum balancing on a cart affected by a force F .

The following second order, vector differential equation can be derived for the system:
· ¸· ¸ · ¸ · ¸
M +m −ml cos θ q̈ c q̇ + ml sin θ · θ̇ 2 F
+ = (2.23)
−ml cos θ J + mL 2 θ̈ γθ̇ − mg l sin θ 0

This model is characterized by parameters m, M, J , l , g , c, γ. Let us neglect friction, i.e. c =


γ = 0. Further, let us choose length scale l , time scale 1/ω0 , and force scale (M + m)l ω20 ,
where s
mg l
ω0 = . (2.24)
J + ml 2
This is equivalent to introducing the normalized variables

τ = ω0 t (2.25a)
1
x= q (2.25b)
l
1
u= F (2.25c)
(M + m)l ω20

35
We leave it as an exercise to verify that this leads to the following differential equations for
the new variables:

d2 x d2 θ ¡ dθ ¢2 m
2
− α cos θ 2
+ α sin θ = u, α= (2.26a)
dτ dτ dτ M +m
d2 x d2 θ ml 2
−β cosθ 2 + 2 − sin θ = 0, β= (2.26b)
dτ dτ J + ml 2

It can be seen that the scaling introduced has resulted in a model with only two parameters
α and β, which clearly simplifies the study of system properties as function of parameter
values. It can finally be noted that in the case when m ≪ M, J ≪ ml 2 , then we have α ≈ 0
and β ≈ 1, which simplifies the equations even further. 

2.4 T IME - SCALE SEPARATION AND THE T IKHONOV T HEOREM


We will now return to the techniques used in time-scale separation, specifically the elim-
ination of fast dynamics. As already pointed out, this is most often performed based on
the intuition and experience of the engineer, but it has in fact strong foundations in math-
ematics. The elimination of fast dynamics is one of the reasons why differential-algebraic
equations (DAEs) arise in modelling, and in Chapter 6 we will spend some effort to under-
stand these DAEs better. In this section, we will see how and why eliminating fast dynamics
from a model is justified.

Consider the system described by the state-space model

ẋ = f (x, z ) (2.27a)
ǫż = g (x, z ) (2.27b)

where 0 < ǫ ≪ 1. This way to formulate the model suggests that the system can be thought
of as a connection of a system with “slow” dynamics (2.27a) and a system with “fast” dy-
namics in (2.27b). Following the idea of time-scale separation, the fast dynamics would
be neglected and replaced by the corresponding stationary solutions (formally obtained by
putting ǫ to zero). This procedure results in the DAE model

ẋ = f (x, z ) (2.28a)
0 = g (x, z ) (2.28b)

A valid question to ask at this point is how this approximation can be justified mathemati-
cally. Before providing an answer to this question, let us have a look at an example, which
illustrates that the procedure is not entirely foolproof.

Example 2.10. Consider the linear system

ẋ1 = −x1 + x2 (2.29a)


ǫẋ2 = x1 − ax2 + u (2.29b)

36
The limit case ǫ = 0 gives the DAE

ẋ1 = −x1 + x2 (2.30a)


0 = x1 − ax2 + u (2.30b)

We then distinguish two cases:

• if a 6= 0 then (2.30) has the solution

ẋ1 = (a −1 − 1)x1 + a −1 u (2.31a)


x2 = a −1 (x1 + u) (2.31b)

• if a = 0 then (2.30) has the solution

ẋ1 = −x1 + x2 (2.32a)


0 = x1 + u (2.32b)

which equivalently reads:

x2 = −(u + u̇) (2.33a)


x1 = −u (2.33b)

Clearly, the two cases differ qualitatively. For example, the DAE (2.32) requires that u is dif-
ferentiable as it appears time-differentiated. This is in contrast with (2.29) for which any
integrable input profile is admissible.

The trajectories for (2.29) are illustrated in Fig. 2.14 for a few different values of ǫ > 0. It can
be seen that for decreasing values of ǫ, the trajectories approach the solution of the DAE
(2.30), which are shown in bold.

37
1

0.5

x1
0

-0.5
0 0.5 1 1.5 2 2.5 3

0.5 ODE (2.29)


DAE (2.30)
x2

-0.5

-1
0 0.5 1 1.5 2 2.5 3

1.5
1
0.5
u

0
-0.5
-1
-1.5
0 0.5 1 1.5 2 2.5 3
t

Figure 2.14: Trajectories for (2.29) with a piecewise constant input. The grey curves repre-
sent the trajectories for ǫ ranging from 10−1 to 4·10−3 with the initial conditions
x1 (0) = x2 (0) = 1 and for a = 2. The black curve represents the solution to the
DAE (2.30).

We could try to explain the outcome in the example with a bit of intuition. Since x1 is
expected to change slowly, assume we “freeze” x1 and see what happens to x2 in (2.29b).
Well, we expect x2 to quickly converge to the stationary solution given by (2.30b), provided
(2.29b) is stable, i.e. if a > 0. With this condition, we can also solve (2.30b) for x2 as a
function of the “frozen” x1 (which is in reality changing slowly). This intuitive explanation
is indeed supported by the following theorem due to Tikhonov:

Theorem 3 (Tikhonov). Consider the ordinary differential equation (ODE):

ẋ = f (x, z ) (2.34a)
ǫż = g (x, z ) (2.34b)

where 0 < ǫ ≪ 1 is very small. Let us label xǫ (t ), z ǫ (t ) the solution to (2.34). Suppose:

• the dynamics ż = g (x, z ) are stable ∀ x, and


∂g
• matrix ∂z
is full rank (i.e. invertible) everywhere.

38
Then
lim xǫ (t ), z ǫ (t ) = x0 (t ), z 0 (t ) (2.35)
ǫ→0

where x0 (t ), z 0 (t ) is the solution of

ẋ = f (x, z ) (2.36a)
0 = g (x, z ) (2.36b)

The limit (2.35) is to be understood in the sense of “almost everywhere" formally defined
in the context of measure theory. From an intuitive point of view, the trajectories match
everywhere but on a union of time intervals that are “infinitely short”.

Going back to Example 2.10, we can see that the hypotheses of Theorem 3 are satisfied for
∂g
(2.29) if a > 0, such that ∂z = −a is full rank and (2.29b) is stable. However, if a ≤ 0, then
there is no mathematical support for approximating (2.29) with (2.30).

Example 2.11. Let us consider two masses moving horizontally (positions x1 and x2 ), sub-
ject to a velocity-dependent friction with parameter ρ, and connected by elastic links of
constant K . The dynamics of this two-mass system can be modelled via the linear ODE:

m 1 ẍ1 = K (x2 − x1 ) − K x1 − ρ ẋ1 + u (2.37a)


m 2 ẍ2 = K (x1 − x2 ) − ρ ẋ2 (2.37b)

K K
m1 m2
u

x1 x2
ρ ρ

Figure 2.15: Illustration of the two mass system modelled by (2.37).

We can find an approximate model in two steps as follows. By first assuming the oscillatory
dynamics of mass 1 are very fast compared to the one of mass 2, i.e. m 1 ≪ m 2 and mK1 ≪ 1,
(2.37) can be approximated by setting mK1 to zero, giving the new model:

ρ 1
ẋ1 = (x2 − x1 ) − x1 + u (2.38a)
K K
m 2 ẍ2 = K (x1 − x2 ) − ρ ẋ2 . (2.38b)

39
Assuming the dynamics of mass 1 are oscillatory, it can readily be shown that mK1 ≪ 1 im-
ρ
plies that K ≪ 1 (you are encouraged to verify this!), so that we can form the further ap-
proximation:
1
0 = x2 − 2x1 + u (2.39a)
K
m 2 ẍ2 = K (x1 − x2 ) − ρ ẋ2 , (2.39b)

which is a DAE since state x1 does not appear time differentiated. We can easily observe
that (2.39) boils down to eliminating ẋ1 , ẍ1 from equation (2.37a). Alternatively, we could
have arrived at this by putting (2.37a) in the state-space form (2.34b) with a suitable choice
of ǫ and then setting ǫ to zero; please refer to Exercises.

Figure 2.16 compares the trajectories of ODE (2.37) and the ones of the DAE (2.39). One can
verify that for both applications of the Tikhonov theorem (2.38) and (2.39), the hypothesis
apply.

3
2
2
1
1
0 0
x2
x1

-1
-1
-2
ODE (2.37)
-2 -3 DAE (2.39)
-4
-3
0 1 2 3 0 1 2 3

250 1500
200
150 1000
100 500
50
ẋ1

0 0
-50
-100 -500
-150 -1000
-200
-250 -1500
0 1 2 3 0 1 2 3
t

Figure 2.16: Illustration of the Tikhonov theorem for (2.37). The piecewise constant input
illustrated in the lower-right graph was selected. The grey curves represent
the trajectories for m 1 ranging from 10−1 to 4 · 10−4 with the initial conditions
x1 (0) = x2 (0) = 1 and ẋ1,2 (0) = 0. The parameters ρ = 1, K = 103 and m 2 = 1
were selected. The black curve represents the solution to the DAE (2.39).

40
Let us finish by providing some intuition behind the application of the Tikhonov theorem
for the two-mass example. One can observe that the approximation at play here is essen-
tially to replace the dynamics of mass 1 by its “steady-state” approximation. This means
that we consider that the motion of the mass decays instantaneously to its steady state
provided by (2.39a). One should understand here that “steady-state" does not mean “not
moving", as e.g. (2.39a) still allows x1 to move. However, in the approximation (2.39) the
motion of x1 is entirely dictated by the motion of x2 moving x1 in different steady-state
positions. This principle can very often be readily applied to models holding very fast dy-
namics. One ought to observe that the dynamics created by the DAE (2.39) do not hold very
fast dynamics (as opposed to e.g. (2.38) which still hold a very fast dynamics for x1 ). This
observation will connect to Section 7.4.

2.5 E QUATION BASED MODELLING


In this final section, we will briefly discuss a slightly different viewpoint on physical mod-
elling, often referred to as equation based modelling. The approach has gained increased
interest in recent years and offers some advantages for building models for complex phys-
ical systems. There are also connections to ideas in object-oriented programming and to
the earlier mentioned bond-graph techniques.

2.5.1 F ORMULATING A MODEL – ANOTHER VIEWPOINT

Let us start by reminding about the proposed workflow for physical modelling in Section
2.1:

1. Analyze the system’s function and structure

2. Determine basic relations and equations

3. Formulate a model

It is clear that steps 1 and 2 depend heavily on your skills as a modeller. Step 3, on the other
hand, is largely a matter of “book-keeping”, involving substitutions, sorting etc, with the
main goal to arrive at a model that is possible to simulate. In many cases, this amounts to
finding a state-space model, for which computations are easily organized. Taking a closer
look at what happens in this step, there are several issues to discuss.

M ODEL FOR COMPUTATIONS . One reason why we prefer state-space models is that these
are easy to simulate. The explicit ODE form makes it (in principle) very simple to recur-
sively compute an approximate solution by iterating a difference equation approximat-
ing the ODE. In this sense, the model can be seen as a representation of a computational
scheme, as illustrated in the following example.

41
Example 2.12 (Predators and prey [4]). The following model has been suggested to study
variations in populations for a pair of mutually dependent predator-prey, e.g. lynx and
hares:
¡ ¢
Ṅ1 (t ) = λ1 N1 (t ) − γ1 − α1 N2 (t ) N1 (t ) (2.40a)
¡ ¢
Ṅ2 (t ) = λ2 N2 (t ) − γ2 + α2 N1 (t ) N2 (t ) (2.40b)
The figure below illustrates the characteristic, cyclic variations when simulating the model.

Figure 2.17: Variations in populations of lynx and hares using the model (2.40).

The Matlab/Simulink model used to simulate the system is depicted below. It can be seen
that the model is in fact built according to the data-flow used at every iteration of the sim-
ulation. The two integrators holding the states N1 and N2 act as starting points for the
calculations, and completing all assignments “downstream” finally gives the derivatives Ṅ1
and Ṅ2 , allowing a state update by approximating the derivative by a difference quotient.

Figure 2.18: Simulink model used to simulate the predator-prey example.

42


The following example illustrates what is happening when a state-space model is formed
for the electric circuit in Example 2.3.

Example 2.13 (Electric circuit, revisited). In the figure below, we repeat the schematic of
the circuit, the list of equations (left) that we derived in step 2 of the modelling work-flow,
and the final state-space model (right) obtained in step 3.

v3 v4 v5
i3 i4 i5
R3 R4
i1 i2 L5
va vb
C1 v1 C2 v2

C 1 v̇ 1 = i 1
C 2 v̇ 2 = i 2
L 5 i˙5 = v 5
v 3 = R3i 3    
− R31C 1 − R41C 1 1
R4C 1 0 1
R3 C 1
0
v 4 = R4i 4 1
− R41C 2 − C12  x + 
   
ẋ =  R4 C 2 0 0 u
v1 = va − v3 0 1
0 0 − L15
L5
v2 = v1 − v4
vb = v2 − v5
i3 = i1 + i4
i4 = i2 + i5

Figure 2.19: The collection of basic equations (left) and the state-space model (right), both
describing the electrical circuit (top).

We can see here that considerable effort is needed in order to go from the basic equations,
describing the system, to a form that is suitable for computations. Moreover, the state-
space model has a structure that is far from the simple structure of the basic equations,
and parameters appear in combinations and in several instances. 

Based on the examples, it is natural to ask the following question: can we automate step 3
of the modelling workflow by leaving to the computer to manipulate the basic equations?
Well, why not? After all, the computer is very good at routine tasks like sorting, substitu-
tions etc. Leaving step 3 to computer tools would also shift our focus a bit by viewing the
collection of basic equations from step 2 as the main result of our modelling, even as the
model itself—this is what is stressed in the term equation based modelling.

43
Let us finally remind about Example 2.2, where it was demonstrated that it is not always
possible to go from a collection of basic equations to a state-space model. Instead, we
were left with a DAE model for the circuit with a nonlinear resistance. To detect such cases
would be another task for computer tools. In addition, we will see later in the Lagrange
modelling chapter that both implicit ODE and DAE models may be preferable compared to
state-space models to keep model complexity down.

D ECLARATIVE AND A - CASUAL MODELLING . We will now illustrate another consequence of


the new viewpoint of equation based modelling.

Example 2.14 (Two versions of an electrical circuit). The figure below depicts two versions
of basically the same electrical circuit. The only difference is how the circuit is driven—by
a voltage source (left) or a current source (right).
vR vR

R R
i i

v i
C vC C vC

Figure 2.20: Two versions of the same electrical circuit driven by a voltage source (left) and
a current source (right), respectively.

The basic equations that describe the electrical circuit are as follows:

C v̇ C = i (2.41a)
v R = Ri (2.41b)
v R + vC = v (∗) (2.41c)

where the last equation is valid for the voltage source driven circuit only.

Let us view the voltage over the resistance as output, y = v R . We can then readily form
state-space models for the two cases:
1 1 1
v̇ C (t ) = − v C (t ) + v (t ) v̇ C (t ) = i (t )
RC RC C (2.42)
y(t ) = −v C (t ) + v (t ) y(t ) = Ri (t )

It is clear that the final model changes significantly, even for this small modification of the
circuit! On the other hand, the original equations (2.41) differ only by the application of
Kirchhoff’s voltage law for the voltage source case. Another way to explain what is going on
is to stress that the basic resistor equation

v R = Ri (2.43)

44
is just a declarative statement of how voltage and current for a resistor relate, and no causal-
ity is implied. When forming the state-space models, on the other hand, there is implicitly
a causality implied, since the models are very close to the computations to be performed.
This means that the resistor equation is really used in two different ways, namely to per-
form the following assignments:
vR
i := v R := Ri (2.44)
R


In spite of its simplicity, the example points at an important aspect of equation based mod-
elling. By viewing the model as just a collection of equations, a declarative and a-casual
(without causality) description, no implications concerning the organization of compu-
tations are made. This means that we hand over to computer algorithms to analyze the
causality of the model, and to organize computations accordingly. This is in contrast to
state-space models or Simulink models, which are imperative, i.e. built on assignments in
a specific order tied to the causality of the model.

C OMPOSITE MODELS . Let us take another look at Example 2.13 and focus on step 2, i.e.
forming the basic equations for the circuit. It is clear that the equations can be divided
into two groups, one describing the components and one describing the couplings between
these:

Components: Couplings:
C 1 v̇ 1 = i 1 v1 = va − v3
C 2 v̇ 2 = i 2 v2 = v1 − v4
L 5 i˙5 = v 5 vb = v2 − v5
v 3 = R3i 3 i3 = i1 + i4
v 4 = R4i 4 i4 = i2 + i5

Figure 2.21: The basic equations for the circuit in Example 2.13, divided into two groups.

This observation can take us further when dealing with composite models, i.e. models built
up by components and/or subsystems. The component equations remain the same re-
gardless of which circuit we build. The coupling equations, on the other hand, encode
Kirchhoff’s laws describing how the components are connected. These laws follow from
the basic rules that the potential of two connected pins are the same, and that the sum of
currents flowing into a connection node sum up to 0. Similar rules apply for other pair of
effort/potential and flow variables, as discussed in Section 2.2. This way of viewing model
connections has been formalized within the modelling language Modelica [5].

45
2.5.2 T HE M ODELICA LANGUAGE

Modelica is a modelling language that supports equation based, declarative and a-causal
physical modelling along the principles discussed in this section. Furthermore, re-use of
models is facilitated by building models from connections of sub-models and components.
We will give a brief “teaser” on Modelica here, but for further study of the language and its
features, we refer to the many publications freely available, see [6].

Let us start by taking another look at Example 2.14, now in Modelica terms.

Example 2.15 (Two electrical circuits, revisited). All electrical components and circuits
need connectors or pins. This is how it looks in Modelica:

connector Pin
Real v;
flow Real i;
end Pin
Note that the pin is characterized by two variables, namely the effort/potential variable v
and the flow/current variable i . Implicitly, this tells us that potentials are set equal when
pins are connected, and currents are set up to sum to 0. This is all handled by Modelica
“under the hood” when declaring a connection:

pin1.v = pin2.v;
connect(pin1,pin2) ⇔
pin1.i + pin2.i = 0;

Note the dot notation to be interpreted as a genitive s, e.g. pin1.v is “pin1’s v variable”.

We are now ready to define the basic Resistor and Capacitor components (the keyword
der is used here to denote the derivative of a variable):
model Resistor model Capacitor
parameter R; parameter C;
Pin n,p; Pin n,p;
Real i,u; Real i,u;
equation equation
u = p.v - n.v; u = p.v - n.v;
p.i + n.i = 0; p.i + n.i = 0;
i = p.i; i = p.i;
u = R*i; der(u)*C = i;
end Resistor; end Capacitor;

It may seem unnecessarily complicated to define such simple components with a number
of equations. However, the good news is that this allows us to generalize to many different
components, using the same constructs. Moreover, basic components are pre-defined and
available in libraries, which means we don’t have to bother about details as users.

46
Using these components, we can now define the circuit with a voltage source in Modelica:

model Circuit1
Resistor R (R=100);
Capacitor C (C=0.001); R
VSource Vs (u=1);
equation Vs
C
connect(Vs.p,R.n);
connect(R.p,C.n);
connect(C.p,Vs.n);
end Circuit1;
Notice that in order to change this model from using a voltage source to a current source,
we only need to change one line in the code! 

We have demonstrated above how models can be defined textually using the Modelica lan-
guage. There are, in addition, graphical tools available, allowing the user to define mod-
els by graphically connecting components and models earlier defined and collected in li-
braries. It should be stressed, however, that connecting models graphically does not change
the fundamental, declarative and a-causal nature of the model.

PARTIAL MODELS . Modelica is a rich language and offers many more features, which are
outside the scope of this introductory presentation. We will briefly illustrate one feature
though, namely the concept of a partial model, which resembles the inheritance feature
found in object-oriented languages.

An important observation is that many (component) models share some common charac-
teristics. Therefore, it is tempting to systematically develop such models by successively
specializing from more general models. This is exactly the idea of a partial model. Let us
illustrate by defining a general and basic electrical component, called OnePort, which in
turn makes use of a connector Pin, which we have come across earlier, but now slightly
refined:

connector Pin
[Link] v;
flow [Link] i;
end Pin;

partial model OnePort


Pin p,n;
[Link] v ’’voltage drop’’;
[Link] i;
equation
v = p.v - n.v;
p.i + n.i = 0;

47
i = p.i;
end OnePort;

The partial model OnePort needs specialization in order to be well-defined, and it basically
amounts to specify how the current through the component relates to the voltage drop.
In this way, it is straightforward to define models for the standard components, e.g. the
resistor and capacitor:
model Resistor model Capacitor
extends OnePort; extends OnePort;
parameter Real R=1 parameter Real C=0.001
’’Resistance in [Ohm]’’; ’’Capacitance in [F]’’;
equation equation
v = R*i; der(v)*C = i;
end Resistor; end Capacitor;
The same idea applies when defining an inductor, or any other simple one-port compo-
nent. In the code segments shown above, a couple of Modelica features are also shown:
the possibility to define default parameter values (that can later be changed, of course) and
units, the latter allowing to automatically check that a model is based on compatible units.

So far, we have used examples from the electrical domain to illustrate some of the basic
ideas in Modelica. However, Modelica is aimed for multi-domain modelling based on the
same principles, and we finish this brief tour by giving an electro-mechanical example,
namely the DC motor.
Example 2.16 (DC-motor [3]). A DC-motor model is given below in both textual and graph-
ical form. It can be seen that a special component EMF represents the coupling between the
electrical and mechanical domains. It describes both the torque induced by the current to
the motor and the back-emf, i.e. the voltage arising due to the rotation of the motor.

model DCMotor
Resistor R(R=100);
Inductor L(L=100);
VsourceDC DC(f=10);
Ground G;
EMF emf(k=10, J=10, b=2);
Inertia load;
equation
connect(DC.p,R.n);
connect(R.p,L.n);
connect(L.p,emf.n);
connect(emf.p,DC.n);
connect(DC.n,G.p);
connect([Link],[Link]);
end DCMotor;

48
Despite the fact that the DC motor model is quite simple, it nevertheless generates quite
a few equations, as seen in the following list. Luckily, it is up to the modelling software to
handle these equations, e.g. to generate code for simulations.

49
50
3 L AGRANGE MECHANICS
In the previous chapter, we discussed some general principles and guidelines for physical
modelling. It was pointed out, however, that it is important to complement this with do-
main specific knowledge and guidelines. In this chapter, we will go one step further and
show how domain specific techniques can facilitate and strengthen the physical modelling
process, and this will be done by studying the basics of Lagrange mechanics. Lagrange me-
chanics is a powerful tool to build mathematical models for complex mechanical systems,
and is used in many applications. Lagrange mechanics
• allows to describe arbitrarily complex mechanical system, including relative moving
parts, accelerated frames, gyroscopic and centrifugal effects without hassle;

• often allows to build simple models (in terms of complexity of the resulting model
equations), even for mechanical systems that are normally described by very com-
plex models.
We define the generalized coordinates of a mechanical system as a vector of time-varying
“coordinates" q (t ) ∈ Rnq that must be able to describe the “configuration" of the system at
a given time t . One can construe these coordinates as forming a “snapshot" of the system.
The snapshot does not tell us how the system will evolve in time, but it can tell us in what
configuration the system is at a given time. The generalized coordinates typically gather
positions and angles, but can also include more abstract representations of the system. For
the sake of simplicity, in the following we will omit the explicit dependence of the general-
ized coordinates and other time-varying objects in the notation, i.e. we will e.g. note the
time-varying generalized coordinates q (t ) simply as q .

Lagrange mechanics is based on a description of the mechanical system in terms of energy.


In order to build the Lagrange framework and models using the Lagrange equations, we
need to compute the kinetic and potential energy functions of the system, conventionally
labelled T and V , respectively. Building these objects is not necessarily straightforward,
but it can be done fairly systematically, and with a very limited risk of error. Let us discuss
the kinetic and potential energy functions next.

3.1 K INETIC E NERGY


We recall here that the kinetic energy of a “punctual mass" (i.e. of a dimensional particle of
mass m) whose position with respect to a fixed (inertial) reference frame is given by p ∈ RD
(where the number of dimensions can be D = 1, 2, 3) reads as:
1 ° °2 1
T = m °ṗ ° = m ṗ ⊤ ṗ (3.1)
2 2
A few useful observations ought to be made here:
• The kinetic energy function is a quadratic form of the mass velocity ṗ. It is always
positive (as m > 0) and takes the minimum value T = 0. This observation is generally
true for any mechanical system

51
• The generalized coordinates q describe the configuration of the entire mechanical
system we consider. I.e. if our punctual mass is part of a mechanical system having
the generalized coordinates q , then its position p is a function¡of¢q (in the sense that
p can be calculated from q ). More formally, one could write p q . The mass velocity
ṗ is then a direct result of the chain rule:
∂p
ṗ = q̇ (3.2)
∂q

• Imagine that our mechanical system is made of N punctual masses of position p i ,


fully determined by the generalized coordinates q . The kinetic energy of the system
is then an addition of all the individual kinetic energies, i.e.

1X N ° °2 1 X N
T= m i °ṗ i ° = m i ṗ i⊤ ṗ i (3.3)
2 i =1 2 i =1
The generalized coordinates describe the configuration of the whole system, i.e. it
describes the position p i of each mass. Each velocity is given by the chain rule:
∂p i
ṗ i = q̇ (3.4)
∂q
The kinetic energy of the N masses is then given by:
à !
N N ⊤ N ⊤
1X 1 X ∂p i ∂p i 1 X ∂p i ∂p i
T= m i ṗ i⊤ ṗ i = m i q̇ ⊤ q̇ = q̇ ⊤ mi q̇ (3.5)
2 i =1 2 i =1 ∂q ∂q 2 i =1 ∂q ∂q

Here we can define:


N
X ∂p i ⊤ ∂p i
W= mi (3.6)
i =1 ∂q ∂q
and we observe that W is a square, symmetric, (semi-)positive-definite matrix of size
Rnq ×nq . This matrix is in general a function of q , but can also be constant for “smart"
choices of generalized coordinates.

• The observation above holds for systems having infinitely many “particles" (e.g. dis-
tributed mass systems). Generally, the kinetic energy function will take the form:
1 ¡ ¢
T (q , q̇) = q̇ ⊤W q q̇ (3.7)
2
where each of these variables (i.e. q and q̇) are time-varying, and where W is a square,
symmetric, positive-definite matrix of size Rnq ×nq , possibly function of q .

• Note that since


¡ the
¢ kinetic energy function T is given by (3.7), it is essentially
¡ ¢ defined
via matrix W q , hence in our constructions in the
¡ ¢following, matrix W q will play
a central role. In particular, we will see that W q actually being a function of q
or being constant plays a central role in the complexity of the model describing the
corresponding mechanical system.

52
Example 3.1 (Mass on a rigid rod). Imagine a punctual mass m attached to the origin by a
rigid, massless rod of length L, and moving in the plane x − y. We can describe the system
by the angle θ between the x axis and the rod. This angle alone describes entirely this
simple system, as knowing θ specifies in what “configuration" (or position) the system is.
Our (unique) generalized coordinate here is q = θ ∈ R.

m
y
L

θ
x
Figure 3.1: Illustration of a mass on a rigid rod.

The position of the mass in the x − y plane is then given by:


· ¸
cos θ
p =L (3.8)
sin θ

The chain rule then tells us that:


· ¸
− sin θ
ṗ = L θ̇ (3.9)
cos θ

The kinetic energy is then:


· ¸⊤ · ¸
1 1 − sin θ − sin θ 1 ¡ ¢ 1
T = m ṗ ⊤ ṗ = mL 2 θ̇ 2 = mL 2 θ̇ 2 sin2 θ + cos2 θ = mL 2 θ̇ 2
2 2 cos θ cos θ 2 2
(3.10)

Here “matrix" W is simply

W = mL 2 (3.11)

and is constant. 
Example 3.2 (Mass on an elastic rod). Imagine a punctual mass m attached to the origin
by an elastic, massless rod of varying length L, and moving in the plane x − y.

m
y L

θ x
Figure 3.2: A mass on an elastic rod.

53
To describe the system we now need the angle θ between the x axis and the rod, and the
length L (as the latter is also changing and therefore not simply a constant). Our generalized
coordinates are then:
· ¸
θ
q= (3.12)
L

as both are needed at a given time t to describe the “configuration" (or position) of the
system. The position of the mass in the x − y plane is still given by:
· ¸
cos θ
p =L (3.13)
sin θ

but now L is no longer a constant and is part of our generalized coordinates. The chain rule
then tells us that:
· ¸ · ¸
− sin θ cos θ
ṗ = L θ̇ + L̇ (3.14)
cos θ sin θ

The kinetic energy is then:

1
T = m ṗ ⊤ ṗ
2
· ¸⊤ · ¸ · ¸⊤ · ¸
1 − sin θ − sin θ 1 cos θ cos θ
= mL 2 θ̇ 2 + m L̇ 2
2 cos θ cos θ 2 sin θ sin θ
· ¸⊤ · ¸
1 − sin θ cos θ
+ mL L̇ θ̇
2 cos θ sin θ
· ¸
1 1 1 mL 2 0
= mL 2 θ̇ 2 + m L̇ 2 = q̇ ⊤ q̇ (3.15)
2 2 2 0 m

Here matrix W is diagonal:


· ¸
mL 2 0
W= (3.16)
0 m

and not constant as L is part of the state. 


Example 3.3 (Rigid rod with a mass). Imagine a uniform rod of length L and of negligible
thickness attached at the origin, having a mass m (uniformly distributed).

y
L
m

θ
x
Figure 3.3: A rigid rod with mass.

54
The rod position is again described by the angle θ alone (we consider the length L fixed as
the rod is rigid), hence q = θ. We can consider our rod as made of infinitely many punctual
masses distributed along the rod axis. If we locate these masses via their position ν ∈ [0, L]
along the rod axis, we can describe their position in the plane x − y by:
· ¸
cos θ
p (ν) = ν (3.17)
sin θ

and their velocity is given by the chain rule:


· ¸
− sin θ
ṗ (ν) = ν θ̇ (3.18)
cos θ

The overall kinetic energy of the rod is given by the summation of the kinetic energy of all
the infinitesimal masses (each having a mass m L
dν). It can be computed by the integral:
ZL µ · ¸¶⊤ µ · ¸¶
m − sin θ − sin θ
T= νθ̇ νθ̇ dν
L 0 cos θ cos θ
ZL
m m 2 3 ¯¯ν=L m 2 3 1
= ν2 θ̇ 2 dν = θ̇ ν ¯ = θ̇ L = mL 2 θ̇ 2 (3.19)
L 0 3L ν=0 3L 3
Our “matrix" W is now reduced to 2/3 of the one we had for the mass concentrated at the
end of the rod, i.e.
2
W = mL 2 (3.20)
3
(one can compare (3.19) and (3.10) to see that). 

3.2 P OTENTIAL E NERGY


Let us provide a few examples of potential energy, some of which will be extensively used
in this course.

• The potential energy due to gravity in most “standard" mechanical applications de-
rives from:

V = mg z (3.21)

which gives the potential energy of a punctual mass m, where “z" is the “height" of
the mass in the field of gravity. Consider e.g. in the examples above that the mass m
is concentrated at the end of the rigid rod. The position of the mass is given by (3.17),
such that its vertical position is given by:

p z = L sin θ (3.22)

Its potential energy is then given by:

V = mg L sinθ (3.23)

55
In contrast, consider that the mass is distributed throughout the rod. Computing
the potential energy then follows similar lines as (3.19), i.e. we need to “sum up"
the potential energy of every “particle of the rod". A “particle" is described here as a
infinitesimal piece of the rod, of length dν, having a mass m L dν and a vertical position
ν sin θ. We can then do:
Z ¯ν=L
mg L mg 1 2 ¯ 1
V= ν sin θdν = ν sin θ ¯¯ = mg L sinθ (3.24)
L 0 L 2 ν=0 2

hence the potential energy then corresponds to considering that the whole mass of
the rod is concentrated at the half-length (this is to put in contrast with the kinetic
energy computation, where one can consider that the whole mass of the rod is con-
centrated at the 2/3 of the length of the rod!).

• The considerations above are valid for mechanical systems evolving “at a small scale"
in the field of gravity, for which one can consider the field as “straight" and uniform.
For mechanical systems evolving at a “large scale" such as e.g. a satellite orbiting a
planet, one needs to apply (at least) the genuine formula for classic gravity, i.e.:
m
V = −G (3.25)
r
where r is the distance to the center of the planet.

Note that the Lagrange formalism can handle potential energy from other sources (mag-
netic field, electric field, etc.), but we will not consider these cases in this course. Another
source of potential energy that is commonly present in mechanical systems is potential en-
ergy stored in flexible components. We will often simply consider springs in this course, but
any part of the system that undergoes elastic deformations does in principle store kinetic
energy.

Example 3.4 (Simple spring). Consider the spring illustrated below:

x
0
Figure 3.4: Illustration of a spring. The position x = 0 is set at the spring rest length. The
rigidity constant of the spring is k

The force delivered by the spring is given by

F = −kx (3.26)

56
where x is the position of the spring end, relative to its rest position (position at which the
spring yields no force). The potential energy stored in the spring is given by:
¯
1 2 ¯¯0 1 2
Z0
V= −kνdν = − kν ¯ = kx (3.27)
x 2 x 2
Note that if the reference frame is chosen such that the rest length of the spring is not at
x = 0 but rather at an arbitrary position x = x0 , then the potential energy is given by:
1
V = k(x − x0 )2 (3.28)
2

Example 3.5 (2D spring). Let us consider now a slightly more complex example illustrated
in the figure below.
y

Figure 3.5: Illustration of a spring moving in the plane. The rigidity constant of the spring
is k. We suppose that the rest length of the spring is L 0 .

We consider that the position of the end of the spring is given by


· ¸
x
p= (3.29)
y
such that the elongation of the spring with respect to its rest length is given by:
¡ ¢1
∆ = p ⊤p 2 − L0 (3.30)
The potential energy stored in the spring is then given by:
µ ¶2
1 2 1 ¡ ⊤ ¢ 21
V = k∆ = k p p − L 0 (3.31)
2 2
1
The latter expression is fairly complex, as the . 2 does not simplify with the .2 unless L 0 = 0.
If L 0 = 0, then the potential energy simplifies to:
1
V = kp ⊤ p (3.32)
2
In many exercises in this course, will use the simplifying assumption that L 0 = 0 in order to
get simpler expressions. 

57
3.3 L AGRANGE E QUATION
Before detailing the Euler-Lagrange equation, we need to define the Lagrange function
¡ ¢ ¡ ¢ ¡ ¢
L q , q̇ = T q , q̇ − V q ∈ R (3.33)

simply made from substracting the potential energy from the kinetic energy. In general,
the Lagrange function takes the generalized coordinates q and their time derivative q̇ as
arguments. Using the observations of the sections above, we observe that the Lagrange
function takes the form:
¡ ¢ 1 ¡ ¢ ¡ ¢
L q , q̇ = q̇ ⊤W q q̇ − V q (3.34)
2
The Euler-Lagrange equation then reads as:

d ∂L ∂L
− =0 (3.35)
dt ∂q̇ ∂q

and defines the model of the mechanical system. It is useful to observe here that (3.35)
defines equations as a row vector. Indeed, e.g. ∂L ∂q
∈ R1×nq is a row vector. One can verify
that the first term is (of course) also a row vector, such that the difference between the two
terms is well defined. As we tend to prefer working with equations in a column vector form,
it can be useful when needed to rewrite (3.35) in its transposed version, i.e.:

d
∇q̇ L − ∇q L = 0 (3.36)
dt
where we use the “gradient" notation:

∂L ⊤ ∂L ⊤
∇q̇ L = , ∇q L = (3.37)
∂q̇ ∂q

This transformation is cosmetic and of secondary importance, but it is useful to point it out.
Remember this observation in the following whenever we transpose the terms appearing
in (3.35), as we will do it whenever it will help us write things in a neat and structured form.

Before investigating how (3.35) delivers the model of the system, it is worth here pausing
to understand the mathematical meaning of the first term in (3.35). Indeed, the partial
derivative of the Lagrange function L with respect to the time derivative of the generalized
coordinates q̇ , i.e. ∂L
∂q̇ may appear surprising or ambiguous. In fact, it is a very straight-
forward operation. In order to perform it correctly, one has to consider q̇ as a variable in
itself, independent of all other variables, and take the classical differential operations ac-
cordingly. More specifically, considering (3.34), the operation ∂L
∂q̇ simply yields:
µ ¶ µ ¶
1 ⊤ ¡ ¢ ¡ ¢ 1 ⊤ ¡ ¢ ¡ ¢
∇q̇ L = ∇q̇ q̇ W q q̇ − V q = ∇q̇ q̇ W q q̇ = W q q̇. (3.38)
2 2

58
d ∂L
Forming the term dt ∂q̇ in the Lagrange equation requires a total differentiation with re-
d
spect to time (operator dt
). Using the chain rule, this yields:

d ¡ ¡ ¢ ¢ ∂ ¡ ¡ ¢ ¢ ∂ ¡ ¡ ¢ ¢ ¡ ¢ ∂ ¡ ¡ ¢ ¢
W q q̇ = W q q̇ q̈ + W q q̇ q̇ = W q q̈ + W q q̇ q̇ (3.39)
dt ∂q̇ ∂q ∂q

We can then pack the developments above to observe that the Lagrange equation yields:

¡ ¢ ∂ ¡ ¡ ¢ ¢
W q q̈ + W q q̇ q̇ − ∇q L =0 (3.40)
∂q | {z }
=∇q T −∇q V

In order to understand the “mechanics" behind using the Lagrange equations, it is useful to
run through a handful of simple and classical examples. It is really useful to underline here
that the computations performed hereafter are best carried out in a Computer Algebra
System (CAS) such as the Matlab Symbolic Toolbox. Indeed, the computations required to
deploy Lagrange mechanics are very systematic, but become quickly involved. It is best to
perform them in the computer. In the exercises associated to this course, we recommend
you to use this approach to generate your models.

We can make a few observations here.

1. The Lagrange equation delivers the differential equation in an implicit form, i.e. it is a
set of equations relating the generalized coordinates and their first and second-order
time derivatives, i.e. q , q̇ , q̈ .

2. The Lagrange equation allows one to compute the system acceleration q̈ for a given
system configuration q and its time derivative q̇ if matrix W (q ) is invertible. They
can be used for simulating the system, if the initial conditions q , q̇ are provided (i.e.
we need to specify the “position" and “velocity" of the system in order to predict its
future trajectory). For q , q̇ given, then the Lagrange equations can be solved for the
acceleration q̈, from which the time evolution of q , q̇ can be computed (i.e. q is
obtained from integrating q̇ and q̇ is obtained by integrating q̈).

3. The second term

∇q L = ∇q T − ∇q V (3.41)

in the Lagrange equation is generally speaking introducing forces that are intrin-
sic to the system (as opposed to forces applied “externally" to the system), such as
e.g. forces deriving from potentials (coming from the “V " part, see remark following
(3.51)) and centrifugal forces (coming from the “T " part).

4. The Lagrange equation is linear in the accelerations q̈. This is true for Lagrange func-
tions in the form (3.34), and stem from (3.39), which yields the first term in the La-
grange equation in the form W (q )q̈, which is linear in q̈.

59
5. Because the accelerations enter linearly in the Lagrange equation, we can solve it
for q̈. More specifically, if one write the Lagrange equation in the form (3.40), the
accelerations q̈ can be explicitly expressed as:
· ¸
¡ ¢−1 ∂ ¡ ¡ ¢ ¢
q̈ = W q ∇q L − W q q̇ q̇ (3.42)
∂q

Unfortunately, writing the model in this explicit form is not always a very good move.
Indeed, for systems that do not have a vector of generalized coordinates q of very low
¡ ¢−1
dimension (e.g. > 2), the inverse W q can be very complex, even if matrix W (q )
is fairly simple. An exception to this observation is the case where matrix W (q ) is
constant (i.e. not actually function of q ). In this case

∂ ¡ ¡ ¢ ¢
∇q T = 0 ⇒ ∇q L = −∇q V and W q q̇ q̇ = 0 (3.43)
∂q

such that (3.42) becomes:

q̈ = −W −1 ∇q V (3.44)

where matrix W is purely “numerical" (i.e. it contains only numbers, not expres-
sions), and is therefore easy to invert.

Let us try to nail down these remarks via a series of examples, starting simple and ramping
up to complex systems.

Example 3.6 (Spring-mass system). Consider the spring-mass system depicted in Fig. 3.6
below.

x
m

Figure 3.6: Vertical spring-mass system. The mass “0" position is set at the spring rest
length. The rigidity constant of the spring is k, and the hanging mass is m.

Let us deploy the principles of Lagrange modelling on this simple system. Our goal is to
unpack what the Lagrange equation does, and what the different terms “physically" corre-
spond to. Let us start with setting up our generalized coordinates. We need to describe the
mass position. Fig. 3.6 proposes to set the “0" position at the rest length of the spring, and

60
consider that x increases when the mass goes down. This choice is not unique: we could
set this “0" position anywhere we want and decide that x increases when the mass goes up.
The kinetic energy function is simple to calculate for this kind of system. We can use (3.1),
where our position p ≡ x is a scalar (we work in 1D). We then simply get:
1
T = m ẋ 2 (3.45)
2
One can observe here that our kinetic energy function is in the form (3.7), with W (x) = m.
The potential energy is composed of the sum of gravity and of the energy stored in the
spring (note that energies always add). These two terms read as:
1
Vgravity = −mg x and Vspring = kx 2 (3.46)
2
Note the minus sign on Vgravity . One can easily guess it by observing that if x increases the
mass goes down and therefore the potential energy decreases. This requires the minus sign.
The potential energy of a spring is always given by the quadratic form used here, i.e. it is
given by the elongation of the spring (from rest length) squared, multiplied by the rigidity
and divided by two.
We can now assemble the Lagrange function:
¡ ¢ 1 1
L = T − Vgravity + Vspring = m ẋ 2 + mg x − kx 2 (3.47)
2 2
Once the Lagrange function is assembled, the modelling does not require “intelligence"
anymore, as the rest is just calculus and algebraic manipulations in order to extract a differ-
ential equation from (3.35). This procedure can actually be automated, and its deployment
entrusted to computer tools. If we do the exercise of computing (3.35) here. Let us start
with:

∇q̇ L = ∇ẋ L = m ẋ (3.48)

We observe that this term is a momentum. As a matter of fact, the term ∂L


∂q̇
will always
describe the momentum present in the system. We can also compare this expression with
(3.38).
Furthermore, we can compute:
d d
∇q̇ L = (m ẋ) = m ẍ (3.49)
dt dt
We observe that this term is equivalent to the “m · a" of the F = m · a of Newton, i.e. it cor-
responds to a mass multiplied by accelerations. As in Newton, the product of a mass by an
acceleration will equate forces. This observation will also hold true for all system, though
“masses" and “forces" will take a somewhat more general meaning in Lagrange mechanics.
We can also compare this expression to (3.39), and observe that
∂ ¡ ¡ ¢ ¢ ∂W 2
W q q̇ q̇ = ẋ = 0 (3.50)
∂q ∂x
|{z}
=0

61
because W is constant. We will see later on that having a matrix W (q ) that is constant
simplifies significantly the resulting model equations.
We can turn to computing:

∇q L = ∇x L = mg − kx (3.51)

We ought to observe that this term delivers the forces acting on the system resulting from
the potentials present in the system. We can now assemble the Lagrange equations, i.e.
(3.35) for this simple example reads as:

m ẍ − mg + kx = 0 (3.52)

Because the acceleration ẍ enters linearly in the Lagrange equation, we can solve it for ẍ.
Here we can trivially manipulate (3.52) to get:

m ẍ = mg − kx. (3.53)

This equation is essentially the Newton equation “F = m · a" for the spring-mass system.
Indeed, we observe that:

m ẍ = mg − kx
|{z} (3.54)
| {z }
≡m·a ≡F

Example 3.7 (Linear crane). We will now consider a linear crane as depicted in Fig. 3.7.
This kind of system is e.g. important for loading/unloading cargo ships in large harbors.
The proposed generalized coordinates are visible in Fig. 3.7, i.e.
· ¸
x
q= ∈ R2 (3.55)
θ

is made of the position of the cart on the rail, and the angle θ is the angle of the rod linking
the cart to the hanging mass with respect to the vertical. We observe here that there is
no problem mixing positions and angles (and any other physical quantity describing the
“configuration" of a mechanical system) in the generalized coordinates.

62
Figure 3.7: Simple hanging crane. The mass M moves on a rail (position x) and mass m
is linked to it via a massless rod of length L. We describe the position of the
hanging mass via the angle θ.

We can now compute the kinetic and potential energy functions. As energies are additive,
we can compute them separately for the two masses. The kinetic energy of the cart is simply
Tcart = 21 M ẋ 2 . The energy of the hanging mass is a bit more complex to compute here, but
can still be simply derived from formula (3.1). Indeed, we observe that the position of the
hanging mass is given by:
· ¸
x + L sinθ
pm = (3.56)
− L cos θ

Note that the positive sign in the first line will specify in which direction a positive angle
will “rotate" the hanging mass. We can then compute:
· ¸
∂p m ẋ + L θ̇ cos θ
ṗ m = q̇ = (3.57)
∂q L θ̇ sin θ

and
° °2
°ṗ m ° = ṗ ⊤ ṗ
m m = ... = L 2 θ̇ 2 + ẋ 2 + 2L θ̇ ẋ cos θ (3.58)

The kinetic energy of the hanging crane is therefore given by:

1 1 1 1
T = M ẋ 2 + m ṗ m

ṗ m = (m + M)ẋ 2 + mL 2 θ̇ 2 + mL θ̇ ẋ cos θ (3.59)
2 2 2 2
We can deduce that the matrix W (q ) is now:
· ¸
m + M mL cosθ
W (q ) = (3.60)
mL cos θ mL 2

63
and is not constant in this example (it is function of θ). It can be useful to observe here that
the matrix W (q ) can be readily computed from the kinetic energy function by computing
its Hessian with respect to q̇ , i.e.
¡ ¢
∂2 T q , q̇
W (q ) = (3.61)
∂q̇ 2
The potential energy of the hanging crane is just gravity, and is related to the “z" position
of the hanging mass, i.e.:
£ ¤
V = mg 0 1 p m = −mg L cosθ (3.62)

Our Lagrange function can now be assembled, and reads as:


1 1
L = (m + M)ẋ 2 + mL 2 θ̇ 2 + mL θ̇ ẋ cos θ + mg L cosθ (3.63)
2 2
We can now develop the terms of the Lagrange equation:
· ¸
(M + m)ẋ + mL θ̇ cos θ
∇q̇ L = = W (q )q̇ (3.64)
mL 2 θ̇ + mL ẋ cos θ
and
· ¸
d (M + m)ẍ + mL cosθ θ̈ − mL sin θ θ̇ 2
∇q̇ L = (3.65)
dt mL cos θ ẍ + mL 2 θ̈ − mL θ̇ ẋ sin θ
Moreover,
· ¸
0
∇q L = −mL sinθ (3.66)
g + θ̇ ẋ
The Lagrange equation then reads as:
· ¸
(M + m)ẍ + mL cos θ θ̈ − mL sinθ θ̇ 2
=0 (3.67)
mL cosθ ẍ + mL 2 θ̈ + mg L sinθ
The acceleration q̈ enters linearly in (3.67). We can actually observe that (3.67) can be writ-
ten as:
· ¸
−θ̇ 2
W (q )q̈ + mL sin θ =0 (3.68)
g
And inverting W (q , we can write:
· ¸
−1 θ̇ 2
q̈ = mL sin θW (q ) (3.69)
−g
Unfortunately, this construction yields fairly complex expressions. We can verify that:
· ¸
sin θ m(L θ̇ 2 + g cos θ)
q̈ = 1 (3.70)
(M + m − m cosθ 2 ) − L ((M + m)g + mL cos θ θ̇ 2 )


64
Example 3.8 (Elastic crane). Let us consider an elastic crane, identical to the one depicted
in Fig. 3.7, but where the rod is not rigid, hence the length L becomes variable, and the
associated potential energy stored in the rod is then given by:
1
Vrod = K (L − L 0 )2 (3.71)
2
where L 0 is the rest-length of the rod. We will now adopt the generalized coordinates
 
x
q = θ  (3.72)
L
The kinetic energy of the system now needs to account for L being time-varying, but the
computations performed previously do not change. I.e. we can compute:
· ¸
∂p m ẋ + L θ̇ cos θ + L̇ sin θ
ṗ m = q̇ = (3.73)
∂q L θ̇ sin θ − L̇ cos θ
and
° °2
°ṗ m ° = ṗ ⊤ ṗ
m m = ... = L 2 θ̇ 2 + ẋ 2 + 2L θ̇ ẋ cos θ + L̇ 2 + 2L̇ ẋ sin θ (3.74)
The kinetic energy of the hanging crane is therefore given by:
1 1 1 1 1
T = M ẋ 2 + m ṗ m ⊤
ṗ m = (m + M)ẋ 2 + mL 2 θ̇ 2 + mL θ̇ ẋ cos θ + m L̇ 2 + m L̇ ẋ sin θ (3.75)
2 2 2 2 2
Matrix W (q ) is not quite complex in this example. It reads as:
 
m + M mL cosθ m sinθ
W (q ) =  mL cos θ mL 2 0  (3.76)
m sinθ 0 m
It is the same as (3.60), but with an extra row and an extra column stemming from the new
coordinate L. Our potential energy now includes the new term (3.71) corresponding to the
elastic rod. It reads as:
1
V = −mg L cosθ + K (L − L 0 )2 (3.77)
2
We can then assemble the Lagrange function and compute the different terms of the La-
grange equation. We have:
 
(M + m)ẋ + mL θ̇ cos θ + m L̇ sin θ
∇q̇ L =  mL 2 θ̇ + mL ẋ cos θ  = W (q )q̇ (3.78a)
m L̇ + m ẋ sin θ
(M + m)ẍ + mL cosθ θ̈ − mL sin θ θ̇ 2 + 2m L̇ cos θ θ̇ + m L̈ sin θ
 
d
∇q̇ L =  mL cosθ ẍ + mL 2 θ̈ − mL θ̇ ẋ sin θ + m L̇ ẋ cos θ + 2mL L̇ θ̇  (3.78b)
dt
m L̈ + m ẍ sin θ + m θ̇ ẋ cos θ
 
0
∇q L =  −m(Lg sinθ − L̇ ẋ cos θ + L θ̇ ẋ sin θ)  (3.78c)
2
mL θ̇ + m ẋ cos θ θ̇ + K (L − L 0 ) + mg cosθ

65
The Lagrange equation then reads as:
(M + m)ẍ + m sin θ L̈ + mL cos θ θ̈ − mL sinθ θ̇ 2 + 2m L̇ cos θ θ̇+
 
 mL(L θ̈ + 2L̇ θ̇ + ẍ cos θ + g sin θ) =0 (3.79)
2
m L̈ + m sinθ ẍ + K (L − L 0 ) − mg cosθ − mL θ̇
As previously, the acceleration q̈ enters linearly in (3.79). We can again observe that (3.79)
can be written as:
 
m θ̇(2L̇ cos θ − L θ̇ sin θ)
W (q )q̈ +  mL(2L̇ θ̇ + g sinθ) =0 (3.80)
2
−mL θ̇ + K (L − L 0 ) − mg cosθ
and the acceleration can be computed explicitly using:
 
m θ̇(2L̇ cos θ − L θ̇ sin θ)
q̈ = −W (q )−1  mL(2L̇ θ̇ + g sinθ)  (3.81)
2
−mL θ̇ + K (L − L 0 ) − mg cosθ
Unfortunately, (3.81) yields very long expressions and we skip writing it here. 
Example 3.9 (Elastic crane, cont’d). Let us revisit the elastic crane example with a different
choice of generalized coordinates. Instead of specifying the position of the hanging mass
using (θ, L) (these are polar coordinates), it is equally valid to use cartesian coordinates. Let
us describe the hanging mass using
· ¸
p1
p= ∈ R2 (3.82)
p2
as illustrated in Fig. 3.8, and we chose to order our generalized coordinates q as:
· ¸
x
q= ∈ R3 . (3.83)
p

Figure 3.8: Simple hanging crane similar to Fig. 3.7. Here we use a cartesian coordinate
system.

66
The kinetic energy then reads as:

1 1
T = M ẋ 2 + m ṗ ⊤ ṗ (3.84)
2 2
and the W (q ) matrix reads as:
 
¡ ¢ M 0 0
W q =  0 m 0  (3.85)
0 0 m

and is constant. The potential energy in the rod reads as:


° · ¸°
1 ° x °
Vrod = K (L − L 0 )2 where L = °
°p − 0 °
° (3.86)
2

and the potential energy of the hanging mass is:

Vgravity = −mg p 2 (3.87)

(note that the minus sign here is due to the reference frame chosen for the hanging mass,
with the vertical basis vector oriented down, see Fig. 3.8). Hence the Lagrange function
reads as:
1 1 1
L = T − V = M ẋ 2 + m ṗ ⊤ ṗ − K (L − L 0 )2 − mg p 2 (3.88)
2 2 2
We can then compute:
· ¸
M ẋ
∇q̇ L = (3.89a)
m ṗ
· ¸
d M ẍ
∇q̇ L = (3.89b)
dt m p̈

  
0 p1 − x
K (L − L 0 )
∇q L =  0  +  x − p1  (3.89c)
L
mg −p 2

And write the model as:


   −1 ¡ ¢ 
· ¸ 0 M p1 − x
ẍ K (L − L 0 )
=  0 +  m −1 (x − p 1 )  (3.90)
p̈ L
g −m −1 p 2

67
Figure 3.9: Illustration of the elastic hanging chain

Example 3.10 (Hanging chain). Consider an elastic hanging chain as depicted in Fig. 3.9.
The chain is made of N punctual masses each of mass m that can move in a 3-dimensional
space, linked together by elastic rods of rigidity K , and linked to two fixed points by elastic
rods as well. The “configuration" of the hanging chain can be described by listing the posi-
tions p 1,...,N of each mass in the vector q ∈ R3N . We will label the fixed end-points as p 0 and
p N+1 , and these two points will not be part of the generalized coordinates q . For the sake
of simplicity, we will consider that the elastic links have a rest-length L 0 = 0.
The kinetic energy is given by

1
T = m q̇ ⊤ q̇, (3.91)
2
and hence the matrix W (q ) reads as
¡ ¢
W q = mI , (3.92)

where I ∈ R3N×3N is the identity matrix, and is therefore constant. The potential energy
function reads as:
N £
X 1 XN ° °
°p k+1 − p k °2
¤
V = mg 0 0 1 pk + K (3.93)
k=1 2 k=0
| {z } | {z }
gravity spring

where p 0 and p N+1 are the fixed positions of the end points. The Lagrange function reads
as:

1 XN £ 1 XN ° °
°p k+1 − p k °2
¤
L = m q̇ ⊤ q̇ − mg 0 0 1 pk − K (3.94)
2 k=1 2 k=0

68
We can then compute:

∇q̇ L = m q̇ (3.95a)
d
∇q̇ L = m q̈ (3.95b)
dt
     
p0 − p1 p2 − p1 g  
 p1 − p2   p3 − p2   g  0
     
∇q L = K  .. +K  .. +m .. , g = 0  (3.95c)
 .   .   . 
g
p N−1 − p N p N+1 − p N g

The Lagrange equation then delivers:


     
p0 − p1 p2 − p1 g
.. ..  . 
 + m  ..  = 0
   
m q̈ + K  . +K  . (3.96)
p N−1 − p N p N+1 − p N g

which is trivial to put in an explicit form:


     
p0 − p1 p2 − p1 g
K  ..  K  ..   .. 
q̈ = −  . −  . − .  (3.97)
m m
p N−1 − p N p N+1 − p N g

Example 3.11 (2-pendulum crane). Let us consider now a 2-pendulum crane as illustrated
in Fig. 3.10.

Figure 3.10: Illustration of the 2-pendulum crane. The cart of mass m moves on a rail (posi-
tion x, not shown here) and the two hanging masses also of mass m are linked
to it via massless rods of length L. We describe the position of the hanging
masses via the angle θ1,2 (relative to the verticals).

69
We proceed with building the kinetic and potential energy functions for this system. To that
end, it is best to start with describing the position of the two hanging masses (in a cartesian
reference frame), as functions of the generalized coordinates
 
x
q =  θ1  (3.98)
θ2

The cart is at a position:


· ¸
x
p0 = (3.99)
0

The first hanging mass is at a cartesian position:


· ¸
0
p 1 = p 0 + R (θ1 ) (3.100)
−L

where
· ¸
cos θ1 − sin θ1
R (θ1 ) = (3.101)
sin θ1 cos θ1

and the second hanging mass is at a position: The first hanging mass is at a cartesian posi-
tion:
· ¸
0
p 2 = p 1 + R (θ2 ) (3.102)
−L

We can then readily apply the chain rule (3.2), i.e. to obtain ṗ 0,...2 . The kinetic energy func-
tion then reads as (the cart and the two haging masses are all of mass m):

1 X2
T= m ṗ ⊤ ṗ k (3.103)
2 k=0 k

but it is a fairly complex expression given by:


1
T = q̇ ⊤W (q )q̇ (3.104)
2
where matrix W (q ) reads as:
 
3 2L cosθ1 L cos θ2
W (q ) = m  2L cos θ2 2L 2 L cos (θ1 − θ2 )  (3.105)
L cos θ2 L cos (θ1 − θ2 ) L2

The potential energy function is given by:


2
X
V = mg p k,2 = −2mg L cosθ1 − mg L cosθ2 (3.106)
k=1

70
We can then compute:

∇q̇ L = W (q )q̇ (3.107a)


d ∂ ¡ ¢
∇q̇ L = W (q )q̈ + W (q )q̇ q̇ (3.107b)
dt ∂q
 
0
∇q L = −mL  g sin(θ1 ) + g sin(θ1 ) + θ̇1 ẋsi nθ1 + ẋ θ̇1 sin θ1 + L θ̇1 θ̇2 sin(θ1 − θ2 ) 
g sinθ2 + θ̇2 ẋ sin θ2 − L θ̇1 θ̇2 sin(θ1 − θ2 )
(3.107c)

where
 2 2 
2 sin θ 1 θ̇ 1 + sin θ 2 θ̇ 2
∂ ¡ ¢
W (q )q̇ q̇ = mL  L θ̇1 θ̇2 sin(θ1 − θ2 ) + 2ẋ θ̇1 sin θ1 − L θ̇22 sin(θ1 − θ2 )  (3.108)
∂q
−L θ̇1 θ̇2 sin(θ1 − θ2 ) + ẋ θ̇2 sin θ2 + L θ̇12 sin(θ1 − θ2 )

If one tries to assemble the resulting model in e.g. its explicit form, by computing (identi-
cally to (3.42)):
· ¸
¡ ¢−1 ∂ ¡ ¢
q̈ = W q ∇q L − W (q )q̇ q̇ (3.109)
∂q

However, the symbolic complexity resulting from computing (3.109) is high (and not re-
ported here for this reason). 

3.4 E XTERNAL FORCES


The approach we have looked at so far does not include the possibility of having external
forces and moments. We will close this gap now. One ought to observe that the Lagrange
modelling approach is intrinsically an energy-based point of view of the mechanical sys-
tem. As a result in the Lagrange approach, we will have to look at the external forces and
moments in terms of the energy they deliver to or remove from the system. In other words,
we will have to consider the work produced on the system by the external forces and mo-
ments. Work is related to motions occurring under forces and moments, and motions are
described in the Lagrange approach as changes in the generalized coordinates q . The way
external forces and moments are included in the Lagrange formulation in terms of gen-
eralized forces that we will label Q, which describe the amount of work produced on the
system when moving the generalized coordinates. More formally, the generalized forces
should satisfy:

∇q E = Q (3.110)

i.e. they describe the change of energy in the system when the generalized coordinates q
are “moved a little bit".

71
This concept is often presented in the literature via the following equation:
δW = 〈Q , δq 〉 (3.111)
which is essentially saying the same as (3.110), i.e. a “small" motion δq , combined with the
external forces and moments produces a “small" amount of work δW (a change of energy
in the system), and the generalized forces Q relate the small motion to small amount of
work produced. Once the generalized forces Q are known, they can be readily included in
the Lagrange formalism using:
d ∂L ∂L
− = Q⊤ (3.112)
dt ∂q̇ ∂q
A fairly systematic procedure can be derived from these concepts whenever a force of a
torque is applied punctually somewhere in the system. Suppose that a force given by vector
F ∈ Rn (where n = 1, 2 or 3 is the number of dimensions in which we are working) in a given
fixed reference frame R is applied at a specific point of the system, having a position p ∈ Rn
in the same reference frame R. We then observe that a small change in the generalized
∂p
coordinates yields a small displacement of the position p given by the Jacobian ∂q , and a
small work:
∂p
δW = F ⊤ δq (3.113)
∂q
It follows that in this case the generalized force corresponding to F is given by:
∂p ⊤
Q= F = ∇q pF (3.114)
∂q
This principle can be easily extended to several forces. When a set of forces F 1,...,m is applied
to a set of points p 1,...,m , then the generalized force is given by:
Xm ∂p ⊤ Xm
i
Q= Fi ≡ ∇q p i F i (3.115)
i =1 ∂q i =1

A couple of simple examples illustrate how these concepts can be used in practice:
• Consider the simple case of a point whose position is described in an orthonormal
references frame, such that the generalized coordinates are q ∈ R3 , and subject to a
force F ∈ R3 described in the same reference frame. A small displacement q → q +δq
combined with the force F yields a small amount of work:
δW = 〈 F , δq 〉 = F ⊤ δq (3.116)
such that in this specific case Q = F , i.e. the generalized force in this system is the
force F itself. One can verify that this is also given by (3.115). Indeed, since the posi-
tion of the point is readily given by the generalized coordinates, i.e. p = q , it follows
that an application of (3.115) yields (where m = 1):
∂p ⊤
Q= F =F (3.117)
∂q
| {z }
=I

72
• The very same principle applies to moments and rotations. Consider a axis of rota-
tion whose position is described via the angle θ ∈ R, and subject to a moment T ∈ R.
The amount of work produces for a small motion δθ due to the moment T is then
given by:

δW = 〈 T , δθ 〉 = T δθ (3.118)

such that the generalized force for this case is Q = T .

Example 3.12 (2-pendulum crane, cont’d). Consider the crane example illustrated in Fig.
3.10. Recall that the position of the two masses in a fixed reference frame is given by (see
(3.100) and (3.102)):
· ¸ · ¸
x 0
p1 = + R (θ1 ) (3.119a)
0 −L
· ¸
0
p 2 = p 1 + R (θ2 ) (3.119b)
−L

with the generalized coordinates


 
x
q =  θ1  (3.120)
θ2

We then observe that:


   
⊤ 1 0 ⊤ 1 0
∂p 1 ∂p 2
=  L cos θ1 L sin θ1  , =  L cos θ1 L sin θ1  (3.121)
∂q ∂q
0 0 L cos θ2 L sin θ2

If forces F 1,2 ∈ R2 are applied to masses m 1 , m 2 respectively, then the generalized force then
reads as:
   
1 0 1 0
Q =  L cos θ1 L sin θ1  F 1 +  L cos θ1 L sin θ1  F 2 (3.122)
0 0 L cos θ2 L sin θ2

3.5 C ONSTRAINED L AGRANGE M ECHANICS


We have been discussing so far cases where the generalized coordinates can “move inde-
pendently" from each other, in the sense that any element of the set Rnq is admissible for
the vector of generalized coordinates q . E.g.

• In the spring-mass system, any position x is “acceptable" for the system (even if for
practical reasons the position of the mass may be restricted in a real system)

73
• In the hanging crane example, any value of θ and x are valid (even if in practice the
rail may have a limited length and the hanging mass may not be physically allowed
to get above the rail)

To clarify where we are going here, let us consider a simple example where the generalized
coordinates are not free to “move independently".

Example
¡ ¢ 3.13 (Bowl). Let us consider a “bowl" in 3D, described by the scalar equation
c p = 0 (see Fig. 3.11 for an illustration), where p ∈ R3 are cartesian coordinates and

¡ ¢ 1¡ ¢
c p = p 3 − p 22 + p 12 ∈ R (3.123)
4

Figure 3.11: Illustration of the bowl example for (3.123).

Let us consider a mass m moving in 3D, but forced to slide on the surface of the bowl. The
position of the mass can be described by the cartesian coordinates p ∈ R3 , but since the
mass moves at the surface of the bowl, the position p is not “free" to move everywhere, but
is “constrained"
¡ ¢ to move in the 2D “space" described by (3.123). Indeed, only positions that
satisfy c p = 0 are admissible. Put differently, the generalized coordinates are not “free" to
move independently. We will return to the example after having formally introduced some
new concepts. 

More formally, we will consider mechanical systems of generalized


¡ ¢ coordinates q ∈ Rnq
where the generalized coordinates
¡ ¢ are constrained by c q = 0 with c : Rnq 7→ Rnc and
n c < n q . Mathematically, c q = 0 describes a manifold in Rnq , i.e. a (not necessarily flat)
surface of dimension n q − n c . We ought to observe here that the number of dimension of
the manifold in which the system moves, i.e. the difference n q − n c is actually the number
of degrees of freedom (DoF) that the system physically has. Being able to treat this type
of problem systematically is extremely useful in a number of complex mechanical applica-
tions.

74
Fortunately, the Lagrange formalism we have discussed so far can be readily extended to
constrained problems, with minimal changes. The construction of the kinetic and potential
energy functions for the system can be done without any regard to the constraint function
c (q ) = 0. The constraints appear in the Lagrange function, which is then assembled as:
­ ®
L (q , q̇, z ) = T (q , q̇) − V (q ) − z , c (q ) (3.124)

where z is a new set of variables, and usually labelled the Lagrange multipliers associated
to the constraint function c . When function c (q ) has as image space the vector space Rnc
(i.e. when it returns a “standard" vector), then z ∈ Rnc and we can rewrite (3.124) simply
as2 :

L (q , q̇, z ) = T (q , q̇) − V (q ) − z ⊤ c (q ) (3.125)

Beyond this point, the Lagrange formalism applies without any modification, i.e. the dy-
namics associated to a system of kinetic and potential energy T and V and constrained to
evolve on the manifold c (q ) are described by:

d ∂L ∂L
− = Q⊤ (3.126a)
dt ∂q̇ ∂q
¡ ¢
c q =0 (3.126b)

The mathematical justification for this construction are beyond the scope of this course,
but we will try to build some intuitions via examples. Before deploying examples of ap-
plication of the constrained Lagrange equation, we can do the same work as in (3.38) and
(3.39). I.e. one can verify that the equalities:
¡ ¢
∇q̇ L = W q q̇. (3.127)

and
d d ¡ ¡ ¢ ¢ ¡ ¢ ∂ ¡ ¡ ¢ ¢
∇q̇ L = W q q̇ = W q q̈ + W q q̇ q̇ (3.128)
dt dt ∂q
¡ ¢
still hold, and are actually not impacted by the presence of the constraints c q . The con-
straints modify the Lagrange equation via the term:

∇q L = ∇q T − ∇q V − ∇q c · z (3.129)

We ought to recall here remark 3. of Section 3.3: the term ∇q L in the Lagrange equation
holds forces that are “intrinsic" to the system (potential, centrifugal). This remark still holds
here, and we will see in the illustrative examples below that the term ∇q c · z is in fact akin
to a force “keeping the system on the manifold c ".

2
The generic form (3.124) allows one to consider constraint function in “exotic" vector spaces such as e.g.
matrix spaces or Hilbert spaces. We will discuss only the simple case of “standard" vector spaces here.

75
As observed previously in Section 3.3, we can assemble the Lagrange equation (3.126) in
the form:
¡ ¢ ∂ ¡ ¡ ¢ ¢
W q q̈ + W q q̇ q̇ − ∇q T + ∇q V + ∇q c · z = Q (3.130a)
∂q
¡ ¢
c q =0 (3.130b)

And the accelerations q̈ can be explicitly expressed from (3.130a) as:


· ¸
¡ ¢−1 ∂ ¡ ¡ ¢ ¢
q̈ = W q Q + ∇q T − ∇q V − ∇q c · z − W q q̇ q̇ (3.131)
∂q

and as previously, for W (q ) constant, this equation reduces to:


£ ¤
q̈ = W −1 Q − ∇q V − ∇q c · z (3.132)

We can observe again here that the Lagrange equation (and their explicit forms (3.131) and
(3.132)) deliver the acceleration q̈ as a function of q , q̇, Q and of the Lagrange multipliers
z . Hence, a simulation of the model could be produced for given initial conditions q (0),
q̇(0), given external forces Q(t ) (given at all time) and the Lagrange multipliers z (t ). Hence,
in order to compute a simulation of the model equations, we need to calculate somehow
the (time-varying) Lagrange multipliers z at every time instant of the simulation. We will
investigate in the next section how to do that.

Let us now consider a list of examples in order to digest these theoretical concepts, and to
gather some intuitions on the meaning of the constrained Lagrange equation (3.126).

Example 3.14 (Bowl, cont’d). Let us start with our bowl example of Fig. 3.11 to detail the
procedure behind the constrained Lagrange equation (3.126). The mass position is de-
scribed in the cartesian coordinates q ≡ p ∈ R3 , such that the kinetic and potential energy
function read as:
1 £ ¤
T = m ṗ ⊤ ṗ, V = mg 0 0 1 p (3.133)
2
Note that we¡ can
¢ assemble these functions by completely disregarding the existence of the
constraint c q = 0. We then assemble the constrained Lagrange function as per (3.125):
µ ¶
1 ⊤
£ ¤ ⊤ 1¡ 2 2
¢
L = m ṗ ṗ − mg 0 0 1 p − z p 3 − p 2 + p 1 (3.134)
2 4
¡ ¢
Note that here the constraint is scalar, such that z ∈ R, and the scalar product z ⊤ c q is a
simple product here, i.e. we can rewrite the Lagrange function:
µ ¶
1 ⊤
£ ¤ 1¡ 2 2
¢
L = m ṗ ṗ − mg 0 0 1 p − z p 3 − p 2 + p 1 (3.135)
2 4

76
and deploy the Lagrange equation as usual:

∇q̇ L = m ṗ (3.136a)
d
∇q̇ L = m p̈ (3.136b)
dt
− 12 p 1
   
0
∇q L = −mg 0  − ∇q c · z ,
 where ∇q c =  − 12 p 2  (3.136c)
1 1

The dynamics of the mass in the bowl are then given by:
   1 
0 − 2 p1
m p̈ = −mg  0  − z  − 12 p 2  (3.137a)
1 1
1 ¡ ¢
0 = p 3 − p 22 + p 12 (3.137b)
4
As detailed in the theory above, it is worth here briefly underlining the following:

• Equation (3.137a) delivers the system acceleration p̈ for p and z known.

• Equation (3.137b) is scalar, and describes the condition that the position p ought to
satisfy in order to “be on" the manifold (i.e. the bowl here). However, one ought to
observe that (3.137b) does not deliver the unknown z . As a matter of fact, z does not
even appear in (3.137b). Hence, while (3.137) provides 4 equations for the 4 unknown
variables q̈, z , it does not provide them as such, as it fails to provide z .

Example 3.15 (Crane, revisited). Let us revisit the crane of Fig. 3.8 in cartesian coordinates,
assuming that the link between the cart and the hanging mass is (infinitely) rigid. The
kinetic energy function is built in the same way as in the elastic example above, i.e. we use
the coordinates
· ¸
p1
p= ∈ R2 (3.138)
p2

for position of the hanging mass, and x for the position of the cart. We chose to order our
generalized coordinates q as:
· ¸
x
q= (3.139)
p

The kinetic energy reads as (3.84), i.e.:

1 1
T = M ẋ 2 + m ṗ ⊤ ṗ (3.140)
2 2

77
and the W matrix reads as (3.85), i.e.:
 
M 0 0
W=  0 m 0  (3.141)
0 0 m

and is constant. The potential energy now involves only the energy from gravity, i.e.

V = −mg p 2 (3.142)

The constraint is then the distance between the cart and the hanging mass. The vector
describing the link is given by:
· ¸
x
δ=p− (3.143)
0

And the constraint imposed by the link in the system can e.g. be written as:
¡ ¢ 12
δ⊤ δ −L = 0 (3.144)

For computational reasons, it is useful to “get rid" of the square root function in (3.144) and
to scale the constraint by 21 , i.e. we write:

¡ ¢ 1¡ ¢
c q = δ⊤ δ − L 2 = 0 (3.145)
2
which is equivalent to (3.144). The Lagrange function then reads as:

1 1 1 ¡ ¢
L = M ẋ 2 + m ṗ ⊤ ṗ + mg p 2 − z δ⊤ δ − L 2 (3.146)
2 2 2
We can then deploy the Lagrange equation as usual:

∇q̇ L = W q̇ (3.147a)
d
∇q̇ L = W q̈ (3.147b)
dt
   
0 x − p1
∇q L =  0  − z  p 1 − x  (3.147c)
mg p2

The model then reads as:


   
0 x − p1
W q̈ −  0  + z  p 1 − x  = 0 (3.148a)
mg p2
1¡ ⊤ ¢
δ δ − L2 = 0 (3.148b)
2

78
In a more explicit form, it reads as:
   
0 x − p1
q̈ = W −1  0  − z  p 1 − x  (3.149a)
mg p2
1¡ ¢
0 = δ⊤ δ − L 2 (3.149b)
2
The observations already made above concerning solving (3.149) for z still hold here. It
is interesting at this stage to compare (3.149) to its equivalent (3.70) developed in polar
coordinates (i.e. using the angle θ to describe the position of the hanging mass). While
(3.149) holds more equations than (3.70) (4 vs. 2), its symbolic complexity is lower. Indeed,
while (3.70) comprises several trigonometric terms (sin and cos), and a division, (3.149)
includes only bilinear terms (i.e. products of the variables). 
Example 3.16 (2-pendulum crane, revisited). We will revisit now the 2-pendulum crane
illustrated in Fig. 3.10. Modelling this system using polar coordinates (i.e. the angles of
the pendulums) was resulting in a model of very high symbolic complexity (see (3.107)-
(3.108) and the following remarks). Let us see the outcome of approaching the modelling
of the same system using cartesian coordinates and constrained Lagrange. We select the
generalized coordinates:
 
x
q =  p1  , p 1,2 ∈ R2 (3.150)
p2

The kinetic energy function reads as:


1
T = m q̇ ⊤ q̇ (3.151)
2
hence W = mI (where I is the 3×3 identity matrix). The potential energy function reads as:
£ ¤
V = −mg 0 0 1 0 1 q (3.152)

The constraints then read as:


· 2 ¸
¡ ¢ 1 δ⊤
1 δ1 − L
c q = =0 (3.153)
2 δ2 δ2 − L 2

where
· ¸
x
δ1 = − p1, δ2 = p 2 − p 1 (3.154)
0

The Lagrange function reads as:


" 2
#
1 £ ¤ 1 δ⊤
1 δ1 − L
L = m q̇ ⊤ q̇ + mg 0 0 1 0 1 q − z ⊤ 2
(3.155)
2 2 δ⊤
2 δ2 − L

79
where z ∈ R2 . We can then compute

∇q̇ L = m q̇ (3.156a)
d
∇q̇ L = m q̈ (3.156b)
dt  
0
 0   x −£ 1 0 ¤p 0

  1
 
∇q L =  mg  −  −δ1 −δ2  z (3.156c)
 
 0  0 δ2
mg

The final model in a explicit form then reads as:


 
0 £ ¤
 
 0 
  1 x − 1 0 p1 0
 
q̈ =  g  −  −δ1 −δ2  z (3.157)
  m
 0  0 δ2
g
· ⊤ ¸
1 δ1 δ1 − L 2
0= ⊤ 2 (3.158)
2 δ2 δ2 − L

This ought to be compared to (3.107)-(3.108), where the final model was not even provided
because of its very high symbolic complexity. The reduction of symbolic complexity is re-
sulting from the choice of cartesian coordinates,
¡ ¢which yields a constant and diagonal W

matrix. As a result, the complex terms ∂q W (q )q̇ q̇ in
· ¸
¡ ¢−1 ∂ ¡ ¡ ¢ ¢
q̈ = W q Q + ∇q T − ∇q V − ∇q c · z − W q q̇ q̇ (3.159)
∂q
¡ ¢−1
disappears, and the inverse W q is trivial. 

3.5.1 H ANDLING M ODELS FROM C ONSTRAINED L AGRANGE

In the previous section we have seen that we can write the accelerations q̈
£ ¤
q̈ = W −1 Q − ∇q V − ∇q c · z (3.160)

As observed before, in order to compute a simulation from (3.160) we need the (time-
varying) Lagrange multipliers z at every time instant of the simulation.

Unfortunately, a problem arises here. Indeed, since we use equation (3.126a) (or its equiv-
alent form (3.130a)) to compute q̈, we are left with equation (3.126b) to determine the un-
known z . However, equation (3.126b) does not deliver z ; indeed it is not even a function of
z , and thus cannot inform us about z . We will see in the following how to circumvent this
problem, and we will put some further formalism around this issue in Chapter 6.

80
Running simulations for mechanical system requires one to be able to compute the system
accelerations q̈ as a function of the positions and velocities q , q̇ and of the
¡ ¢ external forces
Q. In the unconstrained case, this is readily feasible as long as matrix W q is full rank.

In the constrained case, we have raised the issue before that a model arising from con-
strained Lagrange and taking the form:
d ∂L ∂L
− = Q⊤ (3.161a)
dt ∂q̇ ∂q
¡ ¢
c q =0 (3.161b)

does not deliver as such the accelerations of the system. Indeed, while equation (3.161a)
delivers the acceleration q̈ via its explicit version:
· ¸
¡ ¢−1 ∂ ¡ ¡ ¢ ¢
q̈ = W q Q + ∇q T − ∇q V − ∇q c · z − W q q̇ q̇ (3.162)
∂q
the
¡ accelerations
¢ q̈ are function of the unknown z , which is not delivered by (3.161b), as
c q is not even a function of z . In this section we will see how to tackle this problem.

If one sets z = 0 in the dynamic equation (3.161a), then one can verify that the equation¡ ¢
would be describing a “free" motion, i.e. what the system would do if the constraints c q =
0 were not present at all. E.g. in the “bowl" example above, the ball would free fall, and in
the 2-pendulums crane example, the hanging masses would free fall and the cart would
move along its rail as if not connected to anything. ¡ ¢One can therefore construe ∇q c · z in
(3.162) as a term that will enforce the constraints c q = 0 by manipulating q̈ resulting from
(3.162) via adequate selections of the variables z . As a matter of fact, the combination of
terms
¡ ¢−1
W q ∇q c · z (3.163)

arising in (3.162) yield an acceleration that is added to the other “sources" of accelera-
tions (e.g. stemming from forces, gravity, centrifugal effects, etc.). One can observe that
¡ ¢−1
W q ∇q c is a matrix of dimension n q × n c , such that the term (3.163) yields accelera-
¡ ¢−1
tions in the subspace spanned by the columns of W q ∇q c .

Let us then build some intuition behind what will happen here. We¡ know ¢ that z ought to
be chosen such that the accelerations q̈ enforce the constraints c q = 0. However, the
constraints specify conditions on the system position, not on its accelerations. We ought
to understand, though, that the positions q are obviously influenced
¡ ¢ by the accelerations q̈
(via 2 integrations), such that the accelerations influence c q (via 2 integrations).
¡ ¢In order
to chose the z adequately, we need to make the impact of the accelerations on c q appear
explicitly.

Fortunately, this influence


¡ is¢ fairly simple to unpack. Indeed, let us take two time deriva-
tives of the constraints c q (in order to “rewind" the two integrations from q̈ to q ). We

81
¡ ¢
should observe that if c q = 0 is enforced throughout the trajectory of the system (i.e. at
every time instant), then:

dk ¡ ¢
k
c q =0 (3.164)
dt
d2
¡ ¢
also hold at all time for any k ≥ 0. We also observe that dt 2 c q = 0 is a condition where the
accelerations appear explicitly. In order to see that, we simply need to apply some chain
rules:
¡ ¢ d ¡ ¢ ∂c
ċ q , q̇ = c q = q̇ (3.165a)
dt ∂q
µ ¶
¡ ¢ d2 ¡ ¢ ∂c ∂ ∂c
c̈ q , q̇, q̈ = 2 c q = q̈ + q̇ q̇ (3.165b)
dt ∂q ∂q ∂q
We can then assemble the Lagrange equation (3.126a) and equation (3.165b) in the new
model:
d ∂L ∂L
− = Q⊤ (3.166a)
dt ∂q̇ ∂q
¡ ¢
c̈ q , q̇, q̈ = 0 (3.166b)

Or in the explicit form similar to (3.130):


¡ ¢ ∂ ¡ ¡ ¢ ¢
W q q̈ + W q q̇ q̇ − ∇q T + ∇q V + ∇q c · z = Q (3.167a)
∂q
µ ¶
∂c ∂ ∂c
q̈ + q̇ q̇ = 0 (3.167b)
∂q ∂q ∂q
Recall that our original problem is that the constrained Lagrange equations (3.161) do not
deliver the variables z , which are needed in order to compute the accelerations
¡ ¢ q̈. In order
to solve the problem we have time-differentiated the constraints c q = 0, and obtained
(3.167). However, It is not yet obvious that the modified constraints (3.167b) deliver z . As
a matter of fact, they are still not a function of z . Let us investigate whether (3.167) is now
capable of delivering the pair q̈ , z . Here we can proceed via algebraic manipulations over
(3.167). However, it is much easier to address the question by observing that (3.167) is linear
in q̈, z . Indeed on can rewrite (3.167) as:
· ¡ ¢ ¸· ¸ " Q − ∂ ¡W ¡q ¢ q̇ ¢ q̇ + ∇ T − ∇ V #
W q ∇q c q̈ ∂q ´ q q
⊤ = ³ (3.168)
∇q c 0 z ∂ ∂c
− ∂q q̇ q̇
| {z } ∂q
:=M ( q )
¡ ¢
Here we can readily see that (3.168) delivers q̈ , z jointly if matrix M q is full rank. One can
also write the model in an explicit form:
· ¸ " ∂
¡ ¡ ¢ ¢ #
q̈ ¡ ¢−1
Q − ∂q
W q q̇ q̇ + ∇q T − ∇q V
=M q ³
∂ ∂c
´ (3.169)
z − ∂q q̇ q̇
∂q

82
¡ ¢
However, as observed previously for matrix W (q ) alone, inverting the symbolic matrix M q
¡ ¢−1 ¡ ¢
can yield an extremely complex matrix M q , even if M q is fairly “simple". For this rea-
son, it is often preferable
¡ ¢−1 to work with the model (3.168) in its implicit form, or to treat the
matrix inversion M q numerically when deploying model (3.169).

Let us revisit some of the examples of Section 3.5 in the light of the model transformation
described above.
Example 3.17 (Bowl, cont’d). In the bowl Example 3.13, the model equations determined
were
   1 
0 − 2 p1
m p̈ = −mg  0  − z  − 12 p 2  (3.170a)
1 1
1¡ ¢
0 = p 3 − p 22 + p 12 (3.170b)
4
¡ ¢
where (3.170b) is c q = 0 written explicitly. In order to perform the model transformation
detailed above on this model, we need to time-differentiate (3.170b) twice:
¡ ¢ 1¡ ¢
c q = p 3 − p 22 + p 12 (3.171a)
4
¡ ¢ 1¡ ¢
ċ q = ṗ 3 − p 2 ṗ 2 + p 1 ṗ 1 (3.171b)
2
¡ ¢ 1¡ ¢
c̈ q = p̈ 3 − p 2 p̈ 2 + p 1 p̈ 1 + ṗ 22 + ṗ 12 (3.171c)
2
These expressions can be obtained directly as above, or via using the construction (3.165).
The transformed model then reads as:
   1 
0 − 2 p1
m p̈ = −mg  0  − z  − 12 p 2  (3.172a)
1 1
1 ¡ ¢
0 = p̈ 3 − p 2 p̈ 2 + p 1 p̈ 1 + ṗ 22 + ṗ 12 (3.172b)
2
It can be put in the form (3.168):
    
m 0 0 − 12 p 1 p̈ 1 0
 0 0 − 12 p 2   p̈ 2   0
    
m 
  =  (3.173)
 0 0 m 1   p̈ 3   mg 
− 21 p 1 − 21 p 2 1 0 z ṗ 22 + ṗ 12
| {z }
:=M (q )
¡ ¢
The determinant of matrix M q reads as:
¡
¡ ¢¢ m2 ¡ 2 ¢
det M q = − p 1 + p 22 + 4 < 0 (3.174)
4
and is non-zero for any position p, such that (3.173) is always well defined. The inverse of
matrix M(q ) is fairly complex, such that writing (3.173) explicitly is tedious. 

83
Example 3.18 (Crane, cont’d). Let us look at the crane Example 3.15, which had the original
model
   
0 x − p1
W q̈ =  0  − z  p 1 − x  (3.175a)
mg p2
1 ¡ ¢
0 = δ⊤ δ − L 2 (3.175b)
2
· ¸
x
where δ = p − and
0
 
M 0 0
W = 0 m 0  (3.176)
0 0 m
We can then proceed with time-differentiating the constraints:
¡ ¢ 1¡ ¢
c q = δ⊤ δ − L 2 (3.177a)
¡ ¢ 2
ċ q = δ⊤ δ̇ (3.177b)
¡ ¢
c̈ q = δ⊤ δ̈ + δ̇⊤ δ̇ (3.177c)

The model put in the form (3.168) then reads as:


   
x − p1   0
 ẍ
0
  
 W p1 − x    
  p̈  =   (3.178)
 p2   mg 
z
x − p1 p1 − x p2 0 ẋ 2 + ṗ 12 + ṗ 22 − 2ẋ ṗ 1
| {z }
:=M ( q )
· ¸
¡ ¢ x
We observe that the determinant of matrix M q is non-zero as long as 6= p, other-
0
¡ ¢
wise the last row and last column of M q are zero, and the matrix becomes rank-deficient.
This situation corresponds, physically, to the hanging mass coinciding with the cart (i.e.
the length of the link is zero). The inverse of matrix M(q ) is very complex, such that writing
(3.178) explicitly is not recommended. 

3.6 C ONSISTENCY CONDITIONS


Before finishing this chapter, we need to approach an important consequence of the model
transformation detailed above in terms of model simulation. The original model that is
describing the physical system reads as:
d ∂L ∂L
− = Q⊤ (3.179a)
dt ∂q̇ ∂q
¡ ¢
c q =0 (3.179b)

84
while the transformed model reads as:
d ∂L ∂L
− = Q⊤ (3.180a)
dt ∂q̇ ∂q
¡ ¢
c̈ q , q̇, q̈ = 0 (3.180b)

An important question to ask here is whether the transformed model (3.180) is equivalent
to the original model (3.179). Since the trajectories arising from the original model (3.179)
satisfy c = 0 at all time, they also satisfies (3.180b) at all time. Hence, the trajectories of
the original model are also trajectories of the transformed model (3.180). Unfortunately,
the converse is not always true. Indeed, the trajectories of the transformed model (3.179)
satisfy c̈ = 0 at all time, which does not entail that they satisfy c = 0 at all time. Actually, if
the trajectories satisfy c̈ = 0 at all time, then one can verify that the constraints c must obey
the time evolution:
¡ ¢ ¡ ¢ ¡ ¢
c q (t ) = c q (0) + t · c˙ q (0) , q̇ (0) (3.181)
¡ ¢
such that c q (t ) = 0 does not necessarily hold. Fortunately, (3.181) delivers the conditions
that are required in order for the trajectories of the transformed model (3.180)
¡ ¢ to be trajec-
tories of the original model (3.179). Indeed, (3.181) readily tells us that c q = 0 holds at all
time if:
· ¡ ¢ ¸
¡ ¢ c q (0)
C q (0), q̇ (0) := ¡ ¢ =0 (3.182a)
ċ q (0), q̇ (0)

hence, the transformed


¡ model ¢ delivers meaningful trajectories if the initial conditions q (0)
and q̇ (0) satisfy C q (0), q̇ (0) = 0. Condition (3.182a) is referred to as the consistency con-
ditions associated to the model transformation. They are required in order to impose initial
conditions in the model that are physically meaningful.

In order to build some intuition behind these observation using the bowl example above.
The consistency conditions for the bowl read as:
· ¸
¡ ¢ p 3 − 14 p 12 − 41 p 22
C q (0), q̇ (0) := (3.183)
ṗ 3 − 12 ṗ 1 p 1 − 21 ṗ 2 p 2

While these expressions do not carry much intuition, they have a very strong geometric in-
terpretation. The first¡ condition
¢ in C simply states that the initial condition q (0) must sat-
isfy the constraints c q , and is a quite natural requirement. The second condition states
that:
¡ ¢ ∂c
ċ q (0), q̇ (0) = q̇ = ∇q c ⊤ q̇ = 0 (3.184)
∂q

i.e. it requires that the scalar product between the gradient ∇q c and the velocities q̇ is
zero. In other words, it requires the velocities q̇ to be orthogonal to the gradient ∇q c . This
requirement has a very intuitive interpretation in the bowl example, see Fig. 3.12. Indeed,

85
Figure 3.12: Consistency conditions for the bowl example. The condition ∇q c ⊤ q̇ = 0 simply
requires that q̇(0) is tangent to the bowl.

¡ ¢
the gradient ∇q c is describing a normal to the surface described by the equation c q (0) =
0. In our bowl example, this surface is the bowl itself (more complex systems may not
¡ ¢⊤
deliver such simple picture), and the condition ∇q c q (0) q̇(0) = 0 essentially requires
that the initial velocities q̇(0) are tangent to the bowl surface. This requirement is physically
needed in order for the ball to slide on the surface of the bowl.

3.7 C ONSTRAINTS DRIFT


We have seen in the previous section that the initial conditions of the transformed model
must satisfy some consistency conditions in order for the transformed model to deliver
meaningful trajectories.
¡ ¢ The consistency conditions can be derived from the observation
that the constraints c q obey the time evolution (3.181), i.e.:
¡ ¢ ¡ ¢ ¡ ¢
c q (t ) = c q (0) + t · c˙ q (0) , q̇ (0) (3.185)

on the trajectories q (t ) computed from the transformed model. However, this time evolu-
tion holds if and only if the condition c̈ = 0 is imposed. While this condition is imposed
by the transformed model (3.166) (and all the variants we have looked at), when simulat-
ing the system, the condition is imperfectly imposed for numerical reasons. As a result,
the time evolution (3.185) is “noisy", and it tends to accumulate the numerical errors over
time. This effect is illustrated in Fig. 3.13, which is the outcome of simulating the trans-
formed model (3.172) with consistent initial conditions, but a low integration accuracy. As
a result, the condition c̈ = 0 is not enforced very accurately, and tends to create a drift in the
constraints.
This issue can be treated via Baumgarte stabilization, whereby the transformed model
does not impose c̈ = 0, but rather dynamics on c that stabilizes them to zero. This can be
e.g. done by imposing:

c̈ + 2αċ + α2 c = 0 (3.186)

86
×10-4
10

¡ ¢
c q
0

-5
0 2 4 6 8 10
t
×10-4
4

2
¡ ¢
ċ q

-2

-4
0 2 4 6 8 10
t

Figure 3.13: Illustration of the constraints drift for the bowl example.

for some α > 0 (such that the dynamics (3.186) are stable, with two real poles at −α). The
transformed model then reads as:
d ∂L ∂L
− = Q⊤ (3.187a)
dt ∂q̇ ∂q
c̈ + 2αċ + α2 c = 0 (3.187b)

87
88
4 N EWTON M ETHOD
The Newton method is a central tool for simulation, optimization, and estimation. It is
therefore no surprise that we will encounter it here. Generally speaking, the Newton method
aims at solving a set of equations, that we can e.g. write as:
¡ ¢
ϕ x, y = 0 (4.1)

in x ∈ Rnx for a given y ∈ Rn y . Clearly, in order for the problem of finding x to be well-posed,
one needs (4.1) to hold “as many equations as unknown variables", i.e. function ϕ is

ϕ : Rnx × Rn y 7→ Rnx (4.2)

We will see later that this condition is not sufficient. When the system of equations de-
scribed by (4.1) is nonlinear, finding a solution x can typically not be done explicitly (i.e.
we typically cannot provide a set of symbolic expressions that describe x as a function of
y ). This does not mean that x is not perfectly well described as a function of y by (4.1).
Formally, one say that x is an implicit function of y defined by (4.1).

The Newton method then allows one to compute x as a function of y numerically. In this
chapter we will introduce the basics of the Newton method.

4.1 B ASIC IDEA OF N EWTON METHOD


The Newton method replaces the problem of solving the nonlinear set of equations (4.1) by
solving a succession of surrogate linear approximations. These linear approximations are
built be successive linearizations of (4.1), i.e. instead of solving (4.1), we solve:
¡ ¢
¡ ¢ ¡ ¢ ∂ϕ x, y
ϕ x+ , y ≈ ϕ x, y + (x+ − x) = 0 (4.3)
∂x
for¡ x+ ,¢ based on a given guess x. Note that (4.3) is a first-order Taylor approximation of
ϕ x, y , i.e.
¡ ¢
¡ ¢ ¡ ¢ ∂ϕ x, y ¡ ¢
ϕ x+ , y = ϕ x, y + (x+ − x) + O kx+ − xk2 (4.4)
∂x
and is an affine form in x+ , i.e. x+ can be obtained from (4.3) explicitly from:
¡ ¢ −1
∂ϕ x, y ¡ ¢
x+ = x − ϕ x, y (4.5)
∂x
It is common to label
¡ ¢ −1
∂ϕ x, y ¡ ¢
∆x = − ϕ x, y (4.6)
∂x

89
as a (full) Newton step that corrects the guess x to the update x+ . See Fig 4.1 for an illus-
tration. Using this construction, the Newton
¡ ¢ method is then iterative, in the sense that the
(4.5) is repeated until the solution to ϕ x, y = 0 is reached (or close enough). More specif-
ically, the Newton method uses the algorithm:

Algorithm: Full-step Newton method

Input: Variable
¡ ¢ y , initial guess x, and tolerance tol
while kϕ x, y k∞ ≥ tol do
Compute
¡ ¢
¡ ¢ ∂ϕ x, y
ϕ x, y and (4.7)
∂x
Compute the Newton step
¡ ¢
∂ϕ x, y ¡ ¢
∆x + ϕ x, y = 0 (4.8)
∂x
Take the Newton step

x ← x + ∆x (4.9)

return x
¢
ϕ x, y

x⋆ ∆x
¡

x+ x

∂ϕ(x,y )
∂x

x
Figure 4.1: Illustration of the Newton principle on a scalar equation (i.e. n x = 1 and y is not
used here). The update¡ x+ is ¢ obtained by finding the root of the linear approxi-
mation (grey line) of ϕ ¡x, y ¢ built at a candidate point x. The update x+ is closer
to the solution x ⋆ of ϕ x, y = 0 than the candidate x.

90
A few crucial remarks ought to be made here:
¡ ¢
• If it converges, the Newton method converges to x ⋆ a solution of ϕ x, y = 0
¡ ¢
• Each step of the Newton method requires evaluating the function ϕ x, y and its Ja-
∂ϕ(x)
cobian ∂x , and solving the linear system (4.8).
∂ϕ(x,y )
• The Newton method requires that the square Jacobian matrix ∂x is full rank (i.e.
invertible), in order for the linear system (4.8) to be well-posed.
¡ ¢
• If function ϕ x, y is linear in x (and well posed), then the Newton method finds the
solution x ⋆ in one step. It is then fully equivalent to solving the linear system.

4.2 C ONVERGENCE OF THE N EWTON METHOD


A crucial question for the Newton method is whether the Newton algorithm
¡ ¢ converges or
3 ⋆
not. If it converges,
¡ ¢ then it has to converge to a solution x of ϕ x, y = 0. Indeed, as
long as ϕ x, y 6= 0, the Newton algorithm
¡ ¢ performs corrections ∆x 6= 0, and only stops
correcting the initial guess when ϕ x, y ≈ 0 (depending on the tolerance). Unfortunately,
the convergence of the Newton method is not trivial. We will discuss some basic results in
this section.

4.2.1 C ONVERGENCE RATE

Let us first assume that the Newton iteration presented above converges, and let us inves-
tigate its convergence rate, i.e how quickly kx − x ⋆ k decreases. This question is addressed
as follows:

Theorem 4. if the full step Newton iteration converges, then it converges at a quadratic rate,
i.e.

°x+ − x ⋆ ° ≤ C · °x − x ⋆ °2
° ° ° °
(4.10)

for some constant C > 0.

Before delivering a formal proof of the quadratic convergence rate, one ought to under-
stand that this result is fairly intuitive. Indeed, one can observe that at every iteration, the
error that the Newton step is making is of order 2. Hence, every iteration “removes"
¡ ¢ the
first-order inaccuracy between the solution candidate and the solution to ϕ x, y = 0. It
° °2
follows that at every step, the Newton method retains an error of order °x − x ⋆ ° . This ob-
servation alone give an intuition of why (4.10) ought to be true. We provide next a proof of
Theorem 4. This proof is not part of the examination though.

3
Note that there may be more than one solution! Hence the Newton method could converge to different
points, depending on where it is started.

91
Proof. For the sake of simplicity, let us dismiss the fixed argument y , and let us write the
Newton step as
∂ϕ (x)
∆x = −M −1 ϕ (x) , M= (4.11)
∂x
¡ ¢
We first observe that since ϕ x ⋆ = 0, the equality
¡ ¡ ¢¢
x+ − x ⋆ = x − x ⋆ − M −1 ϕ (x) − ϕ x ⋆ (4.12)
holds. We then use classical results from analysis to observe that:
Z1 ¡ ¡ ¢¢
¡ ⋆¢ ∂ϕ x + τ x ⋆ − x ¡ ¢
ϕ (x) − ϕ x = x − x ⋆ dτ (4.13)
0 ∂x
Using (4.12) and (4.13), we can write:
µZ1 ¡ ¡ ¢¢ ¶
⋆ ⋆ −1 ∂ϕ x + τ x ⋆ − x ¡ ⋆
¢
x+ − x = x − x − M x −x dτ (4.14)
0 ∂x
Or equivalently
µ Z1 ¡ ¡ ¢¢ ¶
⋆ −1 ∂ϕ x + τ x ⋆ − x ¡ ¢
x+ − x = M M− dτ x − x ⋆ (4.15)
0 ∂x
We can modify (4.15) as:
µ Z1 · ¡ ¡ ¢¢ ¸ ¶
⋆ −1 ∂ϕ (x) ∂ϕ x + τ x ⋆ − x ∂ϕ (x) ¡ ¢
x+ − x = M M− − − dτ x − x ⋆ (4.16)
∂x 0 ∂x ∂x
We can then bound (using triangular inequalities):
° µ ¶°
° ⋆
° ° −1
°x + − x ° ≤ °M ∂ϕ (x) °° °
° °x − x ⋆ °
° M− °
∂x
° Z · ¡ ¡ ¢¢ ¸ °
° −1 1 ∂ϕ x + τ x ⋆ − x ∂ϕ (x) °° °
dτ° °x − x ⋆ ° (4.17)
+° °M ∂x

∂x °
0
∂ϕ(x)
Since M = ∂x , the first term is zero. Moreover, the term under the integral sign in the
second term can be bounded (on any closed set around x ⋆ ) by
° µ ¡ ¢ ¶°
° −1 ∂ϕ (x) ∂ϕ x ⋆ ° ° °
°M − ° ≤ c °x − x ⋆ ° (4.18)
° ∂x ∂x °

for some constant c > 0 for ϕ smooth. It follows that


° Z · ¡ ¡ ¢¢ ¸ ° Z1 ° · ¡ ¡ ¢¢ ¸°
° −1 1 ∂ϕ x + τ x ⋆ − x ∂ϕ (x) ° ° −1 ∂ϕ x + τ x ⋆ − x ∂ϕ (x) °
°M
° − dτ° ≤
°
°M
° − ° dτ
°
0 ∂x ∂x 0 ∂x ∂x
Z1
° ° c° °
≤ c °x − x ⋆ ° τdτ = °x − x ⋆ ° (4.19)
0 2
such that
° x + − x ⋆ ° ≤ c ° x − x ⋆ °2
° ° ° °
(4.20)
2

92
Let us illustrate this convergence rate for the example provided in Fig. 4.3. A few remarks
can be useful here:
• A quadratic convergence rate is a very strong convergence. It means that at every
iteration
° ° the number of accurate digits is doubled, i.e. over the iterations, the error
°x − x ⋆ ° follows a decay of e.g. the form
° °
°x − x ⋆ ° = 10−1
° °
°x − x ⋆ ° = 10−2
° °
°x − x ⋆ ° = 10−4
° °
°x − x ⋆ ° = 10−8
° °
°x − x ⋆ ° = 10−16

Hence a few iterations are typically sufficient to reach machine precision (ǫ = 1e −16 )

• While (4.18) appears very technical, it can be understood as a “measure" of how non-
linear the function ϕ (x) is. Indeed, if ϕ (x) is linear in x, i.e.

ϕ (x) = Ax + b (4.21)

(we dismiss the argument y here) for a matrix A and a vector b, then
∂ϕ (x)
= A, ∀x (4.22)
∂x
and the left-hand side of (4.18) is zero, such that c = 0. Conversely, if ϕ (x) is strongly
∂ϕ(x)
nonlinear, then its Jacobian ∂x varies a lot and the difference
¡ ¡ ¢¢
∂ϕ x + τ x ⋆ − x ∂ϕ (x)
− (4.23)
∂x ∂x
can be very large, yielding a large constant c.

• The proof above also informs us on the condition required for the Newton°iteration °
to converge. Indeed, in order for the bound (4.10) to guarantee the decay of °x − x ⋆ °,
one ought to require that the initial guess x provided to the algorithm is such that:
° °
°x − x ⋆ ° ≤ 2c −1 (4.24)

It is interesting here to understand (4.24) properly. Indeed, it tells us that


1. in order for the full-step Newton iteration to converge, it ought to be provided
with an initial guess that is close enough to a solution x ⋆ . This observation is
illustrated in Fig. 4.2.
2. “how far" the initial guess provided to the full-step Newton iteration can be from
a solution x ⋆ is larger if c is small, i.e. if ϕ (x) is close to linear. Conversely, if
ϕ (x) is very nonlinear, then c is large and the initial guess provided to the full-
step Newton iteration must be close to a solution x ⋆ in order to guarantee the
convergence of the iteration

93
1.5 1.5
1 1
0.5 0.5
¢
ϕ x, y

Guess x
0 0
Guess x
¡

-0.5 -0.5
-1 -1
-1.5 -1.5

x x
¡ ¢
Figure 4.2: Newton iteration on a nonlinear, scalar function ϕ x, y (five steps are displayed
¡ ¢
here). On the left graph the iteration converges to the solution x ⋆ of ϕ x ⋆ , y =
0, while on the right graph the iteration diverges. Convergence is obtained for
the initial guess provided to the full-step Newton iteration being close enough
to a solution x ⋆ , while an initial guess further away (possibly) results from an
unstable iteration.

100

10-1
°
°x − x ⋆ °

10-2
°

10-3

1 2 3 4 5
Iteration

Figure 4.3: Converging full-step Newton iteration.


° ° The quadratic convergence rate yields a
⋆°
decrease of the solution error x − x that “collapses" downward in a semi-log
°
plot.

94
4.2.2 R EDUCED N EWTON STEPS

The full-step Newton algorithm as presented above can diverge¡ if the


¢ initial guess provided
to the Newton iteration is “too far" from a solution x ⋆ of ϕ x, y = 0. Fortunately, this
shortcoming can be addressed using reduced Newton steps. In this section we present the
justification and deployment of the approach,

The motivation behind reduced Newton steps can be stated as follows.

Theorem 5. if it exists, the Newton step ∆x solution of the linear system


¡ ¢
∂ϕ x, y ¡ ¢
∆x + ϕ x, y = 0 (4.25)
∂x
¡ ¢
is a descent direction for function ϕ x, y , i.e.
° ¡ ¢° ° ¡ ¢°
°ϕ x + t · ∆x, y ° < °ϕ x, y ° (4.26)
for some t ∈]0, 1].
Note that the result above is valid for any choice of norm k.k (this follows directly from the
equivalence between norms), but it is easiest to prove using the 2-norm.

Proof. inequality (4.26) is equivalent to4


° ¡ ¢° ° ¡
°ϕ x + t · ∆x, y °2 − °ϕ x, y °2 < 0
¢°
(4.27)
The first-order Taylor expansion of (4.27) yields:
¯
d ° ¢° ¯
°ϕ x + t · ∆x, y °2 ¯ · t + O t 2 < 0
¡ ¡ ¢
(4.28)
dt ¯
t=0
such that (4.27) holds for t > 0 small enough if
¯
d ° ¢° ¯
°ϕ x + t · ∆x, y °2 ¯
¡
<0 (4.29)
dt ¯
t=0
We observe that
¯ ¯
d ° ¡ ¢° ¯
°ϕ x + t · ∆x, y °2 ¯ d ³ ¡ ¢⊤ ¡ ¢´¯
= ϕ x + t · ∆x, y ϕ x + t · ∆x, y ¯¯
dt ¯
t=0 dt t=0
¯
¡ ¢⊤ d ¡ ¢¯
= 2ϕ x, y ϕ x + t · ∆x, y ¯¯ (4.30)
dt t=0
¡ ¢
¡ ¢⊤ ∂ϕ x, y
= 2ϕ x, y ∆x (4.31)
∂x
¡ ¢⊤ ¡ ¢
= −2ϕ x, y ϕ x, y < 0 (4.32)
¡ ¢
if kϕ x, y k > 0.
4
squaring the 2-norm is helpful here, as it makes it smooth

95
An important consequence of this result is that while the Newton method presented above
may diverge, a careful selection of reduced Newton steps, i.e. modifications of x in the di-
rection of the Newton step ∆x but possibly scaled down must necessarily converge, as long
as the Newton steps ∆x exist. This motivates the modification of the full-step Newton al-
gorithm into:

Algorithm: Newton method with reduced steps

Input: Variable
¡ ¢ y , initial guess x, and tolerance tol
while kϕ x, y k∞ ≥ tol do
Compute
¡ ¢
¡ ¢ ∂ϕ x, y
ϕ x, y and (4.33)
∂x
Compute the Newton step
¡ ¢
∂ϕ x, y ¡ ¢
∆x + ϕ x, y = 0 (4.34)
∂x
Select step size t ∈]0, 1]
Take Newton step

x ← x + t ∆x (4.35)

return x ≈ x ⋆

Finding the adequate step size t ∈]0, 1] is typically performed via a line-search strategy,
based on testing first a full step (t = 1), and then reducing it if the condition (4.26) is not
met. Clever methods exist in order to decide whether a step-size t is “good enough", or
too short or too long. Other methods also exist in order to adjust the Newton step ∆x and
guarantee the convergence (see e.g. trust-region techniques), but they are more difficult to
describe and implement.

A Newton iteration equipped with¡ the¢ possibility of taking reduced steps is guaranteed to

converge to a solution x of ϕ x, y = 0 as long as the Newton steps ∆x exist throughout
the iterations. It shall be underlined here that when reduced steps (i.e. using t < 1) are
necessary, then the quadratic convergence rate detailed in Theorem 4 is lost. The Newton
iteration with reduced steps has the following behavior:

• when x is “far" from a solution x ⋆ , then reduced steps (t < 1) are necessary, and the
algorithm converges “slowly". The resulting convergence rate can be very poor, even
though it is often close to linear in practice

• after a certain amount of iteration, x becomes close enough to x ⋆ and full steps (t = 1)
are acceptable. The convergence then becomes quadratic. Once x is close to x ⋆ , full

96
0.6

0.4

0.2

ϕ (x) 0
Guess x

-0.2
Iteration fails here
-0.4
x
Figure 4.4: Newton iteration with reduced steps on a nonlinear, scalar function ϕ (x) (five
steps are displayed here). Here the iteration does not diverge, but is fails at a
point where ( ) = 0. At this point, the linear system (4.8) does not have a
∂ϕ x,y
∂x
well-defined solution and the Newton step ∆x ceases to exist.

steps can be taken all the way to x ⋆ and the convergence is very fast.

The Newton algorithm with reduced steps converges globally to a solution x ⋆ unless the
linear system (4.34) becomes ill-posed. In order to develop some intuition on this potential
(and frequent) issue, let us consider the example illustrated in Fig. 4.4. A simple inspection
∂ϕ(x,y )
reveals that the Newton iteration may converge to a point at which ∂x = 0, where the
Newton step ceases to exist. In such a case, Theorem 5 ceases to apply, and the Newton
iteration fails. One can also readily see that if the Newton iteration was started closer to the
solution x ⋆ (black dot in the graph), then it would converge to x ⋆ .

4.3 I MPLICIT F UNCTION T HEOREM


We have presented the Newton method as a numerical approach to solve the equations
¡ ¢
ϕ x, y = 0 (4.36)

in terms of x. In many cases, the equations are dependent on some other argument, la-
belled y here, which in practice can gather e.g. parameters or data appearing ¡ in¢the equa-
tions. If one changes the y , it is natural to expect that the solution x to ϕ x, y = 0 is af-
fected. Formally, we ought¡ to
¢ think of x as being implicitly a function of y , i.e. we should
consider x as a function x y that we usually cannot write explicitly, but which is implicitly
defined by
¡ ¡ ¢ ¢
ϕ x y ,y =0 (4.37)

97
¡ ¢
Some questions naturally arise then, such as e.g. is x y well-defined, is it differentiable
with respect to y and what is its derivative. These questions are object of the Implicit Func-
tion Theorem (IFT), which is one of the cornerstones in the field of numerical analysis.

¡ ¢
Theorem 6. (IFT, ¡ simplified
¢ version): let function ϕ x, y be smooth, and consider a point
x̄, ȳ such that ϕ x̄, ȳ = 0. Suppose that the Jacobian
¡ ¢¯
∂ϕ x, y ¯
¯
¯ (4.38)
∂x x=x̄ , y = ȳ

is full rank. Then there


¡ ¢ exists an open set Y around the point ȳ in which there exist a unique,
smooth function x y satisfying:
¡ ¡ ¢ ¢
ϕ x y , y = 0, ∀ y ∈Y (4.39)
¡ ¢
Moreover, the Jacobian of function x y is given by
¡ ¢ ¡ ¢ −1 ¡ ¢¯
∂x y ∂ϕ x, y ∂ϕ x, y ¯¯
=− ¯ (4.40)
∂y ∂x ∂y ¯
x=x ( y ),y

The proof of the IFT is fairly involved and we will not do it here. However, equality (4.40) is
trivial to prove.

Proof. (of (4.40)) Because of (4.39), the equality:

d ¡ ¡ ¢ ¢
ϕ x y , y = 0, holds ∀ y ∈ Y (4.41)
dy

Using the chain rule, we observe that:


¡ ¢ ¡ ¢ ¡ ¢¯
∂ϕ x, y ∂x y ∂ϕ x, y ¯
+ ¯
¯ =0 (4.42)
∂x ∂y ∂y x=x ( y ),y

which yields (4.40).

The implicit function theorem is essential in numerical analysis, and let us briefly review
what it provides us:
¡ ¢
• even though function x y cannot be written explicitly and can only be evaluated nu-
merically (using a Newton iteration), its derivative is trivial to compute, using (4.40)
at the point x found by the Newton iteration for given point y .
¡ ¢
• the IFT guarantees that if ϕ x, y = 0 is well posed (i.e. the Jacobian (4.38) is full rank)
at a point y , then it has a neighborhood where it is also well posed, i.e. we can “move"
y (locally), and x “moves" accordingly, and also locally.

98
∂ϕ(x̄, ȳ )
• One ought to observe that the assumption of the Jacobian ∂x being full rank is
also required for the Newton iteration to be well-behaved (i.e. for the linear system
(4.34) to be well-posed). In other words, if the Newton iteration is well-behaved, then
the IFT holds. ¡Conversely,
¢ the assumption of the IFT must hold in order for a system
of equation ϕ x, y = 0 to be solvable for x.

We will use the IFT in several places in the following.

4.4 J ACOBIAN A PPROXIMATION


∂ϕ(x)
In some applications of the Newton iteration, the evaluation of the Jacobian ∂x is very
expensive. It can then be useful to consider using an approximation that is less expensive
to evaluate. Let us label this approximation:

∂ϕ (x)
M≈ (4.43)
∂x
The resulting Newton-type step reads as:

∆x = −M −1 ϕ (x) (4.44)

The use of an approximate Jacobian has an impact on the convergence of the Newton
method. Let us briefly review how it is changed. In order to do that, we will revisit The-
orem 4 and 5 with the Jacobian approximation.

Theorem 7. the convergence of the full-step Newton method with an approximate Jacobian
follows:
° ³ °´ °
° x + − x ⋆ ° ≤ κ + c °x − x ⋆ ° ° x − x ⋆ °
° ° °
(4.45)
2
for some constants c, κ > 0.

Proof. We can re-use (4.17), repeated here for simplicity:


° µ ¶°
° ⋆
° ° −1 ∂ϕ (x) ° ° °
° °x − x ⋆ °
°x + − x ° ≤ °M
° M− ° (4.46)
∂x
° Z · ¡ ¡ ¢¢ ¸ °
° −1 1 ∂ϕ x + τ x ⋆ − x ∂ϕ (x) °° °
+ °M
° − dτ° °x − x ⋆ °
∂x ∂x °
0

∂ϕ(x)
and we observe that if M 6= ∂x the first term in the right-hand side of the inequality does
not disappear. We can then bound (on any closed set around x ⋆ ):
° µ ¶° ° °
° −1 ∂ϕ (x) ° ° −1 ∂ϕ (x) °
°
°M
° M − ° = ° I − M ≤ κ, ∀ x (4.47)
∂x ° ° ∂x °

while the second term in the right-hand side of (4.46) can be bounded as in Theorem 4.
Bound (4.45) follows.

99
It is useful here to make some remarks concerning Theorem 7.
° °
• The bound (4.45) predicts a linear convergence rate, due to the term κ °x − x ⋆ ° in
(4.45). A linear convergence rate is significantly slower than a quadratic convergence
rate, as it does not go through a “collapse" to very small values. See Fig. 4.5 for an
illustration.

100
°
°x − x ⋆ °

10-1
°

10-2

1 2 3 4 5
Iteration

Figure 4.5: Stable full-step Newton iteration with approximate°Jacobian.


° The linear conver-
⋆°
gence rate yields a decrease of the solution error x − x
° that follows a “line"
downward in a semi-log plot. This is to be contrasted with Fig. 4.3.

• Bound (4.47) informs us on the impact of the error between M and the true Jacobian
∂ϕ(x) ∂ϕ(x)
∂x
. Indeed, for M = ∂x , κ = 0 can be selected, and one recovers a quadratic
∂ϕ(x)
convergence (see Th. 4). If M and M = ∂x are “very different", then κ has to be
large. One can observe from (4.45) that κ < 1 must hold in order for the convergence
of the full-step quasi-Newton iteration to be guaranteed (even for x being arbitrarily
close to x ⋆ ).

4.5 N EWTON M ETHODS FOR U NCONSTRAINED O PTIMIZATION


An important application of the Newton method is for solving minimization problems. In
this chapter, we will consider the case of unconstrained optimization, i.e. problems of the
form:
¡ ¢
min Φ x, y (4.48)
x
nx ny
for a scalar function Φ : R × R 7→ R. If function Φ is smooth with respect to x, the
solution x ⋆ to (4.48) is given by solving:
¡ ¢
∇x Φ x, y = 0 (4.49)

100
for x. The Newton methods detailed above then apply directly to solving (4.49). More
specifically, a Newton iteration for solving (4.49) reads as:

Algorithm: Newton method for optimization

Input: Variable
¡ y¢, initial guess x, and tolerance tol
while k∇x Φ x, y k∞ ≥ tol do
Compute
¡ ¢ ¡ ¢
∇x Φ x, y and ∇2x Φ x, y (4.50)

Compute the Newton step


¡ ¢ ¡ ¢
∇2x Φ x, y ∆x + ∇x Φ x, y = 0 (4.51)

Select step size t ∈]0, 1]


Take Newton step

x ← x + t ∆x (4.52)

return x ≈ x ⋆

Let us review the results presented in the previous sections in the specific context of opti-
mization.
• The convergence of Newton for Optimization is quadratic¡ in ¢ a neighborhood of the
solution x ⋆ and “slow" otherwise. If the Hessian ∇2x Φ x, y is approximated by M,
then the method has a linear convergence rate (i.e. Theorem 7 applies).

• The iteration can fail at a point where (4.51) is ill-defined, i.e. at a point where the
Hessian of Φ is rank-deficient.
In the context of optimization problems, an additional remark can be made concerning the
choice of Hessian approximation, summarized hereafter.

Theorem 8. if M > 0 (positive-definite), then the quasi-Newton step


¡ ¢
∆x = −M −1 ∇x Φ x, y (4.53)
¡ ¢
is a descent direction for the cost function Φ x, y , i.e.
¡ ¢ ¡ ¢
Φ x + t ∆x, y < Φ x, y (4.54)
for some t > 0.
Proof. for the sake of simplicity, let us write the proof dismissing the argument y . We ob-
serve that
∂Φ (x) ¡ ¢
Φ (x + t ∆x) − Φ (x) = ∆x t + O t 2 (4.55)
∂x
¡ ¢⊤ ¡ ¢ ¡ ¢
= −t ∇x Φ x, y M −1 ∇x Φ x, y + O t 2 < 0 (4.56)

101
if M > 0 and for t > 0 sufficiently small.

Hence it is sufficient to choose a positive-definite Hessian approximation to guarantee that


the Newton iteration converges. However, the results of Theorem 7 still apply, hence a poor
Hessian approximation results in a strong degradation of the convergence rate.

4.5.1 G AUSS -N EWTON H ESSIAN APPROXIMATION

Let us consider optimization problems of the form

1° °
°φ (x) − y °2
min (4.57)
x 2
for some function φ : Rnx → Rnφ .

Optimization problems of the form (4.57) are very common in system identification, see
¡ ¢ ° °2
Section 5. The gradient and Hessian of the cost function Φ x, y = 21 °φ (x) − y ° , then read
as:
¡ ¢ ¡ ¢
∇x Φ x, y = ∇x φ (x) φ (x) − y (4.58)
¡ ¢ h ¡ ¢i
∇2x Φ x, y = ∇x φ (x) ∇x φ (x)⊤ + ∇xi ,x j φ (x) φ (x) − y (4.59)
i ,j

where [.]i ,j denotes a matrix where the elements i , j are detailed between the brackets.
¡ ¢
For many fitting problems, the evaluation of the second term in the Hessian ∇2x Φ x, y is
expensive, such that the Gauss-Newton Hessian approximation:
¡ ¢
∇2x Φ x, y ≈ MGN = ∇x φ (x) ∇x φ (x)⊤ (4.60)

is often preferred. The Gauss-Newton Hessian approximation (4.60) is valid if:

• the function φ (x) is not very nonlinear, such that its second-order derivatives
∇xi ,x j φ (x) are small, or
¡ ¢
• the optimal solution to the fitting problem (4.57) yields φ x ⋆ ≈ y , such that φ (x)− y
is small.
h ¡ ¢i ¡ ¢
Both cases justify the dismissal of the second term ∇xi ,x j φ (x) φ (x) − y in ∇2x Φ x, y .
i ,j
It is useful to observe that the Gauss-Newton Hessian approximation is by construction
semi-positive definite, as MGN ≥ 0. In order to ensure its strict positiveness, it is common
to add some regularization, i.e. to use a Hessian approximation of the form:

M = ∇x φ (x) ∇x φ (x)⊤ + αI (4.61)

for some α > 0 reasonably small.

102
4.5.2 C ONVEX OPTIMIZATION

Before concluding this section, we ought to briefly talk


¡ about
¢ convexity in optimization
problems. For a (sufficiently) smooth cost function Φ x, y , the solution x ⋆ must satisfy
the condition
¡ ¢
∇x Φ x ⋆ , y = 0 (4.62)

However, a point satisfying condition (4.62) is not necessarily the solution to the underly-
ing optimization problem (4.48). This problem is illustrated in Fig. 4.6.
Φ(x)

Figure 4.6: Illustration of the difficulty of finding the minimum of a function Φ(x) via solv-
ing (4.62). Several points satisfy (4.62) here. The black dot is the true minimum,
but (4.62) also yields a so-called local minimum (circle), which is lower than its
neighbours but not lower than the black dot. Condition (4.62) also returns (pos-
sibly local) maxima (square).

Mathematically speaking we would say that condition (4.62) is necessary but not sufficient
to deliver the solution to (4.48). An exception to this lack
¡ of
¢ equivalence between condition
(4.62) and solving (4.48) is when the cost function Φ x, y is convex in x, i.e. if
¡ ¢
∇2x Φ x, y > 0, ∀x (4.63)

In this case, the solution to (4.62) is unique and delivers the solution to (4.48). Moreover,
the Newton iteration with reduced steps is guaranteed to converge to the solution of (4.48).

4.6 S UMMARY
The results concerning the Newton method are a bit convoluted. We provide hereafter a
brief summary condensing the main points:

• Exact reduced Newton steps ∆x improves ϕ for sufficiently small step sizes t ∈ ]0, 1]

103
• Inexact reduced Newton steps ∆x improve ϕ for a sufficiently small step size t ∈ ]0, 1]
∂ϕ
if M is sufficiently close to ∂x . In the context of optimization, M > 0 and sufficiently
small steps t ∈ ]0, 1] reduce the cost function Φ.

• Exact full (t = 1) Newton steps converge quadratically if close enough to the solution

• Inexact full (t = 1) Newton steps converge linearly if close enough to the solution and
if the Jacobian approximation is "sufficiently good"
∂ϕ
• The Newton iteration fails if ∂x
becomes singular

• Newton methods with reduced steps converge in two phases: damped (slow) phase
when reduced steps (t < 1) are needed, quadratic/ linear when full steps are possible.

104
5 S YSTEM I DENTIFICATION (S YS I D )
So far, we have primarily explored one main approach to create models for dynamical sys-
tems, namely physics based modelling, where physical knowledge is encoded in mathe-
matical relations describing the system. In some applications, however, it may be difficult
to determine and quantify the underlying physical mechanisms in a system. In such cases,
there is an alternative approach that is based on collecting data from experiments with
the system, and—based on analysis of the data—creating a mathematical model. Such ex-
perimental, or data-driven, modelling is often referred to as system identification (or short
SysId).5

System identification may be an attractive alternative to physical modelling, e.g. when the
system is too complex to analyze in terms of physical mechanisms, or when the limited re-
quirements on model fidelity does not motivate a possibly time-consuming physical mod-
elling effort. It should be noted that in practice, a combination of physical modelling and
system identification is often used: physical modelling may give important hints about de-
pendencies and qualitative relationships, and system identification may then be used to
quantitatively determine model parameters.

The system identification problem (in the parametric case) basically amounts to adjusting
a model (with adjustable parameters) to data. The principle may be depicted as in the figure
below. An input sequence u is applied to the real system, generating an output sequence
y . The model depends on a set of parameters θ, and generates an output sequence ŷ (u, θ)
from the input sequence u. The idea is then to adjust the parameters θ such that the model
output ŷ matches in some sense the output y from the real system.

Real system
y SysId (parametric):
u
Input sequence Adjust θ such that
ŷ matches y

u ŷ

Model
ŷ = ŷ (u, θ)

Note that we will work in discrete time when applying system identification. Indeed, since
the measurements are always collected in a discrete fashion from the real system, it makes
sense to build a theory where the model output is also discrete-time.

5
This chapter is much influenced by the treatment in the book [4]. See also the course homepage.

105
For the system identification problem thus described, we can already at this stage observe
that there are a few key issues that have to be considered when we want to apply it for
system modelling:
• Selection of model structure: the model ŷ (u, θ) can take various forms, allowing e.g.
both linear and nonlinear dynamics, different parametrizations etc.

• Experiment design: this involves e.g. selection of inputs and outputs to be used and
construction of the input sequence u to be applied to the system.

• Algorithm design: we need to define what is meant by a good fit of the model to data,
and how to find the best model parameter vector θ.

• Model validation: how can we assess the resulting model and whether it fills its pur-
pose?
We will discuss each of these in the sequel, but in order to introduce some of the basic
concepts and ideas, we will start by investigating a much simpler problem, namely to fit a
function to experimental data.

5.1 I NTRODUCTORY EXAMPLE : FITTING A FUNCTION TO DATA


The scenario to be discussed in this section would probably be familiar from lab experi-
ments in basic physics courses. The idea is that we want to investigate whether there is
a simple relationship between two physical variables x and y, so that we can write y as a
function of x, i.e. y = f (x) for some function f . In spite of the seemingly simple problem
statement, there will be several aspects of the problem to discuss.

D ATA . Our investigation relies on experimental data. Hence, we assume a series of mea-
surements of the quantities x and y is available. The N data-points are labelled x(1), . . ., x(N )
and y(1), . . . , y(N ), respectively, and they are depicted in Fig. 5.1 (left) for a particular data
set with N = 100. The precise conditions for the data collection may vary, but the basic as-
sumption is that for the N known values of x (determined by the experimenter or in some
other way), the corresponding values of y are measured. Usually, these measurements are
corrupted by disturbances—this can be suspected already from the scattered data-points
in Fig. 5.1 .

OVERFIT. In our search for a relation between x and y, a very simplistic (and a bit naive)
approach would be to rely completely on all measurements and assume they are reflecting
the “true” relation. This would mean that our model is obtained by simply connecting all
the data-points, as depicted in Fig. 5.1 (right). Note that we have here introduced the no-
tation ŷ = ŷ(x) to denote the predicted model output, corresponding to any (limited to the
interval [0, 1]) value of the input variable x. If we want to avoid the sharp “corners” of the
function ŷ(x), we could optionally use smooth curves instead of straight lines when con-
necting the data-points.

106
1.4 1.4

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 5.1: Illustration of the curve fitting example: experimental data (left) and a naive,
overfitted model (right).

There are at least two ways to criticise what we have just done. The first objection is that
the procedure is likely to give very different results, depending on the particular realization
of the disturbances affecting the measurements. Thus, we have been tempted to rely too
much on the individual data-points, resulting in an over-fitted model. A natural remedy for
this would be to collect several output measurements for each of a set of pre-selected val-
ues of x and to take the average as the corresponding model output. However, this would
require that we have full control over the choice of x values and also raises additional ques-
tions, such as how to select these values from a continuous range, and how to interpolate
the model output in-between the set of pre-selected x values.

More importantly, we would still be left with the second objection: we have not at all taken
into account a fundamental assumption (a prejudice, if you want), namely that we almost
always postulate that the sought-for relation is a smooth one. Hence, we do not expect an
erratic behaviour of the function as the one shown in Fig. 5.1 (right). Another way to ex-
press the assumption on smoothness is that data-points contain information not only for
specific values of x, but also for nearby values. This fact can be exploited in order to come
up with numerous superior methods for curve fitting, but we will limit our discussion to
parametric methods.

PARAMETRIC MODELS . By postulating a model that is given by a smooth function, which


is parametrized by a parameter vector θ, we more or less automatically take care of the
issues discussed above: the model will be smooth, and we can control the risk of overfit by
carefully selecting the number of parameters, i.e. the dimension of the vector θ. For the
curve fitting example, the model would become

ŷ = f (x, θ), (5.1)

107
where f is a predefined, known function of the “input” variable x and the adjustable pa-
rameter vector θ. The idea with this parametric model, formulated already in the beginning
of this chapter, is to adjust θ in such a way as to make the model output ŷ match the mea-
sured output as closely as possible.

M ODEL STRUCTURE . Having come this far, the natural question to ask is how to choose
the function f (x, θ), also referred to as the model structure. There are basically two ways
to go. The first alternative is to use some physical insight, suggesting possible types of
relations between x and y. For example, if we want to experimentally determine the rela-
tion between the pressure drop over and the flow through a restriction in a pipe (such as
a valve), then basic physical laws tell us that a quadratic dependence would be expected.
Then a good first attempt would be the model structure

ŷ = θx 2 , (5.2)

containing only one parameter to adjust for the best fit to data.

The other alternative is to accept that there is no a priori information about promising
model structures. We can then resort to standard choices, such as polynomials, and hope-
fully be guided by inspection of experimental data. In our example, the data-points de-
picted in Fig 5.1 suggest that there may be a simple linear relationship between x and y:

ŷ = a + bx = θ⊤ ϕ, (5.3)

where we have defined the parameter vector θ and the regression vector ϕ (holding the
regressors 1 and x) as · ¸ · ¸
a 1
θ= ϕ= . (5.4)
b x
The model structure defined by (5.3) is probably known to you as a “straight-line approx-
imation”. It is an example of a linear regression model, since the parameter vector to be
adjusted or estimated enters linearly. This property of the model will turn out to be impor-
tant.

PARAMETER ESTIMATION . Once the model structure has been defined—at least as a first
attempt—we would like to test how well the model works by determining what values of a
and b that would give as good fit as possible between the assumed model and the data. Let
us formalize the posed problem a bit:

• The model (5.3) can be used to “guess” or predict values of y, given values of x. Using
ϕ(i ) = [1 x(i )]⊤ , the predictions corresponding to the N data-points will be denoted

ŷ(i |θ) = θ⊤ ϕ(i ), i = 1, . . . , N (5.5)

where the notation emphasizes the fact that the predictions depend on the parameter
vector θ.

108
• In practice, there will always be a discrepancy between what the model predicts, i.e.
ŷ(i |θ), and the measured data point y(i ). Accordingly, we define the residual ε as

ε(i , θ) = y(i ) − ŷ (i |θ) = y(i ) − θ⊤ ϕ(i ), (5.6)

which again depends on the model parameter vector θ.

• Our desire is now to find the model parameters, which make the residuals as small
as possible in some sense. To this end, it is common to define a scalar loss function
or criterion VN (θ) to be minimized with respect to θ. The celebrated least-squares
method, going all the way back to Gauss, provides one solution to this problem and
will be described next, and we will then return to the example.

L EAST- SQUARES . The least-squares (LS) method for a linear regression model is based on
forming a measure of the model fit by summing the squares of the residuals, i.e. the least-
squares criterion is defined as

1 X N 1 X N ¡ ¢2
VN (θ) = ε2 (i , θ) = y(i ) − θ⊤ ϕ(i ) . (5.7)
N i =1 N i =1

The least-squares estimate (LSE) θ̂N is then obtained by minimizing this criterion, i.e.

θ̂N = arg min VN (θ),


θ

where the notation stresses the fact that the resulting estimate depends on N data points.

The solution to the least-squares problem can be readily found by completion-of-the-squares


as follows:

1 X N 1 X N ¡ ¢2
VN (θ) = ε2 (i , θ) = y(i ) − θT ϕ(i )
N i =1 N i =1
1 X N 1 XN 1 XN
= y 2 (i ) − 2θ⊤ ϕ(i )y(i ) +θ⊤ ϕ(i )ϕ⊤ (i ) θ
N i =1 N i =1 N i =1
| {z } | {z }
fN RN

1 X N
= y 2 (i ) − f N⊤ R N
−1 −1
f N + (θ − R N f N )⊤ R N (θ − R N
−1
f N ),
N i =1

assuming the matrix inverse exists (equivalently, R N is positive definite). Since the last term
is the only term that depends on θ and since it is a nonnegative quadratic form, the least-
squares estimate θ̂N is obtained by simply putting it to zero, i.e.

N
¡1 X ¢−1 1 X N
−1
θ̂N = R N fN = ϕ(i )ϕ⊤ (i ) ϕ(i )y(i ) (5.8)
N i =1 N i =1

109
R EMARK . An alternative derivation goes as follows. Define the vector y and the matrix Φ
as    
y(1) ϕ⊤ (1)
 .   . 
y =  ..  , Φ =  ..  , (5.9)

y(N ) ϕ (N )
thus holding all the data, the measured outputs and all regression vectors. Then the LS
criterion can be written as (modulo a factor 2/N )
1 1
VN (θ) = ky − Φθk2 = (y − Φθ)⊤ (y − Φθ) (5.10)
2 2
The LS solution is found by differentiating w.r.t. θ:

dVN (θ)
= θ⊤ Φ⊤ Φ − y ⊤ Φ = 0, (5.11)

giving
θ̂N = (Φ⊤ Φ)−1 Φ⊤ y , (5.12)
which is of course identical to (5.8) (can you verify this?). Note that the solution can be
interpreted as an approximate solution of the overdetermined system of linear equations
y = Φθ.
We leave it as an exercise to show that a weighted 2-norm replacing (5.10), i.e.

1 2 1
VN (θ) = ky − ΦθkW = (y − Φθ)⊤ W (y − Φθ), (5.13)
2 2
for any symmetric, positive definite matrix W , gives the weighted least squares solution

θ̂N = (Φ⊤W Φ)−1 Φ⊤W y (5.14)


Let us now return to the curve fitting example. Since the model (5.3) is a linear regression,
the LS estimate is obtained by applying (5.8) for the data set at hand:
· ¸ N N
â N ¡1 X ¢−1 1 X
θ̂N = = ϕ(i )ϕ⊤ (i ) ϕ(i )y(i ) (5.15)
b̂ N N i =1 N i =1

where
N · PN ¸
1 X ⊤ 1 N x(i )
ϕ(i )ϕ (i ) = PN PNi =1 2 (5.16)
N i =1 N i =1 x(i ) i =1 x (i )
N · P N ¸
1 X 1 i y(i )
ϕ(i )y(i ) = PN =1 (5.17)
N i =1 N i =1 x(i )y(i )

The linear model predictions are shown in Fig. 5.2 (left) along with the data points. At first
sight, it seems as if the simple, linear model serves its purpose quite well.

110
1.4 1.4

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 5.2: Curve fitting of a model using a linear function (left) and a quadratic function
(right).

M ODEL ORDER . When taking a closer look at the model fit to data in Fig. 5.2 (left), it can
be noticed that the data-points tend to lie above the fitted straight line for small and large
values of x, and below for intermediate values. This observation suggests that perhaps a
quadratic function would better describe the data. We can easily test this hypothesis by
redefining the parameter vector and the regression vector as
   
a 1
θ = b  ϕ =  x . (5.18)
2
c x

Notice that the model is still a linear regression, although one of the regressors is nonlinear
in x! As seen in Fig. 5.2 (right), the quadratic model clearly gives an improved fit to data;
the loss function is approximately halved compared to the straight line fit.

In fact, we could extend the model in this way to any polynomial in x or, for that matter,
with any regressor being a nonlinear function of x—the only thing that will change is the
definition of the regression vector ϕ. What is important for the derivation of the LS esti-
mate, however, is that the model ŷ = θ⊤ ϕ is linear in the parameters θ.

P ROPERTIES OF THE LS ESTIMATE . The least-squares criterion, based on summing the


squares of the residuals, at this point seems a bit arbitrary, albeit reasonable. It can be ap-
plied quite generally, without introducing any particular assumptions on model uncertain-
ties or disturbances. However, if we would like to analyze and characterize the properties
of the LS estimate, then we would need to introduce some assumptions on how the data is
generated. This will be exemplified next.

Let us return to our curve fitting example and for a moment pretend that the data is actually
generated by the “true system”
y(i ) = θ0⊤ ϕ(i ) + e(i ), (5.19)

111
where the sequence {e(i )} consists of independent, identically distributed (i.i.d.) random
variables with variance σ2 (see Section 1.10 for a discussion of i.i.d. random variables). We
have used the word “pretend” to stress the fact that it is highly unlikely that the real data
obeys (5.19). However, the idealized assumption makes it possible to perform some analy-
sis in order to better understand how the least-squares method behaves.

A consequence of the assumption (5.19) is that the estimate θ̂N is itself a random variable,
taking a numerical value that depends on the particular realization of the noise sequence.
When we repeat the experiment, the noise realization will be different and, hence, the es-
timate as well. Considering the fact that θ̂N is a random variable, we can ask ourselves
how this random variable can be characterized. There are a few classical and important
concepts at hand, which we will now discuss:

• Bias. The concept of bias is used in order to describe what happens “on average”, if
we repeat the experiment many times. If there is a systematic error when estimating a
“true” parameter θ0 , there is a bias. Conversely, we say that the estimate is unbiased6
if
E[θ̂N ] = θ0 (5.20)

• Variance. In addition to a systematic error, due to the randomness of the collected


data, there will be random fluctuations of the estimate around the expected value.
These fluctuations are usually measured by the covariance (a matrix) of θ̂N ,

Cov θ̂N = E[(θ̂N − E[θ̂N ])(θ̂N − E[θ̂N ])⊤ ], (5.21)

which typically decays as the number of data points N increases.

• Consistency. Bias and variance describe properties of the estimate for finite data
records. In addition, a natural question is how the estimate θ̂N behaves when the
number of data N tends to infinity. The estimator is called consistent if

lim θ̂N = θ0 . (5.22)


N→∞

It should be remarked here that the limit of a random variable can be defined in sev-
eral ways. We will use the concept of almost sure convergence or convergence with
probability 1 (w.p. 1). This means, loosely speaking, that the estimate converges in
the usual sense for almost all realizations (or, equivalently, realizations for which this
does not hold have probability 0). In this case, it is common to refer to the property
as strong consistency.

Returning to our linear regression example, we can now state the following properties of
the LS estimate θ̂N :

6
Swedish: medelvärdesriktig

112
1. The estimate is unbiased, which is easily proved:

1 X N 1 X N
E[θ̂N ] = E[( ϕ(i )ϕ⊤ (i ))−1 ϕ(i )y(i )]
N i =1 N i =1
1 X N 1 X N
= θ0 + ( ϕ(i )ϕ(i )⊤ )−1 · E[ ϕ(i )e(i )] = θ0 (5.23)
N i =1 N i =1

2. The covariance (see Section 1.10) of the parameter estimate is, using (5.23):

1X 1X
E[(θ̂N − θ0 )(θ̂N − θ0 )⊤ ] = R N
−1
E[( ϕe)( ϕe)⊤ ]R N −1
N N
1 −1 X N σ2 −1
= 2 RN ϕ(i )ϕ( j )⊤ E[e(i )e( j )]R N
−1
= R . (5.24)
N i ,j =1 N N

In the last step, the fact that the regressors are deterministic (or, alternatively, inde-
pendent of the noise) has been used, together with the i.i.d. property of the noise
sequence (implying that E[e(i )e( j )] = 0, i 6= j ).

This result seems quite natural: the spread increases with noise variance and de-
creases with increasing number of data-points. In addition, the spread depends on
the matrix R N , which can be interpreted as a measure of the information available in
the regressors about the model parameters.

3. Assume that the following limits exist:

1 X N
ϕ(i )ϕ⊤ (i ) → R ∞ , w.p. 1 as N → ∞ (5.25)
N i =1
1 X N
ϕ(i )e(i ) → 0, w.p. 1 as N → ∞ (5.26)
N i =1

Then θ̂N is a strongly consistent estimate of θ0 , i.e.:

θ̂N → θ0 , w.p. 1 as N → ∞ (5.27)

R EMARK . A natural way to estimate the noise variance is the following:

1 X N
σ̂2N = VN (θ̂N ) = ε2 (i , θ̂N ) (5.28)
N i =1

Let us make some brief comments on these results. First, it should again be stressed that
the properties of the LS estimate have been derived under ideal conditions, the most im-
portant one being that the true system belongs to the model set, i.e. the collection of models
obtained by varying the parameter vector θ. The “pay-back” we get in return to this strong

113
assumption are the insights through these results, which we can hope hold at least approx-
imately for more realistic scenarios.

Another relevant remark is that the characterization of the least-squares method is in terms
of the parameter estimate θ̂N , which is, strictly speaking, not our main concern. What we
are really interested in is to fit a function to data, and the parameter vector θ is just a vehicle
for doing this. However, there is a strong coupling, of course. To illustrate this, assume that
the data is described by, instead of (5.19), the following equation:

y(i ) = f (x(i )) + e(i ), (5.29)

where f (·) is the unknown function to model, and {e(i )} is an i.i.d. noise sequence with
variance σ2 . Let us now calculate the expected value of the LS criterion:

N ¡
£1 X ¢2 ¤ £1 X N ¡ ¢2 ¤
E[VN (θ)] = E y(i ) − θT ϕ(i ) = E f (x(i )) + e(i ) − θT ϕ(i )
N i =1 N i =1
1 X N ¡ ¢2
= f (x(i )) − θT ϕ(i ) + σ2 (5.30)
N i =1

This result shows that there are two contributions to the (expected) LS loss function. The
first term constitutes the systematic error (bias), and the second term is a contribution from
the noise. The bias is determined by how well we can approximate the true function f by
a linear regression (linear or quadratic function of x in our example). Notice that the value
of θ that minimizes the bias depends on the set of data-points {x(i )}, since more emphasis
will be put on intervals with more data-points than intervals with scarce data. For example,
the data depicted in Fig. 5.1 indicates that less emphasis will be put to the interval [0.6 0.8]
when fitting a model, since this interval contains few data-points. Even if the argument is
based on the expected value of the LS loss function, it gives an idea of how the parameter
estimation can be affected by the experimental conditions. We will return to this later when
discussing identification of dynamic models.

This concludes the discussion on the introductory curve fitting example, based on apply-
ing the least-squares method to a linear regression model. Although the problem of fitting
a model of a static function to data is different from fitting a dynamic model to data—for
example, it does not involve time per se—many of the concepts introduced and discussed
will turn out to be useful also in the dynamic case.

A final remark is that the problem we have discussed can be formulated already at the out-
set in a probabilistic context, i.e. characterizing the task in terms of stochastic variables. In
doing so, the maximum-likelihood (ML) method offers a general and powerful technique
to formulate and solve many parameter estimation problems. It turns out that the least-
squares method, although formulated here without any stochastic framework, is in fact a
special case of the ML method. A brief account of the ML technique is provided for the
interested reader in Section 5.4.

114
5.2 PARAMETER E STIMATION FOR L INEAR DYNAMIC S YSTEMS
We will now turn to the task to identify dynamic models from experimental data. We re-
mind about the basic formulation of the (parametric) system identification problem, as for-
mulated in the beginning of this chapter. The basic idea is to choose a model, which de-
pends on a number of parameters, collected in the parameter vector θ, and then to adjust
the parameters so that the model behaviour is close to the real system’s behaviour, as rep-
resented by data recorded from experiments. In this way, you can view the parameters as
“tuning knobs” of the model, and the tuning should be done to mimic the real system as
closely as possible.

A natural question to ask at this point is how the model, with its parameter vector θ, often
referred to as a model structure, should look like? Clearly, there are many choices, and any
type of qualitative a priori knowledge of the system may guide the decision. One useful dis-
tinction of model structures is between tailor-made and general-purpose models (a similar
distinction was discussed for the curve fitting example).

Tailor-made models are typically the result of some type of physical modelling effort. The
modelling may for example lead to a state-space model with a specific structure, but with
a number of parameters that are unknown. Such a white-box model could look like:
 
θ1
ẋ(t ) = f (x(t ), u(t ), θ)  .. 
, θ= .  (5.31)
y (t ) = h(x(t ), θ)
θd

In this case, we typically trust that the model (5.31), including the functions f , h, describe
the system well, but lacking the precise values of the parameters θ, we want to determine
or estimate these from data.

Example 5.1 (DC motor). Let us recall the state space model previously derived for a DC
motor:
· ¸ · ¸
−R/L −K e /L 1/L
ẋ = x+ u
K m /J −b/J 0
£ ¤
y= 0 1 x

where the state vector holds current i and angular speed ω, i.e. x = (i , ω), the input u is the
voltage applied, and the output y is the angular speed.

If the electrical time constant is neglected, the transfer function from voltage to angular
speed is given by
Ω(s) K Km RJ
= ; K= , τ=
U (s) 1 + sτ K m K e + bR K m K e + bR
While there are five parameters in the physical model (R, L, J , K e , K m ), the transfer function
can be characterized by only two parameters. The implication is that not all parameters are

115
identifiable, i.e. can be determined from experimental data describing the input-output re-
lations of the system. On the other hand, it is enough to find appropriate values for K and
τ in order to capture the dynamics of the DC motor. The fact that not all physical parame-
ters can be determined from input-output data is not uncommon for models derived from
physical principles. 

Motivated by the example above, it seems natural to ask whether we could instead focus
on models that only try to capture the input-output behaviour of the system. Such models
need not be based on physical relationships, as long as they could describe the dynamics,
and they could therefore be termed general-purpose models.

General-purpose models are also termed black-box models, suggesting that the internal
workings of the model is not of interest, only the external, input-output behaviour. Black-
box models may be linear or non-linear, and formulated in continuous or discrete time.
The following example shows a simple and well-known model.

Example 5.2 (Step-response analysis). Already in the basic control course a simple system
identification tool was employed, namely a step response test. In the simplest case, we ex-
ploit the fact that a first order system, given by the transfer function

K
G(s) = ,
1 + sT
has a step response shown in the figure below.

y 6

0.63K

0 18

T t

Now, if the input sequence u is chosen as a step function and applied to a real system,
whose dynamics can be well approximated by a first order system, then the model param-
eters θ = (K , T ) can at least approximately be found from a plot of the step response—the
model parameters are adjusted to fit the data, i.e. the recorded step response. The proce-
dure can thus be described as a simple example of system identification. 

In this course we will, however, focus our attention on linear, discrete-time models of the
form
y(t ) = G(q, θ)u(t ) + w(t ) = G(q, θ)u(t ) + H(q, θ)e(t ), (5.32)

116
where

B(q, θ) b 1 q −1 + . . . + b nb q −nb
G(q, θ) = = (5.33a)
F (q, θ) 1 + f 1 q −1 + . . . + f n f q −n f
C (q, θ) 1 + c1 q −1 + . . . + cnc q −nc
H(q, θ) = = (5.33b)
D(q, θ) 1 + d 1 q −1 + . . . + d nd q −nd

A few remarks are appropriate:

• Recall that the operator q is the correspondence to z in the time domain, i.e. qu(t ) =
u(t + 1) and q −1 u(t ) = u(t − 1). See Section 1.9.

• The first term of the model (5.32) means that the u − y dependence is modelled by a
general, linear time-invariant transfer function including a time delay.

• The second term in (5.32), w(t ), is an additive disturbance term. Sometimes, it is


desirable to model also the disturbance, and here we exploit the fact that a large class
of stationary stochastic processes can be modelled as filtered white noise e(t ) (i.e. the
sequence {e(t )} consists of i.i.d. random variables).

We will investigate the problem to estimate parameters of linear black-box models of the
type (5.32) in some detail. However, before treating the general case, we will introduce
some of the basic ideas in the context of the simpler linear regression case, where some of
the results obtained for the curve fitting example in Section 5.1 will turn out to be useful.

5.2.1 A SPECIAL CASE : LINEAR REGRESSION

Recall that much of the discussion on curve fitting in Section 5.1 was centered around mod-
els providing predictions of the form (5.5), repeated here for convenience:

ŷ(i |θ) = θ⊤ ϕ(i ). (5.34)

We also noted in passing that there is a great deal of flexibility in using different regressors in
ϕ. In fact, this observation opens up for applying the techniques also to dynamic systems,
and a couple of simple examples will illustrate this.

Example 5.3 (Finite impulse response model). Consider the Finite Impulse Response (FIR)
(why is it called this?) model with input u and output y, given by
nb
X
y(t ) = b i u(t − i ) + e(t ) = θ⊤ ϕ(t ) + e(t ), (5.35)
i =1

where the vector θ holds the parameters b 1 , . . . , b nb , and the vector ϕ holds delayed values
of u. This is clearly a linear regression model, although the regressors are now lagged (de-
layed) input signals to a dynamic model, which is also the reason why we have switched

117
notation from i to t to stress that time is now the independent variable. In the same way as
before, the model gives rise to the prediction

ŷ(t |t − 1, θ) = θ⊤ ϕ(t ), (5.36)

where the notation now gives emphasis to the fact that the prediction is made one step into
the future, i.e. the model (which depends on θ) is used to predict the output at time t , given
information available at time t − 1 (note that there is no way to predict future values of
e(·), since it is an i.i.d. sequence). The conclusion is that the least-squares method can be
applied to estimate the parameters of the FIR model, and the least-squares estimate (LSE)
is given by (5.8), again repeated here for easy reference:

N
¡1 X ¢−1 1 XN
−1
θ̂N = R N fN = ϕ(t )ϕ⊤ (t ) ϕ(t )y(t ) (5.37)
N t=1 N t=1

Moreover, taking a close look at the analysis of the LSE performed in Section 5.1, we can
also conclude that the properties of the LSE still hold under the assumption that the noise
is uncorrelated with the input. 

We discussed briefly in Section 5.1 that the choice of experimental data, i.e. selection of
points {x(i )}, has an impact on the resulting curve fitting. A similar observation can be
made in the dynamic case, as illustrated in the next example.

Example 5.4 (Estimation of an FIR model with different inputs). Let us illustrate next esti-
mation of an FIR model with n b = 20, assuming data is generated by a real system with the
same structure and noise added. Figure 5.3 shows the results for different inputs (left-hand
graphs), yielding different output of the real system and model (centred graphs), and set
of parameters θ̂ (right-hand side graphs). One can observe that the choice of input has a
strong impact on the quality of the parameter estimate θ̂. Indeed, a poor parameter esti-
mation can be obtained, even though the model output fits the system output fairly well.
In such a case, one would suspect that some of the conditions, under which we analyzed
the properties of the LSE, is not fulfilled (which one?). 

118
1 1.5

1 0.4
0.5
0.5 0.3
u

θ
0 0

y
0.2

-0.5
-0.5 0.1
-1
0
-1 -1.5
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0 0.05 0.1 0.15
t t t
1 1.5
0.6

1 0.4
0.5
0.2
0.5
0
u

θ
0 0
y

-0.2
-0.5 -0.4
-0.5
-0.6
-1
-0.8
-1 -1.5
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0 0.05 0.1 0.15
t t t
0.3 0.2

0.2 0.4
0.1
0.1 0.3
u

θ
0 0
y

0.2

-0.1
-0.1 0.1
-0.2
0
-0.3 -0.2
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0 0.05 0.1 0.15
t t t
Figure 5.3: Illustration of the least-squares approach applied to an FIR model. The input
sequences u are displayed on the left-hand side graphs. The black signals repre-
sent the outputs y (centred graphs) and parameters (right-hand side graphs) of
the real system, while the grey signals represent the outputs ŷ (centred graphs)
and estimated parameters (right-hand side graphs) of the model.

Example 5.5 (Auto-regressive model). The FIR model describes a transfer function from u
to y, having zeros only (but no poles). We can add poles, however, by extending the FIR
model with auto-regressive terms,
nb
X na
X
y(t ) = b i u(t − i ) − ai y(t − i ) + e(t ) = θ⊤ ϕ(t ) + e(t ), (5.38)
i =1 i =1

and this way get an Auto-Regressive model with eXogenous input (ARX) model. The vector θ
now contains both a- and b-parameters, and ϕ contains delayed values of both input and

119
output. The prediction is still given by (5.36), and is based on previous inputs and outputs.
Again, we can conclude that the least-squares method can be applied to estimate the pa-
rameters, and the LSE is given by (5.8) with the new definitions of the vectors involved.


5.2.2 P REDICTIONS FOR LINEAR BLACK - BOX MODELS

Motivated by the FIR and ARX examples, we would now like to extend the techniques to the
general linear, black-box model (5.32), repeated here for convenience:

y(t ) = G(q, θ)u(t ) + w(t ) = G(q, θ)u(t ) + H(q, θ)e(t ), (5.39)

where

B(q, θ) b 1 q −1 + . . . + b nb q −nb
G(q, θ) = = (5.40a)
F (q, θ) 1 + f 1 q −1 + . . . + f n f q −n f
C (q, θ) 1 + c1 q −1 + . . . + cnc q −nc
H(q, θ) = = (5.40b)
D(q, θ) 1 + d 1 q −1 + . . . + d nd q −nd

In block diagram form, the model (5.39) would look as depicted below:

C (q)
D(q)

+ +
B(q)
u y
F (q)

Two important remarks should be made at this point:

• Postulating the model above means we are interested in modelling both input-output
behaviour by the transfer function G, and the disturbance character by the noise
transfer function H (unless H = 1).

• A consequence of introducing a noise model H is that the prediction of y becomes a


bit more involved than in the previous linear regression cases.

Let us start to investigate the latter item.

If we write the model transfer functions in terms of their impulse response coefficients, we
have

X ∞
X
G(q, θ) = g (k, θ)q −k , H(q, θ) = 1 + h(k, θ)q −k (5.41)
k=1 k=1

Now, using this representation for H in (5.39), we see that the disturbance term w(t ) is a
weighted sum of previous noise terms e(t − k). With the assumption that {e(·)} are i.i.d.

120
random variables, it is clear that the term e(t ) in w(t ) is completely unpredictable. On the
other hand, all previous noise terms are present in old outputs, so in principle it should be
possible to extract these from measurements of the output (along with the inputs). Indeed,
this program can be carried out mathematically by re-writing the output y(t ) as

y(t ) = G(q)u(t ) + (H(q) − 1)e(t ) + e(t )


= G(q)u(t ) + (H(q) − 1)H(q)−1 (y(t ) −G(q)u(t )) + e(t )
= H(q)−1G(q)u(t ) + (1 − H(q)−1 )y(t ) + e(t ) (5.42)

where we have omitted the argument θ for simplicity. As argued above, the last term in
this expression cannot be predicted. The remaining two terms consist of filtered input and
output signals up till time t − 1 (why?). Hence, these two terms will form the best mean-
square prediction of the output:

ŷ(t |t − 1, θ) = H −1 (q, θ)G(q, θ)u(t ) + (1 − H −1 (q, θ))y(t ), (5.43)

leading to the optimal prediction error ε(t , θ) = e(t ) (for θ known). By using the definitions
(5.40), the prediction can be expressed in the model polynomials as

D(q) B(q) C (q) − D(q)


ŷ(t |t − 1, θ) = · u(t ) + y(t ), (5.44)
C (q) F (q) C (q)

In order to make it a bit more concrete, let us take a look at some special cases.

Example 5.6 (Linear regression model structures). We have already encountered the FIR
and the ARX cases, which can be obtained by the choices F (q) = C (q) = D(q) = 1 (FIR) and
C (q) = 1, F (q) = D(q) = A(q) (ARX), respectively:

FIR: y(t ) = B(q, θ)u(t ) + e(t ) (5.45a)


B(q, θ) 1
ARX: y(t ) = u(t ) + e(t ) ⇔ A(q, θ)y(t ) = B(q, θ)u(t ) + e(t ) (5.45b)
A(q, θ) A(q, θ)

These two model structures are depicted in block diagram form below (FIR left, ARX right),
where it can be clearly seen that in the ARX model structure, the polynomial A(q) is shared
between the input-output and noise-output transfer functions.

1
A(q)
e
+ + + +
B(q)
u B(q) y u y
A(q)

121
The predictors are obtained as special cases of (5.44):

FIR: ŷ(t |t − 1, θ) = B(q, θ)u(t ) (5.46a)


ARX: ŷ(t |t − 1, θ) = B(q, θ)u(t ) + (1 − A(q, θ))y(t ). (5.46b)

Recall that we have in earlier examples observed that these expressions can both be written
as linear regressions ŷ(t |t − 1, θ) = θ⊤ ϕ(t ), which in turn gave us the opportunity to derive
explicit solutions for the least-squares estimates. 
There are two more special cases that we would like to mention, cases that will turn out to
be different than the ones discussed so far.

Example 5.7 (Non-linear regression model structures). The ARMAX (Auto-Regressive, Mov-
ing Average with eXogenous input) and OE (Output Error) model structures are obtained by
choosing F (q) = D(q) = A(q) and C (q) = D(q) = 1, respectively. We thus get the models and
predictors as follows:
B(q, θ) C (q, θ)
y(t ) = u(t ) + e(t ) ⇔ A(q, θ)y(t ) = B(q, θ)u(t ) +C (q, θ)e(t )
A(q, θ) A(q, θ)
ARMAX:
B(q, θ) C (q, θ) − A(q, θ)
ŷ(t |t − 1, θ) = u(t ) + y(t )
C (q, θ) C (q, θ)
(5.47a)
B(q, θ)
y(t ) = u(t ) + e(t )
F (q, θ)
OE: (5.47b)
B(q, θ)
ŷ(t |t − 1, θ) = u(t )
F (q, θ)

Block diagrams are depicted below (ARMAX left, OE right).


e

C (q)
A(q)
e
+ + + +
B(q) B(q)
u y u y
A(q) F (q)


Let us reflect a bit on the results in the last example and make a couple of observations:

• We can notice that the predictions for these cases are not simply a linear combina-
tion of a few delayed, measured quantities, as in the FIR and ARX cases. Rather, the
predictions are given as the outputs of filters with measured u and y as inputs.

• In order for these predictor filters to produce meaningful results, we have to ensure
that C (q) and F (q), respectively, are stable polynomials.

122
• In the OE case, the prediction is generated by filtering u only, i.e. by simply simu-
lating the model with input u; the measured outputs are not used to compute the
prediction.

The first observation above—concerning predictions as output from filters—has important


implications. To illustrate this, let’s take a closer look at the OE case, where we can re-write
the equation for ŷ as
F (q, θ) ŷ (t |t − 1, θ) = B(q, θ)u(t ), (5.48)
or, after rearranging,

ŷ(t |t − 1, θ) = (1 − F (q, θ)) ŷ(t |t − 1, θ) + B(q, θ)u(t ) (5.49)

The first term in this expression contains delayed values of the prediction itself, rather than
measured quantities. We can still gather delayed signals, u and ŷ, in a vector ϕ to get an
expression for the prediction like

ŷ(t |t − 1, θ) = θ⊤ ϕ(t , θ), (5.50)

but the important distinction from earlier is that ϕ now depends on θ via ŷ. The conclusion
is that the prediction is nonlinear in θ. The same conclusion holds for the ARMAX case,
where the predictor can be re-written as

ŷ(t |t − 1, θ) = B(q, θ)u(t ) + (C (q, θ) − A(q, θ))y(t ) + (1 −C (q, θ)) ŷ (t |t − 1, θ) (5.51)


= B(q, θ)u(t ) + (1 − A(q, θ))y(t ) + (C (q, θ) − 1)(y(t ) − yˆ(t |t − 1, θ)), (5.52)

where the last expression contains previous values of the prediction error ε(t , θ).

5.2.3 P REDICTION E RROR M ETHODS (PEM)

Once we have determined a model structure, or rather the predictor associated with the
model structure, we can devise a parameter estimation scheme that is a fairly straightfor-
ward generalization of the ones encountered so far. The following “recipe” gives rise to an
entire class of Prediction Error Methods (PEM):

1. Compute the prediction error

ε(t , θ) = y(t ) − ŷ(t |t − 1, θ), t = 1, . . . , N

2. Compute the model fit (the cost)

1 X N
VN (θ) = l (t , θ, ε(t , θ)),
N i =1

where l is a scalar, positive function.

3. Pick the best model


θ̂N = arg min VN (θ)
θ

123
The following remarks should be made:

• The PEM recipe is quite general and can be applied to both white-box and black-box
models.

• The choice of l can be made in various ways. The very common least-squares crite-
rion corresponds to the choice l (t , θ, ε(t , θ)) = ε2 (t , θ).
R EMARK : The maximum-likelihood method, described in Section 5.4, corresponds to
choosing l as the so-called negative log-likelihood function.

Let’s assume we want to apply the PEM recipe in order to fit a model to data using a least-
squares criterion, i.e. we would like to minimize the cost

1 XN 1 XN ¡ ¢2
VN (θ) = ε2 (t , θ) = y(t ) − ŷ(t |t − 1, θ) (5.53)
N t=1 N t=1

Based on the discussion we had concerning the property of the predictor, we should dis-
tinguish between the two following situations:

LLS: For the linear regression case, including the ARX and FIR models, the predictor ŷ(t |t −
1, θ) = θ⊤ ϕ(t ) is linear in θ. The problem is then a Linear Least-Squares (LLS) prob-
lem, with the property that the estimate—the minimizer of VN (θ)—can be given ex-
plicitly or be computed as the solution of a system of linear equations. To refresh this,
we note that the cost VN (θ) can be differentiated and put to zero to find the solution:

2 XN 2 XN ¡ ¢
∇θ VN (θ) = ∇θ ε(t , θ)ε(t , θ) = − ∇θ ŷ(t |t − 1, θ) y(t ) − ŷ(t |t − 1, θ) = 0,
N t=1 N t=1
(5.54)
which, after inserting ŷ(t |t − 1, θ) = θ⊤ ϕ(t ), gives the system of linear equations
" #
1 X N 1 X N

ϕ(t )ϕ (t ) · θ = ϕ(t )y(t ) (5.55)
N t=1 N t=1

NLS: For the other model structures, including ARMAX and OE, the predictor is nonlinear
in θ. The problem is then a Nonlinear Least Squares (NLS) problem, for which the
minimizer of VN (θ) has to be found by an iterative search. Indeed, we have already
come across a general tool for this, namely the Newton method. The method would
be applied to the same necessary condition for a minimum, i.e.

2 XN ¡ ¢
∇θ VN (θ) = − ∇θ ŷ(t |t − 1, θ) y(t ) − ŷ(t |t − 1, θ) = 0, (5.56)
N t=1

the difference being that ŷ(t |t − 1, θ) = θ⊤ ϕ(t , θ) is not any longer linear in θ.
Observe that, when applying an iterative method to find the minimum of the crite-
rion, both predictions and the gradient of the predictions (see below) are obtained
by filtering operations, and since the filters depend on θ, these signals need to be
re-evaluated at each iteration.

124
R EMARK . In the NLS case, e.g. for the ARMAX and OE models, we need an expression for
the gradient of the prediction, ∇θ , in order to apply the Newton method. We will illustrate
for the OE case how this can be worked out and we start by showing in detail how to go
from (5.49) to (5.50):

ŷ(t |t − 1, θ) = (1 − F (q, θ)) ŷ (t |t − 1, θ) + B(q, θ)u(t ) = θ⊤ ϕ(t , θ), (5.57)

where
£ ¤
θ ⊤ = f 1 · · · f n f b 1 · · · b nb (5.58a)
£ ¤
ϕ⊤ (t , θ) = − ŷ(t − 1) · · · ŷ(t − n f ) u(t − 1) · · · u(t − n b ) (5.58b)

This implies that we can calculate the derivatives of the prediction with respect to the pa-
rameters (using here total derivates, observing that previous values of the predictions de-
pend on the parameters):

d d
ŷ(t |t − 1, θ) = (1 − F (q, θ)) ŷ(t |t − 1, θ) + u(t − i ) (5.59a)
db i db i
d d
ŷ(t |t − 1, θ) = (1 − F (q, θ)) ŷ(t |t − 1, θ) − ŷ(t − i |t − i − 1, θ). (5.59b)
d fi d fi

so that, finally, the gradient can be compiled as

1
∇θ ŷ(t |t − 1, θ) = ϕ(t , θ) (5.60)
F (q, θ)

Hence, the gradient is a filtered version of the regressor vector. The derivation for the AR-
MAX case is analogous. 

5.2.4 P ROPERTIES OF THE PEM ESTIMATE

Properties of the PEM estimate θ̂N , such as bias and variance, can be established similarly
to least-squares applied to the non-dynamic case. We briefly state a few results below.

B IAS . In real applications, the system to be modelled is typically more complex than what
the model can capture. The implication is that, even with very large data sets, there will be
a discrepancy between the model and the real system; this is what we have earlier termed
systematic error or bias. One way to interpret the bias is to view the parameter estimation as
an attempt to find an approximation of the real system, an approximation that is in some
sense the best one can find within the given model structure. The “systematic” approxi-
mation error—the bias—will, however, depend on how the experiments were conducted,
since e.g. different types of inputs will highlight different features of the system.

It is difficult to characterize the bias in general terms, since it depends on e.g. properties
of the real system and the model structure, as well as properties of the input. However, by
assuming that the number of data points N tends to infinity, a fundamental property that

125
can be used to investigate the bias in special cases can be derived. Let us first note that,
under quite general conditions, VN (θ) converges to E ε2 (t , θ). This in turn can be used to
prove the following: © ª
θ̂N → θ | E ε2 (t , θ) attains its minimum (5.61)
In hindsight, this result encodes what we can hope for: the parameter estimate converges
to a set, which minimizes the variance of the prediction errors, corresponding to the best
model within the given model structure.

The result quoted above is somewhat implicit. It is also formulated in terms of the param-
eter estimate θ̂, which is of less interest for the case of black-box models (cf. the discussion
on the curve-fitting example on p. 114). In order to get a bit more insight into characteri-
zation of bias, let us look at an example:
Example 5.8 (Bias distribution in the OE case). Recall the Output Error (OE) model struc-
ture with the predictor
B(q, θ)
ŷ(t |t − 1, θ) = u(t ) (5.62)
F (q, θ)
Now, assume the data is generated by the system

y(t ) = G(q)u(t ) + e(t ), (5.63)

where G(q) is a transfer function and e(·) is any noise sequence, which is uncorrelated with
the input u(·). Using (5.62) and the uncorrelatedness, we can then obtain an expression for
the variance of the prediction errors:
£ B(q, θ) ¤2 £ B(q, θ) ¤2
E ε2 (t , θ) = E (G(q) − )u(t ) + e(t ) = E (G(q) − )u(t ) + E e 2 (t ) (5.64)
F (q, θ) F (q, θ)
The interpretation is that, asymptotically, the parameter estimate will be determined such
that the systematic error, given by the first term in this expression, will be minimized. Sim-
B(q,θ)
ilarly to the curve-fitting case, the fit of the model transfer function F (q,θ) to the system
transfer function G(q) will depend on the input u(t ). This dependence can be expressed
more explicitly in the frequency domain by exploiting Parseval’s formula:
Z Z
1 π B(e i ω , θ) 2 1 π
E ε2 (t , θ) = |G(e i ω ) − | Φu (ω)d ω + Φe (ω)d ω, (5.65)
2π −π F (e i ω , θ) 2π −π
where Φu (ω) and Φe (ω) are the spectral densities of the input and the noise, respectively.
From this expression, it can clearly be seen how the fit of the model transfer function will
be weighted by the frequency distribution of input power. This is quite intuitive and gives
a clear indication of how the spectral properties of the input excitation can be chosen to
prioritize certain frequency regions when fitting the model to data. 
The example above was formulated for the OE model structure. For the general model
structure (5.39), a similar result can be derived, but the interpretation is less transparent,
since the fitting of the input-output transfer function and the noise model transfer function
are interconnected.

126
VARIANCE . The random fluctuations of the parameter estimate due to the finite number
of (noise corrupted) data can be characterized in a similar way as discussed before for the
linear regression case. Basically, the only thing that is changed is that the regression vector
ϕ is replaced by the gradient of the prediction, i.e. ∇θ ŷ. Thus, assuming there is a “true” pa-
rameter θ0 , the covariance of the parameter estimate is approximately and asymptotically
given by:
σ2 −1
E[(θ̂N − θ0 )(θ̂N − θ0 )⊤ ] ≈ R , (5.66)
N
where σ2 is the noise variance, and
£ ¤
R = E ∇θ ŷ(t , θ)∇θ ŷ ⊤ (t , θ) . (5.67)

In fact, it can also be shown, under general conditions, that the estimate is asymptotically
normal distributed with mean θ0 and covariance matrix given in (5.66), or
p
N (θN − θ0 ) ∼ AsN (0, σ2 R −1 ) (5.68)

5.3 S YSTEM IDENTIFICATION IN PRACTICE


In order to successfully apply the “experimental modelling” approach, there are plenty of
practical aspects that need consideration. Figure 5.4 gives a bird’s-eye view on the workflow
going from planning of the experiment to the final result, a model that fits its purpose. We
notice an important feature of the workflow, namely that there would be ample reasons
and opportunities for iteratively performing some of the steps, i.e. simply to re-do some
of the parts, after having analyzed results from later steps. In the following, we will briefly
discuss each of the steps in the suggested workflow.

5.3.1 D ESIGN OF EXPERIMENTAL CONDITIONS

With the experimental or data-driven approach to system identification, data is clearly fun-
damental. Indeed, carrying out the experiments and recording the chosen signals during
the experiment is the first necessary step. It should be stressed right away that it is usu-
ally worthwhile spending some effort to prepare the experiment and the data collection
carefully. The reason for this is that the quality of the final outcome of the system identi-
fication, the model, is largely dictated by the quality of the data, and significant efforts are
usually required to prepare and carry out additional experiments. Several factors influence
the quality of the data obtained, and some examples are given next.

C HOICE OF OPERATING POINT. The general rule is to collect data in operating points, sim-
ilar to those where the model is to be applied. If the system under study has significant
nonlinearities, the resulting model will depend on the operating point used. In some cases,
in particular if linear models are used, several models covering different parts of the oper-
ating window may have to be used. One example of this is when a so called gain scheduled
controller is designed based on the models.

127
Experiment design

Data collection

Pretreatment of data

Model structure selection

Parameteter estimation

Model validation

No
Model ok?

Yes

Figure 5.4: Workflow for the system identification process in practice.

C HOICE OF SAMPLING INTERVAL . When data is collected, the system is almost always sam-
pled at discrete events. This necessarily implies loss of information, and hence it is impor-
tant to make a conscious decision on which sampling interval to use. Sampling too slow
may imply that dynamics in the interesting frequency region is not accurately reflected.
Sampling too fast, on the other hand, may cause the modelling to be focussed on e.g. ir-
relevant, high-frequency disturbances—and, not the least, to generate a lot of data! A
rule-of-thumb is that 6-10 samples per settling time of a step response is usually a good
starting-point. A final remark concerning sampling is the risk of aliasing and the need to
use (analog) filtering of the signals before sampling.

C HOICE OF INPUT SIGNAL . When carrying out an experiment on the system, you usually
need to decide which inputs that should be applied. This is an opportunity to guide the
system identification process to produce high-quality models. Again, thinking in terms of
frequency properties usually helps intuition: if we intend to use the model in certain fre-
quency regions, we need to make sure that the applied input has high content (spectral
power) in those regions. Indeed, we saw in Example 5.4 that the lack of proper excitation—
in this case exemplified by a single sinusoid—may lead to poorly estimated parameters,
although the predictions could still be good.
Choosing amplitude of the input is usually a balance between getting good signal-to-noise
ratios (accuracy) and staying within operating constraints. In addition, in the presence of
non-linearities, it may be be wise to limit the amplitudes (saturation in actuators is an ex-

128
treme example). Sometimes the inputs to the system are not decided by the experimenter,
e.g. when the system is in normal operation, or when the input is generated by feedback.
Such situations require some extra care, but will not be further pursued here.

5.3.2 P RETREATMENT OF DATA

Once data is collected, the rest of the system identification process basically amounts to
data processing. Traditionally, this involves a fair amount of engineering skills used to guide
the processing of the data7, and this work is supported by interactive system identification
tools, e.g. available in Matlab or Mathematica.

D ETECTION OF ABNORMAL DATA . The first step in the data processing is to prepare the data
to be used for system identification. A good advice is to simply look at the data in the very
first step—the human eye and brain are powerful tools to quickly detect strange features
of data, and this could save a lot of work later on in the process. One standard action is
to remove outliers, i.e. data-points that are obviously wrong, due to e.g. loss of data or a
temporary failure of a sensor.

F ILTERING OF DATA . Viewing the collected data as raw data, it may be worth “polishing”
the data prior to system identification. In the interest of focussing the modelling on the
most important frequency region, it may be a good idea to remove low-frequency content
by e.g. eliminating non-zero means or trends in the data. This can be done in different ways,
e.g. by fitting a constant or a straight line to the data, or by using differentiated data. In a
similar way, high-frequency content may be removed by low-pass filtering, e.g. when it has
been discovered that an unnecessarily high sampling frequency has been used.

5.3.3 M ODEL STRUCTURE SELECTION

Prior to the actual parameter estimation, we need to decide on the model structure to be
used, and we have seen that there are plenty of possible choices.

W HITE OR BLACK BOX ? In the beginning of this chapter, we discussed the choice between
tailor-made and general-purpose models, or white-box vs. black-box models. If physics-
based relations can be used as starting-point for the modelling, then the white-box alter-
native may be preferable. The PEM principle can still be used to formulate a system iden-
tification problem, with the formation of the predictor being one important step. In this
introductory treatment of the SysId process, we have chosen to focus on the alternative
route via black-box models. There are still choices to be made, however...

C HOICE OF PARAMETRIZATION . There are, within the general white-box model (5.39), sev-
eral special model structures such as ARX, OE and ARMAX. The selection may be guided
by some application insights, whether there is a desire to model disturbances or not, or

7
This is to some extent challenged by the current development within machine learning.

129
simply by a desire to keep it simple—not a bad starting point! In addition to the choice
of parametrization, you need to choose model order (number of parameters in each of the
polynomials) and possibly also a time delay of the model. These are difficult choices, and
in the end it is usually a process of trial-and-error, guided by different tools for model vali-
dation.

5.3.4 M ODEL VALIDATION

Once the model structure, including e.g. model order, has been decided, the actual param-
eter estimation is basically an automatic step, carried out by the software tool. The resulting
model needs to be scrutinized, however, a step that is usually referred to as model valida-
tion. Note that this is not only a final step to ensure that the model fits its purpose, but it
is also a tool to assess the different choices made earlier in the process, e.g. selection of
model orders. The outcome may very well lead to going back to a previous step, making
different choices and repeating the parameter estimation step.

T ESTING MODEL QUALITY. There are many alternatives to test the model quality. The most
immediate is to evaluate the loss function used in the PEM approach, and a ranking of dif-
ferent models can easily be made. This could be a crude first step, but more information
can be gathered by studying time series based on the model, e.g.:

• The predicted output, ŷ(t , θ), using the model, can be compared with the real output
y(t ).

• The simulated, noise-free output of the model, y m (t ) = G(q, θ)u(t ), can be compared
with the real output y(t ).

Notice the distinction between the two: the predicted output is at the core of the PEM,
whereas the simulated output neglects the disturbance model and therefore usually devi-
ates more from the measured output. In the FIR and OE cases, the two are the same.
In addition to the above, the model can be studied in various ways, e.g. plotting of fre-
quency response, investigating poles and zeros etc.

M ODEL ORDER SELECTION . When testing model quality as described above, e.g. by com-
paring the values of the loss functions for several candidate models, it should come as
no surprise that increased model order—and hence model flexibility—will always improve
the fit. However, increasing the model order too much poses a risk: the model captures
specifics of the noise realization rather than relevant system dynamics. This phenomena is
called overfit. A pragmatic way to avoid overfit is to look for a “knee” in the curve depicting
how the loss function decreases with model order, as illustrated in Figure 5.5.
Another way to judge what is a significant improvement when increasing model orders, is
to use some test. One such test quantity is Akaike’s final prediction error criterion, which is
defined as
(1 + Nn )
F PE = VN (θ̂N ), (5.69)
(1 − Nn )

130
V (θ̂)

Model order

Figure 5.5: Loss function for increasing model orders, a typical plot.

where n is model order and N is number of data-points. By looking for the model with the
smallest value of the test quantity, it is possible to make a trade-off between improved fit
and increased model complexity.

A NALYZING PREDICTION ERRORS . The prediction errors comprise the very basis for the
PEM approach. Under certain assumptions, the statistical properties of the prediction er-
rors obtained after parameter estimation, the residuals, can be analyzed. This leads to two
useful tests:

• Cross-correlation test: If the model structure is able to “extract” all available infor-
mation in the input when forming the preciction of the output, we would expect
the residuals to be uncorrelated with the input. Indeed, approximately the follow-
ing holds and can be used to formulate statistical tests:
(
= 0, τ > 0, if the model fit is good
R̂ εu (τ) (5.70)
6= 0, τ < 0, indication of feedback

• Autocorrelation test: If the model structure basically fulfills the idealized assump-
tions we used in the derivation, then we expect the residuals to be an i.i.d. sequence
of random variables:

R ε (τ) = 0, τ 6= 0, if the model fit is good (5.71)

Figure 5.6 shows plots of two test quantities used for these tests, for a specific example.

C ROSS VALIDATION . For all of the above validation techniques, it is a good rule to try the
model on fresh data. This is called cross validation and thus means that you separate data
into two sets: the first one is used (after pre-processing) for parameter estimation, and the
second one is used for assessing the model. In this way you avoid the risk that the model
is picking up phenomena that are coupled to the particular realization of the noise, rather
than properties of the system.

131
Figure 5.6: Example of residual tests. Top: autocorrelation test of residuals; Bottom: cross-
correlation between input and residuals. Test thresholds shown with dotted
lines.

132
5.4 T HE MAXIMUM LIKELIHOOD METHOD *
The least-squares, and more generally the prediction error methods, as presented, are not
based on any particular description of the uncertainties of the model. We will now discuss
a different approach to parameter estimation, which is based on a stochastic, or probabilis-
tic, description of the uncertainties.

The starting point for the discussion is to assume that a θ-dependent model is given in
terms of a probability density function (PDF) for the random variable Y , namely f Y (θ, y ).
The task is to use this parametrized model to estimate the parameters θ, given an observa-
tion y ∗ of y . Intuitively, the “likelihood to make an observation y = y ∗ ” is proportional to
L(θ) = f Y (θ, y ∗ ), and the latter is called the likelihood function. Hence, selecting the param-
eters most likely to yield the measurements y ∗ corresponds to

θ̂ = arg max f Y (θ, y ∗ )


θ

The latter is referred to as the Maximum-Likelihood Estimate (MLE).

The maximum-likelihood method is a very general principle and can be applied to many
different problems. Our interest is to apply it to the estimation of parameters in dynamical
models. We will, however, begin to have another look at the simple curve fitting example,
and see how the MLE can be constructed.
Example 5.9 (MLE for curve fitting). In order to apply the ML method, we need to postulate
an explicit model for the uncertainty, or the mismatch between measured data and model
predictions. It is common to assume that the model output is corrupted by additive noise,
which for the linear regression case becomes

y(i ) = θ⊤ ϕ(i ) + e(i , θ) or y = Φθ + e(θ), (5.72)

where y and e are vectors with stacked variables as in (5.9). We assume {e(i , θ)} are i.i.d.
random variables with PDF f e (x; θ). Now, the “likelihood to make an observation y = y ∗ ”
is the same as the “likelihood that e takes the value y ∗ − Φθ”. Hence, denoting the data
available (i.e. measured outputs y and regressors Φ) by Z N , the likelihood function is given
by
N
Y
N
f (θ, Z ) = f e (ε(i , θ); θ) (5.73)
i =1
where we have used the independence assumption, and the residuals are given by (5.6).
Now, maximizing the likelihood function is equivalent to maximizing the logarithm of the
function—the log likelihood function—allowing us to get a simplified expression for the
maximum likelihood estimate:

θ̂N = arg max f (θ, Z N ) = arg max log f (θ, Z N ) (5.74)


θ θ
1 X N 1 XN
= arg max log f e (ε(i , θ); θ) = arg min − log f e (ε(i , θ); θ) (5.75)
θ N i =1 θ N i =1

133
The derived MLE can be applied to any probability distribution, but we will get further
insight by assuming that the noise has a (centred) Gaussian distribution, a common as-
sumption. This means that the PDF is given by

1 ε2

f e (ε, θ) = p e 2σ2 (5.76)
2πσ

implying
1 ε2
− log f e (ε, θ) = log σ + + const. (5.77)
2 σ2
We should observe here that the right hand side depends on both θ (via ε) and σ. By in-
cluding σ among the parameters to estimate, we therefore get the following expression for
the MLE:
¡ 1 1 XN ¢
(θ̂, σ)N = arg min logσ + 2 · ε2 (i , θ) ,
σ,θ 2σ N i =1
which finally gives the estimates (check this by finding the local minima!)

1 X N
θ̂N = arg min ε2 (i , θ) (5.78)
θ N i =1
1 X N
σ̂2N = ε2 (i , θ) (5.79)
N i =1

The important conclusion is that the maximum-likelihood estimate for the case with inde-
pendent, identically and normally distributed noise variables is identical to the least-squares
estimate (at least for the linear regression case)! This should leave us with a certain degree
of trust for the least-squares criterion, although it was initially introduced in a fairly ad-hoc
manner. 

R EMARK . In the derivation above, we assumed that the noise sequence consists of i.i.d.
random variables. If we instead assume e has a multi-variate normal distribution, denoted

e ∼ N (0, Σ) , (5.80)

then the probability distribution reads as:

1 1 ⊤ −1
f e (ε, θ) = 1
e−2ε Σ ε
, (5.81)
det (2πΣ) 2

which, assuming that Σ is known, implies

1
− log f e (ε, θ) = const + ε⊤ Σ−1 ε. (5.82)
2
We see that the ML method in this case corresponds to minimizing a weighted LS criterion
as in (5.13), and that the weight matrix is given by the inverse of the covariance matrix, i.e.

134
W = Σ−1 . For the special case with independent noise variables with variances {λi }, i.e.
Σ = diag (λi ), the resulting estimate becomes

N ε2 (i , θ)
X
θ̂N = arg min , (5.83)
θ i =1 λi

so that each of the data-points used for the estimation is used with a weight that reflects
the “credibility” of the data-point, as reflected by its measurement noise variance. 
Let us conclude this section with an example, illustrating that the maximum-likelihood
method can be applied to other PDFs than the Gaussian.

Example 5.10 (ML for uniformly distributed noise). Consider a single scalar measurement,
y = y ∗ ∈ R, used to estimate a single scalar parameter θ ∈ R, based on the underlying model

y = θ · u + e, (5.84)

where u ∈ R is a given (single) input. Estimating the parameter θ in this case seems fairly
obvious, as one ought to compute θ̂ = y/u, but let us see where the ML method takes us.
With e assumed to be uniformly distributed in the interval [−1, 1], the PDF is
(
1/2, x ∈ [−1, 1]
f e (x) = , (5.85)
0, otherwise

which implies that the likelihood function becomes


( (
1/2, ε ∈ [−1, 1] 1/2, −1 ≤ y − θ · u ≤ 1
L(θ) = f e (ε(θ); θ) = = (5.86)
0, otherwise 0, otherwise

If we rule out the degenerate case u = 0, the MLE is thus given by a set of θ, given by
£ y −1 y +1¤
θ∈ , , (5.87)
u u
for which the likelhood function takes the maximum value 0.5. Although our initial guess
for θ̂ is within this set, there is no reason it should be given any particular preference, con-
sidering the assumptions made. 

P ROPERTIES OF THE ML ESTIMATE . A classical result in statistical estimation theory says


the following:

• The MLE is consistent and asymptotically efficient for independent observations.

The concept of efficiency expresses that the estimator has the least possible covariance of
the estimate; the details follow below.

135
Let θ̂(y ∗ ) be any unbiased estimator of θ = θ0 . Then the covariance of the estimate is
bounded from below by the Cramér-Rao inequality:

Cov θ̂(y ∗ ) ≥ M −1

where the Fisher information matrix M is defined as


hd ih d i⊤
M =E log f Y (θ, y ∗ ) log f Y (θ, y ∗ )
dθ dθ θ=θ0
d2
= −E log f Y (θ, y ∗ )|θ=θ0
d θ2
The meaning of this result is that there is a strict bound for how good an (unbiased) esti-
mator can be made, in the sense of its covariance. An estimate is said to be (statistically)
efficient if the covariance of the estimate attains the lower bound in the Cramér-Rao in-
equality. The result above says that the MLE is such an estimate, if it is unbiased. Note,
however, that even if this is often the case, it is not universally true for the MLE.

136
6 D IFFERENTIAL A LGEBRAIC E QUATIONS (DAE S )
We have seen so far formally ordinary differential equations (ODEs), which describe the
time evolution of a vector of variables via relationships of the form:

ẋ(t ) = f (x(t ), u(t ), t ) (6.1)

where x ∈ Rnx is a vector of differential states and u ∈ Rnu a vector of “external variables",
often referred to as inputs (any variable that is not defined by (6.1), but rather “externally"
can be classified as an “input") . For the sake of simplicity, the time dependence (t ) of these
variables is usually omitted.

In this chapter, we will approach a new form of differential equations, labelled Differential-
Algebraic Equations (DAEs), which are used a lot in the modelling of complex and large-
scale systems. DAEs have features and properties that are unlike ODEs. We will investigate
them here.

6.1 W HAT ARE DAE S ?


Simply put, DAEs are a set of equations that do not define directly the entire state. This hap-
pens e.g. if we have a mix of differential equations and purely algebraic equations (equa-
tions where the time-differentiated variables do not appear) or if the equations mix differ-
ential states, i.e. variables that appear as time-differentiated in the equations, and algebraic
states, i.e. variables (excluding inputs and parameters) whose time derivative does not ap-
pear at all. Before introducing the formal definition of a DAE, let us take a look at some
simple examples.

Example 6.1.

• Consider first the differential equation

ẋ1 = 2x1 + x2 (6.2a)


0 = 3x1 − x2 (6.2b)

Here we clearly observe that the state variable x2 does not appear in its
time-differentiated form ẋ2 , such that the equations do not define ẋ2 , but rather x2 .
Put differently, x2 is not a differential state but an algebraic state. Hence (6.2) is a
DAE.

• Now consider the differential equation

ẋ1 + 2x1 + ẋ2 + x2 = 0 (6.3a)


2ẋ1 + x1 + 2ẋ2 + 2x2 = 0 (6.3b)

137
Here both variables ẋ1 and ẋ2 appear time-differentiated. However, one can also ob-
serve that replacing (6.3b) by −2 · (6.3a) + (6.3b) yields:

ẋ1 + 2x1 + ẋ2 + x2 = 0 (6.4a)


−3x1 = 0 (6.4b)

Here we observe that (6.4) does not define the differential states ẋ1 and ẋ2 (in the
sense that one cannot solve (6.4) to obtain ẋ1 , ẋ2 ). However, (6.4) still provides a well-
defined trajectory for x1 and x2 . Indeed, (6.4b) specifies that x1 = 0, such that (6.4a)
reads as the simple ODE:

ẋ2 + x2 = 0 (6.5)

which can be solved for x2 .

Clearly, in order for the notion of DAE to be rigorous, we need a more formal definition.

Definition 1. Consider a differential equation having the state vector x ∈ Rnx , and defined
by the equation

F ( ẋ, x, u, t ) = 0 (6.6)

The differential equation (6.6) is a DAE if


µ ¶
∂F
det =0 (6.7)
∂ẋ
∂F
i.e. if the Jacobian ∂ẋ
is rank-deficient.

To clarify this definition, let us test it on (6.2) and (6.3).

Example 6.1 cont’d.

• Differential equation (6.2) can be written as


· ¸
ẋ1 − 2x1 − x2
F (ẋ, x) = =0 (6.8a)
3x1 − x2

such that
· ¸
∂F 1 0
= (6.9)
∂ẋ 0 0

is rank-deficient because the second column (row) of the Jacobian of F is zero. One
can generalize this observation: algebraic states yield columns of zeros in the Jaco-
bian ∂F
∂ẋ and make it rank-deficient.

138
• Differential equation (6.3) can be written as
· ¸
ẋ1 + 2x1 + ẋ2 + x2
F (ẋ, x) = =0 (6.10)
2ẋ1 + x1 + 2ẋ2 + 2x2

such that
· ¸
∂F 1 1
= (6.11)
∂ẋ 2 2

is rank-deficient because its determinant is zero. Here the rank-deficient determi-


nant captures the fact that (6.3) does not define the differential state properly.

In practice, most often DAEs are arising because some of the states are purely algebraic
(i.e. they do not appear time-differentiated), as in (6.2). In order to stress the difference
between differential and algebraic states, it is common to use the notation x for the differ-
ential states, and z the algebraic states. E.g. (6.2) would be written as:

ẋ = 2x + z (6.12a)
0 = 3x − z (6.12b)

Definition 1 allows one to assess whether a differential equation is a DAE or an ODE in


ambiguous cases. However, one can trivially write differential equations that cannot be
categorized so simply. Let us consider two simple examples.

Example 6.2. Let us investigate two “freak" differential equations that can switch between
being ODEs and DAEs

• Consider the simple scalar differential equation

u ẋ + x = 0 (6.13)

here we observe that F = u ẋ + x and


∂F
=u (6.14)
∂ẋ
Hence for u 6= 0 (6.13) is an ODE and reads as:
x
ẋ = − (6.15)
u
while for u = 0 (6.13) is an “DAE" (in fact it is purely algebraic) and reads as:

x =0 (6.16)

Hence (6.13) can switch between being an ODE and a DAE depending on the input
u.

139
• Consider the differential equation

ẋ1 + x1 − u = 0 (6.17a)
(x1 − x2 ) ẋ2 + x1 − x2 = 0 (6.17b)

One can verify that


µ ¶
∂F
det = x1 − x2 (6.18)
∂ẋ
Hence for initial conditions x(0) satisfying x1 (0) = x2 (0), (6.17) starts as a DAE and
obeys the dynamics:

ẋ1 + x1 − u = 0 (6.19a)
x1 − x2 = 0 (6.19b)

hence it enforces x1 (t ) = x2 (t ) for all time, and (6.17) remains a DAE. However, if the
initial conditions satisfy x1 (0) 6= x2 (0), then (6.17) starts as an ODE and remains an
ODE throughout its trajectories.

These examples are meant to draw the attention to the fact that the notion of DAE can
be fairly convoluted. In most practical cases, however, DAEs arise because the differential
equations hold variables that do not appear time-differentiated. In this context, the prob-
lem of DAEs having exotic behaviors as in the examples above does not arise.

6.2 D IFFERENT FORMS OF DAE S


DAEs come in different forms, which are useful to recognize, as these different forms can
have different intrinsic properties. We will briefly look at these forms here.
• Fully-implicit DAEs are in the form (6.6) of definition 1, i.e. they read as

F (ẋ, x, u) = 0 (6.20)

where
· ¸
∂F
det =0 (6.21)
∂ẋ
If the rank-deficiency arises because some of the states x do not appear as
∂F
time-differentiated in F (hence creating columns of zeros in the Jacobian ∂ẋ ), then
they are commonly labelled as “z " states, and (6.20) is rewritten as:

F (ẋ, x, z , u) = 0 (6.22)

Note that here condition (6.21) becomes


¡£ ¤¢ ¡£ ¤¢
det ∂F ∂ẋ
∂F
∂ż
= det ∂F
∂ẋ
0 =0 (6.23)

hence (6.22) is by construction a DAE as ż is not an argument in F .

140
• Semi-explicit DAEs split explicitly the differential and algebraic equations, and can
generally be written in the form:

ẋ = f (x, z , u) (6.24a)
0 = g (x, z , u) (6.24b)

A semi-explicit DAE can be trivially written as a fully-implicit DAE by defining:


· ¸
ẋ − f (x, z , u)
F (ẋ, x, z , u) = =0 (6.25)
g (x, z , u)

Conversely, a fully implicit can be trivially written as a semi-explicit DAE by introduc-


ing some “helper variables" (labelled v here), i.e.

ẋ = v (6.26a)
0 = F (v , x, z , u) (6.26b)

We observe that (6.24a) is a DAE by construction. Indeed, one can apply definition 1
on (6.25) (using the same construction as in (6.23)) and observe that:
µ· ¸¶
¡£ ∂F ∂F ¤¢ I 0
det = det =0 (6.27)
∂ẋ ∂ż 0 0

• Linear DAEs are DAEs where the underlying functions are linear. They are often put
in the form:

E ẋ = Ax + Bu (6.28)

where definition 1 requires that E is rank-deficient in order for (6.28) to be a DAE. In a


semi-explicit form, and underlying the distinction between algebraic and differential
states, a linear DAE would read as:

ẋ = Ax + Bu +C z (6.29a)
0 = Dx + E u + F z (6.29b)

For matrix F full rank, we observe that the algebraic variables z can be eliminated
using (6.29b) and replaced into (6.29a) such that the DAE can be reduced to an ODE
with the addition of some “output variables" z :
¡ ¢ ¡ ¢
ẋ = A −C F −1 D x + B −C F −1 E u (6.30a)
z = −F −1 (Dx + E u) (6.30b)

Similarly to ODEs, nonlinear DAEs can be linearized into linear DAEs. E.g. a fully-
implicit DAE of the form (6.22) can be linearized to

∂F ∂F ∂F ∂F
∆ẋ + ∆x + ∆z + ∆u = 0 (6.31)
∂ẋ ∂x ∂z ∂u

141
and a semi-explicit DAE of the form can be linearized to

∂f ∂f ∂f
∆ẋ = ∆x + ∆u + ∆z (6.32a)
∂x ∂u ∂z
∂g ∂g ∂g
0= ∆x + ∆u + ∆z (6.32b)
∂x ∂u ∂z
and therefore takes the form (6.29). It is interesting to note here that the elimina-
∂g
tion of the algebraic variables ∆z in (6.32) is possible if the Jacobian ∂z is full rank
throughout the trajectory. We will get back to this notion soon.

6.3 D IFFERENTIAL I NDEX OF DAE S


DAEs are unlike ODEs in various ways. One very important distinction is that DAEs hold
the concept of differential index, which is crucial when it comes to solving them numer-
ically. While the notion of differential index is not straightforward, the problem of high-
index DAE, which is the main focus of this section, can be construed fairly intuitively.

One ought to understand that in order for a DAE e.g. in the semi-explicit form:

ẋ = f (x, z , u) (6.33a)
0 = g (x, z , u) (6.33b)

to be “solvable", one need to be able to compute the differential state derivative ẋ and the
the algebraic states z for a given differential state x and input u. Indeed, ẋ, z can be readily
obtained (possibly numerically) from (6.33) for any given x, u, then one can process the
system trajectories and build the trajectories corresponding to (6.33). In order to build this
intuition, let us consider the following examples.

Example 6.3.

• Let us start with a simple linear DAE:


 
ẋ1 − z
F =  ẋ2 − x1  = 0 (6.34)
ẋ1 + x2 − u

One can verify that, using simple algebraic manipulations, (6.34) can be rewritten as

z = u − x2 (6.35a)
ẋ2 = x1 (6.35b)
ẋ1 = u − x2 (6.35c)

i.e. (6.34) delivers ẋ and z as a function of x and u.

142
• Consider a linear DAE similar to but slightly different than (6.34)
 
ẋ1 − z
F =  ẋ2 − x1  = 0 (6.36)
x2 − u

One can verify that no algebraic manipulation can deliver ẋ and z from (6.36). In-
deed, (6.36) also reads as:
    
1 0 −1 ẋ1 0
 0 1 0   ẋ2  =  x1  (6.37)
0 0 0 z x2 − u
| {z }
:=M

and since matrix M is rank deficient (last row is zero), one cannot manipulate the
equations defined by (6.36) in order to extract ẋ and z. In contrast, we observe that
(6.34) can be rewritten as
    
1 0 −1 ẋ1 0
 0 1 0   ẋ2  =  x1  (6.38)
1 0 0 z u − x2

and its matrix is full rank, allowing us to build (6.35). We ought to observe that the
matrix M is obtained by taking the Jacobian
£ ∂F ∂F ¤
∂ẋ ∂z
(6.39)

which need to be full rank in order for the DAEs to be solvable for ẋ and z.

The principles deployed in the examples above can be generalized to nonlinear DAEs, as is
stated next.
Theorem 9. a fully implicit DAE

F (ẋ, z , x, u) = 0 (6.40)

with function F smooth can be readily solved (i.e. solved for ẋ, z ) if the Jacobian
£ ∂F ∂F ¤
∂ẋ ∂z
(6.41)

is full rank on all trajectories ẋ, z , x, u.


Proof. This theorem follows directly from the IFT (see Theorem 6). Indeed, for (6.41) full
rank, there exist functions ẋ (x, u) and z (x, u) such that

F (ẋ (x, u), z (x, u) , x, u) = 0 (6.42)

hold locally for any x, u. It follows that (6.40) can be solved. Another way of construing this
result is that if the Jacobian (6.41) is full rank, then a Newton iteration deployed on (6.40) in
order to compute ẋ, z converges locally (because the linear system (4.8) is well-posed).

143
One can readily apply Theorem 9 to semi-explicit DAEs.

Corollary 1. a semi-explicit DAE

ẋ = f (x, z , u) (6.43a)
0 = g (x, z , u) (6.43b)

with function g smooth can be readily solved (i.e. solved for ẋ, z ) if the Jacobian

∂g
(6.44)
∂z
is full rank on all trajectories z , x, u.

Proof. Recall that a semi-explicit DAE can be transformed into a full implicit one via the
transformation (6.25) recalled here:
· ¸
ẋ − f (x, z , u)
F (ẋ, x, z , u) = =0 (6.45)
g (x, z , u)

A direct application of Theorem 9 then yields the Jacobian:


" #
∂f
£ ∂F ∂F ¤ I ∂z
∂ẋ ∂z
= ∂g (6.46)
0 ∂z

∂g
which is full rank if ∂z
is full rank.

An intuitive way of construing Corollary 1 is by observing that if the Jacobian (6.44) is full
rank, then the algebraic equation (6.43b) can be solved for z at any point x, u (see IFT, The-
orem 6). The solution z can then be injected in (6.43a) to obtain ẋ.

Theorem 9 and its corollary 1 tell us that there are DAE that are “easy" to solve (those satis-
fying the full rankness of the Jacobians (6.41) and (6.44)) and DAEs that are “hard" to solve
(those where the Jacobians (6.41) and (6.44) are rank deficient). We will see next that Theo-
rem 9 and its corollary 1 have to do with the differential index of DAEs.
d
Definition 2. the differential index of a DAE is the number of times the operator dt must
be applied to the equations (+ possibly an arbitrary amount of algebraic manipulations) in
order to convert the DAE into an ODE.

Definition 2 is non-trivial and can require some care when being applied. Let us make a
few examples in order to gather some intuitions on how it works.

144
Example 6.4.

• Consider the linear DAE (6.35) repeated here:

z = u − x2 (6.47a)
ẋ2 = x1 (6.47b)
ẋ1 = u − x2 (6.47c)

Recall that (6.35) is an “easy" DAE since it satisfies the assumption of Theorem 9. We
d
also observe that a single application of dt on (6.47a) yields:

ż = u̇ − ẋ2 (6.48a)
ẋ2 = x1 (6.48b)
ẋ1 = u − x2 (6.48c)

or equivalently

ż = u̇ − x1 (6.49a)
ẋ2 = x1 (6.49b)
ẋ1 = u − x2 (6.49c)

The latter differential equation is an ODE, hence (6.35) is of index 1. It is interesting to


note here that by performing the transformation D AE → ODE (from (6.35) to (6.49)),
we have restricted the functional space for u. Indeed, u only needs to be integrable
in (6.35), while it needs to be differentiable in (6.49).

• Consider the linear DAE (6.36), repeated here

ẋ1 − z = 0 (6.50a)
ẋ2 − x1 = 0 (6.50b)
x2 − u = 0 (6.50c)

Recall that (6.36) is a “hard" DAE since it fails the assumption of Theorem 9. A time
differentiation of (6.36) (last row) yields

ẋ1 − z = 0 (6.51a)
ẋ2 − x1 = 0 (6.51b)
ẋ2 − u̇ = 0 (6.51c)

An algebraic manipulation yields

ẋ1 − z = 0 (6.52a)
u̇ − x1 = 0 (6.52b)
ẋ2 − u̇ = 0 (6.52c)

145
A second time differentiation applied on (6.52b) yields

ẋ1 − z = 0 (6.53a)
ü − ẋ1 = 0 (6.53b)
ẋ2 − u̇ = 0 (6.53c)

and an algebraic manipulation yields:

ü − z = 0 (6.54a)
ü − ẋ1 = 0 (6.54b)
ẋ2 − u̇ = 0 (6.54c)

A third time differentiation applied to (6.54a) yields the ODE


...
ż = u (6.55a)
ẋ1 = ü (6.55b)
ẋ2 = u̇ (6.55c)

A total count of 3 time differentiations was used to transform the DAE (6.36) to the
ODE (6.55). It follows that DAE (6.36) is of index 3.

Note that the same principles can be applied to nonlinear DAEs (see e.g. Section 6.4 below).

Let us make now the connection between the concept of differential index and Theorem 9
and its corollary 9.
Theorem 10. a fully-implicit DAE

F (ẋ, z , x, u) = 0 (6.56)

fulfils the assumption of Theorem 9 if and only if it is of index 1.


Proof. for an index-1 fully-implicit DAE, a single time differentiation allows a transforma-
tion to an ODE, i.e.
d ∂F ∂F ∂F ∂F
F= ẍ + ẋ + ż + u̇ = 0 (6.57)
dt ∂ẋ ∂x ∂z ∂u
is an ODE. For the sake of clarity, let us introduce the state extension v = ẋ, such that (6.57)
reads as:
∂F ∂F ∂F ∂F
v̇ + v+ ż + u̇ = 0 (6.58)
∂ẋ ∂x ∂z ∂u
or equivalently (6.57) reads as
· ¸ µ ¶
v̇ £ ∂F ∂F ¤−1 ∂F ∂F
= − ∂ẋ ∂z v+ u̇ (6.59)
ż ∂x ∂u
£ ¤
and the Jacobian matrix ∂F ∂ẋ
∂F
∂z
must be full rank.

146
The same result extends to semi-explicit DAEs, i.e.

Corollary 2. a semi-explicit DAE

ẋ = f (x, z , u) (6.60a)
0 = g (x, z , u) (6.60b)

fulfils the assumption of Theorem 9 if and only if it is of index 1.

Proof. if (6.60) is of index 1, then a single time differentiation on the algebraic equation
yields an ODE, i.e.

ẋ = f (x, z , u) (6.61a)
d
0 = g (x, z , u) (6.61b)
dt
is an ODE, i.e. equivalently, using a chain rule

ẋ = f (x, z , u) (6.62a)
∂g ∂g ∂g
0= ẋ + ż + u̇ (6.62b)
∂x ∂z ∂u
is an ODE. It follows that

ẋ = f (x, z , u) (6.63a)
µ ¶
∂g −1 ∂g ∂g
ż = − ẋ + u̇ (6.63b)
∂z ∂x ∂u
∂g
is an ODE, which requires the Jacobian ∂z
to be full rank.

The message to take home from Theorem 10 and corollary 2 is that index-1 DAEs are “easy"
to solve in the sense that the equations readily deliver ẋ and z . DAEs of index more than 1
are generally referred to as “high-index" DAEs, and are notoriously more difficult to han-
dle. When approaching DAEs numerically, index-1 DAEs are clearly preferred. This does
not mean than high-index DAEs cannot be treated, but they are often best treated via a
so-called index-reduction procedure, which we will introduce in Section 6.4 and further
discuss in Section 6.5.

6.4 C ONNECTION TO L AGRANGE MECHANICS


An attentive reader will have noticed that we have already approached DAEs in Section
3.5, when discussing constrained Lagrange mechanics. Indeed equation (3.161) stemming
from constrained Lagrange mechanics:

d ∂L ∂L
− =Q (6.64a)
dt ∂q̇ ∂q
¡ ¢
c q =0 (6.64b)

147
is in fact a semi-explicit DAE, where (6.64b) is the algebraic equation and (6.64a) yields (af-
ter minor treatments) an explicit ODE.

We have seen that (6.64) does not readily deliver q̈, z , and that it ought to be modified
by time-differentiating the constraint equation (6.64b) twice, delivering the model (3.166)
recalled here:
d ∂L ∂L
− =Q (6.65a)
dt ∂q̇ ∂q
¡ ¢
c̈ q , q̇, q̈ = 0 (6.65b)

It is interesting here to verify the differential index of (6.64) by applying definition 2. The
differential state for a mechanical system reads as
· ¸
q
x= (6.66)

The transformed Lagrange equation (6.65) can be written explicitly as:


 

I 0 0

· ¸ q̇
ẋ  ∂
¡ ¡ ¢ ¢ 
 0 W (q ) 0¡ ¢  = Q − ∂q W q q̇ q̇ + ∇q L  (6.67)
z  ³ ´ 
0 0 M q ∂ ∂c
− ∂q q̇ q̇
∂q

One can observe that (6.67) readily delivers ẋ, z for W (q ) M(q ) full rank. It follows that
(6.67) fulfils the assumption of Theorem 9, and is therefore of index 1. Since (6.67) has been
obtained after two time differentiations of (6.64), it follows that (6.64) is an index-3 DAE.
We can summarize this observation as:
d2 d
dt 2 dt
(6.64b) −−−−−−→ (6.65b) −−−−−→ ODE
| {z } | {z }
index 3 index 1

This observation is generic, i.e. models arising¡ from


¢ constrained Lagrange mechanics with
position-dependent constraints of ¡ the
¢ form c q = 0 yield index-3 DAEs, and two time dif-
ferentiations of the constraints c q = 0 yield an index-1 DAE.

The important message to take home here is that a high-index DAE such as (6.64) can be
transformed into an index-1 DAE (i.e. (6.65)) via time differentiations and algebraic ma-
nipulations. The trick we have explored to transform the DAEs arising from constrained
Lagrange mechanics into low-index DAEs is not limited to this special case, but it a generic
approach to transform “hard" (i.e. high-index) DAEs into “easy" ones (i.e. low-index DAEs),
commonly referred to as an index-reduction procedure. We further detail this in the next
section.

148
6.5 I NDEX REDUCTION
The index reduction of DAEs is identical to assessing their index as per definition 2, with the
minor difference that the procedure ought to be stopped one step before reaching an ODE
(i.e. at the index-1 step). More specifically, one ought to perform time differentiations and
algebraic manipulations until an index-1 DAE emerges. It is difficult to automate this pro-
cedure in general (although some software packages offer index-reduction capabilities), as
the algebraic manipulations to be performed typically require some insights into the DAE.
However, we can attempt a detailed description of the procedure.

Let us consider a semi-explicit DAE of the form

ẋ = f (x, z , u) (6.68a)
0 = g (x, z , u) (6.68b)

The index-reduction procedure can be described as follows.

1. Check if the DAE system is of index 1. If yes, stop.

2. Identify a subset of algebraic equations in (6.68b) that can be solved for a subset of
algebraic variables z .
d
3. Apply dt on the remaining algebraic equations containing some differential states x j ,
this leads to terms ẋ j appearing in these differentiated equations.

4. Substitute the terms ẋ j by the corresponding expressions f j (x, z , u), this delivers new
algebraic equations to replace those differentiated in (6.68b).

5. With this new DAE system, go to step 1.

This procedure is not always straightforward to deploy. Let us consider a couple of exam-
ples to understand it better.

Example 6.5.

• Consider first the linear semi-explicit DAE


· ¸
x2 + z2
ẋ = ≡f (6.69a)
z1 + u
· ¸
x1 − x2
0= ≡g (6.69b)
z2 − x1

We observe that
· ¸
∂g 0 0
= (6.70)
∂z 0 1

149
is rank deficient, such that (6.69) is of index > 1 (see Corollary 2). We observe that the
second equation of (6.69b) delivers z 2 as:

z2 = x1 (6.71)
d
We then apply dt
on the first equation in (6.69b) to obtain:

ẋ1 − ẋ2 = 0 (6.72)

We use (6.69a) to replace ẋ1,2 by their corresponding expressions, delivering a new


algebraic equation:

x2 + z2 − z1 − u = 0 (6.73)

We obtain the new DAE where the differentiated algebraic equation has been re-
placed by (6.73):
· ¸
x2 + z2
ẋ = (6.74a)
z1 + u
· ¸
x2 + z2 − z1 − u
0= := g̃ (6.74b)
z2 − x1

We observe that
· ¸
∂g̃ −1 1
= (6.75)
∂z 0 1

is full rank, such that (6.74) is of index 1. This concludes the procedure. We can
additionally deduce from the index reduction that (6.69) is of index 2.

• Consider the semi-explicit DAE given by:

ẋ = Ax − bz (6.76a)
1¡ ¢
0 = x ⊤x − 1 (6.76b)
2
for some matrices A and vector b, and where z ∈ R. Here the algebraic part is:
1¡ ⊤ ¢
g (x) = x x −1 (6.77)
2
g (x)
such that ∂z = 0 is rank deficient by construction. We then take the time derivative
of g (x):
d
g̃ (x, z ) = g (x) = x ⊤ ẋ = x ⊤ (Ax − bz) = 0 (6.78)
dt
We observe that
∂g̃ (x, z )
= −x ⊤ b (6.79)
∂z

150
is full rank if x ⊤ b 6= 0. Note that z is here readily delivered by g̃ (x, z ) = 0. Indeed:

x ⊤ Ax
z= (6.80)
x ⊤b
The index-reduced DAE then reads as:

ẋ = Ax − bz (6.81a)
0 = x ⊤ (Ax − bz) (6.81b)

Similarly to Lagrange mechanics, the reduction of the index of a DAE requires one to collect
a set of consistency conditions that need to be enforced in order for the resulting index-1
DAE to match the original one. Generally speaking, when performing the index reduction,
one ought to collect all the algebraic equations on which a time differentiation is per-
formed, and add them to the list of consistency conditions. E.g. in the example above, the
first equation in (6.69b) is time-differentiated and is therefore the (only) consistency con-
dition.

Similarly to the Lagrange context, consistency conditions can drift numerically over long
simulations, and ought to be corrected if needed. Describing the introduction of a Baum-
garte stabilization in generic index-reduced in a general form can be difficult, and we will
leave this question out of these notes. In simple cases (such as e.g. the Lagrange context),
the principle is fairly simple.

151
152
7 E XPLICIT I NTEGRATION M ETHODS - R UNGE -K UTTA
For the remainder of these notes we will discuss the numerical treatment of differential
equations. Simulating a system in e.g. the state-space form:

ẋ = f (x, u) , x(0) = x0 (7.1)

over a time interval [0, T ] consists essentially in computing a discrete sequence of state
vectors

x0 , . . . , x N (7.2)

that approximate the true and continuous trajectories x(t ), t ∈ [0, T ] solution of (7.1) on a
given time grid t0 , . . . , t N sufficiently accurately that they are useful for whichever goal we
have to tackle, i.e. we want e.g. the approximation:

kxk − x(tk )k ≤ ǫ, k = 0, . . . , N (7.3)

to hold. For most model equations f , the sequence (7.2) can only be built numerically in
a computer. For the remainder of the course, we will focus on understanding some (non-
exhaustive) modern methods to build a sequence (7.2) of simulated states that are “reason-
ably close" to the true model trajectories.

7.1 E XPLICIT E ULER


In order to get started on integration methods, it is natural to start with the most intu-
itive, yet least efficient integration method, namely the explicit Euler method. Explicit Euler
ought to be avoided whenever efficiency8 matters. The explicit Euler method is, however,
deceivingly simple to implement and therefore sometimes a good choice when efficiency
ought to be sacrificed to minimize the coding effort.

The explicit Euler method uses the rule:

xk+1 = xk + ∆t · f (xk , u k ) , k = 0, . . . , N (7.4)

where u k = u(tk ), starting from the given initial conditions x0 . The Euler step is illustrated
in Figure 7.1, and is essentially linearly extrapolating from the model dynamics f (xk , u k ) in
order to build the next discrete state xk+1 from the current one xk .

8
i.e. the amount of computations needed to reach a given accuracy

153
x(t )

x(tk+1 )
xk+1 = xk + ∆t f (xk , u k )

tk xk ∆t tk+1
t

Figure 7.1: Illustration of the principle underlying the explicit Euler scheme (7.4)

It ought to be intuitively clear here that the larger the time step ∆t is chosen, the larger the
discrepancy between the true model trajectories x(t ) and the simulated ones x0 , . . . , x N are.
Indeed, Euler is essentially ignoring the “curvature" of the model trajectories between the
discrete states and replacing it with a straight line. The longer the step, the “more curva-
ture" is ignored.

7.1.1 ACCURACY OF THE EXPLICIT E ULER METHOD

Is is useful to formally analyse the discrepancy between the simulated states and the true
trajectories, and relate them to the choice and parameters (e.g. ∆t ) of the integration
method. In order to do that, it is very useful to compute the one-step error, i.e. assuming
that xk = x(tk ) (i.e. the simulation is exact at time tk ), then what is the error at time tk+1 , i.e.
what is kxk+1 − x(tk+1 )k? This can be done via Taylor arguments. Indeed, we observe that:

1
x(tk+1 ) = x(tk ) + ∆t · ẋ(tk ) + ∆t 2 ẍ(ξ) (7.5)
| {z } 2
=xk+1

for some ξ ∈ [tk , tk+1 ] (this is the Taylor expansion with an explicit remainder). It follows
that:

∆t 2 ∆t 2
kxk+1 − x(tk+1 )k = kẍ(ξ)k ≤ max kẍ(ξ)k (7.6)
2 ξ∈[tk tk+1 ] 2

Two remarks can be drawn from this expression:

• The one-step error is of the order ∆t 2

154
• The one-step error is worse when ẍ is large, i.e. if the model trajectory is more “curved".

Using this result we can then discuss the global error of the explicit Euler integration method,
i.e. what is the final integration error kx N − x(T )k? We can answer that question almost di-
rectly from (7.6). Indeed, we make the following observations:
T
• An integration up to time T with a time step ∆t requires N = ∆t Euler steps

• Each step yields an error of order ∆t 2

• After N step, the error will then be of the order N ∆t 2 = ∆t


T
∆t 2 = T ∆t

• After N step, the simulation error kx N − x(T )k will be of the order ∆t . We will use the
notation

kx N − x(T )k = O (∆t ) (7.7)

throughout the notes, which formally means:

kx N − x(T )k ≤ c∆t (7.8)

holds for some constant c > 0 and for ∆t small enough.

Example 7.1. We illustrate next these principles on the integration of the following dynam-
ics:
 
¡ (x2 − ¢x1 )
σ
ẋ = f (x, u) =  x1 ρ − x3 − x2  (7.9)
x1 x2 − βx3
£ ¤⊤
for σ = 10, β = 3, ρ = 28 and starting at the initial conditions x0 = 1 1 1 . Figures
7.2-7.4 illustrate the model trajectories obtained from the explicit Euler scheme (7.4) for
different step sizes ∆t , and figure 7.5 reports the global error, i.e. the error observed at the
end of the simulation.

155
∆t = 10−2
25 30 60
20
50
20
15
40
10 10

x2 (t )

x3 (t )
x1 (t )

5 30
0 0
20
-5
-10
10
-10
-15 -20 0
0 0.5 1 0 0.5 1 0 0.5 1
t t t

Figure 7.2: Numerical integration of the equation (7.9). In dark the trajectories from the
explicit Euler scheme (7.4) using ∆t = 10−2 , and in grey the trajectories obtained
using a very high accuracy integration method.

∆t = 10−2.5
25 30 60
20
50
20
15
40
10 10
x2 (t )

x3 (t )
x1 (t )

5 30
0 0
20
-5
-10
10
-10
-15 -20 0
0 0.5 1 0 0.5 1 0 0.5 1
t t t

Figure 7.3: Numerical integration of the equation (7.9). In dark the trajectories from the ex-
plicit Euler scheme (7.4) using ∆t = 10−2.5 , and in grey the trajectories obtained
using a very high accuracy integration method.

156
∆t = 10−3
25 30 60
20
50
20
15
40
10 10

x2 (t )

x3 (t )
x1 (t )

5 30
0 0
20
-5
-10
10
-10
-15 -20 0
0 0.5 1 0 0.5 1 0 0.5 1
t t t

Figure 7.4: Numerical integration of the equation (7.9). In dark the trajectories from the
explicit Euler scheme (7.4) using ∆t = 10−3 , and in grey the trajectories obtained
using a very high accuracy integration method.

1
10
kx N − x(T )k

10 0

10 -1
10 -2 10 -3 10 -4
∆t
Figure 7.5: Global error vs. ∆t using the explicit Euler scheme (7.4).

7.1.2 S TABILITY OF THE EXPLICIT E ULER METHOD

Beyond the accuracy of integration methods, another crucial aspect to discuss is their sta-
bility. In order to introduce this concept, let us consider the trivial stable scalar linear sys-
tem:
ẋ = −λx, x(0) = x0 (7.10)

157
for λ > 0 and arbitrary initial conditions x0 . The dynamics (7.10) are linear and have there-
fore the explicit solution:

x(t ) = x0 e −λt (7.11)

Let us nonetheless consider deploying the explicit Euler method on (7.10). The Euler steps
then read as:

xk+1 = xk − λ∆t xk = (1 − λ∆t ) xk (7.12)

We observe that (7.12) is a linear discrete dynamic system, and that it becomes unstable
for |1 − λ∆t | > 1. Since λ, ∆t > 0, this happens if 1 − λ∆t < −1, i.e. if

2
∆t > (7.13)
λ
In other words, the explicit Euler method delivers an unstable simulation for the dynam-
ics (7.10) when the time step ∆t is too large compared to the pole of the dynamics λ. We
need to stress here that the dynamics (7.10) are stable, only their numerical simulation is
possibly unstable. We illustrate these observations in Fig. 7.6 below.

15 15 15

10 ∆t = 0.01 10 ∆t = 0.15 10 ∆t = 0.21

5 5 5

0 0 0
x

-5 -5 -5

-10 -10 -10

-15 -15 -15


0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t t

Figure 7.6: Illustration of the stability problem of the Euler method for λ = 10. Only a
small enough step-size ∆t ≤ 0.2 allows the method to be stable (left and mid-
dle graph), while ∆t > 0.2 yields an unstable simulation (right graph).

We will need to discuss the stability of each integration method we will study. This stabil-
ity issue is in fact studied in a slightly more general way for various integration methods,
namely the test system:

ẋ = λx, x(0) = x0 (7.14)

158
is used with λ ∈ C and the integration method stability is described in terms of region of
stability in the complex plane, for which a given λ∆t yields a stable simulation. In the
specific case of the explicit Euler method the region of stability is given by a circle of radius
one centered at −1, i.e.

1.5

Imaginary 0.5

-0.5

-1

-1.5
-2.5 -2 -1.5 -1 -0.5 0 0.5
Real

Figure 7.7: Stability region of the explicit Euler scheme (7.4) in the complex plane. All com-
binations of λ,∆t such that λ∆t lies within the circle result in a stable (though
not necessarily accurate) integration.

Obviously, the larger the region of stability is the more “stable" the method is considered.
Summary

• The explicit Euler method is a method of order 1, i.e. the global error is kx N − x(T )k =
O (∆t ) this is the lowest order that one can accept from an integration method, as one
expects from an method that the error gets smaller with the step size ∆t .

• The explicit Euler method can become unstable for a too large step size. Its region of
stability for the test system (7.14) is fairly small and given in Fig. 7.7

7.2 E XPLICIT RUNGE -K UTTA 2 METHODS


We will now consider an improvement of the explicit Euler method, while remaining in the
family of explicit methods (the meaning of this label will become clear later). The most
widely know family of explicit integration methods (beyond Euler) is the family of Runge-
Kutta methods. Let us start with studying the Runge-Kutta 2 method (RK2) for which a
simple interpretation of its construction can be developed.

The exact trajectories of the model dynamics (7.1) obey the following integral relationship:
Ztk+1
x(tk+1 ) = x(tk ) + f (x(t ), u(t )) dt (7.15)
tk

159
The key idea behind many integration methods is to try to provide good evaluations of the
integral term in (7.15). In fact, one can interpret the explicit Euler method as a crude way
of doing that. Indeed, explicit Euler approximates:
Ztk+1
f (x(t ), u(t )) dt ≈ (tk+1 − tk ) f (x(tk ), u(tk )) ≈ ∆t f (xk , u k ) (7.16)
tk

essentially using a one-step rectangular quadrature on the integral. This approximation


illustrated in Figure 7.8.

Figure 7.8: Illustration


Rtk+1 of the approximation (7.16). The light grey area is
tk f (x(t ), u(t )) dt (including the dark grey area), which the explicit Eu-
ler scheme is approximating with ∆t f (xk , u k ), i.e. the dark grey area.

A natural question then is how can we improve the approximation (7.16). To that end, one
can e,g, use a mid-point rule, based on the approximation
Ztk+1 µ µ ¶ µ ¶¶
∆t ∆t
f (x(t ), u(t )) dt ≈ ∆t f x tk + , u tk + (7.17)
tk 2 2

which tends to be better than (7.16) whenever the model trajectories have a “low curva-
ture". This observation is illustrated in Figure 7.9.

160
Figure 7.9: Illustration
Rtk+1 of the approximation (7.17). The light grey area is
tk f (x(t ), u(t )) dt (including the dark grey area), which the explicit Euler
¡ ¡ ¢ ¡ ¢¢
scheme is approximating with ∆t f x tk + 2 , u tk + ∆t
∆t
2 , i.e. the rectangular
grey area.

¡ ¢
Unfortunately, x tk + ∆t
2 is unknown and needs to be itself replaced by an approximation
built upon xk . Here we rely again on an explicit Euler “half" step, i.e. we use:
µ ¶
∆t ∆t
x tk + ≈ xk + f (xk , u k ) (7.18)
2 2

Combining (7.18) and (7.17), we obtain the “mid-point" RK2 integration scheme:

K 1 = f (xk , u(tk )) (7.19a)


µ µ ¶¶
∆t ∆t
K 2 = f xk + K 1 , u tk + (7.19b)
2 2
xk+1 = xk + ∆t K 2 (7.19c)

Despite the somewhat convoluted construction, we can fairly easily show that this RK2
scheme is of order 2 (hence the name). In order to compute the order, we will use a Taylor
argument again. In order to reduce the complexity of the notation, let us do the calculation
assuming that the model dynamics do not have an input u. Assuming again that xk = x(tk )
(i.e. the integration is exact at time tk ), we observe that:

1 ¡ ¢
x (tk+1 ) = x(tk ) + ∆t · f (x(tk )) + ∆t 2 f˙ (x(tk )) + O ∆t 3 (7.20)
2

161
where f˙ = ∂x f . We additionally observe that (7.19) in effect does:
∂f

µ ¶
∆t
xk+1 = xk + ∆t · f xk + f (x(tk )) (7.21)
2
and we finally observe that
µ ¶ ¯
∆t ∆t ∂ f ¯¯ ¡ ¢
f xk + f (x(tk )) = f (x(tk )) + f (x(tk )) + O ∆t 2 (7.22)
2 2 ∂x xk ¯

Hence
¯
∆t 2 ∂ f ¯¯ ¡ ¢
xk+1 = xk + ∆t · f (x(tk )) + f (x(tk )) + O ∆t 3 (7.23)
2 ∂x xk ¯

A comparison of (7.20) and (7.23) finally yields:


¡ ¢
kxk+1 − x (tk+1 )k = O ∆t 3 (7.24)

We can then conclude that the one-step error of the RK2 scheme is of order 3, and using a
similar reasoning as for the explicit Euler scheme, the global error of the RK2 scheme is of
order 2, i.e.
¡ ¢
kx N − x (T )k = O ∆t 2 (7.25)

(see Figure 7.10 for an illustration). We will discuss the stability of RK2 later, together with
the other RK methods.

0
10
RK1

RK2
-2
kx N − x(T )k

10

-4
10

-6
10
10 -4 10 -3 10 -2 10 -1 10 0
∆t
Figure 7.10: Global error of the RK2 method (7.19), and comparison with the explicit Eu-
ler scheme (7.4) (RK1), as obtained from numerical experiments. One can see
the difference between having a first or second order method. The stronger
slope of the RK2 method (in the log-log plot) indicates a higher power in the
relationship ∆t to kx N − x (T )k.

162
7.3 G ENERAL RK METHODS
The principles of the RK2 method can be generalized. We introduce here the general for-
mula for RK methods, which is a generalization of (7.19).
à !
s
X
K 1 = f xk + ∆t a1j K j , u(tk + c1 ∆t ) (7.26a)
j =1
..
.
à !
s
X
K i = f xk + ∆t ai j K j , u(tk + ci ∆t ) (7.26b)
j =1
..
.
à !
s
X
K s = f xk + ∆t a s j K j , u(tk + c s ∆t ) (7.26c)
j =1
s
X
xk+1 = xk + ∆t bi K i (7.26d)
i =1

where the coefficients ai j , b i , c j define the RK method, and s is the number of stages of the
method. We can easily observe that both the explicit Euler method and the RK2 methods
introduced above can be written in this form. Indeed, it can be verified that:

• The explicit Euler method has s = 1 with

a11 = 0, b 1 = 1, c1 = 0 (7.27)

• The “mid-point" RK2 method has s = 2 with

a11 = 0, a12 = 0, c1 = 0 (7.28a)


1 1
a21 = , a22 = 0, c2 = (7.28b)
2 2
b 1 = 0, b 2 = 1 (7.28c)

7.3.1 BUTCHER TABLEAU

The Butcher tableau is a convenient way of summarizing the coefficient of RK methods in


a clean and compact way. The Butcher tableau reads as:

c1 a11 . . . a1s
.. .. ..
. . .
cs a s1 . . . a ss
b1 ... bs

163
E.g. explicit Euler has the Butcher tableau

0 0
1

and is therefore an RK1 method (because of order 1). The “mid-point" RK2 method has the
Butcher tableau
0 0 0
1 1
2 2
0
0 1
Other RK2 methods exist, with Butcher tableaus of the same size but with different coeffi-
cients, e.g.

Ralston’s RK2 Henn’s RK2

0 0 0 0 0 0
2 2
3 3 0 1 1 0
1 3 1 1
4 4 2 2

RK methods resulting from (7.26) ought to be divided in two categories explicit and im-
plicit. In order to make these terms clear, let us consider all RK methods having two stages
(s = 2). In this case (7.26) reads as:

K 1 = f ( xk + ∆t (a11 K 1 + a12 K 2 ) , u(tk + c1 ∆t ) ) (7.29a)


K 2 = f ( xk + ∆t (a21 K 1 + a22 K 2 ) , u(tk + c2 ∆t ) ) (7.29b)
xk+1 = xk + ∆t ( b 1 K 1 + b 2 K 2 ) (7.29c)

We observe that for a11 , a12 , a22 = 0 in (7.29) (such as in all the RK2 we have looked at
above), K 1 can be computed explicitly from xk , u(.), and then K 2 can be computed explic-
itly from xk , u(.), K 1 . However, for any of the a11 , a12 , a22 non-zero, this cannot be done as
equations (7.29a)-(7.29b) become implicit (the unknown K 1,2 appear on both sides of the
equalities and are linked through the function f ). We then say that the RK scheme is im-
plicit. One can trivially see from the Butcher tableau of an RK method whether the method
is explicit of implicit. We state this next.

A Butcher tableau defines an explicit integrator if and only if only the diagonal and
upper-diagonal elements are zero, i.e. if and only if ai j = 0 for any j ≥ i . Otherwise it
defines an implicit method.

The distinction is important in practice. Indeed, while explicit RK methods require simply

164
computing the K i sequentially (K 1 then K 2 , etc.), implicit RK methods require solving the
equations together and numerically, typically using a Newton method. The latter is com-
putationally more expensive because it requires solving linear systems a number of times
for each integration steps. However, as we will see later, implicit RK schemes have powerful
properties that make them attractive, even when computational time is important.

7.3.2 T HE RK4 METHOD

It is worth discussing one of the most commonly used RK method, the explicit RK4 (order
4), which requires s = 4 stages. It has the Butcher tableau9

0 0 0 0 0
1 1
2 2 0 0 0
1 1
2 0 2 0 0
1 0 0 1 0
1 1 1 1
6 3 3 6

It is a popular approach because it has a very good trade-off between computational com-
plexity (CPU time) and accuracy.

• The RK4 method has s = 4 stages

• The integration order of this scheme is 4, i.e.

kx N − x(T )k ≤ ck∆t 4 k (7.30)

holds for some c > 0 and for ∆t small enough. We should probably stress here the im-
plication of having an order 4. The consequence of the order 4 is that if one divides
the time step ∆t by 2, then the amount of computation required to run a simulation
on a time interval [0, T ] is doubled, but the numerical error of the simulation is di-
vided by 24 = 16. This ∆t 4 effect of the order 4 makes it “cheap" to gain accuracy
(cheap in terms of computational time).

• The method requires a few lines of computer code to implement

For the sake of completeness, let us write the pseudo-code corresponding to an RK4 method.

9
Note that other Butcher tableaus can generate order 4 RK methods, but this one is arguably the most com-
monly used.

165
RK4 method for explicit ODEs
Algorithm:

Input: Initial conditions x0 , input profile u(.), step size ∆t


for k = 0, . . ., N − 1 do
Compute:

K 1 = f (xk , u(tk )) (7.31a)


µ µ ¶¶
∆t ∆t
K 2 = f xk + K 1 , u tk + (7.31b)
2 2
µ µ ¶¶
∆t ∆t
K 3 = f xk + K 2 , u tk + (7.31c)
2 2
K 4 = f (xk + ∆t K 3 , u(tk + ∆t )) (7.31d)

Assemble the RK step


µ ¶
1 1 1 1
xk+1 = xk + ∆t K 1 + K 2 + K 3 + K 4 (7.32)
6 3 3 6

return x1,...,N

7.3.3 S TAGES , ORDER & EFFICIENCY OF EXPLICIT RK METHODS

Let us discuss more the integration order of explicit RK methods. Before diving in, we
should discuss more the implication of the number of stages s that the different RK meth-
ods have. It can be observed that the number of stages of an explicit RK method defines the
computational complexity of evaluating one step xk → xk+1 in the integrator. Indeed, each
integration step (i.e. the execution of (7.26)) requires s evaluations of the model equations
f and each of these evaluation translates directly into a (typically) fixed CPU time. We can
deduce that the computational cost of one step of an explicit integrator is roughly propor-
tional to s. In that sense, integrators with a low number of stages appear preferable

On the other hand, integrators with a larger number of stages s also achieve higher orders
(let’s label them o), and therefore achieve a higher accuracy for a given ∆t . Indeed, remem-
ber that the simulation error is bounded by:
kx N − x(T )k ≤ c∆t o (7.33)
for some c > 0 and ∆t small. Note that we should always consider ∆t small, i.e. ∆t o be-
comes exponentially smaller as o increases. Hence by increasing the number of stages s we
pay proportionally more in computations but we gain exponentially in accuracy. The gain
tends to out-weight the loss because of the “power effect" of the order.

The RK methods we have discussed so far suggest that the number of stages s of the method
equals its order o. Indeed, we have seen that:
• Explicit Euler has an order o = 1 and is an RK method with s = 1 stage. If ∆t is divided
by 2 then the error is divided by 2.

166
• The RK2 methods we have seen have an order o = 2 and have s = 2 stages. If ∆t is
divided by 2 then the error is divided by 4.

• The RK4 method we have seen has order o = 4 and has s = 4 stages. If ∆t is divided by
2 then the error is divided by 16.

From these observations, it may appear that the higher s the better. Unfortunately, this pat-
tern o = s breaks beyond o = 4. Let us stress this fact in the following table (these numbers
are not straightforward to establish, and we will leave that question out of these notes).

Order Stages required ∆t divided by 2 → error divided by...


RK1 1 2
RK2 2 4
RK3 3 8
RK4 4 16
RK5 6 32
RK6 7 64
RK7 9 128
RK8 11 256
We observe from this table that for a low number of stages (s ≤ 4), we have o = s, hence
adding a stage adds an order (see Fig. 7.11 for an illustration). However, for a higher number
of stages, the increase in the order of the method “stalls", and stops increasing as quickly
as the number of stages.

2
10

0
10
kx N − x(T )k

-2
10

10
-4 RK1

10
-6 RK2 RK4 RK5

RK3
-8
10 -4 -2 0
10 10 10
∆t
Figure 7.11: Global error of the different RK methods, as obtained from numerical experi-
ments.

167
This effect has some important consequences in practice on the choice of an integration
method. Indeed, the crux of any integration method is to achieve a given accuracy with
a minimum computational budget. There are two ways of increasing the integrator accu-
racy: increasing the order o, i.e. increasing the number of stage s, or decreasing ∆t . Having
these two options makes the choice of integrator (choice of order and choice of time step)
non-trivial.

However, using the table above, we can in fact somewhat formalize this choice. Let us
assume that we want the global error to be limited to some given number Tol

• Then for an integrator of order o, we need:

kx N − x(T )k ≤ c∆t o ≤ Tol (7.34)

such that the step size ∆t is limited to:


µ ¶1
Tol o
∆t ≤ (7.35)
c

• In order to carry out a simulation on the time interval [0, T ], the number of integrator
step required is:

T
N= (7.36)
∆t
and for a number of stages s, the number of evaluation n of the system dynamics f
is:
µ ¶ 1
sT Tol − o
n = N ·s = ≥ sT (7.37)
∆t c

We deduce from this simple reasoning that the computational cost per unit of simulation
time is at least:
µ ¶ 1
n Tol − o
≥ s (7.38)
T c

where o and s are related via the table above. It is then interesting to chart this relationship,
we do that in the Figures below. Here it becomes clear that a very low or very high order is
not optimum, and that the optimum is in the order 3-6.

168
Figure 7.12: Illustration of formula (7.38) for different Tol
c . The horizontal axis represents the
order o of the method, and the vertical the computational cost per simulation
time T , i.e. Tn . The optimum is achieved in the middle range, and not for low
nor high order methods.

10 4 RK1
Number of function evaluations n

RK2
3
10 RK3

RK4
10 2
RK5

10 1

-10 -5 0 5
10 10 10 10
Simulation error kx N − x(T )k

Figure 7.13: Illustration of the number of function evaluations required to reach different
levels of simulation accuracy for a linear test example. Here the RK5 scheme
beats the other schemes for high accuracies, and RK4 beats the other schemes
for lower accuracies. Note that this result is model-dependent.

169
Before closing this section, a last point ought to be stressed regarding the order o achieved
by different RK methods. Indeed, we have seen the relationship between the order o of
different RK methods and the number of stages s required to achieve that order. We need to
stress here that the order o of an RK method does not follow simply from using the adequate
amount of stages. E.g. the RK2 schemes (o = 2) we have seen require s = 2 stages, but they
also require the coefficients a, b, c to be chosen adequately. E.g. in the case of a s = 2 stages
method, an order o = 2 is achieved if the coefficients satisfy:

1
b 1 + b 2 = 1, b2 c2 = , a21 = c2 (7.39)
2
Similarly, for the other RK methods, the coefficients cannot take arbitrary value if one wants
the method to achieve its highest possible order.

7.4 S TABILITY OF EXPLICIT RK SCHEMES


Let us go back to the question of integration stability we investigated in Section 7.1.2 in the
more general context of RK methods of arbitrary order. The general formula defining the
stability region of the test system

ẋ = λx, x(0) = x0 (7.40)

for an RK method of order o is:


( ¯ ¯ )
o k¯
¯X
¯ (λ∆t ) ¯
S= λ∆t s.t. ¯ ¯≤1 (7.41)
¯k=0 k! ¯

The regions of stability for different orders are depicted in Figure 7.14. One can observe
that the regions increase with the order, but their size is fairly limited.

170
3 4 5
3
2 2

1
1

Imaginary
0

-1

-2

-3

-3 -2 -1 0
Real

Figure 7.14: Illustration of the region of stability for the explicit RK methods 1 to 5. One can
observe that the stability region grows with the order, but it nonetheless covers
a limited range of admissible λ∆t .

7.4.1 S TIFF SYSTEMS

The numerical stability considerations discussed above have important practical conse-
quences. Indeed, while the dynamics(7.40) are clearly of no interest, they operate as a
trivial test system to discuss how integration methods react to fast dynamics, i.e. more
specifically, time constants in the dynamics that are in the same ballpark as the integration
step ∆t . The question of whether an integration method is capable of handling such fast
dynamics is crucial in the context of stiff systems, which appear very often in models for
engineering application such as mechanical and electrical systems. It is important to “de-
tect" the stiffness of a model when trying to simulate it.

We have in fact already encountered stiff systems in the DAE chapter. Indeed, (2.34) essen-
tially describes a model having very fast dynamics (for ǫ small) on the states z, and slower
dynamics on the states x. DAEs allow one to approximate these fast states via their “de-
cayed" values (i.e. their trajectories after their fast, stable dynamics have decayed). How-
ever, one may want to process these dynamics without using DAE formulations, in which
case the fast dynamics have to be dealt with.
Example 7.2 (Stiff system). As an example of a stiff system, consider the model equations

 
0 5 1 0
1
0 0
 
 −5 2 
ẋ =  −1 x (7.42)
 0 0 −10 ǫ ǫ 
0 ǫ −ǫ −10−1 ǫ

171
for ǫ = 5 · 10−4 . The corresponding trajectories are illustrated in Fig. 7.15 and 7.16. One can
observe that states x1,2 evolve slowly (although they are influenced by the fast states and
have some small oscillations in the beginning), while the states x3,4 oscillate fast and decay
to slow trajectories. This kind of system can e.g. arise in mechanical systems or electri-
cal circuits when some part have eigenfrequencies that are much higher than other parts.
They are especially expensive to treat in numerical simulations because the fast dynamics
require a small ∆t in order for the numerical simulation to be stable, while long simulations
(T large) are required in order to see the evolution of the “slow" states.

1.5 1

1.4 0.9

0.8
1.3
0.7

x2
x1

1.2
0.6
1.1
0.5
1 0.4

0.9 0.3
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t t
10 15

5 10

0 5
x4
x3

-5 0

-10 -5

-15 -10
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t t
Figure 7.15: Illustration of the trajectories of the stiff system (7.42).

172
1.12 1

1.1 0.98

1.08 0.96

1.06 0.94

1.04 0.92

1.02 0.9

1 0.88
0 0.005 0.01 0.015 0.02 0 0.005 0.01 0.015 0.02

1 0.105

0.98 0.1

0.095
0.96
0.09
0.94
0.085
0.92
0.08
0.9 0.075

0.88 0.07
0 0.005 0.01 0.015 0.02 0 0.005 0.01 0.015 0.02

Figure 7.16: Illustration of the trajectories of the stiff system (7.42) vs. a simulation using
RK4 with ∆t = 1.5 · 10−3 (dotted curve). The initial conditions for the “fast"
states are chosen such that ż = 0 at t = 0. The small step size is close to the
limit for which the simulation becomes stable.


We will see later on that stiff systems are in fact one of the motivations for using implicit
integration methods.

7.5 E RROR CONTROL & A DAPTIVE INTEGRATORS


We have seen above that the integrator accuracy and stability depend on the step size ∆t .
It can be tricky, however, to select the correct step size a priori. Moreover, as suggested in
the order computation of the RK1 and 2 schemes, the accuracy depends on how “curved"
the model trajectories are, and this “curvature" can vary significantly throughout the tra-
jectories of the system. A careful approach would then be to simply use an overly small step
size to ensure that the required accuracy is achieved. This conservative approach would,
however, be overly expensive, as it would use very small step sizes ∆t even when they are
not needed.

Example 7.3 (Van der Pol oscillator). A classic example of an ODE triggering the difficulties
mentioned above is the Van der Pol oscillator having the equations

ẋ1 = (1 − x22 )x1 − x2 , (7.43a)


ẋ2 = x1 , (7.43b)

173
and e.g. the trajectories displayed in Fig. 7.17. One can observe that the trajectories are
fairly “benign" most of the time, but regularly go through very sharp changes. The Van der
Pol oscillator is a challenging ODE for numerical integration as very fine time steps ∆t are
needed to “survive" the sharp changes, while fairly large time steps can be used on the rest
of the trajectories.

0
x1

-5
0 10 20 30 40 50
t
50
x2

-50
0 10 20 30 40 50
t
Figure 7.17: Illustration of the Van der Pol trajectories.

It then makes sense to use varying step sizes ∆t throughout the trajectories. Adaptive inte-
gration schemes are a practical approach to do precisely that. The key idea here is to adjust
the step size ∆t during the simulations in order to meet the required accuracy. In prac-
tice, adaptive RK schemes perform an error control at every RK step, and when the error
is larger than acceptable, the step size is reduces and a shorter step is attempted until the
tolerance is met. Adaptive integrator require a baseline to assess the error at each step. A
common practice is to compute the integration step using two different Butcher tableaus,
and compare their outcome. If their discrepancy is above a certain level, then the step size
is reduced.

Let us e.g. consider the RK45 adaptive integration scheme, which is implemented in Matlab
in the function ode45.m. At each RK step, the procedure:

• Generates two steps from xk , let us call them xk+1 and x̂k+1 , using two different
Butcher tableaus.

• Compares the two steps, i.e. computes e = kxk+1 − x̂k+1 k

174
• Reduces ∆t if the error e is above some tolerance, and computes the steps again until
the tolerance is met

• Increases ∆t a bit if the error e is significantly below the tolerance

We observe here that ∆t has a “memory" in the sense that if a very low ∆t was needed at a
step k, then it will remain small for some steps. One ought to observe that this procedure is
somewhat computationally expensive, as it computes at every RK steps two different steps.
In order to limit the computational expenses, a common approach is to use two Butcher
tableau that differ only in the b coefficients, i.e. in forming the linear combination (7.26d),
while the coefficients a and c are the same such that equations (7.26a)-(7.26c) need to be
computed only once. The two steps xk+1 and x̂k+1 are therefore computed as:
s
X
xk+1 = xk + ∆t bi K i (7.44a)
i =1
X s
x̂k+1 = xk + ∆t b̂ i K i (7.44b)
i =1

The Butcher tableau for adaptive RK methods is then presented with two lines in the “b"
part of the tableau, the first one for the “classic" b and the second for b̂. E.g. the Butcher
tableau used in the ode45.m Matlab function reads as10 :

0
1/5 1/5
3/10 3/40 9/40
4/5 44/45 −56/15 32/9
8/9 19372/6561 −25360/2187 64448/6561 −212/729
1 9017/3168 −355/33 46732/5247 49/176 −5103/18656
1 35/384 0 500/1113 125/192 −2187/6784 11/84
35/384 0 500/1113 125/192 −2187/6784 11/84 0
5179/57600 0 7571/16695 393/640 −92097/339200 187/2100 1/40

One can observe that this method has s = 7 stages. Both steps xk+1 and x̂k+1 are of order
o = 5. Figure 7.18 illustrates the step size ∆t selected by the ode45.m when treating the Van
der Pol oscillator (7.43).

One can finally note that the last line of the “a" part of the tableau is identical to the first
line of the “b" part, and that the K 1 is given by:

K 1 = f (xk , u(tk )) (7.45)

This implies that K 1 of step k + 1 matches K 7 of step k such that the last K 7 can be reused
for the next K 1 .
10
the RK method coded by this tableau is referred to as the RK45 Dormand-Prince method

175
10 -1

10 -2
∆t

10 -3
0 10 20 30 40 50
t
Figure 7.18: Illustration of the step sizes selected by ode45.m for simulating the Van Der Pol
trajectories. Whenever the trajectories are going through a “sharp" turn, the
step size drops drastically in order to keep the integration error under control.
It then increases slowly again to reduce the computational burden.

176
8 I MPLICIT I NTEGRATION M ETHODS – R UNGE -K UTTA
In the explicit RK section, we have seen that a Butcher tableau can define explicit or implicit
RK methods. We have discussed in details explicit RK methods. It is time now to discuss
implicit RK methods. While implicit methods are more expensive computationally (one
needs to solve equations at every RK step, as opposed to simply evaluate the K i sequen-
tially as in explicit methods), we will see that they have some striking advantages. Indeed,
Implicit RK methods

• can achieve very high and systematic orders

• are stable regardless of the step size ∆t

• are ideal for handling the simulation of DAEs

8.1 I MPLICIT E ULER METHOD


Let us start with approaching implicit integrators from a basic point of view, by studying
the implicit Euler method. Recall that the explicit Euler method uses the iteration:

xk+1 = xk + ∆t f (xk , u k ) (8.1)

where u k = u (tk ). The implicit Euler method uses instead:

xk+1 = xk + ∆t f (xk+1 , u k+1 ) (8.2)

The name “implicit" ought to be fairly clear from (8.2). Indeed, obtaining xk+1 from (8.2)
cannot be done via a simple function evaluation, but must be done via solving (8.2) for
xk+1 , i.e. one needs to find a solution to

r (xk+1 , xk , u k+1 ) := xk + ∆t f (xk+1 , u k+1 ) − xk+1 = 0 (8.3)

in terms of xk+1 . Note that r ∈ Rn where n is the size of the state space, i.e. x ∈ Rn . Solving
(8.3) is typically done via applying the Newton method covered earlier in these notes. The
deployment of the implicit Euler algorithm is then detailed below.

177
Implicit Euler method
Algorithm:

Input: Initial conditions x0 , input profile u(.), step size ∆t


for k = 0, . . ., N − 1 do
Guess xk+1 , one can e.g. use xk+1 = xk
while kr (xk+1 , xk , u k+1 )k > Tol do
Compute the solution ∆xk+1 to the linear system:

∂r (xk+1 , xk , u k+1 )
∆xk+1 + r (xk+1 , xk , u k+1 ) = 0 (8.4)
∂xk+1

where r is given by (8.3). Update:

xk+1 ← xk+1 + α∆xk+1 (8.5)

for some step size α ∈]0, 1] (a full step α = 1 generally works for implicit
integrators)
return x1,...,N

Note that this procedure is significantly more complex than the simple update (8.1) of the
explicit Euler method. In particular, computing the Newton step (8.4) requires computing
∂r (xk+1 ,xk ,u k+1 )
the Jacobian ∂xk+1
and forming its matrix factorization (i.e. solving the linear sys-
tem). This procedure must be repeated at each RK step xk → xk+1 , i.e. many times in a
∂r (xk+1 ,xk ,u k+1 )
complete simulation. Note that the Jacobian ∂x reads as:
k+1

∂r (xk+1 , xk , u k+1 ) ∂ f (xk+1 , u k+1 )


= −I (8.6)
∂xk+1 ∂xk+1
∂ f (xk+1 ,u k+1 )
and requires computing the Jacobian on the model dynamics ∂xk+1
.

The implicit Euler method has a global error of order 1, i.e.

kx N − x(T )k ≤ c∆t (8.7)

for some c > 0 and for ∆t sufficiently small. Hence the order of the implicit Euler method
is identical to the explicit Euler method.

8.1.1 S TABILITY OF THE IMPLICIT E ULER METHOD

Let us now unpack one of the main motivations for using, in some cases, implicit methods
for simulating a model. Let us come back to the stability issue of the explicit Euler method
(8.1), and recall that the iteration is unstable on the (stable) test system

ẋ = λx (8.8)

for ∆t · λ > 2.

178
We can make a very similar computation for implicit Euler scheme, i.e. we observe that the
Implicit Euler method (8.2) deployed on the test system (8.8) reads as
xk+1 = xk + ∆t λxk+1 (8.9)
or equivalently
1
xk+1 =xk (8.10)
1 − λ∆t
Hence the Implicit Euler method (8.2) is stable if
|1 − λ∆t | > 1 (8.11)
Observe that R (λ) < 0 is required in order for (8.8) to be stable, and it follows that (8.11)
always holds. This result is quite striking. Indeed, it entails that the implicit Euler method
is stable regardless of how “fast" the time constants of the model are. This property is
called A-stability, and means that the whole left-hand complex plane is stable (as opposed
to the limited regions depicted in Figure 7.14). A-stability allows one to treat stiff dynamics
without taking special care of the instability of the method, i.e. by taking “fairly large" steps
∆t despite the fast time constant λ of the model.
Example 8.1 (Stiff system, cont’d). Let us reuse the stiff model (7.42) in Example 7.2 and
apply the implicit Euler method to simulate it. The outcome is illustrated in Fig. 8.1 for
∆t = 1 · 10−3 and ∆t = 2 · 10−2 . One can observe that the implicit Euler, for “long" steps re-
mains stable and approximates the fast dynamics by decaying to their damped values.

1.5 1

1.4 0.8

0.6
1.3
0.4
x2
x1

1.2
0.2
1.1
0
1 -0.2

0.9 -0.4
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
t t
10 15

5 10

0 5
x4
x3

-5 0

-10 -5

-15 -10
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
t t
Figure 8.1: Illustration of the trajectories of the stiff system (7.42) and its simulation via
the implicit Euler method using ∆t = 2 · 10−2 (round markers) and ∆t = 1 · 10−3
(square markers).

179
8.2 I MPLICIT RUNGE -K UTTA METHODS
Let us now study higher-order implicit Runge-Kutta methods. As already explained in the
explicit RK section, a Butcher tableau that is not lower-diagonal describes an implicit method.
Then the RK equations (7.26) recalled here:
à !
Xs
K 1 = f xk + ∆t a1j K j , u(tk + c1 ∆t ) (8.12a)
j =1
..
.
à !
s
X
K i = f xk + ∆t ai j K j , u(tk + ci ∆t ) (8.12b)
j =1
..
.
à !
s
X
K s = f xk + ∆t a s j K j , u(tk + c s ∆t ) (8.12c)
j =1
s
X
xk+1 = xk + ∆t bi K i (8.12d)
i =1

are implicit in K 1,...,s , as ai j 6= 0 for some j ≥ i . In this case we need to solve (8.12a)-(8.12c)
numerically, typically using the Newton method. more specifically, similarly to (8.3), we
write:
 ³ Ps ´ 
f xk + ∆t j =1 a1j K j , u(tk + c1 ∆t ) − K 1
 
 .. 

 ³ . ´


 Ps 
r (K , xk , u(.)) :=  f xk + ∆t j =1 ai j K j , u(tk + ci ∆t ) − K i  = 0 (8.13)
 
 .. 

 ³ . ´


Ps
f xk + ∆t j =1 a s j K j , u(tk + c s ∆t ) − K s

and deploy a Newton method on the “unknown"



K1
 . 
K =  ..  (8.14)
Ks

to solve (8.13). Note that r ∈ Rn·s and K ∈ Rn·s where n is the dimension of the state space
(i.e. x ∈ Rn ) and s the number of stages of the IRK method. The Newton method then works
as follows:

180
IRK for explicit ODEs
Algorithm:

Input: Initial conditions x0 , input profile u(.), Butcher tableau, step size ∆t
for k = 0, . . ., N − 1 do
Guess for K (one can e.g. use K i = xk )
while kr (K , xk , u(.))k > Tol do
Compute the solution ∆K to the linear system:

∂r (K , xk , u(.))
∆K + r (K , xk , u(.)) = 0 (8.15)
∂K
with r given by (8.13). Update:

K ← K + α∆K (8.16)

for some step size α ∈]0, 1] (a full step α = 1 generally works for implicit
integrators)
Take RK step:
s
X
xk+1 = xk + ∆t bi K i (8.17)
i =1

return x1 , . . . , x N

The main computational complexity of this procedure is typically solving the linear system
∂r (K ,xk ,u(.))
(8.15) which involves the (possibly dense) Jacobian matrix ∂K of size Rn·s×n·s . Form-
ing the Jacobian matrix can also be fairly expensive in terms of computational complexity
as it requires evaluating the Jacobians of the system dynamics:
³ P ´
∂ f xk + ∆t sj =1 ai j K j , u(tk + ci ∆t )
∈ Rn×n (8.18)
∂K l

for i , l = 1, . . . , s.

8.2.1 ACCURACY AND EFFICIENCY OF IRK METHODS

Concerning the order of IRK methods, the picture here is fairly striking. Indeed, recall that
explicit RK methods achieve an order o = s (number of stages) for s ≤ 4 and then the order
“stalls" and does not increase as fast as s. This was bad news for high-order explicit RK
methods in terms of efficiency (complexity vs. accuracy).

In contrast, implicit RK methods can achieve o = 2s for any number of stages s.


We can then revisit the “order table" that we saw in the explicit RK section, comparing the

181
explicit RK methods (ERK) to the implicit RK methods (IRK)

Order Stages required Order Stages required


ERK2 2 IRK2 1
ERK4 4 IRK4 2
ERK6 7 IRK6 3
ERK8 11 IRK8 4

and readily see that IRK methods requires dramatically less stages to achieve the same or-
der as ERK methods. Let us illustrate this statement in Figure 8.2.

10 0
ERK2
-2
10
IRK4
-4
10
kx N − x(T )k

-6
10

-8
10

-10
10

-12
10 -4 -2 0
10 10 10
∆t
Figure 8.2: Comparison of the global error of the explicit RK2 method (7.19) and an implicit
RK2 method, as obtained from numerical experiments. One can see the high
order o = 2 · s of the IRK method to the order o = s of the ERK method. One
can also observe how the accuracy of the IRK method stops decreasing at about
10−10 − 10−11 where it reaches the accuracy of the linear algebra.

Unfortunately, this striking difference does not mean that IRK methods are necessarily bet-
ter than the ERKs. In order to unpack that statement, let us investigate the efficiency of IRK
methods (complexity for a given accuracy), using similar lines as what we did for the ERK
methods. Let us assume that we want the global error to be limited to some given number
Tol

• Then for an integrator of order o (even), we need:

kx N − x(T )k ≤ c∆t o ≤ Tol (8.19)

182
such that the step size ∆t is limited to:
µ ¶1
Tol o
∆t ≤ (8.20)
c

• In order to carry out a simulation on the time interval [0, T ], the number of integrator
step required per simulation time T is:
µ ¶ 1
T Tol − o
N= ≥T (8.21)
∆t c

• The similarity with ERK methods stops here. Indeed, the complexity of IRK methods
is typically dominated by solving the linear system (8.15). The complexity of solving
this system is typically in the order of the cube of the size of the Jacobian matrix if the
matrix is dense11 , which is n · s, where n is the state size. This system must typically
be solved several times, say m in order to reach a good accuracy in solving the IRK
equations. The complexity per time step is then in the order of:
¡ ¢
C = O mn 3 s 3 (8.22)

We deduce from this simple reasoning that the computational cost per unit of simulation
time is of the order
õ ¶− 1 ! õ ¶− 1 !
C ¡ ¢ Tol o Tol o mn 3 o 3
= O N mn 3 s 3 = O mn 3 s 3 = O (8.23)
T c c 8

One can put (8.23) in contrast with the complexity of explicit RK methods(7.38), recalled
here:
µ ¶ 1
n Tol − o
≥ s (8.24)
T c
but the comparison is arguably a bit difficult, as we are trying to compare evaluations of the
model dynamics f in explicit methods to forming and solving linear systems in implicit
methods. We can, however, compare the explicit and implicit approaches via numerical
experiments. E.g. Fig. 8.3 depicts the computational time vs. integration accuracy for ex-
plicit and implicit RK methods of various orders (for ∆t fixed) for the Van der Pol oscillator
(7.43). The observations we make in this specific example are fairly consistent for differ-
ent models. Implicit methods can be a bit more computationally expensive than explicit
ones, though the difference is typically mild. This picture is dramatically changed for stiff
systems, where implicit integrators are typically required to have a numerically stable inte-
gration scheme.

11
the cube can be lowered if the Jacobian is sparse or structured, which can be the case for IRK methods.
Linear Algebra packages are very efficient at finding such structures, and lower the complexity of solving
the linear system. Hence in numerical experiments, one often observe a lower complexity.

183
0
10
Explicit
-2
10
Implicit
Integration error

-4
10

-6
10

-8
10

-10
10 -2 -1 0
10 10 10
Computational time [s]

Figure 8.3: Illustration of the computational time vs. accuracy for IRK and ERK methods
for the Van der Pol system (7.43) with ∆t = 10−2 . One can observe the different
behaviors of the implicit and explicit methods, and that even though implicit
methods reach very high order for few stages, the computational cost of solving
linear systems at every RK step makes the IRK method typically more expensive
than the ERK ones. Note that these results are only illustrative, and may change
depending on the coding language, computer architecture, system, etc.

8.2.2 S TABILITY OF RK METHODS

We will now show how the stability of RK schemes can be analyzed more systematically,
and indicate why IRK methods are different from ERK methods in this respect. Start by
writing the Butcher array for a general RK method as

c1 a11 · · · a1s c1 a 1⊤
.. .. .. .. ..
. . . . . c A
= = , (8.25)
cs a s1 · · · a ss cs a s⊤ b⊤
b1 ··· bs b⊤

184
where {a i }, b, and c are column vectors and A is a matrix. With this notation, applying the
RK scheme to the test system ẋ = λx results in the equations

K 1 = λ(xk + ∆t · a 1⊤ K ) (8.26a)
..
.
K s = λ(xk + ∆t · a s⊤ K ), (8.26b)

which can be written in a more compact form as


 
1
 .. 
K = λ(xk · 1 + ∆t · AK ), 1 =  . . (8.27)
1

Letting µ = λ∆t , this gives

(I − µA)K = λxk · 1 ⇔ K = (I − µA)−1 λxk · 1. (8.28)

The final row of the RK algorithm then gives


¡ ¢
xk+1 = xk + ∆t · b ⊤ K = 1 + µb ⊤ (I − µA)−1 1 xk . (8.29)

The pole of this first order difference equation, a function of µ = λ∆t , is called the stability
function and is given by
R(µ) = 1 + µb ⊤ (I − µA)−1 1. (8.30)
The stability function characterizes the stability region of the RK scheme via the condition
¯ ¯
¯R(µ)¯ ≤ 1, µ = λ∆t (8.31)

As mentioned already in Section 8.1, we refer to the scheme as being A-stable, when the
stability region includes the entire left half-plane of the complex plane.

Based on the derived expression for the stability function R(µ) = 1 + µb ⊤ (I − µA)−1 1, it is in
principle possible to analyze the stability properties, once the matrix A and the vector b of
the Butcher array are defined, but in practice it may still be difficult, of course. However,
focussing for a moment on explicit RK schemes, we can be more concrete. Note that, for
ERK methods, A is strictly lower-diagonal, implying that det(I − µA) = 1 (also follows from
(I − µA)−1 = I + µA + . . . + µs−1 A s−1 , since A s = 0). From this follows (for ERK schemes):

• R(µ) is a polynomial in µ of degree at most s (number of stages).

• For ERK schemes of order o = s, R(µ) is given by

1 s
R(µ) = 1 + µ + . . . + µ
s!

• Explicit schemes can never be A-stable.

185
For IRK schemes, on the other hand, the following holds:

• R(µ) is a rational function of µ. The numerator and denominator are polynomials in


µ of degree at most s (number of stages).

• Many (but not all) implicit schemes are A-stable, meaning that they can “survive"
(i.e. they are stable for) any fast dynamics. Some of them have even stronger form of
stability (L-stability), but we will not detail this in these notes.

Example 8.2 (Stability of ERK methods). Let us compute the stability function for a few
ERK methods, using the formula (8.30):

• The explicit Euler method with the Butcher array

0 0
(8.32)
1

has the stability function


R(µ) = 1 + µ, (8.33)
and the stability region is the circle defined by |1 + µ| < 1, cf. Section 7.1.

• The explicit midpoint RK2 method with Butcher array

0 0 0
1 1
2 2 0 (8.34)
0 1

has the stability function


· ¸· ¸
£ ¤ 1 0 1 µ2
R(µ) = 1 + µ 0 1 µ = 1+µ+ . (8.35)
2 1 1 2

• The explicit trapetzoidal RK2 method with Butcher array

0 0 0
1 1 0 (8.36)
1 1
2 2

has the stability function

µ2
· ¸· ¸
£ 1 1
¤ 1 0 1
R(µ) = 1 + µ 2 2 = 1+µ+ . (8.37)
µ 1 1 2

Notice that the two RK2 methods have the same stability function, as is the case for all
o = s = 2 ERK methods. 

186
Example 8.3 (Stability of IRK methods). A couple of examples of IRK methods verify that
the stability function is indeed a rational function:

• The implicit Euler method with the Butcher array

1 1
(8.38)
1

has the stability function


1 1
R(µ) = 1 + µ = , (8.39)
1−µ 1−µ
and the method is stable if |1 − µ| > 1, implying A-stability.

• The implicit trapetzoidal RK2 method with Butcher array

0 0 0
1 1
1 2 2
(8.40)
1 1
2 2

has the stability function

· ¸−1 · ¸
£ 1 1
¤ 1 0 1
R(µ) = 1 + µ 2 2 −µ/2 1 − µ/2 1
· ¸· ¸
£ 1 1
¤ 1 1 − µ/2 0 1 µ 1 + µ/2
= 1+µ 2 2 = 1+ = . (8.41)
1 − µ/2 µ/2 1 1 1 − µ/2 1 − µ/2

The stability region, characterized by the inequality

1 + µ/2
< 1, (8.42)
1 − µ/2

is exactly the left half-plane (can you verify this?), i.e. the method is A-stable.

187
8.3 COLLOCATION METHODS*
There are different families of implicit RK methods, but in this course we will investigate
the family of collocation methods, which have very strong properties, and which have a
fairly intuitive interpretation. The key idea behind collocation methods is to approximate
the model trajectories via polynomials. This is intrinsically an interpolation problem.

8.3.1 P OLYNOMIAL INTERPOLATION

Interpolation based on polynomials typically uses a linear combination:


 
s
K 1
X  . 
p (τ, K ) = K i ℓi (τ) , K =  ..  (8.43)
i =1
Ks
of a collection of polynomials ℓi ∈ R built on the interval τ ∈ [0, 1] and weighted by the
parameters K 1,...,s ∈ Rn . Note that p ∈ Rn as well. A common choice of polynomial collection
ℓ1,...,s is the so called Lagrange polynomials, which are constructed as:
Y τ − τj
ℓi (τ) = (8.44)
j 6=i τi − τ j

based on a grid τ1,...,s ∈ [0, 1] (see Figure 8.4). These polynomials have two interesting fea-
tures
• For a suitable choice of grid, they are orthogonal, i.e.
Z1
ℓi (τ)ℓ j (τ) dτ = 0 if i 6= j (8.45)
0

• They satisfy:
½
1 if i = j
ℓi (τ j ) = (8.46)
0 otherwise
resulting in the following property on p:
p (τi , K ) = K i , i = 1, . . . , s (8.47)

8.3.2 I NTERPOLATION OF THE TRAJECTORIES

The key idea behind the collocation method is to approximate12 the true model trajectories
by using13:
s
X
ẋ (tk + τ · ∆t ) = p (τ, K ) = K i ℓi (τ) (8.48)
i =1
12
sometimes the method is presented by considering that the trajectories x(t ) themselves are approximated
by the polynomials instead of their time derivatives ẋ(t ). This does not change much the mathematics of
the collocation schemes.
13
Note that in the following developments we use the label x(.) to denote the trajectories provided by the
integrator.

188
0 ℓ1 (τ) τ 0 ℓ2 (τ) τ

τ1 τ2 τ3 τ4 τ1 τ2 τ3 τ4

0 ℓ3 (τ) τ 0 ℓ4 (τ) τ

τ1 τ2 τ3 τ4 τ1 τ2 τ3 τ4

Figure 8.4: Illustration of the Lagrange polynomials for s = 4. The grid τ1,...,4 is displayed via
the vertical dotted lines (note that τ1 = 0). Property (8.46) is readily visible.

0 τ
K2 K3

K1 p(τ, K )

K4
τ1 τ2 τ3 τ4

P
Figure 8.5: Illustration of the polynomial p(τ, K ) = is=1 K i ℓi (τ) for s = 4 and for an arbitrary
coefficient vector K ∈ R4 , and K i ∈ R. One can observe the property (8.47), i.e.
p(τi , K ) = K i .

189
on the interval [tk , tk+1 ] (corresponding to τ ∈ [0, 1]) by selecting the coefficients K 1,...,s ∈
Rn . Note that each interval [tk , tk+1 ] (for k = 0, . . . , N −1) will have its own, possibly different
coefficients
 
K1
 . 
K =  ..  (8.49)
Ks

where K i ∈ Rn .

We should now formulate equations that let us determine the parameters K . The key idea
here is to enforce the dynamics on the grid points τ1 , . . . , τs , i.e. we want to determine the
K 1,...,s such that:
¡ ¢ ¡ ¢ ¡ ¡ ¢ ¡ ¢¢
ẋ tk + ∆t · τ j = p τ j , K = f x tk + ∆t · τ j , u tk + ∆t · τ j (8.50)

holds for j = 1, . . . , s (i.e. on each grid point τ j ). We first observe that


¡ ¢ ¡ ¢
ẋ tk + τ j · ∆t = p τ j , K = K j (8.51)

holds by construction
¡ ¢from (8.47). In order to further specify the equations above, we need
to relate x tk + τ j · ∆t (i.e. the integral of (8.48)) to the coefficients K . We do this next. We
first observe that:
Zτ·∆t
x (tk + τ · ∆t ) = xk + ẋ (tk + ν) dν (8.52)
0

where xk is the initial state of interval [t k , tk+1 ]. We can then make a change of variable
ν = ξ · ∆t , yielding dν = ∆t · dξ, we can then rewrite (8.52) as:

x (tk + τ · ∆t ) = xk + ∆t ẋ (tk + ∆t · ξ) dξ (8.53)
0

We then use (8.48) in (8.53) to get:


Zτ X
s
x (tk + τ · ∆t ) ≈ xk + ∆t K i ℓi (ξ)dξ = (8.54a)
0 i =1
s
X Zτ
= xk + ∆t Ki ℓi (ξ)dξ (8.54b)
i =1 0
X s
= xk + ∆t K i L i (τ) (8.54c)
i =1

where we define:

L i (τ) = ℓi (ξ)dξ (8.55)
0

190
Note that L i (τ) can be easily computed explicitly as it is simply the integration of the poly-
nomial ℓi . We can then use (8.54c) in (8.50) to get:
à !
¡ ¢ X s ¡ ¢
ẋ tk + ∆t · τ j = f xk + ∆t K i L i (τ j ), u tk + ∆t · τ j (8.56)
| {z } i =1
=K j

which should hold for all grid points j = 1, . . . , s. It follows that on a given interval [tk , tk+1 ]
the collocation equations read as:
à !
Xs
K 1 = f xk + ∆t K i L i (τ1 ), u (tk + ∆t · τ1 ) (8.57a)
i =1
..
. (8.57b)
à !
s
X ¡ ¢
K j = f xk + ∆t K i L i (τ j ), u tk + ∆t · τ j (8.57c)
i =1
..
. (8.57d)
à !
s
X
K s = f xk + ∆t K i L i (τs ), u (tk + ∆t · τs ) (8.57e)
i =1
s
X
xk+1 = xk + ∆t · K i L i (1) (8.57f)
i =1

At this stage, it is very useful to compare (8.57) to the RK equations (7.26) and realise that
they are identical if one defines:
a j i = L i (τ j ), b i = L i (1), cj = τj (8.58)
We also observe that if one picks the grid points τ1 , . . . , τs , then the polynomials ℓ1,...,s (τ) are
defined and so is their integrals L i (τ). It follows that the coefficients (8.58) of the Butcher
tableau are entirely defined via the grid points τ1 , . . . , τs . We also construe here the role of
the variables K 1,...,s in RK methods. Indeed, (8.56) shows us that these variables are holding
the state derivatives ẋ at the grid points τ1 , . . . , τs .

Note that not all IRK methods are collocation methods, as one could choose a dense Butcher
tableau with coefficients that do not satisfy (8.58) for any choice of grid points τ1 , . . . , τs . In-
deed, one can easily see that the Butcher tableau has s 2 + 2s degrees of freedom, while the
grid point selection offers only s degrees of freedom, such that not all choice of coefficients
a, b, c can arise from (8.58).

We now need to briefly discuss how to choose the grid points τ1 , . . . , τs effectively. There are
a few possible choices for a given number of stages s. The most commonly used for treating
ODEs is the Gauss-Legendre method, which chooses the grid points τ1 , . . . , τs as the roots
of the polynomial:
1 ds h¡ 2 ¢s i
P s (τ) = τ − τ (8.59)
s! dτs

191
i.e. we select the grid points τ1 , . . . , τs such that:
P s (τ j ) = 0, j = 1, . . . , s (8.60)
This selection rule may sound mysterious. Its motivation, though, has solid and deep roots
in the Gauss quadrature theory, but we will not discuss this further here. Equipped with
this rule, we have a procedure to build IRK methods (i.e. their Butcher tableau):
1. Select the number of stages s

2. Find the roots of (8.59) to get τ1 , . . . , τs

3. Build the polynomials ℓ1 (τ), . . . , ℓs (τ) according to (8.44)

4. Build their integrals L i (τ) according to (8.55)

5. Evaluate L i (τ j ), b i = L i (1) for i , j = 1, . . . , s

6. Build a computer code to solve the collocation equations (8.57)


This procedure builds an IRK method with order o = 2s, and A-stable.

An additional benefit of using polynomial interpolation for simulation models is that the
polynomials provide us with an approximation of the state trajectories at any point in time,
i.e. unlike other E/IRK methods, which deliver only the states on the time grid t0,...,N , collo-
cation methods can be prompted to deliver the states in-between, using (8.53), i.e. the true
state trajectories (say x true) is approximated at any time point in [0, T ] by:
Zτ s
X Zτ
true
x (tk + ∆t τ) ≈ xk + ∆t ẋ (tk + ∆t · ξ) dξ = xk + ∆t Ki ℓi (ξ)dξ (8.61)
0 i =1 0

Note that the order of approximation o = 2s unfortunately only holds on the grid t 0,...,N ,
such that this approximation is typically a bit worse than order 2s.

8.4 RK METHODS FOR IMPLICIT ODE S


We have so far discussed numerical integration methods for models in the explicit ODE
form:
ẋ = f (x, u) (8.62)
Some models, however, are easier to treat in an implicit ODE form:

F (ẋ, x, u) = 0 (8.63)
This is e.g. the case for models of complex mechanical system arising in Lagrange mechan-
ics, taking the form (see (3.40)), where we use v ≡ q̇:
· ¸· ¸ " #
I 0¡ ¢ q̇ v
= ∂
¡ ¡ ¢ ¢ (8.64)
0 W q v̇ Q + ∇q L − ∂q W q v v
| {z }
≡ẋ

192
¡ ¢
and where the symbolic inverse of matrix W q is very complex to write. In that case, it is
best to avoid trying to form the explicit version of (8.64), i.e.
· ¸ · ¸−1 " #
q̇ I 0¡ ¢ v
= ∂
¡ ¡ ¢ ¢ (8.65)
v̇ 0 W q Q + ∇q L − ∂q W q v v

and work with the implicit form (8.64) directly. One trivial, but not necessarily effective
approach would be to use an explicit RK method while forming the matrix inverse numer-
ically in (8.65) at every integrator step.

A usually more effective approach is to use an implicit RK method directly on the implicit
equation (8.64). Note that we can easily write (8.64) as (8.63), using:
· ¸ " #
I 0¡ ¢ v
F (ẋ, x) = ẋ − ∂
¡ ¡ ¢ ¢ =0 (8.66)
0 W q ∇q L − ∂q W q v v

We should then focus on treating (8.63) numerically in implicit integrators. Interestingly


enough, the IRK equations (7.26) apply almost directly to implicit models, using only a
minor modification. For easy reference, the RK equations for an explicit model of the form
ẋ = f (x, u) are repeated here:
à !
s
X
K 1 = f xk + ∆t a1j K j , u(tk + c1 ∆t ) (8.67a)
j =1
..
.
à !
s
X
K i = f xk + ∆t ai j K j , u(tk + ci ∆t ) (8.67b)
j =1
..
.
à !
s
X
K s = f xk + ∆t a s j K j , u(tk + c s ∆t ) (8.67c)
j =1
s
X
xk+1 = xk + ∆t bi K i (8.67d)
j =1

Recall that the variables K 1,...,s in RK method for collocation are holding the state deriva-
tives ẋ at the grid points τ1 , . . . , τs . The modification of the IRK equations for treating (8.63)

193
therefore read as:
à !
s
X
F K 1 , xk + ∆t a1j K j , u(tk + c1 ∆t ) = 0 (8.68a)
j =1
..
.
à !
s
X
F K i , xk + ∆t ai j K j , u(tk + ci ∆t ) = 0 (8.68b)
j =1
..
.
à !
s
X
F K s , xk + ∆t a s j K j , u(tk + c s ∆t ) = 0 (8.68c)
j =1
X s
xk+1 = xk + ∆t bj Kj (8.68d)
j =1

The equations to solve are then


 ³ P ´ 
F K 1 , xk + ∆t sj =1 a1j K j , u(tk + c1 ∆t )
 
 .. 

 ³ . ´


 Ps 
r (K , xk , u(.)) :=  F K i , xk + ∆t j =1 ai j K j , u(tk + ci ∆t ) =0 (8.69)
 
 .. 

 ³ . ´


Ps
F K s , xk + ∆t j =1 a s j K j , u(tk + c s ∆t )

The rest works as IRK schemes for explicit ODEs (see algorithm “IRK for explicit ODEs").

Here it is useful to reflect on the IRK method in comparison to using an ERK method for
solving the implicit ODE (8.63). Indeed, since the ODE does not readily deliver the state
derivative ẋ, even when using an explicit method, one would have to solve the implicit
equation for ẋ at every stage of the RK step, typically via a Newton method. We can then
observe that

• Deploying an explicit RK method would then requiring solving s implicit equations


of size n (the size of the state space), in order to get an order o = s (for s ≤ 4).

• Deploying an implicit IRK method requires solving one implicit equation of size n · s,
in order to get an order o = 2s (for collocation methods).

Comparing formally the efficiency of the two approaches can be tricky, but the bottom
line here is that if the ODE model is implicit, using an implicit RK method to perform the
simulation can be a very good choice, independently of questions of stiffness.

194
8.5 RK METHODS FOR IMPLICIT DAE S
In order to close this chapter, it remains to be discussed how to treat DAEs numerically. We
have in fact already all the tools required to tackle this problem, we only need to clarify a
few points carefully. Let us consider DAEs in the fully implicit form (6.22) recalled here:

F (ẋ, z , x, u) = 0 (8.70)

DAEs can be treated very similarly to implicit ODEs, but we need to understand how the
algebraic variables z are treated here.

At every time step, the algebraic variables z ought to be considered as “free" variables that
need to be determined independently of the other time steps, and they need to be adjusted
so that (8.70) holds. More specifically, while an implicit ODE ought to be treated via impos-
ing
à !
s
X
F K i , xk + ∆t ai j K j , u(tk + ci ∆t ) = 0 (8.71)
j =1

for i = 1, . . . , s of each RK step xk → xk+1 , equivalently for a DAE of the form (8.70), we ought
to impose
à !
s
X
F K i , z i , xk + ∆t ai j K j , u(tk + ci ∆t ) = 0 (8.72)
j =1

for i = 1, . . . , s of each RK step xk → xk+1 . Note that here the variables z i ∈ Rm operate as
“unknown" that must be determined alongside the variables K i ∈ Rn . The complete IRK
equations for DAEs then read as:
 ³ P ´ 
F K 1 , z 1 , xk + ∆t sj =1 a1j K j , u(tk + c1 ∆t )
 
 .. 

 ³ . ´


P
r (w , xk , u(.)) :=  F K i , z i , xk + ∆t sj =1 ai j K j , u(tk + ci ∆t )  = 0
 
(8.73)
 
 .. 

 ³ . ´


Ps
F K s , z s , xk + ∆t j =1 a s j K j , u(tk + c s ∆t )

where we gathered the K i , z i variables in:



K1
 .. 
 . 
 
 Ks 
w =
  (8.74)
 z1 

 . 
 .. 
zs

195
Note that if x ∈ Rn and z ∈ Rm , then function F returns vectors of dimension Rn+m . It
follows that w ∈ Rs(n+m) . Moreover, the “residual" function r in (8.73) returns a vector of di-
mension Rs(n+m) , and that solving (8.73) must provide the variables w . The set of equations
(8.73) must be solved at each RK step xk → xk+1 in order to build the complete simulation
on the time interval [0, T ], and is done as previously using the Newton method.

For the sake of clarity, let us write down explicitly the pseudo-code required to perform the
numerical simulation of a fully implicit DAE model.

IRK for Fully Implicit DAEs


Algorithm:

Input: Initial conditions x0 , input profile u(.), Butcher tableau, step size ∆t
for k = 0, . . ., N − 1 do
Guess for w (one can e.g. use K i = xk , z i = 0)
while kr (w , xk , u(.))k > Tol do
Compute the solution ∆w to the linear system:

∂r (w , xk , u(.))
∆w + r (w , xk , u(.)) = 0 (8.75)
∂w
with r given by (8.73). Update:

w ← w + α∆w (8.76)

for some step size α ∈]0, 1] (a full step α = 1 generally works for implicit
integrators)
Take RK step:
s
X
xk+1 = xk + ∆t bi K i (8.77)
j =1

return x1,...,N
Note that the remarks at the end of section 8.4 also hold here, i.e. in order to deploy an
explicit RK method on the DAE (8.70), one would have to solve it for ẋ and z at every stage
of the RK steps, typically requiring the deployment of a Newton method.

8.5.1 RK METHOD FOR SEMI - EXPLICIT DAE MODELS

Adapting the approach presented above to the semi-explicit DAE case is fairly straightfor-
ward. Let us do it here nonetheless. Recall that semi-explicit DAEs are in the form:

ẋ = f (z , x, u) (8.78a)
0 = g (z , x, u) (8.78b)

196
Here we ought to impose at each i = 1, . . . , s:
à !
Xs
K i = f z i , xk + ∆t ai j K j , u(tk + ci ∆t (8.79a)
j =1
à !
s
X
0 = g z i , xk + ∆t ai j K j , u(tk + ci ∆t (8.79b)
j =1

hence we form the RK equations:


 ³ P ´ 
f z 1 , xk + ∆t sj =1 a1j K j , u(tk + c1 ∆t − K 1
 ³ P ´ 
g z 1 , xk + ∆t sj =1 a1j K j , u(tk + c1 ∆t
 
 
 
 .. 

 ³ . ´


 Ps 
 f z i , xk + ∆t j =1 ai j K j , u(tk + ci ∆t − K i 
r (w , xk , u(.)) :=  ³ P ´ =0 (8.80)
g z i , xk + ∆t sj =1 ai j K j , u(tk + ci ∆t
 
 
 
 .. 

 ³ . ´


 Ps 
 f z s , xk + ∆t j =1 a s j K j , u(tk + c s ∆t ) − K s 
 ³ P ´ 
g z s , xk + ∆t sj =1 a s j K j , u(tk + c s ∆t

And apply the same procedure as described several times above to perform the simulation.

Let us now make a connection between the problem of DAE differential index and numer-
ical simulations. Recall that a semi-explicit DAE (8.78) is of index 1 (i.e. an “easy" DAE) if
∂g
the Jacobian ∂z is full rank (i.e. invertible) on the system trajectories. In order to make a
simple point, let us consider an IRK method having a single stage s = 1 to treat the DAE. In
that simple case, we observe that the RK equations (8.80) boil down to:
· ¸
f (z 1 , xk + ∆t a11 K 1 , u(tk + c1 ∆t ) − K 1
r (w , xk , u(.)) = =0 (8.81)
g (z 1 , xk + ∆t a11 K 1 , u(tk + c1 ∆t )

and that the Jacobian of these equations read as:


" #¯
∂ f (x,u) ∂ f (x,u)
∂r (w , xk , u(.)) h ∂r (w ,xk ,u(.)) ∂r (w ,xk ,u(.)) i ∆t a11 ∂x − I ∂z
¯
¯
= ∂K 1 ∂z1
= ∂g (x,u) ∂g (x,u) ¯
∂w ∆t a11 ∂x ∂z
¯ x = xk + ∆t a11 K 1
z = z1
u = u(tk + c1 ∆t)
(8.82)

Suppose then that we let ∆t → 0, i.e. we investigate the behavior of the method for an
“infinite" accuracy (i.e. very high). Then the Jacobian matrix tends to:
" #
∂f
∂r (w , xk , u(.)) −I ∂z
lim = ∂g (8.83)
∆t→0 ∂w 0 ∂z

197
∂g (x,u)
which is full rank (invertible) if and only if ∂z is full rank. Recall that we need to deploy a
∂r (w ,xk ,u(.))
Newton method in order to solve r (w , xk , u(.)) = 0, where we need the Jacobian ∂w
∂g (x,u)
to be invertible, and therefore requires ∂z to be full rank. We can therefore conclude
from this simple example that

The IRK methods we have investigated should not be deployed on DAEs


of index above 1.

Note that if ∆t > 0, the rank of the Jacobian matrix may be nonetheless full, and one may
conclude that the method is nonetheless fine. However, the small analysis above shows
us that “something is intrinsically wrong" when deploying the IRK schemes we have stud-
ied on high-index DAEs (index above 1). We will leave out of this course further analysis
of this question. Before closing this chapter, it ought to be underlined here that one can
deploy some of the RK schemes we have investigated on high-index DAEs, and get a tra-
jectory computed from the method. Hence, there may not be a clear sign when deploying
the method that “something is not right". The trajectories obtained, however, are typically
non-sensical. Hence one ought to check the index of the DAE before trusting the output of
a classic RK method.

198
9 S ENSITIVITY OF S IMULATIONS
Our last chapter will deal with a topic that is crucial in many fields of engineering that
make use of numerical simulations. In many problems it is important not only to compute
accurate and reliable simulations for an ODE or a DAE, but also to get some information
on how this simulation would be affected by a change in the model parameters. To be a bit
more formal, let us consider an ODE depending on a fixed parameter p:
¡ ¢
ẋ = f x, p , x(0) = x0 (9.1)

Note that we will omit the inputs in our developments here, and discuss in the end what
role they play. Suppose that we have computed a simulation of (9.1), i.e. we have a se-
quence:

xk ≈ x(tk ) (9.2)

on a given time grid t0 , . . . , t N . Clearly, if one was to modify “a bit" either the initial condi-
tions x0 and/or the parameters p, then the sequence x1,...,N would be affected. For nonlin-
ear dynamics f , one cannot assess the impact of changing the parameters and/or initial
conditions x0 on the simulation without redoing the entire simulations. However, one can
use first-order arguments and compute the approximate effect of changing the parameters
and initial conditions i.e. one can try to compute:

∂xk ∂xk
, (9.3)
∂x0 ∂p

for k = 1, . . . , N , evaluated at x0 , p given. The sensitivity of simulations then deals with


computing these derivatives.

9.1 VARIATIONAL APPROACH


Before giving the answer to the question raised above, let us investigate the derivative of
the trajectories x(t ) generated by the ODE (9.1) with respect to p, x0 , i.e. let us forget for
now the numerical aspect of the problem. The question of computing

∂x(t ) ∂x(t )
(9.4)
∂p ∂x0

can appear tricky, but let us consider the following construction. Suppose that we know the
trajectory x(t ) associated to (9.1). We observe that:

d ∂x(t ) ∂ẋ(t ) ∂ f ∂x(t )


= = (9.5a)
dt ∂x0 ∂x0 ∂x ∂x0
d ∂x(t ) ∂ẋ(t ) ∂ f ∂x(t ) ∂ f
= = + (9.5b)
dt ∂p ∂p ∂x ∂p ∂p

199
where all Jacobians are evaluated at x(t ) and p. We moreover observe that

∂x(0) ∂x(0)
= I, =0 (9.6)
∂x0 ∂p

Let us label:
∂x(t ) ∂x(t )
A(t ) = , B(t ) = (9.7)
∂x0 ∂p

We can then write the dynamic system for A and B using (9.5)-(9.6):

∂f
Ȧ(t ) = A(t ) A(0) = I (9.8a)
∂x
∂f ∂f
Ḃ(t ) = B(t ) + B(0) = 0 (9.8b)
∂x ∂p

We then observe that (9.8) is a linear time-varying system with a matrix-based state space
(A and B). It is often referred to as the variational equations associated to (9.1). In princi-
ple, one could answer the sensitivity question raised above by performing the integration
of the dynamic (9.1) together with (9.8) using any numerical integration method.

Such an approach, however, would not deliver exact derivatives. Indeed, the variational
equations are expressing the sensitivity of the trajectories of the system, whereas the se-
quence x1,...,N arises from numerical simulation and it approximates the trajectories. More-
over, a numerical integration of (9.8) is also an approximation of the trajectories A(t ), B(t ).
As a result the sensitivities generated can be quite inaccurate.

For these reasons, the variational approach is not very often used to treat the sensitivity
question, and rather replaced by so called Algorithmic Differentiation methods, which seek
to compute exact derivatives of the inexact simulation of the dynamics. Let us briefly study
these techniques next.

9.2 A LGORITHMIC D IFFERENTIATION OF THE EXPLICIT E ULER SCHEME


As usual, it will be easiest to first consider the problem of sensitivity computation for the
explicit Euler scheme:
¡ ¢
xk+1 = xk + ∆t f xk , p (9.9)

Computing (9.3) for the simulation (9.9) boils down to a careful application of the chain
rule on (9.9). We observe that:
∂xk+1 ∂xk ∂ f ∂xk ∂x0
= + ∆t , =I (9.10a)
∂x0 ∂x0 ∂xk ∂x0 ∂x0
µ ¶
∂xk+1 ∂xk ∂ f ∂xk ∂ f ∂x0
= + ∆t + , =0 (9.10b)
∂p ∂p ∂xk ∂p ∂p ∂p

200
where all Jacobians are evaluated on the baseline simulation x1,...,N and for the parame-
ter value p. We observe that (9.10) is in fact a discrete linear dynamic system, assigning
dynamics to the matrices (9.3). Indeed, let us label

∂xk ∂xk
Ak = , Bk = (9.11)
∂x0 ∂p

And rewrite (9.10) as:


µ ¶
∂f
A k+1 = I + ∆t Ak , A0 = I (9.12a)
∂xk
µ ¶
∂f ∂f
B k+1 = I + ∆t B k + ∆t , B0 = 0 (9.12b)
∂xk ∂p

This procedure is labelled Algorithmic Differentiation. An explicit Euler integrator with


sensitivity generation can then be implemented using the following code:
Explicit Euler with sensitivity generation
Algorithm:

Input: Initial conditions x0 , parameter p, ∆t


Set A 0 = I , B 0 = 0
for k = 0, . . ., N − 1 do
¡ ¢
xk+1 = xk + ∆t f xk , p (9.13a)
µ ¶¯
∂ f ¯¯
A k+1 = I + ∆t Ak (9.13b)
∂xk ¯xk ,p
µ ¶¯ ¯
∂ f ¯¯ ∂ f ¯¯
B k+1 = I + ∆t B k + ∆t (9.13c)
∂xk ¯xk ,p ∂p ¯xk ,p

return x1,...,N , A 1,...,N and B 1,...,N


Note that if only the final state x N is needed, together with the associated sensitivities A N
and B N , then a lot of memory can be saved by reusing the memory cells assigned to storing
x, A and B.

In contrast to the variational approach, the sensitivities delivered by processing the discrete
dynamics (9.8) provides the exact (up to machine precision) derivative of the sequence
x1,...,N delivered by the explicit Euler scheme, even though that sequence is not an exact
(up to machine precision) approximation of the true trajectories.

An attentive reader, however, may have observed that if one treats both (9.1)-(9.8) using an
explicit Euler scheme, then one recovers the algorithm above. Hence in this specific case,
the variational equations and the Algorithmic Differentiation coincide. This observation is
not true in general.

201
9.3 A LGORITHMIC D IFFERENTIATION OF EXPLICIT RUNGE -K UTTA METHODS
The explicit RK equations for (9.1) arising from a lower-diagonal Butcher tableau can be
written as:
¡ ¢
K 1 = f xk , p (9.14a)
..
.
à !
iX
−1
K i = f xk + ∆t ai j K j , p (9.14b)
j =1
..
.
à !
s−1
X
K s = f xk + ∆t as j K j , p (9.14c)
j =1
s
X
xk+1 = xk + ∆t bj Kj (9.14d)
j =1

(observe that the summations in the formation of the K 1,...,s entail by construction an ex-
plicit method). Although it is a bit tedious, one can apply chain rules to (9.14) in order to
extract the sensitivities. Indeed, one can observe that:

• The differentiation of (9.14d) reads as:


à !
∂xk+1 X s ∂K j ∂xk
= I + ∆t bj (9.15a)
∂x0 j =1 ∂xk ∂x0
à !
∂xk+1 X s ∂K j ∂xk X s ∂K j
= I + ∆t bj + ∆t bj (9.15b)
∂p j =1 ∂xk ∂p j =1 ∂p

• The differentiation of (9.14a)-(9.14c) reads as:


¯ Ã !
∂K i ∂ f ¯¯ iX −1 ∂K j
= · I + ∆t ai j (9.16a)
∂xk ∂x ¯xk +∆t Pi −1 ai j K j , p j =1 ∂xk
j =1
¯ Ã ! ¯
∂K i ∂ f ¯¯ ∂xk iX−1 ∂K j ∂ f ¯¯
= · + ∆t ai j + (9.16b)
∂p ∂x ¯xk +∆t Pi −1 ai j K j , p ∂p j =1 ∂p ∂p ¯xk +∆t Pi −1 ai j K j , p
j =1 j =1

A pseudo-code deploying the above principle reads as follows.

202
Explicit RK scheme with sensitivity generation
Algorithm:

Input: Initial conditions x0 , parameter p, ∆t


Set A 0 = I , B 0 = 0
for k = 0, . . ., N − 1 do
for i = 1, . . ., s do
à !
iX
−1
K i = f xk + ∆t ai j K j , p (9.17a)
j =1
¯ Ã !
∂K i ∂ f ¯¯ iX−1 ∂K j
= · I + ∆t ai j (9.17b)
∂xk ∂x ¯xk +∆t Pi −1 ai j K j , p j =1 ∂xk
j =1
¯ Ã ! ¯
∂K i ∂ f ¯¯ ∂xk iX−1 ∂K j ∂ f ¯¯
= · + ∆t ai j +
∂p ∂x ¯xk +∆t Pi −1 ai j K j , p ∂p j =1 ∂p ∂p ¯xk +∆t P i −1 ai j K j , p
j =1 j =1
(9.17c)

s
X
xk+1 = xk + ∆t bj Kj (9.18a)
j =1
à !
s
X ∂K j
A k+1 = I + ∆t bj Ak (9.18b)
j =1 ∂xk
à !
s
X ∂K j s
X ∂K j
B k+1 = I + ∆t bj B k + ∆t bj (9.18c)
j =1 ∂xk j =1 ∂p

return x1,...,N , A 1,...,N and B 1,...,N

Note that these computations are tedious to derive and to program. Fortunately, very ef-
ficient Algorithmic Differentiation tools such as e.g. CasADi can perform these operations
automatically.

9.4 S ENSITIVITY OF IMPLICIT RUNGE -K UTTA STEPS


Note that the algorithm above can only work for explicit RK methods, i.e. RK methods
having a lower-diagonal Butcher tableau. Recall that the implicit RK equations generally

203
read as:
à !
s
X
K 1 = f xk + ∆t a1j K j , p (9.19a)
j =1
..
.
à !
s
X
K i = f xk + ∆t ai j K j , p (9.19b)
j =1
..
.
à !
s
X
K s = f xk + ∆t as j K j , p (9.19c)
j =1
s
X
xk+1 = xk + ∆t bj Kj (9.19d)
j =1

Interestingly enough, the sensitivities of an implicit RK scheme are fairly straightforward


to formulate and compute, i.e. easier than an explicit RK scheme. Recall that the implicit
equations (9.19a)-(9.19c) are solved via a Newton method, i.e. we write (8.13), recalled here:
 ³ P ´ 
f xk + ∆t sj =1 a1j K j , p − K 1
 
 .. 

 ³ . ´


¡ ¢  Ps 
r K , xk , p :=  f xk + ∆t j =1 ai j K j , p − K i  = 0 (9.20)
 
 .. 

 ³ . ´


Ps
f xk + ∆t j =1 a s j K j , p − K s
and deploy a Newton method on the “unknown"
 
K1
 . 
K =  ..  (9.21)
Ks
i.e. we iterate (see algorithm “IRK for explicit ODEs"):
µ ¡ ¢ ¶−1
∂r K , xk , p ¡ ¢
K ← K −α r K , xk , p (9.22)
∂K
Recall the Implicit Function theorem 6, and in particular equation (4.40). Here one ought
to view
¡ ¢
r K , xk , p = 0 (9.23)
as making K implicitly a function of xk , p, and use equation (4.40) to deduce that:
¡ ¢ −1 ¡ ¢
∂K ∂r K , xk , p ∂r K , xk , p
=− (9.24a)
∂p ∂K ∂p
¡ ¢ −1 ¡ ¢
∂K ∂r K , xk , p ∂r K , xk , p
=− (9.24b)
∂xk ∂K ∂xk

204
One ought to observe that the matrix inverse (or rather the matrix factorization for the
linear-algebra savy among the readers) needed in (9.22) can be reused in (9.24), such that
the latter is inexpensive to compute. We then observe that (9.15) can be readily used to
compute the sensitivities of xk+1 as in the explicit RK method (see (9.15)). We can summa-
rize these observations in the following pseudo-code.
IRK for explicit ODEs with senstivities
Algorithm:

Input: Initial conditions x0 , input profile u(.), Butcher tableau, step size ∆t
Set A 0 = I , B 0 = 0
for k = 0, . . ., N − 1 do
Guess for K (one can e.g. use K i = xk )
while kr (K , xk , u(.))k > Tol do
Compute the solution ∆w to the linear system:

∂r (K , xk , u(.))
∆K + r (K , xk , u(.)) = 0 (9.25)
∂K
with r given by (8.13). Update:

K ← K + α∆K (9.26)

for some step size α ∈]0, 1] (a full step α = 1 generally works for implicit
integrators)
Reuse the matrix factorization required to solve (9.25) to compute:
¡ ¢ −1 ¡ ¢
∂K ∂r K , xk , p ∂r K , xk , p
=− (9.27a)
∂p ∂K ∂p
¡ ¢ −1 ¡ ¢
∂K ∂r K , xk , p ∂r K , xk , p
=− (9.27b)
∂xk ∂K ∂xk

Take RK step with sensitivities:


s
X
xk+1 = xk + ∆t bj Kj (9.28a)
j =1
à !¯
s
X ∂K j ¯¯
A k+1 = I + ∆t bj ¯ Ak (9.28b)
j =1 ∂xk ¯
à !¯
X s ∂K j ¯¯ X s ∂K j
B k+1 = I + ∆t bj ¯ B k + ∆t bj (9.28c)
j =1 ∂xk ¯ j =1 ∂p

return

205
9.5 S ENSITIVITY WITH RESPECT TO INPUTS
Let us finish these lecture notes with discussing the computation of the¡ sensitivity
¢ of a sim-
ulation with respect to the input u applied to the dynamics ẋ = f p, u . Since u(.) is, in
general, a profile (i.e. a function of time in the interval [0, T ]) as opposed to a finite set
of parameters p, discussing sensitivities in the form explored here is a priori inadequate.
However, the classic approach in numerical simulations is to consider that the input profile
u(.) is in fact parameterized by a finite set of parameters, i.e. we consider that the input is
given by:

u(t , p) (9.29)

Any derivative in the algorithms above is then simply replaced by its chain-rule expansion:

∂. ∂. ∂u
= (9.30)
∂p ∂u ∂p
e.g.

∂f ∂ f ∂u
= (9.31)
∂p ∂u ∂p

The input parametrization can take many forms. It can e.g. be convenient to use an ap-
proach similar to the one used in collocation-based IRK schemes. In that context, the input
profile is e.g. given by:

¡ ¢ Xs
u tk + τ∆t , p = p k,i ℓi (τ) (9.32)
i =1

where we have:
 
p 0,1
 .. 
 . 
 
 p 0,s 
 
 .. 
p = .  (9.33)
 
 p N−1,1 
 
 .. 
 . 
p N−1,s

Hence the input profile is piece-wise smooth, and interpolates the points p k,i on the time
intervals [tk , tk+1 ] for k = 0, N − 1. This principle is illustrated in Figure 9.1. A wide-spread
alternative input parametrization is to use the piecewise constant approach, which uses:

u(tk + τ∆t ) = p k , k = 0, . . . , N − 1 (9.34)

This principle is illustrated in Figure 9.2. All the sensitivity principles we studied above can
be readily applied.

206
t0 t1 t2 t3

p 2,2

p 2,1
u p 2,3

Figure 9.1: Illustration of the “collocation-based" parametrization of the inputs, for N = 4,


and s = 3. Note that the input profile is piecewise-smooth.

t0 t1 t2 t3
p1

p2
u

p0

p3

Figure 9.2: Illustration of the piecewise-constant approach to the input parametrization for
N = 4.

207
R EFERENCES
[1] K.J. Åströom and R. Murray. Feedback Control Systems.

[2] K.J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley.

[3] P. Fritzon. Introduction to Object-Oriented Modeling and Simulation with Modelica


using OpenModelica. Available at [Link]

[4] L. Ljung and T. Glad. Modeling and Identification of Dynamic Systems. Studentlitter-
atur 2016.

[5] The Modelica Association. [Link]

[6] Modelica publications. [Link]

208

Common questions

Powered by AI

The implicit Runge-Kutta (IRK) methods differ from explicit Runge-Kutta (ERK) methods in terms of computational complexity and stability when dealing with differential algebraic equations (DAEs). Implicit methods generally involve solving systems of equations at each step, which makes them computationally more expensive than explicit methods . However, this computational cost is justified by their superior stability properties; implicit methods are often A-stable, meaning they can handle stiff problems and maintain stability for a wider range of step sizes . Explicit methods, on the other hand, are not A-stable and can become unstable with larger step sizes, especially when dealing with stiff equations common in DAEs . The stability advantage of IRK makes them suitable for high-index DAEs, where stability and robust handling of fast dynamics are crucial .

DAEs can transform into ODEs through adroit manipulation of their system constraints and differentiating algebraic components until they appear explicitly as time derivatives, allowing conversion to ODEs. The differential index is fundamental in this process as it indicates the number of differentiations required to express the system without algebraic equations. A DAE with a high differential index may require several differentiations to reach an ODE form, indicating computational complexity and the potential for numerical stability challenges. Solving DAEs often involves reducing the differential index, thus simplifying integration .

The Maximum Likelihood (ML) method provides several advantages over the least-squares method for parameter estimation in system identification. Firstly, ML incorporates a probabilistic framework that explicitly accounts for model uncertainties, allowing for a more comprehensive analysis under conditions of random variability . This enables the ML method to be applied effectively to various types of probability distributions, making it a more flexible approach . Secondly, while the least-squares approach assumes independent, identically distributed noise, leading to unbiased estimates under these conditions, ML extends this to handle cases with correlated noise through the use of covariance matrices, which adjust the weight of each data point based on its variance . Lastly, while least-squares is essentially a special case of ML under normal distribution assumptions, ML can optimize under different noise characteristics and distributions, providing a more robust estimation mechanism when the noise characteristics deviate from ideal assumptions . Thus, ML's versatility and robustness in handling noise contribute to its advantages over least-squares.

One might prefer a model structure based on physical laws when there is a clear understanding of the system's mechanics and if precise predictions under varied conditions are crucial. This approach can lead to more accurate models when system behavior is aligned with known physical principles, aiding in extrapolations. Conversely, general-purpose models are preferred when the system is too complex or poorly understood, allowing flexibility in capturing dynamics without detailed physical insight. They are useful when data-driven insights can adequately model relationships despite lacking a fully detailed physical understanding, such as in exploratory analysis or when rapid prototyping is required .

Implicit methods maintain stability in stiff systems by allowing larger time steps without sacrificing accuracy, as they inherently manage rapid changes in system dynamics more effectively than explicit methods. Implicit methods, such as the implicit Euler or implicit Runge-Kutta, solve equations involving both the current and future state variables at each step, inherently stabilizing the integration by accounting for stiffness, which explicit methods struggle with because they require smaller step sizes to achieve similar accuracy. This quality makes implicit methods particularly well-suited for stiff DAEs where stability is critical .

Parametric models in system identification provide a structured approach to model the relationship between inputs and outputs using a parameterized function. The choice of model, whether based on linear or nonlinear dynamics, affects the accuracy and reliability of the estimation process. Selecting an incorrect model structure can lead to overfitting or underfitting, affecting the model's ability to generalize from data. The model's complexity and the number of parameters also influence the computational effort and precision of parameter estimation. Properly chosen parametric models help achieve a balance between simplicity and fidelity to the real system .

Lagrange multipliers (denoted as \( z \)) in constrained Lagrange equations impose the constraints \( c(q) = 0 \) into the system's dynamics by introducing additional terms \( \nabla_q c \cdot z \) into the Lagrange equations. They serve as a mechanism to ensure that the constraints are satisfied during motion, effectively acting like additional forces that keep the system on the constraint manifold \( c(q) \). The dynamics of a constrained system are influenced by these multipliers as they alter the resultant forces affecting the system, contributing terms into the equations of motion which are akin to restoring forces preventing deviation from constraints . Consequently, these multipliers have to be computed at each time step to simulate correctly constrained dynamics, affecting accelerations \( \ddot{q} \), which are expressed as functions of not only positions \( q \), velocities \( \dot{q} \), and external forces \( Q \), but also these Lagrange multipliers .

Model validation is critical in the system identification process because it ensures that a model meets its intended purpose and requirements, which may vary depending on whether the model is used for insights, parameter determination, or control design . It helps to determine how well a model reflects the real system and can highlight the need for modifications if discrepancies are found, thus it is integral to the iterative nature of the modeling process . Effective model validation involves simulating the model under relevant conditions and comparing the results with actual system data, as well as conducting various tests such as predicted versus real output comparisons and statistical tests like cross-correlation and autocorrelation of prediction errors . Additionally, cross-validation can be useful by testing the model on fresh data to ensure it is not only capturing noise but is accurately representing system dynamics .

The Lagrange function for the hanging chain incorporates gravitational and spring energies by treating the potential energy as the sum of gravitational energy and the energy stored in the springs. The gravitational energy term is expressed as \(V_{gravity} = mg \sum_{k=1}^{N} \left[0, 0, 1 \right] p_k\), where \(g\) is the gravitational constant, and \(p_k\) represents the positions of the individual masses along the chain. The spring energy is given by a sum reflecting the elastic links between the masses, expressed as \(\frac{1}{2}K \sum_{k=0}^{N} \lVert p_{k+1} - p_k \rVert^2\), where \(K\) is the rigidity of the springs. The total Lagrange function then becomes \(L = \frac{1}{2}m \dot{q}^\top \dot{q} - V_{gravity} - V_{spring}\), showing how both energy types are incorporated into the dynamics of the chain .

Sensitivity analysis in implicit Runge-Kutta methods is generally more straightforward than in explicit methods because it inherently accounts for the coupled nature of states and parameters via the implicit system equations. Implicit methods allow for a more direct computation of sensitivities since they involve solving the entire system as a unified whole, leveraging the implicit function theorem, which can reuse matrix factorization to reduce computational cost. In contrast, explicit methods require separate computations for state derivatives and can be less efficient in addressing parameter sensitivity due to a need for additional steps to decouple dynamics and parameters explicitly .

You might also like