Signal Processing 1 Script English v2017 PDF

Signal Processing 1
LVA 389.166
Univ.-Prof. DI. Dr.-Ing. Markus Rupp

Institute of Telecommunications
TU Wien
December 20, 2017

Contents
1 Classical Deterministic Signal Processing 7

1.1 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Polynomial Description . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Vector-Matrix Description . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.4 State Space Description . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Properties of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.1 Stability of Linear, Time-invariant Systems . . . . . . . . . . . . . . . 25
1.2.2 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.3 Linear-Phase Property . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.2.4 Passivity and Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.2.5 Minimum phase property . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.3 Sampling Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.3.1 Poisson Summation Formula . . . . . . . . . . . . . . . . . . . . . . . 41
1.3.2 Equidistant Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.3.3 Further Sampling Theorems . . . . . . . . . . . . . . . . . . . . . . . 44
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2 Linear Vector Space 52

2.1 Metrics and Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.1.1 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.1.2 Limit Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.1.3 Gibbs’ Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.1.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.2 Topology of Linear Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.1 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.2.2 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.2.3 Complete Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.3 Norms in a Linear Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.3.1 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.3.2 Induced Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2
Univ.-Prof. DI. Dr.-Ing. Markus Rupp 3
2.3.3 Submultiplicative Property . . . . . . . . . . . . . . . . . . . . . . . . 94

2.4 Application of Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.4.1 The Small Gain Theorem . . . . . . . . . . . . . . . . . . . . . . . . 96
2.4.2 The Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . 99
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3 Representation and Approximation in Vector Spaces 107

3.1 Least Squares Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1.1 The Gramian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.1.2 The Orthogonality Property of LS . . . . . . . . . . . . . . . . . . . . 112
3.1.3 Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.1.4 Pseudoinverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.2 Projection Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3 Applications of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.4 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.5 Orthogonal Polynomial Families . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.5.1 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5.2 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.6 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6.1 Pre-Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6.2 Wavelet Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4 Linear Operators 150

4.1 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.1.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.1.3 Spaces of linear Operators . . . . . . . . . . . . . . . . . . . . . . . . 155
4.2 Matrix Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3 Hermitian Matrices and Subspace Techniques . . . . . . . . . . . . . . . . . 172
4.4 Further Subspace Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4.1 Pisarenko’s Harmonic Decomposition . . . . . . . . . . . . . . . . . . 183
4.4.2 Multiple Signal Classification . . . . . . . . . . . . . . . . . . . . . . 185
4.4.3 Estimation of Signal Parameters via Rotational Invariance Techniques 185
4.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.6 Applications of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5 Matrix Operations 197

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.2 Tensor Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4 Signal Processing 1
5.3 Large Matrices with Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 208

5.4 Toeplitz and Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.4.1 Toeplitz Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.4.2 Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.4.3 Relations of Toeplitz Matrices with Previous Considerations . . . . . 221
Preface
The deterministic examination of signals and systems experienced its classical period
in the forties and fifties by the work of Wiener and Küpfmüller, among others. Much
fundamental cognition about linear systems originates from this period. Linear functional
transformations such as Laplace-, Fourier- and z-transform became standard tools in these
times. The main applications lay in the field of filter design, first in the development of
analogue filters, and later they started a renaissance in the design of digital filters.
The sixties were characterized by state space descriptions (Kalman filter), which in
particular led to a paradigm shift in the automatic control engineering. Furthermore, the
FFT (Fast Fourier Transform) was re-invented by Cooley and Tukey in 1965, which did not
know that Gauss used this technique already in 1804.
The eighties then have been shaped by designs of complete filter banks to split up the
signals in small and controllable subunits, according to the well known tradition divide et
impera that goes back to Philip II of Macedonia albeit not in the field of signal processing.
Therewith, the classical themes have been exhausted, however. Innovations now started to
emerge from other fields, namely from scientists who solved their problems with the help of
linear algebra. Therefore, today’s signal processing is a modern discipline, based on linear
algebra, which requires its own methodological skills.
This lecture therefore is structured into a classical part, which subsumes the significant
methods and results of the classical period together with some applications, and a second,
much larger part which deals only with modern methods. Only these modern methods make
it possible to understand such complicated concepts such as stereo equalizer methods, big
data, radio interference suppression by beamforming, finding petrol and sunken treasures
and whatever is possible by other subspace methods.
The lecture also spans the history of algorithmic development over the past two hundred
years where many mathematicians left their mark by their name on the methods. In the
last chapter we finally arrive at newest techniques that are currently being developed.
Nomenclature:
In this script, it was attempted to stick to the following nomenclature whenever possible.
∗ denotes the complex conjugate,
T denotes the transpose operator,
H denotes the Hermitian operator, thus the complex conjugate transpose,
a,b,c... describe determined scalars,
a,b,c,... describe random variables with a probability density function
fa (a), variance σa2 and mean ā,
a,b,c... describe (column-)vectors with a number of elements,
1 describes a vector with all elements equal one,
a,b,c... denotes (column-)vectors with random variables as elements,
the joint probability density function is given by fa (a),
the autocorrelation matrix of, e.g., vector a is given by Raa =E[aaH ],
A,B,C,... denotes a matrix of defined dimension, with scalar elements,
A,B,C,... denotes a matrix of defined dimension, with random variable elements,
I denotes the identity matrix of appropriate dimension,
kakqp applied to the vector a indicates the p norm to the power q,
kakqp = ( i |ai |p )q/p
P
p
kakQ applied to the vector a indicates the norm: aH Qa.
If there is an index k (lower case letter) introduced with matrices and vectors, then
k denotes an additional dependency on time. Random variables in this case have to be
interpreted as random processes. If an upper case letter is used as an index, on the other
hand, it mostly denotes the dimension of the vector. Matrices which are positive definite
are often described by the relation A > 0. In particular, this means that all eigenvalues of
A are larger than zero.
Chapter 1
Classical Deterministic Signal

Processing
1.1 Linear Systems

Linear systems are a dominant assumption. They can be linear because they have been
developed to be so, because they behave sufficiently linear (according to the observation
accuracy), or because they, although intrinsically non-linear, are operated in a small domain
of the operational range, so that they perform approximately linear.
Definition 1.1 (Linearity) A linear system S is a system which fulfills
S[αx] = αS[x] (1.1)

S[x1 + x2 ] = S[x1 ] + S[x2 ]. (1.2)
Sometimes both properties are combined in one form which reads:
S[αx1 + βx2 ] = αS[x1 ] + βS[x2 ]. (1.3)
In this context, α and β have to be arbitrary constants.
The first property in (1.1) is called homogeneity, the second in (1.2) additive and the
third (1.3) superposition. We denote by S[.] an operation, or an operator, which in this
case is a linear one. Such functions can both be applied to functions (thus time-continuous
mappings x(t)), as well as to series xk 1 . We introduce here the concept of linear operators
very loosely. In Chapter 4 we will exclusively discuss the properties of linear operators and
learn a lot more details.
1
In fact to all objects of a linear vector space, as we will learn later
7
1.1.1 Convolution
Familiar linear operators are functional transformations, such as the convolution of contin-
uous functions:
Z ∞ Z ∞
H[x(t)] = h(τ )x(t − τ )dτ = x(τ )h(t − τ )dτ (1.4)
−∞ −∞
Z ∞ Z t
H[x(t)] = h(τ )x(t − τ )dτ = x(τ )h(t − τ )dτ (1.5)
0 −∞
Z t0 Z t
H[x(t)] = h(τ )x(t − τ )dτ = x(τ )h(t − τ )dτ. (1.6)
0 t−t0
The first case (1.4) represents a non-causal system, as typically considered in telecommuni-
cations engineering. In the second case (1.5), the system h(τ ) is causal but with an infinite
memory. Please note the two different notations: the former highlights the causal system
h(τ ), where the range of τ is between zero and infinity. In the latter notation, the signal
x(t), which is present at the input of the system, is in the foreground. It acts upon the
range from −∞ to t, since the system is causal and thus x(t) cannot affect beyond t. In the
third description (1.6) a causal system with finite memory t0 is treated. In this context,
such systems are referred to as FIR filters (FIR = Finite Impulse Response). In FIR filters,
the signal x(t) can only affect respective to the duration from t − t0 to t. Previous values
are irrelevant, they have been forgotten.
In analogy to continuous functions, the convolution can also be defined for series:
X∞ X∞
H[xk ] = hl xk−l = xl hk−l
l=−∞ l=−∞
∞
X k
X
H[xk ] = hl xk−l = xl hk−l
l=0 l=−∞
k0
X k
X
H[xk ] = hl xk−l = xl hk−l .
l=0 l=k−k0
Example 1.1 (Multipath Propagation) The radio wave propagation can be described
by a so-called multipath model. In this context, one imagines multiple (nP ) propagation
paths of the electromagnetic waves, which are attenuated independently from each other
(αn ). The time-delays τn are determined by the geometrical runtime of the signals (waves).
Therefore, a linear impulse response h(τ ) can be assigned to each path. The simplest case
with only one propagation path is thus given by h(τ ) = α1 δ(τ −τ1 ). Because sender, receiver
and scattering objects can move independently of each other, the parameters αn (t) and τn (t)
remain time-dependent, and accordingly the impulse response
nP
X nP
X
h(t,τ ) = hn (t,τ ) = αn (t)δ(τ − τn (t)).
n=1 n=1
Why is this time-variant system linear? If we examine the response onto the input-signal
x1 (t), we obtain
Z ∞ nP
X
y1 (t) = x1 (t − τ ) αn (t)δ(τ − τn (t))dτ
τ =−∞ n=1
nP
X
= αn (t)x1 (t − τn (t)) = H[x1 (t)].
n=1
To simplify matters, we assumed a non-causal system. Thus, for a second input-signal

x2 (t), the output-signal is given by
nP
X
y2 (t) = αn (t)x2 (t − τn (t)) = H[x2 (t)].
n=1
If we now investigate the superposition (1.3) of the two signals, we obtain

Z ∞ nP
X
H[β1 x1 (t) + β2 x2 (t)] = β1 x1 (τ ) αn (t)δ(τ − τn (t))dτ
τ =−∞ n=1
Z ∞ nP
X
+β2 x2 (τ ) αn (t)δ(τ − τn (t))dτ
τ =−∞ n=1
= β1 H[x1 (t)] + β2 H[x2 (t)].
Example 1.2 (Linear Prediction) The main applications of linear prediction are within
speech- and image-processing but it is more generally being applied to any strongly correlated
signal. The basic idea is that correlated random processes can be predicted up to a specific
amount. An obvious approach to predict a signal is the linear combination (interpolation,
filtering) of past signal-values:
nP
X
x̂k = an xk−n .
n=1
Source-coding methods utilize this to transmit only the error ek = xk − x̂k instead of the
original signal xk , because the error can be coded with considerably less bits.
Is this a linear system? The answer is: no. This is because the choice of the optimal
prediction-coefficients depends on the statistic of the signal xk . If we consider two input-
signals xk,1 and xk,2 with different statistics, the corresponding output signals yk,1 and yk,2
will be different from the output-signal matching the input signal xk,1 + xk,2 . See also
Example 3.6 for the computation of the computation of the prediction coefficients an .
1.1.2 Polynomial Description

Besides the convolution operation, there are many other functional transformations that
are linear operations: Fourier transform, Laplace transform, z-transform. Although these
transformations simplify more complicated operations (convolution becomes a multipli-
cation in the complex variable domain), they are limited to specific signal categories. A
sufficient (but not necessary) existence constraint is the Dirichlet condition. Typically,
problems occur if the energy of the examined signals is not limited. Already such simple
functions such as a sinusoidal wave needs to be described by the complex construct
of the Dirac distribution in the Fourier-domain. Also if the z-transform or the Laplace
transform is applied, one has to consider the domains of the functions. On the other hand, a
linear operator, in particular if it only represents a linear filter, is applicable onto any signal.
The operator description from Section 1.1 avoids this problem elegantly. Linear filters
are indicated by upper case letters, and typically, if operations on time-discrete series are
meant, the additional argument q −1 is used to make the notation unique. The simplest
operator q −1 denotes: q −1 [xk ] = xk−1 .2
Example 1.3 (IIR-filter operator) For example we write B(q −1 )[vk ] = B[vk ] to denote
a series vk which is filtered with the filter given by the coefficients b0 ,b1 ...bnB −1 . Also recur-
B(q −1 )
sive filter structures can accordingly be denoted very easily: 1−A(q −1 ) [vk ] describes a recursive
filter of the form:

XnA B −1
nX
yk = al yk−l + bl vk−l .
l=1 l=0
Although very similar to the z-transform, we will prefer this notation, because it does
not require an (energy-, Dirichlet) constraint at the input-sequence, in contrast to the z-
transform. At a first glance, z-transforms and notations in q −1 appear similar. But note
that they are fundamentally different, as z is a complex-valued number, while q −1 describes
an operator. As a consequence, z −1 can be multiplied to any transfer function, say H(z),
while q −1 operates on (time-)sequences hk . Operations like z ∗ or |z| make no sense in q −1 ,
or need to be defined there first.
This description in terms of q −1 of time-discrete series or systems is also called polyno-
mial description, because in most cases finite polynomials in q −1 arise.
H[.] = h0 + h1 q −1 + h2 q −2 + ... + hn q −n ;causal system
n 1 −1 −n
H[.] = h−n q + ... + h−1 q + h0 + h1 q + ... + hn q ;non-causal system.
The polynomial description allows for a simple mathematical description:
G(q −1 )H(q −1 ) = g0 + g1 q −1 + .. + gnG q −nG h0 + h1 q −1 + .. + hnH q −nH

= g0 h0 + (g1 h0 + g0 h1 )q −1 + ... + gnG hnH q −(nG +nH ) .

2
For time-continuous signals a canonical linear operator can be the differentiator or the integrator.
Note that multiplying polynomials is commutative, i.e., the order can be changed.
Since this is an equivalent description of a convolution, this is true for any convolu-
tion as well and also for other equivalent descriptions of linear time-invariant systems,
such as Toeplitz or circulant matrices as we will learn later on (see in particular Chapter 5).
Example: In this example we want to distinguish between corresponding but equivalent

forms of description. Let us first consider a time discrete system with impulse response
{1,a}. In terms of an operator we can write
H[.] = H(q −1 ) = 1 + aq −1 = h0 + h1 q −1 .
If a sequence xk is the input of such a system, its output is yk = xk + axk−1 . If the sequence
xk can be transformed into the z-domain (because it satisfies the Dirichlet conditions):
∞
X
X(z) = xk z −k ,
k=−∞
we can equivalently formulate the input-output relation as
Y (z) = H(z)X(z)
with H(z) = 1 + az −1 = h0 + h1 z −1 . If we are interested in the Fourier behavior of such

system, we can set z = ejΩ and find
Y (ejΩ ) = H(ejΩ )X(ejΩ )
with H(ejΩ ) = 1 + ae−jΩ . If we are interested for its stability behavior, we can check the
existence of the system H in the complex plane:
H(z) = 1 + az −1 = z −1 (z + a).
We find that for all but z = 0 finite values exist and thus the ROC is given by all values of
z except zero. The zero of the z-transfer function H(z) = 0 delivers z0 = −a.
Let us now consider a second system that describes the input-output relation
yk = ayk−1 + xk .
The corresponding linear operator is given by

1
H[.] = H(q −1 ) =
1 − aq −1
and the z-transform reads:
1 z
H(z) = −1
= .
1 − az z−a
The impulse response of such system delivers hk = ak ; k = 0,1,.... As long as |a| < 1 we
thus have a stable and causal system. But what about z? For which values of z is this
defined? For this to solve we consider the z-transform again
∞ ∞
X X a k
H(z) = ak z −k = .
k=0 k=0
z
We know that this sum can only converge as long as |a/z| < 1. We thus conclude that
|z| > |a|. The stability of such system is given by its poles. We can compute the poles of
this system by setting the denominator to zero and find z∞ = a. Thus, as long as the poles
remain inside the unit circle, we have a stable and causal system. We then can use all
values of z larger than a. This includes also the values on the unit circle z = ejΩ , offering
us to compute the Fourier behavior of the system.
In both systems the operator description did not require any restriction on its use, while
for applying the z-transform we have to ensure that the signals are satisfying the Dirichlet
conditions, i.e., they are sufficiently bounded and that we only use values for z for which
the expressions exist.
Example 1.4 (Zero-Forcing Equalizer) Consider a linear transfer function in form of
a polynomial H(q −1 ). How does a polynomial G(q −1 ) look like, such that G(q −1 )H(q −1 ) =
q −D , i.e., that the linear distortion of the filter H can be compensated? Such problems
often arise, e.g., in the transmission over a static multipath radio propagation channel, or
in the transmission of the speaker signals (echoes) to the human ear. Because the transfer
function G(q −1 ) of the receive filter ought to reconstruct the original signal, G(q −1 ) is called
an “equalizer”. Without solving the problem, one easily sees that there will not always be
a solution. If H(q −1 ) has zeroes, those cannot be compensated. Later on (see treatment
of minimum phase systems in Section 1.2.5) we will learn that this problem has in general
only a double infinite solution, thus a non-causal G(q −1 ). To illuminate this a bit more,
investigate a channel with the transfer function H(q −1 ) = h0 + h1 q −1 . The problem is then
given by:
H(q −1 )G(q −1 ) = h0 + h1 q −1 g0 + g1 q −1 + .. + gnG q −nG

= g0 h0 + (g1 h0 + g0 h1 )q −1 + ... + h1 gnG q −(1+nG ) = q −D .

Let us now choose D = 1. Then g0 = 0 must hold and thus also g1 = 1/h0 is determined.
Now, one can iteratively insert the solution and thus aim at gnG −1 . However, one will
discover that there is one equation too much, gnG is already determined and accordingly
leads to a contradiction. This can be shown systematically. Let us start with nG = 1. Then
we obtain the system of equations
   
h0 0
 h1 h0  g0 =  1  .
g1
h1 0
This is a system of equations with two unknowns and three equations. If the number of
degrees of freedom is increased, from nG = 1 to nG = 2, one obtains a system with three
unknowns and four equations. By induction it can be seen that there is always one equation
more than unknowns, showing that the system of equations typically does not lead to a
solution.
A solution for this difficult problem only appears if more than one channel H is incor-
porated.
Theorem 1.1 (Bezout) Consider the transfer functions H1 (q −1 ),...,HnH (q −1 ) to be given.
Then, the equation
XnH
Hn (q −1 )Gn (q −1 ) = q −D (1.7)
n=1
has a finite dimensional solution G1 (q −1 ),...,GnH (q −1 ) if and only if (iff ) the polynomials
H1 (q −1 ),...,HnH (q −1 ) do not show any joint zeroes, i.e., they have to be co-prime.
Proof:
The proof (for polynomials of finite dimension) will proceed in two steps. In the fist step
(Example 1.5) we show that if the polynomials have a common zero, no solution can exist.
In the second step we show that if they have no common zero, then there is at least one
solution.
Example 1.5 Let us consider two transfer functions H1 and H2 for which the following
holds: H1 (q −1 ) = (1 + hq −1 )C1 (q −1 ) and H2 (q −1 ) = (1 + hq −1 )C2 (q −1 ). Here, C1 (q −1 ) and
C2 (q −1 ) should not share any zeroes. To determine the solutions G1 (q −1 ) and G2 (q −1 ), the
following must hold:
q −D = H1 (q −1 )G1 (q −1 ) + H2 (q −1 )G2 (q −1 )
= (1 + hq −1 )C1 (q −1 )G1 (q −1 ) + (1 + hq −1 )C2 (q −1 )G2 (q −1 )
= (1 + hq −1 )(C1 (q −1 )G1 (q −1 ) + C2 (q −1 )G2 (q −1 )).
In order to lead to a solution, C1 (q −1 )G1 (q −1 ) + C2 (q −1 )G2 (q −1 ) needs to be equal to

q −D /(1 + hq −1 ). However, this is not possible because C1 (q −1 )G1 (q −1 ) + C2 (q −1 )G2 (q −1 )
has to compensate now for (1 + hq −1 ), i.e., to build (1 + hq −1 )−1 which is not possible for
finite order polynomials.
Let us now turn to the second step of the proof. Let us assume that the conditions of
the theorem are fulfilled. How does the previous example alter? If we have the two transfer
(1) (1) (2) (2)
functions H1 (q −1 ) = h0 + h1 q −1 and H2 (q −1 ) = h0 + h1 q −1 , we can solve the problem
(1) (1) (2) (2)
via two equalizers: G1 (q −1 ) = g0 + g1 q −1 and G2 (q −1 ) = g0 + g1 q −1 . Then, we obtain
the equation:
H1 (q −1 )G1 (q −1 ) + H2 (q −1 )G2 (q −1 ) = q −D .
For D = 1, we find:
(1)
 
 (1) (2)
 g0  
h0 h0 (1) 0
g1
 
(1) (1) (2) (2)
h1 h0 h1 h0 = 1 .
    
  (2)
(1) (2)  g0 
0
h1 h1 (2)
g1
In contrast to the previous example we now have four unknowns, but only three equations.
This means that the system of equations can be solved now, strictly speaking even infinite
solutions exist. If we increase the number of free parameters from two to three, we obtain
six unknowns and five equations, and so on. By induction we can therefore show that we
always have more unknowns than equations. This proves Bezout‘s Theorem.
It also lets the question open which of the infinite amount of solutions is to be taken.
The final answer to this problem will be provided in the context of the so-called minimum
norm solution in Chapter 3, see also Example 3.11.
1.1.3 Vector-Matrix Description

In case of a causal system with finite memory of size nH − 1, the convolution
H −1
nX k
X
H[xk ] = hn xk−n = xn hk−n
n=0 n=k−nH +1
−1
+ ... + hnH −1 q −nH +1 xk = yk

= h0 + h1 q (1.8)
can also alternatively be written in vector form:
yk = hT xk = xTk h (1.9)
hT = [h0 ,h1 ,...,hnH −1 ]
xTk = [xk ,xk−1 ,...,xk−nH +1 ].
We choose the representation in column vector form which we will hence always use. In
the strict sense, the series in (1.8) is the response for the input-series xk , whereas (1.9)
represents a single value as the response for a composition of input values xk . Equation
(1.9), however, can also be interpreted as an output-series yk as response for the series of
input-vectors xk .
If we do not want to consider a single value yk but a sorted set y0 ,...ynK−1 of values, the
matrix-vector notation offers to be useful:

  
  h0 x0
y0  h1 h0
 y1  
 x1 
.. 
x2

. h1 h0
    
 y2    

=
 
h ... h1 h0
 .. 
(1.10)
.. n −1 .
  H
    

 .   . .. ... ... 
  xnK −3


 ynK −2    
 hnH −1 ... h1 h0   xnK −2 
ynK −1
hnH −1 ... h1 h0 xnK −1
Note that the occurring matrix is of dimension nK × nK and due to its structure the
matrix is regular and thus invertible (h0 6= 0). The description allows also for a backward
calculation of the input-signal from the output-signal. The short notation for this is: y =
Hx. Furthermore, this description includes the initial values since we start with the time-
index k = 0. Disadvantageous on the other hand is the property that the matrices (and
vectors) grow with increasing time-index. Note that often the reverse order of the vectors
is applied, resulting in an alternative representation:
 
  h0 h1 . . . hnH −1 xnK −1

ynK −1 ...
h0 h1   xnK −2 
 
 yn −2   
 K   h h x
  
 yn −3   0 1  n
 K  −3
 K   ..  ..
= (1.11)
 
 .. h0 . hn −1
 . 
 .  
..
H   

 y1  
 . h1

 x2 
  x1 

h h

y0  0 1

h0 x0
The short notation for this is: y (r) = H T x(r) , r denoting backward or reverse notation.
The vector h and thus each of the lower rows of H contains the same information.
Because the initial values are often not of interest in communications engineering, the band
matrix description presents itself as an alternative presentation of the problem.
 
  x k−nN −nH +2
yk−nN +1  
 yk−n +2  h n H −1 ... h1 h 0
 xk−n −n +1 
 N H 
 N   . .. . .. . ..   .
.. 
..
=
    
 .  

 yk−1 
  hnH −1 ... h 1 h0 
 x k−2 

hnH −1 ... h1 h0  xk−1 
yk
xk
or in short notation y k = HnN xk , where the index nN denotes the dimension of the (nH +
nN + 1) × nN matrix HnN . The band matrix description is very beneficial if several linear
systems are connected serially, because then again band matrices emerge. However, these
matrices are then not invertible anymore because they have a rectangular form, i.e., it is
not easily possible to calculate the input-signal from the output-signal. This is not very
astonishing given that there are no initial values in this description.
Example 1.6 Let us now consider the transmission of a training sequence at the beginning
of a data block. The training sequence is six symbols long and is sent through a channel
with transfer function h0 + h1 q −1 . The first description yields:
    
y0 h0 x0
 y1   h1 h0   x1 
    
 y2   h1 h0
  x2 
    
 y3  =  h1 h0   x3  .
    
 y4   h 1 h 0
  x4 
    
 y5   h1 h0   x5 
y6 h1 h0 x6 = 0
It also contains the initial-values. If the training sequence is only seen as a part of an
infinitely long set of symbols, then the initial-values will be irrelevant. In that case the
following description may be more appropriate:
 
    xk
yk+1 h1 h0
  xk+1 
 
 yk+2   h1 h0
     xk+2 
 yk+3  =  h1 h0   xk+3  .
 
  
 yk+4   h1 h0   
 xk+4 
yk+5 h1 h0
xk+5
1.1.4 State Space Description

The probably most important matrix description of linear systems is the so called state-
space form. Let us examine the (canonical) transfer function of an Infinite Impulse Response
(IIR) filter in operator notation:
−1 b0 + b1 q −1 + .. + bnB −1 q −nB +1
H(q ) =
1 + a1 q −1 + ... + anA q −nA
b0 + b1 q −1 + .. + bp q −p
= . (1.12)
1 + a1 q −1 + ... + ap q −p
The name canonical is due to the fact that this description is a basic structure that can, by
variations, be combined to new complex units. This is somehow similar to the canon in the
musical theory, which is varied from strophe to strophe. In order not to handle two memory
variables nA , nB , we chose p = max{nA ,nB − 1}. Here, either ap or bp can be zero, but
not both, and hence p determines the order of the filter. A Finite Impulse Response (FIR)
filter is thus a special case for a1 = a2 = ...ap = 0. The following Figure 1.1 clarifies this
connection in form of a signal flow diagram. The outputs of the delay chain are denoted
b0
b1
bp-1
xk z(p-
k
1)
yk
q-1 q-1 q-1 q-1 bp
a1 z(p)
k
ap-1 z(k2)
ap z(k1)
Figure 1.1: Signal flow diagram of an IIR filter.
(p)
zk , which are the states of the system. If the states are known, the output signal yk can
be calculated for every input-signal xk . Accordingly, the state values adopt the role of the
initial values, not only for the origin k = 0, but for arbitrary time instances k. Thus, the
whole effect of the input sequence up to time k is accumulated in the system states. From
the figure, we can determine the following relations:
(1) (2) (2) (3) (p−1) (p)

zk+1 = zk ; zk+1 = zk ; ...; zk+1 = zk .
Furthermore, the following holds:
(p) (p) (p−1) (1)

zk+1 = xk −a1 zk − a2 zk − ... − ap zk (1.13)
(p) (p) (p−1) (1)
yk = bo zk+1 +b1 zk + b2 zk + ... + bp zk . (1.14)
In both equations the states appear. Let us combine the states at time instance k in a
vector z k . Then, we obtain the much more compact description:
 (1)     (1)   
zk+1 1 zk 0
 z (2)  . (2) 
..   zk   0 
    
 k+1 
 .  =    .  +  .  xk

 ..   1   ..   .. 
(p)
zk+1 −ap −ap−1 ... −a1 (p)
zk 1
| {z } | {z } | {z } | {z }
z k+1 A zk b
 (1)    (1) 
zk zk
(2) (2)
T
 zk    zk 
yk = [bp ,bp−1 ,...,b1 ]   + b0 xk − [ap ,ap−1 ,...,a1 ]T
    
..  .. 
 .    . 
(p) (p)
zk zk
 (1) 
zk
(2)
 zk 
= [bp − b0 ap ,bp−1 − b0 ap−1 ,...,b1 − b0 a1 ]   + |{z}
b0 x k .
 
..
| {z } . 
cT d
(p)
zk
And this can be written compactly as:
z k+1 = Az k + bxk (1.15)
yk = cT z k + dxk .
Naturally, the question regarding the relation to the transfer function arises. If the
transfer function is given in the form of (1.12), then it can be transformed into the form
(1.15). Else, starting from (1.15) and by applying the z-transform, one can obtain:
Z[z k+1 ] = Z(z)z = AZ(z) + bX(z) (1.16)
Y (z) = cT Z(z) + dX(z). (1.17)
The first Equation (1.16) can be solved for Z(z), leading to:
Z(z) = (zI − A)−1 bX(z).
That can be inserted into the second Equation (1.17), finally resulting in:
Y (z) = cT (zI − A)−1 bX(z) + dX(z) = (cT (zI − A)−1 b + d) X(z).
| {z }
H(z)
A first advantage of the state-space description is thus that we obtain a very compact
description:
z k+1 A b zk
= .
yk cT d xk
Example 1.7 Let us consider the signal flow diagram in the left part of Figure 1.2. We
can identify a simple feedback, as it is, e.g., used to average signals. We want to find the
canonical structure in the state-space form. The input-output relations are:
yk = ayk−1 + dxk .
By applying the z-transform, we obtain
d
Y (z) = aY (z)z −1 + dX(z) = X(z).
1 − az −1
Accordingly, we can determine: a1 = −a, b0 = d. The canonical structure of the IIR filter
can be seen in the right part of Figure 1.2. The state-space description is then given by
zk+1 = xk − a1 zk
yk = b0 zk+1 .
In more compact notation, we obtain:

zk+1 −a1 1 zk a 1 zk
= = .
yk −b0 a1 b0 xk da d xk
yk yk
b0
xk yk-1 xk
d q-1 q-1
-a a1 zk
Figure 1.2: Signal-flow diagram of an elementary feedback system, left: arbitrary signal-flow
graph, right: corresponding, canonical description.
Similarity Transformation: The dynamic behaviour of the system is solely determined

by matrix A. Strictly speaking, the location of the eigenvalues of A in the unit circle deter-
mine whether a system reacts quickly or inertly. However, A shows no diagonal structure
in the canonical form which means that the eigenvalues are not directly identifiable. This
special form of matrix A directly derived by the canonical structure is called companion
form.  
1
 ... 
A= .
 
 1 
−ap −ap−1 ... −a1
It is relatively easy to determine the eigenvalues from this companion form, and thus the
dynamic behaviour of the system. Note that the state-space description is not unique.
We only considered the canonical structure, but in principle infinite forms are possible,
which all access inner states to describe the same system. Other representations of the
same system can be found by affinity transformations. A similarity transformation is given
by any regular matrix T , i.e., its inverse T −1 exists. Then, the input and output can
uniquely be transformed, so that this path can also be followed backwards. We exchange
A = T A0 T −1 and multiply in (1.15) with T −1 from the left:
T −1 z k+1 = A0 T −1 z k + T −1 b xk . (1.18)
| {z } | {z } | {z }
z 0k+1 z 0k b0
To be able to describe the output in the transformed state z 0k , we also modify the output-
equation:
yk = cT zk + dxk = cT T zk0 + dxk .
|{z}
c0T
Hence, if the transformation-matrix T is known, all involved variables can easily be con-
verted:
A0 = T −1 AT ; b0 = T −1 b; c0 = T T c; d0 = d. (1.19)
Notably advantageous are transformations T in which A0 shows a diagonal structure (or a
Jordan structure, see also Equation (4.1) and the discussion there), because then its eigen-
values can directly be determined. Accordingly, the question remaining is which matrix T
transforms the matrix A into a matrix D of diagonal structure. The answer is the following:
from (1.19) we know that T D = AT must hold. Hereunto we find the Vandermonde matrix
 
1 1 ... 1
 λ1 λ2 ... λm 
T =  ..
 
.. 
 . . 
m−1 m−1 m−1
λ1 λ2 ... λm
which diagonalises the companion form. If all eigenvalues are different then the matrix is
regular and thus invertible. Note, however, that we need to know the eigenvalues first in
order to construct this matrix.
Example 1.8 Let us expand our averaging filter from the previous example by one pole, so
that we obtain
yk = −a1 yk−1 − a2 yk−2 + dxk .
By applying the z-transform, we obtain the canonical form
d
Y (z) = X(z)
1 + a1 z −1 + a2 z −2
and thus the compact canonical state-space description

 
0 1 0
z k+1 zk
=  −a2 −a1 1 
yk xk
−a2 d −a1 d d
Thus, we can identify the matrix

0 1
A=
−a2 −a1
and calculate the two eigenvalues {λ1 ,λ2 }. Accordingly, we can specify the transformation
matrix
1 1
T =
λ1 λ2
and consequently the transformed matrix A0 = D in diagonal structure. The formal way
however, is determined by the Vandermonde matrix T but unfortunately the eigenvalues of
matrix A always have to be calculated first.
Extensions: The relevant parameters, which describe the system behaviour, are apparently
the quadruple {A,b,c,d}. Here, A solely describes the dynamic behaviour of the system,
thus how fast (or slowly) it reacts onto changes in the input-signal, and if it is stable or
not. The vector parameter b characterizes the effect of the input-signal onto the states
of the system. In the canonical form considered so far, this is a unit vector and thus a
constant. But as far as we saw, this parameter can adopt completely other values by an
affinity transformation, and accordingly can take up an important role. The second vector
parameter c denotes how the states effect the output. Finally, the scalar parameter d
describes how the input instantaneously controls the output. This parameter is thus called
passage (Ger.: Durchgriff).
A linear, time-variant system with vector input- and output-signal (MIMO system;
Multiple-Input Multiple-Output) can be described by the following state equations:
z k+1 = Ak z k + Bk xk , z k0 = initial value
y k = Ck z k + Dk xk , k ≥ k0 ,
where the matrices {Ak ,Bk ,Ck ,Dk } shall have the dimensions m × m, m × q, p × m and p × q.
Accordingly, xk is of dimension q × 1 and y k is of dimension p × 1. The m-dimensional
vector z k is then called state of the system.
Since we are dealing with a time-variant system, we cannot utilize the z-transform to
dissolve the system. The solution of the state z k can directly be indicated by using the
transition matrix Φ(k,j):
k−1
X
z k = Φ(k,j)z j + Φ(k,l + 1)Bl xl . (1.20)
l=j
The transition matrix in this case is given by
Φ(k,j) = Ak−1 Ak−2 ...Aj , Φ(k,k) = I.
Via the state-space solution (1.20), also the solution for the output of the time-variant
system can be given in a closed-form expression:
y k = Ck z k + Dk xk
k−1
X
= Ck Φ(k,j)z j + Ck Φ(k,l + 1)Bl xl + Dk xk .
l=j
For the special case of a time-invariant system {A,B,C,D}, for the transition matrix,
the following must hold
Φ(k,j) = Ak−j , k ≥ j.
Under special circumstances the time-invariant system can be diagonalized. Then
A = diag{λ1 , λ2 , ..λm }.
Stability is guaranteed iff all eigenvalues |λi | < 1.
Example 1.9 (cable) Let us first consider a linear, time-invariant system H[.] with the
FIR impulse response h (e.g., a static radio transmission channel) and additive interference
wk . Then, we obtain:
z k+1 = Iz k + 0wk ; z0 = h
yk = xTk z k + wk .
Please note that the static channel can very easily be described by a simple state equation.
Example 1.10 (wireless) Now let us consider another, advanced example with a time-
variant transfer function, thus a typical mobile radio channel, which changes over time:
z k+1 = Az k + 0wk ; z0 = h
yk = xTk z k + wk .
The only change in comparison to the previous example is the introduction of the matrix
A, which now describes the dynamic behaviour of the radio channel. Please note that a
time-invariant notation is sufficient for the description of this time-variant radio channel.
Of course, this can not be generalized. Certain time-variant channels require a description
by time-variant state-space equations. Nevertheless, one recognizes in this example that the
state-space description allows for descriptions which have not been possible before.
The reader may argue that the channel will fade now if all eigenvalues of A are inside
the unit circle but not return. Such behavior can also be described by adding a driving force
for the channel:
z k+1 = Az k + 0wk + v k ; z 0 = h
yk = xTk z k + wk + 0v k .
Now the second noise source v k drives the channel back and lets it fluctuate. It is not a
direct noise source for the channel, which is described by wk .
1.2 Properties of Linear Systems

In the following, five important, elementary properties of linear, time-invariant systems will
be examined: stability, causality, linear-phase characteristics, all-pass character, and the
minimum phase property. These are indeed properties of linear systems but in addition
they are attributes of their describing functions (impulse response and transfer function).
Although such qualities equally apply for time-discrete and time-continuous systems they
will only be derived in one of the two forms. The corresponding other form should be
straightforwardly derivable.
In some cases we require a special method, the so-called ”bi-linear transform”. Let us
recall that the convergence domain of the Laplace transform for causal systems is located in
the left complex half-plane (for anti-causal systems in the right) whereas for the z-transform,
it is located inside of the unit circle. A transformation that maps the left complex half-plane
inside the unit circle is the bi-linear transform, or correspondingly for Fourier transforms
from the imaginary axis to the unit circle
2 ejΩ − 1

2 Ω
jω = jΩ
= j tan , (1.21)
T e +1 T 2

ωT
Ω = 2atan .
2
The corresponding integrals can be converted by
T dω
dΩ = 2 2 .
1 + T 4ω
The time parameter T corresponds to the sampling time interval, assuming equidistant
sampling. Such transformation is a special case of the so called Möbius transform3 .
3
August Ferdinand Möbius (1790–1868), German Mathematician. In the English control literature it is
often referred to as the Tustin’s method, after Arnold Tustin, (16 July 1899 - 9 January 1994), a British
engineer.
In the mathematical literature, the term T /2 is simply omitted, as it is only a scaling

and does not influence the basic behavior, causes however a frequency warping. As it
maps the left half plane into the unit circle, it guarantees stability, i.e., if stability is
given in one domain, after applying the bi-linear transform, stability is also given in the
corresponding domain. Note, however, that only stability as a property is ensured by
such transformation, other properties may be changed considerably. In the beginnings
of time-discrete signal processing, the bi-linear transformation gained a lot of attention
because it allowed for the direct application of numerous methods from the analogue world
(e.g., filter design methods) into the discrete world. However, nowadays there are so many
digital methods which are moreover considerably better, so that the bi-linear transform
has lost its importance.
Example: Consider an analog low pass realized by an RC filter (voltage divider). Its
transfer function in the Laplace domain reads
1
H(s) = .
1 + sRC
Its impulse response is given by
t
1 − RC
e ;t ≥ 0
h(t) = RC .
0 ;else
This is obviously an exponential decaying function which can be described by
1 −k T 1 − T k 1 k
h(kT ) = e RC = e RC = a = hk , for k = 0,1,....
RC RC RC
The z-transform of such sequence is given by
∞ ∞
X
−k 1 X a k 1 1 1 1 1 1
H(z) = hk z = = a = −1
= T .
k=0
RC k=0 z RC 1 − z
RC 1 − az RC 1 − e RC
−
z −1
2 z−1
Alternatively, let us now apply the bi-linear transform to H(s), i.e., s = T z+1
. We obtain
1 z+1 1 1 + z −1
HB (z) = = =
1+ 2RC z−1
T z+1
z + 1 + 2RC
T
(z − 1) 1 + 2RC
T
1 + TT −2RC
+2RC
z −1
which is indeed a very different z-transform and thus also a different impulse response than
the previous one. The corersponding impulse response reads:
! k
1 1 T − 2RC
hk = − − ; k = 0,1,....
1 + 2RC
T
1 − 2RC
T
T + 2RC
RC
In Figure 1.3 we show this example for T
= 4 with T = 1. The smaller T is compared to
RC, the better the match.
0.3 1.2
hk
0.25 1
h B,k
h(t)
0.2 0.8
h(t),hk ,h B,k
0.15 0.6
0.1 0.4
0.05 0.2
0 0
0 10 20 -2 0 2
t=kT
Figure 1.3: Lowpass function for RC T

= 4. The impulse response of the continuous system
and the corresponding discrete system hk match very well, while the bi-linear transform
results in a different impulse response (left) and consequently a different frequency response
(right).
1.2.1 Stability of Linear, Time-invariant Systems

Theorem 1.2 (Stability) Assuming a rational system function of a linear, time-invariant
continuous time system
p
b0 + b1 s + ... + bp sp X di
H(s) = =
1 + a1 s + ... + ap sp i=1
s − si
with corresponding time impulse response:

p
X
h(t) = di esi t U (t).
i=1
We call such a system bounded-input bounded-output (BIBO)-stable if and only if <{si } < 0.
We make use of the unit-step function here4

 0 ;t < 0
1
U (t) = ;t = 0 .
 2
1 ;t > 0
BIBO stability refers to the property that the input amplitudes does not exceed a certain
value: max |x(t)| ≤ A < ∞, and correspondingly the output: max |y(t)| ≤ B < ∞.
4
Note the subtlety for t = 0.
Proof: Consider the output y(t) of a linear system h(t) from the input x(t):
Z ∞
y(t) = h(t) ∗ x(t) = h(t − τ )x(τ )dτ
0
Z ∞
|y(t)| ≤ max |x(t)| h(τ )dτ
0
Z ∞ p
X
≤ max |x(t)| di esi τ dτ
0 i=1
p
∞
X di
si τ
≤ max |x(t)| e
i=1 i
s
0
p
X di s ∞
≤ max |x(t)| |e i − 1| .
si
i=1
The last term esi ∞ can only be bounded if the real part of si is negative.
How about the stability of a time-discrete system? We simply apply the bi-linear transform
and find for <{s} < 0:
|z|2 − 1 < 0,
thus z needs to lie inside the unit circle for stability of a causal system.
Theorem 1.3 (Stability of closed loop) The closed loop system is BIBO stable if the
open loop system has:
max |H(jω)G(jω)| < 1.
ω
Proof: We find the input-output relation to be
Y (jω) H(jω)
=
X(jω) 1 + G(jω)H(jω)
As long as |G(jω)H(jω)| < 1, the system behaves stable which concludes the proof. Note
however, that even values larger than one can result in stable systems. Linear time-invariant
systems of such form allow very precise statements on stability (if and only if) by just
analyzing the open loop.
1.2.2 Causality
In principle, we know the relation between a continuous function and its Fourier-transform:
Z ∞
F (jω) = f (t)e−jωt dt
−∞
and the inverse Fourier-transform:

Z ∞
1
f (t) = F (jω)ejωt dω
2π −∞
Z ∞
1
= (FR (jω) + jFI (jω))ejωt dω.
2π −∞
The complex-valued function F (jω) can be separated into magnitude and phase5 :
F (jω) = FR (jω) + jFI (jω) = A(ω)ejφ(ω) .
If we restrict ourselves onto real-valued functions f (t), the magnitude

p
A(ω) = FR (jω)2 + FI (jω)2 = A(−ω)
is an even function in ω, and the phase

FI (jω)
φ(ω) = atan = arc(F (jω)) = −φ(−ω)
FR (jω)
is an odd function6 .
A system is called causal, if according to an excitation at t = t0 , no responses of the
system prior to t0 occur. In the impulse response h(τ ), this is determined by h(τ ) = 0 ∀ τ <
0, thus the impulse response has to be dexter (Ger.: rechtsseitig). Let us first investigate a
general continuous function f (t). Such a function can always be decomposed into an even
and an odd part:
1
fe (t) = [f (t) + f (−t)]
2
1
fo (t) = [f (t) − f (−t)] .
2
Accordingly, the function can be assembled again:
f (t) = fe (t) + fo (t).
Note that for causal systems with t > 0, the following must hold: f (t) = 2fe (t) = 2fo (t).
5
Please note that we use a description with positive phase. Some authors use the negative form e−jφ(ω)
which accordingly leads to different terms with alternate signs.
6
Often in literature the function arc(x) is used to compute the phase of a complex-valued argument x
In general the Fourier-transform can be divided into its real and imaginary part:
Z ∞
F (jω) = f (t)e−jωt dt
Z−∞
∞
= (fe (t) + fo (t))e−jωt dt
Z−∞
∞
= (fe (t) + fo (t)) {cos(ωt) − j sin(ωt)} dt
−∞
Z ∞
= (fe (t) + fo (t)) cos(ωt) − j(fe (t) + fo (t)) sin(ωt)dt
−∞
Z ∞
= fe (t) cos(ωt) − jfo (t) sin(ωt) + fo (t) cos(ωt) − jfe (t) sin(ωt) dt
−∞ | R
{z }
=0
= FR (jω) + jFI (jω).

In this representation, we discover that the real-valued part of the Fourier transform is only
determined by the even part of f (t), and the imaginary-valued part is only determined by
the odd part. Furthermore, we recognize that the real part of the Fourier transform FR (jω)
is an even function in ω and7 the imaginary part FI (jω) is an odd function in ω. Hence,
we obtain the following two Fourier pairs
Z ∞ Z ∞
FR (jω) = fe (t) cos(ωt)dt = 2 fe (t) cos(ωt)dt (1.22)
−∞ 0
1 ∞
Z
fe (t) = FR (jω) cos(ωt)dω.
π 0
and
Z ∞ Z ∞
FI (jω) = − fo (t) sin(ωt)dt = −2 fo (t) sin(ωt)dt
−∞ 0
1 ∞
Z
fo (t) = − FI (jω) sin(ωt)dω.
π 0
For causal systems, we have for t > 0, f (t) = 2fe (t) = 2fo (t), and it can be concluded that
2 ∞ 2 ∞
Z Z
f (t) = FR (jω) cos(ωt)dω = − FI (jω) sin(ωt)dω.
π 0 π 0
Both, the real and the imaginary part FR (jω) and FI (jω) are coupled tightly. A repeated
insertion of this relation in (1.22) leads to:
2 ∞ ∞
Z Z
FR (jω) = − FI (jω 0 ) sin(ω 0 t) cos(ωt)dω 0 dt
π 0
Z Z0
2 ∞ ∞
FI (jω) = − FR (jω 0 ) cos(ω 0 t) sin(ωt)dω 0 dt.
π 0 0
7
Because cos(ωt) is an even function in ω which is not changed by the integration.
This description clearly shows that with causal, linear systems it is utterly sufficient either
to know the real or the imaginary part. Both are uniquely connected. This conclusion can
be formulated more clearly by converting the two equations.
To show this, let us start with the trivial equality for causal functions: f (t) = f (t)U (t)+
1
2
We added here the single point f (0) since the step function U (0) = 12 . Computing
f (0)δ(t).
the Fourier transform on both sides, we obtain

1 1 f (0)
FR (jω) + jFI (jω) = F (jω) ∗ πδ(ω) + +
2π jω 4π
1 1 1 f (0)
= F (jω) + F (jω) ∗ +
2 2πj ω 4π
1 1 f (0)
= F (jω) ∗ +
πj ω 2π
Z ∞ 0
2 FI (jω ) 0
FR (jω) = dω + FR (∞)
π 0 ω − ω0
2 ∞ FR (jω 0 ) 0
Z
FI (jω) = − dω .
π 0 ω − ω0
The last two equations show that this is a convolution in the Fourier-domain with the
function 1/ω. This corresponds to a multiplication with the sign function in the time-
domain which is exactly the operation that is necessary to link even and odd part with
each other:
fe (t) = fo (t)sgn(t) + f (0)δ(t).
The above description with the convolution in the Fourier-domain is also called the Hilbert-
transform of a function. The real part requires an additional term,
2 ∞ FI (jω 0 ) 0
Z
FR (jω) = FR (∞) + dω , (1.23)
π 0 ω − ω0
due to the singular point in the unit step function (also in the sign function). Note
that the previous propositions are strictly speaking only valid for t > 0. For t = 0,
a separate investigation is needed. Because fo (t = 0) = 0 must hold, but fo (t) and
FI (jω) are uniquely coupled, f (0) cannot be determined out of FI (jω). For this, we
need a separate term, which occurs in (1.23) according to the final value theorem as FR (∞).
Further alternative formulations exist, but we only want to mention this one:
2ω ∞ FR (jω 0 ) 0
Z
FI (jω) = − dω
π 0 ω2 − ω02
2 ∞ 0 FI (jω 0 ) 0
Z
FR (jω) = FR (∞) + ω 2 dω .
π 0 ω − ω02
Hence we now know that real and imaginary parts of the Fourier-transform of causal
systems are connected by the Hilbert-transform. Now the question arises whether there
exists a similar relation between the magnitude and the phase. It turns out that this is not
the case, but they are also not entirely independent of each other. A connection, however
in a very different form, is given by which is described as the Paley-Wiener theorem.
Theorem 1.4 (Paley-Wiener) Let H(jω) = A(ω)ejφ(ω) . If for an even, non negative
magnitude function A(ω) with energy-constraint
Z ∞
A2 (ω)dω < ∞
−∞
the following holds:

∞
| ln A(ω)|
Z
dω < ∞
−∞ 1 + ω2
then and only then the transfer function H(jω) describes a causal system.
For time-discrete systems, a corresponding theorem holds, if

Z π
| ln A(Ω)|dΩ < ∞
−π
is fulfilled. This proposition directly follows, if the bi-linear transform (1.21) is applied. We
present here a normalized form in which the term T /2 of the bi-linear transform is simply
omitted. As we only test whether the terms are finite or not, scaling with a fixed constant
does not change the result. Please note that the conclusion of the theorem of Paley-Wiener
explains the existence of a phase-function, but not its uniqueness. To see that, think of a
valid function φ(ω) and add a delay e−jωT : then φ̃(ω) = φ(ω)e−jωT is also a valid phase
function.
Likewise, note that the theorem includes a very strong statement which is also invertible:
Iff the system H(jω) is causal, then the previous theorem holds. In practice, the theorem
is often difficult to apply. If for example a magnitude function A(ω) is given, for which it
was successfully tested by applying the theorem that it is a causal system, still the problem
remains how to find H(jω), e.g., in the form A2 (ω) = H(jω)H ∗ (jω). This is not a trivial
problem and the question will confront us in the next chapter (see Example 2.13).
Example 1.11 Consider a low-pass filter of first order with the transfer function
1
HLP (jω) = .
1 + jωc
Thus, for the magnitude function, we have
1
ALP (ω) = √ .
1 + ω 2 c2
The first condition of the theorem is fulfilled:

Z ∞
A2LP (ω)dω = 2|c| < ∞.
−∞
However, the second condition cannot be verified so easily:

1
Z ∞
| ln ALP (ω)|
Z ∞ | ln √1+ω 2 c2 |
dω = dω
−∞ 1 + ω2 −∞ 1 + ω2
Z ∞ 1
1 | ln 1+ω2 c2 |
= dω
−∞ 2 1 + ω2
π
= ln(1 + |c|) < ∞.
4
So accordingly, this low-pass filter is a causal system.
Example 1.12 If we examine the magnitude function of an ideal low-pass filter, i.e.,
(
1 ; |ω| < ωg
AIL (ω) = ,
0 ; else
the first condition is obviously fulfilled:

Z ∞ Z ωg
2
AIL (ω)dω = dω = 2ωg < ∞.
−∞ −ωg
Though, the second condition is critical:

Z ∞ Z ωg
| ln AIL (ω)| | ln(1)| | ln(0)|
Z
2
dω = 2
dω + 2
dω.
−∞ 1+ω −ωg 1 + ω |ω|>ωg 1 + ω
The first term is zero, because ln(1) = 0, but the second term is not finite, since ln(0) = −∞.
hence we conclude that the ideal low-pass filter is not a causal system and thus not realizable.
In principle, this is the case for all ideally band-limited systems.
1.2.3 Linear-Phase Property

Let us first have a look at the following Fourier-transform pairs:
f (t)ejωo t ⇔ F (j(ω − ωo ))
1
f (t) cos(ωo t) ⇔ {F (j(ω − ωo )) + F (j(ω + ωo ))}
2
f (t − to ) ⇔ F (ω)e−jωto
f (t − to )ejωo t ⇔ F (j(ω − ωo ))e−j(ω−ωo )to .
For a narrow-band system, i.e., a system for which the considered bandwidth is small
compared to the reciprocal signal-duration (∆B∆T < 1), let us assume that the magnitude
function A(ω) = A(ω0 ) is constant and located at the centre frequency ωo between ωo − ωg
and ωo + ωg . Only the phase function is frequency-dependent. For that, we assume a Taylor
series which we truncate after the linear term:
H(jω) = A(ω)ejφ(ω)
j[φ(ω )+(ω−ω )φ0 (ω )]
e o 0 0
; |ω − ω0 | < ωg
≈ A(ωo ) j[φ(−ωo )+(ω+ω0 )φ0 (ω0 )]
e ; |ω + ω0 | < ωg
j[φ(ω )+(ω−ω )φ0 (ω )]
e o 0 0
; |ω − ω0 | < ωg
= A(ωo ) −j[φ(ωo )−(ω+ω0 )φ0 (ω0 )]
e ; |ω + ω0 | < ωg
We now examine a narrowband excitation signal X(jω) which is transmitted over the given
narrowband system. The following Figure 1.4 clarifies the connection. In the upper picture,
we identify the system-characteristics, given by a constant transmission around the centre
frequency ωo . In the lower picture, the transmitted signal X(jω) is shown which has been
shifted to the centre frequency by a frequency converter (mixer). We now calculate the
A({!0 ) = A(!0 ) A(!0 )
!
{!0 !0
X(j!)
X(j(!+!0))/2 X(j(!{!0))/2
!
{!0 !0
Figure 1.4: Transmission over a linear narrowband system.
output-signal Y (jω):
n 0
A(ωo )
Y (jω) = 2
X(j(ω − ωo ))ej(φ(ωo )+(ω−ωo )φ (ωo ))
0
o
+X(j(ω + ωo ))e−j(φ(ωo )−(ω+ωo )φ (ωo )) .
The inverse transformation then leads to

y(t) = A(ω2 o ) x (t + φ0 (ωo )) ej(φ(ωo )+ωo t)

+x (t + φ0 (ωo )) e−j(φ(ωo )+ωo t)

  
 
φ(ωo ) 
= A(ωo ) x t + φ0 (ωo ) cos 
  
ωo  t+  .
| {z }   ωo  
−TG | {z }
−TP
We identify two main parameters, which describe the behaviour of the narrowband system
fundamentally:

dφ(ω)
TG = − dω
= −φ0 (ωo ) ;Group delay
ω=ωo
TP = − φ(ω
ωo
o)
;Phase delay.
Note that in the literature also definitions with a positive sign (instead of negative sign
used in this lecture notes) occur. This results from the definition A(ω)e−jφ(ω) of the model,
instead of our definition, A(ω)ejφ(ω) . Then the group delay (Ger.: Gruppenlaufzeit) and
the phase delay (Ger.: Phasenlaufzeit) become negative. The narrowband description has
the advantage that the (considerably costly) convolution is not needed. The transmission
behaviour is sufficiently described by easier operations.
The group delay TG thus indicates how fast a small group of energy passes the system
around ωo . If groups pass at different frequencies ωo with different velocities, the signal
becomes distorted. To obtain an undistorted (and only delayed) signal, a constant group
delay has to be required: the system has to have a linear phase such that the derivative is
constant.
Example 1.13 (Time-discrete system with linear phase) We investigate a linear,

time-discrete system H(ejΩ ) which is constructed from a second system, G(ejΩ ) (with real-
valued coefficients), in the following way:
H(ejΩ ) = G(ejΩ )G(e−jΩ )e−j∆Ω .
The integer constant ∆ > 0 is chosen such that a causal system is formed. We then
recognize that A(ω) = G(ejΩ )G(e−jΩ ) and φ(Ω) = −∆Ω. Thus, it is a system with linear
phase, independent of the choice of G(ejΩ ).
Let us therefore assume that G(ejΩ ) = 1 + ae−jΩ , corresponding to the series gk = {1,a}
together with ∆ = 1. Then the following holds:
H(ejΩ ) = a + (1 + a2 )e−jΩ + ae−j2Ω

= e−jΩ [1 + a2 + 2a cos(Ω)].
The impulse response of the filter thus is: hk = {a,1 + a2 ,a}. This means that it is a
symmetric filter, and this also holds true, independently of the choice of G(ejΩ ). Hence, we
conclude that symmetric filters which are constructed according to this rule, are always of
linear phase8 .
8
Does the opposite also hold? Find out yourself. See Example 3.16
1.2.4 Passivity and Activity

Definition 1.2 A system is called passive if at its output less energy emerges than is fed
into its input (after the decay of memory effects) for every input signal. This is equivalent
to require
max |H(ejΩ )| < 1.
Ω
Definition 1.3 If for some input signal the energy at the output is bigger than at the input,
the system is called active:
max |H(ejΩ )| > 1.
Ω
See also the discussion of system gains in the context of norms in Chapter 2.
The gain can be measured, for example, if the ratio of input- and output-energy is examined:
y Tk y k
PK 2
k=0 yk
?
PK 2 = < 1.
k=0 xk
xTk xk
Next to passive and active systems there exists also a limit case, that is, a system that is
not active nor passive. We cal this an allpass. For an allpass the method of comparing input
and output energy has to be questioned as it needs to be checked whether this property
is fulfilled for all signals xk , or just for some. Apparently, this description seems not very
practical, since all possible signals would have to be tested.
A better answer is delivered by the classical approach in the Fourier-domain. Let us
have a look at the following example.
Example 1.14 A linear, time-invariant, time-discrete system is given by its transfer func-
tion: Y (ejΩ ) = H(ejΩ )X(ejΩ ).
Now let us assume that the maximum (minimum) of the transfer function occurs at a
specific frequency Ω+ (Ω− ). Then, all signals X(ejΩ ) with spectral components unequal to
Ω+ (Ω− ) will come out of such system with less (more) energy than components at Ω+ (Ω− ).
To ensure the same energy at the output and the input, the following must hold:
Y (ejΩ ) 2

= max H(ejΩ )2 = min H(ejΩ )2 = 1.

max
X(ejΩ )6=0 X(ejΩ ) Ω Ω

If a system emits exactly the same amount of energy than it gathers, and thus no energy
is added or consumed, it is called an all-pass.
Definition 1.4 A linear time-invariant (finite order) system is called allpass, if the follow-
ing holds:
min |H(ejΩ )| = max |H(ejΩ )| = 1.
Ω Ω
Example 1.15 Let 1/H ∗ (ejΩ ) be a causal, stable system. Then consider the following
linear time-invariant system:
H(ejΩ )
G(ejΩ ) =
H ∗ (ejΩ )
A(Ω)ejφ(Ω)
=
A(Ω)e−jφ(Ω)
= ej2φ(Ω) .
This is obviously an all-pass, independent of the choice of the system H(ejΩ ).
Theorem 1.5 The poles and zeroes of an allpass system are symmetrical with respect to the
unit circle. The phase of an allpass is monotonically decreasing, or equivalently, φ0 (Ω) < 0.
Proof: Let us consider the following system of order one:
H(z) = −a∗ + z −1
with the corresponding Fourier-transform
H(ejΩ ) = −a∗ + e−jΩ .
We construct an allpass and obtain
z − a1∗ za∗ − 1
G(z) = a∗ =
z−a z−a
with a pole at z = a and a zero at 1/a∗ . Its Fourier-transform reads:
ejΩ a∗ − 1 jΩ e
−jΩ
− a∗
G(ejΩ ) = = −e
ejΩ − a ejΩ − a
The magnitude of this is
|G(ejΩ )| = 1
and thus we have an allpass. Thus, for all polynomial combinations of zeroes and poles we
can only achieve the allpass property if we have a match of zero and pole (symmetrical
with respect to the unit circle means: pole at a, zero at 1/a∗ ). Concatenating higher
order systems provides the same result. Note that this is a property of an allpass, thus a
necessary condition, not a sufficient condition. For example, just multiply the allpass by
a constant (not equal to one) and it is not an allpass any more, but the poles and zeroes
remain unchanged.
In order to show the second property of the theorem, we have to deviate first.
Example 1.16 (Circle of Apollonius) If we chose a particular transfer function
H(ejΩ ) = a∗ − e−jΩ
and apply the previous example, we obtain G(ejΩ ) = ej2φ(Ω) . Figure 1.5 illustrates the phase
plot for a = 0,5 + 0,3j = Ra + jIa with the phase function φ(Ω) = arc(H(Ω)):

−Ia + sin(Ω) −0,3 + sin(Ω)
φ(Ω) = atan = atan .
Ra − cos(Ω) 0,5 − cos(Ω)
Figure 1.5 clarifies the phase-relation, if we sweep the frequency Ω from −π to π. The
2
2 ( )
( )
0
-2
-4
-6
-8
-10
-12
-14
-4 -3 -2 -1 0 1 2 3 4
Figure 1.5: Phase plot of the all-pass for a = 0,5 + 0,3j.
angle φ(Ω) decreases (becomes more negative) if we go from Ω = −π to +π. In the figure
we also find the angle ψ(Ω) that monotonically decreases as well. Note that φ(Ω) is related
to ψ(Ω):
1 jΩ
∗ − e
ψ(Ω) = arc a
a − ejΩ
and
1 jΩ
a∗ − e−jΩ ∗ −jΩ a∗ − e
G(ejΩ ) = = −a e .
a − ejΩ jΩ
|a −{ze }
arc(.)=ψ(Ω)
As arc(G(ejΩ )) = 2φ(Ω), we finally find
2φ(Ω) = ψ(Ω) − Ω + arc(a∗ ) + π.
According to the Greek mathematician Apollonius (Apollonius of Perga [Pergaeus] ca. 262
BC - ca. 190 BC), who found this relationship at conjugate complex numbers (at a time,
when they even have not been invented), the property of decreasing phase when moving along
a circle is called the circle of the Apollonius. See Figure 1.6 for a better understanding.
M: 1/a* M: 1/a*
Ψ(Ω) Ψ(Ω)
N: a N: a
P Ω
Ω=−π C C
P‘
Figure 1.6: Circle of Apollonius: starting at point P at Ω = −π in the left picture the angle
ψ(Ω) becomes more negative when moving to point P’ as shown in the right part of the
figure.
Consider now Figure 1.6. Let the center of the unite circle be denoted by C. Line
PN can be constructed by adding the lines PC = −ejΩ and CN = a in a vector manner:
PN = a − ejΩ , while PM = a1∗ − ejΩ . The angle between lines PN and PM is ψ(Ω).
Why does P lie on the unit circle? For better understanding of this, let us consider the
points P , M and N , such that M forms the point at which 1/a∗ = RRa2+jI a
2 is located in the
a +Ia
complex plane, and N denotes the point at which a = Ra + jIa is located. The point P
marks an arbitrary point on the unit circle. The complex values a and a1∗ have the same
angle in the complex plane.9
9
One can also compute this analytically
by building
the derivative of the phase function. The straight-
∂ sin(Ω)−Ia
forward method is to compute ∂Ω atan Ra −cos(Ω) . One finds then −1 + Ra cos(Ω) − Ia sin(Ω) < 0 as long
as a lies inside the unit circle.
Now we investigate all points for which the following holds:

PN
= k
PM
|Rp − Ra + j(Ip − Ia )|
= .
Rp − Ra2R+I
a Ia
2 + j(Ip − R2 +I 2 )

a a a
The conclusion of Apollonius is that all of these points P are located on the unit circle.
Thus, we obtain
(Rp − Ra )2 + (Ip − Ia )2
k2 = .
(Rp − R2R+I
a 2 Ia
2 ) + (Ip − R2 +I 2 )
2
a a a a
With the approach Rp = R cos(α); Ip = R sin(α) the equivalent description in R and α can
be found:
R2 + R2 + I 2 − 2R(Ra cos(α) + Ia sin(α))
k 2 = (Ra2 + Ia2 ) 2 2 a 2 a .
R (Ra + Ia ) + 1 − 2R(Ra cos(α) + Ia sin(α))
This equation is only solvable for R = 1, i.e., on the unit circle.
We have shown here in this example the property of a single stage allpass filter G(ejΩ ).
This property also holds if more than one of those systems are connected in a row as the
phases simply add up.
1.2.5 Minimum phase property

The question arises whether there exists a unique relation between the magnitude and the
phase of a linear system, or if these parameters can be chosen independently of each other
for a causal system. Obviously, this is not the case because we just saw that all-passes
exist which leave the magnitude untouched but change the phase. For this, we examine the
transfer function of a time-discrete system:
N (ejΩ )
H(ejΩ ) = ,
D(ejΩ )
given by the zeroes of the polynomial N (z) and the roots that is the zeros of polynomial
D(z). We assume that the both polynomials N (ejΩ ) and D(ejΩ ) have complex-valued
coefficients10 , and thus:
N (ejΩ ) = a0 + a1 e−jΩ + a2 e−j2Ω + ... = No (ejΩ )Ni (ejΩ )
N ∗ (ejΩ ) = a∗0 + a∗1 ejΩ + a∗2 ej2Ω + ... = No∗ (ejΩ )Ni∗ (ejΩ ).
10
Please note that for polynomials with real-valued coefficients, N ∗ (ejΩ ) = N (e−jΩ ) holds. Consider, for
example, N (z) = 1 + az −1 . Then N (ejΩ ) = 1 + ae−jΩ and the zero is located at zo = a. If we, on the
other hand, consider N ∗ (ejΩ ) = 1 − a∗ ejΩ , then we find the zero located at z0 = 1/a∗ . If a is outside the
unit circle, then 1/a∗ is inside.
The polynomial is decomposed into two fragments No (ejΩ ) and Ni (ejΩ ), of which the zeroes
of the corresponding fragments of N (z) = Ni (z)No (z) either all lie inside, Ni (z), or all
outside of the unit circle, No (z). Note that the location of the zeroes is affected by the
complex conjugate operation. All zeroes of No∗ (z) lie inside of the unit circle, whereas all
zeroes of Ni∗ (z) lie outside of it. By cleverly converting,
N (ejΩ ) Ni (ejΩ )No (ejΩ )

H(ejΩ ) = =
D(ejΩ ) D(ejΩ )
No (ejΩ ) No∗ (ejΩ )Ni (ejΩ )
=
No∗ (ejΩ ) D(ejΩ )
= Ha (ejΩ )Hm (ejΩ ),
we detached an all-pass Ha (ejΩ ) and a system Hm (ejΩ ) of which no all-pass can be separated
anymore. All zeroes of the system Hm (ejΩ ) are located inside of the unit circle. It is
called minimum phase system. Figure 1.7 shows the pole-zero plot of the low-pass with
z-transformed transfer function
z − a1∗ z − a1 1 − a1∗ z −1 1 − a1 z −1

H(z) = = (1.24)
z(z − b) (1 − bz −1 )
with the parameters a = 0,5 + 0,3j and b = 0,8. We find the Fourier transform simply by
replacing z = ejΩ :
ejΩ − a1∗ ejΩ − a1

jΩ

H e =
ejΩ (ejΩ − b)
Minimum-phase
Low-pass filter All-pass low-pass
1/a* 1/a*
a a
a* a*
1/a 1/a
Figure 1.7: Conversion of a low-pass filter into an all-pass and a minimum phase low-pass
filter. Crosses denote poles and circles denote zeros.
Example 1.17 We investigate the phase of a linear transmission system H(ejΩ ) =

A(Ω)ejφ(Ω) and the corresponding minimum phase system Hm (ejΩ ) = A(Ω)ejφm (Ω) . Fig-
ure 1.8 shows the proper phase function; φm (Ω) = arc{Hm (ejΩ )} is the upper curve. In the
1 1.5
0 1
-1 0.5
-2 0
'( )
( )
-3 -0.5
-4 -1
-5 -1.5
H(e j ) H(e j )
Hm(e j ) Hm(e j )
-6 -2
-7 -2.5
-4 -2 0 2 4 -4 -2 0 2 4
Figure 1.8: Phase plot of a linear transmission system H(e−jΩ ) from Example 1.16 (low-
pass H(z) = a∗ − z −1 and the corresponding minimum phase system Hm (z) = 1 − az −1 ).
Left: phase, right: derivative of phase.
right part of the picture, it can be observed that the φ0m (Ω) ≥ φ0 (Ω), thus it is smaller (more
positive means also closer to zero)!
Definition 1.5 A time-discrete, rational minimum-phase system Hm (ejΩ ) with phase

φm (Ω) has all zeroes and poles located inside the unit circle.
Theorem 1.6 Given two causal LTI systems with same amplitude function but different
phase function, a minimum phase system has the property φ0 (Ω) < φ0m (Ω).
Proof: Consider a second rational transmission system H(ejΩ ) with φ(Ω) under the con-
dition:
|H(ejΩ )| = |Hm (ejΩ )|.
Then, furthermore, it must hold that
H(ejΩ ) |H(ejΩ )|ejφ(Ω) ejφ(Ω)

Ha (ejΩ ) = = = = ejφa (Ω) ,
Hm (ejΩ ) |Hm (ejΩ )|ejφm (Ω) ejφm (Ω)
representing an all-pass. For such, however, we have just learned that its phase must be
φ0a (Ω) < 0 as it is an all-pass. Likewise, we can see that
φa (Ω) = φ(Ω) − φm (Ω),
or, equivalently
φ0a (Ω) = φ0 (Ω) − φ0m (Ω) < 0,
and the statement of the theorem follows.
Recall that the bi-linear transforms uniquely maps zeroes and poles from the left half
plane into the unit circle. Therefore, we can conclude similar results for the minimum
phase property of time-continuous systems. Rather than sorting the zeroes into two sets,
one inside and one outside teh unit circle, for time-continuous systems we have two sets,
one with zeroes in the left and one with zeroes in the right half plane.
1.3 Sampling Theorems

Before, we look into sampling, we recall an important relation, the so-called Poisson Sum
Formula.
1.3.1 Poisson Summation Formula

The Poisson summation formula allows to describing a relation between infinite, time-
discrete series and sums of Fourier-components. The main statement is:
∞ ∞
T X X
exp(jkωT ) = δ(ω − 2nπ/T ).
2π k=−∞ n=−∞
This can be shown by a Fourier-series expansion. The functions δ(ω −2nπ/T ) are periodical
in 2π/T and can thus be expanded into a Fourier-series of the form:
∞
X ∞
X
δ(ω − 2nπ/T ) = ck exp(jkωT ). (1.25)
n=−∞ k=−∞
The Fourier series coefficients are obtained by:

Z π/T
1
ck = δ(ω) exp(−jkωT )dω
2π −π/T
Z π
1
= δ(ω 0 /T ) exp(−jkω 0 )dω 0 T
2π −π
T
= ,
2π
explaining the identity of (1.25).
1.3.2 Equidistant Sampling

Time-discrete series often occur due to sampling of time-continuous functions. If this is
the case, the question arises under which circumstances a continuous function f (t) can be
uniquely described by its samples f (kT ) = fk , or f (tk ), and thus can be reproduced from
those. The answer in case of equidistant sampling is given in the following theorem.
Theorem 1.7 (Sampling theorem for equidistant sampling) A continuous function

f (t) can uniquely be represented by its sampling values f (kT ), if it is band-limited with
|ω| < π/T .
Proof:
Consider an equidistant sampling with interval T . According to the theorem, there must
exist an interpolation function p(t) with which f (t) can be obtained from the series fk .
∞
X
fˆ(t) = f (kT )T p(t − kT )
k=−∞
∞
X
= T fk p(t − kT ),
k=−∞
where we introduced the reconstructed signal fˆ(t). We like to invest under which conditions
fˆ(t) = f (t), or equivalently, F̂ (jω) = F (jω). The above equation can be interpreted as
a clever approach in which the original question is now reduced to the query how the
interpolator p(t) has to be formed in order to obtain fˆ(t) = f (t). To answer this, we
consider the Fourier-transform F (jω) of the continuous function fˆ(t):
∞
X
F̂ (jω) = T fk P (jω)e−jkωT
k=−∞
∞
X
= P (jω) T fk e−jkωT
k=−∞
= P (jω)FA (jω).
Thus, we introduced an auxiliary function FA (jω) which has no physical meaning 11 . By

applying the Poisson summation formula (see previous Section 1.3.1 and appendix of this
11
Please note that often (1.26) is interpreted in literature as a physical sampling. This, however, is not
true: no physical device offers an infinite bandwidth. In reality a gate/switch opens for a specified time
and charges a capacity, resulting in a completely different behavior. Modeling sampling by Diracs does not
appear close to reality and in fact is not required for the understanding of sampling.
chapter), the auxiliary function fulfills

∞
X
fA (t) = T fk δ(t − kT ) (1.26)
k=−∞
∞
X 2kπ
FA (jω) = F j ω+ .
k=−∞
T
Hence:
F̂ (jω) = P (jω)FA (jω)

∞
X 2kπ
= P (jω) F j ω+ .
k=−∞
T
This equation adopts a practical form if F (jω) is band-limited with |ω| < π/T . In that case,
F (jω) = F̂ (jω) for |ω| < π/T and P (jω) = 1 in this region. Thus, for the interpolator, the
following must hold:
1 ; |ω| < π/T
P (jω) = .
0 ; else
However, this is nothing more than an ordinary ideal low-pass. Hence:
sin( πt

T
) 1 πt
P (jω) ⇔ p(t) = = sinc .
πt T T
X∞ π
f (t) = fk sinc (t − kT ) . (1.27)
k=−∞
T
However, this equation allows for other solutions as well. Consider a band-limited function
F (jω) in |ω+k2π/T | < π/T and a band-pass P (jω) in that domain. That is also a solution.
Figure 1.9 clarifies the situation.
Note that the sampling theorem provides a sufficient condition not a necessary one.
This explains why literally hundreds of variants exist. We will re-interpret our result in
Equation (1.27) in terms of metrics, see Exercise 2.11 and approximations, see Exercise 3.22.
Note also that the interpolation equation (1.27) allows a so-called ”re-sampling”:
∞
X π
f (mT1 ) = fk sinc (mT1 − kT ) .
k=−∞
T
Consequently, series that have been sampled with interval T , can be converted to series
with sampling interval T1 . However, note that in doing so, the sampling theorem may not
be violated.
F(j!)
!
4¼ 2¼ 2¼ 4¼
T T T T
F(j!)
A
!
4¼ 2¼ 2¼ 4¼
T T T T
Figure 1.9: For the sampling theorem.
1.3.3 Further Sampling Theorems

A continuous function f (t) now shall be not directly accessible but only via the output of
a linear system H(jω). The sampled output-values of the system are:
Z ∞
1
gk = g(kT ) = F (jω)H(jω) exp(jωkT )dω.
2π −∞
Is it possible to conclude from those values gk by interpolation onto f (t)?
∞
X
fˆ(t) = T gk p(t − kT ).
k=−∞
How has the interpolation function now to be constructed so that f (t) = fˆ(t)?
We examine again the Fourier-transform from which we know that it has a periodical
spectrum:
∞
X 2kπ
F̂ (jω) = P (jω) G j ω+
k=−∞
T
∞
X 2kπ 2kπ
= P (jω) F j ω+ H j ω+ .
k=−∞
T T
This equation can be solved by choosing a band-limited version of 1/H(jω) for P (jω) to
compensate for the effect of the prefiltering in H(jω):
Z π/T
1/H(jω) ; |ω| < Tπ

T exp(jωt)
p(t) = dω ⇔ .
2π −π/T H(jω) 0 ; else
This idea can be generalized, see Figure 1.10. For each sampled signal and its spectrum,
g1(t) g2(t) gm(t)
H1(j!) H2(j!) Hm(j!)
f(t)
Figure 1.10: On the generalization of the sampling theorem.
the following holds:

Z ∞
1
gk (t) = Hk (jω)F (jω) exp(jωt)dω
2π −∞
Gk (jω) = Hk (jω)F (jω).
The question, which arises therewith, is the following: is it possible to reconstruct the
function f (t) by the values gk (nT ); k = 1,2,...,m, sampled in the interval T (thus m samples
per period T ) with a linear interpolation?
∞ X
X m
f (t) = gk (nT )pk (t − nT ). (1.28)
n=−∞ k=1
Let us consider three application examples:

Example 1.18 Let Hk (jω) be interpolators, with Hk (jω) = exp(−jkT ω/m); k = 1,2,...,m.
Then we again obtain the sampling theorem for equidistant sampling with sampling rate
T /m. The solution then necessarily is given by an ideal band-pass (low-pass) of the smaller
bandwidth π/(T m).
Example 1.19 Let Hk (jω); k = 1,2,...,m be interpolators, with non-equidistant delays,

which nevertheless periodically repeat themselves (with period T ), see also Figure 1.11.
Under which constraints will the interpolation (1.28) work?
0 T T
m
Figure 1.11: Sampling with non equidistant but periodical sampling with period T .
Example 1.20 Let Hk (jω) k = 1,2,..m be arbitrary linear systems. Is it possible to derive
a conclusion concerning these transfer functions, so that the interpolation equation (1.28)
is realizable?
The solution for this problem was found by Papoulis [2], and can be summarized in the
following theorem:
Theorem 1.8 An interpolation of the function f (t) with the interpolation

∞ X
X m
f (t) = gk (nT )pk (t − nT )
n=−∞ k=1
is achieved iff the system of equations

 
H1 (jω) H2 (jω) ... Hm (jω)  
 H1 j ω + 2π

H 2π Y1 (jω,t)
2 j ω+

T T
Y2 (jω,t)
 
.

 H1 j ω+2 2π
 ..  
T
 .. 
.
 
 .. ..  
 . . 
Ym (jω,t)
H1 j ω+(m−1) 2π Hm j ω+(m−1) 2π

T ... T
 
1
 exp j2πt
 
 T


j2πt
exp 2
 
= T .
 (1.29)
 .. 
 . 
 
exp (m−1) j2πt
T
is solvable for all ω (determinant unequal zero). The interpolators are then given by
Z −m 2π + 2π
T T T
pk (t) = Yk (ω,t)ejωt dω. (1.30)
2π −m 2π
T
Note that the terms Yk (ω,t) do not describe the Fourier-transform of pk (t).
Proof:
Let us first consider the response onto f (t) which occurs at the output of Hk (jω): gk (t),
sampled at the time instant nT , thus gk (t − nT ). This is equal to an excitation f (t − nT )
of the system Hk (jω), and also identical to the response of the system Hk (jω) exp(jnωT )
excited by f (t). So, if
∞ X
X m
f (t) = gk (nT )pk (t − nT )
n=−∞ k=1
should hold, then for the special excitation f (τ ) = exp(jω0 (t+τ )) and its Fourier-transform
F (jω) = exp(jω0 t)δ(ω − ω0 ), the following must hold:
∞
m X
X
exp(jω0 (t + τ )) = gk (nT )pk (τ − nT )
k=1 n=−∞
m X ∞ Z ∞
X 1 0
= Gk (jω 0 )ejnω T dω 0 pk (τ − nT )
k=1 n=−∞
2π −∞
m X ∞ Z ∞
X 1 0
= Hk (jω 0 )F (jω 0 )ejnω T dω 0 pk (τ − nT )
k=1 n=−∞
2π −∞
m X ∞ Z ∞
X 1 0
= Hk (jω 0 )ejω0 t δ(ω 0 − ω0 )ejnω T dω 0 pk (τ − nT )
k=1 n=−∞
2π −∞
∞
m X
X
= Hk (jω0 )ejω0 t ejnω0 T pk (τ − nT ).
k=1 n=−∞
Cancelling ejω0 t on both sides leads to the following identity:

m
X ∞
X
exp(jωτ ) = Hk (jω) ejnωT pk (τ − nT ). (1.31)
k=1 n=−∞
Furthermore, (1.30) is in force, and thus

Z −m 2π + 2π
T T T
pk (τ − nT ) = Yk (ω,τ )ejω(τ −nT ) dω. (1.32)
2π −m 2π
T
Note that the right side of (1.29) is periodically in 2π

T
. Accordingly, also Yk (ω,τ ) must show
this period.
Thus, the2πterm Yk (ω,τ ) exp(jωτ ) can be processed into a Fourier series in the
interval −m 2π
T
, − m T
+ 2π
T
:
∞
X
Yk (ω,τ ) exp(jωτ ) = pk (τ − nT ) exp(jωnT ) (1.33)
n=−∞
with the Fourier-coefficients pk (τ − nT ). Insertion of the Fourier-series in (1.31) leads to:

m
X ∞
X
exp(jωτ ) = Hk (jω) pk (τ − nT ) exp(jωnT )
k=1 n=−∞
m
X
= Hk (jω)Yk (jω,τ ) exp(jωτ ).
k=1
With that, we obtain the first condition in form of an equation:
m
X
Hk (jω)Yk (jω,τ ) = 1.
k=1
Further m − 1 linear
equations can be obtained by determining the Fourier-series (1.33)
in other intervals −(m − l) 2π
T
, − (m − l) 2π
T
+ 2π
T
;l = 1,2,...,m − 1, so that we obtain the
following system of equations:
 
H1 (jω) H2 (jω) ... Hm (jω)  
 H1 j ω + 2π

H 2π Y1 (jω,t)
2 j ω+

T T
Y2 (jω,t)
 
.

 H1 j ω+2 2π
 ..  
T
 .. 
.
 
 .. ..  
 . . 
Ym (jω,t)
H1 j ω+(m−1) 2π ... Hm j ω+(m−1) 2π

T T
 
1
 exp j2πt
 
T

 
j2πt
exp 2
 
= T .

 .. 
 . 
 
exp (m−1) j2πt
T
Interpretation: For the system of equations to be solvable, the determinant has to be

unequal to zero. For example, this is fulfilled if Hk (jω) = (jω)k but also if Hk (jω) =
exp(jωαk ), as long as |αk | < T /2 and all αk are different, corresponding to f (nT + αk ).
Likewise, the system of equations is also fulfilled if all functions Hk (jω) are non-overlapping
band-passes (low-passes), which together fill up the required bandwidth BF .
Appendix
As the derivation of the auxiliary functions in (1.26) is a bit lengthy, we moved it to this
appendix. Consider the Fourier pair:
∞
X
fA (t) = T fk δ(t − kT )
k=−∞
∞

X 2kπ
FA (jω) = F j ω+ .
k=−∞
T
The identity:
∞
X
FA (jω) = T fk exp(−jkωT )
k=−∞
∞
X 2nπ
= F j ω+ ,
n=−∞
T
will be shown in the following. Note that

Z ∞
f (t = kT ) = fk = F (jω 0 ) exp(jkω 0 T )dω 0
−∞
the Fourier back transformed of F (jω 0 ) at time instant t = kT . With that we find:
∞
X
FA (jω) = T fk exp(−jkωT )
k=−∞
∞ Z ∞
X T
= F (jω 0 ) exp(jkω 0 T )dω 0 exp(−jkωT )
k=−∞
2π −∞
Z ∞ ∞
X T
= F (jω 0 ) exp(jk(ω 0 − ω)T )dω 0
−∞ k=−∞
2π
Z ∞ X∞
0
= F (jω ) δ(ω 0 − ω − 2nπ/T )dω 0
−∞ n=−∞
∞
X 2nπ
= F j ω+ .
n=−∞
T
In other words, an equidistant sampled function can always be represented by a periodical

spectrum in the spectral domain.
1.4 Exercises
Exercise 1.1 Linearity can be defined by the following two constraints:
S[αx] = αS[x]
S[x1 + x2 ] = S[x1 ] + S[x2 ].
Show that the superposition form can be deduced from those and vice versa.
Exercise 1.2 Check if sampling and interpolation are linear operations.

(i) (i)
Exercise 1.3 Observe the theorem of Bezout (1.7) for the case Hi (q −1 ) = h0 + h1 q −1 ;
i = 1,..,nI and Gi (q −1 ) for arbitrary impulse response lengths nG . How many unknowns
and how many equations do you find, dependent on nI and nG ?
Exercise 1.4 Consider a MIMO (Multiple Input Multiple Output) transmission

system with two inputs and two outputs. The four transfer functions are
−1 −1 −1 −1
H11 (q ),H12 (q ),H21 (q ),H22 (q ): design an equalizer by means of the theorem of Be-
(1) (2)
zout to recover the two transmission sequences xk and xk .
Exercise 1.5 It is well known that linear time-invariant systems are commutative. Derive
the system’s transfer function by the polynomial approach, a z-transform and the state-space
approach for two concatenated linear time-invariant systems. Which description is easier?
Exercise 1.6 Derive the Hilbert-transform (relation between real- and imaginary part of
causal functions) for time-discrete series.
Exercise 1.7 Let H(jω) = a + jb; a,b ∈ R. Under which constraints is H(jω) causal?
What changes, if a time-discrete system is given, i.e., H(ejΩ ) = a + jb; a,b ∈ R?
Exercise 1.8 Symmetrical filter show a linear phase. Is it possible to conclude that linear
filters are symmetrical? Which sorts of symmetries do you know? Which forms of linear
phase filters can be deduced from those?
Exercise 1.9 We examine the transfer function of a square-root raised cosine filter, as it
is typically used for pulse shaping in UMTS and WiMAX.
, |ω| ≤ (1 − α) TπS


 1
 r h i
1 TS π
H(jω) = 2
1 − sin 2α
(|ω| − TS
) , (1 − α) TπS ≤ |ω| ≤ (1 + α) TπS (1.34)


0 , |ω| ≥ (1 + α) TπS .

Let the roll-off factor α be 0,22. What is the corresponding impulse response? Is it a causal
filter? Does the filter show a linear phase?
Exercise 1.10 Consider

2 1 + 4ω 2
A (ω) = . (1.35)
1 + 10ω 2 + 10ω 4
to be given. Name all possible realizations of H(jω) = A(ω)ejφ(ω) . How many realizations
exist?
Exercise 1.11 Have a look at the derivation of the minimum phase system. Is it possible
to create a maximum phase system for which all zeroes are located outside the unit circle?
Exercise 1.12 Derive a minimum phase system for time-continuous systems
1. use the same method as in the lecture,
2. use the bi-linear transform.
Exercise 1.13 Consider a linear phase filter. Can you design such a filter that is of min-
imum phase?
Exercise 1.14 Consider the following linear phase filter with amplitude function
A(Ω) = 1 + cos(Ω) + 0.1 cos(2Ω).
It is to realize by two concatenated filters of lower order, i.e.,
A(Ω) = A(1 + B cos(Ω))(1 + C cos(Ω)).
Find A,B and C. Consider now
A(Ω) = 1 + a cos(Ω) + b cos(2Ω).
Is the mapping from (a,b) to (A,B,C) unique?

Chapter 2
Linear Vector Space
2.1 Metrics and Spaces

Mathematical spaces are sets with objects. Note that sets and vectors are distinctively
different concepts. While the position of an object inside a set does not matter, it does
matter in a vector. Thus every concept including position information can be interpreted
as vectors. The number of entries defines the dimension of a vector; an n-dimensional
vector has thus n entries1 . Also matrices are vectors. Functions can be interpreted as
vectors with infinite entries. Typically the distance between such objects in a set and/or
the size of the objects can only be defined if a metric is specified.
Let us consider two examples to understand the concept of metrics.
Example 2.1 In a first example we consider the transmission of binary data as depicted
in Figure 2.1. Here, k bits are combined in a codeword x ∈ C ⊂ IB n = Y , pointing at one
of M possible outcomes. Thus, 2k = M . As the data is corrupted during the transmission
process (i.e., some bits are flipped from zero to one or vice versa), the information needs
to be protected. Such protection is obtained by adding redundancy. Here n − k additional
bits are being added following a particular coding rule (block code). The codewords x thus
contain n > k bits. Nevertheless, there are only M = 2k possible codewords and not 2n !
The received vectors y ∈ IB n are passed to the decoder whose task is to find that codeword
that is closest to the received vector. Figure 2.2 depicts potential received values (red) and
expected codewords (green). How does the decoder decide which received word is close to
which codeword? It has to compare the received vector y to all allowed codewords x and
measure the distance to each of them. The codeword x with the smallest distance is the
most likely transmitted symbol. The distance is here simply the number of different bits,
1
Note that the dimension of a vector does not define the dimension of the vector space in which it lives.
52
Figure 2.1: Binary transmission with coded words.
also known as the Hamming distance:

n
X
dH (y,x) = xi ⊕ yi ,
i=1
where ⊕ stands for antivalence also known as exclusive-or.
Example 2.2 In our second example, we transfer two possible signal forms from two sen-
sors. Take for example a modern car with wireless pressure sensors in each wheel. To
distinguish the left from the right wheel signals, they transmit their information by different
signatures, say f (t) and g(t). The receiver has to decide whether the received signal form
r(t) is closer to f (t) or to g(t). We thus need a distance measure from r(t) to f (t) and r(t)
to g(t):
s s
Z b Z b
d2 (r(t),f (t)) = |r(t) − f (t)|2 dt; vs. d2 (r(t),g(t)) = |r(t) − g(t)|2 dt.
a a
Definition 2.1 (Metric) Consider a set X from IR,C l ,IB,N Z. A metric d : X × X →

N ,Z
IR+0 is a functional mapping to measure distances between objects/elements of the set X.
In order to call such a distance a metric, the following properties need to be satisfied:
0) d(x,y) ≥ 0
1) d(x,y) = d(y,x)
2) d(x,y) = 0; iff x = y
3) d(x,z) ≤ d(x,y) + d(y,z); for all x,y,z ∈ X.

(n, k ) − Code
y x
= 2k
# Info bits k = log 2 ( M )
# Code bits n ≥ k
Code Rate:
k log 2 ( M )
R= =
n n
Figure 2.2: Received corrupted signals.
Note that 0) is not really required since it follows from the other three: we have d(x,x) = 0
from 2). Furthermore, 0 = d(x,x) ≤ d(x,y) + d(y,x) = 2d(x,y) where we have used 1) and
3). We can thus follow 1) d(x,y) ≥ 0.
Definition 2.2 (Metric space) A metric space (X,d) is described by a pair, comprising
a set X and a metric d, valid on this set.
Example 2.3 Consider the following metric d1 (,) : IRn × IRn → IR+0 :
n
X
d1 (x,y) = |xi − yi |.
i=1
defined over the n-dimensional vectors x,y. This metric is called l1 metric. It is often used
as it requires very little computational complexity. In literature it is sometimes referred to
as the Manhattan distance as walking along the street grid of Manhattan would provide a
similar distance.
Example 2.4 We can generalize the previous to dp (,) : IRn × IRn → IR+0 :
n
! p1
X
dp (x,y) = |xi − yi |p ,
i=1
which turns out to be a valid metric for every 1 ≤ p < ∞. Here, of particular interest are
l1 , l2 (Euclidean metric), and d∞ (,) : IRn × IRn → IR+0 :
n
! p1
X
d∞ (x,y) = lim |xi − yi |p = max |xi − yi |.
p→∞ 1≤i≤n
i=1
It is relatively straight-forward to show that the last equivalence holds. Consider the largest
difference ∆ = max |xi − yi | at index I = arg max |xi − yi | and take it out:
 
n n n p
X X X |xi − yi | 
|xi − yi |p = ∆p + |xi − yi |p = ∆p 1 + .
i=1
∆p
i=1,\I i=1,\I
Now, taking the p-th root we find

   p1   p1
n p n p
∆p 1 +
X |xi − yi |  X |xi − yi | 
p
= ∆ 1 + .
∆ ∆p
i=1,\I i=1,\I
p
Eventually, computing the limit for p → ∞ we find that all terms |xi∆
−yi |
p < 1 are smaller
than one and go to zero with growing p when comparing to the leading one. Thus only ∆
remains, which is the desired result.
Example 2.5 We already mentioned the Hamming distance dH (,) : IB n × IB n → N 0 :

n
X
dH (y,x) = xi ⊕ yi ,
i=1
where ⊕ refers to the antivalence operation (exclusive or), i.e., (xi ⊕ yi ) = 1 only if xi 6= yi .
Example 2.6 The set IRn (vectors with n entries from IR) together with the metric d2 (x,y)
builds a metric space.
Note, a metric defines the distance to the zero vector for p ≥ 1:

n
! p1 n
! p1
X X
dp (x,0) = |xi − 0|p = |xi |p .
i=1 i=1
A metric thus allows for statements about sizes (lengths, areas, volumes, etc. for p = 2)
of objects. Tied to this is the existence of such objects, or, equivalently whether they fit
into the given space. Given an object x of the space, we thus need to test if it fits into the
space, i.e., if
dp (x,0) = dp (x) < ∞.
For lp -metrics, we can also leave out the p-th root as it is a monotone mapping, i.e., it is
sufficient to show if n
X
|xi |p < ∞.
i=1
Example 2.7 Consider the causal, infinitely long series of real or complex-valued numbers
xi , i = 0,1,...,∞, with the property
∞
X
|xi |p < ∞.
i=0
For each p for which the condition is satisfied, the sequence belongs to the corresponding
metric space lp (0,∞) with metric dp . (Ger.: Folgenraum).
Consider on the other hand the sequence xi , i = −∞,..., − 1, 0,1,...,∞ with the same prop-
erties and same metric, we have the lp (−∞,∞) space. Consider the metric for p → ∞, for
sequences for which: |xi | is bounded for every i, i.e., |xi | < M , we obtain the l∞ (0,∞) or
l∞ (−∞,∞) space, respectively:
d∞ (xn ,yn ) = sup |xn − yn |.

n
Definition 2.3 (Supremum) Consider the set S in IR,Q,Z

Z,N
N with the elements xi . The
smallest number z in IR, for which we have:
z ≥ xi ; for all i
is called supremum (sup) of the set S. It is the least upper bound.
If there is no number in IR that is larger than the largest element in S, then we have
sup(S) = ∞.
Definition 2.4 (Infimum) Consider the set S in IR,Q,Z

Z,N
N with the elements xi . The
largest number z in IR, for which we have:
z ≤ xi ; for all i
is called infimum (inf) of the set S. It is the greatest lower bound.
If there is no number in IR that is smaller than the smallest element in S, then we have
inf(S) = −∞.
Example 2.8 Let S = (3,6) be an open set, then: inf(S) = 3, sup(S) = 6. Let T = [4,8),
then: inf(T ) = 4, sup(T ) = 8. Let U = [2,∞), then: inf(U ) = 2, sup(U ) = ∞.
Why do we not simply select the maximum (minimum) of S?2
Example 2.9 Metric spaces can also be formed with particular properties. Let lh (0,∞) be
a metric space of sequences in which sequences exist whose quadratic sum is finite (finite
energy sequence). Thus, xi ∈ lh (0,∞) = lp=2 (0,∞) means, that
∞
X
|xi |2 < ∞,
i=0
thus the sequences of bounded energy. Note that Gaussian noise, a dominant noise model
in telecommunications, is not member of this metric space.
Metric spaces can also be defined over functions rather than a set of numbers (note
that functions are in a way sets of numbers). These are called metric spaces over functions
(Ger.: Funktionenraum).
xo (t ) + ε d 2 ( xo , xm ) < ε
d ∞ ( xo , xm ) < ε
3
xo (t ), xm (t )
2
xo (t ) − ε
x(t)
0
0 1 2 3 4 5 6 7 8 9 10
4
d 2 ( xo , x1 ) < ε
d 2 ( xo , x2 ) < ε
3
x1 (t ) d ∞ ( xo , x1 ) >> ε
d ∞ ( xo , x2 ) >> ε
2 xo (t )
x(t)
0
x2 ( t )
0 1 2 3 4 5 6 7 8 9 10
Figure 2.3: Examples of metrics.
Definition 2.5 (p-metric) Let X be a set of real- or complex-valued functions, defined

on the interval [a,b] with b > a, p ≥ 1, that have the property:
Z b
|x(t)|p dt < ∞; 1 ≤ p < ∞.
a
2
As for infinite series it may turn out the limit is not member of S!
The metric over the functions x(t) and y(t) from X is then3 :
Z b 1/p
p
dp (x,y) = |x(t) − y(t)| dt ;1 ≤ p < ∞
a
The space with the metric dp over the functions is called Lp space. For p → ∞ we have for
bounded functions (from above and below) :
sup |x(t)| < ∞

t∈[a,b]
d∞ (x,y) = sup {|x(t) − y(t)|; a ≤ t ≤ b} .
In Figure 2.3 we see some examples of metrics. In the upper graph we find two functions
x0 (t) and xm (t). Both functions are relatively close to each other. This is well expressed by
the metrics. No matter if a d2 or d∞ metric is selected, for both a small value comes out.
The d∞ metric can be found by moving x0 (t) up and down until they are the tightest upper
and lower bound to xm (t). In the lower picture the same function x0 (t) is displayed. But
now two other functions x1 (t) and x2 (t) are shown. They are almost identical to x0 (t) but
at one point they differ. Computing the d2 metric, however, does not show any discrepancy
as integrating over a point does not deliver any contribution. Taking the d∞ metric here,
makes a substantial difference, as now the outstanding point defines the value for the metric.
If we thus are interested in measuring outliers, a d∞ metric is the right choice.
2.1.1 Sparsity
In Figure 2.4 we consider a 2-dimensional vector whose dp -metric is constant. Constant
metrics are called iso-metrics (from the Greek word isos=equal)4 . For p = 2 we find a circle
(here only the right quadrant is shown). For larger values of p we see the metric inflate
that is it becomes more blown up and eventually for p → ∞ a square shape is obtained.
But what happens for values of p smaller than two? For p = 1 we find a straight line.
Decreasing p further makes the metric deflating.
What happens if we decrease p below 1? The dp metric definition does not hold any
longer as the next example shows.
1
Example 2.10 Consider p = 2
and select the three points x = (1,0),y = (0,1),z = (0.5,0.3).
Now let us compute d1/2 :
2
!2 2
!2 2
!2
1 1 1
X X X
|xi − yi | 2 =4> |xi − zi | 2 + |zi − yi | 2 = 3.96.
i=1 i=1 i=1
3
In terms of a Lebesgue Integral.
4
You may remember iso-bar lines that show lines of equal air pressure.
inflation
p=100
p=4
p=2
x 
x =  1
 x2 
p=1
d p (x ) = 1
p p
p=1/2 = x1 + x2
p=1/4
deflation
Figure 2.4: Varying p: inflating and deflating metrics.
Definition 2.6 For 0 < p < 1, the following expression is a metric:

Z b
dp (x,y) = |x(t) − y(t)|p dt.
a
Repeating Example 2.10 we now obtain

2
! 2
! 2
!
1 1 1
X X X
|xi − yi | 2 = 2 < |xi − zi | 2 + |zi − yi | 2 = 2.798.
i=1 i=1 i=1
Definition 2.6 still expects p > 0. What happens if p = 0?

Definition 2.7 In practise this definition of d0 is often given in the form of
n
X
d0 (x,y) = |xi − yi |0
i=1
which is truly a counter for sparseness. But note, mathematically speaking, this is not
a metric (norm); sometimes it is called pseudonorm. Mathematical difficulties with this
definition occur due to discontinuities as

0 0 ;x = 0
x = .
1 ; else
Example 2.11 Consider the following application of metrics and sparseness. Given an
interpolator p(t) = sinc( Tπ t), we want to sample a continuous function f (t) such that the
corresponding sequence gk minimizes the quadratic distance to the original function, i.e.,
we want
 1
2  2

∞
! ∞
X Z ∞ X π
min d2 f (t), gk p(t − kT ) = min f (t) − gk sinc (t − kT ) dt

gk
k=−∞
gk  −∞
k=−∞
T 
We know that the solution is given by gk = f (kT ) = fk .
We can now variate the problem by providing a set of possible interpolation functions,
say {p1 (t),p2 (t),....,pM (t)}. We now like to find a sparse representation of gk :
k0
X
min d0 (gk |{pm (t)}); such that f (t) = gk pm (t − kT ).
gk
k=0
Such a formulation allows to replicate f (t) typically over a certain range 0 ≤ t ≤ k0 T to

approximate with minimum complexity as the so obtained gk has only very few non-zero
entries.
Alternatively we can variate the problem in providing a set of possible sampling sequences
{g1,k ,g2,k ,...,gm,k }. If the interpolators are given in terms of a digital filter followed by a fixed
interpolator, i.e.
P
X −1 π
p(t) = pk sinc (t − kT ) .
k=0
T
A part of the interpolator is thus given in terms of the P coefficients p0 ,p1 ,...,pP −1 .
We can now formulate the sampling problem as
k0
!
X
min d2 f (t), gm,k p(t − kT ) .
pk
k=0
The sparsity is thus provided by the set of sparse sampling sequences {g1,k ,g2,k ,...,gm,k } and
does not need to be included in the optimization process. The desired level of sparsity (how
many entries in gk are non-zero) is already introduced in the initial design. We thus have
to solve now
k0
!
X
min min d2 f (t + tD ), gm,k p(t − kT ) =
gm,k pl
k=0
 2  12
Z (P +ko)T k 0 P −1
X X π 
= min min f (t + t ) − g p sinc (t − lT − kT ) dt .

D m,k l
T

gm,k pl  0 
k=0 l=0
In terms of gm,k it is a trial-and-error process. The size of the set defines the complexity
of the search algorithm. In terms of the coefficients pk , however, it is a d2 −metric that we
will learn later to solve by so called Least-Squares methods. This method is the principle
of the so-called codebook excited linear predictive coding (CELP) that is nowadays a
common standard for speech coding and can be found in every speech-codec (handy, digital
phone). The interpolator is the linear prediction mechanism (indicated by the future value
f (t + tD ); tD > 0; typically tD = T ), the pre-stored set of excitations gm, is the codebook
from which we select.
Let us summarize the various metric spaces in Table 2.1 as they are commonly used in
the literature.
lp (0,∞) Space of causal sequences with metric dp

lh (0,∞) Space of causal sequences with finite energy, (p = 2)
lp (−∞,∞) Space of non-causal sequences with metric dp
Lp (0,∞) Space of causal functions with metric dp
Lp (−∞,∞) Space of non-causal functions with metric dp
(C[a,b],dp ) Space of continuous functions with metric dp
Table 2.1: Nomenclature to describe various metric spaces.
Example 2.12 Let us apply the metric for a filter design: a linear time-discrete filter with
linear phase
H(ejΩ ) = a0 + a1 ejΩ + a1 e−jΩ = a0 + 2a1 cos(Ω)
is to design, so that it follows a desired amplitude function

jΩ 1 ; f or|Ω| < ΩG
|Hd (e )| =
0 ; else
most optimally. We can formulate the problem by a d2 metric:

Z π
1
min |Hd (ejΩ ) − H(ejΩ )|2 dΩ
H 2π −π
Z π
1
= min |Hd (ejΩ ) − a0 − 2a1 cos(Ω)|2 dΩ
a0 ,a1 2π −π
Z ΩG Z
1 2 1
= min |1 − a0 − 2a1 cos(Ω)| dΩ + min |0 − a0 − 2a1 cos(Ω)|2 dΩ.
a0 ,a1 2π −Ω a0 ,a1 2π Rest
G
Note that all terms in the desired values a0 ,a1 are quadratic. We can thus solve the problem
by deriving with respect to a0 and a1 individually. Setting the resulting equations to zero
delivers the desired results. We find
ΩG
a0 = ,
π
sin(ΩG )
a1 = .
π
Compare the result with the coefficients of the Fourier series. The coefficients are the two
first coefficients of a Fourier series. The result is depicted in Figure 2.6 further ahead (left
upper part). Not surprisingly the obtained filter solution is only weakly resembling the low
pass character. Too few coefficients were spent.
Example 2.13 Let us return the problem of Paley-Wiener. Given an amplitude function
Ad (Ω), what is a valid filter F (ejΩ ) that satisfies this? We cannot use the technique of the
previous example as it requires the knowledge of the desired filter function in amplitude and
phase. However, if we consider linear phase filters as solution, we remember that for those
we can write
F (ejΩ ) = AF (Ω)e−jΩ∆
with ∆ a positive constant, indicating the linear phase. We know recall that linear phase
filters have symmetric impulse responses, for example
N
!
X
F (ejΩ ) = e−jΩN f0 + fn (e−jnΩ + ejnΩ )
n=1
N
!
X
= e−jΩN f0 + fn 2 cos(nΩ)
n=1
N
X
= e−jΩN hn cos(nΩ),
n=0
where we took the linear phase part in front, thus ∆ = N . In the last line line we rewrote
the terms by introducing new coefficients hn = 2fn for n = 1,2,...,N and h0 = f0 . Now we
are ready to compute the coefficients hn based on a d2 metric:
Z π 2
N
1 X
min Ad (Ω) − hn cos(nΩ) dΩ.

H 2π −π
n=0
As in the previous example we compute the derivatives taking advantage of the following
property Z π 1
1 2
;l = n
cos(lΩ) cos(nΩ)dΩ =
2π −π 0 ; else
and obtain for l = 0,1,...,N :

Z π
hl = 2 Ad (Ω) cos(lΩ)dΩ.
−π
Thus given the amplitude function Ad (Ω), we now know how to compute a corresponding
filter function. This is of course not the only solution. Any additional allpass can variate
the phase without changing the amplitude function. By this also minimum phase solutions
can be found, employing the techniques of the previous chapter.
2.1.2 Limit Sequences

Typically sequences are following a specific rule, for example a rational number is converted
into another one and so on. Applying the rule an infinite amount of times, some surprising
result can follow. Take for example the series
k
X 1
fk = 2
.
n=1
n
We only add rational numbers but once k → ∞ the final result is π 2 /6 which is clearly not
from Q
I . We thus like to know more about where such infinite series are ending.
Definition 2.8 (Limit) If there exists to every distance d a number no , such that
d(xn ,y) < d for each n > no at fixed value y, the sequence xn is said to be convergent
to y.
xn → y
y = lim xn .
n→∞
y is called the limit (Ger.: Grenzwert) of xn . All points in an arbitrarily small distance
to y are called the -neighborhood (Ger.: Nachbarschaft) of y.
Example 2.14 The following two sequences diverge:
an = n2 ,
bn = 1 + (−1)n .
Definition 2.9 (Limit Point) If a sequence xn returns infinitely often to a neighborhood

of z, then we call this point z a limit point (Ger.: Häufungspunkt,Grenzpunkt). If limit
points exist, then there must be subsequences (partial sequences, Ger.: Teilfolgen) xn that
converge. The largest limit point of a sequence xn is called limes superior, or
lim sup xn .
n→∞
The smallest limit point of a sequence xn is called limes inferior, or
lim inf xn .
n→∞
A sequence converges if
lim sup xn = lim inf xn .
n→∞ n→∞
Example 2.15 Consider the sequence

1
cn = 2 + + (−2)n ; n = 1,2,3,....
n
We find
lim sup cn = 4,
n→∞
lim inf = 0.
n→∞
Thus, the sequence does not converge. Note that
sup cn = 4.5
delivers a different result. There are two limit points. The subsequence {c0 ,c2 ,c4 ,..} takes
on the limit 4, while the subsequence {c1 ,c3 ,c5 ,..} takes on the limit 0.
The previous definition of a limit is not of practical nature. Once we know the limit
point, we can test if the series converges towards it, but if we do not have clue whether it
converges and whereto, we cannot say. A more practical definition is thus the following one
in which we do not need to know the limit a-priori.
Definition 2.10 (Cauchy Sequence) A sequence xn in a metric space (X,d) is called

Cauchy-sequence, if for every e > 0, there exists a N (e) > 0 so that d(xn ,xm ) < e for every
m,n > N .
Example 2.16 Let X = C[−1,1] be a set of continuous functions and fn (t) a sequence of
functions, defined by 
 0 ; t < − n1
nt
fn (t) = + 21 ; − n1 ≤ t ≤ n1
 2
1 ; t > n1
Consider the metric space (X,d2 ) with
sZ
1
d2 (f,g) = (f (t) − g(t))2 dt
−1
f1 ( t ) f n (t )
−1/ n 1/ n t t
Figure 2.5: Function for small n (left) and larger n (right).
See Figure 2.5 to illustrate function fn (t). We compute

(
(m−n)2
6n3
;m > n
d2 (fn (t),fm (t)) = (m−n)2 .
6m3
;m < n
For large m,n, m − n = k < ∞, we find d2 → 0. We thus conclude that it is a Cauchy
sequence. Note, however, the function in the limit is given by

 0 ;t < 0
1
lim fn (t) = f (t) = ;t = 0 .
n→∞  2
1 ;t > 0
This function is not continuous and thus not in X = C[−1,1]. The sequence is thus not
convergent in X!
Definition 2.11 (Complete) A metric space (X,d) is called complete (Ger.: vollständig),
if every Cauchy-sequence converges in X.
Once we know such property about a metric space we can trust our Cauchy criterion. For
some metric spaces, this property is well-known, see Section 2.2.3 further ahead.
2.1.3 Gibbs’ Phenomenon

The reader is probably well aware of the fact that periodic functions can be described by
Fourier series.
Example 2.17 Consider the following periodic signal of period 1 in L[−0,5, 0,5]:

 0 ; −0.5 ≤ t ≤ −0.25
f (t) = 1 ; −0.25 ≤ t ≤ 0.25
0 ; 0.25 ≤ t < 0.5

with the Fourier-series:

( n )
2 X (−1)k+1
fn (t) = 0.5 + cos(2(2k − 1)πt) .
π k=1
2k − 1
The result of fn (t) is displayed for for various values of n = 1,2,3,4 in Figure 2.6. In
Figure 2.7 we find the result at n = 500. A spike is showing that does not disappear even
if n → ∞. This effect is known as Gibbs’ phenomenon.5 In terms of metrics it is easily
explained. As a single point does not deliver anything to the integration, applying a d2
metric results in no difference between f (t) and fn (t): limn→∞ d2 (f (t),fn (t)) = 0. Applying
a d∞ metric instead and we obtain limn→∞ d∞ (f (t),fn (t)) >> 0. Typically the maximal
point is about 9% higher than the neighboring level.
Figure 2.6: Gibbs Phenomenon.
Example 2.18 A further good example of Chauchy sequences is obtained when calculating
roots by the so called Heron’s method. We like to solve the problem
f (x) = x2 − A = 0
5
Josiah Willard Gibbs (11.2.1839 -28.4.1903) was an American scientist in Physics, Chemistry, Mathe-
matics.
1.2
0.8
lim n→∞ d 2 ( f (t ), f n (t )) = 0
0.6
0.4
lim n→∞ d ∞ ( f (t ), f n (t )) > 0 f n=500 (t )

0.2
-0.2
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
Figure 2.7: Gibbs Phenomenon.
and use an iterative approach of Newton’s style:
f (xn )
xn+1 = xn − .
f 0 (xn )
For roots we find

A
xn + xn
xn+1 = .
2
a
If we start with any value x0 = b
∈Q
I we end up with a new value
a Ab a2 + Ab2
x1 = + = ,
2b 2a 2ab
which is again√in Q I with every iteration. Nevertheless, for n → ∞ the
I . We thus stay in Q
final value is A. Thus even if A = 3 ∈ N , the final value is in IR.
2.1.4 Robustness
Often it is important not to optimize the mean value, e.g., mean data rate, Bit Error Ratio
(BER) but the value at worst case condition. If a system shows to be insensitive to the
worst cases, it is called robust. Robustness is often measured in terms of energy. Let xi be
the input sequence and yi the output sequence of a system T . The output is assumed to
be distorted by an unknown sequence vi . The form of distortion is not necessarily additive
or even known. Furthermore, the M initial states zi of this system are also not known.
Assume that there exists a reference system without distortion having an output signal
(R)
yi . The influence of such distortion can be described by the following expression:
PN (R)
i=0 (yi − yi )2
PM 2 PN 2 ≤ γ 2.
i=1 zi + i=0 vi
We express the fact that this ratio stays below a positive upper bound by selecting a value
γ 2 . As this is true for every possible noise and initial states, we have to consider the worst
case:
PN (R)
i=0 (yi − yi )2
sup PM 2 PN 2 ≤ γ 2 .
zi ∈lh (1,M ),vi ∈lh (0,N ) i=1 zi + i=0 vi
Assume now that there exists several possibilities for the realization of this system T (F )
depending on a specific strategy. Then the robustness criterion reads:
PN (R)
i=0 (yi − yi )2
inf sup PM 2 PN 2 ,
i=1 zi + i=0 vi
F z ∈l (1,M ),v ∈l (0,N )
i h i h
saying that we are interested to find that strategy F that minimizes this worst case upper
bound.
Example 2.19 A power amplifier (PA) as depicted in Figure 2.8 (left) in a cell phone
(Ger.: Handy) can be described by the following nonlinear mapping:
ρi |xi |2
yi = xi + vi .
1 + ρi |xi |2
The variable ρ = ρi (T ) describes the influence of temperature, aging and more. In order
to define the robustness of the output signal with respect to ρ, the following expression is
considered:
PN (R)
(yi − yi )2
sup PNi=0 2 PN 2 ≤ γ 2 .
ρi ,vi ∈lh (0,N ) i=0 ρi + i=0 vi
If it is possible to reduce the influence of ρ by new technologies (e.g. a feedback with con-
troller as depicted in Figure 2.8 (right)) or improved circuitry, a smaller factor γ is the
result:
PN (R)
i=0 (yi − yi )2 2
sup PN 2 PN 2 ≤ γnew < γ 2.
ρi ,vi ∈lh (0,N ) i=0 ρi,new + i=0 vi
The improvement is achieved by higher robustness (smaller γ).

Note: Robustness is not sensitivity.
T PA yi T +
PA yi
Contr.
Figure 2.8: Robust Power Amplifier (PA).
2.2 Topology of Linear Vector Spaces

We now have to introduce the ground rules under which we operate. This section sis thus
of mathematical nature but worth to understand.
Definition 2.12 A binary operation ∗ on a set S is a rule that assigns to each ordered pair
(a,b) of elements from S some element from S. Since the operation ends with an element
in the same set, we call this also a closed operation.
Example 2.20 The following operation ‘*’ for (a,b) ∈ S = Z

Z is a binary operation:
a ∗ b = min(a,b).
Definition 2.13 (Linear Vector Space) A linear vector space S over a set of scalars
T (C
l ,IR,Q,ZZ,NN ,B) is a set of (objects) vectors together with an additive + and a scalar
multiplicative · operation, satisfying the following properties:
1. S is a group under addition.
2. For every a, ∈ T and x,y ∈ S we have:
ax ∈ S, a(bx) = (ab)x, (a + b)x = ax + bx, a(x + y) = ax + by.
3. W.r.t. the multiplicative operation there exists an Identity (One) and a Zero element:
1x = x, 0x = 0.
Definition 2.14 (Group) A set S for which a binary operation ∗ (operation w.r.t two
elements of S) is defined, is called a group if the following holds:
1. for each a,b ∈ S it holds that: (a ∗ b) ∈ S

2. there exists an identity element e in S, so that for every element a ∈ S: e ∗ a = a ∗ e =

a.
3. for each element a in S there exists an inverse element b in S, so that: a∗b = b∗a = e.
4. The binary operation * is associative, i.e., (a ∗ b) ∗ c = a ∗ (b ∗ c) The group is

denominated by (S, ∗).
If, furthermore, for each pair a,b in S it holds that a ∗ b = b ∗ a (commutativity), then
the group is called commutative or Abelian (Ger.: Abelsch).
Note that in the previous definitions the binary operations +, ∗ ,· were placeholders
even though they may associate classic addition or multiplication. For each number field a
specific operation need to be tested.
Example 2.21 Let S be the set of vectors of a particular dimension, say n. S forms a
group w.r.t the additive operator, if the following properties are satisfied:
1. For every x,y ∈ S it holds that: x + y ∈ S.
2. There exists an identity element 0 in S, so that x + 0 = 0 + x = x.
3. For each element x ∈ S there exists a second inverse element y ∈ S, so that x+y = 0,
equivalently y = −x.
4. The additive operation is associative: for each x,y,z ∈ S it holds that (x + y) + z =

x + (y + z).
Definition 2.15 (Ring) A set S for which two binary operations + and * are defined, is
called a ring if the following holds:
1. (S,+) is a commutative group (Abelian)
2. The operation * is associative.
3. Distributivity holds w.r.t. +: a(b + c) = ab + ac, (a + b)c = ac + bc.
A ring is denominated by (S,+, ∗).

Note: for * there does not need to be an identity or inverse element.
If there exists additionally the inverse element to *, then it is called a skewed field (Ger.:
Schiefkörper).
Definition 2.16 (Field) A set S equipped with two binary operations + and * is called a
field (Ger.: Körper), if:
1. (S,+) is an Abelian group

2. (S\0,∗) is an Abelian group
3. The operations + and * distribute.
If the set is finite, i.e., |S| < ∞, we talk about finite groups, finite rings, finite fields
and so on.
Example 2.22 The Galois Filed GF(2) is a field. It comprises of the elements 0 and 1.
Let us define the two binary operations + and ∗ as follows:
a b a+b a b a*b
0 0 0 0 0 0
0 1 1 0 1 0
1 0 1 1 0 0
1 1 0 1 1 1
W.r.t. the operation +, S needs to be an Abelian Group:
1. We find that all a + b ∈ S.
2. Identity element is 0 : 0 + 0 = 0, 1 + 0 = 1.
3. Inverse element exists: 0 + 0 = 0,1 + 1 = 0.
4. Associativity: (a + b) + c = a + (b + c).
5. Check the remaining properties for ∗ yourself.
Example 2.23 (IR, + ,∗),(I Q, + ,∗),(C

l , + ,∗) are fields.
(N
N , + ,∗),(Z
Z, + ,∗) are not fields.
Example 2.24 Examples for finite dimensional linear vector spaces are:
1. Consider the linear vector space in IR4 (set of quadruples, Ger.: Menge der Quadrupel)
       
1 5 6 13
 5   2   7   19 
x=  4  ; y =  0  ; x + y =  4  ; 3x + 2y =  12  .
      
2 −2 0 2
2. The set of m × n matrices with real-valued elements.
3. The set of polynomials of degree 0,2,...,n with real-valued coefficients.

4. Consider the (3,2) single parity check code in GF(2)3 with the elements: V =
{[000],[011],[101],[110]}. We define the operation + as a binary exclusive or oper-
ation. This Galois Field is a linear vector space.
5. Do the remaining elements W = {[111],[100],[010],[001]} also form a linear vector

space? No, they do not. For example, the identity element [0,0,0] is missing.
Example 2.25 Examples for infinite-dimensional linear vector spaces are:

1. Consider the set of infinitely long sequences {xn }.
2. The set of continuous functions over the interval [a,b] → C[a,b].
3. The set of functions in Lp → Lp [a,b].
Definition 2.17 (Subspace) Let S be a vector space. If V is a subset of S such that V

itself is a vector space then V is called a subspace (Ger.: Unterraum) of S.
Example 2.26 Let S be the set of polynomials of arbitrary degree (e.g., > 6) and V the
set of polynomials of degree less than 6. Then V is a subspace of S.
Example 2.27 Consider a (n,k) binary linear block code in which k arbitrary bits are
mapped onto n bits. This is a k-dimensional subspace of GF(2)n .
2.2.1 Linear Independence

Definition 2.18 (Linear Combination) Let S be a linear vector space over R and T
a subset of S. A point x of S is called linear combination (Ger.: Linearkombination) of
points in T , if there is a finite set of points pi ; i = 1,2,...,m ∈ T and a finite set of scalars
ci ; i = 1,2,...,m ∈ R, so that:
x = c1 p1 + c2 p2 + ... + cm pm .
Example 2.28 Let S = C l (IR), the set of continuous functions over the complex (real)
numbers. Let, furthermore, p1 (t) = 1; p2 (t) = t, p3 (t) = t2 . A linear combination of such
functions is given by:
x(t) = c1 + c2 t + c3 t2 .
Consider the polynomial x(t) = 6 + 5t + t2 , and the function p4 (t) = t2 − 1. We find:
x(t) = −p1 (t) + 5p2 (t) + p3 (t)

= p1 (t) + 5p2 (t) − p3 (t) + 2p4 (t)
= 5p2 (t) + p4 (t).
Obviously, the description is not unique. The number of required coefficients varies.
Definition 2.19 (Linear Independence) Let S be a linear vector space and T a subset
of S. The subset T (pi ; i = 1,2,...,m) is said to be linearly independent (Ger.: linear
unabhängig), if for each nonempty linear subset T the only finite set of scalars satisfying
the equation
c1 p1 + c2 p2 + ... + cm pm = 0
is the trivial solution c1 = c2 = ... = cm = 0.
If the above equation is satisfied by a set of scalars that are not all equal to zero, then the
subset (pi ; i = 1,2,...,m) is called linearly dependent (Ger.: linear abhängig).
Example 2.29 The previously presented polynomials p1 (t) = 1; p2 (t) = t, p3 (t) =
t2 , p4 (t) = t2 − 1 are linearly dependent, since:
p1 (t) − p3 (t) + p4 (t) = 0.
The polynomials p1 (t),..., p3 (t) are linearly independent!
The vectors p1 = [2, − 3,4],p2 = [−1,6, − 2] and p3 = [1,6,2] are linearly dependent since:
4p1 + 5p2 + 3p3 = 0.
It is obviously not trivial to check if a set of vectors is linear dependent or not.
Example 2.30 Consider the complex numbers z = r + ji in IR2 in vector form:

r r
T = , .
i −i
This is a set describing z and z ∗ . Note that both elements are linearly independent as long
as i is unequal to zero.
Example 2.31 Consider band-limited functions or sequences xk with Fourier transform
X(ejΩ ) (also band-limited random processes) that only exist in a frequency range S:
X(ejΩ ) = 0; for Ω ∈ S.
For a sequence fk (linear time-invariant system) with Fourier transform F (ejΩ ) existing in
the complementary space of S
F (ejΩ ) = 0; for Ω ∈
/ S.
we find:
∞
X
yn = fk xn−k ,
k=−∞
Y (e ) = F (e )X(ejΩ ) = 0,
jΩ jΩ
∞ −1
1 X 1 X
xn = − fk xn−k − fk xn−k .
f0 k=1 f0 k=−∞
Band-limited signals are said to be linearly dependent! Note however, that the property
finite number of elements is not necessarily satisfied.
Example 2.32 An open issue is to find the linear weights ci in a linear combination given
the result of such combination.
c1 p1 + c2 p2 + ... + cm pm = x.
To this end, let us reformulate the linear combination as matrix-vector relation:

 
c1
 c2 
x = [p1 ,p2 ,...,pm ]  ..  = P c.
 
 . 
cm
The vector set {pi } now forms a matrix P , the coefficients a vector c. Given a right hand
side x, do you know now how to obtain the linear weights ci ?
Answer: we have to invert matrix P and find c = P −1 x.
Let us set x = 0. The a solution exists only if the columns of P are linearly dependent.
Example 2.33 Consider a blind channel estimation scheme as depicted in Figure 2.9.
Such schemes were very much of interest in the 90ies as the desire to get rid of redundant
training signals came up. The question was whether a channel estimation is possible at
all without known data signals. A signal sk is transmitted over two linear time-invariant
sk(1)
rk(1)
h1 g1
sk 0
rk(2) sk(2)
-
h2 g2
Figure 2.9: Blind channel estimation scheme.
channels h1 and h2 . If the two linear time-invariant filters g1 = h2 and g2 = h1 , then the
(1) (2)
convolution results in the same outcome sk = sk and the difference of the two becomes
zero. The question is under which conditions is the requirement that both outcomes are
identical, sufficient to obtain the desired estimate for the two channels. This is a tricky
question that will require several iterations in order to solve it completely. Nevertheless, at
this state we can already formulate the problem and provide first conditions.
First, we recognize that if (g1 ,g2 ) are solutions, so are (αg1 ,αg2 ), that is any scaled version.
We can thus not expect a unique solution. Second, we recall Bezout’s Theorem 1.1 and
follow that h1 and h2 need to be co-prime. Otherwise a common part could be split and
would alter the input sequence sk . The third condition now comes from a consideration in
terms of linear dependency. At each sensor (antenna) we observe N elements of the past:
(i) (i) (i) (i)
rk = [rk ,rk−1 ,...,rk−N +1 ]T ; i = 1,2.
In order to be identical both convolutions must provide the same result (starting at k = 1):
(1) (1) (2) (2)
g11 r1 + g12 r2 + ... + g1m r(1) (2)
m = g21 r 1 + g22 r 2 + ... + g2m r m .
This can be formulated as a large set of equations:


g11

 g12

 ..

.
 
 
h i g 
(1) (1) (2) (2) 1m 
r1 ,r2 ,...,r(1) (2) 
m ,r 1 ,r 2 ,...,r m   = Rg = 0.
 −g 21 
 −g22 
 
 .. 
 . 
−g2m
(1) (2)
Which properties must the receive vectors rk and rk have, so that the problem can be
solved uniquely?
Answer: the vectors need to be linearly dependent to guarantee a solution to exist.
At this point several interesting questions arise:

1. Under which conditions is the linear combination of vectors unique?
2. Which is the smallest set of vectors required to describe every vector in S by a linear
combination?
3. If a vector x can be described by a linear combination of pi ; i = 1,2,...,m, how do we
get the linear weights ci ; i = 1,2,...,m?
4. Of which form do the vectors pi ; i = 1,2,...,m need to be in order to reach every point
x in S?
5. If x cannot be described exactly by a linear combination of pi ; i = 1,2...,m, how can
it be approximated in the best way (smallest error)?
All of those questions will be answered in the next parts of the script.
2.2.2 System Identification

The impulse response hk of a linear time-invariant system with input sequence sk and output
rk is to estimate based on observations of input and output signals. What conditions does
the input sk have to satisfy to ensure a unique identification? To answer this question we
can formulate the identification as set of linear equations:
    
s1 s2 . . . sM −1 sM hM −1 r1
 s2 s1 . . . sM sM +1   hM −2   r2 
   

 .. ...   ..   .. 
 .  .  =  . .
    
 sM −1 sM   h1   rM −1 
sM sM +1 . . . s2M −2 s2M −1 h0 rM
For a unique identification, the rows (columns) need to be linearly independent! A signal
with such a property is said to be of persistent excitation (Ger.: hartnäckige Anregung).
Question: is a complex-valued exponential harmonic sk = exp(jΩo k) of persistent excita-
tion? What about a sinusoid sk = sin(Ωo k)?
To answer those questions, consider the following excitation signal:

sk = ejΩo k
x1 = [ejΩo 0 ,ejΩo ,ejΩo 2 ,...,ejΩo (M −1) ]
x2 = [ejΩo ,ejΩo 2 ,ejΩo 3 ,...,ejΩo M ] = ejΩo x1 .
We conclude that all following vectors are linearly dependent and thus only a system of
order 1 (thus a constant) can be identified with such method. In modern wireless systems
often the OFDM modulation is applied. It splits the broadband channel into many small
band channels which an be described by a single coefficient.
Let us now consider a sinusoidal signal:
1 jΩo k
sk = sin(Ω0 k) = (e − e−jΩo k ),
2j
xa = [ejΩo 0 ,ejΩo ,ejΩo 2 ,...,ejΩo (M −1) ]
xb = [e−jΩo 0 ,e−jΩo ,e−jΩo 2 ,...,e−jΩo (M −1) ]
1 1
x1 = xa − xb
2j 2j
1
sk+1 = sin(Ωo (k + 1)) = (ejΩo ejΩo k − e−jΩo e−jΩo k )
2j
jΩo −jΩo
e e
x2 = xa − x.
2j 2j b
We recognize here two linearly independent vectors xa and xb out of which all other vectors
can be constructed from by a linear combination. Thus with such sinusoid a system of
order two can be identified. We can now answer the following general questions:
How many signals (frequencies Ωo ) of the form exp(jΩo k) are required to identify a
time-invariant system of order M ?
Answer: M different frequencies are required.
How many signals (frequencies Ωo ) of the form sin(Ωo k) are required to identify a
time-invariant system of order M ?
Answer: Only M/2 different frequencies are required.
Definition 2.20 (Span) Let T be a set of vectors in a vector space S over a set of scalars
IR(Cl ,Q,Z
Z,N N ,B). The set of vectors V that can be reached by all possible (finite) linear
combinations of vectors in T is the span (Ger.: aufgespannte Menge, erzeugte Menge,
lineare Hülle) of the vectors:
V = span(T ).
Note that span is a short form to describe something that would be lengthy otherwise.
We map a set T into a typically larger set V . Take for example T1 = {x,v}; then

V1 = span(T1 ) = αx + βy|α,β ∈ K .
The number field K is typically IR or Cl . Thus the set with two elements T1 is mapped
into a set of an infinite amount of members.
The saving of writing is even more pronounced in the next example where T2 = {xi |i =
1,2,...,N }. We find
( N )
X
V2 = span(T2 ) = αi xi |αi ∈ K, xi ∈ T2 .
i=1
Definition 2.21 (Hamel Basis) Let S be a vector space, and let T be a set of vectors
from S such that span(T ) = S. If T is linearly independent, then T is said to be a Hamel6
basis for S.
Example 2.34 The vectors p1 = [1,6,5], p2 = [−2,4,2], p3 = [1,1,0] and p4 = [7,5,2] are
linearly dependent. Note that T = {p1 ,p2 ,p3 } spans the space IR3 and thus is a basis for IR3 .
Example 2.35 The vectors p1 = [1,0,0], p2 = [0,1,0] and p3 = [0,0,1] are linearly inde-
pendent and are a basis for IR3 . This basis is often called natural basis (Ger.: natürliche
Basis).
6
After the mathematician Georg Karl Wilhelm Hamel (12.9.1877-4.10.1954).
Example 2.36 Consider the (3,2) single parity check code in GF(2)3 with the elements:
V = {[000],[011],[101],[110]}. A Hamel basis is given by:

[011]
G= .
[101]
Example 2.37 The following 3 × 3 matrices are a Hamel basis in IR3×3 :

     
1 0 0 0 1 0 0 0 1
 0 0 0   0 0 0   0 0 0 
0 0 0 0 0 0 0 0 0
     
0 0 0 0 0 0 0 0 0
 1 0 0   0 1 0   0 0 1 
0 0 0 0 0 0 0 0 0
     
0 0 0 0 0 0 0 0 0
 0 0 0   0 0 0   0 0 0 .
1 0 0 0 1 0 0 0 1
Example 2.38 Are these 3 × 3 matrices are a Hamel basis in IR3×3 :

       
1 0 −1 0 1 −1 0 0 0 0 0 0
 0 0 0   0 0 0   1 0 −1   0 1 −1  .
−1 0 1 0 −1 1 −1 0 1 0 −1 1
Answer: These matrices are not a basis for IR3×3 . However, the are a basis for the subspace
in IR3×3 , that has a zero row and column sum!
Definition 2.22 (Cardinality) The number of elements in a set A is its cardinality |A|
(Ger.: Kardinalität).
Theorem 2.1 If two sets T and U are Hamel bases for the same vector space S, then T
and U are of the same cardinality.
Proof: (only for the finite dimensional case)

Let
T = {p1 ,p2 ,...,pm }; U = {q 1 ,q 2 ,...,q m }
be two bases for S and

c1 p1 + c2 p2 + ... + cm pm = q 1
with at least one coefficient unequal to zero, say c1 . Then
1
p1 = (q − c2 p2 − ... − cm pm ).
c1 1
Thus, {q 1 ,p2 ,p3 ,...pm } is also a basis for S. Further substitution leads to:
{q 1 ,q 2 ,p3 ,...pm }
{q 1 ,q 2 ,q 3 ,...pm }
... {q 1 ,q 2 ,q 3 ,...,q m−1 ,pm }
{q 1 ,q 2 ,q 3 ,...,q m−1 ,q m }.
Consequently, we must have m ≥ n. Now, starting with q 1 instead of p1 we find n ≥ m. It

follows that n = m.
Definition 2.23 (Dimension) Let T be a Hamel basis for S. The cardinality of T is

the dimension of S, |T | = dim(S). It equals the number of linearly independent vectors,
required to span the space S.
Note: each vector space has at least one Hamel basis.

Operations in the vector space are often simpler and of lower complexity in their basis.
Let’s now revisit Example 2.33. Which necessary conditions do we have so far?
1. We recognize that if (g1 ,g2 ) are solutions, so are (αg1 ,αg2 ), that is any scaled version.
We can thus not expect a unique solution.
2. We recall Bezout’s Theorem 1.1 and follow that h1 and h2 need to be co-prime.
Otherwise a common part could be split and would alter the input sequence sk .
3. The input sequence sk must be persistent exciting.
(i)
4. We can enforce that the vectors rk ; i = 1,2; k = 1,2,...,m are linearly dependent.
As each vector contains N elements and we have 2m such vectors, we simply select
2m > N . As maximally N vectors span the entire space IRN with 2m > N vectors
we ensure that some vectors are linearly dependent.
It turns out that these conditions are indeed sufficient for a solution that is unique up to
the scaling of the impulse responses. The existence and uniqueness of the solution is an
important step for finding the solution. The solution itself will be presented in a later
chapter.
2.2.3 Complete Spaces

Definition 2.24 (Pre-Hilbert Space) A vector space for which an inner vector prod-
uct is defined is said to be a pre-Hilbert space (Ger.: Innerer Vektorproduktraum, Vor-
Hilbertraum). An inner vector product maps two vectors onto one scalar with the following
properties:
0) hx,xi > 0 for every x 6= 0 and 0 else.

∗
1) x,y = y,x

2) αx,y = α x,y

3) x + y,z = hx,zi + y,z
Example 2.39 Consider the space of continuous functions C[a,b] with the two elements
x(t) and h(t). Let x(t) be an input signal and h(t) the impulse response of a low pass. We
have:
Z T
y(T ) = x(τ )h(T − τ )dτ
0
Z T
= x(τ )g(τ )dτ = hx,gi .
0

l : x,y = y H x. But also matrices
Example 2.40 A simple example are the vectors in C
can build inner products by the trace operator:
tr(B H A) = hA,Bi .
Example 2.41 Consider the expectation of a random variable:

Z Z
E[xy] = xyfxy (x,y)dxdy = hx,yif .
This is also an inner vector product, however with an additional weighting function f (x,y).
To illustrate the inner product of two vectors, let us consider two 2-dimensional vectors
x and y. Their inner product (projection) is a measure of non-orthogonality as depicted in
Figure 2.10. If the two vectors are perpendicular (orthogonal) to each other, the inner prod-
uct is zero; if they are parallel (antiparallel), the inner product is a maximum (minimum).
Definition 2.25 (Banach and Hilbert Space) A complete normed vector space is said
to be a Banach space. Is there additionally an inner vector product (the norm is an induced
norm) the space is said to be a Hilbert space.
(definition of norm follows later)
Example 2.42 The space of continuous functions (C[a,b],d∞ ) is a Banach space.
Example 2.43 The space of continuous functions (C[a,b],dp ) is for finite p not a Banach
space since it is not complete.
x x x
y y y
x, y > 0 x, y = 0 x, y < 0
Figure 2.10: The inner product of two 2-dimensional vectors.
Example 2.44 The space of sequences lp (0,∞) is a Banach space. For p = 2 it is also a
Hilbert space.
Example 2.45 The space of functions Lp [a,b] is a Banach space. For p = 2 it is also a
Hilbert space. This Hilbert space is often denoted as L2 (IR) or L2 (IR) for functions and
l2 (IR) or l2 (IR) for sequences.
In the following part of the lecture we will exclusively consider Hilbert spaces. (if not
noticed otherwise). Figure 2.11 illustrates several metric vector spaces.
Definition 2.26 (Orthogonal) Vectors of a pre-Hilbert space are said to be orthogonal

or perpendicular (Ger.: normal) if

x,y = 0.
Definition 2.27 (Orthonormal) Vectors of a pre-Hilbert space are said to be orthonor-

mal (Ger: orthonormiert) if :

x,y = 0,
hx,xi = 1,

y,y = 1.
Definition 2.28 (Orthogonal Basis) A Hamel basis of dimension m is said to be or-

thogonal if for all basis vectors T = {p1 ,p2 ,p3 ...,pm } :
D E 0 ; i 6= j
pi ,pj = .
6= 0 ; i = j
Definition 2.29 (Orthonormal Basis) A Hamel basis of dimension m is said to be or-

thonormal if all basis vectors T = {p1 ,p2 ,p3 ...,pm }satisf y :

D E 0 ; i 6= j
pi ,pj = δi−j = .
1 ;i = j
Normed vector spaces

(C[a,b],d1)
Banach spaces
Inner product spaces
Hilbert
Spaces l1
(C[a,b],d2) loo
l2 L1
L2 Loo
(C[a,b],doo)
l0
Figure 2.11: The relation of vector spaces.
Example 2.46 The following set of vectors is orthogonal:

     

 1 1 1 1 
 1   −1   1
       −1 
  −1  .
 
  1   1   −1
 
1 −1 −1 1
 
How do we have to modify the set, in order to make the vectors orthonormal?
Answer: we multiply all vectors by 21 .
Consider a set of orthogonal vectors. We then have:

* m m
+ m
X X X D E
αi pi , α i pi = |αi |2 pi ,pi .
i=1 i=1 i=1
Consider a set of orthonormal vectors. We then have:

* m m
+ m
X X X
α i pi , αi pi = |αi |2 .
i=1 i=1 i=1
Example 2.47 The following set of functions is orthonormal in [−π, π] :

{1,ejt/4 , ej2t/4 , ej3t/4 }. The inner product is defined as:
Z π
1
hf,gi = f (t)g ∗ (t)dt.
2π −π
Let for example be f (t) = ejt/4 and g(t) = ejt2/4 :

Z π
1
hf,gi = ejt/4 e−jt2/4 dt
2π −π
Z π
1
= e−jt/4 = 0.
2π −π
Let T = {p1 ,p2 ,...,pn } be a set of vectors. How can we find a set S = {q 1 ,q 2 ,...,qm } with
m smaller or equal to n so that span(S)=span(T ) and the vectors in S are orthonormal?
The answer to this problem is known as the Gram-Schmidt7 method:
rD E
1. take p1 and construct: q 1 = p1 / p1 ,p1 .
D E p
2. build: e1 = p2 − p2 ,q 1 q 1 ; q 2 = e1 / he1 ,e1 i.
D E D E p
3. continue: e2 = p3 − p3 ,q 1 q 1 − p3 ,q 2 q 2 ; q 3 = e2 / he2 ,e2 i...
4. if ei = 0, throw away pi+1 and continue.
Orthogonal (and orthonormal) bases offer some advantages when calculating with them.
Consider a vector f = [a,b,c]T and let P = {p1 ,p2 ,p3 } be an orthonormal basis. We can
display the vector as
D E D E D E
f = f ,p1 p1 + f ,p2 p2 + f ,p3 p3
 D E 
f ,p
 D 1E 
= [p1 ,p2 ,p3 ]  f ,p2  .
 
 D E 
f ,p3
The proof of this statement is easily done by left multiplication of [p1 ,p2 ,p3 ]T . The or-
thonormal basis thus works as a filter to separate the individual components of f .
7
Erhard Schmidt 13.1.1876- 6.12.1959 was a German Mathematician.
Definition 2.30 (Biorthonormal) If there are two bases, T = {p1 ,p2 ,p3 ...,pm } and U =
{q 1 ,q 2 ,q 3 ...,q m } that span the same space with the additional property:
D E
pi ,q j = kij δi−j
then these bases are said to be dual or biorthogonal (biorthonormal for kij = 1).
Example 2.48 Consider the following four vectors:
p1 = [1,0]T ; p2 = [−1,1]T ; q 1 = [1,1]T ; q 2 = [0,1]T ;
These pairs build a dual basis in IR2 . Consider a vector
f = [a,b]T
then
D E D E
f = f ,q 1 p1 + f ,q 2 p2 = (a + b)p1 + bp2
D E D E
= f ,p1 q 1 + f ,p2 q 2 = aq 1 + (−1 + b)q 2 .
The biorthogonal basis allows a fast reinterpretation from one basis to the other. This
concept has been used widely in the context of wavelets in the 90ies and has become important
modern filter bank designs for wireless transmissions, so-called Filter Bank Multi-Carrier
Techniques (FBMC).
2.3 Norms in a Linear Vector Space

Similar to the definition of inner vector products, we will now introduce the concept of a
norm in the linear vector space.
Definition 2.31 (Norm) Let S be a vector space with elements x. A real-valued function
kxk is said to be a norm (length) of x, if the following four properties are satisfied:
0) kxk ≥ 0 for every x ∈ S.
1) kxk = 0 iff x = 0.
2) kαxk = |α|kxk.
3) kx + yk ≤ kxk + kyk (triangle inequality).
Note that 0) follows from 1) and 3)! Norms are a special form of metrics. They are tailored
to be suitable for the linear vector space. In consquence, metrics are not necessarily norms
(see, e.g., the Hamming distance metric).
Example 2.49 Some metrics can directly be used to define a norm. Take for example the
dp (·) metric. We find
dp (x,y) = kx − ykp ; dp (x,0) = kxkp .
We thus find
n
X
l1 − norm: kxk1 = |xi |,
i=1
v
u n
uX
l2 − norm: kxk2 =t |xi |2 ,
i=1
l∞ − norm: kxk∞ = max |xi |,

1≤i≤n
Z b
L1 − norm: kx(t)k1 = |x(t)|dt,
a
s
Z b
L2 − norm: kx(t)k2 = |x(t)|2 dt,
a
L∞ − norm: kx(t)k∞ = sup |x(t)|.
t∈[a,b]
Note that although these are commonly defined norms, adding a multiplicative constant does
not change the norm property. Thus, a normalization can also be found, e.g.,
s
Z b
1
L2 − norm: kx(t)k2 = |x(t)|2 dt.
b−a a
Definition 2.32 (Normed Space) A normed, linear space is a pair (S,k · k) in which S
is a vector space and k · k is a norm on S. Metrics of a normed linear space are defined by
norms.
Definition 2.33 (Normalized Vector) A vector is said to be normalized (Ger.:

normiert; unit vector, Ger.: Einheitsvektor) if kxk = 1. Except of the zero vector, all
vectors can be normalized.
Note that the elements (vectors) of a linear and normed space are not necessarily
normalized vectors.
Only the space is normed, meaning that a norm exists for this space.
An important theorem in the context of norms is the Norm Equivalence Theorem that
follows next. Loosely speaking it claims that the knowledge of one norm is sufficient to
conclude to all other norm. More precisely, it shows that convergence in one norm, guar-
antees convergence in any norm. Let us assume that a limit (vector) y ∗ exists. We consider
a sequence of vectors y k with time index k. The convergence of this series to y ∗ can be
quantified by a norm: limk→∞ ky k − y ∗ k = 0. We can also replace xk = y k − y ∗ and consider
sequences of vectors that converge to the zero vector.
Theorem 2.2 (Norm Equivalence) Let k·k and k·k0 be two norms on finite dimensional
l n ,Qn ,Z
spaces IRn (C Z n ,Nn ,Bn ), then we have:
lim kxk k = 0
k→∞
iff
lim kxk k0 = 0.
k→∞
Proof:
Without restricting generality we set (Ger.: oBdA) k · k0 = k · k2 and obtain for the lower
bound:
n
X
x = xi e i
i=1
n n n
X X X 1
kxk ≤ |xi |kei k ≤ max{|xi |} kei k ≤ kxk2 kei k = kxk2
i=1
i
i=1 i=1
α
αkxk ≤ kxk2 .
We have used the fact that the l2 norm is an upper bound for the l∞ -norm. This is easily
shown by8
s
√
s
X X |xi |2
kxk2 = |xi |2 = xmax = kxk ∞ 1 + ... ≥ kxk∞ .
i i
|xmax |2
For the upper bound β consider the case:
kxk2 = c > 0.
And we obtain:
x 1
kxk =
kxk2 kxk2 ≥ β kxk2 ; β > 0.

With the above condition, x cannot be the zero vector and since k · k > 0 there must be a
positive lower bound β with c = kxk2 ≤ βkxk. Since the property is true for the l2 norm,
8
With this method, it can be shown that kxkp ≥ kxk∞ .
it is also true for all other norms!
Note that the equivalence theorem is defined in terms of Cauchy sequences. This is a
consequence from the proof. Since if a particular norm tends to zero, then it does for every
norm. The equivalence is to be understood in this sense.
Norms often find their application in terms of energy relations. It can be in form of
average energy (l2 -norm) or peak values (l∞ -norm).
Thus, norms appear often when we describe systems. (Robustness, nonlinear systems,
convergence of learning systems such as adaptive equalizers or controllers).
Example 2.50 Consider a hands-free telephone (Ger.: Freisprechtelefon) as depicted in
Figure 2.12. The far end speaker signal is linearly distorted by echos (h). These echos
are electronically reproduced by filtering with ĥ. If both impulse responses are identical, the
echos disappear and only the local speaker signal vk is transmitted to the far end speaker.
A measure of how much the two agree is given by the norm
kh − ĥk2 .
Once this norm is small, the adaptation algorithm for learning h rests, while for large norms
the algorithm needs to run.
Local speaker vk
Far end speaker sk rk
h + -
ĥ
Figure 2.12: Hands-free telephone.
Example 2.51 Consider an adaptive equalizer as part of a receiver as depicted in Fig-

ure 2.13 . The data signal sk is linearly distorted by the channel h. An linear equalizer f is
supposed to reconstruct the signal so that ŝk = sk . However, due to the additive noise this
is a difficult task. The quality of the equalizers performance is measured by a cost function
sk rk sˆk ≈ sk
h + f
Figure 2.13: Adaptive equalizer.
of the form
f (|ŝk − sk |)
which it tries to minimize (adaptive process). The error at its output is given by an error
vector:
ek = [ŝk−N − sk−N ,ŝk−N +1 − sk−N +1 ,...,ŝk − sk = [ek−N ,...,ek ].
Let the adaptive equalizer have the property that the error signal from k − 1 to k is mapped
via the learning rule: y = g(x) = x3 . Under which condition is the equalizer adaptive?
For the error we find
   
ek−N e3k−N −1
 ek−N +1   e3 
 k−N 
ek =  ..  = f (ek−1 ) =  .. .
 
 .   . 
3
ek ek−1
We can thus compare the energy norms of the error vector form one time instant to the
next under worst case conditions:
kek k22 e2k + e2k−1 + ...e2k−N e6k−1 + e6k−2 + ...e6k−N −1
sup 2
= sup 2 2 2
= sup 2 2 2
< 1.
ek ∈lh kek−1 k2 ek ∈lh ek−1 + ek−2 + ... + ek−N −1 ek ∈lh ek−1 + ek−2 + ... + ek−N −1
As long as |ek | < 1 we find that the output is always smaller than the input energy. Thus
no matter what the error sequence is, as long as it preserves such property the algorithm
ensures that the error terms decrease and thus the output signal becomes sk .
Example 2.52 Consider the linear system in matrix-vector-form.
    
yk−N +1 hM −1 . . . h0 xk−N −M +2
 ..   hM −1 . . . h0  .. 
 . .
.. ..
   

 yk−2 =
 
  hM −1 . . 
 xk−2 .


 yk−1
  .. .. 
xk−1

  . . h0  
yk hM −1 . . . h0 xk
or short:
y N,k = HN xN +M −1,k .
Consider the ratio of input and output energy:
ky N,k k22
sup = f (HN )?
xk ∈lh kxN +M −1,k k22
The linear system HN will relate the input energy to an output energy. The ratio may
depend on the input sequence xk . But if we ask for the largest possible value for this ratio,
then it should only be dependent on the system HN itself. But at this point we cannot see
how to derive such property. In case the system is an allpass, how would we be able to
identify this property just based on the knowledge of HN ? The answers to these questions
will follow. See for example the discussion after (2.6) and in Section 5.4.3.
2.3.1 Matrix Norms

In principle we can apply vector norm concepts directly on matrices. Take for example any
p-norm, we can apply it to the elements of a matrix just as it were a vector:
m X
n
! p1
X
kAk0p = |Ai,j |p .
i=1 j=1
The most common one in this context is the so-called Frobenius9 norm for p = 2:
m X
X n
kAk2F = |Ai,j |2 = tr(AH A).
i=1 j=1
Later (see Example 4.33) we will recognize that this (squared)norm is identical to the
summed up squares of the singular values of such matrix
p
X
kAk2F = σi2 ; p = min(m,n).
i=1
But also the p = 1 and p → ∞ norms provide practical values:

m X
X n
kAk01 = |Ai,j |
i=1 j=1
kAk0∞ = max |Ai,j |.

i,j
9
After the German Mathematician Ferdinand Georg Frobenius (1849-1917).
There exist other norms on matrices. The most common ones are the maximum row
and column sums, that is
m
X
kAk1 = max |Aij | (2.1)
j
i=1
Xn

kAk∞ = max |Aij | = AT 1 . (2.2)
i
j=1
2.3.2 Induced Norms

Induced norms are norms induced by a ‘simpler concept’. This is a relatively vague state-
ment. Typical examples of induced norms are the vector 2 norm, that is induced by an
inner vector product: p
kxk2 = hx,xi.
Example 2.53 Consider an induced norm in L2 [a,b]:
Z b 12
1
2
kx(t)k2 = hx(t),x(t)i 2 = |x(t)| dt .
a
Note that for real-valued signals we have

kxk22 + kyk22 − kx − yk22 = 2 x,y .

In case the signals are in the complex domain, we find

kxk22 + kyk22 − kx − yk22 = 2< x,y ,

that is only the real part results. Both formulations allow to describe inner products by
(energy) norms.
Vector norms can also be used to induce norms on multivariate functions. Consider the
following example:
f (x1 ,x2 ) = |x1 x2 | exp(−|x1 x2 |).
The graph of the function is illustrated in Figure 2.14. We induce a vector norm, i.e.
|f (x1 ,x2 )|
kf k = sup ,
x=[x1 ,x2 ] kxk
and select the l2 -norm:
|x1 x2 | exp(−|x1 x2 |)
kf k2 = sup
x=[x1 ,x2 ] kxk2
|x1 x2 | exp(−|x1 x2 |)
= sup p ≈ 0.2.
x=[x1 ,x2 ] x21 + x22
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
5
4
3
5
2 4
1 3
0 2
-1 1
0
-2 -1
-3 -2
-4 -3
-4
-5 -5
Figure 2.14: Function f (x1 ,x2 ).
Very common are vector norms that are induced to a matrix, for example all vector
p-norms define induced matrix norms:
kAxkp
kAkp,ind = sup = sup kAxkp .
x kxkp kxkp =1
The most common one in use is the so-called spectral norm:
kAxk2 p
kAk2,ind = sup = sup kAxk2 = λmax (AH A) = σmax (A).
x kxk2 kxk2 =1
It can be interpreted as the largest elongation of an ellipsoid that is the output defined by
kAxk2 for an input vector on a unit sphere given by kxk2 = 1, see also Example 4.33.
In mathematical terms the variation in input vectors x stimulates the various modes λi
(eigenvalues) of A at its output, thus displaying the spectrum of these modes. However,
technically speaking, we can also interpret this for a matrix that describes the impulse
response of an FIR system. Consider an input vector at time instant k of the form
xM,k = [exp(jΩk), exp(jΩ(k − 1)),..., exp(jΩ(k − M + 1))]T , (2.3)
that is for a system of order M , a harmonic excitation at frequency Ω. Once the time
continues we find
xM,k+1 = [exp(jΩ(k + 1)), exp(jΩk),..., exp(jΩ(k − M + 2))]T = exp(jΩ)xM,k ,

...
xM,k+m = [exp(jΩ(k + m)), exp(jΩ(k + m − 1)),..., exp(jΩ(k − M + 1 + m))]T
= exp(jΩm)xM,k .
In other words, all such vectors are linearly dependent, independent of their order M or
their frequency Ω. Let’s now use such vector series as input of an FIR system with impulse
response
h = [h0 ,h1 ,...,hM −1 ]T .
We can describe the input-output relation by
  
h0 . . . hM −1 0 exp(jΩ(k + l))
...   ..
h0 . . .
 
 .
AM xM,k+l = 
 
. ..
 
 hM −1   exp(jΩ(k + l + m − 2)) 
h0 . . . exp(jΩ(k + l + m − 1))
   
H(exp(jΩ)) exp(jΩl) exp(jΩl)
 ..   .. 
=  .  = H(exp(jΩ))  .
   

 H(exp(jΩ)) exp(jΩ(l + m − 2))   exp(jΩ(l + m − 2)) 
H(exp(jΩ)) exp(jΩ(l + m − 1)) exp(jΩ(l + m − 1))
= H(exp(jΩ))xM,l .
If we compute now the 2-induced norm for such matrix AM , we find for a growing size M
kAM xk2
lim kAM k2,ind = lim sup (2.4)
M →∞ M →∞ x kxk2

AM xM,k+l
= lim sup 2 (2.5)
M →∞ xM,k+l
kxM,k+l k2 =M 2
= sup |H(exp(jΩ))| = lim kAM k2,ind . (2.6)

Ω M →∞
In other words, for a transfer function, we can simply detect the frequency for which they
take on their maximum magnitude, that is the gain of such line time-invariant system. The
so obtained value is equivalent to the 2 induced norm of the corresponding Toeplitz matrix
once its dimensions grows large. Note also the further remarks on this in Section 5.4.3.
This result already provides an answer to the questions in Example 2.52, telling us that
there exists an excitation vector of harmonic form for which we have the largest gain. If
this gain is larger than one, we have an amplifying system. If the gain is smaller than one,
there is no excitation for which more energy can come out. We thus have then a passive
system.
Although we do not have in general that kAkp 6= kAkp,ind , for two cases there is identical
values
kAk1 = kAk1,ind and kAk∞ = kAk∞,ind .
To understand such identities, we have to consider special cases for kxk1 = 1 and kxk∞ = 1.
If the unit norm kxk1 = 1 is achieved, all energy can be in one element of vector x, that
for x = γem with |γ| = 1. The unit vector em thus selects the m−th element and neglects
the remaining. In this case we can compute the maximum of the i−th element of Ax:

X
max Aij xj = max |Aij |.

kxk1 =1 j
j
If on the other hand, the unit norm kxk∞ = 1 is achieved, all elements of x can be equally
strong, that is x = [γ1 ,γ2 ,...,γM ]T with |γi | = 1; i = 1,2,...,M .
X X
max |Aij xj | = |Aij |
kxk∞ =1
j j
which we find by simply selecting γj = A∗ij /|Aij |.

With such preceding considerations we are now able to show the identities (2.1) and
(2.2). We start with (2.1):
kAk1,ind = max kAxk1

kxk1 =1

X X
= max Aij xj

kxk1 =1
i j

X X X
= max Aij xj = max |Aij |

kxk1 =1 j
i j i
m
X
= max |Aij | = kAk1 .
j
i=1
Similarly for (2.2) we obtain:
kAk∞,ind = max kAxk∞

kxk∞ =1
X
= max max Ai jxj
kxk∞ =1 i
j
X
= max max Ai jxj
i kxk∞ =1
j
X
= max |Aij |
i
j
n
X
= max |Aij | = kAk∞ .
i
j=1
2.3.3 Submultiplicative Property

Some matrix norms show a so-called submultiplicative property, that is
kABk ≤ kAk kBk .
Note however that if the matrix are not square, the norms can be different mappings.
All induced vector-p norms have this submultiplicative property:
kABxkp
kABkp,ind = sup
x kxkp
kAykp kykp
= sup
x kykp kxkp
kykp
≤ kAkp,ind sup
x kxkp
kBxkp
= kAkp,ind sup = kAkp,ind kBkp,ind .
x kxkp
Example 2.54 Note that not all matrix norms have this submultiplicative
property. Take
1 1
for example kAk∆ = maxij |Aij |. Consider A = B = . Here we find kABk∆ = 2 >
1 1
kAk∆ kBk∆ .10
10
Note, however, that the following relations hold for any unitary invariant norm:
kABk ≤ kAkkBk2,ind ,
kABk ≤ kBkkAk2,ind .
Example 2.55 Let us return to our previous Example 2.52 and try to understand if norms
can help us to find out if a given matrix relates to an allpass or not. When we consider the
ratio of input and output signal norm, the first problem we encounter is a technical one: the
input vector is of different length than the output vector. This can be helped by extending
the horizon to N → ∞, making the vectors infinitely long:
kHN xN kp
lim kHN kp,ind = lim sup .
N →∞ N →∞ x6=0 kxN kp
If we find
lim kHN kp,ind = 1,
N →∞
we still cannot conclude to have an allpass as this reflect the worst case scenario for all
input sequences. For the case p = 2, we obtain the spectral norm which means:
lim kHN k2,ind = max |H(ejΩ )|,
N →∞ Ω
thus the maximum of the transfer function. At this point we only have a necessary but not
sufficient condition.
Example 2.56 A further problem seems to be that the dimension of HN needs to grow
which is numerically difficult to handle. We could take the state space form instead as we
have matrices of finite dimension there.

z k+1 A b zk
= .
yk cT d xk
Let us take, for example, the simple transfer function yk = xk−2 , that is a pure delay by two
time instances. This must be an allpass. We find

0 1 0
z k+1 = zk + xk ,
0 0 1
yk = [1,0]z k +0xk .
We can combine both equations into
 
0 1 0
z k+1 zk
= 0 0 1
  .
yk xk
1 0 0
Summing up all energy terms from zero to N , we find that
|zN +1 |2 + N 2
P
k=1 |yk |
= 1.
|z0 |2 + N
P 2
k=1 |x k |
Including the initial memory storage and the final, we recognize that the system indeed
behaves like an allpass.
Example 2.57 We consider a third allpass example, this time a time variant system. Let
us start with a time-invariant system with two inputs and two outputs
" # " (1) #
(1)
yk a b xk
(2) = (2) = Ax.
yk −b a xk
p
As long as b = 1 − |a|2 and 0 ≤ |a| ≤ 1 the system behaves like an allpass. We can easily
extend this behavior to a time-variant system and find
" # " (1) #
(1)
yk ak b k xk
(2) = (2) = Ak x.
yk −bk ak xk
This system behaves also like an allpass.
The examples show transfer matrices A or Ak that share a particular property: AH A =

I. Matrices with such property are called unitary. They preserve energy and thus are
equivalent forms to describe allpasses. The eigenvalues of unitary matrices are one in
magnitude.
2.4 Application of Norms

2.4.1 The Small Gain Theorem
The small-gain theorem can be seen as the generalization of the stability conclusions of
the linear system theory. There, as it is generally known, a closed control loop is stable
if the open control loop shows amplification below one, thus attenuation. The small gain
theorem however holds for arbitrary systems, and thus also for non-linear and time-variant
systems. It was introduced by George Zames (1934-1997) in 1966.
With help of the small gain theorem we can solve stability problems as the example in
Figure 2.15 where nonlinear elements are included in the transfer path. Once the feedback
loop is closed, one would like to know conditions under which the closed loop system behaves
stable.
Let us consider an input-signal xk , k = 1,2,...,N , denoted by a vector xN of the dimension
N × 1. Then, let the answer of a system HN onto this signal be given by
y N = HN xN .
Definition 2.34 A mapping H is called l-stable if two positive constants γ,β exist, such
that for all input-signals xN , for the output-signal, the following holds:
ky N k = kHN xN k ≤ γkxN k + β.
xN hN yN
y=1/(1+h2)
-
zN gN
0,5q-1
Figure 2.15: Feedback loop including non-linear elements. Under which circumstances is
the closed loop system stable?
Definition 2.35 The smallest positive constant γ, for which the l-stability is fulfilled, is
called gain of the system HN .
Remark: The term BIBO (bounded-input bounded-output) stability denotes l∞ -stability.
We now examine a feedback arrangement of the two systems HN and GN , with the gains
γh and γg , respectively, where the following holds (see Figure 2.16):
y N = HN hN = HN [xN − z N ]
z N = GN g N = GN [uN + y N ].
xN hN yN
HN
zN gN uN
GN
Figure 2.16: Small Gain Theorem.

Theorem 2.3 (Small Gain Theorem) If the gains γh and γg are such that
γh γg < 1,
then the signals hN and g N are limited by
1
khN k ≤ [kxN k + γg kuN k + βh + γg βh ]
1 − γh γg
1
kg N k ≤ [kuN k + γh kxN k + βg + γh βg ] .
1 − γh γg
Proof: It holds that
hN = xN − z N
g N = uN + y N .
And thus, for the norm:
khN k ≤ kxN k + kGN g N k

≤ kxN k + γg kg N k + βg
≤ kxN k + γg [kuN k + γh khN k + βh ] + βg
= γg γh khN k + kxN k + γg kuN k + γg βh + βg
1
= [γg kuN k + γg βh + βg ].
1 − γg γh
The relation for g N can be shown in analogy.
Example 2.58 Let the automatic control of a cell phone power amplifier look like a feedback
loop in which the actual power amplifier behaves like
ρ1 |x2 |2
" #
0 2 1+ρ 1 |x2 |
2
H2 = ρ2 |x1 |2
2,5 1+ρ 2 |x1 |
2 0
with two inputs h2 = [x1 ,x2 ]T denoting the I and Q phase of the amplifier. The nonlinear
saturation behavior is parameterized by some positive constants ρ1 ,ρ2 > 0. In general one
expects the amplification to be on the main diagonal of H2 . In this case the engineer has
swapped I and Q and thus the entries moved away from the diagonal. We also recognize
that the gain in the individual paths is slightly different as this is not untypical for analog
devices. Correspondingly a feedback system G2 , often introduced to linearize the nonlinear
power amplifier, is given by:
1

0 3
G2 = .
0.37 0
In order to apply the small gain theorem we have to compute the gains of the two systems.
Let us start with the linear system G2 : z 2 = G2 g 2 . Here we need to compute
1

0
kz 2 k2 ≤ kG2 k2,ind kg 2 k2 = 3
0.37 0
kg 2 k2 = 0,37kg 2 k2 .
2,ind
The gain for the power amplifier is more challenging to compute. We also have:
ρ1 |x2 |2
" #
0 2 1+ρ
|x | 2
ky 2 k2 ≤ kH2 k2,ind kh2 k2 = 1 2
kh2 k2 = 2,5kh2 k2 .

ρ2 |x1 |2
2,5 1+ρ |x |2 0
2 1 2,ind
How do we compute the gain of H2 ?

ρ1 |x2 |2
" #
0 2 1+ρ x1

2
1 |x2 |
"
ρ1 |x2 |2
# ρ2 |x1 |2 x2

0 2 2
2,5 1+ρ |x |2 0
1+ρ1 |x2 | 2 1 2
= sup

ρ1 |x1 | 2
kh2 k2

2,5 1+ρ 2 0 h2 =[x1,x2]T
1 |x1 | 2,ind
v 2 2
u 2,5 ρ1 |x2 |2 ρ2 |x1 |2
u
2
t 1+ρ1 |x2 |2
x 2 + 2 1+ρ2 |x1 |2
x21
= = 2,5
x21 + x22
We can now check whether the closed loop system is stable. We find
0,37 · 2,5 < 1,
and thus the system behaves stable.
2.4.2 The Cauchy-Schwarz Inequality

n
Theorem 2.4 (Chauchy-Schwarz) In an inner-product space S ∈ C
l with induced
norm k · k we have:

| x,y | ≤ kxk2 kyk2 .
The inequality becomes an equality if and only if x = αy for some α ∈ C
l .
Proof:
We start with the simple fact that
kx − αyk2 ≥ 0.
We find further that
0 ≤ kx − αyk22 = kxk22 − 2< x,αy + |α|2 kyk22

2 "
!
∗ !#
| x,y | x,y x,y
= kxk22 − 2
+ kyk22 α− 2
α∗ − .
kyk2 kyk2 kyk22
The minimum is obtained for:
x,y
α= .
kyk22
Thus:

| x,y |2
0 ≤ min kx − αyk22 = kxk22 −
α kyk22

| x,y | ≤ kxk2 kyk2
We are not done!
For x = αy we obtain kx − αyk = 0. Due to the norm property kxk ≥ 0, equality

to zero can only be obtained for kx − αyk = 0 if the argument is zero. Thus equality is
obtained if and only if x = αy.
Note that for the special case of complex-valued vectors we have
|xH y|2 ≤ (xH x)(y H y)
and for real-valued functions in the interval [a,b] we have

Z b 2 Z b Z b
2 2
hf (t),g(t)i = f (t)g(t)dt ≤ f (t)dt g 2 (t)dt.
a a a
The Cauchy-Schwarz Inequality is a very powerful tool due to its second part that allows
a maximization or minimization and thus to optimize particular problems. Four of those
are shown in the next examples.
Example 2.59 The Correlation Coefficient

The correlation coefficient is a useful measure to find relations or closeness of two (or more)
random variables. Given two random variables x and y, it is defined as
E [(x − mx )(y − my )]
rxy = .
σx σy
The fact that a correlation coefficient always lies in the range [−1,1] is due to the Cauchy-
Schwarz Inequality. To understand this, let us first transform the two variables in zero
mean variables, i.e., x0 = x − mx and y0 = y − my for which we know that σx = σx0 and
σy = σy0 . As E [x0 y0 ] = hx0 ,y0 i, the original problem can thus equivalently be formulated as
E [x0 y0 ]
rx0 y0 = .
σx0 σy0
Due to the definition of σx2 0 = E[x02 ] we recognize now Cauchy-Schwarz and conclude
E [x0 y0 ]
−1 ≤ p ≤ 1.
E[x02 ]E[y02 ]
The same relation also holds for the more practical implementation of a correlation coeffi-
cient based on a time series. Let us assume we have a set of pairs (xk ,yk ) with k = 1,2,...,N
elements, then we find an inner product
N
1 X
hxk ,yk i = xk yk ,
N k=1
and the correlation coefficient becomes:

1
P
N
(xk − mx ) (yk − my)
rxy =
σx σy
with the proper definitions

N
1 X
mx = xk
N k=1
N
1 X
my = yk
N k=1
N
1 X
σx2 = (xk − mx )2
N k=1
N
1 X
σy2 = (yk − my )2
N k=1
to satisfy the Cauchy-Schwarz Inequality: |rxy |2 ≤ 1. If the inequality is satisfied with

equality we talk about perfectly correlated variables. This includes not only the special case
xk = yk but also xk = −yk .
Example 2.60 The Matched Filter

A typical problem in transmission is that a known signal form is being corrupted during
transmission and the receiver has to detect which of the expected symbol forms is most likely
being transmitted. The problem is illustrated in Figure 2.1711 . For a proper formulation of
the problem, we need, however, to include a random variable. Let us assume we transmit the
11
The solution of this problem goes back to Dwight O. North, 1943.
v(t)
aks(t) d(t)
+ r(t) g(t)
Figure 2.17: Matched filter problem: How to design g(t) so that the SNR at its output is
maximal?
symbols ak s(t − kT ) periodically every T with a fixed symbol shape s(t) and an information
carrying sequence ak ∈ (−1,1). At the input of the receiver filter we observe the signal
r(t) = ak s(t) + v(t).
If we assume that the random variables ak at time instant k are zero mean, we find
E[|ak |2 ] = 1. We further assume for the additive white noise E[v(t)v∗ (t+τ )] = σv2 δ(τ ). The
question is now what impulse response g(t) manages to maximize R the signal-to-noise ratio
(SNR). The signal component if the noise is absent is given by ak g(t−τ )s(τR )dτ , the noise
component under the absence of the signal, on the other hand is given by g(t − τ )v(τ )dτ .
If both signal energies are compared, we obtain the SNR:
| hg(t − τ ),s(τ )i |2
SNR = .
σv2 hg(τ ),g(τ )i
In order to maximize the SNR, we have to find the right filter g(τ ):
| hg(t − τ ),s(τ )i |2
max SNR = max
hg(τ ),g(τ )i=1 hg(τ ),g(τ )i=1 σv2 hg(τ ),g(τ )i
| hg(t − τ ),s(τ )i |2
= max 2 hg(t − τ ),g(t − τ )i hs(τ ),s(τ )i
.
<g(τ ),g(τ )>=1 σv
In the last step we augmented the term hs(τ ),s(τ )i which we assume to be constant, e.g.,
hs(τ ),s(τ )i = 1. The SNR is thus maximized if and only if we select the filter g(t − τ ) =
αs(τ ). In this case we find that the maximum SNR is given by 1/σv2 .
Example 2.61 The Correlation Filter

An infinitely long symbol sequence rk ∈ Cl including periodically unique L symbol long
synchronization words is transmitted for the purpose of synchronization at the beginning of
a TDMA frame, i.e., to find the beginning of the frame. Design an optimal synchronization
filter with impulse response fk ; k = 0,1,...,L − 1 that ensures maximally good detection.
In such a TDMA transmission a unique sequence sk is periodically transmitted followed
by (random) data symbols, say ik as depicted in Figure 2.18. We expect that the probability
s Infodata ik s
Figure 2.18: Correlator problem: How to optimally detect the unique sequences sk ?
of sk being a part of ik is zero or at least sufficiently small. We thus have to distinguish

sk to all other possible symbol combinations. The solution of this problem is a so-called
correlation filter with impulse response sk . Let rk be the receiving sequence comprising of
sk and ik which is corrupted by additive noise. The outcome of such filter is:
L−1
X
dk = rk+l fl∗ = hrk ,fk i .
l=0
Due to Cauchy-Schwarz, we know that
| hrk ,fk i |2 ≤ hrk ,rk i hfk ,fk i
and
max | hrk ,fk i |2 = hrk ,rk i hfk ,fk i
fk
obtained for fk = αsk . The filter maximizes the signal-to-interference ratio (SIR). Even in
case of a received sequence rk that is distorted by noise, the correlation filter provides the
best choice in terms of signal-to-interference-and-noise ratio (SINR).
Example 2.62 The Time-Frequency Uncertainty

A relatively difficult relation is the so-called time-frequency uncertainty which goes back
to the work of Werner Karl Heisenberg (1901-1976) and Herman Klaus Hugo Weyl (1885-
1955). Loosely speaking, it states that a signal that is localized (concentrated) in the time
domain cannot be localized in the frequency domain and vice versa. More precise, given a
function in time f (t) and its Fourier transform F (jω), then12 :

sR ∞ sR ∞
f 2 (t)t2 dt |F (jω)|2 ω 2 dω 1
R−∞
∞
−∞
R ∞ ≥ .
f 2 (t)dt |F (jω)| 2 dω 2
−∞ −∞
In order to show this property we need the following four relations
1. Energy identity in time and frequency domain. This mostly refers to Plancherel’s
Theorem (and not Parseval which is valid for Fourier series) in literature.
Z ∞ Z ∞
2 1
f (t)dt = |F (jω)|2 dω.
−∞ 2π −∞
2. Differentiation in the time domain results in multiplication by (jω) in the frequency

domain: n
d f (t)
F n
= (jω)n F (jω).
dt
3. Differentiation in the frequency domain results in multiplication by (−jt) in the time

domain:
dn F (jω)
F {(−jt)n f (t)} = .
dω n
4. Partial integration rule of integrals, with the Dirichlet conditions that the function
vanishes for the limits: limt→∞ f (t) = 0 and limt→−∞ f (t) = 0:
Z ∞ Z ∞
0
−2 tf (t)f (t)dt = f 2 (t)dt,
−∞ −∞
where we denote the derivative of f (t) by f 0 (t).
We now apply the Cauchy-Schwarz Inequality for functions with f (t) → tf (t) and g(t) →
f 0 (t) and obtain
Z ∞ 2 Z ∞ Z ∞
0
tf (t)f (t)dt ≤ 2 2
t f (t)dt f 02 (t)dt.
−∞ −∞ −∞
Applying property 4, we find

Z ∞ 2 Z ∞ Z ∞
1 2
f (t)dt ≤ 2 2
t f (t)dt f 02 (t)dt,
2 −∞ −∞ −∞
12
f 2 (t)tdt = 0 and the same goes for the frequency:
R
We assume here that the average time is zero, i.e.,
A2 (ω)ωdω = 0.
R
or equivalently, R∞ R∞
f 2 (t)t2 dt f 02 (t)dt 1
R−∞
∞ R−∞
∞ ≥ .
−∞
f 2 (t)dt −∞
f 2 (t)dt 4
Applying properties 1 and 2 leads to
R∞ 2 1
R∞
−∞
f (t)t2 dt 2π −∞
|F (jω)|2 ω 2 dt 1
R∞
2 1
R∞
2
≥ .
−∞
f (t)dt 2π −∞ |F (jω)| dt 4
Applying the square root on both sides of the inequality returns the desired relation. The
Cauchy-Schwarz Inequality also offers us an answer to the question for which function f (t)
equality is obtained. The answer is for those functions that have: tf (t) = αf 0 (t). The only
function that satisfies this relation are f (t) = exp(−αt2 ). Thus the Gaussian bell curve is
the only function that is as localized in time as in frequency.
A straightforward result for time series is not so easy to derive. We refer the interested
reader to the formulation in [6].
2.5 Exercises
Exercise 2.1 Test the following definitions whether they define a metric for vectors with
a finite number of elements xi ,yi :
1. da (x,y) = |xi − yi |,
2. db (x,y) = ||xi | − |yi ||,
3. dc (x,y) = |x2i − yi2 |,
4. dd (x,y) = |x3i − yi3 |,

1 |xi −yi |
Pn
5. de (x,y) = i=1 mi 1+|xi −yi | with m ∈ N , m > 1,

Pn 2 −1
1 1
6. df (x,y) = n i=1 xi −yi
.
Exercise 2.2 Consider the sequence

1
xn+1 = xn + , n = 0,1,2,...
(n + 1)!
for x0 = 1.
1. Find an upper and a lower bound for the sequence.

2. Show that the sequence is a Cauchy sequence.

3. Show that limn→∞ xn = e (Euler constant).
4. What is the result if x0 = 0?
Exercise 2.3 Consider tr(A), i.e., the sum of the main diagonal elements of a matrix A.
Answer the following questions:
1. Does tr(AH A) induce a norm on matrix A?
2. Does tr(AH A) define a norm on matrix A?
p
3. Does tr(AH A) define a norm on matrix A?
Exercise 2.4 A norm k.k on C l m×n is unitarily invariant if kU M V k = kM k for any
m×n
matrix M ∈ C l and U and V are unitary matrices of corresponding dimensions, i.e.,
U H U = Im , V H V = In . Show that the Frobenius norm is unitarily invariant.
Exercise 2.5 Proof the following inequality for sequences {ak ,bb ,ck } ∈ IR:
n
!2 n n n
X X X X
2 2
ak b k c k ≤ ak bk c2k .
k=1 k=1 k=1 k=1
Exercise 2.6 Proof the parallelogram law:

kx − yk22 + kx + yk22 = 2(kxk22 + kyk22 ).
Exercise 2.7 Consider the first two polynomials P0 (t) = 1 and P1 (t) = t.
1. Given an l2 −norm in the interval [−1,1], show that the two polynomials are orthogo-
nal.
2. Make both polynomials orthonormal.
3. Use Gram-Schmid to derive the next two polynomials P2 (t) and P3 (t) that are orthog-
onal on all previous ones and normalised.
Exercise 2.8 Consider the time-frequency uncertainty for arbitrary signals, that is for
those whose mean time and frequency does not equal zero:
Z ∞
t= f 2 (t)tdt = 0,
Z ∞−∞
ω= A2 (ω)ωdω = 0.
−∞
Show that under these conditions the following uncertainty holds:

sR ∞ sR ∞
f 2 (t)(t − t)2 dt |F (jω)|2 (ω − ω)2 dω
−∞ −∞ 1
R∞
2
R∞
2
≥ .
−∞
f (t)dt −∞
|F (jω)| dω 2
Chapter 3
Representation and Approximation in

Vector Spaces
In the following we consider the so-called approximation problem. Consider the following
application: let x be a signal to transmit. Rather than transmitting x directly, we transmit
an approximation of x. This approximation goes in terms of a few coefficients which carry
much less data than the original vector x. If the number of coefficients is much smaller
than the entries of the vector, we reduce the amount of data that is stored or transmitted.
Remains the question, how is this possible.
3.1 Least Squares Method

To understand this, consider the following approximation problem:
Let (S,||.||) be a linear, normed vector space and T = {p1 ,p2 ,...,pm } a subset of linear
independent vectors from S and V =span(T ). Given a vector x from S, find the coefficients
cm so, that
 
c1
Xm h i  c2 
x̂ = c1 p1 + c2 p2 + ... + cm pm = ci pi = p1 ,p2 ,...,pm  .. 
 
i=1
 . 
cm
approximates x in the best sense by a linear combination, thus the error vector e
e = x − x̂
becomes minimal.
In order to minimize e, it is of advantage to introduce a norm. If taken an l1 - or
l∞ -norm, the problem would become mathematically very difficult to treat. However,
107
utilizing the induced l2 -norm, we typically obtain quadratic equations, solvable by simple
derivatives. Later, we will also introduce iterative LS methods that can be used to solve
problems in other norms, see Section 3.4. Note that if x is in V , then the error can
become zero. However, if x is not in V , it is only possible to find a very small value for ||e||2 .
The vectors in T allow to find an approximate solution with a very small error ||e||2 . The
receiver knows T . We thus need only to transmit the m coefficients cm . Is the number m
of the coefficients much smaller than the number of samples in x, we obtain a considerable
data reduction. The price for this is a representation of x that is not exactly x. The quality
measure of our approximation is related to the remaining energy in the error vector x. As
our hearing and seeing is not perfect, it is sufficient to describe audio and video signals only
as precise as our hearing and seeing works. These principles define the quality of audio and
video coders as they are being used today in our cell phones and video cameras.
Now that we have a proper description of the problem
2
m
X
min x − ci p i ,

c1 ,c2 ,...,cm
i=1 2
we can try to solve it. In order to visualize the problem, let us first consider a single vector
in T = {p1 } ∈ IR. We thus have: e = x − c1 p1 and have to minimize kek22 :
T
min kek22 = min kx − c1 p1 k22 = min x − c1 p1 x − c1 p 1 .
c1 c1 c1
∂ T ∂ T
x − c1 p 1 x − c1 p 1 = x x − c1 pT1 x − c1 xT p1 + c21 pT1 p1
∂c1 ∂c1
= −pT1 x − xT p1 + 2c1 pT1 p1 = 0.
We can solve this last equation with respect to c1 and obtain:

D E
xT p 1 x,p1
cLS,1 = T = .
p1 p1 kp1 k22
As we minimized the squared l2 -norm, we call this method the Least-Squared (LS) method.1
Geometric Interpretation: See Figure 3.1 for illustration. The so obtained minimal
error eLS = x − cLS,1 p1 stands orthogonal (perpendicular) onto the given direction
D E
p1 : eLS ,p1 = 0.
1
The method goes back to its inventor Johann Carl Friedrich Gauß (30.4.1777-23.2.1855) who developed
the method to recover the return of asteroid Ceres in 1800.
x x
e eLS
p1 c1p1
c1p1
Figure 3.1: Least squares. Upper row: the right amount c1 p1 leads to the closest point to
x along the direction up1 . Lower row: drawing circles with x as their center, the first circle
that hits p1 has radius eLS .
Note, however, the problem is not restricted to vectors. All objects of the linear vector
space are possible. We can thus write this more generally in terms of inner vector products
of x and p1 and use the vector notation only if explicitly vectors are meant:
∂ ∂
kx − c1 p1 k22 = hx − c1 p1 ,x − c1 p1 i
∂c1 ∂c1
= h−p1 ,x − c1 p1 i + hx − c1 p1 , − p1 i = 0,
where x and the p1 are of identical dimension. They can be vectors, matrices, functions,
series and many more. We obtain again
hx,p1 i
cLS,1 = .
kp1 k22
Let’s check if this orthogonality property that we obtained form the geometric interpretation
also holds in general. For this we compute the inner product of eLS and p1 :
heLS ,p1 i = hx − cLS,1 p1 ,p1 i
= hx,p1 i − hcLS,1 p1 ,p1 i
hx,p1 i
= hx,p1 i − hp1 ,p1 i
kp1 k22
= hx,p1 i − hx,p1 i = 0.
Furthermore, we can conclude for the minimal error energy

keLS k22 = heLS ,x − cLS,1 p1 i
= heLS ,xi − cLS,1 heLS ,p1 i = heLS ,xi ,
The inner product of original x and the minimal error is identical to the error energy.
What has worked well for a single component p1 should also work for a linear combi-
nation with m components, based on a set {p1 ,p2 ,...,pm } of m terms. We thus redo the
calculation but now with m components. We consider the optimal weight at position k,
that is:
∂ 2 ∂ ∂
kek2 = e,e + e, e = 0.
∂ck ∂ck ∂ck
If we follow the same procedure as before, we recognize that

∂
e,e = − hpk ,ei
∂ck
as well as
∂
e, e = − he,pk i .
∂ck
This leads to the condition
hpk ,eLS i + heLS ,pk i = 0.
Pm
Replacing eLS = x − i=1 cLS,i pi we find
* m
+ * m
+
X X
pk , x − cLS,i pi + x− cLS,i pi , pk = 0.
i=1 i=1
The condition is satisfied if and only if one of the inner products is zero (as the other is its
conjugate complement) and we obtain
* m
+
X
x− cLS,i pi , pk = 0, k = 1,2,...,m
i=1
or, equivalently,
* m + m
X X
hx, pk i = cLS,i pi , pk = cLS,i hpi , pk i , k = 1,2,...,m.
i=1 i=1
These are indeed m linear equations for k = 1,2,...,m. If we arrange them line by line, we
obtain the following set of equations:
    
hp1 ,p1 i hp1 ,p2 i ... hp1 ,pm i cLS,1 hx,p1 i
 hp2 ,p1 i hp2 ,p2 i ... hp2 ,pm i   cLS,2   hx,p2 i 
..  = 
    
 ..  .. 
 .  .   . 
hpm ,p1 i hpm ,p2 i ... hpm ,pm i cLS,m hx,pm i
or in short RcLS = p. The solution of such matrix equation is called (linear) Least-Squares
solution. Whether such a matrix equation has a unique solution, depends solely on matrix
R. Matrix R needs to be positive-definite in order to obtain a unique solution.
3.1.1 The Gramian Matrix

Definition 3.1 (Gramian) An m × m matrix R built by inner vector products of pi ; i =
1,2,...,m from T is called Gramian (Ger.: Gramsche) of the set T .
 
hp1 ,p1 i hp1 ,p2 i ... hp1 ,pm i
 hp2 ,p1 i hp2 ,p2 i ... hp2 ,pm i 
R= .
 
..
 . 
hpm ,p1 i hpm ,p2 i ... hpm ,pm i
Due to its construction rule, a Gramian2 matrix is always Hermitian, that is R = RH .
Definition 3.2 (positive definite) A matrix R is called positive-definite if for arbitrary

vectors q unequal to zero:
q H Rq > 0.
Theorem 3.1 The Gramian matrix R is positive semi-definite. It is positive-definite if

and only if the elements p1 ,p2 ,...,pm are linearly independent.
Proof:
Let q T = [q1 ,q2 ,...,qm ] be an arbitrary vector:
m X
X m m X
X m
q H Rq = qi∗ qj Rij = qi∗ qj hpj ,pi i
i=1 j=1 i=1 j=1
m X
m
* m m
+
X X X
= hqj pj ,qi pi i = qj pj , qi pi
i=1 j=1 j=1 i=1
m 2
X
= qi pi ≥ 0.

j=1 2
Conversely, if R is not positive-definite, then a vector q must exist (unequal to the zero
vector) so that:
q H Rq = 0.
2
Jorgen Pedersen Gram from 27. 6.1850 to 29. 4.1916 was a Danish Mathematician.
Thus also: 2
Xm
q i pi = 0

j=1 2
and m
X
qi pi = 0.
j=1
which means the components p1 ,p2 ,...,pm are linearly dependent. .
Note that such method by matrix inverse of the Gramian requires a large complex-
ity. This can be reduced considerably if the vectors p1 ,p2 ,...,pm are chosen orthogonal
(orthonormal). In this case the Gramian becomes diagonal (identity) matrix. We thus
concentrate later on the search for orthonormal bases.
3.1.2 The Orthogonality Property of LS

We recognized before in our first simple example that the Least-Squares error eLS was
perpendicular to the observation x. We wonder now, if this is still correct for the general
case.
Theorem 3.2 Let (S,k · k) be a linear, normed vector space and T = {p1 ,p2 ,...,pm } a subset
of linear independent vectors from S and V =span(T ). Given an element x from S. The
coefficients cm minimize the error e in the induced l2 -norm by a linear combination
x̂ = c1 p1 + c2 p2 + ... + cm pm
if and only if the error e = x − x̂ is orthogonal to all elements in T .
Proof:
To show (by substitution)
heLS ,pj i = 0; j = 1,2,...,m

* M
+ *M +
X X
= x− cLS,i pi ,pj = hx,pj i − cLS,i pi ,pj .
i=1 i=1
This results in one equation for every index of j = 1,2,...,M . If we combine all M equations
in vector form, we obtain:
p − RcLS = 0,
which is identical to what we obtained if we minimize the squared error norm. Both
conditions lead to the same result.
Note: since eLS is orthogonal to every component pj , eLS must also be orthogonal to the
estimate: * +
M
X
heLS ,x̂i = eLS , cLS,i pi = 0.
i=1
In general we find
heLS ,x̂i = 0,
heLS ,xi = heLS ,eLS i ≥ 0.
The later is zero if and only if x ∈ span{T }.
Example 3.1 A nonlinear system f (x) is excited harmonically. Which amplitudes have
the harmonics? A possibility to solve this problem is to approximate the nonlinear system
in form of polynomials. For each polynomial the harmonics can be pre-computed and thus
the summation of all terms results in the desired solution. For high order polynomials this
can become very tedious. An alternative possibility is to assume the output as given in the
form:
f (sin(x)) = a0 + a1 sin(x) + a2 sin(2x) + ... + b1 cos(x) + b2 cos(2x) + ... − e(x).
We then approximate f (x) by the coefficients â0 ,â1 ,...b̂1 ,... and find

e(x) = f (sin(x)) − â0 + â1 sin(x) + â2 sin(2x) + ... + b̂1 cos(x) + b̂2 cos(2x) + ... .
Since the functions {sin(x), cos(x)} build an orthogonal basis, the results are readily com-
puted by LS methods.
In general the set of linear equations RcLS = p can be very easily solved, if the basis
vectors are orthogonal. The Gramian is a diagonal matrix, say D and the desired solution
cLS = D−1 p which only requires m divisions. Even better is the situation, if all vectors are
orthonormal, as we find R = I and cLS = p, we only have to compute the right-hand side.
3.1.3 Gradient Methods

In some applications we cannot design the basis vectors ourself. Then R can be arbitrary
Hermitian and the computation of its inverse can be challenging. A numerically safe method
is the so-called gradient descent method. Here the gradient of the quadratic error term is
computed and iteratively, the initial solution is improved until finally the LS solution is
obtained. Given the initial value c0 the steepest descent algorithm is given by

ck+1 = ck + µ Rck − p ; k = 0,1,... (3.1)
By selecting a proper step-size the algorithm can converge faster. In general there are two
opposing effects: a too small step-size causes the convergence to become too slow, while a
too large step-size may cause divergence.
Example 3.2 The matrix equation Rc = p with the non-negative matrix R is to solve:

2 1 2 1
Rc = p : c= ⇒c= .
1 3 1 0
We start with the initial value c0 = 0 and a step-size value of µ = 0,3 and obtain in the
first iteration:
2 0,6
c1 = 0 + µ(p − 0) = µp = 0,3 = .
1 0,3
The second iteration reads:

0,6 2 2 1 0,6 0,75
c2 = c1 + µ(p − Rc1 ) = + 0,3 − = .
0,3 1 1 3 0,3 0,15
Figure refiter depicts how the iteration continues. The two components of ck finally reach
their destination [1,0]T .
1
0.9
0.8
0.7
0.6
c(1)
c(1),c(2)
c(2)
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
number of iterations k
Figure 3.2: Iterative solution: after a few iterations the two values approach their final
destination.
3.1.4 Pseudoinverses
If we arrange all vectors pj ; j = 1,2,...,m in a matrix A:
h i
A = p1 ,p2 ,...,pm ,
we can also formulate the LS equations as
AH AcLS = AH x
and recognize that AH A = R resembles the Gramian while AH x = p is the right hand side.
Inverting the Gramian results in
cLS = [AH A]−1 AH x. (3.2)
The so obtained matrix Bl = [AH A]−1 AH is called the (left) pseudoinverse of A as multi-
plying A from the left by Bl results in identity. In general there are two possibilities for
constructing pseudoinverses.3
Definition 3.3 (Pseudoinverse) The matrix Bl = (AH A)−1 AH is called left pseudoin-
verse of A. The matrix Br = AH (AAH )−1 is called right pseudoinverse of A.
The right pseudoinverse has its importance in the context of underdetermined LS prob-
lems. We will discuss them later. Note also that Eq. (3.2) explains that there is a linear
mapping (operator) that translates the observation x into the desired coefficients. Once, A
is known, Bl maps the observation x to the LS-coefficients cLS .
3.2 Projection Operators

If we plug in the LS-coefficients one more time, we find that
x̂ = A(AH A)−1 AH x
that is, the estimate or approximation x is also a linear mapping of the observation. This
mapping, defined as A(AH A)−1 AH is very particular in its properties.
Definition 3.4 (Projection) A linear mapping of a linear vector space onto itself is called
a projection, if P = P 2 . Such an operator is called idempotent.
Lemma 3.1 (Projection Matrix) If P is a projection matrix then I − P is also a pro-

jection matrix.
Proof:
(I − P )2 = I − 2P + P 2 = I − P.
3
The idea of pseudoinverses was independently developed by E. H. Moore in 1920, Roger Penrose in
1955. Often they are called the Moore-Penrose Inverses of a matrix.
We thus recognize that A(AH A)−1 AH is indeed a projection matrix. If A(AH A)−1 AH
maps x to its approximation x̂, what does the corresponding projection I − A(AH A)−1 AH
do? We find the two mappings:
x̂ = A(AH A)−1 AH x, (3.3)

eLS = I − A(AH A)−1 AH x.

(3.4)
While x̂ lies in the column space of A (is a linear combination of the column vectors of
A), the LS-error eLS does not. It is orthogonal to the approximation, which we can easily
verify:
hx̂,eLS i = A(AH A)−1 AH x, I − A(AH A)−1 AH x = 0.

Let us define a subspace V = span(A) and correspondingly W ⊥ V a subspace that is

perpendicular to V . Then x̂ lies in V and eLS lies in W . A general vector x will neither lie
in V nor in W . Note however, due to the projection property, we now can separate a given
vector x into two orthogonal parts x̂ and eLS that complement each other: x = x̂ + eLS .
This is illustrated in Figure 3.3.
V W S = W +V
( −1
)
xˆ = A A A A H x
H
( (
e LS = I − A A H A )
−1
AH x )
Let : x = v + w v ∈V , w ∈ W
xˆ = A(AH A) AH x
−1
(
e LS = I − A(AH A) AH x
−1
)
= A(AH A) AH (v + w ) = v
−1
= (I − A(A H
A) AH
−1
)(v + w) = w
Figure 3.3: LS in form of projections.
In principle we can distinguish between two different kind of projections, so-called

orthogonal projections and non-orthogonal or oblique projections. Orthogonal projections
are of the form A(AH A)−1 AH that we just considered. But also the form A(B H A)−1 B H is
indempotent and thus a projection. The latter are called oblique projections. Orthogonal
projections are self-adjoint: P [.] = P ∗ [.] (for notation and more details see Section 4.1.3
ahead), oblique projections not. The formal definitions will follow in the next chapter.
An interesting algorithm in terms of projections was 1933 introduced by Von Neumann4 .

Consider first the following properties when a projection P is applied to a vector f :

0 ≤ f ,P f = P f ,P f = P f ,f = |{z}
f ,f ≤ f ,f .
0≤≤1
Relaxed Projection Mapping (RPM) Algorithm:
f k+1 = f k + µ(P f k − f k ),
for a step-size 0 < µ < 2. The algorithm converges to f ∗ = P f ∗ .

Convergence proof:
We consider the error term f̃ k = f k − f ∗ and analyze its evolution from time instant k to
k + 1 to find under which conditions the term goes to zero.
f k+1 − f ∗ = f k − f ∗ + µ(P f k − P f ∗ + f ∗ − f k )
| {z }
f̃ k+1
f̃ k+1 = f̃ k + µ(P f̃ k − f̃ k ).
Computing the squared l2 -norm on both sides, we find
kf̃ k+1 k22 = (1 − µ)2 kf̃ k k22 + (µ2 + 2µ(1 − µ))kP f̃ k k22
= (1 − µ)2 + (µ2 + 2µ(1 − µ)) kf̃ k k22 .

For step-sizes between 0 < µ < 2 it can be shown that the term before the norm is strictly
bounded between and one, thus the energy term kf̃ k k22 decays with every iteration and
must finally end up at zero. If then norm is zero, the vector is zero and thus limk→∞ f k = f ∗ .
Figure 3.4 illustrates this behavior for two values of . The algorithm has seen a renaissance
recently (around 2010) and many interesting generalizations have been developed. Many
learning algorithms (machine learning) can be interpreted as a relaxed projection.
3.3 Applications of Least Squares

In the following we demonstrate the LS method on various applications.
4
Janos Neumann Margittai, (born 28.6.1903 in Budapest as Janos Lajos Neumann, died 8.2.1957).
[(1
− µ ) + (µ + 2 µ (1 − µ ) )ε ]
2
 
2

ε ≤...≤1 for 0≤ µ ≤ 2 1
= (1 − ε )(1 − µ ) + ε 2
0.9
0.8
ε=0.9
0.7 ε=0.1
0.6
0.5
0.4
0.3
0.2
0.1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
µ
Figure 3.4: Convergence of RPM Algorithm.
Example 3.3 Let us consider a polynomial fit: given a function f (x), it is to be fit by a
polynomial p(x) of order m optimally in the interval [a,b]. For this the following quadratic
cost function is selected:
Z b
min (f (x) − c0 − c1 x − ... − cm−1 xm−1 )2 dx
c0 ,c1 ,...,cm−1 a
Applying LS we find the following linear set of equations in the unknown coefficients
    
h1,1i hx,1i . . . hxm−1 ,1i c0 hf (x),1i
 h1,xi hx,xi . . .  c1   hf (x),xi 
= .
    
 .. .. ..  .. ..
 . . .  .   . 
m−1 m−1 m−1
h1,x i . . . hx ,x i cm−1 hf (x),xm−1 i
The inner products of the Gramian matrix can be computed explicitly
b
bi+j+1 − ai+j+1
Z

i j
x ,x = xi+j dx = .
a i+j+1
Normalizing the interval to [0,1], we obtain for the Gramian the so-called Hilbert matrix:
 1 1

1 2
... m
 1 1 1
. . . m+1 
 2 3
R =  .. .

. .
 . . 
1 1 1
m m+1
. . . 2m+1
For this matrix it is known that for growing order m the matrix is very poorly conditioned.
It is thus difficult to numerically invert the matrix. Due to this reason typically (simple)
polynomials are not being used for approximation problems. For small values of m (m < 4)
this effect is not so dramatic.
Example 3.4 The function ex is to approximate by polynomials. The Taylor series results
in: ex = 1 + x + x2 /2 + ... LS on the other hand delivers: 1,013 + 0,8511x + 0,8392x2 + ....
Figure 3.5 depicts the approximation error over the interval [0,1].The Taylor series 5
provides only good results close to its centre point, that is zero. (Maclaurin series) We
can thus improve the Taylor series approach by shifting its centre point in the middle of
the interval. Still on the border of the interval, the LS approach shows better results. In
this application what counts is the largest possible error. If we like the largest error to
become minimal, it is not sufficient to minimize the L2 -norm but in this case the L∞ -norm
is required:
Z b p1
min lim (f (x) − c0 − c1 x − ... − cm−1 xm−1 )p dx ,
c0 ,c1 ,...,cm−1 p→∞ a
which is not an LS problem but can be treated as so-called weighted LS problem, see Sec-
tion 3.4.
Example 3.5 Linear Regression Probably the most popular application of LS. The in-
tention is to fit a line so that the distance (more precisely the projection distance onto the
x-axis) between the observations and their projection onto the line becomes (quadratic) min-
imal. Figure 3.6 illustrates the typical scenario. Given the data pairs (xi ,yi ), how do we
find the line with minimum distance? For this we write down a set of equations, one for
each data pair:
         
y1 ax1 + b e1 x1 1 e1
 y2   ax2 + b   e2   x2 1   e2 
 a
 ..  =  ..  +  ..  =  .. ..  b +  ..  ,
        
 .   .   .   . .   . 
yn axn + b en xn 1 en
5
After the English mathematician Brook Taylor (18 August 1685 - 29 December 1731) and the Scottish
mathematician Colin Maclaurin (February 1698 - 14 June 1746).
Figure 3.5: LS application curve fitting on an exponential function.
or in short
y = Ac + e.
We thus obtain the LS solution as:
c = (AH A)−1 AH y.
Unusual here is the fact that matrix A out of which we compose the Gramian R = AH A
contains data itself. It can thus not be precomputed and for each data set it has to be
computed anew. Note however that the Gramian in this case is a square matrix of order
two. Nevertheless, depending on the given data the problem can be more or less numerically
challenging.
Example 3.6 Another example in which the observation data influence the Gramian ma-
trix is given next. In order to describe observations in terms of simple and compact key
Figure 3.6: LS application curve fitting on data: linear regression.
parameters, often so-called parametric process models are being applied. A frequently used
process is the Auto-Regressive (AR) Process, that is build by linear filtering of past values:
nP
X
xk = a1 xk−1 + ... + anP xk−nP + vk = an xk−n + vk .
n=1
The (driving) process vk is a white random process with unit variance. AR processes are
applied to model strong spectral peaks:
1
xk = vk ,
1 − a1 q −1 − ... − anP q −nP
X∞
sxx (ejΩ ) = E[xk xk+l ]e−jΩl ,
l=−∞
2
1
=

−jΩ −jΩn
.
1 − a1 e − ... − anP e P
A typical (short-time) spectrum of human speech is shown in Figure 3.7. Here three res-
onances of the vocal tract so-called formants are visible, the order of the process is thus
nP = 3. Linear prediction of speech signals has already been discussed in Example 1.2
0.9
0.8
0.7
power spectrum
0.6
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000 3500 4000
frequency [Hz]
Figure 3.7: Typical short-time spectrum of human speaker, showing formants at 120, 530
and 3400 Hz.
where we argued that the linear prediction method is not a linear operator. LS methods can
be used to estimate such parameters a1 ,...,anP of an AR process:
xk = a1 xk−1 + a2 xk−2 + ... + ap xk−nP + vk ,
       
xk xk−1 xk−1 vk
xk =  ...  = a1  .. ..  . 
 + a2   + ... +  .. 
     
. .
xk−M xk−M −1 xk−M −2 vk−M
xk = a1 xk−1 + a2 xk−2 + ... + ap xk−nP + vk ,
where we stacked M values of the observations in vectors. We can thus write more compactly
xk = a1 xk−1 + a2 xk−2 + ... + ap xk−nP + vk
, = [xk−1 ,xk−2 ,...,xk−nP ]a + vk ,
= XnP ,k a + vk .
An estimation for a can be found from the observation xk by minimizing the estimation
error:
min kxk − XP,k ak22 ,
a
which is again a standard LS problem. We obtain:
0 = XH H
nP ,k xk − XnP ,k XnP ,k a (3.5)
−1 H
a = XH

nP ,k XnP ,k XnP ,k xk . (3.6)
bringing this back to a Least-Squares problem. We can interpret matrix XH nP ,k XnP ,k as an

H
estimate of the ACF matrix and the vector XnP ,k xk as estimate of the autocorrelation vector
in the Yule-Walker equations. Numerically this problem can be very challenging as now the
statistical properties of the speech signal decide whether the nP × nP system can be solved.
The process order nP in speech applications is typically 8..12.
We also understand now why the linear prediction in Example 1.2 a nonlinear operation
is. The solution in (3.6) shows that the LS coefficients are dependent on the speaker. A
second speaker exhibits different coefficients for the same speech.
Example 3.7 Channel estimation

A training sequence ak with L symbols is sent at the beginning of a TDMA slot, to estimate
the channel hk of length 3(L > 3). In general we receive
rk = h0 ak + h1 ak−1 + h2 ak−2 + vk .
Combining several of these from k = 2,3,....,L − 1 we can formulate this in vector form:
     
rL−1   aL−1 vL−1
 rL−2  h0 h1 h2  aL−2   vL−2 

 rL−3 
  h0 h 1 h2
   
  aL−3   vL−3 

 =  + ,

 .. .. .. 
 ..   . . .   ..   .. 
 .   .   . 
h0 h1 h2
r2 a0 v2
   
aL−1 aL−2 aL−3 v
 aL−2 aL−3 aL−4     L−1 
  h0  vL−2 
 aL−3 aL−4 aL−5  
=   h1  +  vL−3  ,
 
.. . 
 h2  .. 
  
 .
a2 a1 a0 v2
r = Ha + v = Ah + v.
The upper form of description is in Toeplitz form exactly as we learned in Section 2.2.2.
From there we know that the training sequence ak needs to be persistent exciting, here of
order three. Note that with this formulation the channel matrix that was in Toeplitz form,
now appears differently in the second line. Instead the data matrix in Hankel form appears.
With this reformulation the channel can be estimated by the Least-Squares method:
r = Ah + v
ĥ = [AH A]−1 AH r.
Note that the training sequence is already known at the receiver and thus the Pseudo-Inverse
[AH A]−1 AH can be pre-computed. A persistent exciting training sequence ak guarantees the
existence of the pseudo inverse.
Example 3.8 Iterative Receiver Consider once again the equivalent description of the
previous example. We can continue now after the L training symbols a0 ,a1 ,...,aL−1 and use
the estimated symbols ãL ,ãL+1 ,...,ã2L−1 instead:
     
r2L−2   ã2L−1 v2L−2
h0 h1 h2
 r2L−3   ã2L−2   v2L−3 
   h0 h1 h2
   

 r2L−4  = 
 
.. .. ..

 ã2L−3 +
  v2L−3 ,

 ..   . . .  ..   .. 
 .   .   . 
h0 h1 h2
rL+1 ãL vL+1
   
ã2L−1 ã2L−2 ã2L−3 v2L−2
 ã2L−2 ã2L−3 ã2L−4    
v2L−3
 h0

  
 ã2L−3 ã2L−4 ã2L−5   v2L−4
=   h1 +  ,
  
... ..
 h2
   
  . 
ãL+2 ãL+1 ãL vL+1
r = Ha + v = Ãh + v.
This means that the transmitted symbols as well as the channel coefficients can be estimated
in a ping-pong manner by an LS method. Once the channel estimates are improved,
hLS = (ÃH Ã)−1 ÃH r,
in the next step the data estimates are improved
aLS = (H H H)H H r
and so on. This is the principle of a so-called iterative receiver which is illustrated in
Figure 3.8. Typically the soft decoded data symbols which may be the outcome of an LS
estimation are fed back but the final decision requires so called hard symbols. A slicer
(quantization device) forces the soft symbols into symbols of an allowed alphabet.
Example 3.9 Consider an underdetermined system of equations, i.e., there are more pa-
rameters to estimate than observations.
 
x1
1 2 −3  −4
x2  = .
−5 4 1 6
x3
A# Slicer
ĥ â Hard
a~ Symbols
H# Soft
Symbols
Figure 3.8: Iterative receiver.
As the system is underdetermined, it has many solutions. Find one solution, e.g., x =
[1,2,3]T . Then we can describe the manifold of solutions by
   
1 1
x = 2 + t 1 ;t ∈ C
   l .
3 1
Of all these solutions that one with the least norm is of most interest.
min kxk2 ; with constraint: Ax = b.
We assume again that an estimate of x is constructed by a linear combination of the obser-

vations, thus
x̂ = AH c ⇒ Ax̂ = AAH c = b ⇒ c = [AAH ]−1 b.
Substituting this solution c into the initial step, and we find
x̂ = AH [AAH ]−1 b.
Note that AH (AAH )−1 is also a pseudoinverse to A since A × AH (AAH )−1 = I. It is called
the right pseudoinverse. Surprisingly, this solution delivers always the minimum norm
solution. The reason for this is that all other solutions have additional components that are
not linear combinations of AH (they are not in the column space of AH ).
Example 3.10 Consider the previous example again. The minimum norm solution is given
by x = [−1,0,1]T and not as possibly assumed [1,2,3]T !. A solution can be found by linear
combinations of the row vectors of the matrix, i.e., [1,2, − 3] and [−5,4,1].
   
−1 1
x =  0 + t 1 ;t ∈ C
  l .
1 1
     
1 −5 7
2 2  −  4  =  0 .
−3 1 −7
The minimum norm solution can thus be composed as linear combination of the row vectors
(row space) of the underdetermined matrix. Note that [−1,0,1]T is of the row space of A
but [1,2,3]T and [1,1,1]T are not.
Now, is [−1,0,1]T truly the minimum norm solution? We check that by writing straight
forwardly     2
−1 1
2
2 2 2
kxk2 =  0  + t  1 

= (−1 + t) + t + (1 + t) .
1 1 2
Differentiating with respect to t delivers:
∂kxk22
= 2(−1 + t) + 2t + 2(1 + t) = 6t = 0 ⇒ t = 0.
∂t
So, indeed the minimum norm solution is given by [−1,0,1]T . The reader may try himself
to do it with the non minimum norm solution [1,2,3]T .
The general condition

can also be modified, when particular (sparse) conditions are of interest:
Alternatively, one can minimize
min kxk1 ; with constraint: Ax = b
as a practical approximation. In literature this is known under the name basis persuit.
Such a problems are known to be NP hard thus of high complexity to be solved. Alternative
forms are:
min kAx − bk22 + λkxk0
min kAx − bk22 + λkxk1
For some value of λ the first one is identical to the previous sparse problem. Thus, the
problem of finding λ remains.
The second formulation is a convex approximation for which efficient numerical solutions
exist. It is typically the preferred formulation for compressive sensing problems6 .
Example 3.11 Let us revisit Example 1.5 from the first chapter. We found that with
help of Bezout’s theorem an infinite amount of solutions exist for the equalizer problem.
But which solution is the best? This can be answered in the context of additive noise. If
we transmit data over a channel we model further uncertainties by additive noise. In its
simplest form it originates from thermal behavior of the electronic circuits. We thus receive
now
yk = H1 (q −1 )G1 (q −1 )xk + G1 (q −1 )v1,k + H2 (q −1 )G2 (q −1 )xk + G2 (q −1 )v2,k

= q −D xk + G1 (q −1 )v1,k + G2 (q −1 )v2,k .
Additional to the delayed version of the input we now experience filtered noise by the equal-
izers. If we like to find those equalizers that minimize the noise power, we have to minimize
kg (1) ,g (2) k22 assuming that the noise sequences are of equal noise power and statistically
independent. Fortunately the LS approach delivers exactly this solution when solving
 g (1)
 
 (1) (2) 0  
h0 h0  (1)  0
 (1) (1) (2) (2)   g1   
 h1 h0 h1 h0   (2)  = 1 .
(1) (2)  g0 
h1 h1 (2)
0
g1
3.4 Weighted Least Squares

Recall linear regression. There are not only linear relations. Depending on the order m,
we speak of quadratic, cubic... regression. If we have observations available with different
precision (e.g., from different sensors), we can weight them according to their confidence.
This can be obtained by a weighting matrix W :
c = [AH W A]−1 AH W y.
6
In the context of sparse problems, an important matrix property is its spark.
 It is the smallest
 number
1 2 0 1
 1 2 0 2 
of linearly dependent columns (or rows). Take for example the matrix A =   1 2 0 3 . The spark of

0 2 1 4
this matrix is three as the first three columns can be linearly combined to the zero vector. The spark is to
be seen as a counter measure against the rank (see Definition 4.15) of a matrix.
In general W needs to be positive-definite. Up to now we mostly considered Euclidean

norms (L2 and l2 -norms). The question is open, how other norms can be computed:
m
X
min kx − Ackp ⇒ min kx − Ackpp = min |xi − (Ac)i |p ,
c c c
i=1
where we used the fact that powers are monotone functions and thus preserve minima. The
latter can be reformulated into
m
X
min kx − Ackpp = min |x − (Ac)i |p−2 |xi − (Ac)i |2 .
c c |i {z }
i=1 wi
We interpret a part of the norm as weighting term wi . The problem is thus formulated as
classical quadratic problem with a diagonal weighting matrix W . To solve this, an iterative
algorithm is known. Start with an initial LS estimate and then continue as follows:
c(1) = [AH A]−1 AH x

for k = 1,2,...
e(k) = x − Ac(k)
(k) (k)
W (k) = diag(|e1 |p−2 ,|e2 |p−2 ,...,|em
(k) p−2
| )
−1 H
c(k+1) = λ(A W A) A W x + (1 − λ)c(k) .
H (k) (k)
In the last step a convex linear combination is selected with some value λ ∈ [0,1].
Example 3.12 Filter design

A linear-phase FIR filter of length 2N + 1 is to design such that a predefined magnitude
response |Hd (ejΩ )| is approximated in the best manner. Note that for linear-phase FIR
filters we have hk = h2N −k ; k = 0,1,...,2N , see also Exercise 2.13. We form b0 = hN and
bk = 2hk + 1; k=1,2,...,N:
N
X
H(ejΩ ) = e−jN Ω Hr (ejΩ ) = e−jN Ω bn cos(nΩ) = e−jN Ω bT c(Ω).
n=0
Here we placed all cos() terms in the vector c(Ω) = 2[1, cos(Ω), cos(2Ω),..., cos(N Ω)]. If
we would use a quadratic measure (as we did in a previous example), we would obtain the
Fourier coefficients. The magnitude response would be approximated only moderately. With
a larger norm, however, p → ∞ a much better result is obtained (equiripple design).
Z π
Hr (ejΩ ) − |Hd (ejΩ )|p dΩ.

lim min
p→∞ Hr (ejΩ ) 0
As the p-th root is a monotone function, we dropped it for the optimization process. In
Figure 3.9 we depict the obtained result when 81(N = 40) coefficients are applied to obtain
an ideal low pass filter. We recognize that the equiripple design allows a much steeper drop
around the cut off frequency. However, the attenuation at larger frequencies is better for
the Fourier solution (p = 2). The algorithm to obtain a solution for p → ∞ is called Remez
algorithm in the literature.
Figure 3.9: Equiripple filter solution for p → ∞ vs Fourier p = 2 design.
3.5 Orthogonal Polynomial Families

In the following we will treat the (still open) question which basis functions are best
suited for approximations. We have seen so far that simple polynomials lead to poorly
conditioned problems. We have also seen that orthogonal sets are in particular of interest
since the inverse of the Gramian becomes very simple. We thus will search for suitable
basis functions with orthogonal (better orthonormal) properties.
We approximate a function x(t) in the LS sense (L2 -norm) for orthonormal functions
pi (t):
m
X
x̂(t) = ci pi (t).
i=1
Due to the orthonormality of the basis functions pi (t); i = 1,2,...,m, we find
2
m
2 m

X X
2
x(t) − ci pi (t) = kx(t)k2 − hx(t),pi (t)i

| {z }
i=1 2 i=1 cLS,i
m
X
= kx(t)k22 − |cLS,i |2 ≥ 0.
i=1
As this expression cannot be negative, we can conclude here Bessel’s Inequality:

m
X
kx(t)k22 ≥ |cLS,i |2 .
i=1
The energy of the original signal is never exceeded by the energy of the LS coefficients
when applying an orthonormal basis.
It is very interesting to compare this result to Parseval’s theorem, which basically

states that the energy in the image domain (Fourier) is identical to the original energy. We
thus want to investigate under which conditions Parseval’s more strict and under which
Bessel’s more loose condition holds.
Consider the limit of this series:

m
X
x̂m (t) = ci pi (t)
i=1
∞
X
x̂(t) = x̂∞ (t) = ci pi (t).
i=1
Since the estimate is a Cauchy series and the Hilbert space is complete, we can follow
that the limit is also in the Hilbert space. However, not every (smooth) function can be
approximated by an orthonormal set point by point and are thus not in C[a,b]! Let us now
restrict ourselves to approximations in the L2 −norm. Even then not every function can be
approximated (well) with a set of orthonormal basis functions even if m → ∞.
Example 3.13 The set {sin(nt)}; n = 1,2,...,∞ builds an orthonormal set. The function
cos(t) cannot be approximated, since all coefficients disappear:
Z 2π
cn = cos(t) sin(nt)dt = 0.
0
More general, the set of sinusoids cannot approximate any even function.
We thus require a specific property of orthonormal sets, in order to guarantee that every
function can be approximated.
Theorem 3.3 (Completeness) A set of orthonormal functions pi (t); i = 1,2,...,∞ is com-

plete in an inner product space S with induced norm (can approximate an arbitrary function)
if any of the following equivalent statement holds:
•
∞
X
x(t) = hx(t),pi (t)i pi (t),
i=1
•
n
X
x(t) − hx(t),pi (t)i pi (t) < ;for all n ≥ N,N < ∞,

i=1
•
∞
X
kx(t)k2 = hx(t),pi (t)i2 (; P arseval),
i=1
• There is no nonzero function f (t) ∈ S for which the set {f (t),p1 (t),p2 (t),...} forms an
orthonormal set.
It is also said: the orthogonal set of basis functions is complete (Ger.: vollständig). Note
that this is not equivalent to a complete Hilbert space (→ Cauchy)!
It is noteworthy to point out the difference to finite dimensional sets. For finite
dimensional sets it is sufficient to show that the functions pi (t) are linearly independent.
If an infinite dimensional set satisfies the properties of Theorem 3.3, then the represen-
tation of x is equivalently obtained by the infinite set of coefficients ci . The coefficients ci
of a complete set are also called generalized Fourier series.
Lemma 3.2 If two functions x(t) and y(t) from S have a generalized Fourier series rep-
resentation using some orthonormal basis set pi (t) in a Hilbert space S, then:
∞
X
hx(t),y(t)i = c i bi .
i=1
Proof:
Let:
∞
X ∞
X
x(t) = ci pi (t); y(t) = bk pk (t)
i=1 k=1
Then
*∞ ∞
+
X X
hx(t),y(t)i = ci pi (t), bk pk (t)
i=1 k=1
∞
X
= c l bl .
l=1
Compare Parseval’s Theorem in its most general form:

Z π
1
xk = X(ejΩ )ejkΩ dΩ
2π −π
∞ I
X 1 ∗ 1 1
xk yk = X(z)Y ∗
dz
k=−∞
2πj C z z
Z π
1
= X(ejΩ )Y ∗ (ejkΩ )dΩ,
2π π
which is nothing else but an inner product. The series xk ,yk can be interpreted as the
coefficients bk ,ck , e.g.,
∞
X
ck bk = hX,Y i .
k=−∞
In the following a few prominent members of orthonormal sets will be presented.
Example 3.14 Fourier Series

Consider the pair of Fourier series in [0,2π]:
∞
X 1
f (t) = cn √ ejnt
n=−∞
2π
Z 2π
1
cn = √ f (t)e−jnt dt.
2π 0
The symmetric form with the prefactor √12π guarantees orthonormality. Other non-
symmetric forms are possible, leading to orthogonal sets. In case the period is not 2π
but arbitrary, say T , we consider the function to be periodic: f (t) = f (t + mT ).

∞
X 1 2π
f (t) = cn √ ejn T t
n=−∞
2π
Z T
1 2π
cn = √ f (t)e−jn T t dt.
T 2π 0
Example 3.15 Discrete Fourier Transform (DFT)

Very much related is the DFT. A series xk is only known at N points: k = 0,1,...,N − 1.
N −1
1 X j2πkl/N
xk = √ cl e ,
N l=0
N −1
1 X
cl = √ xx e−j2πkl/N .
N k=0
Note that in both transformations often orthogonal rather than orthonormal sets are being
applied to reduce complexity in one direction.
Note further that in the previous two examples the orthogonal functions are constructed
by trigonometric functions ejnt/T and ej2πn/N , respectively. The weighting function is thus
w(t) = 1. We simply find:
Z 2π
1 jnt −jmt 0 ; n 6= m
e e dt =
2π 0 1 ;n = m
N −1
1 X j2πkn/N −j2πkm/N 0 ; else
e e =
N 1 ; |n − m| mod N = 0
k=0
∞
X
= δn−m+rN .
r=−∞
The continuous functions are thus truly orthonormal, while the time-discrete series show
some periodic behavior. The orthonormality only applies per period.
Example 3.16 We can now return to a problem that remains from the first chapter when
we discussed linear phase filters, see Section 1.2.3. Let us assume an arbitrary causal FIR
filter response of finite length n is given in terms of its z-transform:
n
X
F (z) = fi z −i .
i=0
This can always be split in an even and an odd part:

(e) (o) fi
fi = fi = ; i = 1,2,...,N
2
f0
fo(e) = 2
fo(o) = 0.
Correspondingly the z-transform comprises of two parts: F (z) = F (e) (z) + F (o) (z). In terms
of the Fourier transform we find:
n n
(e) jΩ
X fi jΩi −jΩi
X
F (e ) = fo + e +e = fi cos(Ωi)
i=1
2 i=0
n n
X fi X
F (o) (ejΩ ) = ejΩi − ejΩi = j

fi sin(Ωi).
i=1
2 i=1
We thus find the real part to be F (e) (ejΩ ) and the imaginary part F (o) (ejΩ ). The real part
is purely written in terms of cos thus an even function in Ω, the imaginary part an odd
function with sin terms.
As we now understand that sin(.) and cos(.) terms are perfectly orthogonal onto each other,
this means that the real part cannot be approximated by the imaginary part. In that sense
they stand completely independent. As one guarantees an even symmetry, the other an
odd symmetry once can only choose one of them in order to have a linear phase filter. For
even filter responses only the real part in cos(.) terms remains, for odd functions only the
imaginary part in sin(.) terms remains. We can thus conclude that:
Linear phase filters need to have a symmetric impulse response, and symmetric impulse
responses result in a linear phase filter.
3.5.1 Orthogonal Polynomials

We already noticed that simple polynomials lead to poor conditioned problems because
they are not orthogonal. However, it is possible to build orthogonal polynomial families.
In the context of polynomial families it is common to employ weighted inner products.
Thus, inner products in this section typically include some positive weighting function
w(t). Naturally, all inner products of the same family apply the same, identical weighting
function. Orthogonal functions hold the property that their inner vector product becomes
zero in the interval of interest [a,b]:
Z b
hpi ,pj iw = w(t)pi (t)pj (t)dt = 0,
a
for i 6= j, where a particular form of positive weighting function w(t) > 0 is commonly
applied. All of those polynomials share a common property.
Lemma 3.3 Orthogonal Polynomials satisfy the following recursive equation:
tpn (t) = an pn+1 (t) + bn (t)pn (t) + cn pn−1 (t).
Proof:
Let gn (t)
tp (t) − an pn+1 (t) = bn (t)pn (t) + cn pn−1 (t).
|n {z }
gn (t)
be of degree n (by choice of an ). Then we must have:

n
X
gn (t) = bn pn (t) + cn pn−1 (t) = di pi (t)
i=0
with
di = hgn (t),pi (t)i .
Since the polynomials are orthogonal onto each other, we have
hpn (t),pi (t)i =0 ; i = 0,1,2,...,n − 1

htpn (t),pi (t)i = hpn (t),tpi (t)i = 0 ; i = 0,1,2,...,n − 2.
Since gn (t) = tpn (t) − an pn+1 (t) is true, we must also have:
di = hgn (t),pi (t)i = htpn (t) − an pn+1 (t),pi (t)i

= htpn (t),pi (t)i − an hpn+1 (t),pi (t)i
= 0 ; i = 0,1,2,...,n − 2.
Thus, only two coefficients (for i = n − 1 and i = n) remain:
dn = bn = hgn (t),pn (t)i ; dn−1 = cn = hgn (t),pn−1 (t)i .
Example 3.17 Hermite Polynomials

Consider the following differential form:
dn − t2 t2
yn (t) = n e 2 = pn (t)e− 2 .
dt
A polynomial pn (t) of order n appears, which can be brought into the form:
tpn (t) = −pn+1 (t) − npn−1 (t).
The first polynomials are
p0 (t) = 1, p1 (t) = −t, p2 (t) = t2 − 1, p3 (t) = −t3 + 3t.

As the coefficients are simply integer values, the computation with these polynomial family
is very easy. In the 19th century when computers were not available, often this polynomial
family was preferred as many computation need to be done manually. The weighting function
becomes visible in the inner product:
∞ t2
e− 2
Z
pn (t)pm (t) √ dt = δn−m .
−∞ n! 2π
| {z }
w(t)
Example 3.18 Binomial Hermite Sequences

There exist also time-discrete binomial Hermite sequences of order N in the form:
(r+1) (r+1) (r) (r)
xk = −xk−1 + xk − xk−1 ,
a double recursive form, recursive in time k as well as in order r. We thus require initials
for both recursions:
(r)
x−1 =0 ; r = 0,1,...,N,

(0) N
xk = ; k = 0,1,...,N.
k
Note that the order r really runs only from 0 to N , while k is only for the initials bounded
by N , afterwards k keeps increasing. Applying the Z-Transform with respect to k we obtain:
X (r+1) (z) = −z −1 X (r+1) (z) + X (r) (z) − z −1 X (r) (z)

X (r+1) (z)(1 + z −1 ) = X (r) (z)(1 − z −1 )
z − 1 (r)
X (r+1) (z) = X (z)
z+1
r+1
z−1
= X (0) (z).
z+1
From the initials we find
N
X N
X (0)
(z) = z −1 = (1 + z −1 )N .
k
k=0
Plugging in the initials into the previous equation, provides the final explicit solution
X (r) (z) = (1 − z −1 )r (1 + z −1 )(N −r) .

(r) (r)
We can derive now polynomials Pk from sequence xk (similar to the previous example)
with
(r) N (r)
xk = Pk .
k
Recall that for large values of N there is

2N (k − N/2)2

N
≈p exp − ,
k N π/2 N/2
explaining the association to the Hermite polynomials.

The orthogonality of the obtained
N
polynomials is given with respect to the weighting :
k
N
2N

X (r) (s) N
Pk Pk = δr−s .
k N
k=0
s
The so obtained polynomials share an interesting symmetry:
(r)
Pk = Pr(k) .
Due to the recursive character, a so called binomial filter bank can be constructed with very
simple (low complexity ) filter steps as depicted in Figure 3.10. If the input sequence is a
Dirac, the filter impulse response corresponds to the derived sequences. In general we are
interested in the filter behavior of a filter bank. For this we exhibit the magnitude of the
Fourier transforms in Figure 3.11. In the left picture the filter character is not easily visible.
Once the filters are normalized with respect to their maxima, their bandpass character comes
out, as shown in the right hand part of the figure.
(1+z-1)N (z-1)/(z+1) (z-1)/(z+1) (z-1)/(z+1)

δk
x(0)k x(1)k x(2)k
Figure 3.10: Binomial Filter bank architecture.
Example 3.19 Legendre Polynomials

This family of polynomials is straight forwardly to obtain by setting the weighting function
to w(t) = 1. We find
n+1 n
tpn (t) = pn+1 (t) + pn−1 (t).
2n + 1 2n + 1
16 16
|X1(ejΩ )|
|X1(ejΩ )|
14 14 |X2(ejΩ )|
|X2(ejΩ )|
|X3(ejΩ )|
12 |X3(ejΩ )| 12
|X4(ejΩ )|
10 10
|X(ejΩ )|
|X(ejΩ )|
8 8
i
i
6 6
4 4
2 2
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
Ω Ω
Figure 3.11: Binomial Filter bank (left): magnitude of Fourier transform, (right): normal-
ized version.
The first members are
p0 (t) = 1,
p1 (t) = t,
3 2 1
p2 (t) = t − ,
2 2
5 3 3
p3 (t) = t − t.
2 2
Example 3.20 Tschebyscheff Polynomials
These polynomials have the motivation in designing polynomials that guarantee a limited
maximal error. Designs based on them often have the name equiripple or min max approach.
They include a weighting function:

Z 1
1  0 ; n 6= m
pn (t)pm (t) √ dt = π ;n = m = 0
−1 1 − t2  π
; n = m 6= 0
| {z } 2
w(t)
The recursive form reads:

1 1
tpn (t) = pn+1 (t) + pn−1 (t),
2 2
with the first members being:
p0 (t) = 1,
p1 (t) = t,
p2 (t) = 2t2 − 1,
p3 (t) = 4t3 − 3t.
Surprisingly there is even an explicit formula:
pn (t) = cos(n arccos(t)).
Figure 3.12 depicts on the left the Legendre and on its right the first members of the
Figure 3.12: (left): Legendre, (right): Tschebyscheff.
Tschebyscheff family.
3.5.2 Differential Equations

Families of orthogonal functions are well suited to solve multivariate differential equations
numerically. Many partial differential equations are in time and space and can thus obtain
one family (not necessarily the same) for each variable as long as the variables are inde-
pendent. Even nonlinear behavior can be supported by such methods. In the following we
only show a very simple example for which the result is well-known.
Example 3.21 Linear Differential Equation

Consider the following homogeneous differential equation:
dφ(t)
+ φ(t) = 0
dt
with condition
φ(t = 0) = 1.
The solution is well known:
1 1
φ(t) = e−t = 1 − t + t2 − t3 + ...
2 6
Let us solve it with a polynomial. For this we select simple basis functions: pn (t) = tn ; n =
0,1,2,.... We can thus write
X
φ(t) = a0 + a1 t + a2 t2 + ... = an pn (t)
n
with unknown coefficients an . What happens on the basis when we apply a differentiation?
dφ(t)
= a1 + 2a2 t + 3a3 t2 + ...
 dt 
  a0  
0 1 0 0  a1
 0 0 2 0   a1  =  2a2  .

 a2 
0 0 0 3 3a3
a3
Solving the homogeneous differential equation is equivalent to solving

 
    a0
0 1 1 0 0 
 0  =  0 1 2 0   a1 

 a2 
0 0 0 1 3
a3
   
a0 1
 a1   −1 
 a2  =  1 
   
2
a3 − 16
We now consider the inhomogeneous problem:
dφ(t) t
+ φ(t) = 1 +
dt 2
with the initials φ(t = 0) = 1.5 and φ0 (t = 0) = −0.5. The polynomial on the right hand
side can be written in terms of a vector [1,0.5,0]T and we find
 
  a0  
1 1 0 0  1
 0 1 2 0   a1  =  0.5  .

 a2 
0 0 1 3 0
a3
This is an underdetermined set of equations which we can solve by LS and we obtain

   1 
a0 2
 a1   1 
  =  2 .
 a2   0 
a3 0
The complete solution has the form

1 1
φ(t) = + t + αe−t .
2 2
With the initial conditions, we find α = 1.
3.6 Wavelets
3.6.1 Pre-Wavelets
Before we consider candidates for wavelets, we reconsider our interpolation function from
Chapter 1.
Example 3.22 Sinc Functions

Consider the set of basis functions
pn (t) = sinc(2Bt − n).
We can interpret the parameter B as a stretching element, the parameter n as a shift

element. These functions share interesting properties in stretching as well as in shifting:

R∞ 1 ;n = m
2B −∞ sinc(2Bt − n)sinc(2Bt − m)dt =
0 ; n 6= m
R∞
2mnB −∞
sinc(2nBt)sinc(2mBt)dt = min(n,m).
While the first property displays a desired orthonormality for shifting, the change in
stretching does not deliver an orthogonal basis. If the latter is exploited, a non-diagonal
Gramian needs to be applied when solving the LS equations.
If we want to approximate a continuous function f (t) by an infinite series of shifted sinc

functions (as basis), we find in the LS sense that the optimal coefficients
hf (t),pn (t)i
cn = = hf (t),pn (t)i .
hpn (t),pn (t)i
The inner product is simply

Z ∞
f (τ )sinc(2Bτ − n)dτ.
−∞
We remember from the Sampling Theorem 1.7 that for band limited functions f (t), the
result is surprisingly simple: cn = f (nT ). This inner product is a convolution integral, yet
not in the form we are most often used to:
Z ∞ Z ∞
f (τ )sinc(2B(t − τ ))dτ = f (τ )sinc(2B(τ − t))dτ = fL (t).
−∞ −∞
We thus obtain a low pass version fL (t) of the original function f (t) as depicted in
Figure 3.13. If we now compare the result of the convolution integral and the original
f(t)
LP
fL(t)
Figure 3.13: Signal f (t) passes an ideal low pass.
inner product, we recognize that they are identical once we set 2Bt = n. The inner product
n
is thus fL ( 2B ) = fL (nT ) = cn . Only if the function is band limited, the approximated
(filtered) low pass function is identical to the low pass function: fL (t) = f (t) allowing
us to replicate the function without any loss. If this condition is not satisfied, the higher
frequency parts are simply dropped as they do not pass the ideal low pass.
In other words: the sampling and interpolation can equivalently be interpreted as

an approximation problem to resemble a continuous function f (t) by basis functions
sinc(2Bt − n) that are shifted in time by equidistant shifts T = 1/(2B). The approximation
works with zero error only if function f (t) is bandlimited by |ω| < 2πB.
This offers a straight-forward generalization to our sampling approach. A function f (t)

can be approximated in the LS sense by orthonormal basis functions p(2Bt−n) by computing:
cn = hf (t),p(2Bt − n)i ,
X
f (t) = cn p(2Bt − n).
n
Thus, by proper selecting pn (t) = p(2Bt − n), we can select the space that fits our original
signal best.
3.6.2 Wavelet Basics

We now return to one of our basic problems that is how to find a basis that suits best.
Or, slightly more general posed, find the appropriate basis in which either the desired or
undesired parts of the signal can be described in sparse form. By this the desired and
undesired parts can be differentiated and finally decomposed. Figure 3.14 illustrated some
applications of this kind. Very often desired and undesired parts are overlaid and need to
be separated. Classical filtering with different frequency ranges does not work here as both,
the distortion and the desired signal contain the same frequency components. Not only
image processing is an application for this also audio: think of old audio Vinyl records with
lots of scratching sounds on them. Similar to the image example (lower right) in the figure
also the scratching sounds have a basis that can be extracted.
For this to achieve we consider orthonormal bases that are orthonormal not only in
shifting (=position=location) such as the sinc functions but also in stretching:
j
pj,k (t) = 2− 2 φ 2−j t − k .

j
The pre-factor 2− 2 and the stretch factor 2−j appear as a pair so that with arbitrary
values of j ∈ ZZ, all resulting functions are normalized. Note that, if φ(t) is normalized
(kφ(t)k = 1), then we also have kpj,k (t)k = 1. We select the function φ(t) in such way that
they build for all shifts n an orthonormal basis for a space:
V0 = span {φ(t − n),n ∈ Z

Z}
hp0,k ,p0,l i = δk−l .
The shifted functions thus build an orthonormal basis for V0 .
Example 3.23 Unit Pulse

Consider the unit pulse defined as:
φ(t) = U (t) − U (t − 1) = p0,n (t)

Figure 3.14: Example of wavelet applications. Upper: a texture is overlaid a picture. By

selecting the proper basis (in this case a simple DCT), the texture and the background
picture can be separated. Lower left: a finger print is separated from a cartoon, Lower
right: a distorted picture is separated from its scratches.
constructed out of two unit step functions. Shifted versions of such pulse span the space
V0 = span {φ(t − n),n ∈ Z
Z} .
With this basis function all functions f0 (t), that are constant for an integer mesh (Ger: im
Raster ganzzahliger Zahlen) can be described exactly. Continuous functions can be approx-
imated with the precision of integer distance.
X
f0 (t) = hf (t),p0,n (t)i p0,n (t) = f (t) − e0 (t).
n
The so obtained coefficients

Z 1
c(0)
n = hf (t),p0,n (t)i = f (t + n)dt,
0
can also be interpreted as piecewise integrated areas over the function f (t). Figure 3.15
displays the quantization of a continuous function f (t) (left) and the corresponding error
term e0 (t) (right).
Figure 3.15: Left: dissection in unit pulses, right: error terms e0 (t).
Stretching can also be used to define new bases for other spaces, for example,
n√ o
V−1 = span 2φ(2t − n),n ∈ ZZ .
If these spaces are nested (Ger: Verschachtelung)
... V2 ⊂ V1 ⊂ V0 ⊂ V−1 ...

← course scale fine scale →
then we call φ(t) a scaling function (Ger: Skalierungsfunktion) for a wavelet. Next to
nesting, there are other important properties of Vm .
• Shrinking
∩m∈Z
Z Vm = {}
• Closure
∪m∈Z
Z Vm = L2 (IR).
• Multi-resolution property
f (x) ∈ Vm ⇔ f (2x) ∈ Vm−1 ; for f (x) ∈ L2 (IR).

Example 3.24 Unit Pulse II

Consider the unit impulse, this time a stretched version
φ(2t) = U (2t) − U (2t − 1)

n√ o
V−1 = 2φ(2t − n),n ∈ Z
Z
√
p−1,n (t) = 2φ(2t − n)
With this function we can resemble all functions φ(t), that are constant in a half-integer
(n/2) mesh. All continuous functions can be approximated by a half-integer mesh.
X
f−1 (t) = hf (t),p0,n (t)i p−1,n (t) = f (t) − e−1 (t).
n
The function f−1 (t) thus is an even finer approximation of f0 (t) in V0 . Since V0 is a subset
of V−1 we have:
X
f−1 (t) = c(−1)
n p−1,n (t)
n
X
= c(0)
n p0,n (t) + e0,−1 (t)
n
X X
= c(0)
n p0,n (t) + d(0)
n ψ0,n (t)
n n
with a suitable basis ψ0,n (t) from W0 with W0 ∪ V0 = V−1 . In other words, the set Wj
complements the set Vj in such a way that:
Wj ∪ Vj = Vj−1
with Vj−1 the next finer approximation can be built. Hereby, Wj is the orthogonal comple-
ment of Vj :
Wj = Vj⊥ ⊂ Vj−1 .
This is illustrated in Figure 3.16.
We thus learned that the error terms are in complement spaces to the corresponding signal
spaces. These basis functions ψj,n (t) describing the error terms are called wavelets. It is
a word creation from the French word ondelette (small wave), by Jean Morlet, and Alex
Grossmann, 1980. Thus, we can decompose any function at an arbitrary scaling step into
two components {ψj,n } and {pj,n }. Very roughly, one can be considered a high pass, the
other a low pass. By finer scaling the function can be approximated better and better. The
required number of coefficients is strongly dependent on the wavelet- or the corresponding
scaling function.
Vj-1
Vj Wj
Figure 3.16: The finer space Vj−1 can be described by the vector space Vj and its orthogonal
complement.
Wavelets have also the scaling property:

j
ψj,n (t) = 2− 2 g(2−j t − n),
as well as orthonormal properties:
hψj,k (t),ψl,n (t)i = δj−l δk−n .
Example 3.25 Haar Wavelets

Following up on the previous examples with unit pulses we now look at the corresponding
complement space. Based on unit pulses we can form
1
pm+1,n (t) = √ [pm,2n (t) + pm,2n+1 (t)]
2
1
ψm+1,n (t) = √ [pm,2n (t) − pm,2n+1 (t)]
2
Figure 3.17 illustrates the corresponding basis functions at level 07 .
In Figure 3.18 an example is shown where a function f−1 (t) ∈ V−1 is approximated by
f0 (t) ∈ V0 with error e0,−1 (t). The error can be described completely by wavelets ψ0,0 (t).
Wavelets were in the research focus in the 80ies and 90ies. In the 80ies parallel sub-
band coding was popular and first there was no obvious reason to favor one of them, as
with the existing wavelets equivalent formulations were possible. This view (at end of the
80ies) however did not reveal the true potential of wavelets as they only offered equivalent
performance. This situation changed when Ingrid Daubechies8 introduced new families of
7
Haar wavelets were introduced in 1909 by the Hungarian Mathematician Alfred Haar (11.10.1885-
16.3.1933).
8
Ingrid Daubechies (17.8.1954) is a Belgian physicist and mathematician.
p0, 0 (t ) = φ (t ) p−1, 0 (t ) = 2φ (2t ) ψ 0,0 (t ) = φ (2t ) − φ (2t − 1)
Figure 3.17: Unit pulse p0,0 (t) ∈ V0 and p−1,0 (t) ∈ V−1 as well as wavelet ψ0,0 (t).
wavelets, some of them not having the orthogonality property but a so-called bi-orthogonal
property (see Definition 2.30). These wavelets found application in the JPEG2000 stan-
dards for image coding.
3.7 Exercises
Exercise 3.1 Consider projection matrices
1. Show that, if A is a symmetric projection matrix, then B = I − 2A is orthogonal and

symmetric.
2. If B is orthogonal and symmetric, then A = 12 (I − B) is a projection matrix.
3. Show that the eigenvalues of a projection matrix are either zero ore one.
f-1(t)
f0(t)
p0,0 (t )
e0,-1(t) ψ 0, 0 (t )
Figure 3.18: Function f−1 (t) ∈ V−1 is approximated by f0 (t) ∈ V0 with error e0,−1 (t).
Chapter 4
Linear Operators
4.1 Linear Operators

In the following we generalize many of the previous concepts in terms of linear operators.
More specifically we then will constrain our interest to matrices as a special form of linear
operators.
4.1.1 Basic Definitions

Definition 4.1 A transformation A : X → Y in which X and Y are vector spaces, defined
over a ring is called linear, if for each x1 ,x2 from X and scalars a1 ,a2 from IR we have:
A[a1 x1 + a2 x2 ] = a1 A[x1 ] + a2 A[x2 ].
Note that in the mathematical literature the number field (Ger.: Zahlenkörper) can be
arbitrary and needs to be defined. In engineering the number field is typically always real
numbers. This can lead to situations where operators are named linear in the mathematical
literature while in engineering they are not and vice versa, see, e.g., Example 4.6.
Example 4.1 A complex-valued number z ∈ Cl is formed by a vector x ∈ IR2 :

<(z)
A[z] = x = .
=(z)
l → IR2 .
We thus have a mapping from C
Example 4.2 A quadruple s = [s1 ,s2 ,s3 ,s4 ] ∈ C l 4 is mapped onto a 4 × 4 matrix in C
l 4×4
:
 
s1 s2 s3 s4
 −s∗2 s∗1 −s∗4 s∗3 
A[s] =  ∗ ∗ ∗
.
 s3 s4 s1 s∗2 
s4 −s3 s2 −s1
150
Some more examples for linear operators are given by:
Example 4.3 Let a continuous function g(t) from C[0,1] being sampled at fix time points
0 < t1 < t2 < ... < tn < 1:
A[g(t)] = [g(T1 ),g(t2 ),...,g(tn )] ∈ IRn .
Example 4.4 A function f : X → IR(C l ) that maps from a vector space X onto real
(complex) numbers is called a functional. If it is linear, then it is called a linear functional:
1 T
Z
f1 (x) = x(t)dt
T 0
Z b
f2 (x) = x(t)g(τ − t)dt
a
Z ∞
f3 (x) = x(t) exp(−jωt)dt.
−∞
More formally we can define a linear functional as a mapping with a particular property:
Definition 4.2 (Linear Functional)
f : X → IR, f (ax + by) = af (x) + bf (y).
• Remark 1: all inner products over functions can be interpreted as linear functionals,
e.g.,
Z b
f2 (x) = x(t)g(τ − t)dt = hx(t),g(t)i .
a
• Remark 2: all continuous, linear functionals in the Hilbert space can be described by
inner products (Riesz’ theorem).
Note that we introduced induced vector norms to describe norms on matrices. We can
also induce norms on linear operators:

kA[x]kp x
kA[.]kp = kA[.]kp,ind = sup = sup A = sup kA[x]kp .
x6=0 kxkp kxkp kxkp =1
x6=0

Example 4.5 Consider the causal sequence xk ; k = 0,1,2,.... The mapping of the sequence
onto a sum
Xk
sk = xl
l=0
is a linear operator.
Example 4.6 Consider the Hermitian1 operator H[A] = AH , that transposes a matrix and
additionally builds the conjugate complex value of all elements. (Ger.: adjungierte Matrix,
Engl.: adjoint matrix):
B1 = AH H
1 = H[A1 ]; B2 = A2 = H[A2 ]
α1 B1 + α2 B2 = α1 AH H H
1 + α2 A2 = (α1 A1 + α2 A2 ) ; α1,2 ∈ IR
Another example is an operator on continuous functions that maps them to even func-
tions by
f (x) + f (−x)
Pe [f (x)] = .
2
Correspondingly there exists also an operator that maps onto odd functions
f (x) + f (x) f (x) + f (−x) f (x) − f (−x)

Po [f (x)] = (I − Pe )[f (x)] = − = .
2 2 2
4.1.2 Basic Properties

Definition 4.3 (Bounded) Is the norm of an operator finite, then we call this operator
bounded (Ger.: beschränkt):
kA[.]kp,ind < M < ∞, or, kA[x]kp < M kxkp .
Theorem 4.1 (Bounded and Continuous) A linear operator A : X → Y is bounded,

i.e., kA[x]k ≤ M kxk if and only if it is continuous, i.e., kA[x] − A[x + dx]k ≤ Lkdxk for
some finite and positive M and L.
Proof:
Let us assume A is bounded, then we find that
kA[x] − A[x + dx]k = kA[dx]k ≤ M kdxk
for all x from X. This however, is equivalent to being continuous. Now starting with
continuity, we find
kA[dx]k = kA[x] − A[x + dx]k ≤ Lkdxk
and we conclude boundedness.
This theorem is true for linear operators, thus also for linear functionals.
Lemma 4.1 Inner

are continuous. I.e., if xn → x is true in an inner product
products
space S, then xn ,y → x,y for y from S.
1
The name Hermitian originates from the French mathematician Charles Hermite (24.12.1822-
14.1.1901).
Proof:
If xn converges, it must also be bounded, thus
kxn k ≤ M < ∞.
Then we have:

k xn ,y − x,y k = | xn − x,y | ≤ kxn − xkkyk.

Since kxn − xk → 0, then xn ,y converges towards x,y .
With such technique we can also show the following: Let f (x) = hx,g(x)i be a functional,
then f (x) is continuous if g(x) is bounded.
Proof:
| hxn ,gi − hx,gi | = | hxn − x,gi | ≤ kxn − xk kgk .
If kxn − xk → 0, then f (xn ) = hxn ,gi converges towards f (x) = hx,gi.
Such techniques are important for the inverses of linear operators.
Notation: To simplify matters, we denote by A0 [.] = I[.], the identity operator. An

n-times concatenation A[A[.]] = An [.]. The inverse operator is defined as A−1 [A[.]] = I[.].
Theorem 4.2 (Inverse Operator) Let k · k be an operator norm satisfying the submul-
tiplicative property and A[.] : X → X a linear operator with kA[.]k < 1. Then (I − A)−1
exists2 and:
∞
X
−1
(I − A) = Ai ,
i=0
∞
X
A−1 = (I − A)i .
i=0
Proof:
Let kAk < 1. If I − A is singular then there is at least one vector x unequal to 0 so that
(I − A)[x] = 0. Thus we also have x = A[x] and
kxk = kA[x]k ≤ kAkkxk.
In this case we must have kA[.]k ≥ 1 which is a contradiction. By successive application

(multiplication for matrices) we have:
(I − A)(I + A + A2 + .. + Ak−1 ) = I − Ak .
2
Note that we use the matrix notation Ai to describe i successive applications of the operator A[.].
Since
kAk k ≤ kAkk
(by the submultiplicative property) and kAk < 1 it must be true that
lim kAkk = 0.
k→∞
And therefore also: " ∞

#
X
i
(I − A) A = I.
i=0
Note that A[.] must be square: X → X.
Example 4.7 Consider the following linear operator:

x
x
A = y2∗ .
y 2
Obviously, this operator is bounded:

x

1 x
x → kA[.]k = 1 .

A = 2 =
y∗
y
2
2 y 2
And since it is square, its inverse must exist in the form:
∞
X
(I − A[.])−1 = Ai .
i=0
We thus have to show that

∞
−1 a X
i a
(I − A[.]) = A ,
b b
i=0
or, equivalently
∞
X
i a a
A (I − A) = .
b b
i=0
We intend to show the latter. Note that for even exponents we find:

0 a a
A =
b b
a
2 a 4
A = b
b 4
a

2k a 22k
A = b ,
b 22k
and for odd exponents, we find:

a

a 2
A = b∗
b 2
a

3 a 2∗3
A = b
b 23
a

2k+1 a 22k+1
A = b∗ .
b 22k+1
We can thus already compute

a

a
(I − A[.]) = 2
b∗ .
b b− 2
We now have to compute the infinite sum:

∞ ∞ X∞
X
i a X a 2i 2i+1 a
A = A + A
b b b
i=0 i=0 i=0
∞ ∞
X 1 a X 1 a
= + ∗
2 2i b 2 2i+1 b
i=0 i=0

1 a 1 1 a 4 a 2 a
= + = + .
1 − 41 b 2 1 − 14 b∗ 3 b 3 b∗
Finally,
∞ ∞ a
a
a

X
i a X
i 4 2 a
A (I − A) = A 2
b∗ = 2
b∗ + 2 = .
b b− 2 3 b− 2 3 b∗ − b
2
b
i=0 i=0
4.1.3 Spaces of linear Operators

Note however, that most of the shown properties, in particular the projection properties
are not limited to matrices!
Definition 4.4 (Range or Column Space) The vector space, spanned by the columns of
a matrix A = [a1 ,a2 ,..,an ] : X → Y is called its column space or range (Ger.: Spaltenraum
von A):
R(A[.]) = span ({a1 ,a2 ,..,an })

= y ∈ Y : A[x] = y for x ∈ X .
The second row is the more general form and describes the column space of a linear operator.
Definition 4.5 (Row Space) The vector space spanned by the (conjugate complex) rows
of a matrix A = [bT1 ; bT2 ; ...; bTn ] : X → Y , is called row space of A (Ger.: Zeilenraum) or
column space of the adjoint operator A∗ [.]:
R(A∗ [.]) = span ({b∗1 ; b∗2 ; ...; b∗n })
= x ∈ X : A∗ [y] = x for y ∈ Y .

Note that the Hermitian of a matrix is a special form of the adjoint (Ger.: adjungierter)
linear operator: A∗ [x] = AH x.
Definition 4.6 (Adjoint Operator) An adjoint operator for a given operator A[.]
satisfies the following property:
A[x],y = x,A∗ [y] .

Note that x and y can be of different size and even different type! While the adjoint op-
erator for a matrix is very simple to find (A∗ [·] = AH ), it is in general difficult to determine.
Example 4.8 (Adjoint Operator) Let us consider the following linear operator A[.] that
maps a continuous function x(t) ∈ C[−∞,∞] to a sequence xk ∈ l2 (Z
Z).
Z k+ 1
2
A[x(t)] = x(t)dt = zk .
k− 12
Now consider the inner product

∞
X
hA[x(t)],yk i = zk yk∗ .
−∞
We like to find an adjoint operator A∗ [.] so that

Z ∞
∗
hA[x(t)],yk i = hx(t),A [yk ]i = x(t)x̂(t)dt
−∞
with x̂(t) = A∗ [yk ]. For this to find we start with

∞
X ∞ Z
X k+ 21
hA[x(t)],yk i = zk yk∗ = x(t)yk∗ dt.
−∞ −∞ k− 12
P∞ R k+ 21 R∞
We now recognize that −∞ k− 12
= −∞
and we obtain the same result if we display x̂(t)
as small parts of unit length (T = 1). We thus find
∞
X
x̂(t) = yk rec(t − k)
−∞
with rec(t) defining a rectangular unit pulse from − 21 to 12 .
Definition 4.7 (Self-Adjoint Operator) A self-adjoint operator for a given operator

A[.] satisfies the following property:
A[x],y = x,A∗ [y] = x,A[y] .

Definition 4.8 (Nullspace) The vector space defined by the solutions A[x] = 0 of a linear
operator A[.] : X → Y is called nullspace N (A) of A (Ger.: Nullraum). It is also called
kernel (Ger.: Kern) of the operator: ker(A[.]).
Definition 4.9 (Left Nullspace) The vector space defined by the solutions A∗ [y] = 0 of
a linear operator A[.] : X → Y is called nullspace N (A∗ ) of A∗ [.] or left nullspace (Ger.:
linker Nullraum), equivalently ker(A∗ [.]).
Example 4.9 Let A be a linear matrix operator with

 
1 0 0
A= 1 0
 1 .
0 0 0
Then the column space and nullspace of A are given by

   
 1 0 
R(A) = span   1  ,  1  
0 0
 
and  

 0 
N (A) = span   1  .
0
 
Now we reformulate the matrix into its adjoint and obtain the row space and the left
nullspace of A:
   
 1 1 
H
R(A ) = span   0  , 0  

0 1
 
 
 0 
N (AH ) = span   0   .
1
 
Example 4.10 Consider the convolution:

Z t
L(x(t)) = x(τ )h(t − τ )dτ.
0
The nullspace of the linear operator L (linear functional) consists of all functions x(t) which
convolved with h(t) result in zero. In the Fourier domain these are the functions X(jω)
that have no overlap with H(jω). Thus:
N (L(.)) = {x(t)|H(jω)X(jω) = 0}.
Let us revisit Example 2.33 that we can solve now. We were given a set of equations
Rg = 0.
As the columns in R are linearly dependent, we now understand that the desired solution
must come from the nullspace of R. By properly selecting the number of channel taps (m)
and the number of observations N , we can find unique solutions, that is there is only one
vector that spans the nullspace.
Another important example of nullspaces occurs when solving linear sets of equations.
Let vector b be from the column space of A. Then we have: A linear combination of the
columns of A must be exactly b : Ax = b.
Given Ax = b,
• There is exactly one solution if b is in the column space of A and the columns are
linearly independent.
• There is no solution if b is not in the column space of A.
• There is an infinite amount of solutions if b is in the column space of A, and its

columns are not linearly independent.
Proofs follow later...

Based on these definitions, we can already state the following relations that are valid
for all linear operators A[.] : X → Y , i.e., for each x ∈ X and y ∈ Y , the operator maps
A[x] = y such that
R(A[.]) ⊂ Y
R(A∗ [.]) ⊂ X
N (A[.]) ⊂ X
N (A∗ [.]) ⊂ Y
Example 4.11 In this (somewhat larger) example we want to show the importance of un-
derstanding the nullspaces. We consider the application hands free telephone, a common
feature of a handset today, not only used in private conversations but predomiantly in video
conferences and remote team work. It took however, a large amount of research work in the
80ies and 90ies to make that perfectly working what we use today. Consider the setup as
shown in Figure 4.1. The far end speaker signal enters the room via the loudspeaker, is re-
flected in the room and returns into the microphone signal of the local speaker together with
his voice. Both signals are transfered to the far end speaker where his own signal appears
as echo. In case the far end speaker is also using a hands free telephone the loop is closed
and a load sinusoidal sound is audible (Ger.: Rückkopplungspfeifen). An adaptive filter can
estimate the loudspeaker room impulse response and reconstruct the echo signal in order
to subtract it. Important are the very long impulse responses of typically several thousand
taps depending on the room size. In this application the local speaker is the disturbance (for
the adaptive filter estimation) but as it is the signal of interest to be transmitted it requires
special treatment.

ZZ
J

Z J

Z J

Z J
Z

Z J

Z J
Z J

ZJ

~
Z
-^
Z
J
Z
>

@ J Z
@ J Z
J Z
Z
J Z
J Z
J Z
J
Z
Figure 4.1: Loudspeaker-Room-Microphone system.
Such application belongs to the problems of system identification. As shown in Fig-

ure 4.2 the path of the echo is an unknown linear impulse response wk . By observing the
input xk and output signal yk of the system, the adaptive filter learns the impulse response
of the system. With the known input signal the filter can reconstruct the echo without the
signal of interest and by subtraction obtain a clean echo-free signal ek . To understand the
basic principle, assume a (at least WSS) signal at the input of an LTI system with impulse
response w(t). The correlation of input and output signal is:
Z
rxx (t)w(τ − t)dt = rxy (τ ),
with the autocorrelation function rxx (t) of the input signal and the crosscorrelation rxy (τ )
vk
yk
h
xk -
? - ? - ek
−
6

ŷk
-
ŵ

Figure 4.2: System identification.
describing the correlation between input and output. Such integral can be discretized and
written in form of a linear set of equations:
Rxx w = rxy .
Not knowing the correlation terms, the linear system of equations can be approximated by
observations: ! !
n n
1X 1 X
x xT w = x yk .
n k=1 k k n k=1 k
In order to solve a system of order m(dim(w) = m), n must be at least m. In practise,
often n = 2m and xk must be persistent exciting! Then we find
n
!−1 n
!
1X 1 X
w= x xT x yk .
n k=1 k k n k=1 k
The problems with nullspaces, however, only became an issue once stereo applications
were to be developed. Simply replicating the well-working techniques for single microphone
systems turned out not to work. This was first quite puzzling as a clear explanation was
not found and not obvious. It took a few years and the right formulation of the problem to
understand it better.
For this consider now the problem depicted in Figure 4.3 where the far end speaker signal
(1)
sk is recorded at two microphones and transmitted individually over some channel gk and
gk(1) gk(1)
s s xk(1)
M F
gk(2) gk(2) xk(2)
Figure 4.3: Problem of stereo echo compensation. The left figure explicitly shows the air
transmission from source to microphone, a microphone converting air pressure into electrical
signals and an electrical transmission path F . On the right figure this is abstracted as one
(1)
transmission path with impulse response gk .
(2) (1) (2)

gk in order to be reproduces at two loudspeakers with signals xk and xk , respectively.
Let us assume the two channels are given in vector notation of finite dimension L
(i) (i) (i)
g (i) = [g0 ,g1 ,...,gL−1 ]T ; i = 1,2.
The corresponding outcomes of the loudspeakers at time instant k are then given by
(i) (i) (i) (i)
xk = [xk ,xk−1 ,...,xk−L+1 ]T ; i = 1,2.
If we write the original source signal sk in form of a Hankel matrix:
 
sk sk−1 ... sk−L+1
 sk−1 sk−2 
Sk =  ,
 
.. ...
 . 
sk−L+1 sk−2L+2
we find a very compact notation to describe the input-output relation
(i)
xk = Sk g (i) ; i = 1,2.
We nicely recognize here that the same source signal is modified via the two different chan-
nels into two different versions of output signals. Now note the following identities:
(i)T (j)T (i)
xk g (j) = g (i)T SkT g (j) = g (i)T Sk g (j) = g (j)T Sk g (i) = xk g .
(1) (2)
We now construct a composite vector out of the two vectors xk and xk , i.e.,
(1)T (2)T
xTk = [xk ,xk ]. If we now continue as in the previous setup and compute the autocorre-
lation matrix,
n n
1 X Sk g (1) g (1)T Sk Sk g (1) g (2)T Sk

1X T
Rxx = xk xk = ,
n n Sk g (2) g (1)T Sk Sk g (2) g (2)T Sk
k=1 k=1
we find that this so constructed autocorrelation matrix must be singular! The proof of this
fact is easily performed by constructing a vector uT = α[g (2)T , − g (1)T ]. Applying this vector
u from the right to the autocorrelation matrix , we find that the result is zero. In other words
the vector lies in the nullspace of Rxx . Clearly, solving the stereo problem with the same
techniques wouldn’t work. Eventually, other techniques were developed that circumvented
this problem and made hands free telephony also possible for stereo applications.
Definition 4.10 (Inner Sumspace) Let V and W be linear subspaces, then the space
S = V + W is called inner sumspace consisting of all combinations x = v + w.
Definition 4.11 (Direct Sumspace) Let V and W be linear subspaces. The direct
sumspace is constructed by the pairs (v,w).
Definition 4.12 (Orthogonal Subspace) Let S be a vector space and V and W both
subspaces of S. V and W are called orthogonal subspaces if for each pair v from V and w
from W we have: hv,wi = 0.
Definition 4.13 (Orthogonal Complement) Let V be a subset of a vector space S with

inner product. The space of all vectors orthogonal to the vectors in V is called orthogonal
complement (Ger.: orthogonaler Komplementärraum) and is denoted:
V ⊥ = W.
Example 4.12 Let us consider a set V = {(0,0,0),(1,0,0)} in GF(2)3

with exclusive-or (antivalence) as operation from the vector space S =
3
{(0,0,0),(1,0,0),(0,1,0),(0,0,1),(0,1,1),(1,1,0),(1,0,1),(1,1,1)}=GF(2) . Its orthogonal
complement is given by W = V⊥ = {(0,0,0),(0,1,0),(0,0,1),(0,1,1)}. If we merge the two
spaces we obtain
V ∪ V ⊥ = {(0,0,0),(1,0,0),(0,1,0),(0,0,1),(0,1,1)},
obviously not the entire vector space V . If on the other hand we build the linear hull with
these elements, we find span(V ∪ V ⊥ ) = S, a result we also obtain by the inner sumspace:
V + V ⊥ = S. If, however, we build the direct sumspace, we obtain something entirely
different:

⊥ (0,0,0,0,0,0),(1,0,0,0,0,0),(0,0,0,0,1,0),(1,0,0,0,1,0),
V ⊕V = .
(0,0,0,0,0,1),(1,0,0,0,0,1),(0,0,0,0,1,1),(1,0,0,0,1,1)
Example 4.13 Let be S =GF(2)3 . The vectors v = (1,0,0) and w = (0,0,1) are from S.
They span the subspaces V and W : V = span(v) = {(0,0,0),(1,0,0)} and W = span(w) =
{(0,0,0),(0,0,1)}. Both spaces are orthogonal subspaces. The subspace V has the orthogonal
complement :
V ⊥ = {(0,0,0),(0,1,0),(0,0,1),(0,1,1)} .
The vectors that span the orthogonal complement are:
V ⊥ = span ({(0,1,0),(0,0,1)}) ,
or any pair that does not include the zero vector. As in the previous example, merging V
with its orthogonal complement does not make for the entire vector space S. For this to
obtain we need to take the inner sumspace of the two.
Note: let be v from V and w from W . Assume that V and W are orthogonal comple-
ments in S. Then we do not necessarily have:
V ∪ V ⊥ = S,
span V ∪ V ⊥ = S,

V + V ⊥ = S,
although this may appear intuitively. Typically for such properties we need complete
spaces (Cauchy series!).
We can also now return to the concept of projections as we understand them even better.
We already mentioned that there are different kind of projections.
Definition 4.14 In an orthogonal projection its range and its nullspace are orthogonal
subspaces, in oblique (non-orthogonal) projections, this is not the case.
Example 4.14 Consider matrix

0 0
A= .
α 1
Its range is given by
0
R(A) = span ,
1
and its nullspace
1
N (A) = span ,
−α
Only for α = 0, we have an orthogonal projection.
Note that the eigenvalues of a projection matrix are either zero or one.
In general the following statements are true.
Theorem 4.3 Let be V and W two subspaces of a vector space S (not necessarily a com-
plete one) with inner product. Then we have:
1. V ⊥ is a complete subspace of S.
2. V ⊂ V ⊥⊥ .
3. If V ⊂ W then we have: W ⊥ ⊂ V ⊥ .
4. V ⊥⊥⊥ = V ⊥
5. If x ∈ V V ⊥ then x = {∅}.
T
6. {∅}⊥ = S,S ⊥ = {∅}.

Proof (Part 1):
Let be xn a series of vectors in V ⊥ with xn → x and v from V . Because of the continuity
property of the inner product (Lemma 4.1) we have:

xn ,y = 0 → x,y = 0; for v ∈ V.
Thus, x ∈ V ⊥ .
Exercise: try to prove the remaining properties by yourself.
Theorem 4.4 Let be A : X → Y be a bounded linear operator between two Hilbert spaces
X and Y and R(A[.]) as well as R(A∗ [.]) complete subspaces. Then we have:
N (A[.]) = R(A∗ [.])⊥ , N (A∗ [.]) = R(A[.])⊥ ,
R(A[.]) = N (A[.]∗ )⊥ , R(A∗ [.]) = N (A[.])⊥ .
Proof: Let us have a look on the first property. Let x ∈ N (A[.]). But then also
< A[x],y >=< x, A∗ [y] >= 0 holds for all y. This means that x is orthogonal to A∗ [.], and
thus x ∈ R(A∗ [.])⊥ . Then, since x ∈ N (A[.]), also N (A[.]) ⊆ R(A∗ [.])⊥ must hold.
If the argumentation is started with x ∈ R(A∗ [])⊥ , it can be concluded that x ∈ N (A[.]),
and thus R(A∗ [.])⊥ ⊆ N (A). Accordingly, the only possibility is that N (A) = R(A∗ [.])⊥ .
Exercise: Derive the other three properties in complete analogy.

Lemma 4.2 The following holds:
R(A∗ [.]) = R(A∗ [A[.]]).
Proof: With the previous theorem in mind, it is sufficient to show that N (A[.]) =
N (A∗ [A[.]]). Let x ∈ N (A∗ [A[.]]), then A∗ [A[x]] = 0. Thus, also hA∗ [A[x],xi = kA[x]k22 = 0.
However, A[x] = 0 is equivalent to x ∈ N (A[.]), and accordingly N (A∗ [A[.]]) ⊆ N (A[])
holds. Now we start the other way around: let x ∈ N (A[.]), then also A[x] = 0
and A∗ [A[x]] = 0 hold. This means that x ∈ N (A∗ [A[.]]) and consequently N (A[.]) ⊆
N (A∗ [A[.]]). For this reason, only N (A[.]) = N (A∗ [A[.]]) remains, and so also R(A∗ [.]) =
R(A∗ [A[.]]) holds.
Theorem 4.5 (Fredholms Alternative Theorem, Fredholmscher Alternativsatz)

Let A[.] be a bounded, linear operator. The equation A[x] = b has at least one solution if
and only if hb,vi = 0 for every vector v from N (A∗ [.]) : hv,A[.]i = 0. More precisely:
b ∈ R(A[.]) ≡ b ⊥ N (A∗ [.]).
Particular for matrices: the equation Ax = b has (at least) one solution if and only if
H
b v = 0 for each vector v for which AH v = 0.
Proof:
Let A[x] = b and v ∈ N (A∗ [.]). Then hb,vi = hA[x],vi = hx,A∗ [v]i = hx,0i = 0. Consider
now that hb,vi = 0 if v from N (A∗ [.]), but A[x] = b has no solution. Since b is not from
R(A[.]), we assume that b = br + b0 , with br from R(A[.]) and let b0 be orthogonal to the
vectors from R(A[.]). Thus, we have hA[x],b0 i = 0 for all x and thus A∗ [b0 ] = 0. Moreover,
if b0 is in N (A∗ [.]) and
0 = hb,b0 i = hbr + b0 ,b0 i = hbr ,b0 i + hb0 ,b0 i = hb0 ,b0 i → b0 = 0.
b is therefore from R(A[.]).
Example 4.15 Let us consider the following example. We want to solve the set of linear
equations    
1 4 5
Ax = 2 5 x = 7  .
  
3 6 9
We analyze matrix A and find
   
 1 4 
R(A) = span   2 , 5  
 
3 6
 
and   
 −1 
N (AH ) = span   2   .
−1
 
As b ∈ R(A) we conclude that there must be at least one solution. As the nullspace contains a
vector different from the zero vector, we conclude that there are infinite amounts of solution,
i.e., the solution is non unique. According to Fredholm’s theorem, we can simply test this
by computing the inner product of the vector from the left null space and the right hand side
b and we find that it to be zero. We thus must have at least one solution.
Theorem 4.6 The solution of A[x] = b is unique if and only if the unique solution of
A[x] = 0 is x = 0, thus N (A[.]) = {∅}.
Proof
Let us assume the solution x of A[x] = b is not unique. Then a second solution must exist,
say x + ∆x for which we obtain the same right hand side, i.e.,
A[x + ∆x] = b.
Due to the linearity of operators, we have A[x + ∆x] = A[x] + A[∆x]. As A[x] = b, we
conclude that A[∆x] = 0. Thus, ∆x must come from the nullspace of operator A. If the
nullspace contains only the zero vector, this is indeed true. If the nullspace would contain
other vectors, it would not hold.
Definition 4.15 (Rank) The rank (Ger.: Rang) of an operator A[.] is defined by the
dimension of its column space (row space), thus the number of linearly independent columns
(or rows).
Example 4.16 Let an m × n matrix A be of rank r. Then we conclude:
dim(R(A[.]) = r; dim(R(A∗ [.])) = r

dim(N (A[.]) = n − r; dim(N (A∗ [.])) = m − r.
Remember for this the definition of dimension (Definition 2.23).
Example 4.17 Let us consider the following 2 × 3 matrix

1 4 2
A= .
5 20 5
We find

1 2 T 0
R(A) = span , ; N (A ) = span .
5 5 0
and
       
 1 5   −4 
R(AT ) = span   4  ,  20   ; N (A) = span   1   .
2 5 0
   
The rank of A is r = 2 as there is two linearly independent vectors spanning the range and
column space of the matrix.
Definition 4.16 (Full Rank) An m × n matrix is called of full rank if rank(A) =

min(m,n). If a matrix is not of full rank then it is called rank-deficient.
Theorem 4.7 For matrix products AB we have:
N (B) ⊂ N (AB)
R(AB) ⊂ R(A)
N (AH ) ⊂ N ((AB)H )
R((AB)H ) ⊂ R(B H ).
Proof, part 1:
If B[x] = 0, then A[B[x]] = 0. Thus, every x from N (B) is also in N (AB).
Exercise: Proof the remaining parts in an analogue way.
Note that in consequence to the 2nd and 4th property we find:
rank(A[B[.]]) ≤ rank(A[.]),
rank(A[B[.]]) ≤ rank(B[.]).
Example 4.18 Let us reconsider LS solutions for sets of linear equations. In a first ex-
ample we assume an overdetermined system Ax = b, i.e.,
   
a11 ... a1N   b1
... x1
 a21
  
..  =  b2 
.
 
 ..  .   ..
 .   . 
xN
aM 1 ... aM N bM
As the system is overdetermined we have M > N . If we apply now LS, we find
AH Ax = AH b.
Now the matrix is AH A and of dimension N × N . If AH A is of full rank N , LS delivers

a uniques solution, as AH b on the right side is now in the range of AH ! If, on the other
hand, rank(AH A) < N , LS cannot solve the problem. A solution for such an ill conditioned
problem is regularization:
AH A + I x = AH b,

in which a small, positive is introduced. Its value is chosen so that the system of equation
can be solved numerically. Note however, with such regularization, the problem has been
altered (slightly) and thus also its solution.
Let us now do the same for the corresponding underdetermined problem, i.e., N > M .
The underdetermined LS solution is given by
xLS = AH (AAH )−1 b.
If rank(AAH ) = M , the solution is the well-known minimum norm solution. If, on the
other hand, rank(AAH ) < M , the inverse cannot be computed. Again, regularization of
AAH may help in this case.
4.2 Matrix Basics

In the following we will restrict ourselves onto matrices as linear operators. The most
common application of matrices is a set of linear equations that needs to be solved. A
linear system of equations Ax = b can be solved in many different ways. Hereby the
numerical precision is in most cases an important factor for the quality of the result. There
are numerous methods for matrix equations that convert the general problem Ax = b in
an equivalent problem Bx = c in which the matrix B exhibits particular properties so that
the system can be solved easily. In the following we will shortly present the major ideas
without going into the details of each method.
• LU: stands for lower- and upper-triangular. A = LU can be solved easier since
LU x = b : U x = c and Lc = b, i.e., two linear systems of equations, that are easy to
solve.
• Cholesky: a particular solution of LU factorisation for Hermitian, positive-definite

matrices A = LU = LLH . In general such matrices can be decomposed into LDLH ,
U DU H or QDQH . Here, Q is a unitary matrix: QQH = I and D is a diagonal matrix.
In case of QDQH it is called eigenvalue decomposition.
• QR: A = QR. Here, Q is a unitary matrix: QQH = I and R = U is an upper

triangular matrix. A = QR is easier to solve since QRx = b : Rx = c and Qc = b →
c = QH b, i.e., two sets of equations that are easy to solve.
• SVD: Singular Value Decomposition. A = U ΣV H with two unitary matrices U and

V and the diagonal matrix Σ.
In the following after some matrix basics, we treat the eigenvalue decomposition first
and we will then show the singular valued decomposition.
Let A be an m × m matrix from C l . Consider the linear equation
Au = λu
or equivalently
(A − λI)u = 0.
Here the trivial solution u = 0 is not of interest but the nullspace of A − λI. Partic-
ular values λ, generating non-trivial nullspaces, are called eigenvalues. The corresponding
vectors u, are called eigenvectors.
Definition 4.17 The polynomial in λ, generated by the determinant of (A − λI) is called

characteristic polynomial.
The equation det(A − λI) = 0 is called characteristic equation of A.
The roots of the characteristic equation are called eigenvalues. The set with all eigen-
values is called the spectrum of A.
Example 4.19 Let a linear, time invariant system be described by the following equation
in state space:
z k+1 = Az k + Bxk
y k = Cz k
= H(q)[xk ].
We find the equivalent linear operator to be H(q) = C(q)(qI − A)−1 B(q)[xk ]. In case
of matrices A,B,C we can write H(q) = C(qI − A)−1 B. Since the matrix inversion of
(qI−A) determines the dynamic and stability behavior of the system, so does its determinant
det(λI − A).
Lemma 4.3 If the eigenvalues of a matrix A are all different then the corresponding eigen-
vectors are all linearly independent.
Proof:
We start with m = 2 and the opposite: Let us assume the eigenvectors u1 and u2 are
linearly dependent.
c1 u1 + c2 u2 = 0
c1 Au1 + c2 Au2 = c1 λ1 u1 + c2 λ2 u2
−| c1 λ2 u1 + c2 λ2 u2 = 0
c1 (λ1 − λ2 )u1 = 0.
Since λ1 and λ2 are different and u1 is not the zero vector, we must have c1 = 0. A similar
argument leads to c2 = 0. This proves the two eigenvectors are linearly independent. For
m > 2 we consider always the case that two vectors are linearly dependent and prove the
contradiction.
Example 4.20 If the eigenvalues are not different, the eigenvectors can be linearly depen-
dent or not. Consider the following matrices:

4 0 4 1
A= ; B=
0 4 0 4
Both matrices have the same eigenvalues λ1 = λ2 = 4. The eigenvectors of matrix A are
linearly independent, those of B not.
Example 4.21 Check whether the following theorem holds: ‘A square n × n matrix A has
linearly independent columns if and only if all of its eigenvalues are non-zero.’
For each eigenvector uk we have:

Auk = λk uk ; k = 1,2,...,n.
If the eigenvalues are zero, we have Auk = 0, the columns are thus linearly dependent! If
the columns are linearly independent, then for all uk 6= 0 we must have that:
Auk 6= 0
and thus the corresponding eigenvalue λk 6= 0. Thus, the theorem is true.
Now consider the general decomposition for square matrices A = U ΛU −1 of size n × n

with the diagonal matrix Λ. Let us assume that A has n linearly independent eigenvectors
U = [u1 ,u2 ,...,un ]. Then we find
[Au1 ,Au2 ,...,Aun ] = [λ1 u1 ,λ2 u2 ,...,λn un ],
or more compactly
AU = U Λ.
If the eigenvectors are linearly independent, U can be inverted and thus: A = U ΛU −1 .
Remark: Two matrices are called similar if they have the same eigenvalues. A matrix
transformation that does not change the eigenvalues is called a similarity transformation
(Ger: Ähnlichkeitstransformation), see also Section 1.1.4.
Wit such diagonalisation we can compute complicated forms such as eA . We find:

Am = U Λm U −1 !
X X
f (A) = fi Ai = U f i Λi U −1
i i
∞
!
X Λi
eA = U U −1 = U eΛ U −1 .
i=0
i!
If the eigenvalues are not all linearly independent, a diagonalisation is not possible.
However, a close to diagonal form is possible, the so-called Jordan form: A = T JT −1 . Here
matrix J is of blockdiagonal form with blocks Ji :
 
λi 1
 λi 1 
Ji =  . (4.1)
 
. .
 . 1 
λi
Example 4.22 Consider the following matrix

 
3 0 1
B =  0 3 0 .
0 0 3
It has a single eigenvalue λ = 3 and two linearly independent eigenvectors:

   
1 0
u1 = 0 ;
  u2 = 1  .

0 0
We find  
3 1 0
J(B) =  0 3 0  ,
0 0 3
and we need to complement T to
 
1 0 0
T =  0 1 1 .
0 1 0
By adding a linearly independent vector, we can guarantee T to be regular again and thus
its inverse exists. Note that as before we have B m = T J m T −1 but now J m does not preserve
its particular form, thus is not necessarily diagonal or of Jordan form.
Theorem 4.8 (Cayley Hamilton) Each square matrix satisfies its own characteristic
equation.
Proof:
Not too difficult. Try it yourself.
Definition 4.18 (anihilating polynomial) A polynomial f is called anihilating polyno-

mial of a square matrix A if: f (A) = 0.
Definition 4.19 (minimal polynomial) The anihilating (monic) polynomial of A with

smallest degree is called minimal polynomial of A.
Example 4.23 Consider the following matrices:

     
4 5 1 6 1
A1 =  4 ; A2 =  5 ; A3 =  6 1 .
4 5 6
The corresponding minimal polynomials are:
f1 (x) = x − 4; f2 (x) = (x − 5)2 ; f3 (x) = (x − 6)3 .
Note that the order of the polynomial depends on the size of the Jordan block.
4.3 Hermitian Matrices and Subspace Techniques

Hermitian matrices AH = A ∈ C l , or if real-valued, symmetric matrices, AT = A ∈ IR are
square matrices that occur often in signal processing problems. If second order statistics
are included, for example, the resulting covariance matrices are Hermitian by construction:
R = E[xxH ], see also Example 4.27.
Definition 4.20 (Symmetric and Hermitian) If A = AT , for A from IR, then the ma-
trix A is called symmetric (Ger.: symmetrisch). If A = AH , for A from C
l , then the matrix
A is called Hermitian (Ger.: Hermitesch oder Hermitsch).
Tightly connected with Hermitian matrices are unitary matrices as they are the modal
matrices that diagonalize Hermitian matrices: U H AU = Λ.
Definition 4.21 (Unitary) A matrix with the property U H U = I is called unitary (Ger.:
unitär). A matrix with U T U = I is called orthogonal.
Note: if U is an n × n matrix, it follows that U H = U −1 . Note also that U U H = I as
well.
A consequence for square unitary matrices is that they are energy preserving and that
all eigenvalues must be one in magnitude. Sometimes we find in literature also this term
describing matrices that are only a part of a unitary matrix. If only a subset of the columns
are taken, we call this rectangular shaped matrix semi-unitary. We find: U H U = I but
U U H 6= I.
The eigenvalues of unitary matrices must be one in magnitude, otherwise input vectors
exist that result in smaller or larger outputs.
Lemma 4.4 (Eigenvalues of Hermitian Matrices) The eigenvalues of Hermitian ma-

trices are real-valued.
Proof:
hAu,ui = λ hu,ui
= u,AH u = hu,Aui

= λ∗ hu,ui .
Thus, λ = λ∗ .
Lemma 4.5 (Eigenvectors of Hermitian matrices) The eigenvectors to different

eigenvalues of Hermitian matrices are orthogonal.
Proof:
Let λ1 and λ2 be two different eigenvalues with corresponding eigenvectors u1 and u2 . Then
we have:
hAu1 , u2 i = u1 ,AH u2 = hu1 ,Au2 i = hu1 ,λ2 u2 i

= λ2 hu1 , u2 i = λ1 hu1 , u2 i .
Thus: (λ2 − λ1 ) hu1 , u2 i = 0 and therefore hu1 , u2 i = 0.
Lemma 4.6 Hermitian matrices can always be diagonalized by unitary matrices.

X
A = U ΛU H = λi ui uH
i .
i
Proof:
Consider an arbitrary matrix S that diagonalizes a Hermitian matrix A: S −1 AS = Λ. Due
to the Hermitian property we have AH = A but the eigenvalues remain real: ΛH = Λ:
H
S −1 ΛS = S H AS −1H = Λ = S −1 AS.
As a consequence we have S H = S −1 which means that S = U is unitary.3
In practical realizations it is often of advantage to first apply a unitary matrix U to the

input signals x, thus y = U x and then operate in the modal space of the matrix. Realizations
of unitary matrices in fixed-point processors are well understood. Once we are in the modal
space, operations may become simpler as we only have to deal with the diagonal elements
of Λ. For example, we want to compute the square root A of the Hermitian matrix B = A2 :
1 1
B = U ΛU H = U Λ 2 U H U Λ 2 U H = A2 .
3
Strictly, we prove here only that Hermitian matrices, if diagonalizable, then by unitary matrices. The
fact that Hermitian matrices can always be diagonalized is more difficult to prove and follows by the spectral
theorem of linear operators.
1
We thus conclude that the square root of B is given by A = U Λ 2 U H , which exists as long
as none of the eigenvalues is negative. Those matrices B are positive semi-definite. As A
itself is Hermitian, we also find that B = A2 = AAH = AH A = A2H .
However, consider now an arbitrary rectangular shaped matrix Ã that is not Hermitian.
If we construct B = ÃÃH , we obtain a positive semi-definite Hermitian matrix B = B H .
If we now compute the square roots of B, we obtain different square shaped matrices:B =
AAH , say: A 6= Ã.
Example 4.24 In some applications where the square roots are of interest some additional
parts are added on the main diagonal. If white noise is involved a scaled identity matrix is
added, similarly if regularization is applied for numerical reasons:
1 1
A + δI = U (Λ + δI)U H = U (Λ + δI) 2 U H U (Λ + δI) 2 U H .
We can thus immediately observe the influence of δ > 0 on the square roots.
A puzzling result is obtained when we consider the following Hermitian positive semi-definite
matrix: B = 11T . We want to factorize:
B + γ 2 I = AAH .
We find the two eigenvalues of B to be zero and two. The corresponding eigenvectors are:

1 1 1
U=√ , .
2 −1 1
We can thus write
2 2 + γ2
B+γ I =U UH.
γ2
This can now be factorized into
T T 2
γ2

1 1 1 1 γ
A= 1+ + .
1 1 2 −1 −1 2
The following property of linear operators is also very surprising. In general if an input
vector is applied to a matrix, all its columns are linearly combined. Only if zero entries
are in the input vector, some columns are left out. With some vectors, however, this is
different. When applying a linear combination of eigenvectors, a new linear combination of
such eigenvectors results. In other words, once we confine the input to be in a span of a
subset of eigenvectors, we stay in that subspace. The subspace is thus invariant.4
Definition 4.22 (Invariant Subspace) Let A be an n × n matrix and S a subspace of

R(A). S is called invariant subspace of A if for every x from S there exists an Ax from S.
4
The reader may recall the song ”Hotel California” by he Eagles. The refrain is ”You can check out but
you can never leave”. This is exactly what happens here.
Example 4.25 Let an n × n matrix A have k (smaller than n) different eigenvalues with
the corresponding eigenvectors ui , i = 1,2,...,n. Let U = [u1 , u2 ,..., un ] and Ui ; i = 1,2,...,k,
be the k subsets of eigenvectors, corresponding to the k eigenvalues λi , i = 1,2,...,k. The
subspaces span(Ui ) spanned by the subsets Ui are invariant subspaces of A. For example,
consider a 6 × 6 matrix with
λ1 has one eigenvector U1 = {u1 } ,

λ2 has two eigenvectors U2 = {u2 ,u3 } ,
λ3 has three eigenvectors U3 = {u4 ,u5 ,u6 } .
Consider now space S, spanned by the two eigenvectors u2 ,u3 , S = span ({u2 ,u3 }). We call
S an invariant subspace for A. Any vector build by the two eigenvectors: x = αu2 + βu3
results in a vector with the same property:
Ax = αAu2 + βAu3 = αλ2 u2 + βλ2 u3 ∈ S.
In other words, operating on the linear operator A results in staying in the same subspace.
The vector never leaves such subspace. Correspondingly, T = span ({u4 ,u5 ,u6 }) is also an
invariant subspace.
This property of invariant subspaces motivates the following decomposition concept.

If we put eigenvectors associated to identical eigenvalues into sets, we can describe the
operator (matrix) by such sets that themselves form invariant subspaces.
Theorem 4.9 Let A be an n × n Hermitian matrix with k (maximum n) different eigen-

values. Then we have:
• spectral decomposition: A = ki=1 λi Pi ,
P
• identity: I = ki=1 Pi ,
P
• with: Pi = uj ∈Ui uj uH
P
j .
The matrices Pi are projection matrices in the (invariant) subspace span(Ui ), spanned by
the normalized eigenvectors uj .
Proof:
We already know that Hermitian matrices can be diagonalized by unitary matrices, thus:
n
X
H
A = U ΛU = ui λi uH
i
i=1
n
X k
X
= λi ui uH
i = λi Pi .
i=1 i=1
Note we have:
n
X k
X
ui uH
i = UU H
=I= Pi .
i=1 i=1
If we continue the previous example, we find that
A = λ1 u1 uH H H
+ λ3 u4 uH H H

1 + λ2 u2 u2 + u3 u3 4 + u5 u5 + u6 u6 .
Here we identify u1 uH1 = P1 the first projection. Note that eigenvectors are normalized
thus u1 u1 = u1 u1 /(uH
H H
1 u1 ) which reveals the projection operator. Analogously, we identify
u2 u2 + u3 q u = P2 and u4 uH
H H H H
4 + u5 u5 + u6 u6 = P3 .
Such spectral decomposition provides us with an insight what a Hermitian matrix does,
when applied to an input vector x:
k
X k
X X
Ax = λi Pi x = λi uj uH x.
i=1 i=1
|{z}
uj ∈Ui
| {zj}
stretching projectx
A Hermitian operator thus first projects the input onto its individual components (sub-
spaces) and then stretches (or compresses if λ < 1) the components. Finally all parts are
added together again.
Example 4.26 Consider the matrix

9 −2
A= ,
−2 6
with the two eigenvalues λ1 = 5 and λ2 = 10. It has the corresponding eigenvectors:

1 1 1 −2
u1 = √ ; u2 = √ .
5 2 5 1
We find the projections

T
1 1 1 1 1 2
P1 = =
5 2 2 5 2 4
T
1 −2 −2 1 4 −2
P2 = = .
5 1 1 5 −2 1
Combining them leads to

A = 5P1 + 10P2 .
Example 4.27 Consider a weakly stationary random process x with Hermitian autocorre-
lation matrix Rxx . The diagonalization of Rxx leads to:
Rxx = E xxH = U ΛU H .

We can linearly filter now such random process and obtain y = U H x. The autocorrelation
matrix Ryy of this new random process is given by
Ryy = E yyH = E U H xxH U = Λ.

We thus obtain a perfectly decorrelated random process. The eigenvalues can be interpreted
as the individual energy terms in such random process. If considering the eigenvalues one
realizes that some can be extremely small, thus do not have much part of the ACF matrix.
They could be neglected, approximating the process. In some applications such as image
processing the top ten eigenvalues may contain 99 percent of the energy. Such an approxi-
mation based on a few strong eigenvalues is called: Karhunen-Loeve description of random
processes x.
Consider now the following expression:

k
X k
X X
H H
x Ax = λi x P i x = λl xH uj uH
j x
i=1 i=1 uj ∈Ui
 
k
X X
= xH  λl uj uH
j
 x.
i=1 uj ∈Ui
By selecting the various eigenvectors for x = un , we obtain a single eigenvalue λn . For the
eigenvector umax associated to the largest eigenvalue λmax we obtain the largest eigenvalue
λmax and so on. The two extremes are thus:
xH Ax xH Ax
max = λmax ; min = λmin .
x xH x x xH x
Definition 4.23 (Rayleigh Quotient) The expression
xH Ax
r(A,x) =
xH x
is called Rayleigh quotient.
Note that such expression only makes sense when applied to Hermitian matrices. We find
the following important property:
λmin ≤ r(A,x) ≤ λmax .

vk
xk yk
+ h
Figure 4.4: Eigenfilter transmission: knowing the statistic of xk , what is the optimal impulse
response hk to maximize the SNR?
Example 4.28 The matched filter (Ger.: Signalangepasstes Filter) is well known to max-
imize the signal to noise ratio of deterministic signals. If, however, the maximal signal
to noise ratio of random signals is considered, we speak of an eigenfilter. We consider
Figure 4.4 in which a random signal xk is additively corrupted by noise vk . We like to
design a filter hk so that the Signal-to-Noise Ratio (SNR) at its output is maximized. The
idea is thus to separate the signal from the noise as much as possible. We formulate the
convolution by inner vector products:
yk = hT xk + hT vk .
This allows to consider noise and signal components individually. The signal power at the
output of the filter is
P = hT E[xxH ]h∗ = hH Rxx h.
On the other hand, the noise power is simply given by: N = hT E[vvH ]h∗ = σv2 hH h. If we
want to maximize the SNR we have to compute
P hH Rxx h λmax (Rxx )

max = max 2 H = .
h N h σv h h σv2
The desired optimal solution can thus simply be found on the eigenvector h = umax that is
associated to the largest eigenvalue λmax .
Let us revisit Example 2.33 that we can solve now under the more realistic aspect of
(1)
additive noise in the transmissions. For this we assume that the received vectors rk and
(2) (i) (i)
rk are composed of the original vectors r̃k and noise terms v k :
(i) (i) (i)
rk = r̃k + v k ; i = 1,2.
The noise terms are filtered by the channel estimates g (1) and g (2) and appear in sum at
the output with σv2 (kg (1) k22 + kg (2) k22 ) as noise energy. If there is multiple solutions possible
we would certainly be interested in those that have the smallest noise influence that is the
smallest norm (minimum norm solution). Thus, rather than requiring the outcome of Rg to
be the zero vector, we want now kRgk22 to be as small as possible. As every scaled version
of g is acceptable, we can set the additional requirement on g to be of unit norm and obtain
a Rayleigh quotient problem:
g H RH Rg
g opt = arg min .
g kgk22
We thus have to find the smallest eigenvalue of RH R and select its eigenvector as solution.
Example 4.29 This technique can be used for filter design. Let us design a linear phase
filter with 2N +1 coefficients for which we are given the magnitude (Ger.: Amplitudengang)
Hd (ejΩ ):
N
X
−jN Ω −jN Ω
jΩ
H(e ) = e jΩ
HR (e ) = e cos(nΩ) = e−jN Ω bT c(Ω),
n=0
where we forced a linear phase filter due to symmetry constraints and we filled the one sides
filter impulse response (scaled by a factor of two) in vector b. A low pass filter is to design
with limit frequencies Ωp < Ωs so that

jΩ 1 ; 0 ≤ Ω ≤ Ωp
Hd (e = .
0 ; Ωs ≤ Ω ≤ π
Note that this formulation is much more flexible than the previous ones as we do not have
a single fixed frequency where the filter behavior changes rapidly from pass- to stopband
(Ger.: Sperrbereich) but instead we can have a large range between Ωp and Ωs where we
do not care about the filter slope. We find the signal energy in the stopband to:
Z π Z π
2 1 T
c(Ω)cT (Ω)dΩb = bT P b,
jΩ jΩ
Es = Hr (e ) − |Hs (e )| dΩ = b
Ωs π Ωs
where we introduced matrix P :

Z π
1
Pij = cos(iΩ) cos(jΩ)dΩ.
π Ωs
In the passband (Ger.: Durchlassbereich) we have for Ω = 0 : Hd (ejΩ ) = 1, or equivalently

bT 1 = 1. The error is given by: 1 − bT c(Ω) = bT [1 − c(Ω)]. The so obtained error energy is
to minimize:
1
Ep = bT [1 − c(Ω)][1 − c(Ω)]T dΩ b = bT Qb,

π
where we introduced a matrix Q similar to P for the stopband. The entire filter problem
is thus given by:
J(α) = αEs + (1 − α)Ep = αbT P b + (1 − α)bT Qb = bT Rb
for 0 ≤ α ≤ 1 with the abbreviation R = αP + (1 − α)Q. Dependent on the chosen value of
α we can put more emphasis on the stop or the passband. Obviously, there is still freedom
in the choice of b. We can restrict this by normalizing in the form of bT b = 1. The filter
problem thus reads now:
min bT Rb ; with constraint bT b = 1,
or, equivalently
bT Rb
min = λmin (R).
b bT b
The problem can thus be solved by applying the Rayleigh quotient.
Example 4.30 (Coordinated Multipoint Problem 1) Consider the problem of a cel-

lular basestation for wireless transmitting to K users with N antennas when K > N . Is it
possible to form transmit vectors such that the main signal energy only goes to the desired
user and as little as possible to the others (leaking signal or interference). We model the
user channel by vectors h1 ,h2 ,...,hK and neglect noise in the transmission. To keep it simple,
we assume all signals to be real-valued. The individual received signal is then
ri = hTi x; i = 1,2,...,K.
Let us first consider the first user signal power:
|r1 |2 xT h1 hT1 x
= .
kxk22 kxk22
We are thus of interest to maximize the received signal power for a fixed transmission
power and find this to be x = h1 from Cauchy Schwarz (or Rayleigh quotient). However,
this solution ignores the interference terms entirely. Combining all remaining channels
h2 ,h3 ,...,hK into a matrix H, we are able to formulate the problem in terms of receive signal
power and leaking interference power.
kHxk22 kxk22
SLR = min = max .
x kxk22 x kHxk2 2
Maximizing such Signal-to-Leakage-Ratio (SLR) works by applying the Rayleigh quotient.

Consider the term B = H T H. Obviously we have here a positive semi-definite Hermitian
1
matrix that we like to factorize. We replace x = B − 2 y and find:
1 1
kxk22 y T B − 2 h1 hT1 B − 2 y
SLR = max = max y .
x kHxk2 2 kyk22
1
Now the problem is suitable for solving it by a Rayleigh quotient. We find y opt = B − 2 h1 ,
resulting in xopt = B −1 h1 = (H T H)−1 h1 . The corresponding maximal SLR is then
SLR = hT1 (H T H)−1 h1 .
4.4 Further Subspace Techniques

Many modern DSP techniques are based on subspace methods. The most relevant are:
• PHD: Pisarenko’s Harmonic Decomposition
• MUSIC: MUltiple SIgnal Classification
• ESPRIT: Estimation of Signal Parameters via Rotational Invariance Techniques.

As a short historical note, Vladilen Fedorovich Pisarenko published his method in 1973.
The improved version MUSIC was invented by R.O.Schmidt in 1979 (1981) and finally
R.Roy and Thomas Kailath proposed ESPRIT in 1986. We will have a closer look at them
in the following. They all have in common a specific signal model:
p
X
x(t) = ai ej2πfi t + w(t).
i=1
Altogether we can identify four unknowns here:

1. The order p, that is how many signals are superimposed
2. The amplitudes ai which we expect to be complex valued. We are satisfied if we can

estimate their magnitudes as those relate to the power that is strength of the signal
components. We model them as random variables.
3. The frequencies fi , one for each signal component.
4. The additive noise modeling observation and modeling errors. It is a random process
which we assume to be zero-mean complex-valued Gaussian. Its statistic is described
2
by a sole parameter, its variance σw which we hope to measure.
What are the applications of such model?
We rarely identify frequencies directly but note that the Fourier transform of the expression
gives: F [h(t − t)] = H(jω)ejωt . Thus, all calculations of temporal changes or delays are
equivalent to the determination of frequencies. This is for example being used in radar
techniques. Equivalently shock waves in the Earth can be measured by this and different
layers of material be identified. This is then naturally a method to detect large amounts of
hidden fluids in rocks, such as petrol. With antenna arrays the information captured can
also be brought into AoA (Angle of Arrival) and AoD (Angle of Departure) computation
in wireless fields and can be used to detect the number of scatterers and reflections.
Depending on the wavelength this can lead to map of the scanned area. But discriminating
different angels of arrival also allow for spying techniques, that is in a superposition of
many sources, a single one can be picked and filtered out, thus become visible or audible.
The random amplitudes as well as the additive noise makes x(t) a random process.
Sampling such process equidistantly with time period T over M > p positions, we obtain a
vector x with
p
X
x= ai si + w.
i=1
Here, the noise is analogously sampled in vector w and we introduced a new deterministic
vector T
si = 1,ej2πfi T ,ej2πfi 2T ,...ej2πfi (M −1)T .

We can now construct an autocorrelation matrix out of observation vector x
Rxx = E xxH ,

which in practise can be obtained by observing several of theses vectors x at different times
and computing a mean expression for the ACF matrix rather than an ensemble. With our
vectors si we can write the ACF matrix as
p
X
E |ai |2 si sH

Rxx = i + Rww .
i=1
This in turn can be brought into the following notation
Rxx = SP S H + Rww ,
where we introduced matrix S = [s1 ,s2 ,...,sp ], a concatenation of all vectors si as well as a
diagonal matrix P that contains all the signal powers E [|ai |2 ]. Note that matrix S is of
Vandermonde form. We can relate the problem in form of a LS problem:
x = Sa + w
aLS = min kx − Sak22 .
a|S,p
But now that S and p are not given, the problem at hand is not simply an LS problem.
As we expect M > p an analysis of Rxx provides us with

M
X
Rxx = λ̃i ui uH
i .
i=1
If there were no noise, the matrix would only be spanned by the p signal components si
and thus p
X
Rxx|w=0 = λi ui uH
i ,
i=1
with slightly different eigenvalues but the same eigenvectors. The additive noise compo-
2 2
nent adds σw to all eigenvalues uniformly, as we assume white Gaussian noise: λ̃i = λi +σw .
As long as the noise is zero, we thus have only p eigenvalues different from zero. This
part of the space is spanned by the vectors si but equivalently by the first p eigenvectors
ui :
span s1 ,s2 ,...,sp = span u1 ,u2 ,...,up = span {US } ,
meaning that although we do not know the individual vectors si (unless it is a single one,
p = 1), we do know the space spanned by them. We call this the signal space. The
remaining part is called the noise space:

N = span up+1 ,up+2 ,...,uM = span {UN } .
Similarly as before, we can relate the problem at hand to a classical LS description:
x = US b + w
bLS = min kx − US bk22 .
b|US ,p
Different to before is that now US is given.
4.4.1 Pisarenko’s Harmonic Decomposition

In PHD the basic ingredients of the three algorithms are already given. It can thus serve
as a canonical form through which the other methods can also be understood.
1. Finding p: in the classical PHD method, the ACF matrix is computed by (temporal)
averaging of instantaneous vector products xxH and its decomposition in eigenvectors
U and eigenvalues Λ̃ is computed. Ordering the eigenvalues form largest to smallest,
2
typically tells at which amount the “noise floor” starts, that is where only σw is
visible. Assuming that the largest values are discriminable against the noise related
values, p is determined.
2
2. Finding σw : once p is determined, a good estimate of the noise variance is simply
given by averaging over all “noisy” eigenvalues:
M
2 1 X
σ̂w = λ̃i .
M − p i=p+1
3. Finding S: defining the signal space is a rater difficult nonlinear problem. Pisarenko
proposes to use the noise subspace N defined by the eigenvectors in UN . They must
be orthogonal to those in the signal subspace, in other words:
UNH S = 0,
or more individually for each signal vector si ; i = 1,2,...,p and each noise eigenvector
uk ; k = p + 1,p + 2,...,M we have
uH
k sk = 0.
Pisarenko specifically recommends to select M = p + 1 which results in a single noise

vector uM that is orthogonal to all p signal space vectors. Splitting such inner vector
product into its components, we find
M
X −1 M
X −1
uM,m exp(j2πfi mT ) = uM,m exp(j2πfi T )m = 0.
m=0 m=0
In other words, we now have given a polynomial of order M − 1 with the coefficients
uM,m and only need to find its roots. We recognize here several (numerical) problems:
(i) solving a polynomial of high dimension M can become very challenging, (ii) the
obtained result my not lie exactly on the unit circle, and if M > p + 1, (iii) depending
which eigenvector uk ; k = p + 1,...,M we take, we may get different results.
4. Finding a: Pisarenko suffices to compute the energy of the signal components, that
is E[|ai |2 ]. For this we relate to the description
Rxx = SP S H + Rww = SP S H + σw
2
I,
for which we only look into the first row. There we find the autocorrelation function
rxx (0),rxx (1),...,rxx (M − 1). Each of these elements can be computed by taking the
first row of the left S (which is only ones) and taking the m−th column of the right
S: p
X
rxx (m) = exp(−2jπT mfk )E[|ai |2 ] + δm σw
2
; m = 0,1,...,M − 1.
k=1
2
The term δm σw is only for the term rxx (0) as it sees the impact of noise, the remaining
ones are free of it. But note that we only need p values. We thus pick only the first
p autocorrelation values rxx (1),...,rxx (p). We recognize a set of linear equations in
E[|ai |2 ]:
    
exp(j2πf1 ) exp(j2πf2 ) exp(j2πf3 ) . . . exp(j2πfp ) E[|a1 |2 ] rxx (1)
 exp(j2π2f1 ) exp(j2π2f2 ) exp(j2π2f3 ) . . . exp(j2π2fp )   E[|a2 |2 ]   rxx (2) 
    
 exp(j2π3f1 ) exp(j2π3f2 ) exp(j2π3f3 ) . . . exp(j2π3fp )   E[|a3 |2 ]   rxx (3)
= .

 
 ..  ..   .. 
 .  .   . 
2
exp(j2πpf1 ) exp(j2πpf2 ) exp(j2πpf3 ) . . . exp(j2πpfp ) E[|ap | ] rxx (p)
The matrix on the left hand side is a Vandermonde matrix and thus invertible. The
set of equations can be solved.
4.4.2 Multiple Signal Classification

As the determination of the unknown frequencies is the major drawback of Pisarenko’s
proposal, the next methods improve this. In the MUSIC approach we consider again the
signal vectors
s(f ) = [1,ej2πf ,ej2π2f ,...,ej2π(M −1)f ]T ,
for which we want to find the frequencies f = fi so that
sH (f )uk = 0; k = p + 1,p + 2,...M.
Different to the previous method, we now bring on all eigenvectors form the noise space
and sum up the deviation to zero, i.e., we compute
1
P (f ) = PM .
H 2
k=p+1 |s (f )uk |
Rather than looking for zeros of a polynomial, we now identify outstandingly high values,
which are much easier to detect. The problem now reduces in identifying the maxima of
P (f ) with a desired precision.
4.4.3 Estimation of Signal Parameters via Rotational Invariance

Techniques
A drawback of the previous methods so far is the influence of the noise w. It directly
changes the entries on the main diagonal of the autocorrelation matrix and thus is visible
on all eigenvalues. If the signal is small compared to the additive noise variance, it becomes
numerical challenging to separate both. The idea in ESPRIT is thus to reduce the effect of
the noise by including alternative terms:
Rxx = E xk xH H

k ; Q = E x k k+1 .
x
By including matrix Q, the noise contribution moves to the next lower diagonal away from
the main diagonal. With our signal model we find:
Q = SP ΦS H + σw
2
E,
where E is simply a matrix with ones on its upper diagonal and Φ is a diagonal matrix
with phase rotations on them:
   
0 ej2πf1
 1 0   ej2πf2 
E= ; Φ= .
   
. . . . . .
 . .   . 
j2πfp
1 0 e
2
After estimating p and σw we compute the noise free terms SP S H as well as SP ΦH S H and
consider
SP [I − λi ΦH ]S H u = 0; i = 1,2,...,p.
This is a generalized eigenvalue problem: SP S H u = λi SP ΦH S H u with generalized eigen-
values λi . Such eigenvalues are the desired values exp(j2πfi ).5
4.5 Singular Value Decomposition

Although quite useful the decomposition technique presented before seems to be limited to
Hermitian matrices only. In the following we present an even more general method that
allows to decompose any arbitrary (even rectangularly shaped) matrix.
Theorem 4.10 (Singular Value Decomposition) Every matrix A from C l m×n can be
decomposed in the following form: A = U ΣV H with the unitary matrices U from C l m×m
and V from Cl n×n as well as a matrix Σ ∈ IRm×n
+0 with a diagonal block Σ+ from IRr×r
+0 with
r ≤ min(m,n).
The concept behind the theorem is that if we take any arbitrary matrix A, we can form
two positive semi-definite Hermitian matrices, that is
B1 = AH A; AH AV = V Λ1
B2 = AAH ; AAH U = U Λ2 .
Now the unitary matrices U and V as well as the diagonal matrices Λ1 and Λ2 with the
eigenvalues must be related.
Proof:
The proof relates to the eigenvalue decomposition of Hermitian matrices. We thus start
with considering AH A with A ∈ C l n×n rather than A. Let the eigenvalues of AH A be
ordered from largest λ1 to smallest, so that λ1 ≤ λ2 ≤ ... ≤ λr > 0 and λr+1 = ...λn = 0.
The eigenvalues fill a diagonal matrix Λ1 . The Hermitian AH A can be diagonalized by the
unitary matrix V : V AH AV H = Λ1 . The first r vectors can thus be constructed by:
Av
ui = √ i ; i = 1,2,...,r.
λi

The vectors v i are the columns form V . We find that ui ,uj = δi−j , for i,j = 1,2,...,r. The
vectors ui build the columns of a matrix U1 . The set {u1 ,..., ur } from U1 can be extended
5
There is numerical solutions available for solving generalized eigenvalue problems. In Matlab use
eig(A,B) rather than eig(A).
with orthogonal vectors (for example by the Gram Schmidt method). We thus obtain
U = [u1 ,...,ur ,...,un ] = [U1 ,U2 ] : U H U = I.
l m×m :
Obviously, the vectors in U are eigenvectors for AAH from C
v v
AAH ui = AAH A √ i = Aλi √ i = λi ui .
λi λi
This is clear for the eigenvalues that are distinct from zero. For the zero eigenvalues
the corresponding eigenvectors must come from the nullspace of AAH . Since we have
for Hermitian matrices that R(AAH ) is the orthogonal complement to N ((AAH )H ) =
N (AAH ), all eigenvectors are orthogonal.
We therefore find for U AV H :
• i = 1,2,...,r:
1 H H p
uH
i Av j = √ v i A Av j = λi δi−j .
λi
• i = r + 1,r + 2,...,m:
AAH ui = 0 → λi = 0.
Since z i = AH ui is in the nullspace of A (Az i = 0) and in the range of AH . For
AH ui = 0 we also have v H H H
j A ui = uj Av i = 0.
p
Thus, U H AV = Σ contains a diagonal block Σ+ with the non-zero elements λj , j = 1..r .
Note that r denotes the number of non-zero eigenvalues. If r = min(m,n), all potential
eigenvalues are non-zero and the matrix is of full rank. However, if this is not the case
then we have r < min(m,n).
The method has many (independent) inventors: E. Beltrami (1835-1899), C.Jordan

(1838-1921), J.J. Sylvester (1814-1897), E. Schmidt (1876-1959) and H.Weyl (1885-1955).
Example 4.31 : Let the matrix have the two singular values σ1 and σ2 . If it is an m×n =
2 × 3 matrix, we find:
σ1 0 0
Σ= .
0 σ2 0
If, on the other hand, it is an m × n = 3 × 2 matrix:
 
σ1 0
Σ =  0 σ2  .
0 0
Example 4.32 Consider an arbitrary matrix A and construct B1 = AH A. Then B1 is

Hermitian and we can compute its eigenvalues by
B1 V = V Λ1 = V ΣT Σ.
Thus Λ1 = ΣT Σ contains the eigenvalues on its diagonal.

Now consider B2 = AAH . We can also compute the eigenvalues:
B2 U = U Λ2 = U ΣΣT .
Thus Λ2 = ΣΣT contains the eigenvalues on its diagonal. Both matrices B1 and B2 have
the same singular values and thus the same non-zero eigenvalues. Continuing the previous
example leads for a 2 × 3 matrix to
 2 
σ1 0 0 2
T 2 T σ1 0
Σ Σ = 0 σ2 0 ; and ΣΣ =
  ;
0 σ22
0 0 0
For a 3 × 2 matrix both terms are swapped.
Note further that if A is from IR then all matrices (U,S,V ) are from IR. We can explicitly
formulate the spectral decomposition of a matrix A ∈ C l m×n , p = max(m,n) and obtain:
p r
H V1H X X
A = U ΣV = [U1 U2 ]Σ = σi ui v H = σi ui v H
i .
V2H i
i=1 i=1
We recognize a split in [U1 ,U2 ] and [V1 ,V2 ] in which {U1 ,V1 } are associated to the singular
values, and {U2 ,V2 } to the zero blocks. Accordingly, the first summation also takes into
account terms that are zero, specifically σr+1 = σr+2 = ... = σp = 0. In the second sum
they are simply left out as they do not contribute. We can equivalently write
r
X
A= σi ui v H H
i = U1 ΣV1 .
i=1
This is called a thin SVD in literature.
Alternatively, we can also decompose the matrix in the following way:

r r
X X √ √
A= σi ui v H
i = σi ui σi v H H
i = Ũ1 Ṽ1 .
i=1 i=1
This can lead to two moderately small matrices Ũ1 ,Ṽ1 that fully describe the larger matrix
A. It is thus a form of data compression.
Example 4.33 Recall the induced matrix norm: kAk2,ind from Section 2.3.2:
kAk2,ind = sup kAxk2 = sup kU ΣV H xk2 = sup kΣV H xk2 = σmax .

kxk2 =1 kxk2 =1 kV H xk =1
2
Consider further the Frobenius norm from Section 2.3.1:

r
X
kAk2F = tr(AAH ) = tr(U ΣV H V ΣT U H ) = tr(ΣΣT ) = σi2 .
i=1
Does there also exist something like the Rayleigh quotient?

H
x Ay
A = U ΣV H : 0 ≤ ≤ σmax .
kxk2 kyk2
Note that the rank of a matrix A is given by the number of (nonzero) singular values.
Let us reconsider the range of a matrix A:
l m : b = Ax}
R(A) = {b ∈ C
l m : b = U ΣV H x

= b∈C
l m : b = U Σy

= b∈C
l m : b = [U1 U2 ] Σ+ y 1

= b∈C
l m : b = U1 z}
= {b ∈ C
= span(U1 ).
Similar arguments deliver the following relations:
R(A) = span(U1 ),
N (A) = span(V2 ),
R(AH ) = span(V1 ),
N (AH ) = span(U2 ).
Let us revisit now the LS problem with overdetermined parameters. We thus have more
observations m than parameters n:
min kAx − bk22 = min kU ΣV H x − bk22 ,

= min kΣV H x − U H bk22 ,
= min kΣx̃ − b̃k22 .
Let us consider more closely what we now minimize:

 
b̃1
  . 
σ1   .. 
 
... x̃1  
b̃
 
Σx̃ =  ..   n

 b̃n+1  .
=
 
 . 
 σn 
x̃n 
 .. 

0 ... 0  . 
b̃m
We recognize that in this equivalent form only the first n observations are taken into account,
the rest is simply discarded. The LS method thus searches in the reduced observation
space Cl n the solution with smallest norm. Now the solution of this is obtained by the
pseudoinverse of Σ:
 
σ1
 ... 
Σ+

Σ =  = ,
 
 σr  O
0 0 0
 1 
σ1
0
Σ# =  ..  −1
0  = Σ+ O .

.
1
σr
0
We find as LS solution:
x̃ = Σ# b̃
V H x = Σ# U H b
x = V Σ# U H b.
We recognize that the pseudo inverse has the following UVD:
(AH A)−1 AH = V Σ# U H .
How does this relate to the underdetermined solution? Let us consider now m < n,
l m, x ∈ C
b∈C l n . The previous formulation now reads:
 
x̃1
  ... 
 
 b̃1
σ1 0   
.   x̃m  
  b̃2 
Σx̃ = 
 . . 0  = .. .

 x̃m+1   . 
σm 0  .. 

.  b̃m
x̃n
In this case components x of the parameter space are eliminated. Let us thus consider the
solution of the underdetermined LS solution:
x = AH (AAH )−1 b
−1
= V ΣT U H U ΣV H V ΣT U H b
−1
= V ΣT U H U ΣΣT U H b
= V ΣT (ΣΣT )−1 U H b.
We recognize now the term ΣT (ΣΣT )−1 = Σ# the pseudinverse of the singular value matrix.
The underdetermined LS method also finds a minimum norm solution, however now in a
reduced parameter space.
4.6 Applications of SVD

SVD finds many applications in signal processing techniques. The presented subspace
techniques PHD, MUSIC and ESPRIT can also be solved by SVD techniques offering more
control if the observation data are ill-posed. In total LS problems matrix A is not known
perfectly but distorted: min kAx − bk can be solved with help of SVD techniques. In the
following we consider several other applications for SVD.
Example 4.34 (MIMO Transmission) Let us consider the problem of so-called

Multiple-Input-Multiple-Output (MIMO) transmissions as depicted in Figure 4.5. MIMO
systems occur typically in control where multiple sensors and/or actuators are involved. In
the last decades MIMO systems obtained much interest in communications with multiple
antennas. Let us consider a transmission system with NT antennas at the transmit side.
Each of the antenna obtains a data symbol per time unit when transmitting. All of those
are combined in a transmit data vector x ∈ Cl NT . Including additive noise we obtain at the
receiver end:
r = Hx + v.
NR
Assuming NR receive antennas, the receive vector r ∈ C
l and so is the noise vector
P1 H
s P2 T x y B ŝ
Figure 4.5: MIMO transmission: knowing the channel H, how do we design prefilter and
receiver filter to obtain maximum capacity?
v. Matrix H ∈ C l NT ×NR describes the wireless channel. In modern Orthogonal-Frequency

Division-Multiplexing (OFDM) systems such as Long Term Evolution (LTE), such a de-
scription is already a valid model. We now wish to know under which conditions we can
transmit with highest data rate possible. From the well-known Shannon formula we know
that the capacity c of a channel is defined by

SNR H
c = max log2 det I + HRH ,
tr(R)=K NT
where SNR denotes the transmit SNR at unit transmit power. Here a degree of freedom is
the precoding defined by a covariance matrix R. The question thus is how to design such
matrix R with the power constraint tr(R) = K, i.e., the overall transmit power is limited
to a maximum power K. The solution is found due to the application of SVD on channel
matrix H:
H = U ΣV H .
Selecting R = V P V H with a diagonal matrix P , we can reformulate the problem into its
equivalent representation:
r
Y SNR 2
c = Prmax log2 1+ Pi σ i .
i=1 Pi =K
i=1
NT
The value r refers to the rank of H, or equivalently the number of singular values that are
non-zero. Further straight forward calculation finally leads to:
r
X SNR 2
c = Prmax log2 1 + Pi σ i .
i=1 Pi =K
i=1
NT
We can now interpret the MIMO transmission as transmitting over r individual channels,
each of them has its own SNR, given by SNRPi σi2 /NT . The optimal distribution of the
individual power shares is given by the well-known waterfilling solution (Shannon 1948).
Example 4.35 (Iterative Matrix Algorithm) Let us now consider a rectangular

shaped matrix W ∈ IRm×n with m > n. What is the solution to the following problem:
W = W W T W.
We can solve this puzzle by applying an SVD on W = U ΣV T and obtain:
U ΣV T = U ΣV T (U ΣV T )T U ΣV T = U ΣΣT ΣV T .
As the left and right hand matrices are unchanged now, the diagonal part needs to be
considered. Mathematically we obtain problems of the form σ = σ 3 for which three solutions
are possible: {−1,0,1}. Since we deal with singular values that cannot be negative per
definition, only two solutions remain. The solution of the problem is thus given by all
permutations of zeros and ones on the diagonal of Σ and the two unitary matrices U and V .
Let’s now modify the problem and formulate an iterative algorithm:
Wk+1 = Wk + µWk (I − WkT Wk ).
Let the step-size µ > 0 and W0 = X be an initial value for the algorithm, assuming
X ∈ IRm×n with rank r = n.
1. If we continue the algorithm, whereto will it go?
2. Will it converge at all?
3. If so, under which conditions?
4. What is the fix-point of this algorithm, if it exists?
We can solve these questions by applying SVD on the left and right-hand side of the
algorithm:
U Σk+1 V T = U Σk V T +µU Σk V T (I−(U σk V T )T U Σk V T ) = U Σk V T +µU Σk V T (I−V ΣTk Σk V T ).
We recognize that we can drop the unitary matrices U and V as they have no influence in
the convergence of the algorithm and obtain the much simpler form:
Σk+1 = Σk + µΣk (I − ΣTk Σk ),
which can be written equivalently for each element on its diagonal:

2
σi,k+1 = σi,k + µσi,k (1 − σi,k ).
Obviously zero is a fixed-point. The only other potential fixed-point is given by one as we
can show by:
1 − σi,k+1 = 1 − σi,k − µσi,k (1 − σi,k )(1 + σi,k ) = (1 − σi,k )(1 − µσi,k (1 + σi,k )).
We can thus conclude that as long as |(1 − µσi,k (1 + σi,k ))| < 1 the algorithm will converge.
The solution of the iterative algorithm is thus W∞ = U I+ V T , where I+ stands for a matrix
with zeros or ones on its main diagonal. If due to the initial condition X all components
T
are excited, then no zeros occur and W∞ W∞ = I.
What is such algorithm useful for? If W contains different superpositions of several

statistically independent signals (less or equal to n) then the final value W∞ contains the
signals nicely decomposed from each other. A typical application for this algorithm is thus
blind source separation where superpositions of signals are recorded and then separated
again. Take for example a cocktail party with many guests (say 6). If you record there
speeches with several microphones (say 8) that are distributed in the room, chances are
high to obtain all individual speeches by all six persons.
But the algorithm can also be used for other difficult numerical tasks as it is very suit-
able for a fixed-point processor due to a processing that only contains add/mult operations.
Reformulate the algorithm, for example, to
Wk+1 = Wk + µWk (I − WkT R2 Wk ).
T 2
The algorithm will converge to W∞ R W∞ = I and thus a square matrix W equals the
inverse R−1 U , with some unitary matrix U . Starting with W0 = R, for example, results in
the desired modal space and we obtain W∞ = R−1 . Similarly the root of a matrix can be
computed by
Wk+1 = Wk + µWk (I − WkT R−1 Wk ).
Due to the cubic part, the algorithm typically performs very quickly. Once the estimate is
close to the solution, very few more iterations are required. But if the initial point is very
far from the solution, many iterations can be necessary. It is thus important to use as much
a-priori information as possible to have a good starting point.
Example 4.36 (Coordinated Multipoint Problem 2) Let us reconsider Exam-
ple 4.30, a coordinated multi-point problem. Different to the previous example, we now use
several antennas (N ), possibly even from several base-stations (thus the word coordinated)
to serve K < N users with single antennas. We look at the first user and find his receive
signal (without noise)
r1 = hT1 x
where h1 denotes the channel vector of the N antennas to this user and x is the precoding
vector we try to find. The channels h2 ,h3 ,...,hK of the remaining users 2,3,...,K we stack
in a matrix, say H ∈ RN ×(K−1) . We now try to minimize the leakage to the other users
assuming that if no power leaks, it will arrive at the desired user:
kHxk22 khT1 xk22 xT h1 hT1 x
SLR = min = max = max .
x khT1 xk22 x kHxk22 x xT H T Hx
different to problems relating the Rayleigh quotient, we have now take into account that
xopt ∈ N (H T H) such that the ratio becomes infinite. If the nullspace is sufficiently large,
an infinite set of solution sis possible. Nevertheless, not all solutions are of the same quality,
as some may increase the received signal strength more than others. A constructive way is
to identify the nullspace of H T H and then find the linear combination of the nullspace that
maximizes the receive power. We thus apply SVD on H T H:
H T H = U ΣU T = [U1 ,U2 ]Σ[U1 ,U2 ]T ,
where U2 spans the nullspace. That is any solution of the nullspace is feasible. We can
formulate the problem now as:
xT h1 hT1 x y T U2T h1 hT1 U2 y
max T T => max
x x H Hx y yT y
This problem we can easily solve with the Rayleigh quotient. The so obtained y opt = U2T h1
needs to be resubstituted to xopt = U2 y opt = U2 U2T h1 . We can interpret this solution as
the optimal solution h1 projected onto the nullspace U2 , as U2 U2T = U2 [U2T U2 ]−1 U2T is a
projection operator. The so obtained maximal SLR is given by:
(hT1 U2 U2T h1 )2
SLR = → ∞.
0
If compared to the first problem in Example4.30 we recognize that the term (H T H)−1 is now
replaced by U2 U2T .
Alternatively, in such case, a better metric may be the Signal-to-Noise and Leakage-Ratio
(SNLR):
xT h1 hT1 x xT h1 hT1 x
SNLR = max T T = max T T .
x x H Hx + σv2 x x [H H + σv2 I]x
Now the ratio cannot become infinite and all vectors from the nullspace obtain an equal
contribution to the denominator.
Example 4.37 (Condition Numbers) Let us consider a linear set of equations in which
after the SVD has been applied one singular value is very small, say , while the others are
reasonably large:  
σmax
 .. 
Σ= .
.
 
 σn−1 

In practise it can be wise to replace such by zero and continue the computation with
a lower rank n − 1. Methods that are replacing very small singular values by zero are
called rank reducing methods. Before we can apply such a rank reducing method, however,
we should have a metric that quantifies the matrix and tells us how small the really is
when compared to the remaining values. A common metric for such quality is the so-called
condition number, defined by
κ(A) = kAk22,ind kA−1 k22,ind .
We recall that the 2-induced matrix norm picks out the largest singular value σmax of matrix
A. Correspondingly
1 kA−1 xk2 1
= max = . (4.2)
σmin x kxk2 minx kAxk22
kxk
Therefore the condition number is nothing else but the ratio of largest and smallest singular
value
σmax
κ(A) = .
σmin
Consequently, condition numbers are nonnegative numbers whose minimum is one.
In case matrix A is Hermitian, we particularly find also a relation given by the eigenvalues
of A:
λmax (A)
κ(A) = .
λmin (A)
For non-Hermitian matrices A, an alternative form in terms of eigenvalues is possible
s
λmax (AH A)
κ(A) = .
λmin (AH A)
The condition number relates directly to the numerical quality of a solution in a linear set
of equations. For this we consider the distorted problem
A(x + ∆x) = b + ∆b,
in which a distortion in ∆x causes a corresponding distortion in the right hand side. The
condition number κ(A) now relates by how much a distortion on one side relates to the
distortion on the other side:
k∆xk k∆bk
≤ κ(A) .
kxk kbk
4.7 Exercises
Exercise 4.1 Consider Theorem 4.3. Proof the remaining parts in an analogue way.
Exercise 4.2 Consider Theorem 4.7. Proof the remaining parts in an analogue way.
Exercise 4.3 Consider Theorem 4.8. Proof it.
Exercise 4.4 Consider skew-symmetric matrices, i.e.
AH = −A.
1. Show that all eigenvalues are zero or purely imaginary.
2. Derive a projection for arbitrary square matrices to skew-symmetric matrices and to
Hermitian matrices.
Exercise 4.5 Show with SVD techniques that Equation (4.2) is correct.
Exercise 4.6 Show that for the Frobenius norm the following is true:
kABkF ≤ kAkF kBk2,ind
kABkF ≤ kBkF kAk2,ind .
Chapter 5
Matrix Operations
5.1 Motivation
Many problems can be formulated as the solution of a large linear system of equations.
Although in principle mathematical methods are known, once the systems become very
large (say 10.000 equations), they become difficult and tedious to solve. Two sources of
difficulties occur:
1. Numerical problems: due to rounding errors in floating point (fix-point solutions
for large matrices are typically infeasible due to numerical problems). A method
to describe the severity of such numerical problems is the condition number, see
Example 4.37.
2. Complexity problems: with increasing number of equations also the complexity grows,
it thus takes longer and longer until the result is computed.
Both problems are present in many general systems of linear equations and are difficult to
solve. If, however, the systems are in some form structured, additional information may be
possible to deduce from them simply by their structure and by this reduce the problems at
hand. Take for example the following block structured square matrix

A 0
C= ,
0 B
comprising of two smaller square matrices A,B ∈ IRN ×N . If we know the eigenvalues αi ,βi
and corresponding eigenvectors ai and bi of A and B, respectively, then we can also say
something about C:

A 0 ai ai
= αi
0 B 0 0

A 0 0 0
= βi ; i = 1,2,...,N.
0 B bi bi
197
But even more complicated constructs can be investigated. Take for example now:

A B
C=
B A
We can show that the eigenvalues of C may occur in pairs. If we know one eigenvalue λi
with its corresponding eigenvector, we also know the other:

A B xi xi
= λi
B A yi yi

A B yi yi
= λi ; i = 1,2,...,N.
B A xi xi
Thus, as long as xi 6= y i and xi 6= −y i , the eigenvalues occur in pairs, otherwise not. Try
yourself on the skewed block matrix1

A B
C= .
−B A
Consider now block matrix

A A
C=
A A
It turns out that half of the eigenvalues of C are twice those of A and the other half is zero.
To understand such behavior better, we will consider Tensor operations next.
5.2 Tensor Operations

Definition 5.1 (Kronecker Product) Consider two matrices A and B with dimensions
n × p and m × q, respectively. Then, the Kronecker2 product (also called tensor) is defined
as  
a11 B a12 B ... a1p B
 a21 B a22 B ... a2p B 
A ⊗ B =  .. .
 
..
 . . 
an1 B an2 B ... anp B
1
The solution follows a similar path, finding a pair of eigenvalues for the eigenvectors [xTi ,y Ti ]T and
[y T , − xT ]T .
2
Leopold Kronecker (from 7. 12.1823 in Liegnitz until 29.12.1891 in Berlin) was a German Mathemati-
cian.
Example 5.1

1 2 7 7 7
A = ;B =
3 4 9 9 9
 
7 7 7 14 14 14
 9 9 9 18 18 18 
A⊗B = 
 21 21 21

28 28 28 
27 27 27 36 36 36
 
7 14 7 14 7 14
 21 28 21 28 21 28 
B⊗A = 
 9 18 9

18 9 18 
27 36 27 36 27 36
Let us consider the following properties

1. No Commutativity: In general we have
A ⊗ B 6= B ⊗ A,
i.e., in general commutativity does not hold. The previous example shows however,
that the elements of both tensor products are identical, they only occur at different
positions. We can thus fix this missing property by a so called permutation matrix,
that reorders the elements.
2. Commutativity by Permutations: Consider a permutation matrix P that is com-
posed of many products of elementary permutation matrices
P = Pi1 j1 Pi2 j2 ...Pin jn
in which each elementary permutation is defined by its index pair (ik ,jk ):
 
1
..

 . 

1
 
 
P ik jk =  0 1 .
 
1 0
 
 

 . ..


1
with the exchange on the ik −th and jk −th column and row elements. Note that
such a permutation matrix is unitary: P T P = I. With help of suitable permutation
matrices the desired order of the elements can be assured and we can obtain
B ⊗ A = P T (A ⊗ B)P.
3. Distributivity: We find:
(A + B) ⊗ C = (A ⊗ C) + (B ⊗ C),
A ⊗ (B + C) = (A ⊗ B) + (A ⊗ C).
4. Associativity: We find:
(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).
5. Transpose: We find:
(A ⊗ B)T = AT ⊗ B T ,
(A ⊗ B)H = AH ⊗ B H .
6. Trace: We find:
tr(A ⊗ B) = tr(A)tr(B).
7. Diagonality: If A and B are diagonal matrices, so is their Kronecker product.

8. Determinant: Let A be an m × m and B an n × n matrix, then we find:
det(A ⊗ B) = det(A)n det(B)m .
9. Product of Kronecker Products: We find:

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD),
(A ⊗ B)(C ⊗ D)(E ⊗ F )... = (ACE...) ⊗ (BDF...).
10. Commutativity: We have learned that there is in general no commutativity, but

with the previous property we can write
A ⊗ B = (A ⊗ I)(I ⊗ B) = (I ⊗ B)(A ⊗ I).
11. Inverse: We find:

(A ⊗ B)−1 = A−1 ⊗ B −1 .
12. SVD: The SVD decomposition of a Kronecker product A ⊗ B is given by:

A ⊗ B = (UA ⊗ UB )(ΣA ⊗ ΣB )(VAH ⊗ VBH ).
Proof: Straightforwardly by substitution:
A ⊗ B = (UA ΣA VAH ) ⊗ (UB ΣB VBH )
= (UA ΣA ⊗ UB ΣB )(VAH ⊗ VBH )
= (UA ⊗ UB )(ΣA ⊗ ΣB )(VAH ⊗ VBH ) = UA⊗B ΣA⊗B VA⊗B
H
.
13. Pseudo Inverse: The pseudo inverse of A ⊗ B is given by the Kronecker product of
the pseudo inverses.
#
(A ⊗ B)# = UA⊗B ΣA⊗B VA⊗B H
= (VA ⊗ VB )(Σ# # H H
A ⊗ ΣB )(UA ⊗ UB )
= (VA Σ# H # H # #
A UA ) ⊗ (VB ΣB UB ) = A ⊗ B .
As shown in the context of SVD in the previous chapter, we find for the two pseudo
inverses:
(X H X)−1 X H = V Σ# U H
X H (XX H )−1 = V Σ# U H ,
for which we set X = A ⊗ B. Obviously the position of the unitary matrices U and V
remain unchanged: UA⊗B = UA ⊗ UB and VA⊗B = VA ⊗ VB . Only the singular value
matrix ΣA⊗B undergoes some transformation. Let us take a closer look to this (for
the first pseudo inverse):
−1
Σ#A⊗B = ΣT
A⊗B ΣA⊗B ΣA⊗B
−1
= (ΣA ⊗ ΣB )T (ΣA ⊗ ΣB )

(ΣA ⊗ ΣB )
−1
= (ΣA ΣA ) ⊗ (ΣTB ΣB )
T
(ΣA ⊗ ΣB )
−1 −1
T T

= (ΣA ΣA ) ⊗ (ΣB ΣB ) (ΣA ⊗ ΣB )
= (ΣTA ΣA )−1 ΣA ⊗ (ΣTB ΣB )−1 ΣB = Σ# #
A ⊗ ΣB .
Proofs: Let us prove the first property. Select one entry of the matrix (or block),
for example aij , and show that the desired property is satisfied; then generalize. We find
a11 B 6= b11 A. All listed properties can be shown straightforwardly.
Theorem 5.1 Let A be an m × m matrix with eigenvalues αi ; i = 1..m and corresponding

eigenvectors ai and B an n × n matrix with eigenvalues βi ; i = 1..n and corresponding
eigenvectors bi , i = 1,2,...n. Then the mn eigenvalues of the Kronecker product are αi βj ; i =
1..m,j = 1..n and the corresponding eigenvectors are: ai ⊗ bj .
Proof:
Let Aai = αi ai and Bbj = βj bj . Then simply multiplying delivers:
(A ⊗ B)(ai ⊗ bj ) = (Aai ) ⊗ (Bbj ) = αi βj (ai ⊗ bj ).
Example 5.2 Consider the following two matrices

4 5 7 −3
A= ; B= .
2 1 −3 7
The eigenvalues of A are given by (6, − 1) and those of B by (10,4). If we compute the
Kronecker product, we find:
 
28 −12 35 −15
 −12 28 −15 35 
A⊗B =
 14 −6
,
7 −3 
−6 14 −3 7
for which the eigenvalues are given by (60,24, − 4, − 10).
Definition 5.2 (Kronecker Sum) The Kronecker sum of an m × m matrix A and an

n × n matrix B is given by:
(A ⊕ B) = (A ⊗ In ) + (Im ⊗ B).
Note: This definition can also be found in variations in literature:
(A ⊕ B) = (In ⊗ A) + (B ⊗ Im ),
(A ⊕ B) = (In ⊗ A) + (B T ⊗ Im ).
For these forms similar properties are obtained!
Theorem 5.2 Let A be an m × m matrix with eigenvalues αi ; i = 1..m and corresponding

eigenvectors ai and B an n × n matrix with eigenvalues bi ; i = 1..n and corresponding
eigenvectors bi . Then the mn eigenvalues of the Kronecker sum are αi +βj ; i = 1..m,j = 1..n
and their corresponding eigenvectors are: ai ⊗ bj .
Proof:
Let Aai = αi ai and Bbj = βj bj . Then simply multiplying delivers:
(A ⊕ B)(ai ⊗ bj ) = (A ⊗ In )(ai ⊗ bj ) + (Im ⊗ B)(ai ⊗ bj )

= (Aai ⊗ bj ) + (ai ⊗ Bbj )
= αi (ai ⊗ bj ) + βj (ai ⊗ bj )
= (αi + βj )(ai ⊗ bj ).
Example 5.3 Consider the following two matrices

 
7 7 7
1 2
A= ; B =  9 9 9 .
3 4
11 11 11
√
The eigenvalues of A are given by 52 ± 21 33, those of B by (0,0,27). The eigenvalues of
     
1 2 7 7 7 8 7 7 2

 1 2   9 9 9
 
  9 10 9
  2 

 1 2   11 11 11
 
 =  11 11 12
  2 
(A⊕B) =   3 +

 4   7 7 7   3
  11 7 7 

 3 4   9 9 9   3 9 13 9 
3 4 11 11 11 3 11 11 15
√ √ √
are thus given by (2.5 ± 12 33,2.5 ± 12 33,29.5 ± 21 33).
Definition 5.3 (Vec Operator) Let the matrix A be of dimension m × n. The vec oper-
ator rearranges the elements of a matrix column-wise into a vector:
 
a1
 a 
 2 
A = [a1 ,a2 ,...,an ]; vec[A] =  ..  .
 . 
an
Here some important properties of the vec-operator:
1. Trace: We find:
tr(AB) = vec[AT ]T vec[B],
tr(AB) = vec[AH ]H vec[B].
2. Vec of a product: We find:

vec[AB] = (I ⊗ A)vec[B],
vec[AB] = (B T ⊗ I)vec[A].
3. Tripple Product: We find:

vec[AY B] = (B T ⊗ A)vec[Y ].
This is the more general form, for Y = I we can deduce the second property. The
advantage of the formula is that we can take out a matrix Y that has a matrix form
the left and the right.
Proof: Let B be of dimension m × n. Consider the k−th column (AY B):k of AY B:
 
x

 x 

(AY )(b1 ,b2 ,...,bk ,...,bn ) =  .
. .
 
 . 
 x 
x
The k−th column (AY B):k can be written as:

m
X m
X
(AY B):k = (AY ):j bjk = bjk (AY ):j
j=1 j=1
 
Y:1
 Y:2 
= [b1k A,b2k A,...,bnk A]   ; k = 1,2,...,n.
 
..
 . 
Y:n
More compactly, we can write:
 
Y:1
 Y:2 
[b1k A,b2k A,...,bnk A]   = (bTk ⊗ A) vec[Y ].
 
..
 . 
Y:n
Combining now all columns k = 1,2,...,n, and we obtain:

   T   T 
(AY B):1 (b1 ⊗ A)vec[Y ] (b1 ⊗ A)
 (AY B):2   (bT ⊗ A)vec[Y ]   (bT ⊗ A) 
  2   2
vec[AY B] =  = =  vec[Y ] = (B T ⊗A)vec[Y ].
 
.. .. ..
 .   .   . 
T
(AY B):n (bn ⊗ A)vec[Y ] (bTn ⊗ A)
Example 5.4 Given matrices A, B, and C, to solve is AXB = C. If A and B can be
inverted, we have X = A−1 CB −1 . Alternatively, we could write:
vec[AXB] = vec[C],
T
(B ⊗ A) vec[X] = vec[C],
| {z }
D
vec[X] = D−1 vec[C].
Example 5.5 Consider now the problem
A1 XB1 + A2 XB2 = C.
It cannot be solved by simply inverting any of the matrices. But
vec[A1 XB1 + A2 XB2 ] = vec[C],
 
(B1T ⊗ A1 ) + (B2T ⊗ A2 ) vec[X] = vec[C],
| {z } | {z }
D1 D2
vec[X] = (D1 + D2 )−1 vec[C].

Typical matrix Riccati equations can be solved now: X = AXA + C.
Let us consider a further example. Given a matrix A ∈ IRm×m and a matrix B ∈ IRn×n
as well as two rectangularly shaped matrices X,C ∈ IRn×m . We like to solve:
XAT + BX = C.
We first augment the equation with identities and obtain:
In XAT + BXIm = C
and now apply vectorization:
vec[In XAT + BXIm ] = vec[C].
Applying the vectorization property we find
((A ⊗ In ) + (Im ⊗ B)) vec[X] = vec[C].
We recognize a Kronecker sum. We thus can solve for X, if the pairwise sum of all
eigenvalues of A and B is non-zero.
Whether a large dimensional matrix is separable into smaller ones is not easily visible.
Definition 5.4 A matrix A is called separable (Ger.: separierbar) if there are two matrices
A1 ,A2 for which we have:
A = A1 ⊗ A2 .
Example 5.6 Separability allows for saving complexity. Consider for this the problem of
finding the inverse of a matrix A, for example, for solving a right hand side problem of the
form Ax = b. Computing the inverse of a general matrix of size m × m is of cubic order,
i.e., O(m3 ). Take for example
 
2 3 4 6
 −5 6 −10 12  1 2 2 3
A=   = ⊗ .
10 15 14 21  5 7 −5 6
−25 30 −35 42
Computing the inverse of A immediately is of order 43 = 64 operations, while each individual

inverse if of order 23 = 8 operations, thus only 16 operations are required (plus the tensorial
product).
If a matrix is separable can be tested by the following idea: A given matrix F is

decomposed into a Kronecker product B ⊗ A is as close to F as possible. If the error
F − B ⊗ A in the LS sense becomes zero, then the matrix is separable; otherwise only a
corresponding part is separable. The remaining part is then orthogonal to the error. We
can describe this by the following theorem.
Theorem 5.3 Consider a matrix F ∈ Cl m1 ×m2 with m1 = n1 p1 , m2 = n2 p2 , i.e.,

 
F11 F12 ... F1p2
 F21 F22 ... F2p2 
F =  ..
 

 . 
F p1 1 F p1 2 ... Fp1 p2
l n1 ×n2 ; l = 1,2,...,p1 ; k = 1,2,...,p2 . There exists unique (up to a phase) two

with Flk ∈ C
l n1 ×n2 ,B ∈ C
matrices A ∈ C l p1 ×p2 with kAkF = 1 such that
min kF − B ⊗ Ak22
A,B
in the LS sense given by the eigenvector associated to the largest eigenvalue:

p1 p2
X X
vec(A) = arg maxeigvec vec[Fkl ]vec[Fkl ]H .
l=1 k=1
The coefficients of B are then:
bmn = vec[A]H vec[Fmn ] ; m = 1,2,...p1 ; n = 1,2,...,p2 .
Proof: We first compute the elements of matrix B by LS and obtain:

p1 p2
∂ XX
kFkl − bkl Ak2F = 0
∂bmn l=1 k=1
that is
tr(AH Fmn )
bmn = = vec[A]H vec[Fmn ],
tr(AH A)
due to the norm constraint on A.
We now have to minimize
p1 p2
Fkl − tr(AH Fkl )A 2
X X
min F
A
l=1 k=1
which is equivalent to minimizing

p1 p2
X X
min tr(FklH Fkl ) − tr(FklH A)tr(AH Fkl ).
A
l=1 k=1
As tr(AH Fkl ) = vec[A]H vec[Fkl ], we find equivalently

p1 p2
X X
min vec[Fkl ]H vec[Fkl ] − vec[A]H vec[Fkl ]vec[Fkl ]H vec[A],
A
l=1 k=1
the solution of which is given by the eigenvector associated to the largest eigenvalue:
p1 p2
X X
vec[A] = arg maxeigvec vec[Fkl ]vec[Fkl ]H .
l=1 k=1
While the decomposition works well in case matrix A is separable, hat would happen if
not? In case of an arbitrary matrix A, the decomposition would extract that separable
part of A that a LS solution allows. After subtracting it from the original matrix A, the
remaining part describes the LS error term, orthogonal onto the separated part. The error
can thus be used to evaluate how well a matrix is separable or not.
F1 FN
v3
v1 v2
Figure 5.1: Example of a 3-way tensor.
Recall now the spectral decomposition from the previous chapter
r r
X X √ √
A= σi ui v H
i = σi ui σi v H H
i = Ũ1 Ṽ1 .
i=1 i=1
What is the consequence of the spectral decomposition? It allows to decompose A into

a product of two matrices. The size of them depends on the rank leading to low rank
compression. Matrix completion concept: observe a limited set of entries and estimate the
matrix s. th. its rank is preserved. SVD is thus the natural form of decomposing matrices
into products Low rank means data compression. This is equivalent to the requirement:
1
kAk∗ = min kŨ k2F + kṼ k2F ; s.th. A = Ũ Ṽ H .
2
rank
X(A)
kAk∗ = σr ; nuclear norm.
r=1
This norm provides something similar to sparsity but for matrices.3 The rank is thus the
minimum number of outer vector products, required to approximate a given matrix A with
zero error. Matrix A is a special case of a tensor, a so-called two-way tensor.
Now let us consider an extension of the matrix concept, the n-way tensor. Figure 5.1
depicts a 3-way tensor.
Example 5.7 Consider for example, a video stream, i.e., a sequence of matrices Fk ∈
IRl1 ×l3 of dimension l1 × l2 × l3 (being a frame) over time k = 1,2,...,l2 . We thus have a
three dimensional matrix (3-way tensor) F . Such a 3-way tensor F can be approximated
by:

rank
X (F )

min F − v 1,i ⊗ v 2,i ⊗ v 3,i
.
v 1,i ,v 2,i ,v 3,i
i=1
2
It is known that for such a tensor, its rank is bounded by
min(l1 ,l2 ,l3 ) ≤ rank(F ) ≤ min(l1 l2 ,l2 l3 ,l1 l3 ).
To find the rank of a tensor is in general NP-hard!
We can add arbitrary many dimensions to this construct. With every additional dimen-
sion, its data amount increases substantially. If a certain pattern is of interest, the time
to search through the entire tensor would be unfeasibly long. Apply first a separation into
vectors and/or matrices, then apply a search algorithm. Design decomposition based on a
fraction of data and running online without storing the entire data first.
5.3 Large Matrices with Structure

Definition 5.5 (Walsh-Hadamard Matrix) (Walsh) Hadamard matrices Hn with di-
mension n × n are matrices whose entries are {−1,1} with the following property
Hn HnT = HnT Hn = nI,

1 1
H2 = .
1 −1
p1
rank(A)
P
3
The nuclear norm and Frobenius norms are a special case of a Schatten norm: kAkp = r=1 σrp .
Applications of the Hadamard transform can be found in source coding for speech, image
and video processing as well as mobile communications such as the UMTS wireless standard.
A straightforward method to build larger Hadamard matrices is called the Sylvester

construction:
H2n = H2 ⊗ Hn = H2 ⊗ H2 ⊗ ... ⊗ H2 .
| {z }
log2 (2n) times
Example 5.8 Consider n = 2, we find H4 = H2 ⊗ H2 :

1 1
H2 =
1 −1
 
1 1 1 1
 1 −1 1 −1 
H4 =  .
 1 1 −1 −1 
1 −1 −1 1
Often normalized Hadamard matrices are used. Applying √1n as normalization we ob-
tain unitary (orthogonal) matrices. The advantage of such structured matrices is that a
significant amount of operations can be saved (separable). Is typically n2 operations per
vector multiplication of an n × n matrix required, so can this complexity be reduced to
n log2 (n) operations. This is often called the Fast Hadamard Transform.
Example 5.9 Consider z = H4 x with x = [1,b,c,d]T . We thus have

 
a+b+c+d
 a−b+c−d 
z=  a+b−c−d
.

a−b−c+d
The operation can be applied straightforwardly with 12 operations and identically with only
8 = 4 log2 (4) operations due to structuring. We set:

a a+b
z 1 = H2 = ,
b a−b

c c+d
z 2 = H2 = ,
d c−d

z1 + z2
z = .
z1 − z2
Figure 5.2 depicts the signal flow graph for this example.
Figure 5.2: Saving complexity by Hadamard matrices.
Theorem 5.4 Let Hn with n = 2m being a Hadamard matrix built by Sylvester construc-
tion, then we have:
Hn = Mn(1) Mn(2) ...Mn(m)

Mn(i) = I2i−1 ⊗ H2 ⊗ I2m−i ; i = 1,2,..,m.
Proof:
The proof goes by induction. We obtain the result immediately for m = 1. Assume the
result is true for m, is it also true for m + 1? For 0 < i < m + 1 we should have:
(i)
M2m+1 = I2i−1 ⊗ H2 ⊗ I2m+1−i = I2i−1 ⊗ H2 ⊗ I2m−i ⊗ I2
(i)
= M2m ⊗ I2 .
Moreover, for i = m + 1 we must have:

(m+1)
M2m+1 = I2m ⊗ H2 .
Thus we have:
(1) (2) (m+1) (1) (2) (m)
M2m+1 M2m+1 ...M2m+1 = (M2m ⊗ I2 )(M2m ⊗ I2 )...(M2m ⊗ I2 )(I2m ⊗ H2 )
(1) (2) (m)
= (M2m M2m ...M2m ⊗ I2 )(I2m ⊗ H2 )
(1) (2) (m)
= (M2m M2m ...M2m I2m )(I2 H2 )
= H2m ⊗ H2 = H2m+1 .
(1) (2)
Example 5.10 Consider H4 = M4 M4 . Figure 5.3 depicts the two matrices and that
only half their elements are non-zero.
With the HT we can also operate with other symbols than {0,1} or {−1,1}.
Figure 5.3: Sylvester construction of a Hadamard matrix.
Example 5.11 Assume two complex-valued symbols {a,b}, constructing:

" #
(1) (2)
a b Hn Hn
H2 = ; H2n = .
b∗ −a∗ Hn
(2)∗
−Hn
(1)∗
Definition 5.6 The Discrete Fourier Transform describes an operation, mapping a discrete
time sequence into a discrete Fourier sequence.
N
X −1
kl
yl = wN xk ; l = 0,1,...,N − 1.
k=0
Example 5.12 Consider a vector with the entries x = [a,b,c,d,e]T . By a DFT this vector
can be mapped into the Fourier domain:
   
1 1 1 1 1 1 1 1 1 1 1 1
 1 w61 w62 w63 w64 w65   1 w61 w62 w63 w64 w65 
   
 1 w62 w64 w66 w68 w610   1 w62 w64 w60 w62 w64 
y=  3 6 9 12

15  x =   x = F6 x.
 1 w 6 w 6 w 6 w6 w 6 
 1
 w63 w60 w63 w60 w63 

4 8 12 16 20
 1 w6 w6 w6 w6 w6   1 w64 w62 w60 w64 w62 
5 10 15 20 25
1 w6 w6 w6 w6 w6 1 w65 w64 w63 w62 w61
We introduced the following twiddle factors4

m
wN = exp(−j2πm/N ).
Due to their periodic properties we have:
(m+N ) m
wN = exp(−j2π(m + N )/N ) = wN .
DFT exhibit the following properties (and thus FFTs as well).
4
Engl: to twiddle one’s thumbs=Ger: Däumchen drehen.
1. Unique linear mapping

PN −1 kl
yl = k=0 xk wN ; l = 0,1,...,N − 1,
1
PN −1 −kl
xk = N l=0 yl wN ; k = 0,1,...,N − 1.
2. Periodicity
xk+N = xk , yl+N = yl .
3. Linearity
DFT[αxk + βuk ] = αDFT[xk ] + βDFT[uk ].
4. Circular Symmetry
Consider a sequence xk ; k = 0,1,...,N − 1 of length N . Take the first n0 terms away
and append them at the end.
N −1+n
X0 N
X −1 N −1+n
X0
kl kl kl
xk w N = x k wN + xk w N ,
k=n0 k=n0 k=N
N
X −1 N −1+n
X0
kl kl
= xk w N + xk+N wN ,
k=n0 k=N
N
X −1 0 −1
nX N
X −1
kl kl kl
= xk w N + xk w N = xk w N = yl .
k=n0 k=0 k=0
This property also holds equivalently in the Fourier domain.
5. Circular shift
In contrast to the previous property we shift the sequence now by n0 values circularly:
N
X −1 N
X −1
kl l(k−n0 +n0 )
xk−n0 wN = xk−n0 wN ,
k=0 k=0
N
X −1
ln0 l(k−n0 ) ln0
= wN xk−n0 wN = wN yl .
k=0
This property also holds equivalently in the Fourier domain.
6. Multiplication in the Fourier domain

Consider yl = DFT[xk ] and zl = DFT[uk ]. Then we find for the frequency-wise
product wl = yl zl = DFT[vk ]:
N
X −1
vm = xk um−k ; m = 0,1,...,N − 1
k=0
   
v0 u0 u−1 = uN −1 . . . u−N +1 = u1
 v1   u1 u0 . . . u−N +2 = u2 
v =  = x
   
.. .. ..
 .   . . 
vN −1 uN −1 uN −2 . . . u0
 
x0 x−1 . . . x−N +1
 x1 x0 . . . x−N +2 
=   u.
 
.. ...
 . 
xN −1 xN −2 . . . x0
Both sequences xk and uk can be described by circulant matrices.
Example 5.13 Consider the convolution with a short channel {h0 ,h1 ,h2 }
 
    xk
yk h0 h1 h2 
 xk−1 

 yk−1  =  h0 h1 h2  xk−2 ,
 
yk−2 h0 h1 h2  xk−3 
xk−4
    
yk h0 h1 h2 xk

 yk−1 


 h0 h1 h2   xk−1
  


 yk−2  = 
  h0 h1 h2   xk−2
 .

 ỹk−3   h2 h0 h1   xk−3 
ỹk−4 h1 h2 h0 xk−4
The first equation depicts the true transmission scenario, i.e., a convolution. In the
second equation the channel matrix is augmented to be a circulant matrix. Due to this
circulant matrix an FFT can be applied to offer a low complexity solution. There is
two options: 1) to simply throw the additional terms {ỹk−3 ,ỹk−4 } away (Overlap Add
method) or 2) to save them for the continuation of the sequence xk (Overlap Save
method).
7. Backwards sequence
DFT[xN −k ] = yN −l = y−l = yl∗ .

8. Parseval’s Theorem
Let DFT[xk ] = yl and DFT[uk ] = zl :
N −1 N −1
X 1 X ∗
xk u∗k = yl zl .
k=0
N l=0
Question: Can the DFT similarly to the FHT be described so that a fast version with
less complexity is possible? Answer in Example 5.12. Consider two matrices
 
1 1 1
1 1
F2 = ; F3 =  1 w62 w64 
1 w63
1 w64 w68
Now compute  
1 1 1 1 1 1

 1 w62 w64 w60 w62 w64 

 1 w64 w62 w60 w64 w62 
F2 ⊗ F3 =   6= F6 .

 1 w60 w60 w63 w63 w63 

 1 w62 w64 w63 w65 w61 
1 w64 w62 w63 w61 w65
Including input and output vector we find:
         
y0 1 1 1 1 1 1 x0 y0 1 1 1 1 1 1 x0
 y4   1 w62 w64 w60 w62 w64   x2   y1   1 w61 w62 w63 w64 w65  x1 
         
 y2   1 w64 w62 w60 w64 w62   x4   y2   1 w62 w64 w60 w62 w64  x2 
 =  ;  =  .
 y3   1 w60 w60 w63 w63 w63   x3   y3   1 w63 w60 w63 w60 w63  x3 
         
 y1   1 w62 w64 w63 w65 w61   x5   y4   1 w64 w62 w60 w64 w62  x4 
y5 1 w64 w62 w63 w61 w65 x1 y5 1 w65 w64 w63 w62 w61 x5
Obviously, we did not receive the correct DFT matrix. However, it is a DFT, just with
the wrong order of input and output elements. Obviously, the method is sufficient to
construct a fast variant of the DFT, the so called Fast Fourier Transform (FFT). The
corresponding signal flow graph is shown in Figure 5.4 where also a complexity comparison is
given with the direct approach, i.e., multiplication with the F6 matrix. Given the dimension
N = NL NL−1 ...N2 N1 of the matrices and assuming that all Nk are prime, we find the N -
dimensional FFT as:
F = FNL ⊗ FNL−1 ⊗ ... ⊗ FN2 ⊗ FN1 .
The drawback is that the ordering is different to the DFT. However, by suitable permutation
matrices for in- and output Px and Py this can be solved without increasing complexity.
FN = Py (FNL ⊗ FNL−1 ⊗ ... ⊗ FN2 ⊗ FN1 )Px .
Note that also the classical FFT (by Cooley and Tukey) requires a reordering of the entries!
y0
F2
x0 y3
x2 F3
x4
y4
F2
x3 y1
x5
F3 y2
x1
F2 y5
F2 ⊗ F3
Complexity: 2x(3x2) + 3x(2x1) = 18 <52=25
Figure 5.4: FFT construction for F6 .
5.4 Toeplitz and Circulant Matrices
Complexity reduction in calculating with large matrices is an important issue. Com-

plexity is typically measured in terms of operations (add/sub or mult). Depending on the
hardware implementation, however, different measures can be of interest. We follow here
a simplified approach that is used mostly in literature and provide asymptotic results in
terms of O(n) to describe linear complexity, O(n2 ) for quadratic complexity and so on.
The inner vector product of vectors with n entries is of linear order O(n), although we
only need n − 1 operations. The fast algorithms discussed in the previous section are of
order O(n log(n)), thus higher than linear but lower than quadratic. A multiplication of
an n × n matrix with an input vector is of order O(n2 ), multiplying two square matrices
with each other is of order O(n3 ). The solution of a linear set of equations typically
requires to compute the inverse of a matrix. General matrices without particular structure
require cubic complexity O(n3 ) for their inversion. If the matrix is however of Toeplitz
structure, the set of equations can be solved with quadratic complexity O(n2 ) by the
so-called Levinson-Durbin algorithm. It is thus of interest to look more into these matrix
structures.
5.4.1 Toeplitz Matrices

Toeplitz5 matrices naturally appeared in the past of our considerations when we described
time-invariant linear systems. See for example the two forms in Equations (1.10) and (1.11).
In general a Toeplitz matrix does not need to be rectangular and can have entries below
and above the diagonal.
Definition 5.7 A (finite) m × n or infinite matrix is called a Toeplitz matrix if its entries
aij only depend on the distance i − j.
 
h0 ... hL−1
 .. ..
.
 . h0 hL−1 

Tmn = .. .
 h1−L . . . ..
. . 
h1−L ... h0
Here L is smaller than m,n but it can also reach max(m,n).

More particularly, a square Toeplitz matrix Tn of dimension n × n is given by the following
form:
 
h0 ... hL−1
.
h0 . . hL−1 
 
Tn =  .

... ..
 . 
h0
If a Toeplitz matrix is involved to describe a linear time-invariant system, it is typically

rectangular shaped. Consider, for example, the equalizer problem from Example 3.7
     
rL−1   aL−1 vL−1
h0 h1 h2
 rL−2   aL−2   vL−2 
   h0 h1 h2    

 rL−3 =
 
.. .. ..

 aL−3 +
  vL−3  = Ha + v.

 ..   . . .  ..   .. 
 .   .   . 
h0 h1 h2
r2 a0 v2
We try finding vector a by the observation r = Ha + v. We can interpret this as an
5
Otto Toeplitz (1881-1940) was a German mathematician.
underdetermined system of equations for which an LS solution exists:
aLS = H H (HH H )−1 r

 H
h0 h1 h2
 h0 h1 h2 
= 
 
.. .. .. 
 . . . 
h0 h1 h2
 
  H −1 rL−1
h0 h1 h2 h0 h1 h2  rL−2 
 h0 h1 h2   h0 h1 h2    
× 

. . .

.. .. ..
   rL−3 
 .. .. .. 


. . .





 ..


 . 
h0 h1 h2 h0 h1 h2
r2
 H
h0 h1 h2
 h0 h1 h2 
= 
 
.. .. .. 
 . . . 
h0 h1 h2
−1
h1 h∗0 + h2 h∗1 h2 h∗0

|h0 |2 + |h1 |2 + |h2 |2 0
 h∗1 h0 + h∗2 h1 |h0 |2 + |h1 |2 + |h2 |2 h1 h∗0 + h2 h∗1 h2 h∗0 
×  ∗ ∗ ∗

 h2 h0 2 2
h1 h0 + h2 h1 |h0 | + |h1 | + |h2 | 2
h1 h0 + h2 h∗1 
∗
0 h∗2 h0 h∗1 h0 + h∗2 h1 |h0 |2 + |h1 |2 + |h2 |2

 
rL−1
 rL−2 
 
 rL−3 
× .
 .. 
 . 
r2
Thus, the system of equations exhibits square Toeplitz structure and it can be solved by
a low complexity method (O(n2 ): Levinson-Durbin algorithm). We further conclude that
LS as a method is not necessarily destroying the Toeplitz structure. However, if the order
of the channel L is large, the problem can become numerically challenging.
5.4.2 Circulant Matrices

Consider the following two matrices of large size
 
  h0 h1 h2
h0 h1 h2 
 h0 h1 h2 


 h0 h1 h2 


 ... ... ... 

 . . . . . .   
 . . .   h 0 h 1 h2 
 
h0 h1 h2  h2 h0 h1 
h1 h2 h0
On the left hand we have a Toeplitz form while on the right hand side we have a circulant
matrix.
Definition 5.8 (Circulant Matrix) A circulant n × n matrix Cn is of the form
 
h0 h1 h2

 h0 h1 h2 

 ... ... ... 
Cn = 
 


 h0 h1 h2  
 h2 h0 h1 
h1 h2 h0
where each row is obtained by cyclically shifting to the right the previous row.
Note that Toeplitz as well as circulant matrices are commutative, i.e.:
GH = HG.
Note that the two matrices above only differ by three elements. This difference remains
constant even if the matrices keep growing. Thus, the larger the matrices become the more
similar they become.
Definition 5.9 (Asymptotic Equivalence) Two sequences of n×n matrices An and Bn

with kAn k < M1 < ∞ and kBn k < M2 < ∞ are called asymptotically equivalent when the
Frobenius norm of their difference relative to the order n tends to zero.
kAn − Bn kF
lim √
n→∞ n
This equivalence has many consequences as for growing size n both matrices share
the same eigenvalues and eigenvectors. The latter are particularly simple to compute for
circulant matrices. Mathematically more precisely, it can be shown that for finite l
1 1
tr Aln = lim tr Bnl ; for l = 0,1,...,L < ∞

lim
n→∞ n n→∞ n
Note that the trace of the matrices is equivalent to their sum of eigenvalues, thus
n
1 1X
lim tr Aln = lim (λn,k (An ))l .

n→∞ n n→∞ n
k=1
This is indeed a consequence of Definition 5.9. Let us define An − Bn = ∆n . We then have
Aln − Bnl = (∆n + Bn )l − B l = ∆ln + l∆l−1 Bn + ... + l∆n Bnl−1 .
Let us now consider the difference in terms of the Frobenius norm
kAln − Bnl kF k∆n (∆l−1

n + l∆
l−2
Bn + ... + lBnl−1 )kF
lim √ = lim √ = 0.
n→∞ n n→∞ n
As the remaining terms ∆l−1

n + l∆
l−2
Bn + ... + lBnl−1 are bounded, the leading term turns
to zero and pulls the remaining terms to zero. Thus, Aln is asymptotically equivalent to Bnl
and thus the same holds for their traces.
Theorem 5.5 All circulant matrices of the same size have the same eigenvectors. The
eigenvectors are given by the DFT matrix of the corresponding dimension n,i.e.,
FnH Cn Fn = Λ. (5.1)
Conversely, if the n eigenvalues are the entries of a diagonal matrix Λ, then
Cn = Fn ΛFnH
is a circulant matrix.
Thus, different to general matrices, the knowledge of the size n is sufficient to know
all eigenvectors. With this knowledge the eigenvalues can be computed. Even more, the
computation of the eigenvalues in Equation (5.1) is not of cubic complexity but requires
only two FFTs, thus O(n log(n)).
Example 5.14 Let us reconsider the equalizer problem again:

     
rL−1   aL−1 vL−1
 rL−2  h0 h1 h2
  aL−2   vL−2
   
  
 rL−3   h0 h1 h2   aL−3  

 = .. .. ..  + vL−3  = Tn a + v.

 ..   . . .   ..   .. 
 .   .   . 
h0 h1 h2
r2 a0 v2
We can reformulate this by simply adding a few more lines:

 
rL−1  
 rL−2  h0 h1 h2    
aL−1 vL−1
  
 rL−3   h0 h1 h2 
   ... ... ...

 aL−2  
  vL−2 

 ..   aL−3 vL−3
 . = +  = Cn a + v.
   

   h0 h1 h2   ..   .. 
 r2    .   . 
   h2 h0 h1 
 x  a0 v2
h1 h2 h0
x
The problem of the equalizer is to find a linear filter G[r] so that a reappears. As the entries
x do not matter we can generate them for example by
   
rL−1   aL−1
 rL−2  h0 h1 h2  aL−2 
 
vL−1
  
 rL−3   h0 h1 h2   
  aL−3   vL−2 

 ..  
  . . . . . . . . .
  
  ..   vL−3 

 . =  .  +  .
   h 0 h 1 h2    .. 
 r2     a0   . 

 x 
  h 0 h1 h 2 
 aL−1 

v2
h0 h1 h2
x aL−2
We recognize now that a small part of the data is appearing twice, at the beginning and
at the end of the transmission. If the part from the end is replicated at the beginning, we
call it cyclic prefix, otherwise cyclic postfix. By this we have now an equivalent form of
transmission that makes the channel appearing cyclic although physically it is not. At the
receiver end we can decode simply by applying FFT matrices Fn
r = Cn a + v
Fn r = Fn C n a + Fn v
= Fn Cn FnH Fn a + Fn v.
| {z }
Λn
We recognize a linear distortion by the diagonal matrix Λn . Assuming that all eigenvalues
are non-zero such distortion can be compensated by its inverse:
Λ−1 −1
n Fn r = Fn a + Λ n Fn v
FnH Λ−1 H −1
n F n r = a + F n Λn F n v
a ≈ FnH Λ−1 F r.
| {zn n}
Gn
The linear filter Gn = FnH Λ−1

n Fn delivers the transmitted symbols. This happens only per-
fectly as long as no noise v occurs. The complexity of this is relatively moderate as the two
FFT operations require n log(n) operations each and the inverse Λ−1
n is of order n divisions.
Note that we do not claim this to be the optimal decoder. As the noise is changed by Λ−1 n
and we do not take this into account, there is better ones. Nevertheless, this is the basic
method of current WiFi and LTE transmission methods.
Theorem 5.6 The eigenvalues of a circulant matrix of size n × n are given by its Fourier
transform evaluated at the frequencies Ωk = 2πk
n
; k = 0,1,...,n − 1.
Proof: Let us denote the Fourier transform of the circulant set c0 ,c1 ,...,cn−1 by
n−1
X
jΩ
ck e−jkΩ .

c̃ e =
k=0
Consider now the following term

 2π 2π(n−1)

j0
c̃ (e ) c̃ e−j n ... c̃ e −j n
 2π 2π(n−1) 
 c̃ (ej0 ) wn c̃ e−j n . . . wnn−1 c̃ e−j n
 

 2π 2π(n−1) 
Cn Fn =  c̃ (ej0 ) wn2 c̃ e−j n . . . wnn−2 c̃ e−j n
 

 
 .. 

 . 2π


2π(n−1)
c̃ (e ) wn c̃ e−j n
j0 n−1
... wn c̃ e−j n
2π
with the twiddling factor wn = e−j n . Comparing this with its identity
 
λ0 λ1 . . . λn−1
 λ0 n−1
 w λ
n 1 . . . wn λn−1


 λ0
F n Λn =  wn λ1 . . . wnn−2 λn−1
2
,

 .. 
 . 
λ0 wnn−1 λ1 . . . wn λn−1
we find that the theorem holds.
5.4.3 Relations of Toeplitz Matrices with Previous Considera-

tions
Note that for a given Toeplitz matrix, it is rather straightforward to compute the DFT
corresponding to its first row entries:
L−1
−j(L−1)Ω
X
−jΩ
jΩ
H(e ) = h0 + h1 e + ... + hL−1 = hk e−jkΩ .
k=0
Let us recall the definition of the spectral norm in Equation (2.6)
kTn xk2
lim kTn k2,ind = lim sup = sup |H(ejΩ )| (5.2)
n→∞ n→∞ x kxk2 Ω
valid for Toeplitz matrices. We connected there a statement on matrices with classical
Fourier Transform terms. We simply assumed at the time that the worst case vector, i.e.,
the vector that maximizes the ratio above, is given by a harmonic excitation (2.3). We now
understand that this assumption was indeed correct as with growing size n the Toeplitz
matrix becomes asymptotically equivalent to a circulant matrix. The eigenvector for that
is indeed a harmonic excitation with a single frequency.
A further means for connecting Toeplitz matrix properties with those of the Fourier
Transform are given in form of the following theorem.
Theorem 5.7 (Szegö) Given a continuous function g(x), the eigenvalues λn,k ; k =
0,1,...,n − 1 of a square Toeplitz matrix Tn with elements {h0 ,h1 ,...,hL−1 } asymptotically
satisfy the following condition:
n−1 Z π
1X 1
lim g(λn,k ) = g(H(ejΩ ))dΩ
n→∞ n 2π −π
k=0
PL−1
where H(ejΩ ) = k=0 hk e−jΩk .
Proof: We first show that

n−1 Z π
1X l 1
lim λn,k = H l (ejΩ )dΩ. (5.3)
n→∞ n 2π −π
k=0
Once this holds, a continuous function g(x) can be constructed by a linear combination
of polynomials. For Equation (5.3) to hold we recall that the eigenvalues of a circulant
matrix are identical to the corresponding Fourier transform at equidistant frequency points
Ωk = 2π
n
k. Thus
n−1 n
1X l 1 X l −j 2π k
λ = H e n .
n k=0 n,k n k=1
With growing n the sum can be replaced by an integral over the range 2π and we are done.
Example 5.15 Let us use g(x) = x. We then find:

n−1 Z π
1X 1 1
lim λn,k = H(ejΩ )dΩ = h0 = lim tr(Tn ).
n→∞ n 2π −π n→∞ n
k=0
Example 5.16 Let us use g(x) = ln(x). We then find:

n−1 Z π
1X 1
ln H(ejΩ ) dΩ

lim ln(λn,k ) =
n→∞ n 2π −π
k=0
n−1
1 Y
= lim ln λn,k
n→∞ n
k=0
1
= lim ln det(Tn )
n→∞ n
1
= lim ln(det(Tn ) n ).
n→∞
As such terms typically appear in the capacity of multiple antenna transmission systems
(see Example 4.34), the theorem is helpful to compute the capacity bounds.
Example 5.17 Consider Shannon capacity for OFDM transmission over channel H (in
frequency domain) where we have to compute
HRH H

max ln det I + .
tr(R)≤P σv2
Often the simpler expression
HH H

ln det I + ,
σv2
known as Mutual Information, suffices. For linear time-invariant systems, matrix H is of
Toeplitz form. For very large matrices Hn we can now compute such expression to be
Hn HnH

lim ln det I +
n→∞ σv2
as we know that the Toeplitz matrices are asymptotically equivalent to circulant matrices:
Hn → Cn . The Hermitian of a circulant matrix remains a circulant matrix, leaving us
with a circulant matrix Cn CnH instead. The expression I + Cn CnH /σv2 can thus also be
interpreted as a circulant matrix and we can compute the desired value in two ways:
1 n−1
Hn HnH n

1X
ln λn,k (I + Cn CnH /σv2 )

lim ln det I + 2
= lim
n→∞ σv n→∞ n
k=0
where the eigenvalues λn,k of I + Cn CnH /σv2 can quickly be computed by an FFT.
Alternatively, we can compute the desired expression also by
1 Z π
Hn HnH n |H(ejΩ )|2

1
lim ln det I + = ln 1 + dΩ.
n→∞ σv2 2π −π σv2
Bibliography
[1] Unbehauen, Systemtheorie, Oldenbourg.
[2] Papoulis, Signal Analysis, McGraw Hill.
[3] Proakis, Manolakis, Digital Signal Processing, MacMillan.
[4] Oppenheim, Schafer, Discrete Time Signal Processing, Prentice Hall.
[5] Moon, Stirling, Mathematical Methods and Algorithms for Signal Processing, Prentice
Hall.
[6] M. Vetterli, J. Kovacevic, and V. K. Goyal, Signal Processing: Foundations

http://www.fourierandwavelets.org
224

Signal Processing 1 Script English v2017 PDF

Uploaded by

Signal Processing 1 Script English v2017 PDF

Uploaded by

Signal Processing 1

Univ.-Prof. DI. Dr.-Ing. Markus Rupp

December 20, 2017

1 Classical Deterministic Signal Processing 7

2 Linear Vector Space 52

2.3.3 Submultiplicative Property . . . . . . . . . . . . . . . . . . . . . . . . 94

3 Representation and Approximation in Vector Spaces 107

4 Linear Operators 150

5 Matrix Operations 197

5.3 Large Matrices with Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Classical Deterministic Signal

1.1 Linear Systems

Definition 1.1 (Linearity) A linear system S is a system which fulfills

S[αx] = αS[x] (1.1)

Sometimes both properties are combined in one form which reads:

S[αx1 + βx2 ] = αS[x1 ] + βS[x2 ]. (1.3)

In this context, α and β have to be arbitrary constants.

To simplify matters, we assumed a non-causal system. Thus, for a second input-signal

If we now investigate the superposition (1.3) of the two signals, we obtain

1.1.2 Polynomial Description

filter of the form:

= g0 h0 + (g1 h0 + g0 h1 )q −1 + ... + gnG hnH q −(nG +nH ) .

Example: In this example we want to distinguish between corresponding but equivalent

we can equivalently formulate the input-output relation as

with H(z) = 1 + az −1 = h0 + h1 z −1 . If we are interested in the Fourier behavior of such

Y (ejΩ ) = H(ejΩ )X(ejΩ )

The corresponding linear operator is given by

= g0 h0 + (g1 h0 + g0 h1 )q −1 + ... + h1 gnG q −(1+nG ) = q −D .

In order to lead to a solution, C1 (q −1 )G1 (q −1 ) + C2 (q −1 )G2 (q −1 ) needs to be equal to

1.1.3 Vector-Matrix Description

can also alternatively be written in vector form:

matrix-vector notation offers to be useful:

1.1.4 State Space Description

Figure 1.1: Signal flow diagram of an IIR filter.

(1) (2) (2) (3) (p−1) (p)

Furthermore, the following holds:

(p) (p) (p−1) (1)

Similarity Transformation: The dynamic behaviour of the system is solely determined

and thus the compact canonical state-space description

The transition matrix in this case is given by

Φ(k,j) = Ak−1 Ak−2 ...Aj , Φ(k,k) = I.

Stability is guaranteed iff all eigenvalues |λi | < 1.

1.2 Properties of Linear Systems

The corresponding integrals can be converted by

In the mathematical literature, the term T /2 is simply omitted, as it is only a scaling

Figure 1.3: Lowpass function for RC T

1.2.1 Stability of Linear, Time-invariant Systems

with corresponding time impulse response:

Proof: We find the input-output relation to be

and the inverse Fourier-transform:

F (jω) = FR (jω) + jFI (jω) = A(ω)ejφ(ω) .

If we restrict ourselves onto real-valued functions f (t), the magnitude

is an even function in ω, and the phase

Accordingly, the function can be assembled again:

f (t) = fe (t) + fo (t).

= FR (jω) + jFI (jω).

the following holds:

For time-discrete systems, a corresponding theorem holds, if

The first condition of the theorem is fulfilled:

However, the second condition cannot be verified so easily:

the first condition is obviously fulfilled:

Though, the second condition is critical:

1.2.3 Linear-Phase Property

A({!0 ) = A(!0 ) A(!0 )

Figure 1.4: Transmission over a linear narrowband system.

The inverse transformation then leads to

+x (t + φ0 (ωo )) e−j(φ(ωo )+ωo t)

Example 1.13 (Time-discrete system with linear phase) We investigate a linear,

H(ejΩ ) = G(ejΩ )G(e−jΩ )e−j∆Ω .

H(ejΩ ) = a + (1 + a2 )e−jΩ + ae−j2Ω

1.2.4 Passivity and Activity

This is obviously an all-pass, independent of the choice of the system H(ejΩ ).

Proof: Let us consider the following system of order one:

with the corresponding Fourier-transform

H(ejΩ ) = −a∗ + e−jΩ .

We construct an allpass and obtain