0% found this document useful (0 votes)
7 views107 pages

Notes 114

Uploaded by

princkevin93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
7 views107 pages

Notes 114

Uploaded by

princkevin93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 107

Math 114: Measure, Integration and Banach

Spaces — Course Notes

Fabian Haiden

November 29, 2016


Contents

Introduction 2

1 The Banach–Tarski paradox 5

2 Cantor’s set and the topology on R 11

3 Lebesgue Measure 16

4 Integration 25

5 Axiomatic measure theory 36

6 Differentiation and Integration 40

7 Lp spaces 55

8 Inner product spaces 64

9 Fourier analysis 74

10 Baire’s theorem and applications 92

11 Hausdorff Measure and Fractals 102

1
Introduction

He who seeks for methods without having a definite problem in mind


seeks in the most part in vain.
– David Hilbert

An alternative title for this course is The Journey towards Functional Anal-
ysis. Functional analysis is the theory of infinite-dimensional vector spaces over
the real or complex numbers. In typical examples the “vectors” are functions
of some specified type on a space, such as Rn . The motivation for this theory
and major applications come from:
• Partial differential equations, in particular proving existence of solutions.
• Finding a natural setting for the Fourier transform.

• The mathematical foundations of quantum mechanics.


The main prerequisites are measure theory and point set topology, which are
intriguing subjects by themselves which we will develop in detail.
There are many new phenomena when one passes from finite-dimensional
vector spaces to infinite-dimensional ones. First, there are many different no-
tions of convergence of a sequence of vectors. For example, suppose fn , f :
[0, 1] → R are continuous functions, then one natural choice is

fn → f if and only if fn (x) → f (x) for all x ∈ [0, 1]

which is pointwise convergence, while another is


Z 1
fn → f if and only if |fn (x) − f (x)|2 dx → 0
0

which is convergence in mean squared error. Of course the choice will depend
on the particular application one has in mind, but the point is that one needs
to carefully keep track of these different notions, unlike for Rn .
The next point, which is related to the first, is that an infinite-dimensional
vector space, together with a notion of distance of vectors, may not be complete
in the same sense in which Q is not complete, i.e. there are Cauchy sequences

2
which have no limit. This is an issue which manifests itself in Fourier theory.
Take a continuous function f : [0, 2π] → C which has Fourier coefficients
Z 2π
1
an = f (x)e−inx dx, n ∈ Z.
2π 0
The Parseval identity says Fourier transform preserves length, i.e.
Z 2π
2
X
2 1
kak := |an | = |f (x)|2 dx =: kf k2 .
2π 0
n∈Z

|an |2 < ∞ form the prototypical example


P
The sequences a = (an )n∈Z with
of a complete infinite-dimensional inner product space. However some (in fact
most) of such sequences are not the Fourier coefficients of a continuous function,
because such functions do not form a complete space with the above notion of
length. It turns out that the theory of Riemann integration is not sufficient to
resolve this discrepancy and one needs the more powerful theory of Lebesgue
integration, which is a process of completion, like passing from Q to R.
The Lebesgue integral also has many convenient technical properties, and
answers questions like when does
Z Z
fn → f, n → ∞

leading to the Dominated convergence theorem or when can one change the order
of integration, i.e. Z Z Z Z Z
f= f= f
X×Y X Y Y X

which is the content of Fubini’s theorem.


Early on, we will encounter a certain tension between set theory, as ax-
iomatized by Zermelo-Fraenkel, and measure and integration theory, due to
Lebesgue. First, suppose E ⊂ [0, 1] is an arbitrary subset. One could try to
assign a “length”, l(E), to E, which is intuitively the probability that a ran-
domly chosen number in [0, 1] is an element of E. It turns out that it is not
possible to do this in any useful way, and one must restrict to a class subsets,
called measurable, which are much fewer than all subsets. Indeed without this
restriction set theory leads to “paradoxes” such as Banach–Tarski, which allows
for two identical copies of a solid to be created just by cutting it into finitely
many pieces and moving them around.
Another point is that in measure theory one disregards sets of measure zero
or null sets, which can nevertheless be uncountable, thus “big” from the point
of view of set theory. A consequence of this is that one often identifies functions
which differ only on a null set (e.g. a single point), and thus cannot be evaluated
at individual points, so expressions like f (x) no longer make sense. Although
this might seem bizarre at first, it actually makes ones life simpler in practice.
Since Lebesgue theory is so powerful, it is natural to try to generalize it.
One such generalization is Hausdorff measure which measures the size of spaces

3
with non-integer dimension, like fractals, for example Cantor’s set (see Figure 1).
Thus it interpolates between 0-dimensional measure (counting), 1-dimensional
measure (length), 2-dimensional measure (area), and so on. From a more ax-
iomatic point of view one can study abstract measure theory, which provides the
mathematical foundation for probability theory and information theory.

Figure 1: First iterations in the construction of Cantor’s set, a fractal of dimen-


sion ≈ 0.63.

4
Chapter 1

The Banach–Tarski paradox

We are not very pleased when we are forced to accept a mathematical


truth by virtue of a complicated chain of formal conclusions and com-
putations, which we traverse blindly, link by link, feeling our way by
touch. We want first an overview of the aim and of the road; we want
to understand the idea of the proof, the deeper context.
– Hermann Weil

1 The paradox
The Banach–Tarski “paradox” asserts that one can take a solid ball B in R3
and cut into finitely many pieces which can then be reassembled to give two full
copies of B.
Theorem 1.1 (Banach–Tarski). Let B be the solid unit ball in R3 , then there
exists a decomposition
(1) B = B1 ∪ . . . ∪ Bn
of B, and euclidean transformations (rotation-translations) T1 , . . . , Tn such that
there is another decomposition
(2) T1 (B1 ) ∪ . . . ∪ Tn (Bn ) = B ∪ B 0
where B 0 is a translated, disjoint copy of B.
The proof is non-constructive, it does not tell you what the pieces Bi are.
The number n, on the other hand, comes explicitly out of the proof. One can
achieve n = 10, though that require a bit more work than larger values of n. One
can imagine them as very noisy, much more pathological in shape than the solids
one deals with in geometry. The paradox comes from the fact that it contradicts
basic conservation of mass and our everyday experience. Mathematically, it
show the impossibility of assigning a volume, in any reasonable sense, to all
subsets of R3 .

5
2 The proof
The main steps of the proof are:
1. Find a paradoxical decomposition of the free group with two generators,
F2 .
2. Realize F2 as a group of rotations in R3 .

3. Use the paradoxical decomposition of F2 to get a paradoxical decomposi-


tion of the hollow unit sphere S 2 .
4. Extend the decomposition of S 2 to the solid unit ball.

Step 1
What is F2 ? Its elements are possible strings of letters a, b, a−1 , and b−1 for
example

(3) a−1 bbab−1 b−1 abaaab

with the only restriction that no consecutive pair of letters can be cancelled, i.e.
aa−1 , a−1 a, bb−1 , b−1 b do not appear as substrings. Multiplication is defined by
concatenating the strings and then cancelling any forbidden pairs, for example

(4) aba−1 bba · a−1 b−1 baa = aba−1 bbaa.

The neutral element, e, is the empty string, while the inverse is formed by
inverting each letter and reversing their order, for example
−1
(5) ab−1 a−1 b = b−1 aba−1 .

A beautiful way of visualizing F2 is via its Cayley graph, see Figure 2. The graph
has a vertex for each element of F2 and an edge whenever two such elements
are related by left-multiplication with one of the letters, a, b, a−1 , b−1 .
Write S(a) for the set of elements in F2 which start with a, and similarly
S(b), S(a−1 ), S(b−1 ). The on one hand we have a partition

(6) F2 = {e} ∪ S(a) ∪ S(a−1 ) ∪ S(b) ∪ S(b−1 )

but also

(7) F2 = S(a) ∪ aS(a−1 ) = S(b) ∪ bS(b−1 ).

To see this note that aS(a−1 ) is exactly the set of strings in F2 which do not
start with a. Here aS(a−1 ) is the set of elements of F2 which are of the form
a · x for some x ∈ S(a−1 ).

6
b
a−1 a

b
b b
a−1 a−1 a a

b−1 b−1

b b
a−1 a a−1 a
b

b b
b b
a−1 a−1 a−1 a a a

b−1 b−1
b−1 b−1
a−1 a a−1 a
b−1
b−1 b−1

b b
a−1 a−1 a a

b−1 b−1
b−1
a−1 a

b−1

Figure 2: Cayley graph of F2 .

To deal with the neutral element e we define

S1 := S(a) ∪ {e, a−1 , a−2 , a−3 , . . .}


S2 := S(a−1 ) \ {a−1 , a−2 , a−3 , . . .}
S3 := S(b)
S4 := S(b−1 )

then we get partitions

F2 = S1 ∪ S2 ∪ S3 ∪ S4 = S1 ∪ aS2 = S3 ∪ bS4 .

Note the analogy with Banach–Tarski. We have found a partition of F2 into 5


pieces, so that after transformation (multiplying on the left), we can reassemble
four of them to get two copies of F2 . Visually, this manifests itself in the self-
similar nature of the Cayley graph, where aS(a−1 ) is a scaled up version of
S(a−1 ).

7
Step 2
So far we have viewed F2 as an abstract group, its elements given by strings
of symbols. The next step is to realize each element of F2 as a rotation in 3-D
space, in a way compatible with composition and so that no two elements get
mapped to the same rotation. In other words, we want to find an injective group
homomorphism F2 → SO(3). We choose
   
3 4
0 1 0 0
 5 5   
a 7→ − 45 35 0 =: A, b 7→ 0 3 4 =: B
  
   5 5

0 0 1 0 − 45 35

which are rotations about the z-axis and x-axis, respectively. In order to be
compatible with composition, we must send any other element of F2 to the
corresponding product of matrices, e.g. ab−1 7→ AB −1 . The tricky thing to
check is that no non-empty string is mapped to the identity matrix.
It suffices to consider 5A, 5B, 5A−1 , 5B −1 and check that no non-trivial
composition gives a matrix with all coefficients divisible by 5. Thus, we consider
coefficients as elements in F5 , the field with 5 elements, and the matrices
   
3 4 0 0 0 0
   
5A = −4 3 0 , 5B = 0 3 4
   
   
0 0 0 0 −4 3
   
3 −4 0 0 0 0
   
5A−1 = 4 3 0 , 5B −1 = 0 3 −4
   
   
0 0 0 0 4 3

as transformations F35 → F35 . Considered over F5 , each of these four matrices is


of rank 1, because    
3 4
3  =   mod 5
−4 3
and thus they have 1-dimensional range and 2-dimensional kernel. The crucial
property about these eight subspaces is that even though range(5A) is contained
in ker(5A−1 ), it is not contained in the kernels of 5A, 5B, or 5B −1 , and similarly
statements hold for the ranges of the other three matrices. Thus, in a product
involving the four linear transformations 5A, 5B, 5A−1 , 5B −1 , and such that
5A and 5A−1 (resp. 5B and 5B −1 ) are never next to each other, the range
of one matrix is not contained in the kernel of the next, and so the product
non-zero.

8
Step 3
Each rotation corresponding to an element of F2 \ {e} fixes exactly two points
on the unit sphere S 2 . We let C ⊂ S 2 be the union of these fixed points, which
is countable and invariant under F2 . Then F2 acts freely on S 2 \ C, meaning
that if we fix a point x in an orbit of S 2 , then we can identify it with F2 via
g 7→ gx.
Now comes the non-constructive step. The axiom of choice implies the exis-
tence of a subset X ⊂ S 2 \ C which contains exactly one point from each orbit.
If we let Xi = Si X = {gx | g ∈ Si , x ∈ X}, then

S 2 \ C = X1 ∪ X2 ∪ X3 ∪ X4 = X1 ∪ AX2 = X3 ∪ BX4 .

To get rid of the set C, we use the following simple trick. To illustrate it,
consider first the unit circle S 1 , any point p ∈ S 1 , and an irrational rotation R
of the plane. Then all the points p, R(p), R2 (p), R3 (p), . . . are distinct and if we
put P = {p, R(p), R2 (p), . . .}, Q = S 1 \ P , then

S 1 = P ∪ Q, S 1 \ {p} = RP ∪ Q

and these are disjoint unions. This means we can plug a hole in a circle by
cutting it into two pieces and rotating one of them! In fact the same thing works
if instead of p we have any countable subset of S 1 , since there are uncountably
many rotations. By the same reasoning, we can find a rotation R of 3-D space
and a decomposition S 2 = Y1 ∪ RY2 such that S 2 \ C = Y1 ∪ Y2 .
Combining the two ways of decomposing S 2 \ C we get a partition S 2 =
Z1 ∪ . . . ∪ Z8 of the 2-sphere into eight pieces and rotations T1 , . . . , T8 such that

S 2 = T1 Z1 ∪ . . . ∪ T4 Z4 = T5 Z5 ∪ . . . ∪ T5 Z5 .

Step 4
We are almost done. For each piece Zk of S 2 there is a corresponding subset of
the unit ball without its center, B \ {0}, consisting of all points which map to
Zk under radial projection. This immediately gives us a version of the theorem
for B \ {0}, and in order to get all of B we apply the same trick as above to a
circle in B which contains the origin.

3 Conclusion
The Banach–Tarski paradox shows the impossibility of assigning a volume Vol(A)
to every subset of A of R3 such that
1. Vol(A) = Vol(T (A)) where T is a rotation or translation,
2. Vol(A ∪ B) = Vol(A) + Vol(B) for any disjoint A and B,
3. Vol(B) > 0 if B is a solid ball.

9
Rather than giving up on any of theses properties, which are actually a rather
minimal set of assumptions, we will instead adopt the point of view that one
should not try to assign a volume to every subset of Rn , such as those con-
structed in the proof above.
One could also argue that the axiom of choice is too strong and should be
rejected. However this would not lead to any simplification of measure theory
by itself.

References and further reading


• S. Banach, A. Tarski: “Sur la décomposition des ensembles de points en
parties respectivement congruentes.”
• T. Tao: “The Banach–Tarski Paradox.”
www.math.ucla.edu/∼tao/preprints/Expository/banach-tarski.pdf
• L. M. Wapner: “The Pea and the Sun: A Mathematical Paradox.”

• “The Banach–Tarski Paradox.” Video by Vsauce.


www.youtube.com/watch?v=s86-Z-CbaHA

10
Chapter 2

Cantor’s set and the


topology on R

Even fairly good students, when they have obtained the solution of the
problem and written down neatly the argument, shut their books and
look for something else. Doing so, they miss an important and in-
structive phase of the work. ... A good teacher should understand and
impress on his students the view that no problem whatever is completely
exhausted.
– George Polya

The purpose of this chapter is to recall the properties of open/closed subset


of R, and to build some intuition around these notions. To that end, we will
discuss Cantor’s set, which was mentioned in the introduction, in some detail,
as it is an important example throughout real analysis. As a warmup, consider
the following question.
Problem 2.1. Can you fit infinitely many disjoint open intervals (ak , bk ), k =
1,
P2, 3, . . ., into the unit interval [0, 1]? What is the maximum total length l =
k bk − ak that can be achieved?

One solution is to take the picture proof that


1 1 1 1
+ + + + ... = 1
2 4 8 16
giving intervals      
1 1 1 1 1
,1 , , , , ,...
2 4 2 8 4
which satisfy the requirement and have total length 1.
The complement of the intervals in [0, 1] is the countable set
 
1 1 1
(1) 0, , , , . . . , 1
2 4 8

11
0 1 1 1 1 1
16 8 4 2

which is closed. Recall that a subset E ⊂ R is closed if it is closed under taking


limits, i.e. if xn is a sequence in E with xn → x, then x ∈ E. The complements
of closed sets are called open and another characterization is that E ⊂ R is
open if for each x ∈ E there is a small interval (x − , x + ) entirely contained
in E, where  > 0 is allowed to depend on x. Thus we can “wiggle around” a
point in an open set without it leaving that set. The union of any collection of
open sets is again open, so the complement of any union of open intervals, in
particular countably many ones, must be closed. Somewhat confusingly a set
can be both open and closed, but in R, these sets are just ∅ and R itself. To
summarize:
• Any open interval (a, b) is open, as is the empty set ∅.

• (Possibly infinite) unions of open sets are open.


• Finite intersections of open sets are open.
The closed sets, being complements of open sets, satisfy the dual properties
where union and intersection are interchanged.
There is another solution, a construction of Cantor’s, which gives an un-
countable complement and one of the first examples of a fractal. The sequence
of open intervals is
 
1 2
,
3 3
   
1 2 7 8
, , ,
9 9 9 9
       
1 2 7 8 19 20 25 26
, , , , , , ,
27 27 27 27 27 27 27 27
...

or better pictorially, the gaps in Figure 1 or thick segments in Figure 3.

0 1 2 1 2 7 8 1
9 9 3 3 9 9

Figure 3: Intervals of length 1/3, 1/9 and 1/27 in the complement of Cantor’s
set.

The construction is iterative: The first interval is the middle third of [0, 1].
Its complement are the two closed intervals, [0, 1/3] and [2/3, 1], and we take
the open middle thirds of these intervals, and so on. At the n-th step we create

12
2n−1 intervals of length 1/3n , so the total length of all intervals is
∞ ∞  n
2n−1
 
X 1X 2 1 1
(2) n
= = 2 − 1 = 1.
n=1
3 2 n=1
3 2 1 − 3

The complement, C, of the above intervals, i.e. what remains of [0, 1] after we
repeatedly erase middle thirds, is called the Cantor set.
A proof that C is uncountable uses ternary (base 3) expansion of real num-
bers. For example, 3 becomes 10 in ternary, 1/3 becomes 0.1, 2/3 becomes
.2. As with decimal (1 = .99999 . . .), some rational numbers have two ternary
expansion: one ending with 0 repeating, and one ending with 2 repeating, e.g.

0.22012 = 0.22011222222 . . .

To resolve this ambiguity, we agree to not allow ternary expansions ending in


10000 . . . or 12222 . . .. In fact, we could also ignore this issue, since there are
only countably many such numbers.
Returning to the construction of the Cantor set, we remove in the first step
those numbers in [0, 1] which have a ternary expansion of the form 0.1 . . .. (Note
that 1/3 = 0.0222 . . ., 2/3 = 0.2, by our convention.) At the second step we
remove those numbers which have a ternary expansion of the form 0.01 . . . or
0.21 . . .. In general, at the n-th step we remove those numbers which have a 1
at the n-th place of their ternary expansion. The conclusion is: C is just the
set of numbers in [0, 1] which do not have a 1 in their ternary expansion. But
upon replacing 2 7→ 1 we just get the binary expansions of all the numbers in
[0, 1], which form an uncountable set. A caveat: The numbers 1/3 = 0.0222 . . .
and 2/3 = 0.2 turn into 0.0111 . . . = 0.1 = 12 in binary, so we get both binary
expansions for those (countably) many numbers that have two. We have proven
Theorem 2.1. The Cantor set is uncountable.
What is the “length” of the Cantor set, C? We have not yet given any
rigorous definition, but it is fairly clear that the answer should be zero, for
the following reasons. First, the complement of C in [0, 1] is a union of disjoint
intervals of total length 1. Second, we have seen that C is just the set of numbers
in [0, 1] which have no 1 in their ternary expansion, which should be true for a
randomly chose number with probability zero. Once we have developed some
measure theory, the Cantor set will serve as an example of an uncountable
measurable subset of R of measure (length) zero.
The Cantor set C also has some curious topological properties. First, by
the above general argument, it is closed, hence compact by the Heine–Borel
theorem. Like Q, it is totally disconnected, that is between any two points
x, y ∈ C there is a point z in the complement of C, i.e. no interval can fit into
C. Also, C has no isolated points: Deleting any one point from C results in
a set which is no longer closed.
On a historical note, the “Cantor set” was in fact discovered by Henry J.S.
Smith in 1874, but rediscovered independently by several mathematicians in

13
the subsequent years. Curiously, an engraving of an Egyptian column shows a
pattern reminiscent of the Cantor set, see Figure ??.
How complicated can a general open subset of R be? It turns they are all
disjoint unions of countably many open intervals.
S∞
Proposition 2.2. Any open subset E ⊂ R can be written as E = n=1 In ,
where In are disjoint open intervals.
Proof. An open interval I = (a, b) ⊂ E is maximal there is no open interval
J 6= I with I ⊂ J ⊂ E. Here we allow a = −∞ and b = +∞. If I1 , I2 ⊂ E
are maximal, then they must be equal or disjoint, since otherwise I1 ∪ I2 is
an interval strictly larger than either one. Also, any p ∈ E is contained in a
maximal interval (a,b) where
a = inf{x ∈ E | (x, p] ⊂ E}, b = sup{x ∈ E | [p, x) ⊂ E}
thus E is the disjoint union of the maximal open intervals it contains. There
are only countably many such intervals, since we can pick a distinct rational
number from each.
How many open subsets are there?
Proposition 2.3. The cardinality of the set of open subsets in R is the same
as the cardinality of R.
Proof. We give a proof that generalizes to any second countable topological
space. Note that there are countably many intervals (a, b) with rational end-
points a, b ∈ Q. But any open E ⊂ R is completely determined by which such
intervals it contains: To test if x ∈ E, see if intervals with rational endpoints
closer and closer to x are contained in E. This produces for every E ⊂ R a
countable sequence of 0’s and 1’s uniquely characterizing it, thus a real number
in binary notation.
We have seen two ways of fitting countably many disjoint intervals of total
length 1 into the unit interval [0, 1]. But can the total length be more than 1?
As one would expect, the answer turns out to be negative, a fact which quickly
follows from Lebesgue measure theory, which is the subject of the next chapter.

References and further reading


• H.J.S. Smith: “On the integration of discontinuous functions.”
• G. Cantor: “Über unendliche, lineare Punktmannigfaltigkeiten V.”
• G.L. Wise and E.B. Hall: “Counterexamples in Probability and Real Anal-
ysis.”
• B. Lumpkin: “Geometry Activities from Many Cultures.”
• M. Schroeder: “Fractals, Chaos, Power Laws.”

14
Figure 4: Egyptian column with pattern similar to the Cantor set. Engraving
of Ile de Philae from Description d’Egypte by Jean-Baptiste Prosper Jollois and
Edouard Devilliers, Imprimerie Imperiale, Paris, 1809-1828

15
Chapter 3

Lebesgue Measure

Truth is ever to be found in the simplicity, and not in the multiplicity


and confusion of things.
– Isaac Newton

Consider the set A = Q∩[0, 1] of rational numbers between 0 and 1. Can you
find a finite number of open intervals I1 , . . . , In which cover A, but have total
length < 1? What about if you are allowed to use countable many intervals?
The answer to the first question is negative. The main point is that

[0, 1] = A ⊂ I1 ∪ . . . ∪ In = I1 ∪ . . . ∪ In

where we are allowed to interchange closure and union because we are dealing
with just finitely many intervals. If we denote by χE the characteristic function
(
1 x∈E
χE (x) =
0 x∈ /E

then since χ[0,1] ≤ χI1 + . . . χIn we get


Z n
Z X n
X
1= χ[0,1] dx ≤ χ Ik = length(Ik )
R R k=1 k=1

where we use the Riemann integral. One could replace the use of the Riemann
integral by more elementary arguments, or the theory developed later in this
chapter.
On the other hand, if we can use countably many intervals, then the answer
is “yes”, and their total length can be made smaller than any positive number.
Indeed, Q, and thus A = Q ∩ [0, 1] is countable, so we can put these numbers in
a sequence q1 , q2 , q3 , . . .. Let  > 0 and let In be the open interval of diameter
/2n centered at qn . Then the In clearly cover A, and their total length is . In
fact, the same argument works for any countable subset A ⊂ R, e.g. Q which
cannot be covered by finitely many intervals of finite length at all!

16
The preceding example emphasizes the importance of allowing infinite col-
lections of intervals in the following definition.
Definition 3.1. The outer measure, m∗ (E), of E ⊂ R is
( )
X [
m∗ (E) = inf l(Ik ) | E ⊂ Ik ∈ R ∪ {+∞}
k k

where (Ik )k is any countable collection of open intervals and l(Ik ) is the length
of the k-th interval.
The following lemma summarizes some basic properties of m∗ .
Lemma 3.2. 1. If A ⊂ B, then m∗ (A) ≤ m∗ (B) (monotonicity).
2. m∗ ([a, b]) = b − a.
3. m∗ ( n An ) ≤
S P ∗
m (An ) if An form a countable collection of subsets of
R (countable subadditivity).
Proof. Monotonicity is clear: If the intervals Ik cover B, then they also cover
A, and inf itself is monotonic.
Since [a, b] ⊂ (a − ε, b + ε) for any ε > 0 we have m∗ ([a, b]) ≤ b − a. On the
other hand, if Ik cover [a, b] then by compactness just finitely many suffice, and
they must have total length ≥ b − a by the same argument as in the special case
[0, 1] above.
We have already S shown the last statement in the case when each An is a
point, and so A = An is countable and m∗ (A) = 0. We can use a similar idea
for the general case. First, if any m∗ (An ) = +∞, then there is nothing to show,
so we may assume that m∗ (An ) is finite for all n. Let ε > 0. By definition of
outer measure we can choose a cover In,k of An by open intervals with
X ε
l(In,k ) ≤ m∗ (An ) + n .
2
k
S
Since the In,k combined form a cover of An we get
[  X X
m∗ An ≤ l(In,k ) ≤ m∗ (An ) + ε
k,n n

but ε > 0 was arbitrary, so the claim follows.


Uncountable subadditivity does not hold in general, since e.g. [0, 1] is the
union of all its points.
In the case when all the An are disjoint we would expect, or like to have,
that countable additivity, i.e.
!
[ X

m An = m∗ (An ).
n

17
This turns out to be false. A counterexample can be constructed using the
axiom of choice as follows.
Consider the relation ∼ on [0, 1] where x ∼ y if x − y ∈ Q. Since Q ⊂ R
is a subgroup, this is an equivalence relation. Using the axiom of choice, there
exists an E ⊂ [0, 1] which contains exactly one element from each equivalence
class. Such a set is called a Vitali set. By definition we have
[
[0, 1] ⊂ q + E ⊂ [−1, 2]
q∈Q∩[−1,1]

hence  
[
1 ≤ m∗  q + E ≤ 3
q∈Q∩[−1,1]

but m (q + E) = m (E) for any q by translation invariance, and so m∗ (E) > 0


∗ ∗

by countable subadditivity. Thus


X
m∗ (q + E) = +∞
q∈Q∩[−1,1]

contradicting countable additivity, and even finite additivity, since finitely many
of the q + E already have total outer measure greater than 3.
The argument above did not use the definition of m∗ directly, only its basic
properties. This shows that the failure of countable additivity is not just some
defect of the definition. More precisely, there is no way of assgning a number
m(A) ∈ [0, +∞] to every A ⊂ R such that

1. m is countably additive,
2. m is translation invariant,
3. 0 < m([a, b]) < +∞ for any a < b.
In order to recover countable additivity one is forced to restrict to a class of
subsets of R, called measurable, for which this property holds. This turns out to
be a very non-restrictive requirement: To construct non-measureable sets, one
needs the axiom of choice.
A subset E ⊂ R is measurable if for any X ⊂ R one has

m∗ (X ∩ E) + m∗ (X ∩ (R \ E)) = m∗ (X).

Intuitively, a measurable subset E cuts any other subset cleanly into two pieces.
Note that the left hand side is ≥ the right hand side for any E, X by subadditiv-
ity. Also, the equality is non-trivial only for m∗ (X) < +∞. If E is measurable,
then we define m(E) = m∗ (E), the Lebesgue measure of E. This notation is
to emphasize that m is not defined on non-measurable sets.

18
S
Theorem 3.3. If An ⊂ R, n ≥ 1 are disjoint measurable sets, then n An is
also measurable and
∞ ∞
!
[ X
m An = m(An )
n=1 n=1

i.e. m is countably additive.


Proof. We show first that if A, B ⊂ R are measurable, not necessarily disjoint,
then their union is also measurable. The calculation is

m∗ (X) = m∗ (X ∩ A) + m∗ (X ∩ Ac )
= m∗ (X ∩ A ∩ B) + m∗ (X ∩ A ∩ B c )
+ m∗ (X ∩ Ac ∩ B) + m∗ (X ∩ Ac ∩ B c )
≥ m∗ (X ∩ (A ∪ B)) + m∗ (X ∩ (A ∪ B)c )

where we use measurability of A, measurability of B (twice) and subadditivity


of m∗ applied to

A ∪ B = (A ∩ B) ∪ (A ∩ B c ) ∪ (Ac ∩ B)

and (A ∪ B)c = Ac ∩ B c . Recall that the reverse inequality always holds, so


A ∪ B is measurable. Moreover, if A and B are disjoint, then

m∗ ((A ∪ B) ∩ A) + m∗ ((A ∪ B) ∩ Ac ) = m∗ (A ∪ B).


| {z } | {z }
A B

By induction any finite union of measurable sets is again measurable, and m∗


is additive on finite unions of disjoint S
subsets.
N
Now we know from the above that n=1 An is measurable, so for any X ⊂ R
N N
!!
[ \
∗ ∗ ∗
m (X) = m (X ∩ An ) + m X∩ Acn .
n=1 n=1

Note that since each An is measurable and they are disjoint, we have
N
[ −1
N[
m∗ (X ∩ An ) = m∗ (X ∩ AN ) + m∗ (X ∩ An )
n=1 n=1
...
N
X
= m∗ (X ∩ An ).
n=1

for any X ⊂ R, thus


N N
!!
X \
∗ ∗ ∗
m (X) = m (X ∩ An ) + m X∩ Acn
n=1 n=1

19
TN T∞
so applying monotonicity to n=1 Acn ⊃ n=1 Acn and passing to the limit we
get
∞ ∞
!!
X \
∗ ∗ ∗ c
m (X) ≥ m (X ∩ An ) +m X ∩ An
n=1 n=1
| {z }
≥m∗ (
S
An ∩X)
S∞ S∞
thus n=1 An is measurable. Putting X = n=1 An in the above estimate
shows that
∞ ∞
!
[ X
m∗ An ≥ m∗ (An )
n=1 n=1
and since the reverse inequality always holds, the theorem is proven.
The conclusion is that although m has a smaller domain of definition than
m∗ (the measurable subsets), it has much better properties. But which subsets
of R are measurable?
Theorem 3.4. The following subsets of R are measurable.
1. ∅
2. complements of measurable sets
3. countable unions and intersections of measurable sets
4. open and closed sets
5. sets with zero outer measure, the null sets
Poperties 1), 2), and 3) are summarized by saying that measurable sets form
a σ-algebra. This is stronger than just having a boolean algebra — a collection
of subsets closed under complements and finite unions and intersections.
Proof. For 1), 2), there is nothing to check, and 5) follows immediately from
monotonicity. We already know that finite unions of measurable sets are mea-
surable, and thus by 2) also finite intersections, since A ∩ B = (Ac ∪ B c )c . If A
is a countable union of An , not necessarily disjoint, then
∞ n−1
!
[ [
A= An \ Ak
n=1 k=1

shows that A is also a countable union of disjoint measurable subsets. Closure


under countable intersections follows by taking complements.
It remains to show that open subsets of R are measurable. We show first
that any (a, ∞) is measurable. Let X ⊂ R, ε > 0, and choose a cover of X
by open intervals Ik with total length ≤ m∗ (X) + ε. Then Ik ∩ (a, ∞) cover
X ∩ (a, ∞) and Ik ∩ (−∞, a] cover X ∩ (−∞, a], so

X
m∗ (X ∩ (a, ∞)) + m∗ (X ∩ (−∞, a]) ≤ l(Ik ) ≤ m∗ (X) + ε.
k=1

20
Since ε > 0 was arbitrary, (a, ∞) is measurable. Moreover, closure under com-
plements and intersections then shows that any interval is measurable, and thus
any open subset of R, as it is a countable union of open intervals (take intervals
with rational endpoints).

Borel–Cantelli Lemma
As an application of the above we prove the following result of probability theory,
stated in terms of measure theory.
P
Theorem 3.5 (Borel–Cantelli). Suppose Ek are measurable with m(Ek ) <
∞. Then the set of points which belong to infinitely many Ek has measure zero.
Proof. We can write the set in question as
∞ [
\ ∞
N= Ek
n=1 k=n

so it is measurable. By monotonicity and subadditivity,


∞ ∞
!
[ X
m(N ) ≤ m Ek ≤ m(Ek )
k=n k=n

but the right hand side goes to 0 as n → ∞, so m(N ) = 0.


Here is the connection to probability theory. We can think of measurable
subsets A ⊂ [0, 1] as events, i.e. the event that a random number in [0, 1] is
contained in A, and P (A) = m(A) the probability of A. If A ⊂ [0, 1] is a null
set, like the Cantor set, that means that A will almost surely not happen. The
above theorem tells us that if Ai are events with the sum of their probabilities
finite, then almost surely only finitely many of the Ai will occur.
There is a kind of converse of the theorem which is true for independent
events. Recall that a collection A of events is independent if for any distinct
A1 , . . . , An ∈ A we have
n
! n
\ Y
P Ak = P (Ak ).
k=1 k=1
P
Theorem 3.6. Suppose that Ak , k ≥ 1, are independent events with P (Ak ) =
∞, then almost surely infinitely many of the Ak will occur.
The proof is based on the following continuity property of m for monotone
decreasing sequences of subsets.
Proposition 3.7. If E1 ⊃ E2 ⊃ E3 ⊃ . . . are measurable and m(E1 ) < ∞,
then

!
\
m Ek = lim m(Ek )
k→∞
k=1

21
To see that the assumption m(E1 ) < ∞ is needed, consider for example
En = [n, ∞).
T
Proof. Let E = Ek , then E1 = E ∪ (E1 \ E2 ) ∪ (E2 \ E3 ) ∪ . . ., thus

X
m(E1 ) = m(E) + m(Ek \ Ek+1 ) = m(E) + lim (m(E1 ) − m(Ek ))
k→∞
k=1

but since m(E1 ) < ∞ we can subtract it from both sides to get the result.
Proof of the theorem. The probability that only finitely many Ak occur is
∞ [ ∞
!c ! ∞ \ ∞
!
\ [
c
P Ak =P Ak
n=1 k=n n=1 k=n

which we want to show is zero. By independence and continuity we get


∞ ∞
!
\ Y
c
P Ak = P (Ack )
k=n k=n
Y∞
= (1 − P (Ak ))
k=n
Y∞
≤ exp(−P (Ak ))
k=n

!
X
= exp − P (Ak )
k=n
=0

which implies the claim.


A number x ∈ [0, 1] is (weakly) normal if every finite sequence of digits can
be found at infinitely many places of its decimal expansion. If N ⊂ [0, 1] is
the set of normal numbers, then we claim that m(N ) = 1, i.e. almost every
number is normal. To see this, let Ai be the set of numbers which have a 1 at
the i-th place of their decimal expansion. Then the Ai are independent events
with P (Ai ) = 1/10, so by the theorem a number almost surely has infinitely
many 1’s in its decimal expansion. The same argument works for any sequence
of digits instead of 1, so m(N ) = 1 as an intersection of countably many sets
with measure 1.
In particular, if one converts the works of Shakespeare into a (large) integer,
then they will appear infinitely many times in the decimal expansion of almost
every number! A stronger version of normality is to require each sequence of
digits to occur with their expected density, e.g. 1/10th of the digits being 1.
Even though most numbers are normal, it is typically difficult to prove normality
of any particular irrational number.

22
Another viewpoint on measurable sets
The measurable subsets of R can be thought of as a completion of the topological
(open or closed) ones, in the sense that former can be well approximated by the
latter. The following theorem makes this precise.
Theorem 3.8. If E ⊂ R is measurable then for any  > 0 there is an open set
U ⊃ E and a closed set F ⊂ E with m(U \ E) <  and m(E \ F ) < .

Proof. Suppose first that m(E) < ∞. Then there are open intervals Ik , k ≥ 1
which cover E with X
m(Ik ) < m(E) + 
k
S
so if U = k Ik then by additivity (here we use that E is measurable!) we get
X
m(U \ E) = m(U ) − m(E) ≤ m(Ik ) − m(E) < .
k

For the case m(E) = ∞ we cut R into intervals [n, n + 1] so that E ∩ [n, n + 1]
has finite measure and use the above together with the /2n trick.
The statement about closed sets follows by passing to complements: If U ⊃
E c with m(U \ E c ) < , then F = U c ⊂ E is closed and U \ E c = E \ F .
Any set which is a countable union of closed sets is a called an Fσ set, and
dually a countable intersection of open sets is called a Gδ -set. Applying the
above theorem to  = 1/n for all n and taking the union/intersection gives the
following corollary.
Corollary 3.9. If E ⊂ R is measurable then there is a Gδ set U ⊃ E and an
Fσ set F ⊂ E with m(U \ E) = m(E \ F ) = 0.

In particular this shows that every measurable set is the union of an Fσ set
and a set of measure zero. On the other hand, since closed sets and null sets
are measurable, the union of an Fσ set and a null set is measurable, thus:
Corollary 3.10. A subset E ⊂ R is measurable if and only if it can be written
as the union of an Fσ set and a null set (a set with outer measure zero).

One notion of “distance” of subsets A, B ⊂ R is

d(A, B) = m∗ (A 4 B)

where A 4 B = (A \ B) ∪ (B \ A) is the symmetric difference of A and B. In


this sense we can find open/closed sets arbitrarily close to any measurable set.
Subsets of finite measure can be approximated by finite unions of intervals, a
fact known as Littlewood’s first principle.

23
Theorem 3.11 (Littlewood’s first principle of Lebesgue theory). Suppose E ⊂
R is measurable with m(E) < ∞. Then for any  > 0 there is an I ⊂ R which
is a finite union of intervals such that

m(E 4 I) < .

Informally, a measurable set is nearly a finite union of intervals.


We will deal with the remaining principles later when discussing measurable
functions.
Proof. By the definition of m(E) we can find open intervals Ik , k ≥ 1 which
cover E and satisfy X 
m(Ik ) < m(E) + .
2
k

Choose n so that

X 
m(Ik ) <
2
k=n
Sn
then I = k=1 Ik has the desired property.
The following is a variation of the above and may help you build some
intuition about measurable subsets. Suppose X ⊂ [0, 1] is an arbitrary subset,
perhaps not measurable. We “digitize” X with resolution n by approximating
it by a union, En , of intervals of the form
 
k k+1
, , 0 ≤ k < n integer
n n

in such a way as to minimize the error

m∗ (X 4 En ).

Does m∗ (X 4 En ) → 0 as n → ∞, i.e. can we approximate X arbitrarily well


by increasing the resolution? The answer is yes, if and only if X is measurable!
Intuitively, the non-measurable subsets are noisy on all scales with noisy part
not confined to a set of measure zero.

References and further reading


• T. Tao: “An introduction to measure theory.”
• C. McMullen: “Real Analysis.” (Math 114 course notes)

24
Chapter 4

Integration

I regard as quite useless the reading of large treatises of pure analysis:


too large a number of methods pass at once before the eyes. It is in
the works of applications that one must study them; one judges their
ability there and one apprises the manner of making use of them.
– Joseph-Louis Lagrange

The Lebesgue integral is defined in several stages, going from special to more
general classes of functions. There are two important conditions that need to
be imposed in order to make the integral well-defined.
1. Measurability
2. Positivity or absolute integrability
Let us comment a bit more on that. A function f : R → R is measurable if
for every y ∈ R the sublevel set
{x ∈ R | f (x) < y}
is measurable. (Replacing < with >, ≤, or ≥ gives an equivalent definition.)
It’s clear that some condition like that is needed, since measure is a special case
of integration in the sense that
Z
χE = m(E).

and χE is measurable if and only if E is. Here we use again the notation χE for
the characteristic function defined by
(
1 if x ∈ E
χE (x) :=
0 if x ∈ /E

The second point is familiar from infinite series: The


P sum of countably many
numbers ai , i ∈ I is well-defined if either ai ≥ 0 or |ai | < ∞. Here we think
of the integral as a continuous analog of an infinite sum.

25
Integral of a simple function
We begin by defining the integral of a simple function, which is a measur-
able function f : R → R attaining only finitely many values f (x) and having
bounded support in the sense that m(supp(f )) < ∞ where

supp(f ) = {x | f (x) 6= 0}.

Note that in measure theory we do not take the closure, unlike for continuous
or smooth functions. Simple functions form a vector space over R which is
generated by functions χE with E measurable and m(E) < ∞. If f is a simple
function taking non-zero values a1 , . . . , an define
Z n
X X
f= m({x | f (x) = ak })ak = m({x | f (x) = a})a.
k=1 a∈R

Note that {x | f (x) = a} is measurable since f is, and m({x | f (x) = a}) < ∞
for a 6= 0 because f has bounded support.
Lemma 4.1. The integral above is a linear functional on the vector space of
simple functions.
R R R
Proof. Clearly cf = c f for any c ∈ R. For additivity of we use (finite)
additivity of m:
Z X
f +g = m({x | f (x) + g(x) = c}c
c∈R
X
= m({x | f (x) = a, g(x) = b})(a + b)
a,b∈R
X X
= m({x | f (x) = a}a + m({x | g(x) = b}b
a∈R b∈R
Z Z
= f+ g

Measurable functions
Before we continue we need a slightly better understanding of which functions
are measurable. The many stability properties of measurable functions are in
some ways parallel to those of measurable subsets.
Theorem 4.2. The following functions are measurable.
• continuous functions
• monotone functions

26
• sums of measurable functions
• products of measurable functions
• pointwise limits of measurable functions
Proof. First, any continuous function f is measurable, since all the sets {x ∈
R | f (x) < y} are open. For a monotone function, these sets are intervals.
To prove that the sum of measurable functions f, g is measurable, note that
[
{x | f (x) + g(x) < a} = {x | f (x) < q} ∩ {x | q + g(x) < a}
q∈Q

where the right hand side is a countable union of measurable sets. Here we use
the fact that if f (x) + g(x) < a, then there is a rational number between f (x)
and a − g(x).
To show that the product f g is measurable we write it as
(f + g)2 − f 2 − g 2
fg =
2
so we just need to show that f 2 is measurable for f measurable, but this follows
from
{x | f (x)2 < a2 } = {x | f (x) < a} ∩ {x | f (x) > −a}.
Finally, we want to show that if fn → f pointwise and all fn are measurable,
then f is measurable. This follows from

{x | f (x) < a} = {x | ∃n∃N ∀k ≥ N : fk (x) < a − 1/n}


[∞ [ ∞ \ ∞
= {x | fk (x) < a − 1/n}
n=1 N =1 k=N

Intergral of a bounded function with bounded


support
Having defined the integral of a simple function, we move on to functions which
are bounded both in range and support. We can try to define the integral of f by
approximating f by simple functions from either above or below. The following
theorem tells us that both give the same value, provided f is measurable.
Theorem 4.3. Let f : R → R be a function which is bounded, |f | ≤ M , and
which is zero outside a compact interval [a, b]. Then
Z Z
inf ψ = sup φ
ψ≥f φ≤f

where φ, ψ range over simple functions, if and only if f is measurable.

27
Proof. Let us assume first that f is measurable. Divide the interval [−M, M ]
into intervals [ak , ak+1 ) with ak+1 − ak < ε. Let Ek = f −1 ([ak , ak+1 )) ∩ [a, b],
i.e. the set of points in [a, b] where ak ≤ f < ak+1 , which is measurable, since
f is. We get simple functions
X X
φ= a k χ Ek , ψ= ak+1 χEk
k k

with φ ≤ f ≤ ψ and since ψ − φ < ε we get


Z
(ψ − φ) ≤ ε(b − a).

So the approximations from above and below are arbitrarily close, which implies
equality in the statement of the theorem.
For the converse, suppose that inf = sup above, then there are sequences of
simple functions φk , ψk with φk ≤ f ≤ ψk and
Z
(ψn − φn ) → 0, as n → ∞.

Take φ = sup φk and ψ = inf ψk , which are measurable since they are pointwise
limits of measurable functions. We have φ ≤ f ≤ ψ, but in fact φ = ψ almost
everywhere (i.e. outside a set of measure zero). The reason is that if ψ − φ >
ε > 0 on a set E with m(E) > 0, then εχE ≤ ψn − φn for all n, and so
Z
(ψn − φn ) ≥ εm(E) > 0

contradicting convergence to 0. Thus φ = ψ = f a.e., so f is measurable.


The Riemann integral is based on step functions which are a special case of
simple functions with all level sets a finite union of intervals. If a bounded f
is Riemann
R integrable, then there are sequences of step functions φn ≤ f ≤ ψn
with (ψn − φn ) → 0, so f is measurable and the two definitions of the integral
agree.
Note the difference with Riemann integration: Instead of cutting the domain
into smaller and smaller intervals, we do this instead with the range. This
gives us some possibly complicated subsets of the domain, but the Lebesgue
measure takes care of them. The method of discretizing the range, rather then
the domain, is also used when indicating height on a topographical map (see
Figure ??). The regions between the contour lines are preimages of intervals
under the function which maps a point on the map to the height of the terrain.
Next, we verify that integration is linear on bounded functions with bounded
support.
Theorem 4.4. Suppose f, g are bounded measurable functions with bounded
support, c ∈ R, then
Z Z Z Z Z
cf = c f, f +g = f + g

i.e. integration is linear.

28
Proof. The first statement is clear. For the second note that, using additivity
for simple functions,
Z Z Z Z Z
f+ g= sup φ1 + φ2 ≤ sup φ= f +g
φ1 ≤f,φ2 ≤g φ≤f +g

and the reverse inequality follows similarly with the inf definition of the integral.

Figure 5: Contour lines on a topographic map.

Integral of a non-negative or absolutely integrable


function
We first consider arbitrary measurable functions f : R → [0, +∞]. Define the
integral as Z Z
f := sup φ ∈ [0, +∞]
φ≤f

29
where φ ranges over bounded functions with bounded support.
Theorem 4.5 (Properties of integral for non-negative functions). Let f, g :
R → [0, +∞] be measurable.
R R
1. f ≤ g =⇒ f ≤ g
R R
2. cf = c f , for c ≥ 0
R R R
3. f + g = f + g
Proof. The first two are clear from the definition, so we only proof the third
property. The inequality “≥” follows as above in the proof for bounded func-
tions. On the other hand, if φ ≤ f + g then setting

φ1 := min(φ, f ) ≤ f, φ2 := φ − φ1 = φ − min(φ, f ) ≤ g

we get Z Z Z Z Z
φ= φ1 + φ2 ≤ f+ g

and since φ ≤ f + g was an arbitrary bounded function with bounded support


this proves the reverse inequality.
R
A measurable function f : R → R is absolutely integrable if |f | < ∞.
Here we use the integral just defined above. If f is absolutely integrable, then
define Z Z Z
f := f+ − f−

where f+ = max(f, 0), f− = − min(f, 0), f = f+ − f−R are Rthe positive and
negative parts of f . Absolute integrability ensures that f+ , f− < ∞, so we
do not get ∞ − ∞, which is undefined.
Theorem 4.6. The integral of absolutely integrable functions is linear.

Proof. Let f, g be absolutely integrable. Starting with

(f + g)+ − (f + g)− = f + g = (f+ − f− ) + (g+ − g− )

we rearrange the terms to

(f + g)+ + f− + g− = (f + g)− + f+ + g+ .

Applying linearity of the integral for non-negative functions gives


Z Z Z Z Z Z
(f + g)+ + f− + g− = (f + g)− + f+ + g+ .

Putting the terms back to their original side of the equation and applying the
definition gives additivity.

30
R
It is easy to see that all the above definitions of f are compatible with
one another, i.e. agree for functions which fall into more than one of the classes
considered. Also, if one wants to integrate f not over all of R but some interval
[a, b] one can simply take
Z b Z
f := χ[a,b] f
a
or more generally for any measurable subset of R instead of [a, b]. If A and B
are disjoint then Z Z Z
f= f+ f.
A∪B A B
As a consequence of linearity we have a form of monotonicity:
Z Z
f ≤ g =⇒ f≤ g

and Z Z
f ≤ |f |.

Furthermore, if f is bounded, |f | ≤ M , and has support supp(f ) of finite


measure, then it also follows again from monotonicity that
Z
|f | ≤ m(supp(f ))M.

A very important principle in Lebesgue theory is that if f = g a.e. (almost


everywhere), meaning that f − g is supported on a null set, then
Z Z
f = g.

This meansR that even if f is undefined on a null set, for example a countable
set, then f is still well-defined, because no matter how we extend f to all of
R, we get the same integral.
Extending all this to complex valued functions is straightforward. The func-
tion f : R → C is definedR to be measurable if and only if Re(f ) and Im(f ) are
measurable, and if also |f | < ∞ then we can define
Z Z Z
f = Re(f ) + i Im(f ).

Convergence Theorems
In this section we consider the following problem: Suppose a sequence of func-
tions fn converges to f pointwise and that the integral of each fn and f is
defined. Under what conditions does
Z Z
fn → f

31
as n → ∞? In the case of the Riemann integral, a sufficient condition is that fn
are continuous functions on a compact interval and converge uniformly. These
are very strong assumptions, and we will see that there is a much more useful
answer in the context of Lebesgue integration.
Let’s assume for now that each fn is nonnegative and measurable, so the
same will be true for f and the integrals are well-defined. Here are some exam-
ples where interchanging limit and integration is not possible:
• Escape to horizontal
R R We take fn = χ[n,n+1] , then fn → f = 0
infinity.
pointwise, but fn = 1 and f = 0. Informally, the mass (area under
the curve) has escaped to +∞ as n → ∞.

• Escape
R to width infinity.
R Let fn = n1 χ[0,n] , then fn → f = 0 uniformly,
but fn = 1 and f = 0 as before. The mass is spread out over the
infinite real line, eventually having zero density everywhere.
• Escape to vertical infinity. The previous two examples relied on non-
compactness of R, or more precisely m(R) = ∞. However, even if we
restrict to functions on a compact interval, things can R go wrong. Define
R
fn = nχ[ n1 , n2 ] , then again fn → f = 0 pointwise, but fn = 1 6= 0 = f .
This is essentially a 90 degree rotated version of the previous example,
with mass being concentrated with unbounded density.
Any convergence theorem needs a condition on the fn which rules out the
above escape scenarios. The first such condition is monotonicity, i.e. mass can
only be added, not subtracted or moved around.
Theorem 4.7 (Monotone convergence theorem). Let 0 ≤ f1 ≤ f2 ≤ . . . be an
increasing sequence of non-negative measurable functions, then
Z Z
lim fn = f
n→∞

where f = limn→∞ fn = supn fn is the pointwise limit. (Here, functions are


allowed to take the value +∞.)
Of course it suffices if monotonicity holds almost everywhere, because we
can modify the fn on a null set, which does not change the integrals, so as to
ensure monotonicity everywhere.
R
Proof.
R Since fn ≤ f , monotonicity of the integral immediately implies fn ≤
f , thus Z Z
lim fn ≤ f
n→∞

and so it remains to show the reverse inequality. By definition this means


showing that Z Z
φ ≤ lim fn
n→∞

32
for all simple functions 0 ≤ φ ≤ f . Let
N
X
φ= a k χ Ek
k=1

with disjoint measurable Ek ⊂ R and ak ≥ 0.


Let 0 < ε < 1 then φ ≤ f implies

f (x) = sup fn (x) > (1 − ε)ak , x ∈ Ek


n

thus if we consider

Ek,n := {x ∈ Ek | fn (x) > (1 − ε)ak }

then each Ek,n is measurable, Ek,1 ⊂ Ek,2 ⊂ . . ., and


[
Ek,n = Ek .
n

so by continuity of the Lebesgue measure m we get

m(Ek ) = lim m(Ek,n ).


n→∞

Tautologically we have
N
X
fn ≥ (1 − ε)ak χEk,n
k=1

thus, integrating both sides,


Z N
X
fn ≥ (1 − ε)ak m(Ek,n ).
k=1

Taking the limit n → ∞ gives


Z N
X Z
lim fn ≥ (1 − ε)ak m(Ek ) = (1 − ε) φ
n→∞
k=1

but since ε ∈ (0, 1) was arbitrary, this implies the other inequality.
Applying the theorem to partial sums yields the following corollary.
Corollary 4.8. Let fn : R → [0, ∞] be non-negative measurable functions, then
Z X ∞ X∞ Z
fn = fn .
n=1 n=1

In the absence of monotonicity, equality typically fails, but the integral can
only decrease in the limit, as we found in the examples of mass “escaping to
infinity”.

33
Theorem 4.9 (Fatou’s lemma). Let fn : R → [0, ∞] be non-negative measurable
functions, then Z Z
lim inf fn ≤ lim inf fn .
n→∞ n→∞

Proof. Let gn = inf k≥n fk , then g1 ≤ g2 ≤ . . . so by the monotone convergence


theorem,
Z Z
lim inf fn = lim gn
n→∞ n→∞
Z
= lim gn
n→∞
Z
≤ lim inf fk
n→∞ k≥n

where the inequality follows from gn ≤ fk for k ≥ n and monotonicity of the


integral.
Another way of preventing escape to infinity is to confine the fn using an
absolutely integrable functions g which dominates all fn in the sense that |fn | ≤
g. This makes the fn absolutely integrable and we don’t need to restrict to non-
negative functions.
Theorem 4.10 (Dominated convergence theorem). Let fn : R → R be a se-
quence of measurable functions which converge pointwise to a function f . Sup-
pose there exists an absolutely integrable g : R → [0, ∞] such that |fn | ≤ g for
all n, then Z Z
lim fn = f.
n→∞

The theorem extends easily to complex-valued functions f : R → C. Also,


it suffices if the conditions of the theorem are satisfied almost everywhere only.
Proof. Each fn + g is non-negative by assumption, so Fatou’s lemma gives
Z Z
f + g ≤ lim inf fn + g
n→∞
R
so subtracting g, which is finite, we get
Z Z
f ≤ lim inf fn .
n→∞

Similarly, applying Fatou’s lemma to g − fn ≥ 0 gives


Z Z
g − f ≤ lim inf g − fn
n→∞

thus Z Z
lim sup fn ≤ f.
n→∞
Putting the two inequalities together shows the claim.

34
The construction of the integral and its properties presented in this chapter
generalize (with the same proofs) to Rn , or any abstract measure space.

References and further reading


• T. Tao: “An introduction to measure theory.”
• C. McMullen: “Real Analysis.” (Math 114 course notes)

35
Chapter 5

Axiomatic measure theory

The laws of mathematics are not merely human inventions or creations.


They simply ’are’; they exist quite independently of the human intellect.
The most that any(one) ... can do is to find that they are there and to
take cognizance of them.
– Maurits Escher

The definition of the integral in the previous chapter, as well as its proper-
ties, can be developed in a much more general setting without any additional
difficulty. This framework puts infinite sums and integrals (in any dimension),
as well as their weighted or fractal variants, on the same footing.
First, instead of R we consider any set X. Second, we need to specify which
subset of X should be the measurable ones, which amounts to choosing a σ-
algebra.

Definition 5.1. A collection A of subsets of X is a σ-algebra if


1. ∅ ∈ A
2. E ∈ A =⇒ X \ E ∈ A (closure under complements)
S
3. Ek ∈ A, k ≥ 1 =⇒ k Ek ∈ A (closure under countable unions)
An important principle in their construction is the fact that any collection
of subsets can be completed to a σ-algebra in a canonical way. To see this, note
that the intersection of a collection of σ algebras Ai ⊂ 2X , i ∈ I is again a
σ-algebra. Thus it makes sense to define the σ-algebra generated by a collection
of subsets C ⊂ 2X to be the intersection of all σ-algebras A with C ⊂ A. This is
the smallest σ-algebra which contains C, where A ⊂ 2X is by definition smaller
than B ⊂ 2X if A ⊂ B.
Theorem 5.2. The Lebesgue measurable subsets of R form a σ-algebra which
is generated by open sets and null sets.

36
Proof. This is just a restatement of things we already know. If A is a σ-algebra
containing the open sets, then by closure under complements it also contains the
closed sets, thus the Fσ -sets by closure under countable unions. But if A also
contains the null sets, then it contains every Lebesgue measurable set, as any
such set is the union of of an Fσ and a null set. Thus the Lebesgue measurable
sets form the smallest σ-algebra containing open and null sets.
Elements of the σ-algebra generated by the open sets are called Borel sets.
This notion makes sense for any topological space X. Hence every Borel subset
of R is measurable, but the converse is not true. A way to see this is as follows.
One can show that the cardinality of the σ-algebra of Borel subsets is the same
as that of R, but the cardinality of the σ-algebra of measurable set is the same
as that of all subsets of R, since every subsets of the Cantor set is measurable.
Once we have fixed a σ-algebra of measurable sets on X, the final piece of
data needed is the measure.
Definition 5.3. A measure on a σ-algebra A is a map µ : A → [0, +∞] such
that
1. µ(∅) = 0
2. If Ek ∈ A, k ≥ 1, are disjoint, then
∞ ∞
!
[ X
µ Ek = µ(Ek ).
k=1 k=1

Our main example is of course the Lebesgue measure m on the σ-algebra


of measurable (or Borel) subsets of R. Using the Lebesgue integral, we can
construct a measure, mf , from any measurable function f : R → [0, +∞] by
setting Z
mf (E) := f.
E
Countable additivity follows from the monotone convergence
R theorem. For ex-
ample f could be a probability distribution, meaning f = 1, so mf (E) becomes
the probability of the event E. In this case mf (R) = 1, and in general one says
µ is a probability measure if µ(R) = 1.
A very different example of a measure on R is the Dirac measure, δ0 with
(
1 if 0 ∈ E
δ0 (E) =
0 if 0 ∈
/E
which we can consider as a measure on the σ-algebra of all subsets of R, or
restrict to any smaller σ-algebra. This measure is not of the form mf for any
function f , so can be considered as a kind of generalized function.
A more elementary example is the counting measure: For any set X we can
take A = 2X , i.e. all sets are measurable, and define
(
|E| if E is finite
µ(E) = .
+∞ if E is infinite

37
When X is finite we can normalize the counting measure so that it becomes a
probability measure with µ(E) = |E|/|X|.
The list of axioms for a measure µ is remarkably small. They imply mono-
tonicity and countable subadditivity, for example, and indeed all the properties
needed to define the integral. A property specific to the Lebesgue measure is
translation invariance. One chooses not to impose such a condition in general,
as it would exclude many of the interesting examples above.
Once one has fixed a measure µ, one can always add the subsets of null-sets
to the measurable sets. More precisely, one passes to the σ-algebra generated
by A and the subsets of the null sets, to which µ extends in the obvious way.
The triple (X, A, µ) is known as a measure space. Measurable functions
are defined as in the case of R (and do not depend on µ): A function f : X → R
is measurable if {x ∈ X | f (x) < a} is measurable for each a ∈ R. This is
equivalent to requiring the preimage under f of any Borel subset of R to be
measurable, because intervals (−∞, a) generate the Borel σ-algebra.
As in the case X = R one defines the integral of any measurable function
f : X → R (or X → C) which is either non-negative or absolutely integrable.
The main properties of the integral, including the convergence theorems, all go
through in general. To avoid confusion when dealing with multiple measures
one writes Z
f dµ

for the integral constructed from µ, i.e. the one satisfying


Z
χE dµ = µ(E).

Returning to the examples above we have


Z Z
f dmg = f gdm

and Z
f dδ0 = f (0).

For µ the counting measure, the integral just becomes a series:


Z X
f dµ = f (x)
x∈X

Probability spaces
A probability space is simply a measure space (X, A, µ) with µ(X) = 1. Some
examples where given above. In the context of probability theory, X is called the
sample space, and its elements the outcomes. An element A of the σ-algebra A
is an event and its probability is P (A) := µ(A). The condition µ(X) = 1 ensures

38
we have P (A) ∈ [0, 1] and that the event X, which means “any outcome”, has
probability 1.
When X is finite, the σ-algebras on X correspond to partitions X = X1 ∪. . .∪
Xn with Xk being the minimal non-empty measurable subsets. Any probability
measure is completely specified by numbers pk = P (Xk ) ∈ [0, 1] with p1 + . . . +
pn = 1.
σ-algebras, besides being essential in Lebesgue theory, also give a way of
modeling incomplete information. For example, let’s say is coin is tossed twice
giving possible outcomes X = {HH, HT, T H, T T }. We could consider the σ-
algebra of all subsets, and the probability measure which assigns 14 to each
outcome. But suppose we do not know the results of both coin tosses, only if
they are the same or differ. This means our σ-algebra of events is

A = {∅, {HH, T T }, {HT, T H}, X}.

A function f : X → R is measurable with respect to A if and only if it assigns


the same value to HH and T T , and to HT and T H.
A random variable is a measurable function f : X → R. Sometimes one
allows the target to be any measurable space, i.e. set with σ-algebra. The
expected value is simply the integral
Z
E(f ) = f dµ.

References and further reading


• T. Tao: “An introduction to measure theory.”
• A. Kolmogorov: “Foundations of the Theory of Probability.”

• P. Billingsley: “Probability and Measure.”

39
Chapter 6

Differentiation and
Integration

Integration is sometimes introduced as “the inverse of differentiation”, i.e. a


way of finding an antiderivative. This is really two statements, that
Z x
d
f =f
dx a
and Z x
F 0 = F (x) + const
a
which together are the fundamental theorem of calculus.
It turns out that in Lebesgue theory the first statement is true as long as f is
integrable, which is the best result we could hope for. For the second statement
to make sense, we need F to be differentiable almost everywhere, but even then
it can fail. Take for example the Heaviside step function F = χ[0,∞) , then F 0 (x)
exists and is zero at all points except x = 0, so equality does not hold.
An example with continuous F = c comes from the Cantor set: c : [0, 1] →
[0, 1] is the unique monotone extension of the function C → [0, 1] which was
constructed in the proof of uncountability of C. It can be described as follows:
1. Find the ternary expansion of x.
2. If x has a 1 in its expansion, then replace all digits after the 1 by 0.
3. Replace 2 7→ 1 in the expansion.
4. The resulting sequence of 0’s and 1’s is the binary expansion of c(x).
The graph of c is shown in figure ??. The Cantor function is locally constant
on the complement of the Cantor set, which has full measure. In particular c0
exists and vanishes almost everywhere. This is in some sense paradoxical: If
c(t) describes the location of a particle depending on time t, then this means

40
particle moves continuously across a positive distance in finite time, but if can
measure its speed at some random moment in time we will almost surely find it
resting.

Figure 6: The Cantor function or “Devil’s staircase”.

It turns out there is a sufficient and necessary condition on F to make in-


tegration the inverse of differentiation, which is absolute continuity, and there-
fore excludes examples as above. From another point of view, functions like
Heaviside’s or Cantor’s are derivatives of measures on R, which are a kind of
generalized function. The failure of the fundamental theorem of calculus is then
explained by the fact that not all generalized functions are functions in the usual
sense (think of the Dirac delta function).

Nowhere differentiable functions


It came as somewhat of a shock to mathematicians of the 19th century that
a function could be continuous yet nowhere differentiable. The first published
example is due to Weierstrass and given by a Fourier series

X
f (x) = an cos(bn x)
n=1

with 0 < a < 1 and b a positive odd integer such that ab > 1 + 3π 2 . Intuitively,
the graph of f has wiggles on every scale, thus no well-defined slope no matter
how far one zooms in, see Figure ??. Similarly to the Cantor set, it is the frac-
tal nature of the Weierstrass function which is responsible for this unexpected
behavior.
Another example, which is a bit simpler to analyze, is based on a sawtooth
wave instead of cosine. Let w be the 2-periodic function on R which is equal to

41
Figure 7: Plot of a Weierstrass function.

|x| on [−1, 1]. Define


∞  n
X 3
f (x) = w(4n x)
n=0
4
which is continuous as a uniform limit of a sequences of continuous functions.
−k
For any x ∈ R one can choose δk = ± 4 2 so that
f (x + δk ) − f (x)
→∞
δk
as k → ∞, showing that f is not differentiable at x.
It turns out that in some sense most continuous functions are nowhere differ-
entiable. Hence the “pathological” case is in fact the typical one. However, the
following theorem tells us that such a function cannot be monotone (increasing
or decreasing) on any interval, so it must the wiggles on every scale as in the
examples above.
Theorem 6.1. A monotone function f : [a, b] → R is differentiable almost
everywhere.
The proof will require some preparation, in particular Vitali’s lemma.

Vitali coverings
The ball with center x and radius r > 0 in Rn is the subset
B(x, r) = {y ∈ X | |x − y| < r}.
If B = B(x, r) is a ball, write 3B := B(x, 3r).

42
Lemma 6.2. Let K ⊂ Rn be a compact subset covered by a collection E of balls.
The there are disjoint B1 , . . . , Bn ∈ E such that 3B1 , . . . , 3Bn cover K.
Proof. By compactness we can assume that E is finite. We select the Bk “greed-
ily”, that is B1 should have maximal radius among the balls in E, B2 should
have maximal radius among the balls in E which are disjoint from B1 , and so
on. This inductively defines a collection B1 , . . . , Bn and we claim that the 3Bk
cover K. S
Suppose, for contradiction, that x ∈ K, but x ∈ / 3Bk . Then x is contained
in one of the balls B = B(a, r) ∈ E which was not chosen by the algorithm. This
means there is some minimal k such that B intersects Bk , and thus Bk = B(b, s)
has radius s ≥ r. Hence

|x − b| ≤ |x − a| + |a − b| < r + (r + s) ≤ 3s

so x ∈ 3B(b, s), a contradiction.


The same proof works in any compact metric space. A variation of this
result applies to Vitali coverings of a subset X ⊂ Rn , which is a collection of
balls E such that every x ∈ X is contained in some B ∈ E of arbitrarily small
radius.
Theorem 6.3. Let E be a Vitali covering of a compact set K ⊂ Rn . Then there
is a sequence of disjoint Bk ∈ E such that
n
[ ∞
[
K⊂ Bk ∪ 3Bk
k=1 k=n+1

for n ≥ 1.
Proof. For each n > 0 we can cover K by finitely many balls in E which all have
radius ≤ 1/n. Since the union of such coverings is also a Vitali covering, we can
assume that for every ε > 0 the collection E has only finitely many balls with
radius > ε. The same greedy algorithm as above then gives a finite or infinite
sequence of
S disjoint balls Bk with decreasing radius. If the sequence is finite,
thenSK ⊂ Bk and we are done. Otherwise, fix n ≥ 1 and suppose x ∈ K with
n
x∈/ k=1 Bk . By the Vitali property, x ∈ B for some B ∈ E with radius less
than any of Bn . If B was not chosen by the algorithm, then it must intersect
some Bm , m > n, and so x ∈ 3Bm by the same reasoning as above.
We return to the real line, where the previous theorem allows us to prove a
strong version of Littlewood’s first principle.
Theorem 6.4 (Vitali’s Lemma). Let E ⊂ R be measurable with m(E) < ∞
and E a Vitali covering of E. Then for every ε > 0 there is a finite collection
of disjoint intervals B1 , . . . , Bn ∈ E with
n
!
[
m E4 Bk < ε.
k=1

43
Proof. Since E has finite measure we can find K ⊂ E ⊂ U with K compact, U
open and m(U ) − m(K) < ε. We may remove all intervals from E which are not
contained in U , while still keeping it a Vitali covering. The previous theorem
gives a sequence of disjoint Bk ∈ E with
n
[ ∞
[
K⊂ Bk ∪ 3Bk
k=1 k=n+1
P
for all n ≥ 1. In particular we have m(Bk ) ≤ m(U ) < ∞, so choose n ≥ 1
with
X∞
3 m(Bk ) < ε
k=n+1

which implies
n
X
m(U ) − ε ≤ m(K) ≤ m(Bk ) + ε
k=1
Sn
so we get m(U \ k=1 Bk ) < 2ε. By assumption we also have m(U \ E) < ε, so
n
!
[
m E4 Bk < 3ε
k=1

and we are done.

Lebesgue density
As an application of Vitali’s lemma we can prove Lebesgue’s density theorem.

Theorem 6.5 (Lebesgue’s density theorem). Let E ⊂ R be measurable, then

m(E ∩ B(x, r))


lim = χE (x)
r→0 m(B(x, r))

for almost every x ∈ R.


The limit on the left hand side is the density of E at x, and may not exist
everywhere. The theorem tells us that for measurable sets, the density is either
0 or 1 almost everywhere.
Proof. Since the statement is local, we may assume that E is contained in a
compact interval. Let n > 0 and A ⊂ E the set of point x where

m(E ∩ B(x, r)) 1


lim inf <1− .
r→0 m(B(x, r)) n

44
By definition, each x ∈ A is contained in an interval I of arbitrarily small size
so that the density of E in I is less than 1 − n1 . These intervals form a Vitali
covering of A, so by Vitali’s lemma there are disjoint I1 , . . . , Ik with
n
!
[
m A4 Ik < ε.
k=1

We get
n
X
m(A) ≤ ε + m(A ∩ Ik )
k=1
 n

1 X
≤ε+ 1− m(Ik )
n
k=1
 
1
≤ε+ 1− (m(A) + ε)
n

hence m(A) ≤ (1 − 1/n)m(A), since ε > 0 was arbitrary, so m(A) = 0. Since a


countable union of null sets is a null set, this shows that the sequence of densities
tends to 1 for almost every x ∈ E. Replacing E by [a, b] \ E, we conclude that
the limit is 0 for almost every x ∈/ E.
Clearly, the density of E does not change if we modify E by a null set. As
a consequence, we have a canonical representative among the measurable sets
which differ from E by a null set: The set of points of density 1.
A nice application of Lebesgue’s density theorem is the ergodicity of an
irrational rotation on S 1
Theorem 6.6. Let α be irrational and define

T : [0, 1) → [0, 1), T (x) = x + α mod 1

then T is ergodic in the sense that if A ⊂ [0, 1) is measurable and T -invariant,


i.e. T (A) ⊂ A, then m(A) = 0 or m(A) = 1.
The notion of ergodicity makes sense for any measure preserving map from
a probability space to itself.
Proof. Consider the density

m(A ∩ B(x, r))


δr (x) = .
m(B(x, r))

Since T is measure preserving and T (A) ⊂ A, T (A) can differ from A only
by a null set, thus δr (x) is constant on the orbits of T . But since the orbits
of T are dense in [0, 1) and δr is continuous, δr is in fact constant. Assume
that m(A) 6= 0, then by Lebesgue’s density theorem there is an x0 ∈ X with
limr→0 δr (x0 ) = 1, thus lim δr (x) = 1 for all x ∈ [0, 1) and m(A) = 1.

45
Corollary 6.7. A measurable function f : [0, 1) → R which is invariant under
T agrees almost everywhere with a constant function.
Proof. Partition the domain R into intervals Ik of length m(Ik ) ≤ ε. By the
theorem each f −1 (Ik ) has measure 0 or 1, but since m([0, 1)) = 1, there must
be exactly one Ik with m(f −1 (Ik )) = 1. Sending ε → 0 the intersection of such
intervals contains a single point, the value which f achieves almost everywhere.

Monotone functions
A function f : [a, b] → R is increasing if x ≤ y =⇒ f (x) ≤ f (y), decreasing
if −f is increasing, and monotone if it is either increasing or decreasing. We
can finally return to the proof of the following theorem.
Theorem 6.8. A monotone function f : [a, b] → R is differentiable almost
everywhere.

Proof. It suffices to consider the case when f is increasing. Let E ⊂ [a, b]


be set of points where the derivative f 0 does not exist. We remove from E
the countable set of points where f is discontinuous, the set of points where
f 0 (x) = +∞, which is a null set by an easy application of Vitali’s lemma, as
well as the endpoints a, b.
For a subset A ⊂ R let |A| = sup A − inf A be the length of the smallest
interval containing A. If I ⊂ [a, b] is an open interval, let

|f (I)|
D(I) =
|I|

which is an approximation to f 0 on I. For x ∈ E, the value of D(I) must not


converge as |I| → 0 with x ∈ I. If for r < s ∈ Q we let Er,s be the set of
x ∈ [a, b] such that there are arbitrarily small open intervals I, J containing x
with
D(I) < r < s < D(J)
S
then E = r<s Er,s . It suffices to show that A := Er,s is a null set for each
r < s.
The idea is that on A the function f behaves like one with both f 0 ≤ r and
0
f ≥ s, thus
sm(A) ≤ m(f (A)) ≤ rm(A)
leading to a contradiction if m(A) > 0.
For ε > 0, Vitali’s lemma
S provides us with disjoint intervals I1 , . . . , In with
D(Ik ) ≤ r and m(A 4 Ik ) < ε. Thus,
!
[ X
m Ik = |Ik | < m(A) + ε.
k k

46
S
S Vitali’s lemma to A ∩ k Ik gives disjoint intervals J1 , . . . Jm with
Applying
S
Jk ⊂ Ik , D(Jk ) > s, and
!
[ X
m Jk = |Jk | > m(A) − 2ε.
k k

Now monotonicity implies


X X X X
s |Jk | ≤ |f (Jk )| ≤ |f (Ik )| ≤ r |Ik |.
k k k k

Combining this with the previous estimates for m(A) we get


P
m(A) − 2ε |Jk | r
≤ P ≤ <1
m(A) + ε |Ik | s

but the left hand side goes to 1 as ε → 0 unless m(A) = 0.


Rb
As a consequence of the theorem, if f is monotone the a f 0 is well defined.
The examples at the beginning of the chapter show that this may not be equal
to f (b) − f (a), but we get at least an upper bound.
Theorem 6.9. Let f : R → R be an increasing function, then
Z b
f 0 ≤ f (b) − f (a).
a

Proof. Modify f so that it is constant, equal to f (b), on [b, ∞). This does not
change either side of the equation. Consider
   
1
fn (x) = n f x + − f (x) ≥ 0
n

which converges pointwise to f 0 a.e., thus


Z b Z b
0
f ≤ lim inf fn
a n→∞ a

1
by Fatou’s lemma. However for n sufficiently large so that a + n < b we get
!
Z b Z Z b+1/n b
fn = n f− f
a a+1/n a
Z b+1/n Z a+1/n
=n f −n f
b a
≤ f (b) − f (a)

by monotonicity of f .

47
A source of monotone functions are the Borel measures, which are measures
on the σ-algebra of Borel subsets of R. Let us restrict to measures with µ(R) <
∞ for simplicity, the finite measures. We have already seen some examples
of Borel measures in the previous section, but let us be more systematic now.
There are three basic types.
1. Absolutely
R continuous. Given a measurable f : R → [0, +∞] with
f < ∞ we get a measure mf
Z
mf (E) = f dm
E

which is a weighted form of the Lebesgue measure m on R. Such a measure


has the property that m(E) = 0 =⇒ mf (E) = 0, i.e. any null set for
the Lebesgue measure is also a null set for mf . This turns out to uniquely
characterize such measures (Radon–Nikodym theorem).
2. Pure point. These are based on the Dirac measure.
P Let xk ∈ R, ak ∈
[0, ∞) be countable sequences of numbers with k ak < ∞. The we get
a measure X
µ= ak δxk
k

where δxk is the Dirac measures at xk . This means that


X
µ(E) = ak .
k,xk ∈E

The pure point measures are characterized by having support on a count-


able set.
3. Singular continuous Such measures are supported on a set of Lebesgue
measure zero, but unlike the pure point measures do not assign positive
measure to any single point. An example is the Cantor measure µ with

µ(E) = m(f (E))

where f : [0, 1] → [0, 1] is the Cantor function. It is supported entirely on


the Cantor set.
The Lebesgue decomposition theorem states that any Borel measure µ has
a unique decomposition
µ = µac + µsc + µpp
with µac , µsc , µpp being absolutely continuous, singular continuous, and pure
point, respectively. Intuitively, one can think of µac as the 1-dimensional part of
µ, µpp the 0-dimensional part of µ, and µsc the part of µ which has intermediate
“fractal” dimension in (0, 1).
Now if µ is a Borel measure on R, then its cumulative distribution

F (x) = µ((−∞, x])

48
is an increasing function from which µ can be uniquely recovered, since the
intervals (−∞, x] generate the Borel σ-algebra. The discontinuities in f come
from µpp , while µac is responsible for F 0 = f . A singular continuous measure
gives a continuous function with F 0 = 0 almost everywhere, like the Cantor
function.

Functions of bounded variation


If f, g are increasing functions, then by our theorem f − g is also differentiable
almost everywhere. How can we see if a function is a difference of increasing
functions? For f : [a, b] → R define the total variation
n
X
kf kBV = sup |f (ak ) − f (ak−1 )|
a0 <...<an
k=1

and f is of bounded variation if kf kBV < ∞.

Theorem 6.10. A function f : [a, b] → R is of the form f = g − h with g, h


increasing if and only if it is of bounded variation.
Proof. For an increasing f we have kf kBV = f (b) − f (a) < ∞, and it is easy
to see that functions of bounded variation form a vector space, thus includes
differences of monotone functions.
Assume now that kf kBV < ∞, and define
n
X
f+ (x) = sup (f (ak ) − f (ak−1 ))+
a0 <...<an =x
k=1
Xn
f− (x) = sup (f (ak ) − f (ak−1 ))−
a0 <...<an =x
k=1

which are increasing non-negative functions of x, bounded by kf kBV . Note


that the sums in the definition of f+ , f− both increase if the partition of [a, x] is
refined. This means we can find a single sequence a = a0 < . . . < an = x (com-
mon refinement) such that the sums differ from f+ (x) and f− (x) respectively
by no more than some ε > 0. But then
n
X n
X
((f (ak ) − f (ak−1 ))+ − (f (ak ) − f (ak−1 ))− ) = (f (ak ) − f (ak−1 ))
k=1 k=1
= f (x) − f (a)

thus
f+ (x) − f− (x) = f (x) − f (a)
which proves the claim.

49
Just like monotone functions are related to measures, functions of bounded
variation are related to signed measures which can take negative values. The
theorem then corresponds to the Hahn decomposition theorem, exhibiting a
signed measure µ as the difference µ+ − µ− of two unsigned ones.

Absolute continuity
Rb
We have seen that a f 0 makes sense for any function of bounded variation,
but may not be equal to f (b) − f (a). To recover the fundamental theorem of
calculus, we need to impose a stronger condition. A function f : [a, b] → R is
absolutely continuous if for any ε > 0 there is a δ > 0 such that if (ak , bk ),
k = 1, . . . , n are disjoint intervals in [a, b] of total length less than δ, then
n
X
|f (ak ) − f (bk )| < ε.
k=1

The Cantor function is not absolutely continuous, because we can cover the
Cantor set, where all the variation is concentrated,
√ by finitely many intervals
of arbitrarily small total length. The function x, restricted to [0, 1], say, is
absolutely continuous on the on the other hand, even though it is infinitely
steep at 0. The following theorem shows that absolute continuity is a necessary
condition for the fundamental theorem of calculus to hold.
Theorem
Rx 6.11. Suppose f : [a, b] → R is absolutely integrable, then F (x) =
a
f is absolutely continuous.
This is a corollary of the following fact.
R
Theorem R 6.12. Let f ≥ 0 with f < ∞, then for every ε > 0 there is a δ > 0
so that E f < ε if m(E) < δ.
Proof. Let fn = min(f, n) be the truncation, then fn → f pointwise and thus
there is an N > 0 so that Z
ε
(f − fN ) <
2
ε
by the monotone convergence theorem. Setting δ = 2N we have
Z Z
f≤ (f − fN ) + N m(E) < ε
E E

for any E with m(E) < δ.


Theorem 6.13. An absolutely continuous function f : [a, b] → R is of bounded
variation.
Proof. By assumption there is a δ > 0 so that
n
X
|f (ak ) − f (bk )| < 1
k=1

50
whenever (ak , bk ), k = 1, . . . , n are disjoint intervals in [a, b] of total length less
than δ. This shows that kf kBV ≤ d(b − a)/δe.

The fundamental theorem of calculus


Let L1 ([a, b]) denote the vector space of absolutely integrable functions f :
[a, b] → R, but with those functions identified which agree almost everywhere.
This vector space comes with the norm
Z b
kf kL1 = |f |
a

which has the non-degeneracy property kf kL1 = 0 =⇒ f = 0 because we have


quotiented by functions with support on a null set.
Also let AC([a, b]) be the vector space of absolutely continuous functions
f : [a, b] → R where this time we mod out by constant functions. This makes
kf kBV a norm on AC([a, b]).
Theorem 6.14. Differentiation defines an isomorphism
Rx D : AC([a, b]) →
L1 ([a, b]) with inverse I sending f to F (x) = a f . Furthermore, kF 0 kL1 =
kF kBV , so this is an isomorphism of normed spaces.
We show first that I is injective.
Rx
Lemma 6.15. If f : [a, b] → R is absolutely integrable with a f = 0 for all
x ∈ [a, b], then f = 0 almost everywhere.
R
Proof. Note that the subsets E ⊂ [a, b] such that E f = 0 form a σ-algebra
which contains all null sets and all intervals, by assumptions, thus coincides
with the σ-algebra of measurable subsets of [a, b]. But then it must contain
both {x : f (x) < 0} and {x : f (x) > 0}, so f = 0 a.e..
Next we will show that D ◦ I is the identity, i.e. if f : [a, b] → R is absolutely
integrable, then Z x
d
f = f (x)
dx a
almost everywhere.
We first specialize to the case where f is bounded. Recall that F : [a, b] → R
is Lipschitz continuous if there is an C ≥ 0 with

|F (x) − F (y)| ≤ C|x − y|

for all x, y ∈ [a, b]. Such a function is clearly absolutely continuous.


Lemma 6.16. Suppose that f : [a, b] → R is bounded and measurable, then
F = I(f ) is Lipschitz continuous with F 0 (x) = f (x) almost everywhere.

51
Proof. Let |f | ≤ C, then |F (x) − F (y)| ≤ C|x
R x − y|, so F is Lipschitz continuous.
By injectivity of I it suffices to show that a (F 0 − f ) = 0 for all x ∈ [a, b].
This is an application of the dominated convergence theorem to the sequence
of functions
Fn (x) := n(F (x + 1/n) − F (x))
which converge pointwise a.e. to F 0 and are dominated by the constant function
C on [a, b]. This gives us
Z x Z x
F 0 = lim Fn
a n→∞ a
!
Z x+1/n
Z a+1/n
= lim n F− F
n→∞ x a
Z x
= F (x) − F (a) = f
a

where the averages converge to the value of F at x, a since F is continuous.


Lemma 6.17. The composition D ◦ I is the identity on L1 ([a, b]).
Proof. Since both D and I are linear, it suffices to consider the case when f is
non-negative. Let fn := min(f, n) which is an increasing sequence of bounded
functions converging pointwise to f . Thus, if we consider Fn := I(fn ), then the
monotone convergence theorem implies that Fn converges to F pointwise. Since
F − Fn is increasing we also have F 0 ≥ Fn0 , and Fn0 = fn by the previous result,
thus Z x Z x
F0 ≥ fn = Fn (x) → F (x).
a a
But the general estimate for monotone functions gives the reverse inequality:
Z x
F 0 (x) ≤ F (x) − F (a) = F (x)
a

hence Z x
(F 0 − f ) = 0
a
which shows F 0 = f a.e..
Having shown D ◦ I is the identity, it will follows that I ◦ D is the identity on
AC([a, b]) as long as D is injective. This is the content of the following lemma.
Lemma 6.18. The only absolutely continuous functions F with F 0 (x) = 0
almost everywhere are the constant ones.
Proof. Let c ∈ [a, b] and E the set of points in [a, c] where F 0 = 0. Fix ε > 0,
then each x ∈ E is contained in an open interval I with
F (x) − F (y)
≤ε
x−y

52
for x 6= y ∈ I. Vitali’s lemma gives us disjoint open intervals I1 , . . . , In with this
property which cover E up a set of measure < ε. By absolute continuity the
total variation of F over the complement of the Ik , which is a union of intervals,
is at most some ε0 which can be made arbitrarily small if ε → 0. Putting this
together we get
X
|F (c) − F (a)| ≤ ε0 + 2ε m(Ik ) ≤ ε0 + 2ε(b − a)
k

which is arbitrarily small, so F (c) = F (a) and F is constant.

This completes the proof of our main theorem in this section.

Convex functions
A function f : R → R is convex if every one of its chords lies above the graph,
that is
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
for any x, y ∈ R, λ ∈ [0, 1]. The definition makes sense more generally when
the domain of f is a convex set. For the discussion below, f could just as well
be defined on an open interval. Geometrically, convexity means that the secant
slope
f (x) − f (y)
, x<y
x−y
increases if x or y is increased.
Theorem 6.19. A convex function f : R → R is Lipschitz continuous on any
compact interval, and differentiable outside a countable set.
Proof. Any secant to f with endpoints in [a, b] has slope bounded above by the
slope of the secant from b to b + ε, since the endpoints of the latter are to the
right. This show Lipschitz continuity on [a, b].
By convexity, the left and right derivative of f exist everywhere and are
increasing functions. Moreover, the left/right derivatives must be discontinuous
at the points where they are not equal, and a monotone function can have at
most countably many jumps.

When f : R → R is an increasing function, then the average


Z y
1
f, x<y
y−x x
Rx
clearly increases if x or y are increased, thus F (x) = a f is convex. Combining
this with the fundamental theorem of calculus we get a criterion for convexity.
Corollary 6.20. A function f : R → R is convex if and only if it is absolutely
continuous on any compact interval and f 0 is increasing.

53
A useful fact about convex functions is Jensen’s inequality.
Theorem 6.21 (Jensen’s inequality). Suppose (X, A, µ) is a probability space,
f : X → R absolutely integrable, and φ : R → R convex, then
Z  Z
φ f dµ ≤ φ ◦ f dµ.
X X

In terms of the expectation value, this can be written as φ(E(f )) ≤ E(φ(f )).
The convexity condition φ(λx+(1−λ)y) ≤ λφ(x)+(1−λ)φ(y) is just the special
case where X has only two points, µ assigns them probabilities λ, (1 − λ), and
f takes values x, y.
Proof. Let Z
x0 = f dµ
X

then by convexity there is a linear function ax + b with ax + b ≤ φ(x) and


φ(x0 ) = ax0 + b. In particular, φ(f (x)) ≥ af (x) + b for all x ∈ X, thus
Z Z Z 
φ ◦ f ≥ (af + b) = ax0 + b = φ(x0 ) = φ f

For concave functions we get the reverse inequality. An important


P example
is the following. Let X = {1, . . . , n} with µ({k}) = pk ∈ (0, 1), pk = 1, and
f (k) = ak > 0. Since φ = log is concave we get
n n
!
X X
pk log ak ≤ log pk ak
k=1 k=1

thus taking exponentials


n
Y n
X
apkk ≤ pk ak
k=1 k=1

which is the inequality between weighted geometric and arithmetic mean.

References and further reading


• T. Tao: “An introduction to measure theory.”
• C. McMullen: “Real Analysis.” (Math 114 course notes)
• Stein, Shakarchi: “Real Analysis” (Princeton lectures in Analysis 3)

54
Chapter 7

Lp spaces

We are usually convinced more easily by reasons we have found our-


selves than by those which have occurred to others.
– Blaise Pascal

For many purposes, such as solving partial differential equations, it is useful


to express a function as a Fourier series
X
f (x) = an e2πinx .
n∈Z

When does the right P hand side make sense, i.e. define a periodic function on
R? Assuming that |an | < ∞, the series converges uniformly to a continuous
function (Weierstrass M-test). If on the other hand we have only the weaker
|an |2 < ∞, then it is not clear that the series converges even
P
condition
pointwise.
As an example consider

X 1 2πinx
e
n=1
n
which clearly diverges for x ∈ Z, but does converge at non-integer points as can
be seen from the identity

X zn
= − log(1 − z).
n=1
n

which holds for z ∈ C with |z| ≤ 1, z = 6 1.


|an |2 < ∞ defines a square-
P
As it turns out, any sequence an ∈ C with
integrable function f , i.e. one with
Z 1
|f |2 < ∞
0

55
up to equality almost everywhere. This is a consequence of the completeness
of the inner product space L2 ([0, 1]) of such functions, and another advantage
of Lebesgue theory. We have already met L1 ([a, b]), the space of absolutely
integrable functions, in the discussion of the fundamental theorem of calculus.
More generally, one can define Lp (R) for p ∈ [1, ∞], which is the prototypical
example of a Banach space.

Normed vector spaces and Banach spaces


We have already seen some examples of vector spaces with some notion of length,
kvk, of a vector v. More generally, let V be a vector space over R or C, then a
function V → R, v 7→ kvk, is a norm on V if
• kvk ≥ 0 for v ∈ V and kvk = 0 ⇐⇒ v = 0 (non-degeneracy)
• kλvk = |λ|kvk for any v ∈ V , scalar λ (homogeneity)
• kv + wk ≤ kvk + kwk for any v, w ∈ V (triangle inequality)

A vector space together with a norm is called a normed (vector) space.


The norm gives us a notion of distance, d(v, w) := kv − wk, on V giving it
the structure of a metric space, so we can talk about convergence, open sets,
compact sets, and so on.
Any norm on V defines a unit ball B = {v | kvk ≤ 1} from which it can be
recovered via
kvk = inf{λ > 0 | v ∈ λB}.
The three properties of a norm imply that: B meets every line through the
origin in a non-trivial closed interval, λB = B if |λ| = 1, and B is convex.
Conversely, any B ⊂ V satisfying these three properties defines a norm with
unit ball B.
A Banach space is a normed space which is complete as a metric space,
meaning that any Cauchy sequence converges. Recall that a metric space is
complete if a sequence xn converges to a limit provided that for every ε > 0
there exists some N > 0 with d(xm , xn ) < ε for m, n ≥ N . In normed spaces
there is another criterion which is sometimes simpler to verify.
Theorem 7.1. A normed vector space is complete
P∞ if and only if for any sequence
xn ∈ V which is absolutely convergent, i.e. n=1 kxn k < ∞, is convergent in
V , meaning that the limit
n
X ∞
X
lim xn =: xn
n→∞
k=1 n=1

exists.

56
Proof. Suppose xn is a Cauchy sequence in V , then we can pass to a subsequence
such that kxn − xn+1 k < 2−n . This implies that the series

x1 + (x2 − x1 ) + (x3 − x2 ) + . . .

is absolutely convergent, and so the partial sums, which are xn , converge to


some limit x. The reverse implication is clear.

Lp (X)
p
Let X ⊂ R be a measurable set. For p ∈ [1, ∞) we define
R L p(X) as the normed
vector space of measurable functions f : X → R with X |f | < ∞ with norm
Z 1/p
p
kf kp := |f |
X

where we identify functions which agree almost everywhere in order to ensure


non-degeneracy. One defines L∞ (X) as the vector space of essentially bounded
functions, i.e. measurable functions f : X → R such that there exists some
M ≥ 0 with |f | ≤ M almost everywhere. The smallest such M is by definition
the norm kf k∞ . Again, functions are identified if they differ on a null set.
One can just as well consider complex-valued functions f : X → C, giving a
vector space over C. Much more generally, one can define Lp (X) in the same
way whenever X is a measure space. For example, if X = {1, . . . , n} is a finite
set with the counting measure, then Lp (X) = Rn with norm one of the norms

n
!1/p
X
p
kxkp = |xk |
k=1
kxk∞ = max |xk |
k

We still need to check that Lp (X) is a Banach space, which follows from
results below.
Theorem 7.2 (Hölder’s inequality). Let p, q ∈ [1, ∞] with
1 1
+ =1
p q
and f, g : X → R measurable, then
Z
|f g| ≤ kf kp kgkq .
X

Proof. If, say p = 1, q = ∞, then the inequality follows from |f g| ≤ |f |kgk∞


almost everywhere, so we may assume that p, q ∈ (1, ∞). Furthermore, if kf kp =
0, then f = 0 a.e. thus f g = 0 a.e., so there is nothing to check, or if one of the

57
norms is infinite. Thus, assume kf kp , kf kq ∈ (0, ∞). Passing to f /kf kp and
g/kgkq and using homogeneity of the p-norm, we can in fact reduce to the case
kf kp = kgkq = 1. Young’s inequality and the condition on p and q imply the
pointwise estimate
|f (x)|p |g(x)|q
|f (x)g(x)| ≤ +
p q
hence Z  p
|g|q

|f |
Z
1 1
|f g| ≤ + = + =1
X X p q p q
which shows the claim.
The following shows that Lp (X) is a normed space.
Theorem 7.3 (Minkowski’s inequality). Let f, g : X → R be measurable, then

kf + gkp ≤ kf kp + kgkp

i.e. the triangle inequality holds for the p-norm, p ∈ [0, ∞].
Proof. The cases p = 1 and p = ∞ are easy, so assume p ∈ (1, ∞) and let
1/q = 1 − 1/p. We have
Z
kf + gkpp = |f + g| · |f + g|p−1
X
Z
≤ (|f | + |g|)|f + g|p−1
X
Z Z
= |f ||f + g|p−1 + |g||f + g|p−1
X X
Z 1/q
(p−1)q
≤ (kf kp + kgkp ) |f + g|
X
= (kf kp + kgkp )kf + gkp−1
p

where we applied Hölder’s inequality twice and used (p − 1)q = p. Dividing by


kf + gkp−1
p gives Minkowski’s inequality.
Finally, we verify that Lp (X) is a Banach space.
Theorem 7.4. The normed space Lp (X) is complete.
Proof. We use the criterion
P∞that absolutely convergent series are convergent.
Suppose fk ∈ Lp (X) with k=1 kfk kp < ∞. Let

X
F (x) = |fk (x)|
k=1

then

X
kF kp ≤ kfk kp
k=1

58
by the triangle inequality and monotone convergence, so in particular
P∞ F ∈
Lp (X) and F is finite almost everywhere. Hence also f (x) = f
k=1 k (x) is
defined almost everywhere and in Lp (X), since |f | ≤ F . The sequence fk
converges to f also in p-norm because
n
X ∞
X
kf − fk k p ≤ kfk kp → 0.
k=1 n+1

Elements of Lp (X), p < ∞, can be approximated by simple functions sup-


ported in X. This is also true for L∞ (X) if one considers linear combinations
of characteristic functions χE without finiteness condition on m(E).

Theorem 7.5. Simple functions are dense in Lp (X) for 1 ≤ p < ∞.


Proof. Writing the function as f = f+ − f− we can assume f ≥ 0. If f is
unbounded, it can be approximated by truncations min(f, n) in the usual way,
so it suffice to consider the case where f is bounded. The increasing sequence
of simple functions
1
fn := n b2n f c ≤ f
2
converge pointwise to f , but also in p-norm since
Z
lim (f − fn )p = 0
n→∞ X

by dominated convergence with (f − fn )p ≤ f p .


Theorem 7.6. Compactly supported smooth functions, Cc∞ (R), are dense in
Lp (R) for 1 ≤ p < ∞.

Proof. By the previous theorem, it suffices to approximate characteristic func-


tions χE for measurable E with m(E) < ∞. First, the fact that such E can
be approximated arbitrarily well by finite unions of intervals (Littlewood’s first
principle) shows that there is a sequence of step functions converging to χE
in p − norm. But if E is an interval, then there is a smooth function g with
g = 1 in E and support contained in a slightly larger interval. This allows us to
construct a sequence of functions converging to χE .
Note that Cc∞ (R) is a vector space, and we can turn it into a normed vector
space using the p-norm for some p ∈ [1, ∞). Any normed space has a completion,
which may be defined abstractly using equivalence classes of Cauchy sequences,
just like R is constructed from Q. Without Lebesgue theory, it would not be
clear that elements in the completion are functions on R in some sense, and
rather difficult to get a handle on them. The results above show that the
completion of Cc∞ (R) with p-norm is just Lp (R), thus giving it a much more
concrete description.

59
Duality
If V is a normed linear space over R, then the dual space V ∗ consists of linear
functionals ϕ : V → R which are bounded in the sense that

|ϕ(v)| ≤ Ckvk, v∈V

for some constant C ≥ 0. The norm of ϕ is the smallest such constant:

kϕk := sup{|ϕ(v)| : v ∈ V, kvk ≤ 1}

This is just the ∞-norm of ϕ restricted to the unit ball in V .


Theorem 7.7. The dual space V ∗ is a Banach space.
Proof. Let ϕk ∈ V ∗ with k kϕk k < ∞, then
P


X ∞
X
|ϕk (v)| ≤ kvk kϕk k < ∞
k=1 k=1
P
for any v ∈ V . This means we can define ϕ(v) := ϕk (v), since the
P sum is
absolutely convergent. It is easy to see that ϕ is linear with kϕk ≤ k kϕk k,
thus defines an element of V ∗ . Also,
n
X ∞
X
ϕ(v) − ϕk (v) ≤ kvk kϕk k −→ 0, as n → ∞
k=1 k=n+1
P
hence ϕ = k ϕk .
We did not assume that V is a Banach space, but it turns out that V ∗ only
depends on the completion of V . For infinite-dimensional V it is not obvious
that V ∗ 6= {0}. The Hahn–Banach theorem, which we will get to later, allows
one to construct many bounded linear functionals on V .
Returning to the example V = Lp (R), note that if 1/p + 1/q = 1, then
Hölders inequality shows that
Z
ϕf (g) = f g

defines a bounded functional Lp (R) → R for any f ∈ Lq (R). Furthermore, we


have kϕf k = kf kq since if kf kq = 1, then

ksign(f )|f |q/p kp = 1

and Z Z
ϕf (sign(f )|f |q/p ) = |f |1+q/p = |f |q = 1.

The general case follows by homogeneity. Thus f 7→ ϕf defines an isometry


Lq (R) → Lp (R)∗ , which turns out to be an isomorphism (onto) for p 6= ∞.

60
Theorem 7.8. The map f 7→ ϕf defines an isomorphism Lp (R)∗ = Lq (R) for
p 6= ∞, 1/p + 1/q = 1.
Proof. It only remains to show that the map is onto — injectivity follows from
the fact that the norm is preserved. Let p = 1 first. Given a bounded linear
functional ϕ : L1 (R) → R we must find a g ∈ L∞ (R) such that
Z
ϕ(f ) = f g

for any f ∈ L1 (R). If the above relation holds, then the integral of g over [a, b]
is ϕ(χ[a,b] ). So the idea is to get g as the derivative of
(
ϕ(χ[0,x] ) x≥0
G(x) :=
−ϕ(χ[x,0] ) x ≤ 0

which satisfies
ϕ(χ[a,b] ) = G(b) − G(a)
for a < b. It follows from

|G(b) − G(a)| ≤ kϕk · kχ[a,b] k1 = kϕk(b − a)

that G is Lipschitz continuous, hence g := G0 ∈ L∞ (R). By the fundamental


theorem of calculus,
Z b
ϕg (χ[a,b] ) = G0 = G(b) − G(a) = ϕ(χ[a,b] )
a

i.e. ϕg and ϕ agree on characteristic functions of intervals, thus by linearity on


step functions. Since step functions are dense in L1 (R), we get ϕg = ϕ.
We turn to the case of general p < ∞. Let ϕ ∈ Lp (R)∗ and define G as
before. If (ak , bk ), k = 1, . . . , n, is a collection of disjoint intervals of total
length ≤ δ, then
n
X n
X
|G(ak ) − G(bk )| = |ϕ(χ(ak ,bk ) )|
k=1 k=1
n
!
X
=ϕ ±χ(ak ,bk )
k=1
n
!1/p
X
≤ kϕk (bk − ak )
k=1
1/p
≤ kϕkδ

showing that G is absolutely continuous. Let g = G0 , then

ϕ(χ[a,b] ) = ϕg (χ[a,b] )

61
for a < b. By the same argument as before, ϕ and ϕg agree on bounded functions
with bounded support.
To verify that g ∈ Lq (R) consider the truncations gn of g with |gn | ≤ n and
support in [−n, n]. We get
Z Z
|gn |q = gn · sign(gn )|gn |q−1
Z
≤ g · sign(g)|gn |q−1

= ϕ(sign(g)|gn |q−1 )
≤ kϕk · k|gn |q−1 kp
Z 1/p
q
= kϕk |gn |

hence Z 1/q
|gn |q ≤ kϕk

since 1 − 1/p = 1/q. Applying the monotone convergence theorem to the in-
creasing sequence |gn |q → |g|q shows that also kgkq ≤ kϕk. It follows from
Hölder’s inequality that ϕg ∈ Lp (R)∗ , and since it agrees with ϕ on the dense
set of bounded functions with bounded support, we get ϕg = ϕ.
The theorem still holds for general measure spaces, at least if p > 1. It also
holds in the case p = 1 if the measure space is σ-finite, i.e. a countable union
of sets of finite measure. This requires a different proof, since we relied heavily
on the fundamental theorem of calculus in the proof above. For p = ∞ the
theorem is in general not true — the dual of L∞ (X) is usually strictly bigger
than L1 (X).

The Hilbert space L2 (X)


For 1/p + 1/q = 1 we get the symmetric case p = q if and only if p = q = 2.
Hölder’s inequality then specializes to the Cauchy–Schwarz inequality.
Corollary 7.9 (Cauchy–Schwarz inequality). If f, g : X → R are measurable,
then Z
|hf, gi| := f g ≤ kf k2 kgk2 .
X

Here we have introduced an inner product on L2 (X) which is an infinite-


n
dimensional version of the scalar product of vectors p in R . Note that the norm
is determined by the inner product, since kvk2 = hv, vi. In general, a norm is
of this form if and only if it satisfies the parallelogram law

2kxk2 + 2kyk2 = kx + yk2 + kx − yk2 .

62
A complete inner product space, like L2 (X), is called a Hilbert space. The Riesz
representation theorem tells us that L2 (R) is isomorphic to its own dual, more
specifically that any bounded linear functional on it is of the form v 7→ hw, vi
for some w. This turns out to be true for any Hilbert space.
Hilbert spaces are easier to work with than general Banach spaces, since one
has a notion of orthogonal vectors, orthonormal basis, and so on, and the inner
product is bilinear, whereas a norm is only sublinear.

References and further reading


• T. Tao: “Lp -spaces”, https://terrytao.wordpress.com/2009/01/09/245b-notes-3-lp-spaces/

• C. McMullen: “Real Analysis.” (Math 114 course notes)


• W. Rudin: “Real and complex analysis”

63
Chapter 8

Inner product spaces

Mathematical science is in my opinion an indivisible whole, an organ-


ism whose vitality is conditioned upon the connection of its parts.
– David Hilbert

Hilbert spaces are infinite-dimensional analogues of the finite dimensional


Euclidean spaces. The inner product allows us to make sense of important
concepts, like orthogonality, angles, cartesian coordinates. Of particular interest
are generalizations of the spectral theorem to the infinite-dimensional setting.
Hilbert space theory is an essential tool in the study of ordinary and partial
differential equations, Fourier analysis, ergodic theory, and quantum mechanics.
Here is some of the early history:

• 1888: G. Peano gives an axiomatic account of abstract vector spaces


• 1900s: D. Hilbert and E. Schmidt study integral equations, exploiting the
analogy between the inner product of functions and the dot product

• 1907: F. Riesz and E. S. Fischer independently prove completeness of L2 ,


the space of square integrable functions
• 1920s: W. Heisenberg, M. Born, P. Jordan, and independently E. Schrödinger
develop the formalism of quantum mechanics.

• 1929: J. von Neumann presents an axiomatic treatment of “abstract


Hilbert spaces” and unbounded self-adjoint operators

Inner products
The dot product of vectors x, y ∈ Rn given by

hx, yi = x1 y1 + . . . + xn yn

64
and has the geometric interpretation

hx, yi = kxkkyk cos(φ)

where φ is the angle between x and y. When dealing with vectors in Cn one
defines
hx, yi = x1 y1 + . . . + xn yn
(there are two convections here, depending on if the first or second vector gets
conjugated).
More generally, an inner product on a complex vector space V assigns a
complex number hx, yi to any pair of vectors x, y and has the following proper-
ties.
1. hx, yi = hy, xi

2. hx, y + zi = hx, yi + hx, zi, hx, λyi = λhx, yi


3. hx, xi ≥ 0 with equality if and only if x = 0
An inner product is almost bilinear except for hλx, yi = λhx, yi. When V is a
vector space over R instead, the inner product should also take values in R. In
this case the inner product is symmetric and bilinear. An inner product space
is a vector space (over R or C) together with an inner product. As mentioned
in the previous chapter, our main example of an inner product space is L2 (X)
with Z
hf, gi := f¯g
X
where X is a measurable subset of R or indeed any measure space.
Given an inner product we define the length of a vector x ∈ V as
p
kxk := hx, xi.

Theorem 8.1. Let V be an inner product space, then


• |hx, yi| ≤ kxkkyk (Cauchy–Schwarz inequality)
• kx + yk ≤ kxk + kyk (Triangle inequality)

for any x, y ∈ V .
Proof. The Cauchy–Schwarz inequality is equivalent to
 
hx, xi hx, yi
det   ≥ 0.
hy, xi hy, yi

This 2-by-2 matrix represents the quadratic form v 7→ hv, vi restricted to the
2-dimensional subspace spanned by x and y, and so has positive determinant
by Sylvester’s criterion.

65
A more elementary proof goes as follows. Starting from the obvious inequal-
ity kx − yk2 ≥ 0 we get
1 1
Rehx, yi ≤ kxk2 + kyk2
2 2
by expanding out. The idea is to exploit symmetries of the terms to amplify this
to the CS inequality. First, the right-hand side is invariant under multiplying x
or y by some eiφ , so
1 1
Re(eiφ hx, yi) ≤ kxk2 + kyk2 .
2 2
Optimizing this inequality means choosing φ so that the left-hand side is maxi-
mal, i.e. so that eiφ hx, yi is real and positive and thus
1 1
|hx, yi| ≤ kxk2 + kyk2 .
2 2
This is closer to what we want to prove, but we are not quite there yet. The
next step is to use the symmetry (x, y) 7→ (λx, λ1 y) of the left hand side which
implies
λ2 1
|hx, yi| ≤ kxk2 + 2 kyk2
2 2λ
for any λ > 0. We can assume x, y 6= 0 at this point, since otherwise the
inequality is trivial. Then the right-hand side attains its minimum at
p
λ = kyk/kxk

which yields the inequality. The general principle of using symmetries to opti-
mize inequalities is a very useful one and can often be used to prove (or disprove!)
them.
The triangle inequality follows easily from the Cauchy–Schwarz inequality:

kx + yk2 = kxk2 + 2Rehx, yi + kyk2 ≤ kxk2 + 2kxkkyk + kyk2

which is kx + yk ≤ kxk + kyk after taking square roots.


The theorem shows that x 7→ kxk is a norm, so any inner product space is
also a normed space. In fact one can recover the inner product from the norm
using the polarization identities
1
kx + yk2 − kx − yk2

hx, yi =
4
in the real case and
1
kx + yk2 − kx − yk2 + ikx − iyk2 − ikx + iyk2

hx, yi =
4
in the complex case. Conversely, any norm for which the parallelogram law

2kxk2 + 2kyk2 = kx + yk2 + kx − yk2 .

66
holds defines an inner product via the polarization identities. The proof of all
these identities is by expanding out both side using linearity.
Two vectors x, y ∈ V are orthogonal if hx, yi = 0. If x and y are orthogonal,
then kx + yk2 = kxk2 + kyk2 , which is a form of the Pythagorean theorem.
This generalizes, by induction, to finite sets of mutually orthogonal vectors. A
collection of vectors ei , i ∈ I is orthonormal if
(
1 i=j
hei , ej i = δij :=
0 i 6= j

i.e. all ei are mutually orthogonal and have length 1. Give linearly independent
vectors x1 , . . . , xn we can always find orthonormal vectors e1 , . . . , en with the
same span using the Gram–Schmidt algorithm. This proceeds by induction on
n. In the case n = 1 we just have e1 = x1 /kx1 k. Assuming we have already
found e1 , . . . , en−1 we let

yn = xn − he1 , xn ie1 − . . . − hen−1 , xn ien−1

which is orthogonal to e1 , . . . , en−1 and normalize


yn
en = .
kyn k

Hilbert spaces
In order to be able to take infinite sums of vectors, we need the hypothesis
of completeness. Define a Hilbert space to be an inner product space which
is complete (as a normed vector space). In particular any Hilbert space is a
Banach space, but not the other way around. Any finite-dimensional inner
product space is a Hilbert space, as is L2 (X) for any measure space X.
An important consequence of completeness is that certain convex optimiza-
tion problems have a unique solution.
Theorem 8.2. Let H be a Hilbert space, K ⊂ H a non-empty closed convex
subset, and p ∈ H a point. Then there exists a unique q ∈ K which minimizes
the distance kp − qk to x.

Proof. Replacing K by K − p, we may assume that p = 0. Let

δ = inf kxk
x∈K

then 0 ≤ δ < ∞, since K is non-empty. We need to show that the infimum is


attained. Let xn ∈ K be a sequence with kxn k → δ as n → ∞. Applying the
parallelogram law to xm /2 and xn /2 gives
2
kxm − xn k2 kxm k2 kxn k2 xm + xn
= + −
4 2 2 2

67
but (xm + xn )/2 ∈ K by convexity, and so k(xm + xn )/2k ≥ δ, thus

kxm − xn k2 ≤ 2kxm k2 + 2kxn k2 − 4δ 2 .

Since the right hand side converges to 0 as n, m → ∞, this shows that xn is a


Cauchy sequence. By completeness of H there is a q ∈ H which is the limit of
the xn , and in fact q ∈ K, since K is closed. Continuity of the norm (an easy
consequence of the triangle inequality) implies kqk = δ.
This calculations above also shows uniqueness: If q 0 is another minimizer,
then
kq − q 0 k2 ≤ 2kqk2 + 2kq 0 k2 − 4δ 2 = 0.

A new phenomenon for infinite-dimensional spaces is that a linear subspace


V ⊂ H is not necessarily closed in the topological sense. For example, if V ⊂
L2 (R) = H is the vector space of simple functions, then V = H by density but
clearly V 6= H, so V is not closed. However if V ⊂ H is a closed subspace of a
Hilbert space, then it is complete, thus naturally a Hilbert space too.
If V ⊂ H is any linear subspace of a Hilbert space, then define its orthogonal
complement
V ⊥ = {w ∈ H | hv, wi = 0 for all v ∈ V }
which is the set of vectors perpendicular to every vector in V .

Lemma 8.3. V ⊥ is a closed subspace of H.


Proof. We claim first that the set of vectors perpendicular to a given vector x,
x⊥ , is closed. Indeed, if yn is a sequence with yn ⊥ x converging to y, then
hx, yn i → hx, yi = 0, so y ⊥ x. Here we used continuity of the inner product,
which follows from the Cauchy–Schwarz inequality.
In general, we can write \
V⊥ = v⊥
v∈V

which is closed as an intersection of closed sets.


Theorem 8.4. Let V be a closed subspace of a Hilbert space H, then any x ∈ H
can be written as a sum

x = y + z, y ∈ V, z ∈ V ⊥

in a unique way. Moreover y is the closest point in V to x and z is the closest


point in V ⊥ to x.
Proof. Suppose x = y + z = y 0 + z 0 with y, y 0 ∈ V and z, z 0 ∈ V ⊥ . Then
y − y 0 = z 0 − z is in V ∩ V ⊥ = {0}, so y = y 0 and z = z 0 , which shows
uniqueness.

68
Now let y ∈ V be the closest point to x in V and z = x − y. We want to
show that z ∈ V ⊥ , i.e. hz, wi = 0 for all w ∈ V . Since y is minimizing the
distance we have

kzk2 = kx − yk2 ≤ kx − λw − yk2 = kz − λwk2

which can be written as

0 ≤ −λhz, wi − λ̄hw, zi + λλ̄kwk2 .

We may assume that kwk = 1 and set λ = hw, zi which gives 0 ≤ −|hz, wi|, thus
hz, wi = 0. This shows that the decomposition x = y + z exists and that y is
the closest point in V to x. By the same reasoning, if one lets z be the closest
point to x in V ⊥ , then y = x − z ∈ V , which completes the proof.
The conclusion of the theorem is often written as H = V ⊕ V ⊥ , meaning
that H can be identified, as an inner product space, with the set of pairs (y, z),
y ∈ V , z ∈ V ⊥ . A consequence is that V = (V ⊥ )⊥ where V is the closure of
V . The map which assigns to any H its closest point in V is the orthogonal
projection to V and is linear.
Every vector x ∈ H defines an element ϕx in the dual space H ∗ with ϕx (y) =
hx, yi. Indeed boundedness of ϕx follows from the Cauchy–Schwarz inequality.
We already know in the special case H = L2 (R) that this gives a (conjugate
linear) isomorphism between H and H ∗ . In fact this is true in general for any
Hilbert space.
Theorem 8.5. (Riesz representation theorem) Let H be a Hilbert space, ϕ :
H → C a bounded linear functional. Then there exists a unique x ∈ H such
that ϕ = ϕx , i.e. ϕ(y) = hx, yi for all y ∈ H.
Proof. We first show uniqueness. Let ϕx = ϕy . Then ϕx−y = 0, so 0 =
ϕx−y (x − y) = hx − y, x − yi, thus x = y.
Next, we prove existence. If ϕ = 0 identically, then we can just take x = 0,
so the claim is obvious. So assume ϕ 6= 0, which implies that Ker(ϕ) = {x ∈
H | ϕ(x) = 0} is a proper closed subspace of H, thus has non-trivial orthogonal
complement. Choose y ∈ Ker(ϕ)⊥ of length one, then ϕ(y) 6= 0. Note that for
ϕ(z)
any z ∈ H, the vector z − ϕ(y) y is in Ker(ϕ), thus perpendicular to y, so

ϕ(z)
hy, zi − =0
ϕ(y)
or
ϕ(z) = hϕ(y)y, zi
which proves the claim with x = ϕ(y)y.

69
Orthonormal bases
In this section we use the completeness and orthogonality assumptions to make
sense of infinite linear combinations of vectors. Let H be a Hilbert space and let
e1 , e2 , . . . be a countable sequence of orthonormal vectors in H. This means that
each en has unit length P and is orthogonal to all the other ek , n 6= k. Suppose
cn ∈ C are scalars with n |cn | < ∞, then the series

X
cn en
n=1

is absolutely convergent and thus defines an element in H by completeness.


However it turns out that all that is needed is the weaker condition n |cn |2 <
P
∞ of square–summability.
Theorem 8.6. Let en , n = 1, 2, . . ., be an orthonormal set of vectors in a
Hilbert space H and cn ∈ C are square–summable, then the limit
n
X ∞
X
lim ck ek =: ck ek
n→∞
k=1 k=1

exists in H and is independent of the order in which the ck ek are summed. In


other words, if σ is permutation of the positive integers, then

X ∞
X
ck ek = cσ(k) eσ(k) .
k=1 k=1

Proof. Let ε > 0, then by square–summability we can choose N > 0 such that

X
|cn |2 < ε
n=N

hence if m ≥ n ≥ N , then
m n 2 m
X X X
ck ek − ck ek = |cn |2 < ε
k=1 k=1 k=n

by Pythagoras. This proves that the partial sums form a Cauchy–sequence and
thus have a limit since H is complete.
For the independence statement, choose N1 > 0 so that both

X ∞
X
|cn |2 < ε, |cσ(n) |2 < ε.
n=N1 n=N1

We can find N3 ≥ N2 ≥ N1 so that the set {σ(n) | n ≤ N2 } includes all the


numbers ≤ N1 but none of those > N3 . Then
∞ ∞ 2 N3 N2 2
X X X X
cn en − cσ(n) eσ(n) ≤ cn en − cσ(n) eσ(n) + 2ε ≤ 3ε.
n=1 n=1 n=1 n=1

Since ε > 0 was arbitrary, this completes the proof.

70
The theorem ensures that if ei , i ∈ I is any family of orthonormal
P vectors,
not necessarily countable, and ci ∈ C are square summable, then i∈I ci ei is a
well-defined element in H.
Theorem 8.7. Let ei , i ∈ I be an orthonormal family of vectors in a Hilbert
space H. Then the map
X
L2 (I) → H, (ci )i∈I 7→ ci ei
i∈I

is an isometry from the Hilbert space of square summable sequences ci ∈ C,


i ∈ I into H. The image V of this map is the smallest closed linear subspace
containing all the ei , i ∈ I, and the orthogonal projection to V is given by
X
x 7→ hei , xiei .
i∈I

We call the smallest closed subspace containing all the ei the (Hilbert
space) span of the ei , which is usually bigger than the algebraic span, which
is the set of finite linear combinations of the ei .
Proof. We need to show that the map preserves the inner product. If I is finite,
then * +
X X X
xi ei , yi ei = x̄i yi
i∈I i∈I i∈I

by bilinearity and orthonormality of ei . By continuity of the inner product, the


above also holds for infinite I.
The image of this map is necessarily a Hilbert space, isomorphic to L2 (I),
thus a closed linear subspace of H. It is by definition contained in the closure
of the algebraic span of the ei , thus minimal. To verify the formula for the
orthogonal projection, note that
* +
X
x− hei , xiei , ej = hx, ej i − hej , xi = 0.
i∈I

We are particularly interested in the case when the ei span all of H, thus
form an (orthonormal) basis, in the Hilbert space sense.
Theorem 8.8. Let H be a Hilbert space and (ei )i∈I and orthonormal family of
vectors in H. The following conditions are equivalent.
1. The span of the ei is H.

2. The algebraic span of the ei is dense in H.

71
3. The Parseval identity X
kxk2 = |hei , xi|2
i∈I

holds for all x ∈ H.


4. The inversion formula X
x= hei , xiei
i∈I

holds for all x ∈ H.


5. The only vector orthogonal to all ei is the zero vector.
6. The map X
L2 (I) → H, (ci )i∈I 7→ ci ei
i∈I

is an isomorphism of Hilbert spaces.


If any of these equivalent statements holds, (ei )i∈I is called an orthonormal
basis of H.
Proof. The equivalence of these statements is an easy consequence of the previ-
ous theorems.
The motivating example of an orthonormal basis is the family of functions

en (x) := e2πinx , n∈Z

in L2 ([0, 1]). It is easy to see that the en are orthonormal. The statement that
they form a basis is equivalent to the density of trigonometric polynomials in
L2 ([0, 1]). To show this, it suffices to approximate any characteristic function
of an interval by trigonometric polynomials with respect to the 2-norm, which
reduces to an explicit computation. More details will be given in the next
chapter.
An almost trivial example is L2 (Z) with the family of functions
(
1 n=k
en (k) = δnk =
0 n 6= k

which form an orthonormal basis. This generalizes to L2 (I) for any set I of
course.
Theorem 8.9. Any Hilbert space has an orthonormal basis.
Proof. This is a consequence of Zorn’s lemma, which states that if a partially
ordered sets P has the property that every chain has an upper bound in P ,
then P contains a maximal element. In our case, P is the set of orthonormal
families in H. An upper bound on a chain of orthonormal families is given by
their union. Thus there exists some maximal orthonormal family (ei )i∈I . If the

72
span of this family is not all of H, then we could add to it some unit vector
in the orthogonal complement, contradicting maximality, thus ei must form a
basis.
Combining the previous two theorems, we conclude that every Hilbert space
H is isomorphic to L2 (I) for some set I with the counting measure. One can
show that the cardinality of I is independent of the choice of orthonormal basis.
For most Hilbert spaces of interest I is countable, which is equivalent to H
being separable, i.e. containing a countable dense subset. In particular, there is
essentially only one separable infinite-dimensional Hilbert space.
Note that we consider only orthonormal bases of Hilbert spaces. Dealing
with coordinates with respect to more general families of vectors becomes rather
difficult in the topological setting. Also we have so far not discussed the theory
of linear maps from one Hilbert space to another, which is a very non-trivial
extension of its finite-dimensional counterpart.

References and further reading


• T. Tao: “Hilbert spaces”,
terrytao.wordpress.com/2009/01/17/254a-notes-5-hilbert-spaces/
• W. Rudin: “Real and complex analysis”

73
Chapter 9

Fourier analysis

The Fourier transform is one of the most important tool in mathematics. It is a


duality between functions, f (x), on physical space (such as the circle, real line,
or Rn ) and functions fˆ(ξ) on frequency space. Properties and operations of f
have corresponding dual notions for fˆ, which can be rather different looking on
the surface. The following table lists some examples.

Physical domain Frequency domain


smoothness (fine scale) decay (coarse scale)
convolution pointwise product
derivative d/dx multiplication by 2πiξ
rescaling by λ rescaling by 1/λ
frequency modulation translation
subspace quotient space

Although fˆ typically looks very different form f , the do have the same
2-norm, and likewise inner products of functions are preserved under Fourier
transform.
The Fourier transform has many generalizations, depending on what kind
of “physical space” one considers and which kind of symmetries it has. Indeed
a large chunk of modern mathematics, roughly what is called representation
theory, can be thought of as generalizing Fourier analysis in some sense.

Fourier transform of functions on the circle


The circle, S 1 , can be described either as the quotient R/(2πZ), i.e. as a set of
equivalence classes of real numbers which differ by some 2πn, or as the subset

74
U (1) := {z ∈ C | |z| = 1} of complex numbers with absolute value 1. The
identification of the two is given by the exponential map

x 7→ eix = cos(x) + i sin(x).

Functions on the circle are the same as 2π-periodic functions f : R → C,


f (x + 2π) = f (x). In the context of Fourier theory it will be important to
think of the circle not just as a set, but with the following additional structures,
which are all compatible with the above identification.
• Group structure: Addition in R/(2πZ), multiplication in U (1). The circle
is a commutative group.
• Topological space structure: The circle is compact.
• Measure space structure: The circle has a translation invariant probabil-
ity measure. The Lebesgue measure on R/(2πZ) is normalized so that
µ([0, 2π)) = 1
The dual frequency space of the circle is the integers Z. This means that for
each n ∈ Z we have a function on the circle given by

en : R/Z → C, x 7→ einx =: en (x)

or in the description S 1 = U (1) this is just z 7→ z n . Note that the en form an


orthonormal family of vectors in the Hilbert space of the circle L2 (S 1 ).
Z 2π (
1 −imx inx 1 m=n
hem , en i = e e dx =
2π 0 0 m 6= n

In fact they form a basis.


Theorem 9.1. The functions en (x) = einx , n ∈ Z, form a Hilbert space basis
of L2 (S 1 ). Consequently, the map
X
L2 (Z) → L2 (S 1 ), (cn )n∈Z 7→ cn einx
n∈Z

is an isomorphism of Hilbert spaces with inverse


Z 2π
1
L2 (S 1 ) → L2 (Z), f 7→ cn := hen , f i = f (x)e−inx dx.
2π 0

It suffices to show the algebraic span of the en , the space of trigonometric


polynomials, is dense in L2 (S 1 ). Since continuous functions are dense in L2 (S 1 ),
this is a corollary of the following.
Theorem 9.2. Trigonometric polynomials are dense in C(S 1 ), i.e. for every
continuous 2π-periodic function f : R → C and ε > 0 there is a trigonometric
polynomial P such that kf − P k∞ < ε.

75
A trigonometric polynomial is a function of the form
N
X
f (x) = cn einx
n=−N

where N > 0 is an integer and cn ∈ C. Since einx = cos(nx) + i sin(nx) we can


also write it in the form
N
X N
X
a0 + an cos(nx) + bn sin(nx).
n=1 n=1

Note that because eimx einx = ei(m+n)x , the product of two trigonometric poly-
nomials is again one.
Proof. The basic trigonometric polynomials en are concentrated at a single point
n ∈ Z in frequency space, but spread out in physical space. The first step is
to construct a sequence of trigonometric polynomials Q1 , Q2 , Q3 , . . . which are
concentrated more and more near a given point in physical space, which we can
take to be 0 ∈ R/(2πZ). More precisely we want
1. Qn (x) ≥ 0 for all x ∈ R,
1
R 2π
2. 2π 0
Qn (x)dx = 1,
3. On each interval [ε, 2π − ε], ε > 0, Qn converge to 0 uniformly as n → ∞.
We can take  n
1 + cos(x)
Qn (x) = Cn
2
1
R 2π
where the constants Cn > 0 are chosen so that 2π 0
Qn (x)dx = 1. Thus,
using that Qn is an even function,
n n
Cn π 1 + cos(x) Cn π 1 + cos(x)
Z  Z 
2Cn
1= dx > sin(x)dx =
π 0 2 π 0 2 π(n + 1)
and since Qn is decreasing on [0, π],
 n
π(n + 1) 1 + cos(ε)
Qn (x) ≤ Qn (ε) ≤ → 0 as n → ∞
2 2
for x ∈ [ε, π]. This shows that the Qn satisfy the desired properties.
Let f : R/(2πZ) → C be a continuous 2π-periodic function. For each n ≥ 1
we let
Z 2π
1
Pn (x) = f (x − t)Qn (t)dt
2π 0
Z 2π
1
= f (t)Qn (x − t)dt.
2π 0

76
2.0

1.5

1.0

0.5

-3 -2 -1 1 2 3

Figure 8: The trigonometric polynomials Qn for some values of n.

PN
We may write Qn (x) = k=−N ck eikx , since Qn is a trigonometric polynomial.
Then
N
ck 2π
X  Z 
−ikt
Pn (x) = f (t)e dt eikx
2π 0
k=−N

so Pn is also a trigonometric polynomial.


We claim that Pn converge to f in ∞-norm as n → ∞. By the choice of Cn ,
Z 2π
1
Pn (x) − f (x) = (f (x − t) − f (x))Qn (t)dt
2π 0
thus Z 2π
1
|Pn (x) − f (x)| ≤ |f (x − t) − f (x)|Qn (t)dt.
2π 0
For given ε > 0 we can find δ > 0 such that |f (x) − f (y)| < ε if |x − y| < δ by
uniform continuity of f , so
Z δ
1
|Pn (x) − f (x)| ≤ |f (x − t) − f (x)|Qn (t)dt
2π −δ
Z 2π−δ
1
+ |f (x − t) − f (x)|Qn (t)dt
2π δ
≤ ε + 2kf k∞ kQn |[δ,2π−δ] k∞

but kQn |[δ,2π−δ] k∞ → 0 as n → ∞ by construction.


As a consequence of the main theorem, any f ∈ L2 (S 1 ) can be represented
by its Fourier series X
fˆ(n)einx
n∈Z

77
where the Fourier coefficients fˆ(n) ∈ C are uniquely determined by
Z 2π
1
fˆ(n) = f (x)e−inx dx
2π 0

and satisfy the Parseval identity


Z 2π
X 1
|fˆ(n)|2 = |f (x)|2 dx.
2π 0
n∈Z

The partial sums


N
X
fˆ(n)einx
n=−N

converge to f in 2-norm as N → ∞.

Convolution
The group structure on S 1 and Z allows us to define the convolution of functions
on these spaces. Convolution turns out to be Fourier dual to the pointwise
product of functions. While the latter naturally defines a product on L∞ , the
former is defined for a pair of functions in L1 .
We start with the discrete group Z. Given sequences of complex numbers
an , bn , n ∈ Z, we define
X X
(a ∗ b)(n) = ai bj = ak bn−k
i+j=n k∈Z

1
P
P example, if a ∈ L (Z), i.e.
whenever the infinite sum makes sense. For |an | <
∞, and bn are bounded, then the terms k∈Z ak bn−k are absolutely summable
with
ka ∗ bk∞ ≤ kak1 kbk∞
If moreover b ∈ L1 (Z), then a ∗ b ∈ L1 (Z) with
X
ka ∗ bk1 ≤ |ai bj | = kak1 kbk1
i,j∈Z

In particular, the convolution product gives L1 (Z) the structure of an algebra:


A vector space with bilinear product operation. This is a Banach algebra: The
norm of the product is bounded above by the product of the norms. Note that
in contrast the bigger space L2 (Z) is not closed under convolution.
Convolution of functions on S 1 is completely analogous. We define formally
Z 2π
1
(f ∗ g)(x) = f (t)g(x − t)dt.
2π 0

78
This turns out to be well-defined if f, g ∈ L1 (S 1 ) and we have

kf ∗ gk1 ≤ kf k1 kgk1 .

The standard (easy) proof of this uses Fubini’s theorem on double integrals,
which we have not covered so far. A more elementary proof uses the fact that
f ∗ g is approximated by a linear combination of translates of g and the triangle
inequality for the 1-norm.
Suppose now that f, g ∈ L2 (S 1 ) ⊂ L1 (S 1 ), then f ∗ g is in fact continuous.
To see this let
gx (t) = g(x − t)
then gx ∈ L2 (S 1 ) depends continuously on x ∈ R and therefore

(f ∗ g)(x) = hf¯, gx i

also depends continuously on x. So L2 (S 1 ) turns out to be closed under convo-


lution, which is dual to the fact that L2 (Z) is closed under pointwise multipli-
cation.
We return to the Fourier transform f 7→ fˆ and the claim that it interchanges
pointwise products and convolution.
Theorem 9.3. Let f, g ∈ L1 (S 1 ), then

∗ g = fˆĝ.
f[

Similarly, if f, g ∈ L1 (Z) then

∗ g = fˆĝ.
f[

Note that fˆ, ĝ ∈ L∞ (Z) and we have extended the Fourier transform to
L (S 1 ) ⊃ L2 (S 1 ) using the same formula. It follows directly from the definition
1

that kfˆk∞ ≤ kf k1 , so the Fourier transform is continuous as a map L1 (S 1 ) →


L∞ (Z). It is however not onto, since fˆ(n) → 0 as n → ±∞, so we do not get
arbitrary bounded sequences.
Proof. We consider the first the special case where f = em , g = en are basis
elements. Then
Z 2π
1
(em ∗ en )(x) = eimt ein(x−t) dt = einx δmn = en (x)δmn
2π 0
in other words
m ∗ en = e
e\ c me
c n.

Since convolution is bilinear, the statement is also true whenever f, g are trigono-
metric polynomials:
N
! N
! N
X X X
an en ∗ bn en = an bn en .
n=−N n=−N n=−N

79
Density of trigonometric polynomials in L1 (S 1 ) and continuity of convolution
kf ∗ gk1 ≤ kf k1 kgk1 implies the statement for general f, g ∈ L1 (S 1 ).
For the second statement the idea is the same as before, this time using the
identity
em+n = em en
in L1 (S 1 ).

Convergence of Fourier series


We are now ready to investigate the following question: In what sense do the
partial sums
XN
SN (f ) = fˆ(n)einx
n=−N

converge to f ∈ L2 (S 1 ) as N → ∞? We already know that SN (f ) → f


in 2-norm, but what if f is not just square-integrable but continuous or even
differentiable? When do we get pointwise convergence?
We start by rewriting the partial sums as

SN (f ) = DN ∗ f

where
N
X
DN := en
n=−N

is the Dirichlet kernel. The formula for SN (f ) follows from the obvious identity
S\ dˆ 2
N (f ) = DN f in L (Z).
Note that DN is given by a geometric series, and so we can easily find a
closed form as follows (recall that z = eix ).
N
X
DN (z) = zn
n=−N

= z −N (1 + z + . . . + z 2N )
1 − z 2N +1
=
z N (1 − z)
z N +1/2 − z −N −1/2
=
z 1/2 − z −1/2
sin((N + 1/2)x)
=
sin(x/2)

where we used the formula eix − e−ix = 2i sin(x).


Theorem 9.4. Suppose f ∈ L1 (S 1 ) and the derivative f 0 exist at a ∈ S 1 , then
SN (f )(a) → f (a) as N → ∞.

80
20

15

10

-3 -2 -1 1 2 3

-5

Figure 9: The Dirichlet kernel D10 .

Proof. Translating f , we may asssume a = 0. Since DN ∗ (f − f (0)) = DN ∗ f −


f (0), is suffices to consider the case when f (0) = 0. We must then show that
Z 2π
1
SN (f )(0) = f (x)DN (x)dx → 0
2π 0

as N → ∞. We write this integral as


Z 4π
1
SN (f )(0) = g(x) sin((N + 1/2)x)dx
4π 0

with
f (x)
g(x) = .
sin(x/2)
Here we are integrating over two periods and dividing the result by two, since
sin(x/2) and sin((N + 1/2)x) are only 4π-periodic by themselves. Because we
can write SN (f ) essentially as a Fourier coefficient of g, the claim will follow
as long as g ∈ L1 (R/(4πZ)). But this follows directly from out assumptions
since f is differentiable wherever sin(x/2) = 0 and thus g bounded near these
points.

As a corollary we get a criterion for pointwise convergence.


Corollary 9.5 (Dirichlet). Suppose f ∈ C 1 (S 1 ), i.e. f is continuously differ-
entiable, then SN (f ) → f pointwise.
In contrast, if f is only continuous, then SN (f ) may fail to converge to f at
certain points, by a theorem of DuBois-Reymond. However, the points where
SN (f ) does not converge, can be at most a null set — a very non-trival result.

81
Theorem 9.6 (Carleson). If f ∈ L2 (S 1 ), then SN (f ) → f pointwise almost
everywhere.
It was discovered by Fejér, that the convergence properties of the Fourier
series can be radically improved by employing Césaro summation, which means
replacing the term of a sequence by the running average. Thus, instead of the
the SN (f ) we consider the sequence
1
TN (f ) = (S0 (f ) + . . . + SN −1 (f )) = FN ∗ f
N
where
1
FN = (D0 + . . . + DN −1 )
N
is the Fejér kernel. To better understand FN , we find a closed formula. Compute

N FN = z −N +1 + 2z −N +2 + . . . + N + . . . + 2z N −2 + z N −1
 2
= z −(N −1)/2 + z −(N −3)/2 + . . . + z (N −1)/2
2
− z −N/2
 N/2
z
=
z 1/2 − z −1/2

hence
sin2 (N x/2)
FN (x) = .
N sin2 (x/2)
In particular, FN (x) ≥ 0 for all x and FN converges uniformly to 0 as N → ∞
on any interval [δ, 2π − δ], δ > 0. Also note that
Z 2π
1
FN (x)dx = hFN , e0 i = 1
2π 0

thus we can replace the Qn in the proof of the theorem on density of trigono-
metric polynomials by Fn and get the same conclusion.

Theorem 9.7 (Fejér). Suppose f ∈ C(S 1 ), then TN (f ) → f uniformly as


N → ∞.
In a similar way one proves Abel’s theorem.
Theorem 9.8 (Abel). If f ∈ C(S 1 ), 0 < r < 1, define

X
fr (z) = fˆ(n)r|n| z n
n=−∞

then fr (z) → f (z) uniformly as r → 1 from below.

82
10

-3 -2 -1 1 2 3

Figure 10: The Fejér kernel F10 .

Proof. We have fr = Ar ∗ f where



X
Ar (z) = r|n| z n
n=−∞

is Abel’s kernel. Note that



X
Ar = ((rz)n + (r/z)n ) − 1
n=0
1 1
= + −1
1 − rz 1 − r/z
1 − r2
=
1 + r2 − 2r cos(x)
which again satisfies the key properties that Ar ≥ 0, hAr , e0 i = 1, and Ar |[δ,2π−δ] →
0 uniformly for any δ > 0. Thus the claim follows by same argument as in the
proof of density of trigonometric polynomials and Fejér’s theorem.

Applications to partial differential equations


The Fourier transform is particularly useful for solving linear PDEs which ex-
hibit translation invariance.

Laplace’s equation on the disk


We start with Laplace’s equation on the disk D = {z ∈ C, |z| < 1}. The problem
is finding a twice differentiable f : D → R with
∆f := fxx + fyy = 0 in D

83
satisfying the boundary condition f |S 1 = g for some given g : S 1 = ∂D → R,
which we assume to be continuous. Here, z = x + iy and fxx is shorthand
notation for ∂ 2 f /∂x2 . One can show that Laplace’s PDE is equivalent to finding
a minimizer of the energy
Z
E(f ) = fx2 + fy2 dxdy
D

where f ranges over continuously differentiable functions satisfying the given


boundary condition. Solutions are called harmonic functions.
We can write down a solution in terms of the Fourier series of g in the form

X ∞
X
f (z) = a0 + ĝ(n)z n + ĝ(−n)z̄ n .
n=1 n=1

This converges to g(z) on the boundary by Abel’s theorem (note that z̄ = 1/z
for z on the unit circle). To argue that ∆f = 0, Pwe refer to basic results in
complex analysis which state that a power series an z n is holomorphic (com-
plex differentiable) and thus f a sum of a holomorphic and anti-holomorphic
functions, hence harmonic.

Heat equation on the circle


The 1-dimensional heat equation is

fxx = ft

which is to be solved for t > 0 given an initial condition f (x, 0) = g(x). First,
note that for each n ∈ Z the function
2
e−n t+inx

is a solution of the heat equation with g(x) = einx a basis element. At least
formally we can write the general solution as
X 2
f (x, t) = ĝ(n)e−n t+inx = (g ∗ Kt )(x)
n∈Z

where Kt , t > 0 is the heat kernel


X 2
Kt (x) = e−n t+inx
.
n∈Z

We claim that r
X
−n2 t+inx π X −(x−2πn)2 /4t
e = e
t
n∈Z n∈Z
1
for x ∈ S , t > 0. Since the left hand side is given as a Fourier series, it suffices
to compute the Fourier coefficients of the right hand side. Note that the right

84
2
hand side has a sum of translates, by 2πn, of the same function e−x /4t . So
instead of integrating each of the translates over [0, 2π], we can just integrate
once over R. We compute
r Z ∞ Z ∞ √ √ 2
π 1 2 1 2
· e−x /4t e−inx dx = √ e−n t e(−x/(2 t)+in t) dx
t 2π −∞ 2 πt −∞

but moving the contour of integration


Z ∞ √ √ 2
Z ∞ √ √
t))2
e(−x/(2 t)+in t) dx = e−(x/(2 dx = 2 πt
−∞ −∞

which proves the identity.


The second formula for Kt shows that it is a positive function for each t > 0.
2
Since the n = 0 Fourier coefficient of Kt is e−0 t = 1, we also have
Z 2π
1
Kt (x)dx = 1.
2π 0
Finally, using that Kt is a sum of Gaussian peaks, and one can show Kt |[δ,2π−δ] →
0 uniformly as t → 0 for fixed δ > 0. Thus, by a now familiar argument, if g is
continuous then Kt ∗ g → g uniformly as t → 0.
2
Note that for t > 0 the Fourier coefficients ĝ(n)e−n t of Kt ∗ g decay very
rapidly as |n| → ∞. In fact each Kt ∗ g is analytic, i.e. given locally by a
convergent power series, in particular arbitrarily many times differentiable.

-3 -2 -1 1 2 3

Figure 11: The heat kernel Kt for t = 1, 1/2, 1/4, 1/9.

Weyl equidistribution theorem


This is an application of Fourier series to ergodic theory. It says that the integral
of a continuous function on S 1 is approximated by an average over an evenly
distributed set of points.

85
Theorem 9.9 (Weyl). If x ∈ S 1 = R/(2πZ) is be an irrational multiple of 2π
and f ∈ C(S 1 ), then
N Z 2π
1 X 1
lim f (kx) = f
N →∞ N 2π 0
k=1

as N → ∞.

Proof. Both sides are bounded linear functionals on the Banach space C(S 1 )
with the ∞-norm, since the average is bounded by kf k∞ . Thus is suffices to
check the identity for f (x) = einx , n ∈ Z, whose algebraic span is dense. For
n = 0 both sides are obviously 1, so assume n 6= 0. Then einx 6= 1 by irrationality
of x/(2π) so
N
X 1 − ei(N +1)nx
einkx =
1 − einx
k=1

which has absolute value bounded by 2/|1 − einx | for any N , hence the averages
go to 0 as N → ∞.

The theorem says that if x is irrational, then the numbers

nx mod 1, n = 1, 2, 3, . . .

in [0, 1) will be evenly distributed, e.g. the first digit after the decimal point is
equally likely to be any of 0, . . . , 9.

Fourier transform of functions on R


The Fourier transform of a function f : R → C, f ∈ L1 (R) is defined by
Z ∞
ˆ
f (ξ) = f (x)e−2πixξ dx
−∞

and fˆ ∈ L∞ (R). In a certain sense, this can be regarded as the limit of the
Fourier transform of a periodic function as the period goes to infinity. Let us
assume that f is smooth with compact support. Suppose R > 0 is large enough
so that the support of f is contained in [−R, R], then we can represent f as a
Fourier series on [−R, R] of the form
X
f (x) = an e2πinx/(2R)
n∈Z

with Z R
1 1 ˆ n 
an = f (x)e−2πinx/(2R) dx = f .
2R −R 2R 2R

86
The right hand side is the function on Z obtained by sampling fˆ on the regular
grid Z/(2R) and weighting by 1/(2R), so becomes a better and better approxi-
mation to fˆ as R → ∞. Furthermore, Parseval’s identity implies
Z ∞ Z ∞
1 X ˆ n  2
|f (x)|2 dx = f → |fˆ(ξ)|2 dξ, as R → ∞
−∞ 2R 2R −∞
n∈Z

where convergence to the integral follows from continuity of fˆ. This is the
Plancherel theorem, which says that the Fourier transform for functions on R
is an isometry. In particular, the Fourier transform is continuous with respect
to 2-norm, and so extends from compactly supported smooth functions to all of
L2 (R).
A useful property of the Fourier transform is that taking the derivative of fˆ
corresponds to multiplying f by −2πix pointwise. More precisely, we have:
Theorem 9.10. Suppose f ∈ L1 (R) such that xf ∈ L1 (R) as well. Then fˆ is
continuously differentiable and

dfˆ
(ξ) = −2πixf
c (ξ).

Proof. We want to show existence of
Z ∞
e−2πix(ξ+h) − e−2πixξ
lim f (x) dx
h→0 −∞ h

thus need to justify interchange of limit and integration. Note that

e−2πix(ξ+h) − e−2πixξ
≤ 2π|x|
h

and so, since |xf | ∈ L1 (R) by assumption, we can apply the dominated conver-
gence theorem which gives
Z ∞
ˆ 0
f (ξ) = −2πi xf (x)e−2πixξ dx = −2πixf
c (ξ).
−∞

Theorem 9.11. Suppose f is continuously differentiable with f, f 0 ∈ L1 (R) and


f (x) → 0 as x → ±∞. Then

df
c
(ξ) = 2πiξ fˆ(ξ).
dx
Proof. We need to show the identity
Z ∞ Z ∞
df
(x)e−2πixξ dx = 2πiξ f (x)e−2πixξ dx.
−∞ dx −∞

87
The idea is to use integration by parts, which we can do by our assumption on
f and
Z R R
d
f (x)e−2πixξ dx = f (x)e−2πixξ dx

→0
−R dx −R

as R → ∞.

The previous theorems required assumptions both on the regularity and the
decay of f . In order to not have to keep track of these, it it convenient to work
with a class of functions which have an unlimited amount of both regularity and
decay. A Schwartz function is an infinitely differentiable (aka C ∞ , smooth)
function f : R → C such that all functions
dm f
|x|n (x), m, n ≥ 0
dx
are bounded on R. The space of Schwartz functions on R is denoted S(R). An
example of a Schwartz function is any smooth function with bounded support,
2
as is the Gaussian e−x . If f ∈ S(R), then so are f 0 and xf , and fˆ, and S(R)
is closed under pointwise product and convolution of functions.

Uncertainty principle
An important idea about the Fourier transform is that not both f and fˆ can
be concentrated arbitrarily close to a single point. This is a consequence of the
general principle that Fourier transform interchanges small scale and large scale
structure. It leads to the famous Heisenberg uncertainty principle in quantum
mechanics concerning position and momentum of a particle.
We begin by defining operators (linear maps) X, D : S(R) → S(R) on
Schwartz functions given by
i d
(Xf )(x) := xf (x), (Df )(x) := f (x).
2π dx
These satisfy the commutator relation
i
DX − XD =

since
i
(D(X(f )) − X(D(f )))(x) = (f (x) + xf 0 (x) − xf 0 (x)).

They are also self-adjoint in the sense that

hXf, gi = hf, Xgi, hDf, gi = hf, Dgi

88
for any f, g ∈ S(R). The first identity is obvious, and the second follows by
integration by parts:
Z ∞
i 0
hDf, gi = f (x)g(x)dx
−∞ 2π
Z ∞
i
= f (x) g 0 (x)dx
−∞ 2π
= hf, Dgi

Note that the sign coming from integration by parts cancel with the one from
ī = −i.
Theorem 9.12. Let f ∈ S(R), then
1
kXf k2 kDf k2 ≥ kf k22 .

Proof. We start with the obvious inequality

h(aX + ibD)f, (aX + ibD)f i ≥ 0

for any a, b ∈ R. Expanding this out and using the self-adjointness and commu-
tator relations gives
ab
a2 kXf k22 + b2 kDf k22 ≥ − kf k22 .

We optimize the inequality by setting
s s
kDf k2 kXf k2
a= , b=−
2kXf k2 2kDf k2

which produces the claimed inequality.


dk2 = kX fˆk2 , from which
The Plancherel theorem tells us that kDf k2 = kDf
one can deduce the following.
Theorem 9.13. For f ∈ S(R) with kf k2 = 1, x0 , ξ0 ∈ R, then we have
Z ∞  Z ∞ 
1
(x − x0 )2 |f (x)|2 dx (ξ − ξ0 )2 |fˆ(ξ)|2 dξ ≥ 2
.
−∞ −∞ 16π

The integral Z ∞
(x − x0 )2 |f (x)|2 dx
−∞

is the dispersion of f about x0 . Hence, the theorem says that for normalized f ,
not both f and fˆ can have arbitrarily small dispersion.

89
Lets set x0 = ξ0 = 0 for simplicity, then the inequality becomes an equality
only for the Gaussian
2 2
f (x) = Ce−πx /σ
√ √
where σ > 0 is arbitrary and C = 4 2/ σ is chosen so that kf k2 = 1. The
Fourier transform is 2 2
fˆ(ξ) = σCe−πσ ξ
so the Fourier transform of a Gaussian with variance σ 2 is again a Gaussian,
but with variance σ −2 .

Discrete Fourier transform


For this variant the physical space is the cyclic group Z/N Z which we can
think of as a discretized S 1 . The measure on Z/N Z is the standard prob-
ability measure. What is the frequency space? The group homomorphisms
χ : Z/N Z → U (1) must send 1 to an N -th root of unity and are thus given by

χn (k) = e2πink/N , n = 0, 1, . . . , N − 1.

So the dual to Z/N Z is the same cyclic group. Also, the χn form an orthonormal
basis of L2 (Z/N Z) = CN since
N −1
1 X 2πi(n−m)k/N
hχm , χn i = e
N
k=0

and the sum on the right hand side is clearly 1 if m = n, and otherwise 0 from
the formula for a geometric series. Any N orthonormal vectors in CN give a
basis.
The Fourier transform of f : Z/N Z → C is
N −1
1 X
fˆ(k) = hχk , f i = f (n)e−2πink/N
N n=0

which is a again function Z/N Z → C. The inverse transform is given by the


similar formula
N −1
1 X ˆ
f (k) = f (n)e2πink/N .
N n=0

Thus we get an invertible isometry F : CN → CN .


Note that computing the Fourier transform using the above formula requires
about O(N 2 ) operations. A significant improvement, used widely in signal pro-
cessing, is the Fast Fourier Transform (FFT). This works particularly well if
N = 2M is a power of 2, which we will assume. We encode f : Z/N Z → C in
terms of functions f0 , f1 : Z/(N/2Z) → C with

fj (k) = f (2k + j).

90
Then one easily checks the formula
1 ˆ 
fˆ(k) = f0 (k) + e−2πik/N fˆ1 (k) .
2
So the Fourier transform of a length N vector is obtained from the Fourier
transform of two length N/2 vectors in O(N ) operations. Building a recursive
algorithm, we can thus compute the Fourier transform in O(M N ) = O(N log N )
operations. It is an open problem in computer science if this is essentially
optimal, or if there is an algorithm faster than O(N log N ).

References and further reading


• T. Tao: “Fourier Transform”,
terrytao.wordpress.com/2009/04/06/the-fourier-transform/
• C. McMullen: “Real Analysis.” (Math 114 course notes)
• W. Rudin: “Real and complex analysis”

• Stein, Shakarchi: “Real Analysis” (Princeton lectures in Analysis 1,2,3)

91
Chapter 10

Baire’s theorem and


applications

We have seen the usefulness of the completeness property of infinite-dimensional


vector spaces in Fourier analysis and the theory of Hilbert spaces in general. In
the more flexible case of Banach spaces, many theorems exploit completeness
indirectly by appealing to Baire’s theorem, which we discuss in this section.
Baire’s theorem is about sets which are “small” in a topological sense and this
notion can be a substitute for null sets if no measure space structure is available.
We can work in the setting of a complete metric space (X, d), which we
assume to be non-empty. Examples are Banach spaces with metric d(x, y) =
kx − yk and their closed subsets. The open ball with radius r > 0 and center
a ∈ X is the set
Ba (r) = {x ∈ X | d(a, x) < r}.
Recall that E is dense in X if there every ball Ba (r) contains points of E.
A subset E ⊂ X is nowhere dense if every ball B ⊂ X contains a smaller ball
B 0 ⊂ B disjoint from E. This is the same as saying that the closure of E has
empty interior. A subset E ⊂ X is meager if it is a countable union of nowhere
dense sets. This makes the following properties automatic.

1. Nowhere dense sets are meager.


2. Subset of a meager sets are meager.
3. Countable unions of meager sets are meager.

Thus meager sets behave in many ways like null sets. What is not obvious,
however, is that meager sets are truly “small”, and in particular that the entire
space X is not meager. This is where the completeness assumption comes in.
Theorem 10.1 (Baire). Any countable intersection of dense open sets U1 , U2 , U3 , . . .
in a complete metric space, X, is dense.

92
Proof. Let B0 be an arbitrary ball in X. Since U1 is dense the intersection
U1 ∩B0 is non-empty, and open since U1 and B0 are. Choose a ball B1 such that
its closure is contained in U1 ∩B0 . By induction we get balls B0 ⊃ B1 ⊃ B2 ⊃ . . .
with the closure of Bi contained in Ui for i ≥ 1. We may also assume that the
Bi are chosen small enough so that diam(Bi ) → 0, i → 0. Then the centers
of the Bi form a Cauchy-sequence, and thus converge to a point x ∈ X by
completeness. By construction
\ \
x ∈ B0 ∩ Bi ⊂ B0 ∩ Ui
i≥1 i≥1
T
so Ui intersect every open ball, i.e. is dense.
This result is also known as the Baire category theorem. Baire calls meager
sets of the first category and all other sets of the second category.
Corollary 10.2. The complement of any meager set is dense. In particular X
cannot be meager unless X = ∅.
Proof. Suppose E is meager, then
[
E= Ei
i≥1

with Ei nowhere dense. Let Ui = X \ Ei , which is open and dense with


[ \ \
X \E =X \ Ei = X \ Ei ⊃ Ui .
i≥1 i≥1 i≥1
T
Baire’s theorem implies that Ui is dense, thus so is X \ E.
The theorem fails if the completeness assumption is dropped. For example
in X = Q, the singleton sets {q} are nowhere dense, so Q itself is meager. Also
note that the conclusion of Baire’s theorem only concerns the topology of X,
but the proof used the metric space structure. In fact, the conclusion of Baire’s
theorem also holds in locally compact Hausdorff spaces.
Here is another characterization of meager sets.
Corollary 10.3. A set E in a complete metric space is meager iff there is a
disjoint dense Gδ set.
Proof. Use the fact that a set is nowhere dense iff there is a disjoint dense open
set and Baire’s theorem.
Combining the previous two, we find:
Corollary 10.4. In a complete metric space dense Gδ ’s are closed under count-
able intersection.
A curious application is the following.

93
Theorem 10.5. There is no function f : [0, 1] → R which is continuous exactly
at the rational points.
Proof. Let E ⊂ [0, 1] be the set of points where f is continuous. We claim that
E is a Gδ . To see this let En be the set of x ∈ [0, 1] such that there exists a
δ > 0 with
f ([x − δ, x + δ]) ⊂ I
T
for some interval I of length 1/n. Then En is open and E = n≥1 En .
If E = [0, 1] ∩ Q, then E is both meager and a dense Gδ in [0, 1], a contra-
diction.

Diophantine approximation
Diophantine approximation, named after Diophantus of Alexandria, is about
the approximation of real numbers by rational ones. An application of the
pigeonhole principle gives the following.
Theorem 10.6. If x is irrational, then there are infinitely many p/q ∈ Q with

p 1
x− ≤ 2.
q q

Proof. Let n ≥ 1. By irrationality of x, the n numbers

x, 2x, . . . , nx mod 1

are all distinct in [0, 1). Thus there must be some 0 ≤ n1 < n2 ≤ n such that
n1 x mod 1 and n2 x mod 1 have distance ≤ 1/n. Set q = n2 − n1 , then

qx mod 1 ≤ 1/n ≤ 1/q

thus
1
|p − qx| ≤
q
for some integer p.
In the other direction, a number x ∈ R is Diophantine of exponent α if there
is a C > 0 with
p C
x− > α
q q
for all p/q ∈ Q. A number is Diophantine if is Diophantine of exponent 2 + ε
for every ε > 0.
Theorem 10.7. A random number in [0, 1] is Diophantine, i.e. the Diophantine
numbers form a set of full measure.

94
Proof. Let ε > 0 and consider

Eq = x ∈ [0, 1] | ∃p : |x − p/q| < 1/q 2+ε .




Note that Eq is contained in the union of q + 1 intervals of length 2/q 2+ε so

2(q + 1) 4
m(Eq ) ≤ 2+ε
≤ 1+ε
q q
hence

X
m(Eq ) < ∞.
q=1

By Borel–Cantelli almost every x is contained in only finitely many Eq . But


for such an x we can find a suitable C > 0 satisfying the condition in the
definition of Diophantine of exponent 2 + ε. Letting ε = 1/n, for instance,
we get a decreasing sequence of subsets of full measure with intersection the
Diophantine numbers.
An irrational number x is Liouville if for any n ≥ 1 there is a p/q ∈ Q \ Z
with |x − p/q| < q −n . Note that such a number cannot be Diophantine of any
exponent. An example is

X 1
x= .
n=1
10n!
The following theorem shows that Liouville numbers are transcendental.
Theorem 10.8. If x is the root of an irreducible polynomial f (t) of degree d > 1
with integer coefficients, then x is Diophantine of exponent d.

Proof. Note that f has no rational roots (otherwise f has a linear factor), so
 
p 1
f ≥ d.
q q

There is some M ≥ 1 with |f 0 | ≤ M on [x − 1, x + 1] hence

q −d ≤ |f (x) − f (p/q)| ≤ M |x − p/q|

for p/q ∈ [x − 1, x + 1], but in fact for all p/q ∈ Q. Thus

p 1
x− ≥
q M qd

which proves the claim.


Theorem 10.9. The set of Liouville numbers in [0, 1] is a dense Gδ . Hence
the set of Diophantine numbers is meager.

95
Proof. Let
En = {x ∈ [0, 1] | ∃p, q : |x − p/q| < q −n }
which is open and contains Q ∩ [0, 1], thus dense. The intersection of the En , a
dense Gδ by Baire’s theorem, is the set of Liouville numbers.
In particular a meager set can have full measure. On the other hand on can
also find dense Gδ ’s which are null sets. Take for example the intersection of
open Un ⊃ Q with m(Un ) → 0, n → 0.

Uniform boundedness principle


We turn to another application of Baire’s theorem, the Banach–Steinhaus the-
orem, one of the fundamental results in the theory of Banach spaces. But
first, we need to look at the concept of a bounded linear operator. Suppose we
have normed spaces X, Y and a linear map (operator) T : X → Y , then T is
bounded if there is a constant C ≥ 0 such that

kT (x)k ≤ Ckxk

for all x ∈ X. The smallest such C is the operator norm

kT k := sup{kT (x)k, x ∈ X, kxk ≤ 1}

Boundedness is equivalent to continuity of T .


Theorem 10.10. Suppose X, Y are normed spaces and T : X → Y is linear,
then the following are equivalent
1. T is bounded, i.e. kT (x)k ≤ Ckxk for some C ≥ 0.
2. T is continuous.
3. T is continuous at 0 ∈ X.
Proof. 1 =⇒ 2: If xn → x, n → ∞, then

kT (x) − T (xn )k = kT (x − xn )k ≤ Ckx − xn k → 0

so T (xn ) → T (x) also.


2 =⇒ 3: Clear.
3 =⇒ 1: If T is continuous at 0 then there is a ball B0 (δ) around the
origin such that T (B0 (δ)) ⊂ T (B0 (1)). By homogeneity of T , this implies
kT (x)k ≤ 1/δkxk for all x ∈ X.
The Banach–Steinhaus theorem or uniform boundedness principle concerns
a family of bounded operators.
Theorem 10.11 (Banach–Steinhaus). Let X be a Banach space, Y a normed
space, Ti : X → Y , i ∈ I a family of bounded operators. The following are
equivalent:

96
1. (Pointwise boundedness) For every x ∈ X the set {Ti (x) | i ∈ I} ⊂ Y is
bounded.
2. (Uniform boundedness) The set of operator norms {kTi k | i ∈ I} ⊂ R is
bounded.

Proof. It is clear that 2. implies 1., since in this case we have a C, independent
of i, such that kTi (x)k ≤ Ckxk.
The other direction requires completeness of X and Baire’s theorem. Define

En := {x ∈ X | kTi (x)k ≤ n for all i ∈ I}


S
for n ≥ 1. By assumption, X = En , so by Baire’s theorem not all En can
be nowhere dense, thus some En must be dense in a ball B. Since the Ti are
continuous, En is closed hence B ⊂ En . Note that if x, y ∈ En , then

kTi (x − y)k ≤ kTi (x)k + kTi (y)k ≤ 2n

so En − En ⊂ E2n , but this means that E2n contains a ball B0 (r) around the
origin. Uniform boundedness follows with C = 2n
r .

Corollary 10.12. For X, Y as before, and Tn : X → Y a sequence of bounded


operators converging pointwise, the limiting operator T : X → Y with

T (x) = lim Tn (x)


n→∞

is bounded.
One application of the uniform boundedness principle is to show that there
are continuous functions whose Fourier series does not converge pointwise. Fix
x ∈ S 1 and consider the linear functionals ϕN : C(S 1 ) → C given by evaluating
the N -th Fourier partial sum at x, i.e.
N
X
ϕN (f ) = fˆ(k)eikx = (DN ∗ f )(x).
k=−N

Here, DN is the N -th Dirichlet kernel. We have on one hand

|ϕN (f )| = |(DN ∗ f )(x)| ≤ kDN ∗ f k∞ ≤ kDN k1 · kf k∞

but also |ϕN (1)| = kDN k1 thus kϕN k = kDN k1 . One can show that kDN k1 →
∞ as N → ∞, so the family of operators ϕN : X → C is unbounded. By the
uniform boundedness principle we conclude:
Theorem 10.13. For each x ∈ S 1 there is a dense set of continuous functions
in C(S 1 ) whose Fourier series does not converge at x.

97
Open mapping theorem
A map f : X → Y between topological spaces is open if it maps open sets to
open sets, i.e. U ⊂ X open implies f (U ) ⊂ Y open. This is somewhat similar to
continuity, which is defined by the condition U ⊂ Y open implies f −1 (U ) ⊂ X
open. Note that if a linear map T : X → Y between normed spaces is open,
then the image of the open unit ball is an open set containing the origin in X,
thus contains some open ball of small radius. By homogeneity of T , any vector
y ∈ Y must be in the image of T , i.e. T is surjective. The second fundamental
result about bounded operators (after the uniform boundedness principle) is
that the converse is also true: openness implies surjectivity, at least when X, Y
are Banach spaces and T is bounded.
Lemma 10.14. Let T : X → Y be a bounded linear operator between Banach
spaces X, Y and B ⊂ X the open unit ball. If T (B) has nonempty interior, then
T is open and hence surjective.
Proof. Let U be a nonempty open subset of T (B). Since B is convex and
symmetric, the same is true for T (B) and thus (U − U )/2 ⊂ T (B) which implies
that T (B) contains some open ball B0 (r) around the origin. Given y ∈ B0 (r)
we want to show that T (x) = y for some x ∈ X. First, by assumption we can
choose x0 ∈ B with kT (x0 ) − yk ≤ r/2. Then we can find x1 ∈ B/2 with
r
kT (x0 ) + T (x1 ) − yk ≤ .
4
By induction, we get a sequence xk with kxk k ≤ 1/2k satisfying

!
X
T xk = y
k=0
P
and k xk k < 2. Hence T (B0 (2)) ⊃ B0 (r), so T is an open mapping at the
origin. By linearity, T is open everywhere.
Theorem 10.15 (Open mapping theorem). Let T : X → Y be a bounded linear
operator between Banach spaces X, Y . If T (X) = Y , then T is open. Otherwise,
T (X) is meager.
Proof. If T is surjective, then
∞ ∞
!
[ [
Y = T (X) = T nB = nT (B).
n=1 n=1

By completeness of Y and Baire’s theorem, there must be some nT (B) which


is dense in a ball, hence T (B) has nonempty interior. The lemma implies that
T is open.
If T is not surjective, then again by the lemma, all nT (B) must be nowhere
dense, hence T (X) is meager.

98
There is a similar dichotomy for polynomials maps P : C → C, whose image
is either everything or a single point by the fundamental theorem of algebra.
This extends to more general complex differentiable maps, leading to the “open
mapping theorem” in complex analysis.
Corollary 10.16. Let T : X → Y be a bounded operator between Banach
spaces. If T is bijective, then the inverse T −1 is automatically bounded.
Proof. Because T is onto, T is open by the open mapping theorem. This means
the inverse pulls back open sets to open sets, i.e. is continuous.
The following application is due to Grothendieck.

Theorem 10.17. If V ⊂ L2 ([0, 1]) is a closed subspace containing only contin-


uous functions, then V is finite-dimensional.
Proof. If f ∈ V , then
Z 1 1/2
2
kf k2 = |f | ≤ kf k∞
0

so the identity map (V, k k∞ ) → (V, k k2 ) is bounded. By the open mapping


theorem the inverse is also bounded, meaning that there is a C > 0 with kf k∞ ≤
Ckf k2 for all f ∈ V .
Suppose f1 , . . . , fn are orthonormal vectors in V and x ∈ [0, 1], then

n
!1/2 n
X X
2
|fk (x)| = fk (x)fk
k=1 k=1 2
n
1 X
≥ fk (x)fk
C
k=1 ∞
n
1 X
≥ |fk (x)|2
C
k=1

which implies
n
X
|fk (x)|2 ≤ C 2 .
k=1

Integrating both sides over [0, 1] gives n ≤ C 2 , which puts an upper bound on
the dimension of V .

Closed graph theorem


Suppose T : X → Y is a linear map between normed spaces. The graph of T is
the subspace
{(x, T (x))|x ∈ X} ⊂ X × Y.

99
The graph is a closed subspace if and only if xn → x, T (xn ) → y implies
T (x) = y. This is a bit weaker than continuity of T , because there is nothing
to check for sequences where T (xn ) does not converge. However it turns out
that if X and Y are Banach spaces, then this condition on T is equivalent to
continuity.
Theorem 10.18 (Closed graph theorem). If T : X → Y is a linear map between
Banach spaces and the graph of T is closed, then T is bounded.
Proof. Let G = {(x, T x)} ⊂ X × Y be the graph of T , which is a closed
subspace by assumption, thus by itself a Banach space. The projection map
G → X, (x, T x) 7→ x is a continuous bijection, hence its inverse x 7→ (x, T x) is
continuous by the open mapping theorem, thus T is bounded.
The theorem is not true for non-linear maps T . Take for example the function
f : R → R with f (x) = 1/x for x 6= 0 and f (0) = 0, which is not continuous
but has a closed graph.
Let H be a Hilbert space and T : H → H linear. T is self-adjoint if

hT x, yi = hx, T yi

for all x, y ∈ H. If H = L2 (R), then one way of defining such operators is via a
kernel K(x, y)R2 → C with K(y, x) = K(x, y) by
Z
(T f )(x) = K(x, y)f (y)dy.

As long as T is well-defined for a certain choice of K, it is automatically bounded,


by the following result.
Theorem 10.19 (Toeplitz). If H is a Hilbert space, then any self-adjoint T :
H → H is bounded.
Proof. If xn → x is a sequence in H and T xn → y, then for any z ∈ H
selfadjointness gives

hT x − y, zi = lim hT (x − xn ), zi = lim hx − xn , T zi = 0
n→∞ n→∞

which implies T x = y. By the closed graph theorem, T is bounded.


When discussing the Fourier transform we considered the operator D :
S(R) → S(R) given by
i df
Df =
2π dx
and showed that it is self-adjoint. However, S(R) is not complete, so we cannot
conclude that D is bounded. Indeed one can see that D is unbounded, for
2
example by applying it to Gaussians f (x) = e−Cx . It follows that D cannot
have an extension to L2 (R), since such an extension would be self-adjoint, hence
bounded.

100
References and further reading
• C. McMullen: “Real Analysis.” (Math 114 course notes)
• T. Tao: “The Baire category theorem and its Banach space consequences”,
terrytao.wordpress.com/2009/02/01/245b-notes-9-the-baire-
category-theorem-and-its-banach-space-consequences/

• W. Rudin: “Real and complex analysis”, Chapter 5

101
Chapter 11

Hausdorff Measure and


Fractals

Consider the following recursively constructed piecewise linear curves Cn . Note

C1 C2

C3 C4

Figure 12: First four approximations to the Koch curve.

that Cn+1 is constructed from Cn by replacing each line segment with a scaled
down version of C1 , putting a spike into it. In fact the Cn converge uniformly to
a continuous curve C since we are moving points less and less in this procedure.
What is the length of C, the Koch curve? First, C1 is 4/3 times longer than
the straight line segment at its base, which we may assume has length 1, so by
induction Cn has length (4/3)n . This geometric series goes to ∞ as n → ∞, so
C has infinite length in some sense.
It turns out that from a measure theoretic point of view we should not
consider C as a curve, something one-dimensional, but as fractal with non-
integer dimension between 1 and 2. To find the dimension we use the self–
similar nature of C. The principle is the following. If we take a unit square

102
[0, 1]2 and scale it by a factor of 2 we get [0, 2]2 which is made of four disjoint
copies of the original. The dimension is log(4)/ log(2) = 2. Similarly if we scale
a unit cube [0, 1]3 by a factor of 2 we get eight copies, and the dimension is
log(8)/ log(2) = 3. Now if we scale the Koch curve by a factor of 3 we get 4
copies of the original, so the dimension should be log(4)/ log(3) = 1.26 . . ..
A similar trick works with the Cantor set, which we can find at the base
of the Koch curve. If we scale it by 3 we get two copies, so its dimension is
log(2)/ log(3) = 0.63..., exactly half of that of the Koch curve.
Of course self-similar sets, like the examples above, are very special. It was
F. Hausdorff who realized that the Lebesgue measure could be defined not just
for integer dimension d, leading to length, area, volume, and so on, but for any
positive real d. This also allows one to define the dimension of very general
subsets of Rn , even any metric space. We only give a sketch of the theory,
without going into detailed proofs (see references at the end of the chapter for
more details).

Hausdorff measure
We begin by defining the α-dimensional Hausdorff outer measure, α ≥ 0,
of a subset E ⊂ Rn by
(∞ ∞
)
X [
∗ α
mα (E) = lim inf (diam Ei ) | E ⊂ Ei , diam Ei ≤ δ
δ→0
i=1 i=1

where Ei are arbitrary subsets with diameter ≤ δ. The diameter of a set S ⊂ Rn


is just
diam S = sup{|x − y| : x, y ∈ S}
which could be +∞. Note that the quantity
(∞ ∞
)
X [
α
inf (diam Ei ) | E ⊂ Ei , diam Ek ≤ δ
i=1 i=1

is decreasing in δ, so the above limit is defined and is a non-negative number or


+∞. It follows immediately from the definition and diam(λS) = λdiam(S) for
λ ≥ 0 that
m∗α (λE) = λα m∗α (E)
so m∗α measures “α-dimensional content”.
Just like in the case of Lebesgue measure on R, we need to restrict to a
special class of sets to make m∗α countably additive. We can take the σ-algebra
of Borel sets for that purpose. Recall that this is the smallest σ-algebra which
contains all the open and closed subsets of Rn . We call m∗α restricted to the
Borel sets the α-dimensional Hausdorff measure, written mα . This measure
is translation and rotation invariant. In the special case n = 1, α = 1, we recover
the previously defined Lebesgue measure on R. m0 (E) just counts the number
of points of E. For α = n, mn is the Lebesgue measure on Rn , up to a fixed
constant, the volume of the unit ball divided by 2n .

103
Hausdorff dimension
If E ⊂ Rn is a Borel set, then mα (E) is non-trivial for at most one α. More
precisely, there exists a unique α ≥ 0 such that
(
+∞ β < α
mβ (E) =
0 β>α

and this α ∈ R≥0 is called the Hausdorff dimension of E, dimH (E). Note
that at the critical dimension α, nothing can be said about mα (E), which could
in particular be +∞.
A d-dimensional linear subspace of Rn has expected integer Hausdorff di-
mension d. With a bit of work one can show that in the self-similar examples
(e.g. the Cantor set) the similarity dimension we computed is the same as the
Hausdorff dimension. A set with non-integer Hausdorff dimension is called a
fractal, though this is not a uniformly agreed upon definition. It is known that
for every 0 ≤ α ≤ n there are subsets of Rn with Hausdorff dimension α, i.e. all
dimensions can be realized.

Figure 13: Menger sponge, dimH = log(20)/ log(3) = 2.7268 . . ..

Real world objects, unlike ideal mathematical ones, cannot have structure
on arbitrarily small scale, but they can still be approximately self-similar across
a wide range of scales and one can estimate their Hausdorff dimension. For
example, looking at the structure of cauliflower one sees that each branch carries

104
about 13 smaller branches of a third the size, so its Hausdorff dimension is about
log(13)/ log(3) = 2.33 . . .. The surface of the human lung is highly folded and
has dimension about 2.97, so it almost behaves like a solid, which is useful for
absorbing as much O2 as possible from a given volume of air.

Space filling curves


This surprising example shows that the unit interval and unit square are ab-
stractly the same measure space. Even more, we can find a continuous measure
preserving map [0, 1] → [0, 1]2 , a space filling curve. The following construction
is due to Hilbert.

n=1 n=2 n=3 n=4 n=5

Figure 14: First 5 iterations in construction of Hilbert curve

Let: cn : [0, 1] → [0, 1]2 be the piecewise smooth curve obtained at the n-
th step of the construction. One shows that the cn converge uniformly to a
continuous map c : [0, 1] → [0, 1]2 . This curve is non-rectifiable: It has no well–
defined length. Indeed the lengths of cn grow like 2n , so we would expect c to
have Hausdorff dimension 2. In fact, the image of c is the entire square [0, 1]2 !
It is perhaps intuitively clear from the picture that c should be dense in
[0, 1]2 . A precise argument uses the fact that cn meets all little squares with
side length 1/2n−1 in [0, 1]2 . But now one can appeal to the general fact that
since [0, 1] is compact, and c continuous, the image c([0, 1]) should also be
compact, in particular closed. So if c([0, 1]) is dense, then it must be the entire
unit square. We summarize these facts, without proof in the following theorem.
Theorem 11.1. If c : [0, 1] → [0, 1]2 is the Hilbert curve, then
1. c is continuous and onto, but not one-to-one (although it becomes a bijec-
tion after sets of measure zero are removed from the domain and target)
2. c is measure preserving
It is not possible to find a map c : [0, 1] → [0, 1]2 which is both continuous
and a bijection, as such a map would have continuous inverse, and the two space
cannot be topologically the same: Removing one point from [0, 1] (except the
two endpoints) gives a disconnected space, while removing any one point from
[0, 1]2 still leaves the space connected. Still, the theorem shows that all [0, 1]n ,
n ≥ 0, are the same as measure spaces (up to sets of measure zero), so there
can be no intrinsic notion of dimension in measure theory.

105
References and further reading
• Stein, Shakarchi: “Real Analysis”, Chapter 7, (Princeton lectures in Anal-
ysis 3)
• T. Tao: “Hausdorff dimension”,
terrytao.wordpress.com/2009/05/19/245c-notes-5-hausdorff-dimension-optional/

106

You might also like