0% found this document useful (0 votes)
62 views285 pages

Further Topics On Discrete-Time Markov Control Processes

Uploaded by

Angel Lopez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
62 views285 pages

Further Topics On Discrete-Time Markov Control Processes

Uploaded by

Angel Lopez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 285

Stochastic Mechanics Applications of

Random Media Mathematics


Signal Processing
Stochastic Modelling
and Image Synthesis and Applied Probability
Mathematical Economics
and Finance
Stochastic Optimization
Stochastic Control 42
Edited by I. Karatzas
M. Yor

Advisory Board P. Bremaud


E. Carlen
W. Fleming
D. Geman
G. Grimmett
G. Papanicolaou
J. Scheinkman

Springer Science+Business Media, LLC


Applications of Mathematics

1 Fleming/Rishel, Deterministic and Stochastic Optimal Control (1975)


2 Marchuk, Methods of Numerical Mathematics, Second Ed. (1982)
3 Balakrishnan, Applied Functional Analysis, Second Ed. (1981)
4 Borovkov, Stochastic Processes in Queueing Theory (1976)
5 Liptser/Shiryayev, Statistics of Random Processes I: General Theory (1977)
6 Liptser/Shiryayev, Statistics of Random Processes ll: Applications (1978)
7 Vorob'ev, Game Theory: Lectures for Economists and Systems Scientists
(1977)
8 Shiryayev, Optimal Stopping Rules (1978)
9 Ibragimov/Rozanov, Gaussian Random Processes (1978)
10 Wonham, Linear Multivariable Control: A Geometric Approach, Third Ed.
(1985)
11 Hida, Brownian Motion (1980)
12 Hestenes, Conjugate Direction Methods in Optimization (1980)
13 Kallianpur, Stochastic Filtering Theory (1980)
14 Krylov, Controlled Diffusion Processes (1980)
15 Prabhu, Stochastic Storage Processes: Queues, Insurance Risk, Dams,
and Data Communication, Second Ed. (1998)
16 Ibragimov/Has' minskii, Statistical Estimation: Asymptotic Theory (1981)
17 Cesari, Optimization: Theory and Applications (1982)
18 Elliott, Stochastic Calculus and Applications (1982)
19 MarchuklShaidourov, Difference Methods and Their Extrapolations (1983)
20 Hijab, Stabilization of Control Systems (1986)
21 Protter, Stochastic Integration and Differential Equations (1990)
22 BenvenistelMetivier/Priouret, Adaptive Algorithms and Stochastic
Approximations (1990)
23 KloedenIPlaten, Numerical Solution of Stochastic Differential Equations
(1992)
24 Kushner/Dupuis, Numerical Methods for Stochastic Control Problems
in Continuous Time (1992)
25 Fleming/Soner, Controlled Markov Processes and Viscosity Solutions
(1993)
26 BaccellilBremaud, Elements of Queueing Theory (1994)
27 Winkler, Image Analysis, Random Fields, and Dynamic Monte Carlo
Methods: An Introduction to Mathematical Aspects (1994)
28 Kalpazidou, Cycle Representations of Markov Processes (1995)
29 ElliottlAggounIMoore, Hidden Markov Models: Estimation and Control
(1995)
30 Hernandez-LermaILasserre, Discrete-Time Markov Control Processes:
Basic Optimality Criteria (1996)
31 Devroye/GyorfllLugosi, A Probabilistic Theory of Pattern Recognition (1996)
32 MaitralSudderth, Discrete Gambling and Stochastic Games (1996)

(continued after index)


Onesimo Hernandez-Lerma
Jean Bernard Lasserre

Further Topics on
Discrete-Time Markov
Control Processes

i Springer
Onesimo Hemândez-Lerma Jean Bemard Lasserre
CINVESTAV-IPN LAAS-CNRS
Departamento de Matemâticas 7 Av. du Colonel Roche
Apartado Postal 14-740 31077 Toulouse Cedex, France
07000 Mexico DF, Mexico lasserre@laas.fr
ohemand@math.cinvestav.mx

Managing Editors
1. Karatzas
Departments of Mathematics and Statistics
Columbia University
New York, NY 10027, USA

M. Yor
CNRS, Laboratoire de Probabilites
Universite Pierre et Marie Curie
4, Place Jussieu, Tour 56
F-75252 Paris Cedex 05, France

Mathematics Subject Classification (1991): 49L20, 9OC39, 90C40, 93-02, 93E20

Library of Congress Cataloging-in-Publication Data


Hemandez-Lerma, O. (Onesimo)
Further topics on discrete-time Markov control processeslOnesimo
Hemandez-Lerma, Jean B. Lasserre.
p. cm. - (Applications of mathematics; 42)
Includes bibliographical references and index.
ISBN 978-1-4612-6818-5 ISBN 978-1-4612-0561-6 (eBook)
DOI 10.1007/978-1-4612-0561-6
1. Markov processes. 2. Discrete-time systems. 3. Control
theory. 1. Lasserre, Jean Bemard, 1953- II. Title. ill. Series.
QA274.7.H473 1999
519.2'33--dc21 99-12351

Printed on acid-free paper.

© 1999 Springer Science+Business Media New York


Originally published by Springer-Verlag New York Berlin Heidelberg in 1999
Softcover reprint of the hardcover 1st edition 1999
Ali rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science+Business Media, LLC), except for brief
excerpts in connection with reviews or scholarly analysis. Use in connection with any form of
inforrnation storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even if
the former are not especially identified, is not to be taken as a sign that such names, as
understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely
byanyone.
Production managed by Frank McGuckin; manufacturing supervised by Jacqui Ashri.
Photocomposed copy prepared from the authors' J8.1EJX files.
987 6 5 4 3 2 1

ISBN 978-1-4612-6818-5 SPIN 10707298


For Marina, Adrian, Gerardo, and Andres
To Julia and Marine
Preface

This book presents the second part of a two-volume series devoted to a sys-
tematic exposition of some recent developments in the theory of discrete-
time Markov control processes (MCPs). As in the first part, hereafter re-
ferred to as "Volume I" (see Hernandez-Lerma and Lasserre [1]), interest is
mainly confined to MCPs with Borel state and control spaces, and possibly
unbounded costs. However, an important feature of the present volume is
that it is essentially self-contained and can be read independently of Volume
I. The reason for this independence is that even though both volumes deal
with similar classes of MCPs, the assumptions on the control models are
usually different. For instance, Volume I deals only with nonnegative cost-
per-stage functions, whereas in the present volume we allow cost functions
to take positive or negative values, as needed in some applications. Thus,
many results in Volume Ion, say, discounted or average cost problems are
not applicable to the models considered here.
On the other hand, we now consider control models that typically re-
quire more restrictive classes of control-constraint sets and/or transition
laws. This loss of generality is, of course, deliberate because it allows us
to obtain more "precise" results. For example, in a very general context,
in §4.2 of Volume I we showed the convergence of the value iteration (VI)
procedure for discounted-cost MCPs, whereas now, in a somewhat more
restricted setting, we actually get a lot more information on the VI proce-
dure, such as the rate of convergence (§8.3), which in turn is used to study
"rolling horizon" procedures, as well as the existence of "forecast horizons" ,
and criteria for the elimination of nonoptimal control actions. Similarly, in
Chapter 10 and Chapter 11, which deal with average cost problems, we
viii Preface

obtain many interesting results that are virtually impossible to hold in a


context as general as that of Volume I. In the Introduction of each chapter
dealing with problems already studied in Volume I we clearly spell out the
difference between the corresponding settings in each volume.
Volume I comprises Chapter 1 to Chapter 6, and the present volume
contains Chapter 7 to Chapter 12. Chapter 7 introduces background ma-
terial on weighted-norm spaces of functions and spaces of measures, and
on noncontrolled Markov chains. In particular, it introduces the concept
of w-geometric ergodicity of Markov chains, where w is a given "weight"
(or bounding) function, and also the Poisson equation associated to a tran-
sition kernel and a given "charge". Chapter 8 studies a-discounted cost
(abbreviated a-DC or simply DC) MCPs. The basic idea is to give condi-
tions under which the dynamic programming (DP) operator is a contraction
with respect to a suitable weighted norm. This contraction property is used
to obtain the a-discount optimality (or DP-equation), and to make a de-
tailed analysis of the VI procedure. In Chapter 9 we turn our attention to
the expected total cost (ETC) criterion. Conditions are given for the ETC
function to be well defined (as an extended-real-valued measurable func-
tion) and for the existence of ETC-optimal control policies, among other
things. A quite complete analysis of the so-called "transient" case is also
included (§9.6). Chapter 10 deals with undiscounted cost criteria, from av-
erage cost (AC) problems to overtaking optimality, passing through the
existence of canonical policies, bias-optimal policies, Flynn's opportunity
cost, and other "intermediate" criteria. AC problems are also dealt with in
Chapters 11 and 12, but from very different viewpoints. Chapter 11 studies
sample path optimality and variance minimization, essentially using proba-
bilistic methods, while Chapter 12 concerns the expected AC problem using
the linear programming approach. Chapter 12 includes in particular a pro-
cedure to approximate by finite linear programs the infinite-dimensional
AC-related linear programs.
Acknowledgments. We wish to thank the Consejo Nacional de Ciencia y
Tecnologfa (CONACYT, Mexico) and the Centre National de la Recherche
Scientifique (CNRS, France) for their generous support of our research work
through the CONACYT-CNRS Scientific Cooperation Program. The work
of the first author (OHL) has also been partially supported by CONACYT
grant 3115P-E9608 and the Sistema Nacional de Investigadores (SNI).
Thanks also due to Fernando Luque-Vasquez and Oscar Vega-Amaya for
helpful comments on several chapters, and to Gerardo Hernandez del Valle
for his efficient typing of the manuscript.

March 1999 Onesimo Hernandez-Lerma


Jean Bernard Lasserre
Contents

Preface ................. . vii

7 Ergodicity and Poisson's Equation 1


7.1 Introduction .. 1
7.2 Weighted norms and signed kernels 2
A. Weighted-norm spaces 2
B. Signed kernels 3
C. Contraction maps 5
7.3 Recurrence concepts 7
A. Irreducibility and recurrence 7
B. Invariant measures 8
C. Conditions for irreducibility and recurrence 9
D. w-Geometric ergodicity 11
7.4 Examples on w-geometric ergodicity 17
7.5 Poisson's equation 24
A. The multi chain case 26
B. The unichain P.E. 31
C. Examples 34

8 Discounted Dynamic Programming with Weighted Norms 39


8.1 Introduction . 39
8.2 The control model and control policies 40
8.3 The optimality equation 43
A. Assumptions 44
B. The discounted-cost optimality equation 47
x Contents

C. The dynamic programming operator 48


D. Proof of Theorem 8.3.6 . . . . . 51
8.4 Further analysis of value iteration. . . . 55
A. Asymptotic discount optimality . 56
B. Estimates of VI convergence. . . 57
C. Rolling horizon procedures . . . 58
D. Forecast horizons and elimination
of non-optimal actions 59
8.5 The weakly continuous case 65
8.6 Examples . . . . 68
8.7 Further remarks . . . . . . 73

9 The Expected Total Cost Criterion 75


9.1 Introduction . . . . . . . . . . . 75
9.2 Preliminaries . . . . . . . . . . 76
A. Extended real numbers 76
B. Integrability. . . 78
9.3 The expected total cost . . . . 79
9.4 Occupation measures . . . . . . 84
A. Expected occupation measures 85
B. The sufficiency problem 88
9.5 The optimality equation . . . . 93
A. The optimality equation 93
B. Optimality criteria . . . 95
C. Deterministic stationary policies 100
9.6 The transient case . . . . . . 103
A. Transient models . . . . . . . . . 103
B. Optimality conditions . . . . . . 109
C. Reduction to deterministic policies 110
D. The policy iteration algorithm 113

10 Undiscounted Cost Criteria 117


10.1 Introduction . . . . . . . . . 117
A. Undiscounted criteria 117
B. AC criteria . . . . . . 119
C. Outline of the chapter 120
10.2 Preliminaries . . . . 120
A. Assumptions 121
B. Corollaries 123
C. Discussion .. 124
10.3 From AC-optimality to undiscounted criteria 126
A. The AC optimality inequality 128
B. The AC optimality equation. 129
C. Uniqueness of the ACOE 131
D. Bias-optimal policies . . . . . 132
Contents xi

E. Undiscounted criteria 135


10.4 Proof of Theorem 10.3.1 . . . 137
A. Preliminary lemmas . 137
B. Completion of the proof 141
10.5 Proof of Theorem 10.3.6 143
A. Proof of part (a) 143
B. Proof of part (b) 146
C. Policy iteration . 146
10.6 Proof of Theorem 10.3.7 149
10.7 Proof of Theorem 10.3.10 150
10.8 Proof of Theorem 10.3.11 154
10.9 Examples . . . . . . . . . 156

11 Sample Path Average Cost 163


11.1 Introduction . . . . . . . . 163
A. Definitions . . . . 163
B. Outline of the chapter 166
11.2 Preliminaries . . . . . . . . . 167
A. Positive Harris recurrence 167
B. Limiting average variance 168
11.3 The w-geometrically ergodic case 175
A. Optimality in lIDS . . 177
B. Optimality in II . . . . . 178
C. Variance minimization .. 180
D. Proof of Theorem 11.3.5 . 181
E. Proof of Theorem 11.3.8 . 185
11.4 Strictly unbounded costs. 188
11.5 Examples . . . . . . . . . . . . . 196

12 The Linear Programming Approach 203


12.1 Introduction . . . . . . . . . . 203
A. Outline of the chapter . . . 204
12.2 Preliminaries . . . . . . . . . . . . 205
A. Dual pairs of vector spaces 205
B. Infinite linear programming 212
C. Approximation of linear programs 214
D. Tightness and invariant measures . 215
12.3 Linear programs for the AC problem 218
A. The linear programs . . 219
B. Solvability of (P) . . . . 222
C. Absence of duality gap . 224
D. The Farkas alternative. 226
12.4 Approximating sequences and strong duality. 233
A. Minimizing sequences for (P) . 233
B. Maximizing sequences for (P*) . . . . 234
xii Contents

12.5 Finite LP approximations 238


A. Aggregation. . . . 238
B. Aggregation-relaxation. 239
C. Aggregation-relaxion-inner approximations 240
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 . . . . . . 242

References 251

Abbreviations 263

Glossary of notation 265

Index 271
Contents of Volume I

Chapter 1 Introduction and summary

Chapter 2 Markov control processes

Chapter 3 Finite-horizon problems

Chapter 4 Infinite-horizon discounted cost problems

Chapter 5 Long-run average-cost problems

Chapter 6 The linear programming formulation

Appendices

A. Miscellaneous results

B. Conditional expectation

C. Stochastic kernels
D. Multifunctions and selectors
E. Convergence of probability measures
7
Ergodicity and
Poisson's Equation

7.1 Introduction
This chapter deals with noncontrolled Markov chains and presents impor-
tant background material used in later chapters. The reader may omit it
and refer to it as needed.
There are in particular two key concepts we wish to arrive at in this
chapter, and which will be used to study several classes of Markov control
problems. One is the concept of w-geometric ergodicity with respect to some
weight function w, and the other is the Poisson equation (P.E.), which can
be seen as a special case of the Average-Cost Optimality Equation. The
former is introduced in §7.3.D and the latter in §7.5. First we introduce,
in §7.2, some general notions on weighted norms and signed kernels, and
then, in §§7.3.A-7.3.C, we review some standard results from Markov chain
theory. The latter results are presented without proofs; they are introduced
here mainly for ease of reference. Finally, §§7.4 and 7.5.C contain some ex-
amples on w-geometric ergodicity and the P.E., respectively.
Throughout the following X denotes a Borel space (that is, a Borel subset
of a complete and separable metric space), unless explicitly stated other-
wise. Its Borel a-algebra is denoted by 8(X).
O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes
© Springer Science+Business Media New York 1999
2 7. Ergodicity and Poisson's Equation

7.2 Weighted norms and signed kernels


Let X be a Borel space, and let Jm(X) be the Banach space of real-valued
bounded measurable functions u on X, with the sup (or supremum)
norm
Ilull := sup lu(x)l·
xEX

A. Weighted-norm spaces,

We assume throughout that w : X -+ [1,00) denotes a given measurable


function that will be referred to as a weight function (some authors call
it a bounding function). If u is a real-valued function on X we define its
w-norm as
IIull w := IIu/wil = sup lu(x)l/w(x). (7.2.1)

=
xEX

Of course, if w is the constant function identical to 1, w(·) 1, the w-norm


and the sup norm coincide.
A function u is said to be bounded if IIuli < 00 and w-bounded if
IIull w < 00. In general, the weight function w will be unbounded, although
it is obviously w-bounded since IIwllw = 1. On the other hand, if u is a
bounded function then it is w-bounded, as w ;:::: 1 yields

IIull w ::; IIuli < 00 \;/u E Jm(X). (7.2.2)


Let Jmw(X) be the normed linear space of w-bounded measurable functions
u on X. This space is also complete because if {un} is a Cauchy sequence
in the w-norm, then {un/w} is Cauchy in the sup norm; hence, as Jm(X)
is a Banach space, one can deduce the existence of a function u in Jmw(X)
that is the w-limit of {un}. Combining this fact with (7.2.2) we obtain:
7.2.1 Proposition. Jmw(X) is a Banach space that contains Jm(X).
Consider now the Banach space M(X) of finite signed measures J-t on
B(X) endowed with the total variation norm

IIul19 ix
r
IIJ-tllTV:= sup I udJ-t1 = 1J-tI(X), (7.2.3)

where IJ-tl = J-t+ + J-t- denotes the total variation of J-t, and J-t+, J-t- stand for
the positive and negative parts of J-t, respectively. By analogy, the w-norm
of J-t is defined by

IIJ-tllw:= sup
lIull w 9
I r udJ-t1 = ixr wdlJ-tl,
ix
(7.2.4)

which reduces to (7.2.3) when w(·) =1. Moreover, as w ;:::: 1, we obtain


7.2 Weighted norms and signed kernels 3

The latter inequality can be used to show that the normed linear space
Mw(X) of finite signed measures with a finite w-norm is a Banach space.
Summarizing:
7.2.2 Proposition. Mw(X) is a Banach space and it is contained in
M(X).
Remark. (a) In Chapter 6 we have already seen a particular class of
weighted norms; see Definitions 6.3.2 and 6.3.4.
(b) Bw(X) and Mw(X) are in fact ordered Banach spaces. Namely, as
usual, for functions in Bw(X) "u ::; v" means u(x) ::; v(x) for all x in X,
and for measures "JL ::; v" means JL(B) ::; v(B) for all B in B(X).

B. Signed kernels

7.2.3 Definition. (a) Let X and Y be two Borel spaces. A stochastic


kernel on X given Y is a function Q = {Q(BIY) : B E B(X), y E Y} such
that
(i) Q(·ly) is a p.m. (probability measure) on B(X) for every fixed y E Y,
and
(ii) Q(BI·) is a measurable function on Y for every fixed B E B(Y).
If the condition (i) is replaced by
(i)' Q(·ly) is a finite signed measure on B(X) for every fixed y E Y, then
Q is called a signed kernel on X given Y.
(b) Suppose that X = Y. If Q satisfies (i) and (ii) it is then called a
stochastic kernel (or transition probability function) on X, whereas
if it satisfies (i)' and (ii) it is said to be a signed kernel on X.
Signed kernels typically appear as a difference PI - P2 of stochastic ker-
nels.
In the remainder of this section we consider the case X = Y. Moreover, a
measure JL in Mw(X) will be identified with the "constant" kernel Q(·lx) ==
JLO·
A signed kernel Q on X defines linear maps u t--+ Qu on Bw(X) and
JL t--+ JLQ on Mw(X) as follows:

Qu(x) := Ix u(y)Q(dylx), u E Bw(X), x E X, (7.2.5)

and

JLQ(B) := Ix Q(Blx)JL(dx), JL E Mw(X), B E B(X). (7.2.6)

7.2.4 Remark. If Q is the transition probability function of a time-homogeneous


X-valued Markov chain {Xt, t = 0,1, ... }, so that

Q(Blx) = Prob(xt+1 E Blxt = x) Vt = 0, 1, ... ,B E B(X), x E X,


4 7. Ergodicity and Poisson's Equation

then Qu(x) in (7.2.5) is the conditional expectation of u(xt+d given Xt = x,


that is,
Qu(X) = E [u(xt+dlxt = xl , t = 0, 1, ....
On the other hand, if J.t is a p.m. that denotes the distribution of Xt, i.e.,

J.t(B) = Prob(xt E B), B E 8(X),

then J.tQ in (7.2.6) denotes the distribution of Xt+l:

J.tQ(B) = Prob(XtH E B), B E 8(X). 0

Let Q be a signed kernel on X. To ensure that (7.2.5) and (7.2.6) define


operators with values on Bw(X) and Mw(X), respectively, we shall as-
sume that Q has a finite w-norm, which is defined (as usual for "operator
norms") by

IIQllw := sup{IIQull w : lIull w ~ 1 (i.e., lui ~ w) }. (7.2.7)

We can write the w-norm of Q in several equivalent forms. For instance,


by (7.2.5) and (7.2.1),

IIQull w = supw(x)-lIQu(x)l,
x
which combined with (7.2.4) yields

IIQllw = supw(x)-lIIQ(·lx)lIw
x (7.2.8)
= supW(X)-l
x
Ix w(y)IQ(dYlx)l·
On the other hand, replacing J.t in (7.2.6) and (7.2.4) by the Dirac measure
~x at the point x E X [that is, ~x(B) := 1 if x E Band := 0 otherwise], we
see that
~xQ(·) = Q(·lx) and lI~xllw = w(x). (7.2.9)
Then a direct calculation shows that (7.2.7) can also be written in the
following equivalent form using measures J.t in Mw(X):

IIQllw = sup{IIJ.tQllw : 1IJ.tllw ~ I}. (7.2.10)

Moreover, the usual arguments for "operator norms" yield:


7.2.5 Proposition. Let Q be a signed kernel with a finite w-norm, i.e.,

IIQllw < 00. (7.2.11)

Then
(7.2.12)
7.2 Weighted norms and signed kernels 5

and, therefore, (7.2.5) defines a linear map from Bw(X) into itself. Simi-
larly, (7.2.6) defines a linear map from Mw(X) into itself, since

(7.2.13)

Finally, we consider the product (or composition) of two signed kernels


Q and R:

(QR)(Blx):= Ix R(Bly)Q(dylx), BE 8(X), x E X. (7.2.14)

Further, if QO denotes the identity kernel given by

we define Qn recursively as

Qn:= QQn-1, n = 1,2, .... (7.2.15)

Then, from Proposition 7.2.5 and standard results for bounded linear op-
erators, we obtain:
7.2.6 Proposition. If Q and R are signed kernels with finite w-norms,
then
(7.2.16)
In particular,

C. Contraction maps

7.2.7 Definition. Let (8, d) be a metric space. A map T : S -+ 8 is called


(a) nonexpansive if d(Ts 1 ,Ts2) ::; d(Sl,S2) for all Sl, S2 in 8; and
(b) a contraction if there is a number 0 ::; "{ < 1 such that

In this case, "{ is called the modulus of T.


Further, a point s* in 8 is said to be a fixed point of T if Ts* = s*.
Many applications of contraction maps are based on the following theo-
rem of Banach.
7.2.8 Proposition. (Banach's Fixed Point Theorem.) A contraction
map T on a complete metric space (8, d) has a unique fixed point s*. More-
over,
d(Tns, s*) ::; "{nd(s, s*) 'Vs E 8, n = 0,1 .... (7.2.17)
6 7. Ergodicity and Poisson's Equation

where "I is the modulus of T, and Tn := T(Tn-l) for n = 1,2, ... , with
TO :=identity.
Proof. See Ross [1] or Luenberger [1], for instance. 0
As an example, the maps defined by (7.2.5) and (7.2.6) are both nonex-
pansive if IIQllw = 1, and they are contractions with modulus "I := IIQllw
if
IIQllw < 1. (7.2.18)
This follows from (7.2.12) and (7.2.13). For instance, viewing Q as a map
on lffiw(X), (7.2.12) gives

IIQu - Qvllw= IIQ(u - v)llw :s IIQllwll u - vll w,


and similarly for Q on Mw(X).
If T is a contraction map on Mw(X), with modulus "I, then Proposition
7.2.8 ensures the existence of a unique fixed point J.l E Mw(X) of T and
that

(7.2.19)

This property is closely related to the w-geometric ergodicity introduced


in Definition 7.3.9 below. On the other hand, in later chapters we will see
that dynamic programming (DP) operators on lffiw(X), under suitable as-
sumptions, turn out to be contractions. In the latter case, we will use the
fact that typically a DP operator T is monotone, that is, u v implies :s
Tu :s
Tv, and so we can apply the following criterion for T to be a con-
traction.
7.2.9 Proposition. Let T be a monotone map from lffiw(X) into itself. If
there is a positive number "I < 1 such that

T(u + rw) :s Tu + "Irw Vu E lffiw(X), r E ~, (7.2.20)

then T is a contraction with modulus "I.


Proof. For any two functions u and v in lffiw(X) we have u v+w·llu-vll w. :s
Thus, the monotonicity of T and (7.2.20) with r = Ilu - vll wyield

Tu :s T(v + rw) :s Tv + "Irw,


i.e.,
Tu - Tv :s "Iwllu - vllw.
If we now interchange u and v we get Tu - Tv ~ -"Iwllu - vll w, so that
ITu - Tvl :s "Iwll u - vll w.
Hence, IITu - Tvll w :s "Illu - vll w. 0
7.3 Recurrence concepts 7

Notes on §7.2

1. Since the material for this chapter comes from several sources, to
avoid confusions with related literature it is important to keep in mind the
equivalence of the several forms (7.2.7), (7.2.8) and (7.2.10) for the w-norm
of Q. For instance, Kartashov [1]-[5] uses (7.2.10), while, say, Meyn and
Tweedie [1] use (7.2.7).
2. For Markov control processes with a nonnegative cost function c, in
Chapter 6 we used a weight function of the form w := 1 + c.

7.3 Recurrence concepts


In this section, {Xt, t = 0,1, ... } denotes a time-homogeneous, X-valued
Markov chain with transition probability function (or stochastic kernel)
P(Blx), i.e.,
P(Blx) = Prob(Xt+l E Blxt = x) Vt = 0,1, ... , B E 8(X), x E X.
See Remark 7.2.4 (with Q = P) for the meaning of Pu(x) and ILP(B).
All of the concepts and results presented in subsections A, B, C are well
known in Markov chain theory and will be stated without proofs. General
references are the books by Duflo [1], Meyn and Tweedie [1], and Nummelin
[1].
A. Irreducibility and recurrence
For any Borel set B E 8(X) we define the hitting time
TB := inf{t 2:: 1 : Xt E B}
and the occupation time
00

'fiB := LIB(Xt),
t=l

where IB denotes the indicator function of B. Moreover, given an initial


state Xo = x we define
L(x,B) := Px(TB < 00) = Px(Xt E B for some t 2:: 1),
and 00

t=l

where pt(Blx) = Px(Xt E B) denotes the t-step transition probability.


[This can be defined recursively as in (7.2.15).]
7.3.1 Definition. The Markov chain {Xt} (or the corresponding stochastic
kernel P) is called
8 7. Ergodicity and Poisson's Equation

(a) A-irreducible if there exists a a-finite measure A on 8(X) such that


L(x, B) > 0 for all x E X whenever A(B) > 0, and in this case A is
called an irreducibility measure;
(b) Harris recurrent if there exists a a-finite measure J.t on 8(X) such
that L(x, B) = 1 for all x E X whenever J.t(B) > o.
Each of the following two conditions is equivalent to A-irreducibility:
(al) U(x, B) > 0 for all x E X whenever A(B) > O.
(a2) If A(B) > 0, then for each x E X there exists some n = n(B, x) > 0
such that pn(Blx) > o.
If p is A-irreducible, then P is A'-irreducible for any measure A' equiv-
alent to A. Moreover, if P is A-irreducible, then there exists a maximal
irreducibility measure 'IjJ, that is, an irreducibility measure 'IjJ such that, for
any measure A', the chain is A'-irreducible if and only if A' is absolutely
continuous with respect to 'IjJ (in symbols: A' « 'IjJ). We shall assume that
the measure A in Definition 7.3.1(a) is a maximal irreducibility measure,
and we shall denote by 8(X)+ the family of sets B E 8(X) for which
A(B) > o.
We next define another important notion of recurrence, weaker than Har-
ris recurrence.
7.3.2 Definition.
(a) A set B E 8(X) is called recurrent if U(x, B) = 00 for x E B, and
the Markov chain (or the stochastic kernel P) is said to be recurrent
if it is A-irreducible and every set in 8(X)+ is recurrent.
(b) We call a set B E 8(X) uniformly transient if there exists a constant
M such that U(x, B) ~ M for all x E B, and simply transient if it
can be covered by countably many uniformly transient sets. Moreover,
the chain itself is said to be transient if the state space X is a
transient set.
There is a dichotomy between recurrence and transience in the following
sense.
7.3.3 Theorem. If {xt} is A-irreducible, then it is either recurrent or
transient.
B. Invariant measures

Recurrence is closely related to the existence of invariant measures. A


a-finite measure J.t on 8(X) is said to be invariant for {xt} (or for P) if

J.t(B) = Ix P(Blx)J.t(dx) VB E 8(X). (7.3.1)


7.3 Recurrence concepts 9

Extending the notation (7.2.6) to a-finite measures, we may rewrite (7.3.1)


as
/-LP=/-L. (7.3.2)

Thus /-L is an invariant measure if it is a "fixed point" of P.


The following theorem summarizes some relations between recurrence
and invariant measures. We shall use the abbreviation i.p.m. for invariant
probability measure, and (sometimes) we write /-L(v) for an integral J vd/-L.
In chapters dealing with linear programming problems we usually write
integrals J vd/-L as an "inner product" (/-L, v), so we have

7.3.4 Theorem.

(a) If P is recurrent (in particular, Harris-recurrent), then there exists a


nontrivial invariant measure /-L, which is unique up to a multiplicative
constant.

(b) If there exists an i.p.m. /-L for P and P is A-irreducible, then P is


recurrent; hence, by (a), /-L is the unique i.p.m. for P.

(c) Suppose that P is Harris-recurrent and let /-L be as in (a). Then there
exists a triplet (n, v, 1) consisting of an integer n ~ 1, a p. m. v, and
a measurable function 0 ~ I ~ 1 such that:

(cd pn(BJx) ~ l(x)v(B) VB E l3(X) and x E X;


(C2) vel) > 0; and
(C3) 0 < /-L(l) < 00.

When n = 1 in Theorem 7.3.4(c), the conditions (Cl)-(C3) are related to


w-geometric ergodicity. (See Theorem 7.3.11.)
If the Markov chain is (Harris) recurrent and the invariant measure /-L
in Theorem 7.3.4(a) is finite, then the chain is called positive (Harris)
recurrent; otherwise it is called null.

C. Conditions for irreducibility and recurrence

We next introduce some concepts that will be used to give conditions for
irreducibility and recurrence of Markov chains.
Let E(X) be as in §7.2, the Banach space of bounded measurable func-
tions on X endowed with the sup norm, and let Cb(X) be the subspace of
continuous bounded functions. We shall use the notation Pu as in (7.2.5)
with Q = P.
10 7. Ergodicity and Poisson's Equation

7.3.5 Definition. The Markov chain {xt} (or its stochastic kernel P) is
said to satisfy the weak Feller property if P leaves Cb(X) invariant, i.e.,

and the strong Feller property if P maps B(X) into Cb(X), i.e.,

Pu E Cb(X) Vu E B(X).

The weak Feller property has important implications but, unfortunately,


many results for MCPs (Markov control processes) require the much more
restrictive strong Feller condition. To mitigate this fact, we shall sometimes
replace the latter condition by another which is easier to verify. First we
need some notation.
Let q = {q(t), t = 0,1, ... } be a probability distribution on the set N+
of nonnegative integers and define the stochastic kernel

L q(t)pt(Blx),
00

Kq(Blx) := x E X, BE R(X). (7.3.3)


t=O

q is referred to as a sampling distribution. Note that if /3 is a dis-


crete random variable with values in N+ and /3 has distribution q, then
Kq(Blx) = Pz (x(3 E B). A usual choice for q is the geometric distribution

q(t) := (1 - a)a t , t = 0,1, ... ; 0 < a < 1,

in which case Kq corresponds to the resolvent of P.


7.3.6 Definition. Let T(Blx) be a substochastic kernel on X, that is, T
is a nonnegative kernel such that T(Xlx) ::::; 1 for all x E X, and let q be a
sampling distribution.
(a) T is called a continuous component of Kq if

Kq(Blx) ~ T(Blx) Vx E X, B E R(X),

and T(BI·) is a l.s.c. (lower semicontinuous) function on X for every


BE R(X).
(b) The Markov chain {Xt} is said to be aT-chain if there exists a
sampling distribution q such that Kq admits a continuous component
T with T(Xlx) > 0 for all x.
Of course, if {xt} satisfies the strong Feller property, then it is a T-
chain; it suffices to take q(l) = 1 and T(Blx) = P(Blx). Before considering
other-less obvious-relations between the concepts introduced above, let
us recall that the support of a measure I-' on R(X) is the unique closed
set F C X such that I-'(X - F) = 0 and I-'(F n G) > 0 for every open set G
7.3 Recurrence concepts 11

that intersects F. In particular, the support of a p.m. J.t is the intersection


of all closed sets C in X such that J.t( C) = l.
7.3.7 Theorem. If {xt} is a A-irreducible Markov chain that satisfies the
weak Feller property and the support of A has nonempty interior, then {xt}
is a A-irreducible T -chain.
For a A-irreducible Markov chain we have:
(1) The chain has a cycle of period d ~ 1, that is, a collection
{C1, ... ,Cd } of disjoint Borel sets such that P(CiHlx) = 1 for
x E Ci, i = 1, ... ,d - 1, and P(C1Ix) = 1 for x E Cd. When d = 1,
the chain is called aperiodic.
(2) Every set in 8(X)+ contains a small set (also known as a C-set),
that is, a Borel set C such that for some integer n and some nontrivial
measure J.t
(7.3.4)

A more general concept than that of a small set is the following. A set
C in 8(X) is called petite if there is a sampling distribution q and a
nontrivial measure J.t' such that

(7.3.5)

If the chain {xt} is A-irreducible, then a Borel set C is petite if and only if
for some n ~ 1 and a nontrivial measure II,
n
L pt('lx) ~ 11(') Vx E C. (7.3.6)
t=l

Further, for an aperiodic A-irreducible chain, the class of petite sets is the
same as the class of small sets; however, the measures J.t and J.t' in (7.3.4)
and (7.3.5) need not coincide.
Finally, we have the following important result, which will be used later
on in conjuntion with Theorem 7.3.7.
7.3.8 Theorem. If every compact set is petite, then {xt} is aT-chain.
Conversely, if {xt} is a A-irreducible T -chain, then every compact set is
petite.

D. w-Geometric ergodicity

7.3.9 Definition. Let w ~ 1 be a weight function such that 1IPllw < 00.
Then P (or the Markov chain {xt}) is called w-geometrically ergodic
if there is a p.m. J.t in Mw(X) and nonnegative constants Rand p, with
p < 1, that satisfy

Ilpt - J.tllw :=::; Rpt Vt = 0,1, ... ; (7.3.7)


12 7. Ergodicity and Poisson's Equation

more explicitly [by (7.2.7)], for every U E lffiw(X), x E X, and t = 0, 1, ... ,

(7.3.8)

The so-called limiting p.m. p, is necessarily the unique i.p.m. in Mw(X)


that satisfies (7.3.7).
To see that p, is indeed an i.p.m. for P, write pt as pt-l p and use (7.3.7)
and (7.2.16) to get

Ilpt - p,Pllw lI(pt-l - p,)Pllw


< Ilpt- 1 - p,llwllPllw --+ 0 as t --+ 00.

Using this fact and (7.3.7) again we obtain

Thus p, = p,P.
We next present several results that guarantee (7.3.7).
7.3.10 Theorem. Let w ~ 1 be a weight function. Suppose that the Markov
chain is A-irreducible and either
(i) there is a petite set C E 8(X) and constants f3 < 1 and b < 00 such
that
Pw(x) :::; f3w(x) + bIc(x) \Ix E Xj (7.3.9a)

or
(ii) w is unbounded off petite sets (that is, the level set {x E Xlw(x) :::; n}
is petite for every n < 00) and, for some f3 < 1 and b < 00,

Pw(x) :::; f3w(x) +b \Ix E X. (7.3.9b)

Then:
(a) The chain is positive Harris recurrent, and

(b) p,(w) < 00, where p, denotes the unique i.p.m. for P.
If, in addition, the chain is aperiodic then
(c) The chain is w-geometrically ergodic, with limiting p.m. p, as in (b).

Proof. See Meyn and Tweedie [1], §16.1. [Lemma 15.2.8 of the latter ref-
erence shows the equivalence of the conditions (i) and (ii).] 0
7.3.11 Theorem. Letw ~ 1 be a weight function. Suppose that the Markov
chain has a unique i.p.m. p, and, further, there exists a p.m. v, a measurable
function 0:::; 10 :::; 1, and a number 0 < f3 < 1 such that
7.3 Recurrence concepts 13

(i) P(Blx) ~ l(x)v(B) VB E B(X) and x E X;

(ii) v(w) < 00 and v(l) . J.t(l) > 0; and


(iii) Pw(x) ~ /3w(x) + v(w)l(x) Vx E X. (7.3.10)

Then the chain is w-geometrically ergodic with limiting p.m. J.t.


Proof. See Kartashov [3, Theorem 6].0
7.3.12 Remark. Observe that each of the inequalities (7.3.9), (7.3.10)
trivially implies the condition IlPllw < 00 required in Definition 7.3.9. For
instance, since w ~ 1, multiplying both sides of (7.3.9b) by w- 1 we get

(Pw)/w ~ /3 + b;
hence, by (7.2.8), IIPllw ~ /3 + b.
7.3.13 Remark. (See Meyn and Tweedie [1], Theorem 16.0.1.) If the
Markov chain is A-irreducible and aperiodic, then the following conditions
are equivalent for any weight function w ~ 1:

(a) The chain is w-geometrically ergodic; that is, (7.3.7) holds.

(b) The inequality (7.3.9a) holds for some function Wo which is equivalent
to w in the sense that c-1w ~ Wo ~ cw for some constant c ~ 1.
On the other hand, Kartashov [2], [5, p. 20], studies (7.3.7) using the follow-
ing equivalent conditions (c) and (d):
(c) There is an integer n ~ 1 and a number 0 < p < 1 such that
(7.3.11)

where M~(X) C Mw(X) is the subspace of signed measures fJ in


Mw(X) with fJ(X) = O.

(d) There is an integer n ~ 1 and a number 0 < p < 1 such that

Ix w(y)lpn(dylx) - pn(dylx')1 ~ p[w(x) + w(x' )] (7.3.12)

for all x, x, in X.

He shows that if P has a finite w-norm, then (c) [or (d)] is equivalent to
(a).
Statements (c) and (d) are nice because they are easy to interpret. For
instance [by Definition 7.2.7(b)], (7.3.11) means that pn is a contmction
map (or, equivalently, P is a n-contmction) on the subspace M~(X) of
Mw(X). Furthermore, (c) or (d) can be directly "translated" to obtain
14 7. Ergodicity and Poisson's Equation

geometric ergodicity in the total variation norm, as in (7.3.16) below. For


example, if we take w(·) == 1 in (7.3.12) we get

JWn(·Jx) - p n (·Jx')IITv :::; 2p '<Ix, x' E X, (7.3.13)

which is a well-known necessary and sufficient condition for (7.3.16) (see


the references in Note 1). 0
The equivalence of (a) and (c) [or (d)] in Remark 7.3.13 requires to know
a priori that P has a finite w-norm. However, specializing (7.3.11), (7.3.12)
to n = 1 and adding a mild assumption we get the following.
7.3.14 Theorem. Suppose there exists a weight function w(·) ~ 1, a pos-
itive number p < 1, and a state x* E X that satisfy
(a) w* := Pw(x*) = Ix w(y)P(dyJx*) < 00, and
(bI) Ix w(y)JP(dyJx) - P(dyJx')J :::; p[w(x) + w(x')] '<Ix, x' in X,

or, equivalently [with M~(X) as in Remark 7.3.13(c)],


(b 2 ) 1I0Pllw :::; pllOllw VB E M~(X).
Then
(i) Pw(x) :::; pw(x) + b for all x E X, with b:= pw(x*) + w*;
(ii) P has a finite w-norm; and

(iii) P has a unique i.p.m. JL with a finite w-norm IIJLllw :::; b/(1- p), and,
moreover, (7.3.7) holds with R := 1 + IIJLllw :::; 1 + b/(1 + p).
Proof. (i) Let x* and w* be as in (a). Then, by (bI),

J w(y)P(dyJx) < J w(y)JP(dyJx) - P(dyJx*)J + J w(y)P(dyJx*)


< p[w(x) + w(x*)] + w*'
and (i) follows.
(ii) As noted in Remark 7.3.12, the inequality in (i) gives that P has a
finite w-norm. Furthermore, iteration of the inequality Pw :::; pw + b in (i)
yields, for all m = 1,2, ... ,

pmw < pmw + b(1 + p+ ... + pm-I)


< w+b/(1-p)
< [1 + b/(1 - p)]Wj
thus

IIp m ll w :::; R' '<1m = 1,2, ... , with R' := 1 + b/(1- p). (7.3.14)
7.3 Recurrence concepts 15

(iii) To prove part (iii), we first show that for every x E X the sequence
{pt('lx)} is a Cauchy sequence in (the Banach space) Mw(X). To prove
this, fix an arbitrary state x and note that, for any given integer m 2: 1,
the finite signed measure

is such that O(X) = 0 and so it belongs to M~(X) since, by (7.3.14) and


(7.2.9),
IIOllw ~ IIJxllw+ Ilpm('lx)llw ~ (1 + R')W(X) < 00.
Furthermore, for every n = 1,2, ... ,

whereas iteration of (b 2 ) gives

(7.3.15)

so that IIOpnll w ~ pn(1 + R')w(x). Combining these facts we see that

which, as m was arbitrary, clearly shows that {pt ('Ix)} is a Cauchy se-
quence in Mw(X).
Therefore, as Mw(X) is a Banach space, pt('lx) converges in the w-norm
to some measure J.lx in Mw(X). In fact, J.lx == J.l is independent of x, because
if we apply (7.3.15) to the signed measure 0' (.) := J x (-) - J x / (.), we obtain

for all x, x' in X. Hence, as in the paragraph after Definition 7.3.9, we


conclude that J.l is an i.p.rn for P. Moreover, integration of both sides of
the inequality in (i) with respect to J.l gives that 11J.ll w = J wdJ.l satisfies
11J.lllw ~ pllJ.lllw + b, which yields the inequality
11J.lllw ~ b/(l - p).

Finally, to prove the last statement in (iii), we can use the invariance of
J.l to write J.l = J.lpn for all n = 0,1, ... , so that

pn('lx) - J.l = Jxpn - J.lpn = (Jx - J.l)pn.

Thus, applying (7.3.15) to the signed measure J x (') - J.l(-) E M~(X),

which implies the desired result. 0


16 7. Ergodicity and Poisson's Equation

Theorem 7.3.14 is essentially contained in the paper by Gordienko, Montes-


de-Oca and Minjarez-Sosa [1], except that, in addition to the hypotheses
(a) and (b), they assume that P has a unique i.p.m., and, further, the proof
of some of the conclusions are referred to Kartashov [2]. We decided to in-
clude here a proof, independent of Kartashov's, because similar arguments
are repeatedly used in the remainder of this book.
Notes on §7.3

1. For each of the Theorems 7.3.10 and 7.3.11 it is possible to estimate


the values of Rand p in (7.3.7): see Meyn and Tweedie [2] and Kartashov
[3], respectively. For w-geometric ergodicity of denumemble Markov chains,
see Dekker, Hordijk and Spieksma [1] or Spieksma and Tweedie [1].
On the other hand, an important special case is when w(·) is a bounded
function. In such a situation, the w-geometric ergodicity (7.3.7) reduces
to the standard (uniform) geometric ergodicity in the total variation
norm (7.2.3); that is [instead of (7.3.7)],

sup Ilpt(·lx) - J.tOIlTV :::; Rpt 'tit = 0,1, .... (7.3.16)


x
Necessary and/or sufficient conditions for geometric ergodicity are well
known; see, for instance, Hernandez-Lerma [1], Hernandez-Lerma, Montes-
de-Oca and Cavazos-Cadena [1], and Meyn and Tweedie [1]. In particular,
Theorem 16.0.2 of the latter reference shows that the following conditions-
among others-are equivalent:
(a) The chain is (uniformly) geometrically ergodic;
(b) The chain is aperiodic and there is a bounded solution w 2': 1 to the
inequality (7.3.9a);
(c) The chain is aperiodic and Doeblin's condition holds; that is, there
is a probability measure 'Y on B(X), positive numbers c < 1 and 8,
and a positive integer m such that

inf pm(Blx) > 8 whenever 'Y(B) > c.


x
2. The reader should be warned that the term "w-geometric ergodicity"
(Definition 7.3.9) is not quite standard. For example, Meyn and Tweedie [1]
call it w-uniform ergodicity (and use "V" instead of "w"), but in Kartashov
[1]-[5] the latter term means that P has a unique i.p.m. J.t and that (7.3.7)
holds for the Cesaro sums

in lieu of pt. Kartashov does not use any specific term for (7.3.7).
7.4 Examples on w-geometric ergodicity 17

3. Speaking of terminology, we have in mind weight functions w that


satisfy
w(x) -t 00 as Ixl -t 00, (7.3.17)
which in §5.7 were called strictly unbounded or moment functions. However,
a nonnegative measurable function that satisfies (7.3.17) is also known as
a Lyapunov (Duflo [1], Lasota and Mackey [1)) or norm-like (Meyn and
Tweedie [1)) function. In fact, inequalities of the form (7.3.9) and (7.3.10)
are called Lyapunov or Foster-Lyapunov criteria for "stability" (in this case
w-geometric ergodicity) of P, hence the name "Lyapunov function" for w.
There are Lyapunov-like criteria for other forms of "stability", such as the
existence of i.p.m.'s. Another way to approach the latter problem is to
view (7.3.2) as a "fixed point" equation or as a linear equation on a space
of measures. Then one can use generalized (infinite-dimensional) versions
of Farkas Theorem to obtain conditions for the existence of solutions to
(7.3.2); see Hernandez-Lerma and Lasserre [2]-[5], and Lasserre [1], [2].

7.4 Examples on w-geometric ergodicity


In this section we illustrate Theorems 7.3.10 and 7.3.11 on w-geometric
ergodicity. When considering a X -valued Markov chain of the form

Xt+l = F(xt, Zt), t = 0,1, ... , (7.4.1)


we always suppose the following:
7.4.1 Assumption.
(a) The disturbance sequence {zd consists of i.i.d. (independent and
identically distributed) random variables with values in a Borel space
Z, and {Zt} is independent of the initial state Xo. The common dis-
tribution of the Zt is denoted by G
(b) F: X x Z -t X is a given measurable function.
By Assumption 7.4.1(a), the variables Xt and Zt are independent for all
t = 0,1, .... Hence, as in Remark 7.2.4 (with Q = P), for a system of the
form (7.4.1) we have

Pu(X) E[u(xt+dIXt = x] (7.4.2)

Eu[F(x, zo)] = 1 u[F(x, z)]G(dz).

In particular, taking u = I B, the indicator function of a Borel set B E B (X),


we obtain the transition probability function of {xd:

P(Blx) = PIB(x) = EIB[F(x, zo)] = 1 IB[F(x, z)]G(dz). (7.4.3)


18 7. Ergodicity and Poisson's Equation

7.4.2 Example: Random walks. (Kartashov [4].) Consider the random


walk in the interval X = [0,00), namely,

Xt+l = (Xt + Zt)+, t = 0,1, ... , (7.4.4)

where r+ := max(r, 0). We wish to verify that {xtJ satisfies the hypotheses
of Theorem 7.3.11 so that it is w-geometrically ergodic for some weight
function w. We shall suppose that Eizol < 00 and

E(zo) < o. (7.4.5)

Under the condition Eizol < 00, it is well known that (7.4.5) implies that
{xtJ is positive recurrent (see Meyn and Tweedie [1] or Nummelin [1]), and
so {xtJ has a unique i.p.m., which we shall denote by /L. Let us also suppose
that the moment generating function of zo, namely,

m(s) := Eexp(szo) = l exp(sz)G(dz) ,

is finite for all s in some interval [0, s], with s > O. Then, as m(O) = 1 and
m'(O) = E(zo) < 0, there is a number s such that

(3 := m(s) = E exp(szo) < 1. (7.4.6)

Now, for x 2:: 0, let

w(x) := e8X , l(x):= Prob(zo +x ~ 0), and v(·):= 80 (.), (7.4.7)

where 80 is the Dirac measure at O. To verify the condition (i) in Theorem


7.3.11 consider the kernel

Q(Blx) := P(Blx) -l(x)v(B).

If 0 is not in B, then of course Q(Blx) = P(Blx) 2:: o. On the other hand,


if 0 E B then

Q(Blx) = Prob[(x + zo)+ E B, Zo > -x] 2:: o.


Hence Q is a nonnegative kernel and the condition (i) follows.
To obtain the condition (ii) in Theorem 7.3.11 note that v(w) = w(O) =
1, and, by (7.4.5), v(l) = l(O) > o. Further, as

l(x) = Prob(xt+1 = OIXt = x) = P( {O}lx),


the invariance of /L [see (7.3.1)] yields

/L(l) =/ ld/L = Ix P( {O}lx)/L(dx) = /L( {O}) > o.


7.4 Examples on w-geometric ergodicity 19

Finally, to obtain (7.3.10) observe first that


w[(x + zo)+] = w(x) exp(szo) if x + Zo > 0,
= 1 if x + Zo ~ O.
Hence, with (3 as in (7.4.6), we get
Pw(x) E[w(xt+1)lxt = x]
= (3w(x) + Prob(zo + x ~ 0)
= (3w(x) + lI(w)l(x) [since lI(w) = 1],
which yields (7.3.10). Summarizing, the hypotheses of Theorem 7.3.11 are
satisfied and, consequently, the random walk (7.4.4) is w-geometrically er-
godic.
This result for (7.4.4) can be extended in an obvious manner to chains
of the form
Xt+1 = (Xt + ""t - ~t)+, t = 0,1, ... , (7.4.8)
where {""t} and {~t} are independent sequences of nonnegative Li.d. random
variables with finite means, and also independent of Xo. Defining Zt :=
""t - ~t, the condition corresponding to (7.4.5) is
(7.4.9)
and we again require the moment generating function of Zo = ""0 - ~o to
be finite in some interval [0, s], with s > o. Then the chain in (7.4.8) is
w-geometrically ergodic with respect to the weight function win (7.4.7).
The stochastic model (7.4.8) appears in many important applications.
For instance, in a random-release dam model, Xt denotes the content of the
dam at time t, ""t is the input, and ~t is the random release. On the other
hand, in a single-server queueing system, {~t} is the sequence of interarri val
times of customers, the ""t are the service times, and Xt is the waiting time
of the tth customer. Finally, (7.4.8) can also be interpreted as represent-
ing an inventory-production system in which Xt is the stock level at time
t, ""t is the quantity produced, and ~t is the demand during the tth period. 0
7.4.3 Example: A queueing system. The approach in Example 7.4.2,
using the moment generating function m(s) := E[exp(szo)] to find a suit-
able weight function, often works in other cases. For instance, consider a
discrete-time queueing system in which new customers are not admitted if
the system is not empty, that is, when Xt > 0, where Xt is the number of
customers at the beginning oftime slot t. In this case, when Xt > 0 there is
exactly one service completion (and the served customer leaves the system)
in every time slot, and once the system is emptied (Xt = 0) new customers
are accepted. Thus, denoting by Zt the number of arrivals during time slot
t, the process is described by
Xt+l Xt - 1 if Xt > 0
Zt if Xt = o.
20 7. Ergodicity and Poisson's Equation

Equivalently, if 10 stands for the indicator function of the set {O}, we can
write Xt+1 as

Xt+1 = Xt - 1 + (Zt + 1)lo(xt), t = 0,1, ... ,

which is an equation of the form (7.4.1). Further, in Assumption 7.4.1 we


have X = Z = {O, 1 ... } and G is the "arrival distribution":

G(B) = L q(i) VB C {O, 1, ... },


iEB

with q(i) 2:: °and Li q(i) = 1. The transition probability function


P( {j}li) =: p(jli) is given by

p(i - Iii) = 1 Vi 2:: 1 and p(jIO) = q(j) Vj 2:: 0.


To guarantee the existence of a unique i.p.m., as required in Theorem
7.3.11, it suffices to assume that the arrival distribution has a finite mean
71 := E(zo), i.e.,
(a) 71 = L iq(i) < 00.
i
Then the invariance equation (7.3.1) becomes
00

JL(j) = Lp(jli)JL(i), Vj 2:: 0,


i=O

with JL(j) 2:: °and Lj JL(j) = 1, which reduces to


JL(j) = JL(O)q(j) + JL(j + 1) Vj 2:: 0.

Hence the unique i.p.m. is given by


00

JL(j) = JL(O) L q(k) Vj 2:: 0, with JL(O) = 1/(1 + (1).


k=j

On the other hand, to ensure that the conditions (i)-(iii) in Theorem 7.3.11
hold, we shall suppose that Zo has a finite moment generating function m(s)
in some interval [0, s], that is,
(b) m(s) := E[exp(szo)] < 00 Vs in [O,s], for some s > 0.
Now fix a number 8 in (0, s), and define the number (J := exp( -8) and the
functions
l(i) := 10(i) and w(i):= exp(8i), i 2:: 0,
where 10 is the indicator function of {O}. Then taking the p.m. v as the
arrival distribution, that is, v(i) = q(i) for all i, easy calculations show
that the hypotheses (i)-(iii) of Theorem 7.3.11 are satisfied; hence, {xt} is
w-geometrically ergodic. 0
7.4 Examples on w-geometric ergodicity 21

7.4.4 Example: Linear systems. In this example the state and distur-
bance spaces are X = Z = JRd and (7.4.1) is replaced by

(7.4.10)

where F is a given d x d matrix. In addition to Assumption 7.4.1(a) we will


suppose the following:
(a) all the eigenvalues of F have magnitude less than 1 (equivalently, the
spectral radius of F is less than 1) j

(b) the disturbance distribution G is nonsingular with respect to Lebesgue


measure, with a nontrivial density, zero mean and finite second mo-
ment.
Under these hypotheses, the linear system (7.4.10) has many interesting
properties. In particular, it turns out to be wi-geometrically ergodic (i =
1,2,3) with respect to each of the functions

WI (x) = Ixl + 1, W2(X) = Ixl 2 + 1, and W3(X) = x'Lx + 1, (7.4.11)

where "prime" denotes "transpose", and L in W3 is a symmetric positive-


definite matrix that satisfies the matrix equation

F'LF =L - I.

The existence of such a matrix L is ensured by the hypothesis (a)j see, for
instance, Chen [1], Theorem 8-22. We shall omit the calculations showing
the wrgeometric ergodicity and refer to the proofs of Proposition 12.5.1
and Theorem 17.6.2 of Meyn and Tweedie [1] for further details. Moreover,
it is worth noting that these results are valid when (7.4.10) is replaced by
the more general model

(7.4.12)

where the Zt are p-dimensional random vector and H is a d x p matrix,


provided that the pair of matrices (F, H) is "controllable" in the sense that
the matrix
[HIFHI·· 'lFd - I H]
has rank d. 0
We now consider a general Markov chain (7.4.1) with state space X = JRd j
the disturbance space Z is an arbitrary Borel space. We first present con-
ditions under which the inequalities (7.3.9) are satisfied, so that the chain
will be w-geometrically ergodic if the additional hypotheses of Theorem
7.3.10 are satisfied. Conditions for the latter to be true are given in Exam-
ple 7.4.6 for additive-noise systems.
22 7. Ergodicity and Poisson's Equation

7.4.5 Proposition. Let {xd be as in (7.4.1), with X = R d , and Z an


arbitrary Borel space. Suppose that there exist positive constants f3,7 and
M such that
(a) f3 < 1 and f3 + 7 ~ 1;

(b) EIF(x, zo)1 ~ f3lxl - 7 'v'lxl > M; and

(c) F:= sup EIF(x,zo)1 < 00.


I:J:I~M

Then
Pw(x) ~ f3w(x) + bIe(x) 'v'x E X, (7.4.13)
where

w(x) := Ixl + 1, b := 7 + F, and C:= {Ixl ~ M}. (7.4.14)

Proof. The proof follows from direct calculations. Let D := X - C be the


complement of C. Then, since Ie + ID = 1,

EIF(x, zo)1 E(IF(x, zo)I[ID(x) + Ic(x)])


< (f3lxl- 7)ID(x) + FIc(x)
= f3lxl - 7 + (F - f3lxl + 7)Ie(x)
< f3w(x) - (f3 + 7) + bIe(x).
Hence,

Pw(x) = E(IF(x, zo)1 + 1) ~ f3w(x) - (f3 + 7) + bIc(x) + 1


and (7.4.13) follows since, by (a), -(f3 + 7) + 1 ~ o. 0
As an application of Proposition 7.4.5, we shall combine it with Theo-
rem 7.3.10 to give a set of conditions under which the additive-noise (or
autoregressive) process

Xt+l = F(xt) + Zt, t = 0,1, ... (7.4.15)

is w-geometrically ergodic.
7.4.6 Example: Additive-noise nonlinear systems. Let {Xt} be the
Markov chain given by (7.4.15), with values in X = Rd and disturbances
satisfying Assumption 7.4.1. In addition, suppose that
(a) F : Rd --+ Rd is a continuous function;
(b) The disturbance distribution G is absolutely continuous with respect
to Lebesgue measure ,x(dz) = dz, and its density g [Le., G(dz)
g(z)dz] is positive ,x-a.e., and has a finite mean value;
7.4 Examples on w-geometric ergodicity 23

(c) There exist positive constants f3, 'Y and M such that f3 < 1, f3 + 'Y ~ 1,
and
EIF(x) + zol ::; f3lxl - 'Y Vlxl > M. (7.4.16)
Then {xt} is w-geometrically ergodic, where w(x) := Ixl + 1.
Indeed, the conditions (a) and (c) yield that the hypotheses of Propo-
sition 7.4.5 are satisfied, and, therefore, we obtain the inequality (7.4.13),
which is the same as (7.3.9a). Hence, to obtain the desired conclusion it
suffices to verify that the chain satisfies the other hypotheses of Theorem
7.3.10. To do this, first note that (7.4.15) and (7.4.2) yield

Pu(x) = Eu[F(x) + zo] = f u[F(x) + z]g(z)dz. (7.4.17)

Thus, in particular, if u is a continuous and bounded function on X = ]Rd,


then so is Pu; in other words, {xt} satisfies the weak Feller property. On
the other hand, taking u = IB for any Borel set B, a change of integration
variable in (7.4.17) gives [ef. (7.4.3)]

P(Blx) = 1 g(z - F(x))dz, (7.4.18)

which, by hypothesis (b), shows that {xt} is A-irreducible. Therefore, The-


orem 7.3.7 yields that {xt} is a A-irreducible T-chain, which in turn,
by Theorem 7.3.8, shows that every compact set is petite. In particular,
C := {Ixl ::; M} is a petite set. Finally, from (b) and (7.4.18) again, we see
that the chain is aperiodic, so that all the assumptions of Theorem 7.3.10
are valid in the present case. 0
7.4.7 Example: Iterated function systems. An iterated function sys-
tem (IFS) is a Markov chain of the form (7.4.1) with state space a closed
subset X of ]Rd and a finite disturbance set Z, say Z = {I, ... , N}. Let
{ql, . .. , qn} be the common distribution of the Li.d. disturbances, that is,
qi = Prob(zo = i), i = 1, ... ,N.
We assume that the probabilities qi are positive. Then, writing F(x, i) as
Fi(x), for x E X and i = 1, ... , N, the expressions (7.4.2) and (7.4.3)
become
N
Pu(x) = L U[Fi(X)]qi (7.4.19)
i=l
and
N
P(Blx) = LIB[Fi(x)]qi, (7.4.20)
i=l
respectively. It turns out that the hypothesis (H) below implies that the
IFS {xt} is w-geometrically ergodic with respect to the weight function
w(x) := Ixl + 1, x E X. (7.4.21)
24 7. Ergodicity and Poisson's Equation

(H) The zero vector 0 is in X and the functions Fi satisfy the Lipschitz
conditions

lFi(X) - Fi(y)1 ~ Lilx - yl 'l;/x, y E X, i = 1, ... , N,

where L 1 , •.• , LN are nonnegative constants such that


N
f3:= LLiqi < l. (7.4.22)
i=l

By Proposition 12.8.1 of Lasota and Mackey [1], (H) implies that the IFS is
weakly asymptotically stable, which means that the transition probability
function P has a unique i.p.m. J.l, and, in addition, (vpt)(u) ---+ J.l(u) for
any initial distribution v and any continuous bounded function u on X.
(In other words, for any p.m. v, vpt converges weakly to J.l.) On the other
hand, the uniqueness of J.l and the assumption that qi > 0 yield that the
IFS {xd is J.l-irreducible and aperiodic. Hence, by Theorem 7.3.10, to prove
that {xd is w-geometrically ergodic it suffices to show that the inequality
(7.3.9b) holds. To see this observe that, by (H),

IFi(X)1 < lFi(X) - Fi(O)1 + IFi(O)1


< Lilxl+F 'l;/xEX andi=1, ... ,N,
where F := max{lFi(O)I, i = 1, ... , N}. Thus, from (7.4.21) and (7.4.19),
we obtain
N
Pw(x) = L(lFi(X)1 + 1)qi ~ f3w(x) + (F + 1 - f3),
i=1

which is the same as (7.3.9b) with b := F +1- f3. 0

7.5 Poisson's equation


Let {xd be a X-valued Markov chain with transition probability function
P(Blx), and let (X, 11·11) be a normed vector space of real-valued measurable
functions on X. In most applications, X is one of the spaces Ja(X), Jaw(X)
or Lp(X, 8(X), J.l) for some measure J.l. We assume that the usual linear
map u H Pu with

Pu(x) := Ix u(y)P(dylx), x E X,

maps X into itself.


7.5 Poisson's equation 25

7.5.1 Definition. Let c,g and h be functions in X. Then the system of


equations
(a) g=Pg, and (b) g+h=c+Ph (7.5.1)
is called the Poisson equation (P.E.) for P with charge c, and if there
is a pair (g, h) of functions that satisfies (7.5.1) then (g, h) is said to be a
solution to the P.E. for P with charge c. Further, if P admits a unique
i.p.m., then (7.5.1) is referred to as the unichain P.E.; otherwise, is called
the multichain P.E. A function that satisfies (7.5.1)(a) is said to be in-
variant (or harmonic) with respect to P.
Remark. Since Pk = k for any constant k, we can see that if (g, h) is a
solution of (7.5.1), then so is (g, h + k) for any constant k. Several results
in this section deal with the question of "uniqueness" of a solution to the
P.E. (See Corollary 7.5.6, and Theorems 7.5.7 and 7.5.10.) 0
In this section we first present an elementary example showing, in par-
ticular, that the choice of the underlying space X is an important issue.
The remaining material is divided into three parts. In part A we present
some basic results on the multichain P.E., and in part B we consider the
special unichain case in which P is w-geometrically ergodic with respect to
a weight function w. Finally, in part C we present some examples.
7.5.2 Example. Let {xt} be the Markov chain in Example 7.4.3; that is,
the state space is X = {O, I, ... } and the transition probabilities are

p(i - Iii) = 1 Vi ~ 1, and p(jIO) = q(j) Vj ~ 0,

where {q(i), i E X} is a given probability distribution on X with finite


mean value, i.e.,
L iq(i) <
00

q:= 00.
i=O
Let X = B(X) be the Banach space of bounded functions on X with the
sup norm, and consider the P.E. (7.5.1) with charge c in B(X) given by

c(O) := -q, and c(i):= 1 Vi ~ 1.

Then it is easily verified that the pair of functions (g, h) with

g(.) =0, and h(i)=i-q ViEX


satisfies (7.5.1). However, since h is unbounded, it does not belong to the
space X = B(X) and, therefore, the pair (g, h) is not a "solution" of the
P.E. in the sense of Definition 7.5.1. The latter situation can be remedied
by introducing a suitable space larger than B(X). For instance, consider
the weight function wei) := i + 1. Then using (7.2.8) it is easily verified
that the transition law is bounded in w-norm (in fact, Iipliw ~ q + 1) and
so, by Proposition 7.2.5, p maps Bw(X) into itself. Finally, replacing B(X)
26 7. Ergodicity and Poisson's Equation

by JRw(X), we see that (g, h) is indeed a solution of (7.5.1) in the space


JRw(X) :::> JR(X). 0

A. The multichain case

7.5.3 Remark. It is important to keep in mind the probabilistic meaning


of equations such as (7.5.1), but sometimes it will be more expedient to
do things "operationally". For instance, iterating the invariance equation
(7.5.1)(a) we obtain
9 = ptg Vt = 0,1, ... , (7.5.2)
which in turn yields
n-l
ng = 2: ptg Vn = 1,2, .... (7.5.3)
t=O
On the other hand, from a "probabilistic" viewpoint, (7.5.1)(a) implies that
{g(xt} , t = 0,1, ... } is a martingale. Indeed, for any t, g(Xt) is measur-
able with respect to a(xo, . .. ,Xt), the a-algebra generated by {xo, . .. ,xt},
Exlg(xt)1 :'S IIgll < 00 for each x, whereas the Markov property yields

E[g(xt+1)l xt]
! g(y)P(dylxt)
Pg(Xt) = g(Xt) by (7.5.1)(a).

This, of course, also yields (7.5.2), which can be written as

g(X) = ptg(x) = ! g(y)pt(dylx) = Ex [g(xt}] Vt = 0,1, ... , x E X.

Moreover, rewriting (7.5.1)(b) as h = c - 9 + Ph and iterating we get


n-l
h=2: Pt (c-g)+p n h Vn=1,2, ... , (7.5.4)
t=o
which "probabilistically" can be rewritten as
n-l
h(x) = 2:[Exc(xt} - Exg(xt)] + Exh(xn)
t=o

Ex [~[C(Xt) - g(xt}] + h(xn)1 (7.5.5)

Ex (Mn),
7.5 Poisson's equation 27

where {Mn} is defined by


n-l
Mo := h(xo), and Mn := L)e(xd - g(xd] + h(xn ) (7.5.6)
t=o
for n 2: 1. 0
7.5.4 Definition. Let e,g and h be functions in X. The pair (g, h) is called
a c-canonical pair if
n-l
ng +h = L pte + pnh '<In = 1,2, .... (7.5.7)
t=o

Canonical pairs, the sequence {Mn} in (7.5.6), and the P.E. (7.5.1) are
related as follows.
7.5.5 Theorem. The following conditions are equivalent:
(a) (g, h) is a solution to the P.E. with charge e.
(b) (g, h) is a e-canonical pair.
(c) {Mn} is a martingale and 9 is invariant.
Proof. (a) ¢:} (b). The implication (a) ~ (b) follows from (7.5.4) and
(7.5.3). Conversely, (7.5.1)(b) follows from (7.5.7) with n = 1. To obtain
the invariance equation (7.5.1)(a), apply P to both sides of (7.5.1)(b) to
get
p2 h = Pg + Ph - Pc,
and, on the other hand, observe that for n = 2 (7.5.7) becomes

P2h = 2g + h - e - Pc.

The last two equations yield

Pg - 9 = 9 + h - (e + Ph) = 0, i.e., 9 = Pg.

(a) ¢:} (c). If (a) holds, then the invariance conditon on 9 is obvious,
whereas, by its very definition, Mn is measurable with respect to a{xo, ... , x n }
for every n, and ExlMnl ~ n(11c11 + IlglD + IIhll < 00 for each x EX. More-
over, by the Markov property,

E[h(xn+l)lx n ] + c(x n ) - g(x n ) - h(x n )


Ph(x n ) + e(x n ) - g(x n ) - h(x n )
° '<In 2: 0, by (7.5.1)(b),

i.e., {Mn} is a martingale. The converse follows from (7.5.5) [or (7.5.4)]
with n = 1, and the invariance of g. 0
28 7. Ergodicity and Poisson's Equation

Although Theorem 7.5.5 is quite straightforward, it has important con-


sequences. In particular, it allows us to derive additional necessary and/or
sufficient conditions for the existence of solutions to the P.E. In part (c)
of the following corollary we require the stochastic kernel P to be power-
bounded, which means that

Ilpnll ~ K for all n = 0,1, ... , and some constant K. (7.5.8)

[Recall that the norm of an operator T on (X, II·ID is defined as

IITII := sup{IITull : u E X, lIuli ~ I};

cf. (7.2.7) or (7.2.10).] The condition (7.5.8) holds, for instance, if IIFI! ~ 1,
in which case P is nonexpansive (Definition 7.2.7) with respect to the
norm II . II· For such a P, we have IlFnl! ~ 1IFIln ~ 1 [cf. (7.2.16)] and
(7.5.8) follows. On the other hand, note that obviously a power-bounded
operator is bounded, Le., IIFII ~ K for some constant K. An important
consequence of power-boundedness is given in part (c) of the following
corollary.
7.5.6 Corollary. Let (g, h) be a solution of the P.E. with charge c. Then:
n-l
(a) 9 = lim 1.,", ptg pointwise and in norm (that is, in the norm II· II
n-+oc> n ~
t=o
of X).
(b) If
pnh/n -+ 0 pointwise or in norm, (7.5.9)
then
n-l n-l
9 = lim
n-4OO
.!.n '"'
~
ptg = lim
n-4OO
.!.n L ptc (7.5.10)
t=o t=o
pointwise or in norm, respectively.
(c) If, further, P is power-bounded [with a constant K as in (7.5.8)] then,
for all n ~ 1,
n-l n-l
I! L pt(c - g)11 = II L ptc - ngll ~ (1 + K)llhll;
t=o t=o
hence
1 n-l
11-n L ptc - gil ~ (1 + K)llhll/n.
t=o
(d) Uniqueness of solutions: Let (gl, hd and (g2, h 2) be two solutions of
the P.E. with charge c such that hI and h2 satisfy (7.5.9). Then gl =
g2 and
7.5 Poisson's equation 29

n-I
hI - h2 lim ..!:. I: pt(hi -
= n---+oo h 2) pointwise and in norm. (7.5.11)
n
t=o
Proof. Part (a) follows from (7.5.3). Moreover, from (7.5.3) and (7.5.7) we
get
n-I
I: pt(e - g) = h - pnh, (7.5.12)
t=o
which, using part (a), yields (b). In fact, (7.5.12) also gives (c) because
n-I
II I: pt(e - g)11 ~ (1 + K)llhll,
t=o
with K as in (7.5.8).
(d) The equality gl = g2 results from (b), since
n-l
gl = lim..!:.
n
I: pte = g2·
t=o
Let 9 := gl = g2· Then writing (7.5.1)(b) for (g, hI) and for (g, h 2) and
subtracting we see that u := hI - h2 is invariant, i.e., u = Pu. Therefore,
(7.5.11) follows from the same argument used in (a). D
The following theorem gives another characterization of a solution to the
P.E.
7.5.7 Theorem. Let e,g and h be functions in (X, 11·11) and suppose that
(a) P is bounded (i.e., IIPII ~ K for some constant K), and
(b) (7.5.9) holds in norm for every u in X, i.e., Ilpnull/n -+ o.
Then the two following assertions are equivalent:
(i) (g, h) is the unique solution of the P.E. with charge e for which
n-l
lim..!:. I: pth = 0 in norm. (7.5.13)
n t=o
(ii) 9 satisfies (7.5.10) in norm and
N n-l
h= J~oo ~ I: I: pt(e - g) in norm. (7.5.14)
n=1 t=O
Proof. (i) ~ (ii). Suppose that (i) holds. Then, by the hypothesis (b), h
satisfies (7.5.9) and so the requirement on 9 follows from Corollary 7.5.6(b).
On the other hand, by (7.5.7) and (7.5.3),
N n-l N
Nh = I: I: pt(e - g) + I: pnh \IN = 1,2, .... (7.5.15)
n=1 t=o n=1
30 7. Ergodicity and Poisson's Equation

Hence, (7.5.13) gives (7.5.14).


(ii) ~ (i). Suppose that (ii) holds. By the assumption (a), we can inter-
change P and limits in normj that is,

P(limln ) = lim PIn if In converges in norm. (7.5.16)

Applying this fact to the first equality in (7.5.10) we obtain Pg = gj that


is, 9 satisfies (7.5.1)(a). Moreover, note that the hypothesis on 9 gives
n-l
lim.!. L pt(c - g) = 0 in norm. (7.5.17)
n t=O

Now, to obtain (7.5.1)(b), observe first that


n-l
(I - P) L pt = 1- pn Vn = 1,2, ... ,
t=O

so that
n-l
(I - P) L pt(c - g) = (I - pn)(c - g). (7.5.18)
t=o
Therefore, applying 1- P to (7.5.14) and using (7.5.16) and (7.5.18) we
get
N n-l

(I-P)h = IW~L(I-P)Lpt(c-g)
n=l t=o
N
= (c - g) -lim ~ ' " pn(c - g)
NNL....i
n=l
C - 9 by (7.5.17)j

that is, (7.5.1)(b) holds. Hence, (g, h) is a solution to the P.E. with charge
c, and, by (7.5.14) and (7.5.15), the function h satisfies (7.5.13). The latter
condition, (7.5.13), and Corollary 7.5.6(d) give the uniqueness of (g, h). 0
7.5.8 Remark: Finite X. If the state space X is a finite set, in which
case the stochastic kernel P is a square matrix, it is well known that the
limiting matrix
n-l
lim .!. L
P:= n--+oo pt (componentwise) (7.5.19)
n
t=o

exists, and that 1- P + Pis nonsingularj its inverse


7.5 Poisson's equation 31

is called the fundamental matrix associated to P. Moreover, the de-


viation matrix associated to P (or Drazin inverse of I - P), defined
by
1 N n-1 ~ ~
H:= lim N L L(P - P)t(I - P), (7.5.20)
N-too
n=lt=O
can also be written as
N n-1

H = lim N1
N-too
L L (pt - F) (7.5.21)
n=l t=o

and satisfies that

H = Z(I - F) = (I - P + F)-l(I - F). (7.5.22)

Further, the pair (g, h) of vectors given by

9 := Fe, and h := He (7.5.23)

is a solution to the P.E. for P with charge c, and it is precisely of the form
given by Theorem 7.5.7(ii). In fact, in a suitable theoretical setting, all of
the expressions (7.5.19)-(7.5.23) have a well-defined meaning in a much
more general context than the finite-state case. (See Hernandez-Lerma and
Lasserre [6] for details.) 0

B. The unichain P.E.

Suppose that P has an Lp.m. J.t. Then by the Individual Ergodic Theorem
(Yosida [1]) for any given function u in L 1(J.t) == L1(X,8(X),J.t) there is a
function u* in L1 (J.t) such that

(a) u* = .
hm -1 n-1
n-too
L ptu J.t-a.e., and (b)
n t=o
J u*dJ.t = J
udJ.t. (7.5.24)

On the other hand, the Mean Ergodic Theorem ensures that the conver-
gence in (a) holds in L 1(J.t), that is, for every u in L 1(J.t),
n-1
(a) u* = lim .!. L ptu in L 1(J.t), and (b) Pu* = u*. (7.5.25)
n-too n t=o

Further, if J.t is the unique i.p.m. for P, then u* is a constant J.t-a.e. and
in fact, by (7.5.24)(b),
u* = J udJ.t J.t-a.e. (7.5.26)
32 7. Ergodicity and Poisson's Equation

Consider now the unichain P.E., so that P has a unique i.p.m. p., and
let (g, h) be a solution to the P.E. with charge c. Moreover, assume that:
e is in L 1 (p.), and h satisfies (7.5.9). (7.5.27)
Then, by (7.5.26) and Corollary 7.5.6(b), we see that 9 = c·, that is,
9 = p.(e) := / edp. p.-a.e. (7.5.28)

This gives an "explicit" expression for 9 in the unichain case. However, we


wish to distinguish (7.5.28) from the case where 9 = p.(e) holds everywhere.
7.5.9 Definition. The P.E. (7.5.1) is called strictly unichain if it is
unichain and g(x)= p.(e) for all x E X, where p. denotes the unique i.p.m.
of P.
We show next that the P.E. is strictly unichain when the kernel P is
w-geometrically ergodic (Definition 7.3.9).
7.5.10 Theorem. Suppose that P is w-geometrically ergodic with limiting
p.m. p.. Then:
(a) Each function in Bw(X) is p.-integrable, i.e., Bw(X) eLl (p.).
(b) For any function u in Bw(X), limsuPn Ilpnull w ~ 1p.(u)l. Hence u
satisfies (7.5.9), that is,

in the w-norm and, of course, pointwise.


(c) p. is the unique i.p.m. of P.
(d) If u E Bw(X) is invariant (or harmonic) with respect to P (i.e.,
Pu = u), then
u(x) = p.(u) for all x E x.
(e) Let e be an arbitrary function in Bw(X). Then (i) a pair (g, h) of
functions in Bw(X) is the unique solution of the strictly unichain
P.E. with charge c and
p.(h) = 0 (7.5.29)
if and only if g(x) = p.(e) for all x and
00 n-1
h = "L..J pt[e - p.(e)] = n~oo
lim "pt[e -
L..J
p.(e)] in w-norm. (7.5.30)
t=O t=o
Moreover, (ii) for any two solutions (gl, hI) and (g2, h 2) of the strictly
unichain P.E. with charge e we have gl = g2 = p.(c) and hl, h2 differ
at most by an additive constant; in fact,
7.5 Poisson's equation 33

Proof. (a) By definition (7.2.1) of the w-norm, for any function u in Bw(X)

f f
we have:
luldJL ::; Ilullw wdJL = IlullwliJLliw < 00,
where the last equality is due to (7.2.4).
(b) This follows from (7.3.8) and the inequality

(c) This fact was already proved in the paragraph after Definition 7.3.9.
(d) If u = Pu, then u = ptu for all t = 0,1, .... Thus, the desired
conclusion follows from (7.3.8).
(e) The statement (ii) follows from (d) and Corollary 7.5.6(d). Similarly,
statement (i) follows from Theorem 7.5.7 if we can show that the functions
in (7.5.14) and (7.5.30) coincide. In turn, to prove the latter, first note that
if a sequence {Sn} converges in (some) norm to S, then the sequence of
Cesaro sums
1 N
NLSn
n=1

also converges in (the same) norm to S. Now, let Sn be the sequence in


(7.5.30), i.e.,
n-I
Sn := L pt[c - JL(c)],
t=o

and use (7.3.8) to show that {Sn} is a Cauchy sequence in Bw(X), and,
therefore, it converges to a function, say, S := h in Bw(X). Finally, observe
that (7.5.14) with g = JL(c) and the w-norm, is precisely the limit of the
Cesaro sums of {Sn}, so that the function in (7.5.14) coincides with S = h.
o
7.5.11 Remark. In the context of Theorem 7.5.10 we have the following:
(a) By part (ii) of Theorem 7.5.10(e), for any two solutions (g, ht), (g, h 2)
of the strictly unichain P.E. we have hI = h2 + k where k is the constant
JL( hI - h 2). Thus if we wish to guarantee that the P.E. has a unique solution
it suffices to have k = 0, which is precisely the role of condition (7.5.29).
In general, to "fix" a unique h we only need to take any solution (g, h) of
the P.E. and replace h by Ii = h - JL(h). This is tacitly what we did in
Theorems 7.5.1O(e) and 7.5.7. Indeed, if we look at (7.5.15) we see that the
"full form" of his
N n-I

h = J~oo ~ L L pt[c - JL(c)] + JL(h).


n=1 t=o (7.5.31)
=L
00

pt[c - JL(c)] + JL(h) [by (7.5.30)].


t=o
34 7. Ergodicity and Poisson's Equation

There are other ways one can "fix" a unique h. For instance, let x be an
arbitrary fixed point in X and replace (7.5.29) by the condition h(x) = O.
Then in the above notation we again get fJ(h l - h 2 ) = O.
(b) We can replace the "convergence estimate" in Corollary 7.5.6(c) by
the following estimate, which is obtained from (7.3.8): For all n ~ 1
n-l
II :L pt(c - fJ(c))lIw < 1Ic/lwR(l - pn)j(l - p)
(7.5.32)
t=o
< IIcllwRj(l - p).
Observe that (7.5.32) holds for any function c in Bw(X). On the other
hand, again from (7.3.8) [or (7.3.7)] one can see that the function h in
(7.5.4) can be written in the form (7.5.22)-(7.5.23), where the "limiting
kernel (matrix)" P(Blx) is the limiting p.m. fJ, that is, P('lx) = fJ(') for
all x E X. In this case, the "fundamental kernel" Z = (J - P + P)-l is
given by
00 00

(J - p+ fJ)-l =:L(P - fJ)t = :L(Pt - fJ). 0


t=o t=o

For future reference we note that the conclusion of Theorem 7.5.1O(d)


holds fJ-almost everywhere if U is subinvariant (Le., U ~ Pu) or super-
invariant (Le., U ~ Pu). That is:
7.5.12 Lemma. Suppose that P is as in Theorem 7.5.10, and let u be a
function in Bw(X) such that either

(a) u(x) ~ Pu(x) "Ix, or (b) u(x) ~ Pu(x) "Ix.

Then u(x) = fJ(u) = infx u(x) fJ-a.e. in case (a), and u(x) = J-L(u) =
= fJ(u) J-L-a.e.
supx u(x) fJ-a.e. in case (b). Hence, in either case, u(x)
Proof. (a) Suppose that u ~ Pu. This inequality yields u ~ ptu for all
t ~ 0, so that, by (7.3.8), u ~ ptu .!. fJ(u). Thus

u(x) ~ I udfJ for all x.

Hence, letting Ui := infx u(x), we obtain Ui ~ f udfJ ~ Ui, that is, f udfJ =
Ui' Finally, since f(u - ui)dfJ = 0 and u ~ Ui, we conclude that u = Ui =
fJ(u) J-L-a.e. The proof in case (b) is similar [or apply (a) to -u]. 0

C. Examples

In general, solving the P.E. may be a challenging problem. There are


cases, however, in which one can obtain, or perhaps "guess", a solution by
7.5 Poisson's equation 35

some iterative procedure. For instance, consider the Markov chain in Ex-
amples 7.4.3 and 7.5.2. In the latter example we mentioned that a solution
to the P.E. with charge c(O) := -q and c(i) := 1 for all i ~ 1 is given by
the pair (g, h) with

g(.) == 0 and h(i) =i - q Vi = 0,1, ... , (7.5.33)

where
q:= L iq(i) < 00.
00

(7.5.34)
i=l

The question is, how did we get the solution (7.5.33)? To see this, recall
that in Example 7.4.3 is shown that, under (7.5.34), the chain has the i.p.m.

= 1'(0) L
00

1'(i) q(k) Vi ~ 0, with 1'(0) = 1/(1 + q). (7.5.35)


k=i

Therefore, the constant g(.) == 1'(c) is 0 since (7.5.34) and (7.5.35) yield

=L = c(O)1'(O) + 1'(0) L
00 00

1'(c) c(i)1'(i) iq(i) = 1'(0)( -q + q) = O.


i=O i=l

Hence to solve the P.E. (7.5.1) it suffices to consider (7.5.1)(b), which be-
comes
h(i) = c(i) + Ph(i) for all i = 0,1, .... (7.5.36)
To compute
=L
00

Ph(i) h(j)p(jli),
j=O

recall from Example 7.4.3 that the transition probabilities P(jli) are given
by
p(i - Iii) = 1 Vi ~ 1, and p(jIO) = q(j) Vj ~ o.
Consequently,

=L
00

Ph(i) = h(i - 1) for i ~ 1, and Ph(O) h(j)q(j). (7.5.37)


j=O

Then, replacing the values (7.5.37) in (7.5.36), we obtain

h(i) = i - q + Ph(O) Vi ~ 0, (7.5.38)

which together with 9 = 0 yields a solution (g, h) to the P.E. (7.5.1).


Further, in view of the remark after Definition 7.5.1, in (7.5.38) we may
subtract the constant Ph(O) and so we obtain the solution (7.5.33). It is
also illustrative, on the other hand, to verify other conditions in the results
36 7. Ergodicity and Poisson's Equation

of this section. For instance, as shown in Example 7.4.3, the corresponding


Markov chain can be described by the equation

Xt+l = Xt - 1 + (Zt + l)Io(xt}, t = 0,1, ....


Thus, writing the charge c as c(i) = 1 - (1 + q)Io(i) and using (7.5.38), a
direct calculation shows that the sequence Mn in (7.5.6) satisfies

Mn + c(x n ) + h(XnH) - h(xn)


Mn + (zn - q)Io(xn),
from which one can immediately verify, say, the martingale condition in
Theorem 7.5.5(c).
In the above example, finding a solution to the P.E. was very easy because
of the special form of the charge and the transition probabilities. In more
general situations, one can try to obtain an "estimate" of g = p,(c) and
then use the sequence
n-l
L pt[c - p,(c)] (7.5.39)
t=o
in (7.5.30) to "guess" a feasible function h; finally, one would need of course
to check that (g, h) is indeed a solution of the P.E. The following example
shows how this procedure is supposed to work-for ease of exposition we
restrict ourselves to a scalar case, but it should be clear that similar cal-
culations can be done for the multivariable linear system (7.4.10).
7.5.13 Example: Linear systems (d = 1). Consider the scalar (d = 1)
linear system
(7.5.40)
under the hypotheses (a) and (b) in Example 7.4.4. In particular, in the
present case the hypothesis (a) translates into

IFI < 1. (7.5.41)

We assume of course that F :j:. O. In addition, we will assume that the i.i.d.
disturbances Zt have zero mean (which greatly simplifies the calculations),
and finite second moment:

E(zo) = 0, and a2 := E(z5) < 00. (7.5.42)


The charge c(x) is supposed to be the quadratic function
c(x) = 'YX2, X E JR, (7.5.43)

for some constant 'Y, and we take the weight function w := W2 in (7.4.11)
Step 1. By (7.3.8), to "estimate" p,(c) it suffices to compute

Exc(xt} = ! c(y)pt(dylx), Vxo = x,


7.5 Poisson's equation 37

and then let t -+ 00. To do this observe that, for any t ~ 1,

E[c(Xt)IXt-l] = 'Y[p 2xLl + E(zLl) + 2PXt-lE(Zt-d]


= F 2c(xt_d + 'Y0-2
Therefore,
Ezc(Xt) = p2 EzC(Xt-l) + 'Y0-2 "It ~ 1,
and iterating we get
t-l
Ezc(xt) = P 2t c(x) + 'Y0-2 L p2k,
k=O

i.e.,
Ezc(xt) = P 2t c(x) + 'Y0-2(1- P2t)/(1_ p2), t ~ 1. (7.5.44)

Thus, by (7.5.41), letting t -+ 00 we obtain


J.L(c) = 'Y0- 2/(1 - p2). (7.5.45)

Observe that we can write (7.5.44) as


Ezc(xd = [c(x) - J.L(C)]p2t + J.L(c), "It ~ 0, x E JR. (7.5.46)

Step 2. Compute (7.5.39) and let t -+ 00. From (7.5.46),


n-l
L Ezc(xt) = [c(x) - J.L(c)](l - p2n)/(1 - p2) + nJ.L(c).
t=o
Hence, (7.5.39) becomes
n-l
L[Ezc(xt) - J.L(c)] = [c(x) - J.L(c)](l - p2n)/(1 - p2),
t=o
and letting n -+ 00, (7.5.30) yields that
h(x) = [c(x) - J.L(c)]/(l - p2), x E JR, (7.5.47)

which obviously satisfies (7.5.29).


In conclusion, by Theorem 7.5.1O(e), the pair (g,h) with g(.) == J.L(c) in
(7.5.45), and h in (7.5.47), is the unique solution of the strictly unichain
P.E. that satisfies the constraint (7.5.29). 0
7.5.14 Example: Iterated function systems. Consider the IFS of Ex-
ample 7.4.7 in the special case in which X = JR, and Pi(X) = LiX + li'
i = 1,2, ... , N. Equivalently, the IFS is the "linear system"
38 7. Ergodicity and Poisson's Equation

with random coefficients. In analogy with (7.4.22), we shall assume that


L 1 , .•• , LN satisfy the condition
N

L ILilqi < l. (7.5.48)


i=l
Now define the numbers
N N
L := E(Lz') = L Liqi, and 1:= E(lz,) = L liqi,
i=l i=l

and consider the charge c(x) = x for all x E X. To solve the strictly
unichain P.E. with charge c we may proceed exactly as in Example 7.5.13.
In fact,

E(xtlxt-d = LXt-1 +1 for all t ~ 1 [ef. (7.4.19)],


so that
E",(xt) = LE",(xt-d + 1 = Ltx + 1(1 - Lt)/(1 - L)
for all t ~ O. Finally, as in (7.5.44)-(7.5.47), using (7.5.48) we obtain

f.L(c) = 1/(1 - L), and h(x) = [x - f.L(c)]/(1 - L), x E X,

which is a solution to the P.E. with charge c(x) = x. 0


Notes on §7.5

1. Most of the results in subsection A are from Hernandez-Lerma and


Lasserre [6], where other approaches to the multichain P.E. are studied;
see also papers [4] and [7] by the same authors, and Glynn and Meyn [1].
The concept of c-canonical pairs (Definition 7.5.4) is an adaptation to non-
controlled Markov chains of the canonical triplets studied in §5.2 for MCPs,
and which will also appear in later chapters. The P.E. also arises in potential
theory (Nummelin [2], Revuz [1], Syski [1]) and stochastic approximation
(Metivier and Priouret [1]), for instance.
2. For the special case of Markov chains with a countable state space
X (in particular finite, as in Remark 7.5.8) see, for instance, Hordijk and
Spieksma [1], Makowski and Shwartz [1], and Puterman [1].
8
Discounted Dynamic Programming
with Weighted Norms

8.1 Introduction
In this chapter we consider the infinite-horizon discounted cost problem for
a Markov control model (X, A, {A(x)lx EX}, Q, c). We already studied this
problem using dynamic programming and linear programming in Chapters
4 and 6, respectively. Here we use again dynamic programming, so it is
important to state at the outset the differences between this chapter and
Chapter 4.
The main difference lies in the assumptions. In Chapter 4 we considered
nonnegative cost-per-stage functions c, virtually without restriction in their
"growth rate", and we allowed non-compact control-constraint sets A(x).
In contrast, the hypotheses in this chapter-see Assumptions 8.3.1, 8.3.2
and 8.3.3, or 8.5.1, 8.5.2 and 8.5.3-require the sets A(x) to be compact,
but the cost functions c are allowed to take positive and negative values,
provided that they satisfy a certain growth condition (Assumption 8.3.2).
The corresponding dynamic programming theorems (Theorems 4.2.3 and
8.3.6) turn out to be very similar, except that in the present context the
dynamic programming operator is a contraction operator with respect to
a weighted w-norm (Proposition 8.3.9), which yields the w-geometric con-
vergence of the value iteration (VI) algorithm [see (8.3.15)]. The latter
fact, w-geometric convergence of the VI functions, gives many interest-
ing results-such as evaluation of rolling horizon procedures, criteria for
elimination of nonoptimal control actions, and existence and detection of
forecast horizons-which are practically impossible to get in a context as
O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes
© Springer Science+Business Media New York 1999
40 8. Discounted Dynamic Programming

general as that of Chapter 4.


To summarize, except for the sign of c, this chapter considers assumptions
less general than those of Chapter 4, but in return we obtain much more
information from the dynamic programming techniques.
What follows is an outline of the contents of this chapter. Section 8.2
presents an abridged version of §2.2 and §2.3; namely, it summarizes some
of the main underlying concepts for a Markov control process (MCP), which
are a Markov control model, and the admissible control policies. Section 8.3
deals with discounted Dynamic Programming (DP). The main result here
is the DP Theorem 8.3.6, which concerns the discounted-cost optimality
equation (or DCOE) (8.3.4), the convergence of the VI functions to the
value function, and the existence of discount-optimal policies. In §8.4 we
make a further analysis of the VI functions. The main question dealt with
is, how "close" is a VI-policy to being optimal? The results in §8.3 and §8.4
are obtained under a set of hypotheses (Assumptions 8.3.1 to 8.3.3) which
in particular impose a form of "strong continuity" on the controlled pro-
cess' transition law Q-see Assumption 8.3.1(c). However, in §8.5 we show
that those results remain valid under a "weak continuity" (weak Feller-like)
condition on Q provided we change accordingly the hypotheses on the con-
trol model. The basic idea is to change the "measurability" requirements in
Assumptions 8.3.1, 8.3.2, 8.3.3 to "continuity" conditions in Assumptions
8.5.1, 8.5.2, 8.5.3. Finally, in §8.6 we present a couple of examples, and we
conclude in §8.7 with some general remarks.

8.2 The control model and control policies


Let M := (X,A,{A(x)lx E X}'Q,c) be a Markov control model
(MCM), where X and A are Borel spaces which denote the state space
and the action (or control) set, respectively. For every state x E X, A(x)
is a nonempty Borel subset of A whose elements are the feasible actions
(or controls) if the system is in the state x. The set of feasible state-action
pairs, namely
lK:= {(x,a)lx E X, a E A(x)}, (8.2.1)
is assumed to be a Borel subset of X x A. Moreover, the transition law

Q = {Q(Blx,a)IB E 8(X), (x,a) ElK}

is a stochastic kernel on X given lK, and, finally, the measurable function


c: lK ---t 1R denotes the cost-per-stage (or one-stage cost) function.
8.2.1 Definition. IF denotes the set of measurable functions f : X ---t A
such that f(x) is in A(x) for all x E X, and <P stands for the set ofstochas-
tic kernels r.p on A given X for which r.p(A(x)lx) = 1 for all x E X. The
functions in IF are referred to as decision functions or selectors of the
8.2 The control model and policies 41

set-valued mapping x H A(x).


A selector I E IF may be identified with the stochastic kernel <P E <P for
which <p(·lx) is the Dirac measure at I(x) for all x E X. Hence, we have
IF c <P.
We shall assume that IF is nonempty, or, equivalently, that the set OC
in (8.2.1) contains the graph of a measurable function from X to A. This
assumption ensures that the set of control policies, defined below, is non-
empty.
8.2.2 Definition. (Control policies.) For every t = 0,1, ... , let H t be
the family of admissible histories up to time t, that is, Ho := X, and
H t := oct x X = OC x H t - 1 for t = 1,2, ....
An element of H t , called a "t-history", is a vector of the form

ht = (xo, ao, ... , Xt-l' at-l, xd (8.2.2)

with (Xi, ai) E OC for i = 0, ... , t-l, and Xt EX. A (randomized) control
policy is a sequence 7r = {7rd of stochastic kernels 7rt on the control set A
given H t , that satisfy the constraint

(8.2.3)
The set of all control policies is denoted by n. Moreover, a control policy
7r = {7rd is said to be a:
(a) randomized Markov policy if there is a sequence {<pt} of stochastic
kernels <Pt E <P such that

(8.2.4)

(b) randomized stationary policy if (8.2.4) holds for a stochastic ker-


nel <P E <P independent of t, i.e.,

(c) deterministic (or pure) policy if there is a sequence {gt} of mea-


surable functions gt : H t --+ A such that, for every h t E H t and
t = 0,1, ... , we have gt(h t ) E A(xt) and 7rt(·lhd is the Dirac measure
concentrated at gt (hd j
(d) deterministic Markov policy if there is a sequence {It} of selectors
It E IF such that 7rt(·lhd is the Dirac measure at It(Xt) E A(xt) for
all ht E H t and t = 0,1, ... j
(e) deterministic stationary policy if there is a selector I E IF such
that 7rt(·lhd is the Dirac measure at I(xt) E A(xd for all h t E H t
and t = 0,1, ....
42 8. Discounted Dynamic Programming

We denote by IIRM the family of randomized Markov policies, and by IIRS


the subfamily of randomized stationary policies. Similarly, lID :J IIDM :J
lIDS denote the family of deterministic policies, and the subfamilies of de-
terministic Markov and deterministic stationary policies, respectively.
8.2.3 Remark. (a) If 7f is in II R s and cp E <P is as in Definition 8.2.2(b),
then we write 7f as cpoo. Similarly, if 7f is a deterministic stationary policy
as in Definition 8.2.2(e), we write 7f as 1 00 •
(b) Let IF and <P be the sets in Definition 8.2.1, and let II{ be as in (8.2.1).
If G is a function on II{ we write

G(x,cp) := i G(x,a)cp(dalx) if cp is in <P,

which reduces to
G(x, f) := G(x, I(x)) for I in IF.
In particular, for the cost function c and the transition law Q we write

c(x,cp):= i c(x,a)cp(dalx), Q('lx,cp):= i Q('lx,a)cp(dalx) (8.2.5)

if cp is in <P, and
c(x,f):= c(x,/(x)), Q('lx,f) = Q('lx,/(x)) if IE IF. (8.2.6)
(c) Let 7f = {7fd be an arbitrary control policy, and v an arbitrary "ini-

consisting of the (canonical) "sample space" °


tial distribution" on B(X). Further, let (0, F) be the measurable space
:= (X x A)OO and the cor-
responding product a-algebra F. Then 7f and v determine a probability
measure (p.m.) P;: and a stochastic process {(xt,at), t = 0,1, ... } on 0,
where Xt and at represent the state and the control action at time t, respec-
tively. (See §2.2.) The expectation operator with respect to P;: is denoted
by E:J. If v is the Dirac measure concentrated at the initial state Xo = x,
we write P;: and E:J as P; and E;, respectively.
(d) The p.m. P;: in part (c) has the following properties: For every B E
B(X) and C E B(A), and every t-history ht E H t as in (8.2.2), t = 0,1, ... ,
P; (xo E B) = v(B) (8.2.7)
P;:(at E Clht ) = 7ft(Clht) (8.2.8)
P;(Xt+l E Blht,at) = Q(Blxt,at). (8.2.9)
In particular, from (8.2.7)~(8.2.9) one can deduce that if 7f = {cpd E IIRM
is a randomized Markov policy, then the state process {xd is a nonhomo-
geneous Markov chain with transition kernels Q(·I"cpt), t = 0,1, ... j that
is, for every B E B(X) and t = 0,1, ... [and using the notation in (b)],
P;:(Xt+l E BIXt)
(8.2.10)
Q(Blxt,CPt).
8.3 The optimality equation 43

For a deterministic Markov policy 7r = {tt} E IIDM, the corresponding


transition kernels are Q(·I·, It). Moreover, for stationary policies cpoo E IIRs
or foo E lIDS, {Xt} is a time-homogeneous Markov chain with transition
kernels Q(·I·,cp) and Q(·I·,I), respectively. In the latter case, the n-step
transition kernel is written as Qn(·I·, cp), i.e.,

Qn(Blx, cp) := Prob(x n E Blxo = x), (8.2.11)

and similarly for Qn(·I·, I). [For a proof of (8.2.10) see Proposition 2.3.5.]
o

8.3 The optimality equation


As in the previous section, let (X, A, {A(x)lx EX}, Q, c) be a Markov
control model, which is fixed throughout the remainder of this chapter,
and consider the a-discounted cost (abbreviated a-DC, or simply DC)

V(7r, x) := E: [f: atc(xt, at)], 7r E II, x E X, (8.3.1)


t=o
where a E (0,1) is a given discount factor. The corresponding a-discount
value function (or a-discount optimal cost function) is

V*(x) := infV(7r, x), x E X. (8.3.2)


n

As in Chapter 4, the main problem we are concerned with is the calculation


of V* and in finding an a-discount optimal policy, that is, a control policy
7r* such that
V(7r*, x) = V*(x) "Ix E X. (8.3.3)
In Theorem 4.2.3 we showed in particular that, under quite general hy-
potheses, V* is the pointwise-minimal solution of the a-discounted cost
optimality equation (a-DCOE)

V*(x) = min [c(x,a) + a r V*(Y)Q(dY1x,a)] "Ix EX. (8.3.4)


A(x) ix
In this section we shall prove a similar result (Theorem 8.3.6) but which
cannot be obtained from Theorem 4.2.3 because now we will consider cost-
per-stage functions C that can take negative values, whereas all of Chapter
4 deals with nonnegative cost functions. Moreover, for application in later
sections, we wish to obtain "good" estimates of the rate of convergence
of the a-value iteration functions to V* [see (8.3.15)]' which are virtually
impossible to get in a context as general as that of Theorem 4.2.3.
44 8. Discounted Dynamic Programming

Hence, in this chapter we impose three different hypotheses. The first


one, Assumption 8.3.1, is about the usual compactness-continuity condi-
tions for Markov control models [and in fact is the same as Condition
3.3.2(a), (b), (c2)]. The second, Assumption 8.3.2, uses a weight function
w to impose a growth condition on the cost function and, in addition, it
will yield that the dynamic programming (DP) operator To: in (8.3.17) is
a contraction on the space lff,w(X) introduced in §7.2.A, namely, the Ba-
nach space of measurable functions on X with a finite w-norm. Finally, the
third hypothesis, Assumption 8.3.3, is a further continuity condition that
combined with the previous assumptions will ensure the existence of "mea-
surable minimizers" for To: (see Proposition 8.3.9). In §8.5 we introduce a
different set of hypotheses.

A. Assumptions

8.3.1 Assumption. For every state x E X:

(a) The control-constraint set A(x) is compact;

(b) The cost-per-stage c( x, a) is lower semicontinuous (1.s.c) in a E A (x);


and

(c) The function u'(x, a) := J u(y)Q(dylx, a) is continuous in a E A(x)


for every function u in lff,(X), where lff,(X) denotes the Banach space
of real-valued bounded measurable functions u on X, with the sup
norm lIull := sUPx lu(x)1 (see §7.2.A).

Remark. Assumption 8.3.1(c) is equivalent to the apparently weaker con-


dition:

(c') u'(x,a) is l.s.c. in a E A(x) for every nonnegative function u in lff,(X).

Indeed, it is obvious that (c) implies (c'). Conversely, suppose that (c') holds
and let u be an arbitrary function in lff,(X). Then u + Ilull is nonnegative,
and, therefore, by (c'), the function

j(u(y) + IluIDQ(dylx,a) = u'(x,a) + lIull


is l.s.c. in a E A(x), which implies that u'(x, a) is 1.s.c. in a E A(x). In other
words, u'(x, a) is l.s.c. in a E A(x) for any function u in lff,(X). Moreover,
applying the latter fact to -u, we can see that u'(x,a) is also upper semi-
continuous (u.s.c.) in a E A(x). Hence, as u'(x,·) is both l.s.c. and u.s.c.,
it is in fact continuous, and (c) follows. 0
8.3.2 Assumption. There exist nonnegative constants c and (3, with 1 ::;
(3 < 1/0., and a weight function w 2: 1 on X such that for every state
xEX:
8.3 The optimality equation 45

(a) sUPA(x) Ic(x,a)1 ~ ew(x)j and


(b) SUPA(x) I w(y)Q(dylx, a) ~ f3w(x). (8.3.5)
Further, we suppose that w satisfies:
8.3.3 Assumption. For every state x E X, the function w'(x,a) :=
I w(y)Q(dylx, a) is continuous in a E A(x).
Concerning the requirement f3 ~ 1 in Assumption 8.3.2, see Note 4 at
the end of this section.
An obvious sufficient condition for Assumption 8.3.2 is that C is boundedj
that is, there is a constant c such that Ic(x, a)1 ~ c for all (x, a) in K In this
case, the function w ~ 1 can be taken to be bounded. Another sufficient
condition is given in Remark 8.3.5(a) and also in Note 1 at the end of the
section. On the other hand, we have:
8.3.4 Proposition. Assumption 8.3.2 implies that
00

C(x) := L atct(x) < 00 for every x E X, (8.3.6)


t=o
where eo(x) := sUPA(x) Ic(x, a)1 and

Ct(x):= sUP/Ct_l(y)Q(dy1x,a) for t= 1,2, ... (8.3.7)


A(x)
are assumed to be measumble functions. Conversely, if (i) C ~ 1, and (ii)
the inequality in (8.3.6) is satisfied for some ao with a < ao < 1, then
Assumption 8.3.2 holds with

c := 1 w(x):= C(x), and f3:= a o


l . (8.3.8)
Proof. If Assumption 8.3.2 holds, a straightforward induction argument
shows that
Ct(x) ~ cf3t w (x) "It = 0,1, ... , x E X. (8.3.9)
Hence, as 0 ~ af3 < 1, we get
C(x) ~ ew(x)J(1 - af3) < 00 for every x E Xj (8.3.10)
that is, (8.3.6) holds.
Conversely, by (ii) and (8.3.7),

/ C(y)Q(dylx, a) = f:
t=o
at / Ct(y)Q(dylx, a)

00

< La~ct+l(x)
t=o
= aol[C(x) - eo (x)],
46 8. Discounted Dynamic Programming

i.e.,
/ C(y)Q(dylx, a) :::; ao1C(x).

This inequality combined with the conditions (i), (ii), yields the desired
conclusion. D
8.3.5 Remark. (a) In several places we shall consider inequalities of the
form (7.3.9) or (7.3.10), so it is important to note that many results in
this chapter remain valid if part (b) in Assumption 8.3.2 is replaced by an
inequality of the form (8.3.11) below, where,,( > 0 is not necessarily :2: 1,
in contrast to (8.3.5) where (3 :2: 1. More precisely, we have:
Assumption 8.3.2 is satisfied if there exists a real-valued measurable func-
tion w' :2: 1 on X and positive constants m, "( and b such that a"( < 1 and,
for every state x EX,

(i) sUPA(x) Ic(x, a)1 :::; mw'(x), and

(ii) sup A(x) J w' (y)Q(dylx, a) :::; "(w' (x) + b (8.3.11)

Indeed, let C(x) and Ct(x) be as in (8.3.6) and (8.3.7), respectively, except
that Co is redefined as

Co(x) := 1 + sup lc(x,a)l.


A(x)

Observe that, by (i), we have Co :::; Mw ' with M := 1 + m. Then instead


of (8.3.9) we obtain
t-l

Ct(x) :::; Mw'(x)'l + Mb L "(j \:It = 1,2, ... ,


j=O

and so (8.3.10) becomes

C(x) :::; Mw ' (x)j(l - a"() + Mbaj(l- a)(l - a"().

Therefore, by the "converse" of Proposition 8.3.4, there exist constants c


and (3, and a function w(·) :2: 1 that satisfy Assumption 8.3.2.
(b) In analogy with (7.2.8) [equivalently, (7.2.7)] we may define the w-
norm of the transition law Q as

IIQllw := supW(X)-l Ix w(y)Q(dylx, a),

where the sup is over all feasible state-action pairs (x, a) in lK. Hence, under
(8.3.5), we have IIQllw :::; (3 < 00, and so Proposition 7.2.5 is applicable in
the present "controlled" context. D
8.3 The optimality equation 47

B. The discounted-cost optimality equation

To state our main result in this section, we first recall from §4.2 the
definition of the a-value iteration (or a-VI) functions

Vn(x) := min [c(x,a)


A(x)
+a
ixr Vn_l(y)Q(dyl x,a)] (8.3.12)

for all n ~ 1 and x EX, with vo(-) == O. For every n = 1,2, ... , Vn is the
optimal n-stage cost [see (3.4.11)], i.e.,

(8.3.13)

where

Vn (1f, x) := E: [~atc(xt, ad]. (8.3.14)

The following theorem states among other things that the sequence {v n }
converges geometrically in the w-norm to V* [see (8.3.2)].
8.3.6 Theorem. Suppose that Assumptions 8.3.1, 8.3.2 and 8.3.3 hold.
Let (3 be the constant in {8.3.5}, and define, := a{3. Then:

(a) The a-discount value function V* is the unique solution of the a-


DCDE {8.3.4} in the space Jaw(X), and

Ilv n - V*llw ~ C,n /(1- ,), n= 1,2, ... , (8.3.15)

where c is the constant in Assumption 8.3.2{a}.

(b) There exists a selector f* E IF such that f* (x) E A(x) attains the
minimum in {8.3.4} for every state x, that is [using the notation in
(8.2.6}J,

V*(x) = c(x, f*) +a I V*(y)Q(dylx, f*) \/x E X, (8.3.16)

and the deterministic stationary policy ff' EnDS is a-discount opti-


mal; conversely, if ff' EnDS is a-discount optimal, then it satisfies
{8.3.16}.

(c) A policy 1f* is a-discount optimal if and only if the corresponding cost
function V(1f*,·) satisfies the a-DCDE.

(d) If an a-discount optimal policy exists, then there exists a deterministic


stationary policy which is a-discount optimal.
48 8. Discounted Dynamic Programming

Except for the w-geometric convergence in (8.3.15), Theorem 8.3.6 looks


very much the same as Theorem 4.2.3, and, consequently, it is important to
note that none of these two results can be obtained from the other. First of
all, it is obvious that Theorem 8.3.6 does not yield Theorem 4.2.3 because
the latter allows noncompact control-constraint sets A(x), in contrast to the
compactness required in Assumption 8.3.1(a). On the other hand, Theorem
4.2.3 deals exclusively with nonnegative cost functions c, and so it cannot
be used to obtain Theorem 8.3.6. However, if we restrict ourselves to cost
functions c 2: 0, then Theorem 8.3.6 turns out to be a strengthened form
of Theorem 4.2.3. (Later on we will need to consider cost functions with
negative values.)
The remainder of this section is devoted to prove Theorem 8.3.6, which
requires several preliminary results. First we shall introduce the DP (dy-
namic programming) operator To..

c. The dynamic programming operator

Given a measurable function u : X --+ IR and 0 :S 0: :S 1, we denote by


Tau the function given by

Tau(x) := inf [c(x, a)


A(x)
+ 0:
Jrx u(y)Q(dylx, a)], x E X, (8.3.17)

whenever the integral is well defined. If the infimum in (8.3.17) is actually


attained at some action a E A(x) for every x E X, then we shall write
"min" in lieu of "inf". For 0 < 0: < 1, we shall prove that To. is a con-
traction operator on the space lEw(X) (Proposition 8.3.9), which uses part
(a) in the following lemma; the Fatou-like results in part (b) are also used
below.
8.3.7 Lemma. Suppose that Assumptions 8.3.3 and 8.3.i(c) hold. Then:

(a) The function u'(x, a) := Ju(y)Q(dylx,a) is continuous in a E A(x)


for every x E X and every function u in lEw(X).

(b) (An extension of Fatou's Lemma.) Let {un} be a bounded sequence


in lEw(X), that is, there is a constant K such that Ilunll w :S K for all
n, and define

uI(x) := lim inf un(x), and uS(x):= lim sup un(x).


n-+(X) n-+oo

Then for any state x E X and any sequence {an} in A(x) such that
an --+ a in A(x), we have

liminf !un(y)Q(dylx, an) 2:


n-too
! u I (y)Q(dylx, a), (8.3.18)
8.3 The optimality equation 49

and

lim sup j un(y)Q(dYlx, an) ::; j uS(y)Q(dylx, a). (8.3.19)


n--+oo

Hence, if Un -+ u (that is, u I = uS), then

lim j un(y)Q(dylx, an) = j u(y)Q(dylx, a). (8.3.20)


n--+oo

Proof. (a) Let u be a function in I$w(X), so that lu(x)1 ::; mw(x) for all
x E X, where m := Ilull w . Then U m := u + mw is a nonnegative function
in I$w(X), and so it is the limit of a nondecreasing sequence of measurable
bounded functions uk E I$(X). Now fix x E X and let {an} be a sequence
in A(x) converging to a E A(x). Then, as uk tUm, Assumption 8.3.1(c)
yields, for every k,

liminf j um(y)Q(dylx, an) > liminfj uk(y)Q(dylx, an)


n--+oo n--+oo

j uk(y)Q(dylx,a).

Hence, letting k -+ 00, monotone convergence yields that

liminfjum(Y)Q(dY1x,a n ) 2': jUm(Y)Q(dY1x,a)


n--+oo

J
and, therefore, um(y)Q(dylx,') is l.s.c. on A(x), which implies that u'(x,·)
is l.s.c. on A(x). In other words, u'(x,·) is l.s.c. on A(x) for every function
u in I$w(X). Hence, if we now apply the latter fact to -u in lieu of u, we
see that u'(x,·) is also u.s.c. Thus u'(x,·) is continuous on A(x).
(b) Write u I as uI(x) := limk--+oo Uk(X), where

Uk(X) := inf un(x)


n?k
t u I (x) as k -+ 00.

Then, for all n 2': k,

and, as Uk is in I$w(X) (in fact, IIUkll w ::; K for all k), part (a) yields

liminfjun(Y)Q(dY1x,a n ) 2': jUk(Y)Q(dY1x,a). (8.3.21)


n--+oo

Thus, letting k -+ 00, we obtain (8.3.18) from (8.3.21) and monotone con-
vergence. The proof of (8.3.19) is similar and is left to the reader. 0
50 8. Discounted Dynamic Programming

We will also need the following lemma, whose parts (a) and (b) are the
same as the "measurable selection theorem" in Proposition D.5 (Appendix
D). Recall that the set-valued mapping (or multifunction) x t-t A(x) from
X to A is said to be upper semicontinuous (u.s.c.) if {x E XIA(x)nF"I-
0} is a closed subset of X for every closed set F C A. [Equivalently,
x t-t A(x) is U.S.c. if {x E XIA(x) c G} is an open set in X for ev-
ery open set G CA.]
8.3.8 Lemma. Let II{ and A(x) be as in (8.2.1) and Assumption 8.3.1(a},
respectively, and let v : II{ ---+ lR. be a given measumble function. Define

v*(x) := inf vex, a), x E X. (8.3.22a)


A(x)

(a) Ifv(x,·) is l.s.c. on A(x) for every x E X, then there exists a selector
f E IF such that f(x) E A(x) attains the minimum in (8.3.22a) for
all x E X, that is [using the notation in Remark 8.2.3(b}},

v*(x) = vex, f) '<Ix E X, (8.3.22b)

and v* is a measumble function.


(b) If the set-valued mapping x t-t A(x) is u.s. c. and v is l.s. c. and bounded
below on II{, then there exists a selector f E IF that satisfies (8.3.22b)
and, moreover, v* is l.s.c. and bounded below on X.

(c) Suppose that x t-t A(x) is u.s.c, v is l.s.c. and, further,


sup Iv(x, a)1 ~ kW(x) '<Ix E X,
A(x)

where k is a constant and w(·) ~ 1 is a continuous function on X.


Then there is a selector f E IF that satisfies (8.3.22b), v* is l.s.c., and

Iv*(x)1 ~ kw(x) '<Ix E Xj (8.3.22c)

that is, v* is a l.s.c. function in the space lRw(X) and its w-norm
satisfies Ilv*lIw ~ k.
Proof. For the proof of parts (a) and (b) see Rieder [1] or SchaJ. [1]. To
prove (c), apply (b) to the nonnegative l.s.c. function
u(x,a) := v(x, a) + kW(x). 0

8.3.9 Proposition. For 0 < a < 1, suppose that Assumptions 8.3.1, 8.3.2,
and 8.3.3 hold, and let To be the map defined by (8.3.17). Then:
(a) To is a contmction opemtor on lRw(X), with modulus "Y := af3 < 1;
that is, To maps lRw(X) into itself and

(8.3.23)
8.3 The optimality equation 51

(b) For every lunction u in lRw(X) there is a selector I == lu in IF such


that

Tau(x) = c(x, f) +a Ix u(y)Q(dyJx, f) \Ix E X. (8.3.24)

Part (b) holds lor any °~ a ~ l.


Proof. Let u be an arbitrary function in lRw(X). Then, by Lemma 8.3.7(a)
and Assumption 8.3.1(b), the function

v(x,a):= c(x,a) +a I u(y)Q(dyJx,a)

is l.s.c. in a E A(x) for every x EX. Hence, Lemma 8.3.8(a) yields that Tau
is a measurable function and that there exists I ElF that satisfies (8.3.24).
It is also clear that Tau has a finite w-norm since, by Assumption 8.3.2,

Jv(x, a)J ~ cw(x) + aJJuJJw I w(y)Q(dyJx, a)


< (c + (a,8)JJuJJw)w(x) \Ix E X, a E A(x).
Moreover, it is obvious that Ta is a monotone operator (u ~ u' implies
Tau ~ Tau'). Thus, in view of Proposition 7.2.9, to complete the proof
it suffices to show that (7.2.20) holds. This, however, is immediate since
(8.3.17) and (8.3.5) yield, for any real number r,
Ta(u + rw)(x) ~ Tau(x) + (a,8)rw(x) \Ix E X.

That is, Ta is a contraction on lRw(X) with modulus, := a,8. 0


Before presenting the proof of Theorem 8.3.6, observe that using Ta we
may rewrite (8.3.4) and (8.3.12) as
V* = TaV* (8.3.25)
and
vn = TaVn-l = T:;vo \In = 0,1, ... , with Vo = 0, (8.3.26)
respectively.
D. Proof of Theorem 8.3.6.

(a) By Proposition 8.3.9 and Banach's Fixed Point Theorem (Proposition


7.2.8), Ta has a unique fixed point u* in lRw(X), i.e.,
(8.3.27)
and
JJT:;u - u*JJw ~ ,nJJu - u*JJw \lu E lRw(X), n = 0,1, .... (8.3.28)
Hence, to prove part (a) we need to show that
52 8. Discounted Dynamic Programming

(al) V* is in lmw(X), with w-norm IIV*lIw ~ cj(l - ')'), and


(a2) V* = u*.
In this case, (8.3.15) will follow from (8.3.26) and (8.3.28) with u == o.
To prove (al), let 1f' E II be an arbitrary policy and let x E X be an
arbitrary initial state. Then for all t = 0,1, ...

(8.3.29)

and
(8.3.30)
Indeed, (8.3.29) is trivial for t = O. Now, if t ~ 1, it follows from (8.2.9)
that

E;[w(Xt)lht- b at-I] = J w(y)Q(dyIXt-l, at-I)


(8.3.31)
~ {3w(xt-d by (8.3.5).

Therefore E;w(xt) ~ {3E;w(Xt-t} , which iterated yields (8.3.29). Simi-


larly, to get (8.3.30) observe that Assumption 8.3.2(a) yields

Ic(xt, at)1 ~ cw(Xt) Vt = 0, 1, ... ,


so that, by (8.3.29),
(8.3.32)
Finally, note that (al) follows from (8.3.30) since a direct calculation gives

L (i E;lc(xt, at)1 ~ cw(x)j(l -


00

1V(1f', x)1 ~ ')'), (8.3.33)


t=o
with')' := a{3. Thus, as 1f' E II and x E X were arbitrary,

IV*(x)1 ~ cw(x)j(l - ')'). (8.3.34)

To prove (a2) let us first note that

lim atE;u(xt)
t-too
=0 V1f' E II, x E X, u E lmw(X). (8.3.35)

Indeed, by definition of w-norm and (8.3.29),

and (8.3.35) follows. Let us now consider the equality u* = Tau* in (8.3.27).
By Proposition 8.3.9(b), there exists a selector f E IF such that

u*(X) = c(x, f) +a Ju*(y)Q(dylx, f) Vx E X. (8.3.36)


8.3 The optimality equation 53

Iteration of (8.3.36) yields


n-l
u*(x) = E{ L atc(xt, f) + an E{ u*(xn) \/n = 1,2, ... ,
t=O

and letting n --t 00 we get, by (8.3.35),

Thus, by definition of V* [see (8.3.2)], we see that

u*(x) 2:: V*(x). (8.3.37)

To get the reverse inequality, note that (8.3.27) implies that

u*(x)::; c(x, a) +a J u*(y)Q(dylx,a) \/(x, a) E K (8.3.38)

Hence, for any policy 1f E n and initial state x EX, (8.2.9) and (8.3.38)
yield

Therefore, taking expectation E; and summing over t = 0, ... ,n - 1, we


obtain

Finally, letting n --t 00 in the latter inequality and using (8.3.35), it follows
that
u*(x) ::; V(1f,x),
so that, as 1f and x were arbitrary, we conclude that u*(x) ::; V*(x) for all
x E X. This inequality and (8.3.37) yield (a2).
(b) The existence of a selector f* E IF that satisfies (8.3.16) follows from
part (a) and Proposition 8.3.9(b). Conversely, for any deterministic sta-
tionary policy f;C, the corresponding a-discounted cost satisfies

VU;C, x) = c(x,f*) +a J VUc;o,y)Q(dylx,f*) \/x E X. (8.3.39)

(See Remark 8.3.10.) Hence if f;C is a-discount optimal, we have VU;C,·) =


V*(-) so (8.3.39) yields (8.3.16).
Finally, part (c) is obvious, whereas (d) follows from (c) and (b). 0
54 8. Discounted Dynamic Programming

8.3.10 Remark. Concerning (8.3.39), there are at least two ways in which
one can show that, for any deterministic stationary policy foo E lIDS,

V(foo, x) = C(X, I) +a Ix V(foo, y)Q(dylx, I) 'r/X E X. (8.3.40)

The first one is to expand the right-hand side of (8.3.1), with 7r = f oo , to


obtain
V(fOO, x) = c(x, I) +aE{ [~at-lc(xt,l)l.
Then (8.3.40) follows from the Markov property (8.2.10) and using the
definition (8.3.1) again, with 7r = foo. The second way is via a fixed-point
argument as in Proposition 8.3.9(a). Namely, for any given deterministic
stationary policy f oo , define an operator

by
(RfU)(X):= c(x, I) +a Ix u(y)Q(dylx,l), x E X. (8.3.41)

Then using Proposition 7.2.9 it can be verified that Rf is a contraction


operator on :$w(X) with modulus 'Y := aj3, and, therefore, Rf has a unique
fixed point uf in :$w(X), i.e.,

(8.3.42)

From this equation and (8.3.41) we have then that uf is the unique solution
in :$w(X) of the equation

Uf(x) = c(x, I) +a J uf(y)Q(dylx, I), x E X. (8.3.43)

Moreover, iteration of (8.3.42) or (8.3.43) yields

uf(x) = Rfuf(x) = E{ [~atc(xt,l)l +anE{uf(Xn)


for all x E X and n = 1,2, .... Finally, letting n -+ 00, we see from (8.3.35)
and (8.3.1) that uf(x) = V(foo,x) for all x E X, and so (8.3.43) is the
same as (8.3.40). 0
Notes on §8.3

1. Except for the fact that the cost-per-stage c is allowed to take neg-
ative values, Assumption 8.3.2 and (8.3.6) are, respectively, the same as
conditions (b) and (c) in Proposition 4.3.1. However, there is a misprint in
Proposition 4.3.1(b}: the inequality 1::; k ::; l/a should be 1 ::; k < l/a.
8.4 Further analysis of value iteration 55

2. Assumption 8.3.2 was introduced by Wessels [1], and it has been used
by other authors, including Piunovski [1], and Wakuta [1]. On the other
hand, van Nunen and Wessels [1] show that Assumption 8.3.2 is implied by
the following condition introduced by Lippman [1]:
(L) There is a measurable function Wo ~ 1 on X, a positive integer m,
and positive constants band M such that for all (x, a) E 1K:

Ic(x,a)1 ~ Mwo(x)m

and

Ix wo(y)Q(dylx, a) ~ [wo(x) + b]n for n = 1, ... , m.

3. Also the condition (8.3.6) has been used by several authors; see, for
instance, Bensoussan [1], Bhattacharya and Majumdar [1], Cavazos-Cadena
[1].
4. If in Assumption 8.3.2 we allow {3 to be less than 1, then (8.3.29) would
yield a contradiction. Namely, letting t ---+ 00 in (8.3.29) we get 1 ~ 0, since
w~l.
5. Let IK be as in (8.2.1). Then for every pair (x, a) in IK there is a decision
function f E IF such that a = f(x). (See Rieder [1], Example 2.6.) This fact
and Proposition 8.3.9(b) yield that we can rewrite Tau in (8.3.17) as

Tau (x) = i~f [C(X,1) + a ! u(y)Q(dYlx, 1)], x E X, (8.3.44)

for every function u in Bw(X).

8.4 Further analysis of value iteration


For each n = 1,2, ... , let IFn be the family of selectors f E IF for which
f(x) E A(x) attains the minimum in (8.3.12); that is, f E IFn if and only

Ix
if
Vn(x) = c(x, 1) + a Vn-l (y)Q(dylx, 1) "Ix E X. (8.4.1)

A deterministic Markov policy 7r = Un} E TIDM such that fn is in IFn for


all n = 1,2, ... is called an a-value iteration (a-VI) policy. The selector
fo E IF may be arbitrarily chosen. On the other hand, the family of selectors
f. that satisfy (8.3.16) is denoted by IF•. Thus, in view of Theorem 8.3.6(b),
f. belongs to IF. if and only ifthe deterministic stationary policy f':" E TIDS
is a-discount optimal. By (8.3.15), one would expect that IFn is "close" to
IF. for all n sufficiently large. The question is, "close" in what sense? To
deal with this question, we first consider the notion of "asymptotic discount
56 8. Discounted Dynamic Programming

optimality" (already introduced in §4.6). Then we give an estimate for the


difference
V(J~,·) - V*(·),

which is used to introduce rolling horizon policies, and, finally, we consider


the problem of existence and detection of lorecast horizons, which requires
in particular a criterion to eliminate nonoptimal actions.
Throughout this section, Assumptions 8.3.1, 8.3.2 and 8.3.3 are supposed
to be true.

A. Asymptotic discount optimality

Let D : ][{ -+ ~ be the a-discount discrepancy function defined by

D(x,a):= c(x,a) +a Ix V*(y)Q(dYix,a) - V*(x). (8.4.2)

From the a-DCOE (8.3.4) we can see that D is a nonnegative function and
that (8.3.4) can be rewritten as

minD(x,a) = 0 Vx E X. (8.4.3)
A(x)

Furthermore, by Theorem 8.3.6(b), a deterministic stationary policy I::' is


a-discount optimal if and only if [using the notation in Remark 8.2.3(b»)

D(x,/*) = 0 Vx E X. (8.4.4)

Motivated by (8.4.4), we introduce the following concept.


8.4.1 Definition. A deterministic Markov policy 1r = Un} is called point-
wise asymptotically discount optimal (pointwise-<\DO) if, for every
state x E X,
D(x, In) -+ 0 as n -+ 00. (8.4.5)
8.4.2 Proposition. Let 1r = Un} be an a- VI policy, that is, In is in
lFn lor all n = 1,2, ... , and 10 E IF is an arbitrary selector. Then 1r is a
pointwise-ADO policy; in lact, lor every x E X and n = 1,2, ... ,

o ~ D(x, In) ~ 2c,nw (x)/(1-,) -+ 0, with,:= a{3, (8.4.6)

where c,{3 and w(·) are as in Assumption 8.3.2.


Proof. By (8.4.2) and (8.4.1)

c(x, In) +a I V*(y)Q(dyix, In) - V*(x) (8.4.7)

Vn(x) - V*(x) +a I [V*(y) - Vn-l (y»)Q(dyix, In)


8.4 Further analysis of value iteration 57

for all x E X and n = 1,2, .... Moreover, by (8.3.15),


(8.4.8)

and, similarly, by (8.3.15) and (8.3.5),

/ IV*(y) - vn-t{y)IQ(dylx, In) < C'yn-1(1 - ,),)-1 / w(y)Q(dylx, In)


< C'yn-1(1_ ,),)-l,8W(X).
Combining these inequalities with (8.4.7) we obtain (8.4.6). 0

B. Estimates of VI convergence

In view of (8.4.3) and (8.4.4), we can interpret (8.4.6) as an estimate


01 how "close" is In E lFn to being optimal. Another estimate can be ob-
tained if we consider the deterministic stationary policy I~ = {In, In, ... },
which uses the control action at = In(xt) for all t = 0,1, ... , and compute
the difference between the corresponding infinite-horizon discounted cost
V U~ , .) and the a-discount value function V*. In this case we obtain the
following.
8.4.3 Proposition. Fix an arbitrary integer n ~ 1 and let In be a selector
in lFn . Then, lor all x E X,

0::; VU:,x) - V*(x) ::; 2C'y nw(x)/(I- ')') (8.4.9)

with')' := a,8. Multiplying by W(X)-l in (8.4.9) [and (8.4.6)J we obtain


estimates that are unilorm in the w-norm.
Proof. As In is fixed, let us write 1:= In, and v(x) := V (1 00 , x), so that
we wish to estimate

0::; v(x) - V*(x) = [v(x) - vn(x)] + [vn(x) - V*(x)].

Thus, by the inequality (8.4.8), we will obtain (8.4.9) if we show that

(8.4.10)

To prove the latter inequality, first note that [as in (8.3.40)] we have

v(x) = c(x, j) + a / v(y)Q(dylx, 1),

which together with (8.4.1) gives

v(x) - vn(x) = a /[v(y) - v n-l(y)]Q(dylx,1),


58 8. Discounted Dynamic Programming

and so

IV(x) - vn(x)1 :::; a / Iv(y) - Vn-l(y)IQ(dylx, 1>. (8.4.11)

Iteration of (8.4.11) gives [recalling the notation (8.2.11) and that Vo == 0]

Iv(x) - vn(x)1 < on / Iv(y)IQn(dYlx, 1)


on E[oo Iv(xn)l,
so that
(8.4.12)
From the latter inequality, together with (8.3.29) and (8.3.33), we obtain
(8.4.10) and, as we already mentioned, (8.4.9) follows. 0
C. Rolling horizon procedures

The ideal goal in optimal control problems is, of course, to explicitly de-
termine the optimal value function and an optimal control policy. Unfortu-
nately, this goal is quite often very "difficult" -if perhaps not impossible-
to obtain. Thus, there are many cases in which one prefers to use a subopti-
mal but more practical procedure, provided its global performance can be
assessed and compared with that of an optimal policy. We shall now discuss
one such procedure, which is of frequent use in engineering and economics
applications, such as stabilization of control systems, production manage-
ment, and economic growth and macroplanning problems.
In a rolling horizon (RH) procedure-also known as a moving or re-
ceding or sliding horizon procedure-we begin by fixing a positive integer
N, which is called the rolling horizon, and proceed as follows:

Step 1. Set k = 0 and determine an optimal control policy 7rk,N


{Jk,t, t = k, k + 1, ... , k + N - I} for the N-stage problem start-
ing at time k; in other words, 7rk,N minimizes the N-stage cost [cf.
(8.3.14)]

(8.4.13)

Define he := fk,k, the first optimal decision function for the N-stage
problem.
Step 2. Substitute k by k + 1 and go back to step l.

This procedure determines a control policy 1? = L70, it, ... }


for the orig-
inal infinite-horizon problem, and to validate the procedure the question
8.4 Further analysis of value iteration 59

then is to find error bounds for the "degree of suboptimality" of the RH


policy 1?, measured by the difference V(1?,·) - V*(·)(~ 0), where V* is the
optimal value function in (8.3.2). However, as our Markov control model
(X, A, {A(x)lx EX}, Q, c) is stationary, in the sense that all of its compo-
nents X, A, ... , are time-independent, we can see that, for any k = 0,1, ... ,
minimizing Vk,N in (8.4.13) is equivalent to minimizing the N-stage cost
VN given by (8.3.14), namely,

Therefore, by (8.3.13) and (8.4.1), f,. = IN for all k ~ 0, where IN is a


selector in IFN, and so the cost function V (1?, .) is the same as V (f~ , .).
Consequently, Proposition 8.4.3 yields:
8.4.4 Corollary. For any rolling horizon N,

o ~ V(1?, x) - V*(x) ~ 2eyN w (x)j(1 - 'Y) Vx E X. (8.4.14)

D. Forecast horizons and elimination of


nonoptimal actions

There is another way of interpreting the "closeness" of IF n to IF *: By


Theorem 4.6.5 (see Note 4 at the end of this section), if 11" = {In} is an (l!- VI
policy, then there exists an optimal policy 1* E IF * such that, for each state
x E X, I*(x) is an accumulation point of the sequence {In(x)} C A(x)j
that is, for every x E X, there is a subsequence {nil == {ni(x)} of {n} such
that
(8.4.15)
[Theorem 4.6.5 requires the action set A to be locally compact. However, in
our present context this requirement can be omitted because of Assumption
8.3.1(a).]
In turn, (8.4.15) suggests that there might be some control problems for
which, for all n sufficiently large, either

(8.4.16)

or
(8.4.17)
or even
(8.4.18)
If there is a positive integer N* such that (8.4.16) holds for all n ~ N*, then
N* is called a forecast horizon. Since In E IF n is the first optimal decision
function for the n-stage problem [see (8.3.12)-(8.3.14)], the existence of a
60 8. Discounted Dynamic Programming

forecast horizon N* means that, roughly speaking, N* is a finite horizon


that is far enough off that the data beyond it~namely, the "forecasts" ~
have no effect on the optimal decisions in the initial period. Unfortunately,
the existence of forecast horizons requires strong assumptions (even in the
case of a finite state space X ~see Shapiro [1]) and, in fact, for general
(Borel) state spaces we need the more restrictive notion in part (b) of the
following definition.
8.4.5 Definition. (a) For every state x E X, let A* (x) be the set of control
actions a E A(x) for which the minimum is attained in (8.4.3) [or (8.3.4)];
that is, a is in A* (x) if

V*(x) =c(x,a)+a J V*(y)Q(dylx, a).

Similarly, for n 2: 1, An(x) denotes the set of actions a E A(x) that attain
the minimum in (8.3.12), i.e.,

Vn(x) = c(x, a) +a J Vn-l(y)Q(dylx,a) if a E A(x).

Hence, a deterministic stationary policy foo is a-optimal, that is, f is in


IF*, if and only if f(x) is in A*(x) for all x; and, on the other hand, f is in
lFn if and only if f(x) is in An(x) for all x.
(b) Let x E X be a given (initial) state, and let 7r = Un} be an a-VI
policy. Then a positive integer N is said to be a (x,7r)-forecast horizon
if fn(x) is in A*(x) for all n 2: N, i.e.,

V*(x) = c(x, fn) +a Ix V*(y)Q(dylx, fn) Vn 2: N. (8.4.19)

8.4.6 Proposition. (a) If

the action set A is finite, (8.4.20)

then for every initial state x E X and every a- VI policy 7r = Un} there
exists a (x, 7r)-forecast horizon N l . If, in addition, the state space X is also
finite, then Nl is independent of x and 7r; in other words, Nl is a forecast
horizon in the sense that (8.4.16) holds for all n 2: N l .
(b) In addition to (8.4.20), suppose that there exists a unique a-optimal
control policy f':"; that is, IF * consists of the single selector f* E IF:
(8.4.21)

Then for every initial state x EX, there exists a positive integer N2 =
N 2(x) such that f*(x) is in An(x) for all n 2: N2 [cf. (8.4.17)J; in other
words,

(8.4.22)
8.4 Further analysis of value iteration 61

Thus, if X is finite, there exists N 2 1 such that (8.4.17) and (8.4.18) hold
for all n 2 N-.-
Proof. (a) Fix x and 7f = {In}, and suppose that (a) does not hold; that
is, for every positive integer N 2 1 there exists n 2 N such that, instead
of (8.4.19), we have

V*(x) < c(x, fn) + Q J V*(y)Q(dylx, fn).

Equivalently, there is a subsequence {m} of {n} such that

V*(x) < c(x, fm) + Q J V*(y)Q(dylx, fm) Vm,

and, furthermore, as 7f is an Q- VI policy,

On the other hand, as A(x) is a finite set [by (8.4.20)), there is a further
subsequence {mi} of {m} and a control action ax E A(x) such that

fm,(x) = ax Vi, and ax ¢ A*(x).

In other words,

V m, (x) = c(x, ax) +Q J vmi-l(y)Q(dylx, ax) Vi, (8.4.23)

J
and
V*(x) < c(x,a x ) + Q V*(y)Q(dylx, ax). (8.4.24)

Thus, letting i -+ 00 in (8.4.23), from (8.3.15) and (8.3.20) we get

V*(x) = c(x, ax) +Q J V*(y)Q(dylx, ax), (8.4.25)

which contradicts (8.4.24). This proves the first part of (a); that is, there
exists a (x,7f)-forecast horizon, say N 1 (x,7f).
Now, if A and X are both finite, then there are finitely many Q- VI
policies 7f. Hence, Nl := max X ,1T N1(x, 7f) defines a forecast horizon.
(b) Fix the initial state x and suppose that (8.4.20) and (8.4.21) both
hold. If (8.4.22) is not satisfied, then [arguing as in the proof of (a)) there is
a subsequence {nd of {n}, and controls an, E An, (x) and ax E A(x) such
that an, = ax for all i, and

(8.4.26)
62 8. Discounted Dynamic Programming

That is, for all i,

Vn.(x) c(x, ax) +a! Vn.-l(y)Q(dylx, ax)

+ !
(8.4.27)
< c(x, f.) a Vn;-l (y)Q(dylx, f.).

Therefore, letting i ~ 00, from (8.3.15) and (8.3.20) we obtain

V*(x) c(x, ax) +a! V*(y)Q(dylx, ax)

< c(x, f.) + !


a V·(y)Q(dylx, f.) (8.4.28)
V·(x) since f.(x) is in A.(x).

This means that ax belongs to A*(x), which, by (8.4.21), yields ax = f.(x).


As this contradicts (8.4.26), we conclude that (8.4.22) holds for some N 2 (x).
Finally, if X is a finite set, we get (8.4.17) for all n ~ N2 := maxx N 2(x),
and also (8.4.18) for all n ~ max{Nl, N 2 }, with Nl as in the second part
of (a). 0
Of course, for Proposition 8.4.6(a) to be of any practical use we need a
method to "find", or at least "estimate" N 1 ; this is called the detection
(of a forecast horizon) problem. To deal with it, we need in turn some
criterion to eliminate nonoptimal actions, that is, actions which do not
belong to A.(x). To do this, we will use the a-VI discrepancy functions
Dn : IK ~ lR defined-in analogy with (8.4.2)-by

Dn(x, a) := c(x, a) + Ix vn-l(y)Q(dylx,a) - vn(x), (8.4.29)

for n = 1,2, .... Observe that, by (8.3.12), the functions Dn are nonnegative
and that (8.3.12) can be rewritten as

min Dn(x, a) = 0, x E X. (8.4.30)


A(x)

Further, by (8.3.15) and (8.3.20),

lim Dn(x, a)
n-+oo
= D(x, a). (8.4.31)

We also have (with c and f3 as in Assumption 8.3.2, and 'Y := a(3):


8.4.7 Proposition. (Criterion for elimination of nonoptimal ac-
tions.) An admissible control a E A(x) is nonoptimal in state x, that is,
a(/. A.(x), if and only if there is a positive integer n(a) such that

Dn(a) (x, a) ~ 2c-yn(a)-lw(x)/(I_ 'Y). (8.4.32)


8.4 Further analysis of value iteration 63

Proof. First we prove that, for any n ~ 1,

Dn+1 (x, a) ~ Dn(x, a) - 2c')'nw(x). (8.4.33)

To see this, use (8.4.29) to obtain

D n+1(x, a) - Dn(x, a) = a ! [vn(y) -Vn-l (y)]Q(dylx, a) - [Vn+l (x) -vn(x)].


(8.4.34)
On the other hand, as in (8.4.11), one can show that

Ivn(x) - vn-l(x)1 ::; c')'n-1w(x) \In ~ 1. (8.4.35)

From the latter inequality, and using (8.3.5), we get

a ![Vn(y) - vn_l(y)]Q(dylx,a) ~ _"C')'nw(x) \In. (8.4.36)

Finally, from (8.4.34)-(8.4.36), a straightforward calculation yields (8.4.33).


Now, iteration of (8.4.33) yields, for all m, n ~ 1,
m
Dn+m(x, a) ~ Dn(x, a) - 2"C')'nw(x) L ')'kj
k=O

hence,

(8.4.37)

We are now ready to prove the proposition itself.


Suppose that (8.4.32) holds, and take n := n(a) in (8.4.37). This gives

Dn+m(x, a) ~ 2"C')'n-l w(x) for all m ~ 1,

so that
D(x,a) = lim Dn+m(x,a) ~ 2"C')'n-l w(x) > OJ
m--+=
hence, a is not in A*(x). Conversely, if (8.4.32) does not hold, then

0::; Dn(x,a) < 2"C')'n- 1w(x)/(1 - ')') --+ 0 as n --+ 00.

Therefore, D(x, a) = 0 [by (8.4.31)], which means a E A*(x). 0


If a E A(x) belongs to A*(x), so that D(x, a) = 0, then in view of (8.4.31)
we may define n(a) := O. Then Proposition 8.4.7 yields the following.
8.4.8 Corollary. Suppose that (8.4.20) holds, that is, A is a finite set, and
let x E X be a given initial state. For every admissible action a E A(x)
let n(a) be as in Proposition 8.4.7 if a is not in A*(x), and n(a) := 0
otherwise. Moreover, define
N* := max{n(a)la E A(x)}. (8.4.38)
64 8. Discounted Dynamic Programming

Then N* is finite and it is a (x, 7r)-forecast horizon for every a- VI policy 7r.
Proof. Let 7r = Un} be an arbitrary a-VI policy, and let a := fn(x). If
a is not in A*(x), then as A(x) C A is a finite set, it is clear that n(a) is
also finite, and so is N*. Further, if n ;::: N* , then a is necessarily in A* (x)
because, otherwise, Dn(x, a) = 0 would contradict (8.4.32). D
Finally, if both conditions (8.4.20) and (8.4.21) hold, then Proposition
8.4.7 and Corollary 8.4.8 yield the following (on-line) algorithm to de-
tect N* in (8.4.38) and the optimal selector f* in (8.4.21) for a
given initial state x EX:
Initialization. Let n = 0, and define Ao := A(x).
If Ao has a single element, say a*, then stop: a* = f*(x). Otherwise, go
to step n = 1.
Step n: For every a in A n - 1 compute Dn(x, a). If

then eliminate a from An-I, and define

An := {a E A(x)IDn(x, a) < 2c')'n- 1 w(x)/(1 - ')'H.


If An consists of a single element a*, stop: a* = f*(x). Otherwise, go to
step n + 1.
Under the conditions (8.4.20) and (8.4.21), this algorithm is ensured to
stop after a finite number of steps, and the stopping time N* will be a
(x,7r)-forecast horizon for any a-VI policy 7r and any given initial state x.
Furthermore, if the state space X is finite, then repeating the algorithm
for every x EX, one can get a forecast horizon in the sense of (8.4.16).

Notes on §8.4

1. For further properties and several characterizations of asymptotic dis-


count optimality, see §§4.5, 4.6.
2. Proposition 8.4.3 and its application to rolling horizon (RH) policies
are from Hermindez-Lerma and Lasserre [8], [9], although the former ref-
erence deals only with bounded cost functions. These references and also,
for instance, Alden and Smith [1] consider nonhomogeneous MCPs. For
applications of RH policies in economics see Easley and Spulber [1], and
Johansen [1]; for the stabilization of control systems see Kleinman [1] and
Kwon et al. [1].
3. The results in subsection D, on forecast horizons and elimination of
nonoptimal policies, are basically an extension of Bes and Lasserre [1] and
Hernandez-Lerma and Lasserre [10], which deal with bounded costs. For
related references and applications see the survey by Bes and Sethi [1],
8.5 The weakly continuous case 65

and Rempala [1). Perhaps the first paper on forecast horizons for (finite-
state, finite-action) Markov control problems was the work of Shapiro [1),
which soon afterwards was improved by Hinderer and Hubner [1). In math-
ematical economics, "turnpike theorems" refer to results on asymptotic
properties of optimal paths of capital accumulation of economic growth
(see McKenzie [1)), and, by extension, forecast horizons are also known as
turnpike-planning horizons.
4. Theorem 4.6.5 referred to at the beginning of §8.4.D is based on the
following result by M. Schiil [1):
Let A(x) and OC be as in Assumption B.B.l(a} and (B.2.1), respectively,
and let {fn} be a sequence in IF. Then there exists a selector f E IF such
that, for each state x E X, f(x) E A(x) is an accumulation point of the
sequence {fn(x)}.

8.5 The weakly continuous case


From the point of view of applications, perhaps the most restrictive of the
hypotheses in §8.3.A is the "strong continuity" condition in Assumption
8.3.1(c), which requires the function

u'(x, a) := Ix u(y)Q(dyix,a), (x,a) E OC, (8.5.1)

to be continuous in a E A(x) for every x E X and every bounded measurable


function u E Ja(X). This condition, which is similar to the strong Feller
property (Definition 7.3.1), is equivalent to require:

Q(Bix, a) = Prob(xt+1 E Bixt = x, at = a) (8.5.2)

is continuous in a E A(x) for every x E X, every Borel set B C X,


and t = 0,1, ... , and it is certainly too much to ask for a large class
of control models. In this section we replace Assumption 8.3.1(c) by the
"weak continuity" (weak Feller-like) condition in Assumption 8.5.1(c) but
then to obtain the corresponding results in §§8.3 and 8.4 we need to pay
a price; namely, we have to strengthen other parts of Assumptions 8.3.1,
8.3.2, 8.3.3. The reason for this strenghtening will be apparent below.
In this section we replace Assumptions 8.3.1-8.3.3 by the following As-
sumptions 8.5.1-8.5.3.
8.5.1 Assumption.

(a) A(x) is compact for every x E X, and the set-valued mapping x t-+
A(x) is u.s.c.;

(b) The cost-per-stage c is l.s. c. on OC; and


66 8. Discounted Dynamic Programming

(c) Q is weakly continuous on][{, that is, the function u'(x,a) in (8.5.1) is
continuous on ][{ for every bounded continuous function u on X.
8.5.2 Assumption. This is the same as Assumption 8.3.2 except that the
function w ~ 1 is required to be continuous.
8.5.3 Assumption. The function w'(x,a) in Assumption 8.3.3 is contin-
uous on ][{.
Under this new set of assumptions (being basically a strengthening of As-
sumptions 8.3.1, 8.3.2, 8.3.3) Theorem 8.3.6 remains valid, but in addition
we get that
V* is a l.s.c. function in Bw(X). (8.5.3)
To obtain (8.5.3) we need to make some changes in the proof of Theorem
8.3.6. First we introduce the following notation.
8.5.4 Definition. IL(X) denotes the family of l.s.c. functions on X, and
ILw(X) stands for the subfamily of l.s.c. functions that also belong to
Bw(X), i.e.,
ILw(X) := IL(X) n Bw(X).
Similarly, qX) C IL(X) denotes the subfamily of continuous functions on
X, and Cw(X) := qX) n Bw(X) is the subfamily of continuous functions
in Bw(X). The family of continuous bounded functions on X is denoted by
Cb(X).
To prove (8.5.3) we need the following lemma, whose parts (a) and (b)
correspond to Lemma 8.3.7(a) and Proposition 8.3.9, respectively. On the
other hand, Lemma 8.5.5(c) states that convergence in w-norm preserves
lower semicontinuity.
8.5.5 Lemma. Suppose that Assumptions 8.5.1, 8.5.2 and 8.5.3 are satis-
fied. Then:
(a) The function u' in (8.5.1) is continuous on ][{ whenever u is in ILw(X);
(b) Proposition 8.3.9 remains valid if Bw(X) is replaced by ILw(X);
(c) If {vn} is a sequence in ILw(X) that converges in w-norm to a function
v, then v is in ILw(X).
Proof. (a) The proof of this part is essentially the same as the proof of
Lemma 8.3.7(a) with the obvious changes. Let u be a function in ILw(X) and
define Urn as in the proof of Lemma 8.3.7(a). Then Urn is a nonnegative l.s.c.
function and, therefore, there is a nondecreasing sequence of continuous
bounded functions uk E Cb(X) such that uk t U. Now let (xn, an) be a
sequence in ][{ converging to (x, a) E K Then, Assumption 8.5.1(c) yields
that, for every k,
8.6 Examples 67

f uk(y)Q(dylx, a).

Hence, letting k ----+ 00 we get that J um(y)Q(dyl') is l.s.c., which together


with Assumption 8.5.3 gives that u' is l.s.c. on K That is, u' is l.s.c. on IK
for every function u in lLw(X). If we now apply the latter fact to -u we
see that u' is also u.s.c. on K
(b) Since Ta satisfies the contraction property (8.3.23), to complete the
proof of part (b) it only remains to show that

(bd Tau is a l.s.c. function in lEw(X) for every u in lLw(X), and that

(b 2 ) for every u E lLw(X), there exists f E IF that satisfies (8.3.24).


To prove (bd and (b 2 ), let u be a function in lLw(X) and note that, by
part (a) and Assumption 8.5.1(c), the function within brackets in (8.3.7),
namely,
c(x, a) + QU' (x, a) =: v(x, a),
is l.s.c., and, in fact, under the present assumptions, v satisfies all of the
hypotheses of Lemma 8.3.8(c). Hence, the latter lemma yields (bd and
(b 2 ).
(c) Clearly, v is in lEw(X) if {v n } is a sequence in lLw(X) and Vn converges
to v in w-norm. [Recall that lEw(X) is a Banach space-see Proposition
7.2.1.] Thus, it only remains to show that v is l.s.c. To prove this observe
that

v(x) = [v(x) - vn(x)] + vn(x) 2:: -llvn - vllww(x) + vn(x)


for all x E X and n. Therefore, if xk ----+ x, the lower semicontinuity of each
Vn and the continuity of w (Assumption 8.5.2) yield

Finally, letting n ----+ 00 we obtain liminfk v(xk) 2:: v(x); that is, v is l.s.c. 0
We can now see that (8.5.3) is a direct consequence of Lemma 8.5.5(b), (c)
and (8.3.15). Namely, by Lemma 8.5.5(b) and a trivial induction argument,
the Q- VI functions Vn in (8.3.26) [or (8.3.12)] belong to lLw(X), which
combined with Lemma 8.5.5(c) and (8.3.15) yields that V* is in lLw(X);
that is, (8.5.3) holds.
The previous paragraphs illustrate a situation already mentioned in §§4.2
and 3.3: in addition to the features of the particular control problem we
are dealing with, the choice of hypotheses basically depends on whether one
wishes to (or can) work in a class of lower semicontinuous functions (as is
the case under Assumptions 8.5.1, 8.5.2, 8.5.3) or in a class of measurable
functions (Assumptions 8.3.1, 8.3.2, 8.3.3).
68 8. Discounted Dynamic Programming

8.6 Examples
In this section we present a couple of examples of Markov control models
that satisfy the assumptions in §8.3 and §8.5. These examples are intended
to illustrate how one can proceed in similar cases.
When considering an X -valued controlled process {xt} of the form

(8.6.1)

we always suppose the following:


8.6.1 Assumption.

(a) The disturbance sequence {zt} consists of LLd. random variables with
values in a Borel space Z, and {Zt} is independent of the initial state
Xo. The common distribution of the Zt is denoted by G.

(b) F : OC x Z -+ X is a given measurable function, where OC C X x A is


the set defined in (8.2.1).
Let 71' = {at} be an arbitrary control policy (Definition 8.2.2). Then, by
Assumption 8.6.1(a), the variables (Xt, at) and Zt are independent for each
t = 0, 1, .... Thus the controlled process' transition law Q is given by

Q(Blx, a) .- Prob(xt+1 E Blxt = x, at = a)


(8.6.2)
= JzlB[F(x,a,z)]G(dz)
for every B E B(X), (x, a) E OC, and t = 0,1, .... The expression (8.6.2) is
of course a particular case of the integral

u'(x, a) := Ix u(y)Q(dYlx,a) = E[u(xt+1)l xt = x, at = a] (8.6.3)

when u is the indicator function lB. In general, we may use (8.6.1) and
Assumption 8.6.1(a) to write (8.6.3) as

u'(x, a) = h
u[F(x,a,z)]G(dz). (8.6.4)

8.6.2 Example: An inventory system. Consider an inventory-produc-


tion system in which the state variable Xt, the control action at, and the
disturbance Zt, for every t = 0,1, ... , have the following meaning:
• Xt denotes the stock level at the beginning of period t;
• at is the amount of product ordered (and immediately supplied) at
the beginning of period t;
• Zt denotes the product's demand during period t.
8.6 Examples 69

Using the standard notation r+ := max(r,O), we assume that the stock


level evolves according to the equation [see (8.6.1)]
(8.6.5)
for some given initial stock level Xo, so that the state space is the half-line
X := [0,00). The production variables at are supposed to take values in
the interval A := [0,8], for some given constant 8 > 0, irrespective of the
stock level-that is, the control-constraint sets A(x) satisfy
A(x) = A "Ix EX. (8.6.6)
In addition, we suppose that the demand process {zd satisfies Assumption
8.6.1 with Z := [0,00), so that Zt is nonnegative for each t, and that the
demand distribution G has the following properties:
8.6.3 (a) G has a continuous bounded density g [i.e., G(dz) = g(z)dz];
(b) G has a finite mean value Z, i.e.,

Z:= E(zo) = fo= zG(dz) < 00.


Finally, to complete the description of the control model (X, A, Q, c),
where Q is given by (8.6.5) and (8.6.2), we shall consider a cost-per-stage
function c that represents a net cost of the form
production cost + maintenance (or holding) cost - sales revenue
given by
c(x, a) := p . a + m . (x + a) - s . E min(x + a, zo), (8.6.7)
where p, m, and s are positive constants. The unit production p and the
unit maintenance cost m do not exceed the unit sale price, i.e.,
p,m:'Ss. (8.6.8)
We shall now proceed to verify the assumptions in §8.3 and §8.5.
Verification of Assumption 8.3.1 and 8.5.1. It is clear that parts (a)
and (b) are satisfied. In particular, since
(x+a
Emin(x + a, zo) = (x + a)[l - G(x + a)] + Jo zG(dz), (8.6.9)

the cost function c(x, a) is continuous on lK := X x A. On the other hand,


from (8.6.5), (8.6.3)-(8.6.4), and the property 8.6.3(a) we get

u'(x, a) fooo u[(x + a - z)+]g(z)dz


(8.6.10)
u(O)[l - G(x + a)] + lx+a u(x + a - z)g(z)dz.
70 8. Discounted Dynamic Programming

Thus, an elementary change of variables in the latter integral yields

u'(x, a) = u(O)[I - G(x + a)] + 10r+ a


u(z)g(x + a - z)dz,

and so we see that u'(x, a) is continuous in (x, a) E ][( for every bounded
measurable function u on X. This implies part (c) in both Assumptions
8.3.1 and 8.5.1.
Verification of Assumptions 8.3.2 and 8.5.2. It suffices to find a con-
tinuous weight function w that satisfies the conditions (i) and (ii) in Remark
8.3.5(a). To do this, let us first consider the moment generating function 'IjJ
of the variable (J - zo,

'IjJ(r) := Eexp[r((J - zo)], for r ~ O.


As 'IjJ(O) = 1 and 'IjJ is continuous, for each € > 0 there is a positive number
r such that
'IjJ(T) ~ 1 +€. (8.6.11)
Define
w(x) := exp[T(x + 2z)], x E X. (8.6.12)
Then, from (8.6.10) with u = w,

w' (x, a) = w(O)[I - G(x + a)] + w(x) lx+a exp[T(a - z)]G(dz), (8.6.13)

so that, since 1- G(x + a) ~ 1 and r(a - z) ~ r((J - z) for all a E A, we get


w'(x, a) ~ w(O) + 'IjJ(r)w(x) ~ {3w(x) +b '<Ix EX, (8.6.14)
with
{3:= 1 +€ and b:= w(O).
On the other hand, a straightforward calculation using (8.6.8) and (8.6.9)
shows that sUPA Ic(x,a)1 ~ s· (x + 2z) for all x E X, and, therefore,
sup Ic(x, a)1 ~ mw(x) (8.6.15)
A

for some constant m sufficiently large. Hence, as the function w in (8.6.12)


is continuous, we see from (8.6.14) and (8.6.15) that the conditions (i) and
(ii) in Remark 8.3.5(a) are both satisfied for any discount factor 0 such
that {3 < 1/0.
Verification of Assumptions 8.3.3 and 8.5.3. This follows from (8.6.13).
o
Example 8.6.2 is due essentially to Vega-Amaya [I]-see also Hermindez-
Lerma and Vega-Amaya [1]. For an inventory example illustrating the re-
sults in §8.4.D see Bes and Lasserre [1]. Many additional references on
8.6 Examples 71

inventory theory are given in §1.3 and §3.7.


8.6.4 Example: A queueing system. Consider a control system of the
form
(8.6.16)
with state space X = [0,00). This system is related to the inventory model
in Example 8.6.2 [in (8.6.5) take 'fJt == 1], and also to Example 7.4.2 [com-
pare (8.6.16) and (7.4.8)]. In fact, as noted in the last paragraph of Example
7.4.2, a model of the form (7.4.8) [or (8.6.16)] can have several interest-
ing interpretations. Here, we interpret (8.6.16) as modelling a single-server
queueing system (of general type G I / GI /1) with controllable service rates.
Thus, Xt and 'fJt denote, respectively, the waiting time and a "base" service
time of the tth customer (t = 0,1, ... ), whereas ~t stands for the interar-
rival time between the tth and (t + l)th customers. The control variable at
denotes the reciprocal of the service rate for the tth customer.
We shall suppose the following:
8.6.5 Assumptions on (8.6.16).

(a) The action (or control) set A = A(x) for all x E X is a compact subset
of an interval (0,0] for some (finite) number 0.

(b) {'fJtl and {~tl are independent sequences of LLd. random variables.

(c) 'fJo and ~o have continuous bounded densities 91 and 92, respectively.

(d) The random variable z := O'fJo - ~o has a (finite) negative mean and
a moment generating function 1jJ(r) := E(e TZ ) that is finite for some
r> 0, that is
(i) E(z) < 0, and (ii) 1jJ(r) < 00. (8.6.17)

By Assumption 8.6.5(d), we have 1jJ(0) = 1 and 1jJ'(O) = E(z) < 0. Hence


there is a positive number r ~ r such that

1jJ(r) < 1.
For such a number r, we define the continuous weight function

w(x) := e TX , x E X. (8.6.18)

Moreover, we shall suppose that the associated cost-per-stage function


c(x, a) satisfies the Assumption 8.5.1(b) [hence 8.3.1(b)] and 8.3.2(a)j that
is, c is l.s.c. on 1K := X x A and, for some constant c,

sup Ic(x, a)1 ~ cw(x) "Ix E X. (8.6.19)


A
72 8. Discounted Dynamic Programming

Observe that this is not a restrictive condition. Namely, as A is compact,


the condition (8.6.19) is bound to be satisfied, for c sufficiently large, for all
the typical-say, polynomial-cost functions that appear in applications.
We shall now verify that the given queueing system satisfies the other
assumptions in §8.3 and §8.5.
Verification of Assumptions 8.3.1-8.3.3, and 8.5.1-8.5.3. In view of
the previous paragraphs, to complete the verification of Assumptions 8.3.1
and 8.5.1 it suffices to check that

8.6.6 Condition. The function u'(x, a) := J


u(y)Q(dylx, a) is continuous
and bounded on ][{ for every measurable bounded function u on X.

This requires some preliminary calculations.


For every a E A, let Za := a170 - ~o. Then, by Assumptions 8.6.5(b),(c),
the probability distribution function of Za is given, for every real number
y, by

(8.6.20)

Hence, denoting by ga the density of Za, we get

(8.6.21)

which, by Assumption 8.6.5(c) is a bounded function, continuous in both


variables a E A and y E JR. Observe that the latter property, continuity
in a and y, is also satisfied by the distribution function in (8.6.21). This
implies the condition 8.6.6 since, by (8.6.16) and (8.6.3),

u'(x,a) Eu[(x

u(O)P(x
+ Za)+]
+ Za ::; 0) + i: u(x + y)ga(y)dy
(8.6.22)

and the latter integral can be written as

[ : u(x + y)ga(y)dy = 1 00
u(y)ga(Y - x)dy.

This completes the verification of Assumptions 8.3.1 and 8.5.1.


On the other hand, Assumptions 8.3.3 and 8.5.3 can also be deduced from
(8.6.22) since replacing u by the weight function w [see (8.6.18)] yields

w'(x, a) = P(za ::; -x) + w(x) [ : erYga(y)dy, (8.6.23)


8.7 Further remarks 73

which is continuous on lK = X x A.
Finally, to verify (8.3.5) observe that, by the Assumptions 8.6.5(a),(d),

Za = a'TJo - ~o S ()'TJo - ~o = Z Va E A,

i: i:
so that the integral in (8.6.23) satisfies

erYga(y)dy < eTYga(y)dy

Eexp(rza) (8.6.24)
< 'IjJ(r) Vex, a) ElK.
Thus, as P(za S -x) SIS w(x) for all (x, a) in lK, (8.6.23) implies that
(8.3.5) holds with (3 := 1 + 'IjJ(r). The latter fact and (8.6.19) show that
Assumptions 8.3.2 and 8.5.2 are satisfied for every discount factor a < 1/(3.
In conclusion, all of the results in §8.3 and §8.5 are applicable to the
queueing system (8.6.16). In fact, many results in §8.4 are also applicable
since the compactness of A in Assumption 8.6.5(a) includes, for instance,
the condition (8.4.20). 0
Example 8.6.4 comes from Gordienko and Hernandez-Lerma [1].

8.7 Further remarks


This chapter introduced a weighted-norm approach to discounted cost Mar-
kov control problems. A comparison with the hypotheses and results in
Chapters 4 and 6 shows that each particular context or solution technique
has its own merits. For instance, the results in §8.4 are virtually impossible
to obtain in the setting of Chapters 4 and 6, but then §8.4 (and §8.3)
requires more restrictive assumptions.
There are other ways to study the discounted problem. In particular, in
later chapters we will see that it can be studied by a "direct approach",
or as a "transient" control problem, or using finite-dimensional linear pro-
gramming approximations.
9
The Expected Total Cost Criterion

9.1 Introduction
Let M = (X,A, {A(x)lx E X},Q,c) be the Markov control model (MCM)
in §8.2. In this chapter we study the expected total cost (ETC) criterion
defined as

(9.1.1)

so the corresponding (optimal) value function is

Vj*(x) := infVi (1r,x), x E X. (9.1.2)


n
As usual, the main problems we are concerned with are: (i) to "character-
ize" Vi" -for instance, as the solution of a certain "optimality (or dynamic
programming) equation", and (ii) to determine conditions for the existence
of ETC-optimal policies, that is, policies 1r* for which

vt(x) = Vi (1r*, x) Yx E X. (9.1.3)

The ETC criterion was probably the earliest infinite-horizon control


problem studied in the literature, going back at least to the 1920s; see,
for example, Ramsey [1]. It is obviously very demanding from the technical
viewpoint because simply for Vi (1r, x) to be finite valued, or even to be well
defined, we need to impose very strong assumptions on the MCM M. In

O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes


© Springer Science+Business Media New York 1999
76 9. The Expected Total Cost Criterion

later chapters we will study other optimality criteria-such as "overtaking


optimality" -that are less restrictive but which at the same time maintain
some of the features of the ETC criterion.
The remainder of the chapter is organized as follows. For the sake of com-
pleteness and ease of reference, §9.2 summarizes some facts on extended
real numbers and "quasi-integrability." In §9.3 we consider several theoret-
ical questions, including the measurability of Vi (11", .) and Vi* (.). Moreover,
letting

Va (11", x) := E: [f
t=o
(ic(xt, at)] (9.1.4)

be the a-discounted cost (0 < a < 1) in (8.3.1), we show that Va (11", .)


converges to Vi (11", .) as a t 1. In §9.4 we study the sufficiency problem in
which the basic issue is to show that in (9.1.2) we may replace the set II of
all policies by the smaller set IIRM of randomized Markov policies (Defi-
nition 8.2.2); in this case we say that IIRM is a "complete" set of policies
for the ETC criterion. The sufficiency problem requires the introduction of
the ETC-expected occupation measures, which is a concept important in
itself.
Section 9.5 gives conditions for the value function Vt to be a solution of
the ETC-optimality equation, as well as conditions for a policy to be ETC-
optimal. Finally, in §9.6 we study transient MCMs. This is an important
class of models for which all the hypotheses in §9.3 and §9.5 are satisfied,
and, therefore, one can make a very detailed analysis of many optimality-
related questions.

9.2 Preliminaries
This section contains background material. The reader may skip the section
and refer to it as needed.
A. Extended real numbers

In the set iR := IR U {+oo} U {-oo} of extended real numbers, we adopt


the usual rules of arithmetic (where 00 := +00):

r + 00 = 00 + r = 00, and r - 00 = -00 + r = -00 Vr E IR;


00 + 00 = 00, -00 - 00 = -00;
(9.2.1)
Observe that 00 - 00 is not defined. Further,

if r E iR, r > 0
r.oo=oo.r={ c;
-00
if r = 0
if r E iR, r < o.
(9.2.2)
9.2 Preliminaries 77

The positive and negative parts of an extended real number r are defined
as
r+ := max(r,O) and r-:= max( -r, 0), (9.2.3)
respectively, and satisfy

Let {rn} be a sequence in iR that may contain one of the numbers +00,
-00, but not both. Then the "partial sum" SN := L:=o rn is well defined
for each N = 1,2, ... , and we say that the series L~=o rn converges (in iR)
to r E iR if the limit limN-+oo SN exists (in iR) and equals r. For example, if

rn 2: °
for all n = 0,1, ... , (9.2.4)

then the series L rn converges in iR (the limit may be +00). On the other
hand, if {rn} is such that

L: r;; < 00,


00 00

L:r~ < 00 or (9.2.5)


n=O n=O

then the series L rn converges in iR to


00 00 00

L:rn = L:r~ - L:r;;. (9.2.6)


n=O n=O n=O
Moreover, if a series L rn converges in iR, then

(9.2.7)

This follows from the fact (easily verified by induction) that

VN=O,l, ....

The following elementary proposition will be used to relate the a-dis-


counted cost Va in (9.1.4) with the expected total cost V1 [see Proposition
9.3.3(b)).
9.2.1 Proposition. Let {rn} be a sequence in iR and a (a "discount fac-
tor") in (0,1). If {rn} satisfies (9.2.4) or (9.2.5), then the series L anrn
converges in iR for every a in (0,1), and

(9.2.8)
78 9. The Expected Total Cost Criterion

Proof. If the proposition is true under (9.2.4), then [by (9.2.5) and (9.2.6)]
it is also true under (9.2.5). Now, to prove the proposition assuming (9.2.4)
it suffices to note that o:nrn ~ 0 for all n and, further, the partial sums
N
L o:nrn
n=O

are nondecreasing in both variables 0: E (0,1] and N = 0,1, .... Therefore


(by Remark 9.6.13), we can interchange the following limits
N
lim lim L o:nrn
at! N --+00 n=O
N
lim lim
N --+00 atl
L o:nrn
n=O
N
lim "" rn,
N--+oo ~
n=O

and (9.2.8) follows. 0

B. Integrability

Let (0, F, P) be a probability space, and iii the set of extended real
numbers. A random variable, : 0 -t iii is said to be integrable (with
respect to P) if
E(,+) < 00 and E(C) < 00.
In this case, the expectation (or expected value) of , is the real number

(9.2.9)

On the other hand, , is called quasi-integrable if

E(,+) < 00 or E(C) < 00. (9.2.10)

The expectation of a quasi-integrable random variable, is again defined by


(9.2.9), which is now an extended real number.
A nonnegative random variable, is quasi-integrable [as E(,-) = 0 < 00].
A random variable, is integrable if and only if EI'I < 00.
For a proof of parts (a) to (e) of the following proposition see, for in-

'n
stance, Neveu [1, p.41]; for a proof of (f) see Hinderer [1, p.146].
9.2.2 Proposition. Let' and (n = 1,2, ... ) be quasi-integrable random
variables. Then:
(a) E(k,) = kE(,) for every finite constant k.
9.3 The expected total cost 79

(b) E(6 + e2) = E(ed + E(6) if el


+ 6 is defined [see (9.2.1)J and if
e
et and et (or 1 and e2 ) are integrable.

(d) en t eimplies E(en) t E(e) if e; is integrable for at least one n.


(e) en -I- e implies E(en) -I- E(e) if e;t is integrable for at least one n.

(f) Suppose that LE(e;t) < 00 or LE(e;) < 00. Then:


n n

n n n

(f.2) Len converges almost surely to a quasi-integrable random vari-


n
able.
(f.3) E(Len) = LE(en).
n n

n n

9.3 The expected total cost


Let M = (X,A,{A(x)lx E X},Q,c) be the Markov control model in §8.2,
and consider the expected total cost (ETC)

(9.3.1)

when using the policy 11", given the initial state Xo = x. The corresponding
(optimal) value function is

vt(x) := inf VI(1I", x), x E X. (9.3.2)


II

To abbreviate, sometimes we shall write (9.3.1) as

(9.3.3)

The first step in our study of the ETC criterion will be to consider the
following basic theoretical issues.
80 9. The Expected Total Cost Criterion

9.3.1 Questions
(a) Given a policy 7r, is VI ( 7r, .) : X -+ lR (or iR) a measurable function?
Similarly,
(b) Is Vt : X -+ lR (or iR) a measurable function?
(c) For each policy 7r and each initial state x, let JO(7r,x) := 0 and

I n (7r,x) := E; [~c(xt,at)] , n = 1,2, ... , (9.3.4)


t=o
be the n-stage expected total cost. The corresponding optimal
n-stage cost is JO'(.):= 0 and, for n = 1,2, ... ,
J~(x) := inf I n (7r,x), x E X. (9.3.5)
rr
The question is, as n -+ 00, does J~ converge to Vt? In other words,
we would like to find conditions under which
lim J~(x) = vt(x) \Ix E X. (9.3.6)
n--+oo

(d) Does the a-discounted cost Va (7r, x) converge to VI (7r, x) as a t I?


(e) To obtain the optimal value function Vt in (9.3.2), is it "sufficient" to
minimize Vt{·, x) over a subset II' of II? If this happens to be true, we
then say that II' is a sufficient set of policies for the ETC problem.
In this section we give conditions under which each of the questions (a)
to (d) has an affirmative answer. Question (e) is postponed to the next
section since it requires some preliminary concepts and results.
First, concerning Question 9.3.1(a), we need to ensure that the series
L Ct in (9.3.3) is (at least) quasi-integrable with respect to Hence [as P;.
in (9.2.10)], we consider the expectations of nonnegative series

TT(+)(7r
VI , x) .-
. - E7r
x (~C+)
L...J t , (9.3.7)
t=o
and we suppose the following:
9.3.2 Assumption. For each x E X
sup VI (-)(7r,x) < 00. (9.3.8)
rr

Observe that (9.3.8) trivially holds if the cost-per-stage c(x, a) is non-


negative, in which case c- = O. On the other hand, under Assumption 9.3.2
we obtain, in particular, an affirmative answer to Questions 9.3.1(a) and
(d):
9.3.3 Proposition. If Assumption 9.3.2 is satisfied, then:
9.3 The expected total cost 81

(a) x I-t VI (7f, x) is an extended real-valued measurable function on X


for every policy 7f, and similarly for x I-t Va: (7f, x) for each discount
factor a in (0,1).
Moreover,
(b) limatl Va (7f, x) = VI (7f, x) for each 7f E II and x EX, and
(c) Vt(x) > -00 for each x E X.

Proof. (a) For each policy 7f and initial state x, the condition (9.3.8) implies
that VI(-l(7f,x) < 00, with VI(-l as in (9.3.7). Hence, as

(9.3.9)
and similarly for Va (7f, x), part (a) follows from Proposition 9.2.2(f) and
the properties (8.2.7) and (8.2.9) of the p.m. P;: with v = c5 x , the Dirac
measure concentrated at Xo = x.
(b) This follows from Proposition 9.2.1. Incidentally, observe that since

V,;O := igfVa (7f,·) ~ Va (7f', .) V7f',

taking the limit a t 1 and using (b) and the definition (9.3.2) of Vt, we
obtain
lim sup V,; (x) ~ vt(x) "Ix E X. (9.3.10)
atl

(c) As V?l :::=: 0, (9.3.9) yields VI (7f,x) :::=: -VI(-l(7f,x). Thus, taking the
infimum over all 7f and using (9.3.8) we obtain part (c). 0
Concerning Question 9.3.1(c), we show below [Proposition 9.3.5(a)] that
Assumption 9.3.2 yields

lim sup J~(x) ~ VI*(X) "Ix E X. (9.3.11)


n-too

However, to get (9.3.6) we will introduce an additional assumption, which


uses the following notation: Vt (7f, x) denotes the expected total cost from
time n onwards when using the policy 7f, given the initial state Xo = x,
that is,

Vr(7f,x) := E; [~c(xt,adl' n = 0,1, .... (9.3.12)

Note that VIO(7f,x) = VI (7f,x). Moreover, from (9.3.3) and (9.3.4),


VI (7f,x) = I n (7f,x) + Vr(7f,x), n = 0,1, .... (9.3.13)
In addition, Assumption 9.3.2 and Proposition 9.2.2(f) yield that
lim I n (7f,x)
n-too
= VI (7f, x) V7f,X, (9.3.14)
82 9. The Expected Total Cost Criterion

and, therefore, by (9.3.13),

(9.3.15)

The following assumption requires (9.3.15) to be true in a stronger form.


9.3.4 Assumption. For each x E X

lim inf sup vt(1f, x) = O. (9.3.16)


n-t<Xl II

9.3.5 Theorem. (a) If Assumption 9.3.2 is satisfied, then (9.3.11) holds.


(b) If both Assumptions 9.3.2 and 9.3.4 are satisfied, then

liminf J~(x) ~ vt(x) \/x E X, (9.3.17)


n-t<Xl

which combined with (9.3.11) yields (9.3.6), i.e.,


lim J~(x) = vt(x) \/x E X.
n-+<Xl

Proof. (a) By definition (9.3.5) of J~,

J~(.) ~ I n (1f,·) \/n,1f,

and taking limsuPn we get, by (9.3.14),

limsupJ~O ~ VI (1f,x) \/1f.


n-+<Xl

This inequality and (9.3.2) imply (9.3.11).


(b) From (9.3.13),

so that, taking the infimum over all1f E II,

vt(x) ~ J~(x) + sup vt(1f,x).


II

Finally, taking lim infn and using (9.3.16) we obtain (9.3.17). D

From the previous paragraphs we can obtain two (obvious) sufficient


conditions for Question 9.3.1(b) to have an affirmative answer.
9.3.6 Proposition. The value junction Vt is measurable if (for instance)
one of the two following conditions is satisfied:
(1) Assumption 9.3.2 holds and, jurther, there exists an ETC-optimal pol-
icy 1f*, that is, 1f* such that

(9.3.18)
9.3 The expected total cost 83

(2) Assumptions 9.3.2 and 9.3.4 both hold and the functions J~ are mea-
surable.
Proof. If (1) holds, then the measurability of Vt follows from Proposition
9.3.3(a).
On the other hand, under (2), the measurability of Vt follows from a
well-known result in Real Analysis: a pointwise limit of Borel-measurable
functions is Borel-measurable. (See, for instance, Ash [1], Theor. 1.5.4.)
Indeed, if (2) holds, then Vt is measurable because, by Theorem 9.3.5(b)
and (9.3.6), it is the pointwise limit of the measurable functions J~. 0
Of course, Proposition 9.3.6 answers one question [Question 9.3.1(b)],
but simultaneously it raises another: When are (1) or (2) satisfied? This
question is dealt with in §9.5 and §9.6. First, however, in the next section
we consider Question 9.3.1(e).
Notes on §9.3

1. Most of the works on the expected total cost (ETC) criterion deal
with Markov control processes (MCPs) in which: (i) the state space X is
a countable set, and/or (ii) the MCP is either positive (that is, c ~ 0) or
negative (c ~ 0). For extensive bibliographies on these two cases see, for
instance, Altman [1], Bertsekas [1], or Puterman [1]. Among the few works
dealing with Borel state spaces and not distinguishing between positive and
negative MCPs, we can mention the papers by Quelle [1], Rieder [2], SchaJ.
[1], and Hinderer's [1] monograph.
2. With respect to Question 9.3.1(b), it is well known that, in a very
general context, the value function Vt is universally measurable (see, for
instance, Hinderer [1]), which is a concept much weaker than measurabil-
ity. To our knowledge, measurability of Vt typically requires restrictive
conditions, such as (1) and (2) in Proposition 9.3.6.
3. The Markov control model M = (X,A,{A(x)lx E X},Q,c) is called
convergent if it satisfies Assumption 9.3.2 and, in addition,

supv1(+)(7r,x)<00 foreach xEX. (9.3.19)


II

Equivalently, as ICt I = ct + ct", the MCM (Markov control model) M is


convergent if

supE;
II
(f: let I)
t=o
< 00 for each x E X. (9.3.20)

In §9.6 we introduce a class of MCMs for which (9.3.20) holds. Although


(9.3.20) is a strong condition, it still allows some "pathologies"-for in-
stance, it does not guarantee (9.3.6), as shown by counterexamples in Put-
erman [1, §7.3.3], Strauch [1], van Hee et al. [1], etc. This is one of the
84 9. The Expected Total Cost Criterion

reasons for introducing Assumption 9.3.4, which is similar to "tail condi-


tions" used by Schal [1], van Hee et al. [1], and many other authors. Other
works (for instance, van Nunen and Wessels [2]) use "Lyapunov functions"
instead of tail conditions.
Note that if (9.3.20) holds, then of course

n-I )
lim
n--+oo
!..E;
n
( "Ictl
~
= 0 '</7(, x. (9.3.21)
t=o
MCMs with this property are called zero-average cost models.

9.4 Occupation measures and the sufficiency


problem
In this section we consider the "sufficiency problem" posed in Question
9.3.1(e); namely, is there a (proper) subset II' of II which is sufficient for
the ETC problem? Here, "sufficient" means that, with ~* as in (9.3.2),

vt (-) = inf
IT
Vi (7(, .) = inf VI (7(, .).
IT'
(9.4.1)

In Theorem 9.4.5 we show that, under suitable assumptions, (9.4.1) holds


with II' = IIRM, the family of randomized Markov policies [Definition
8.2.2(a)].
This result is important for at least two reasons: First, the minimization
in (9.4.1) is greatly "simplified" in that it suffices to do it in the smaller set
II'. Second, (9.4.1) states that the information required to determine an
optimal control action at each time t reduces to the current state Xt [see
(8.2.4)]' in contrast to a general policy which requires the full history h t
[see (8.2.3)].
We also show (Theorem 9.4.6) that if c(x, a) is nonnegative, then the
second equality in (9.4.1) is satisfied when II and II' are replaced by IIRs
and lIDS, respectively, i.e.,

(9.4.2)

In other words, the family lIDS of deterministic stationary policies is "suf-


ficient" within the class IIRs of randomized stationary policies when c(x, a)
is nonnegative.
To study the sufficiency problem we use the notion of ETC-expected oc-
cupation measures, which are similar to the a-discount expected occupation
measures (or state-action frequencies) introduced in §6.3.
9.4 Occupation measures and the sufficiency problem 85

A. Expected occupation measures

In this subsection, 7f E II denotes an arbitrary policy, and v stands for


an arbitrary initial distribution, that is, a p.m. (probability measure) on
the state space X. The corresponding ETC is

Vl(7f,V):= E~ [fc(xt,at)]. (9.4.3)


t=o
In particular, if v is the Dirac measure 8x concentrated at Xo = x we obtain
(9.3.1). We suppose the following:
9.4.1 Assumption. Assumption 9.3.2 is satisfied and, in addition, (9.3.8)
holds with v in lieu of x.
The easiest way to obtain the ETC-expected occupation measure jL~ is
to replace the cost c(x,a) in (9.4.3) by the indicator function Ir(x,a) of a
Borel set r in X x A, and then let r vary in 8(X x A). This yields the
measure
L P:[(xt, at) E r],
00

jL~(r) := r E 8(X x A). (9.4.4)


t=o
9.4.2 Remark. (a) Let {An} be a sequence of measures on some measur-
able space (S, S). If the sequence is increasing (An ::; An+ 1 Vn) and converges
setwise to A (An (B) -+ A(B) VB E S), then the limit A is a measure. (For
a proof see, for instance, Doob [1], pp.30-31.) It follows that jL~ in (9.4.4)
is indeed a measure since it is the setwise limit of the increasing sequence
of measures
n
L P:[(Xt, at) E·], n = 0,1, .... (9.4.5)
t=o
(b) By (8.2.3) and the properties (8.2.7)-(8.2.9) of P:, the expected
occupation measure jL~ is defined on (the Borel subsets of) X x A but it is
concentrated on the set ]K in (8.2.1), i.e.,
jL~ (]KC) = 0, where ]Kc:= complement of ]K. (9.4.6)
Moreover, using again that jL~ is the limit of the increasing sequence in
(9.4.5), and writing Vt{7f, v) as the difference of its positive and negative
parts [see (9.3.9) and (9.3.7)], it can be seen that we may rewrite (9.4.3) as

V1 (7f,v) = r
JxxA
c(x,a)jL~(d(x,a». (9.4.7)

To see this, suppose that c is nonnegative, as is the case for c+ and c-.
In addition, in (9.3.4) replace x with v, and let rn+l be the measure in
(9.3.5). Then, since c 2: 0 and rn ::; jL~, we get

I n(7f, v) = f cdrn ::; f cdjL~ Vn.


86 9. The Expected Total Cost Criterion

Therefore, letting n -+ 00, (9.3.14) gives

To obtain the reverse inequality, let {Uk} be a nondecreasing sequence of


nonnegative simple functions such that Uk t c pointwise. Hence

and the setwise convergence of "In to J.t~ gives

Thus, letting k -+ 00, it follows from the Monotone Convergence Theorem

! cdJ.t~ ~
that
Vi (n, v),
which completes the proof of (9.4.7) when c is nonnegative. Finally, replac-
ing c with c+ and c- we obtain (9.4.7) for a general c.
Strictly speaking, we should write (9.4.7) as an integral over ][{ instead of
XxA, since c(x, a) is defined on ][{only. However, we can always measurably
extend c to all of X x A for (9.4.7) to be well defined. For instance, we may
take c(x, a) := +00 on the complement of ][{, and then [by (9.4.6)] the
convention (9.2.2) would yield that (9.4.7) consists of the integral over ][{
plus a term equal to zero.
Also note that, for the given initial distribution v, we may rewrite (9.3.8)

!
as
s~p c-(x,a)J.t~(d(x,a)) < 00.
(c) If Band C are Borel sets in X and A, respectively, and if r is the
measurable rectangle B x C, then (9.4.4) becomes
00

J.t~(B x C) = LP;(Xt E B,at E C). (9.4.8)


t=o
In particular, if C is the action space A itself (Le., C = A), then we obtain
the marginal (or projection) Ii':, of J.t~ on X, i.e.,
00

fi~(B) := J.t~(B x A) =L P;(Xt E B) (9.4.9)


t=o

for all B E B(X). 0


9.4 Occupation measures and the sufficiency problem 87

To deal with Question 9.3.1(e) it is convenient to rewrite (9.4.8) as


00

IL~(B x C) = LIL~,t(B x C), (9.4.10)


t=o
where
(9.4.11)
for B E B(X), C E B(A), t = 0,1, .... Similarly, we may rewrite (9.4.9) as
00

jJ,~(B) = LjJ,~,t(B), BE B(X), (9.4.12)


t=o
where

jJ,~,t(B) := 1L~,t(B x A) = P;(Xt E B), BE B(X), (9.4.13)

is the marginal of 1L~,t on X. We have, on the other hand:


9.4.3 Lemma.

(a) jJ,~,o(-) = v(·).


(b) jJ,~,t(-) = IXxA Q('lx, a)IL~,t-1 (d(x, a)), t = 1,2, ....

(c) jJ,~(.) = v(·) + IXXA Q('lx, a)IL~(d(x, a)).


Proof. (a) This follows from (9.4.13) with t = 0, and (8.2.7).
(b) First note that, for t = 1,2, ... , we may write (8.2.9) as

E:[IB(Xt)lht- l , at-d
Q(BIXt-l, at-I).

So, taking expectation E:O and using (9.4.13), we get, for any B in B(X),
E:[IB(xdl
E:[Q(Blxt-l,at-dl
r
iXXA
Q(Blx, a)IL~,t-1 (d(x, a)),

which proves (b).


(c) For any Borel set B in X, (9.4.12) and part (a) yield
00

jJ,~ (B) = v(B) +L jJ,~,t (B).


t=1

Thus, using (b) and (9.4.10), we obtain (c). 0


88 9. The Expected Total Cost Criterion

B. The sufficiency problem

We need the following preliminary result stated in Proposition D.8(a)j


for a proof see, for instance, Dynkin and Yushkevich [1, pp. 88-89] or Hin-
derer [1, p. 89].
9.4.4 Lemma. Let II{ and ~ be as in (8.2.1) and Definition 8.2.1, re-
spectively. If IL is a p. m. on X x A concentrated on II{, then there exists a
stochastic kernel cP E ~ such that

IL(B x C) = In cp(Clx)ji(dx) 'VB E B(X), C E B(A),

where jiO := IL(· x A) is the marginal (or projection) of IL on x.


We next give an affirmative answer to Question 9.3.1(e) by showing that
in (9.3.2) we may replace II by the subset IIRM of the randomized Markov
policies [Definition 8.2.2(a)]j in other words, IIRM is a sufficient set of poli-
cies for the ETC problem.
9.4.5 Theorem. Let 1r be an arbitrary policy and v an arbitrary initial
distribution. Suppose that Assumption 9.4.1 is satisfied. Then there exists
a randomized Markov policy 1r' = {cpt} E IIRM such that the ETC-expected
occupation measures for 1r and 1r' coincide, i.e.,
1r _ 1f'
ILII-ILII· (9.4.14)

Hence
(9.4.15)

and
(9.4.16)

Proof. By (9.4.10), to prove (9.4.14) it suffices to find a randomized Markov


policy 1r' = {CPt, t = 0, 1, ... } such that
,
IL~,t = IL~,t 'Vt = 0,1, ... , (9.4.17)

where IL~,t is the p.m. defined in (9.4.11). In turn, to prove (9.4.17) we


may use Lemma 9.4.4 as follows: Since [by (8.2.3)] IL~,t is a p.m. on X x A
concentrated on II{, there exists a stochastic kernel CPt in ~ such that (as in
Lemma 9.4.4)

IL~,t(B x C) = In CPt(Clx)ji~,t(dx) (9.4.18)

for every B E B(X), C E B(A), and t = 0,1, .... Now let 1r' be the random-
ized Markov policy 1r' := {CPo, CPl, ... }, with CPt as in (9.4.18). Then (9.4.17)
9.4 Occupation measures and the sufficiency problem 89

trivially holds for t = 0 since, by (9.4.11) and Lemma 9.4.3(a), for all B in
B(X) and C in B(A) we have

f.-t~:o(B x C) .- p;:' (xo E B, ao E C)

l <po(Clx)v(dx)

f.-t~,o(B x C),
where the last equality is due to (9.4.18) and Lemma 9.4.3(a) again. The
proof now proceeds by induction: Suppose that (9.4.17) holds for some
integer t 2: o. Then, by Lemma 9.4.3(b), the marginal of f.-t~,tH on X
satisfies that

Ix a)f.-t~,t(d(x,
Q(·lx, a))

Ix Q(·lx,a)f.-t~:t(d(x,a)) [by the induction hypothesis]


~7r' [by Lemma 9.4.3(b)];
f.-tv,t+l .
( )

that is, the marginals (on X) of f.-t~,t+l and f.-t~:tH coincide. This implies
that (9.4.17) holds for t + 1 because [by (9.4.11)]

p;:' (XtH E B,atH E C)

l <ptH(Clx),:L~~l(dx)
l (Clx),:L~,tH
<PtH (dx)
f.-t~,tH (B x C) [by (9.4.18)].

This completes the proof of (9.4.17), which, as was already mentioned, gives
(9.4.14). In turn, (9.4.14) and (9.4.7) give the equality (9.4.15), and also
give (9.4.16) since 7r was an arbitrary policy. 0
In connection with the question of "sufficiency" of sets of policies, we
next present an interesting result which states that, under appropriate
assumptions, the ETC corresponding to a randomized stationary policy
<poo E IIRS can always be "improved" (or "minorized") by the ETC of
some deterministic stationary policy foo E lIDS in the sense that

where v is the given initial distribution. In other words, this fact would
yield another affirmative answer to Question 9.3.1(e) when II and II' are
replaced by II RS and lIDS, respectively-see (9.4.21). The precise state-
ment is as follows.
90 9. The Expected Total Cost Criterion

9.4.6 Theorem. Suppose that the cost-per-stage function c(x, a) is non-


negative (so, that, in particular, Assumption 9.3.2 is satisfied). Further,
let cp= and v be a given randomized stationary policy and a given initial
distribution, respectively, such that
(*) VI (cp=, .) is a finite-valued function on X, integrable
with respect to v.
Then there exists a deterministic stationary policy f= E IIDS such that
(9.4.19)
hence
(9.4.20)
Moreover, if the condition (*) holds for every randomized stationary cp= E
IIRs, then
(9.4.21)

The proof of Theorem 9.4.6 requires two preliminary facts. The first one
is the following lemma of Hinderer [1, Lemma 15.1], which is an extension
of a result by Blackwell [1].
9.4.7 Lemma. Let ][{ and IF, <P be as in (8.2.1) and Definition 8.2.1, re-
spectively. If cp is a stochastic kernel in <P and v : ][{ -+ i: is a measurable
function such that
x I-t i v-(x,a)cp(dalx)

is a finite-valued map on X, then there exists a decision function f E IF


that satisfies

i v(x, a)cp(dalx) ~ v(x, f(x)) "Ix EX.

The second fact we need is simply another expression for the ETC VI in
(9.3.1), which will also be useful in later sections. [The expression (9.4.23)
corresponds to the special case n = 1 of Lemma 9.5.6(a).] We will use the
following notation: Given a policy 1r = {1rt, n = O,I, ... } the I-shift policy
1r(1) = {1rF) , t = 0,1, ... } is defined as

1r~l)('lxd := 1rl(·lxo,ao,xd,
and for t = 1,2, ... ,

1r~1) ('IXl, all' .. ,xt+d := 1rtH (·Ixo, ao, Xl, ai, ... ,Xt+l).
In particular, if 1r = cp= is a randomized stationary policy, then [by Defi-
nition 8.2.2 (b)] the I-shift policy is given by

(9.4.22)
9.4 Occupation measures and the sufficiency problem 91

9.4.8 Lemma. Suppose that Assumption 9.3.2 is satisfied. Then, for each
policy 7f = {7ft} and initial state x EX,

(9.4.23)

In particular, if 7f = cpoo is a randomized stationary policy, then {by (9.4.22)J

VI (cpoo , x) = i [c(x,a) + Ix VI (cpoo, y)Q(dylx, a)] cp(dalx). (9.4.24)

Proof. From (9.3.1),

VI (7f,x) =E;[c(xo,ao))+E; [~c(xt,at)l·


Furthermore, by (8.2.7)-(8.2.9), the first term on the right-hand side equals

i c(x,a)7fo(dalx),

and the second equals

E; {E; [t~ c(xt,at)lxo,ao,xl]} = E;[VI(7f(1),xd)

= fAfx V1(7f(1),y)Q(dylx,a) 7fo(dalx).


Combining these expressions we obtain (9.4.23). 0
We are now ready to prove Theorem 9.4.6.
Proof of Theorem 9.4.6. Rewrite (9.4.24) in the form

VI (cpoo, x) = i v(x,a)cp(dalx)

Ix
with
v(x,a) := c(x, a) + V1(cpoo,y)Q(dylx,a).

Therefore, by the hypothesis (*) and Lemma 9.4.7, there is a decision func-
tion f E IF' such that, using the notation (8.2.6),

VI (cpOO, x) ~ c(x,f) + Ix V1(cpoo,y)Q(dylx,f).

Iteration of this inequality gives, for every n = 1,2, ... [with Qn('lx, f) as
in (8.2.11) and foo E IIDs the deterministic stationary policy determined
by f E IF'-see Remark 8.2.3(a),

Vdcpoo, x) ~ Ero [~C(Xt' f)1+ Ix VI (cpoo, y)Qn(dylx, f).


92 9. The Expected Total Cost Criterion

Therefore, by (9.3.4) (with 7f = rXJ) and the assumption that c 2:: 0, we get
Vn = 1,2, ... ,
and letting n -t 00, we see that (9.4.19) follows from (9.3.14).
On the other hand, integration of both sides of (9.4.19) with respect to
v yields (9.4.20).
Finally, to prove (9.4.21), observe that (9.4.20) implies that

inf VI (7f,v) 2:: inf VI (7f,v),


I1RS I1DS

whereas the reverse inequality follows from the fact that IIDS is contained
in IIRS (since IF is contained in q;-see the paragraph after Definition 8.2.1).
o
Notes on §9.4

1. For MCPs with a countable state space X, Theorem 9.4.5 is due to


Strauch [1), and Derman and Strauch [I)-see also Derman [1). Theorem
9.4.6, on the other hand, is modelled after results by Kurano and Kawai
[1) and GonzaJez-Hernandez and Hernandez-Lerma [1) for discounted cost
problems, but in fact it can be seen as a very special case of Hinderer's [1)
Theorem 15.2, which deals with nonhomogeneous Markov control models.
In our present context, Hinderer's result can be stated as follows:
Suppose that Assumption 9.4.1 is satisfied and that the cost-per-stage
c(x, a) is nonnegative. Then for each policy 7f there exists a determin-
istic Markov policy 7fd = {It} such that

(9.4.25)

To prove this result the idea is that by Theorem 9.4.5 we may assume at
the outset that 7f is a randomized Markov policy, say 7f = {'Pt}. Then, by
(9.4.23) and Lemma 9.4.7, there exist fo ElF such that

VI (7f, x) i [c(x, a) + Ix VI (7f(I), y)Q(dylx, a)] 'Po(dalx)

> c(x, fo) + Ix VI (7f(I) , y)Q(dylx, fo).

Next, applying the same argument to VI (7f(1),,) one obtains h ElF, and
continuing in this manner we get 7fd = {fo, h, .. .}, which satisfies (9.4.25).
2. The ETC-expected occupation measure 11-: in (9.4.4) corresponds to
the case a = 1 of the a-discount expected occupation measures (or state-
action frequencies) in §6.3, namely,
00

11-:''''(f) := LatP:[(xt,at} E f), f E 8(X x A), 0 < a < 1. (9.4.26)


t=o
9.5 The optimality equation 93

Of course, this measure is finite-with total measure 1/(1- a)-but p,~ is


not. Another important difference between the a-discount and the undis-
counted (a = 1) expected occupation measures is the following. In the
former case (0 < a < 1) the corresponding equation in Lemma 9.4.3(c) is

(9.4.27)

which characterizes the family of a-discount expected occupation measures


(Theorem 6.3.7); that is, any such a measure satisfies (9.4.27), and con-
versely, if p, is a measure on X x A, concentrated on lK, and such that p,
satisfies (9.4.27), i.e.,

Ii(·) = v(·) +a r
lxxA
Q(·jx,a)p,(d(x, a)),

then p, is the a-discount expected occupation measure p,~,o; corresponding


to some policy 7r. The latter statement (the converse) is not true in the
undiscounted case, a = 1, unless we impose additional assumptions on the
transition law Q.

9.5 The optimality equation


This section has three main objectives. First, we give conditions under
which the value function Vt [see (9.3.2)] satisfies the optimality equation
(9.5.2), (9.5.3). Second, we present several optimality criteria similar to
those in §4.5 for the discounted cost. Third, we use the optimality equation
to produce optimality criteria when using deterministic stationary policies.
A. The optimality equation

Let T == Tl be the dynamic programming operator To; in (8.3.17) when


a = 1; that is

Tu(x) := inf [c(x, a)


A(x)
+
lx
r u(y)Q(dyjx, a)], x E X. (9.5.1)

A measurable function u is said to satisfy (or to be a solution of) the


optimality equation for the expected total cost (ETC) criterion if u is a
fixed point of T, that is, u = Tu. In Theorem 9.5.3 below we show that,
under suitable assumptions, the value function Vt satisfies the optimality
equation so that
vt = TVt, (9.5.2)
or, more explicitly,

Vl*(X) = inf [c(x,a)


A(x)
+
1rx vt(y)Q(dyjx, a)] , x E X. (9.5.3)
94 9. The Expected Total Cost Criterion

Observe that (9.5.3), which is also referred to as the dynamic programming


equation for the ETC criterion, does not have a unique solution (if it has
one). In fact, if a function u(·) satisfies (9.5.3), then so does u(·) + k for
any constant k. Thus, it is important to "characterize" Vt within a certain
class of solutions of (9.5.3); one such characterization is given in Theorem
9.5.9.
To obtain (9.5.3) we impose several hypotheses (Assumption 9.5.2),
which use the following definition.
9.5.1 Definition. U denotes the family of extended real-valued functions
that are integrable with respect to Q(·lx, a) for each x E X and a E A(x).
Note that U is nonempty since it contains (at least) all the measurable
bounded functions on X. In particular, the O-stage cost JO'(.) := 0 [see
(9.3.5)] is in U. Part (b) in the following assumption requires, among other
things, that the optimal n-stage cost J~ belongs to U for all n = 1,2, ....
This is known to be true under a variety of conditions-see, for instance,
§3.3, or the assumptions in §8.3 and §8.5.
9.5.2 Assumption.
(a) The hypotheses of Theorem 9.3.5(b)-that is, Assumptions 9.3.2 and
9.3.4-are satisfied.
(b) For each n = 1,2, ... , J~ is in U and satisfies
J~(x) = T J~_l(X) Vx E X. (9.5.4)

(c) Either (i) the cost-per-stage function c(x, a) is nonnegative, in which


case the functions J~ form a nondecreasing sequence, or (ii) there is
a function W in U such that
J~ ~ W Vn = 1,2, .... (9.5.5)

Observe that (9.5.4) is simply the value iteration equation that we have
already seen in several chapters; see, for instance, (8.3.12) or (8.3.26) for
the discounted case.
Under Assumption 9.5.2, which is supposed to hold throughout this sec-
tion, Theorem 9.3.5(b) and Proposition 9.3.6 yield (9.3.6), i.e.,
lim J~(x) = vt(x) Vx EX, (9.5.6)
n-too
and that Vt is a measurable function on X. Moreover, Vt is quasi-integrable
with respect to the measure Q(·lx, a) for each x E x and a E A(x). Indeed,
this is obvious if the condition (i) in Assumption 9.5.2(c) holds-in this
case Vt is nonnegative and so the integral of its negative part is zero [see
(9.2.9), or (9.3.7), (9.3.8)]. On the other hand, under the condition (ii), we
have [by (9.5.5) and Proposition 9.3.3(c)]
-00 < vt(x) ~ W(x) Vx E X; (9.5.7)
9.5 The optimality equation 95

hence, as W belongs to the set U (see Definition 9.5.1), (9.5.7) and Propo-
sition 9.2.2(c) yield

-00:::; Ix vt(y)Q(dylx,a) < 00 Vx E X, a E A(x).

We also get the optimality equation (9.5.2), (9.5.3):


9.5.3 Theorem. If Assumption 9.5.2 holds, then Vt satisfies {9.5.2},
{9.5.3}.
Proof. We will show that

(a) vt ~ TVt, and (b) vt :::; TVt· (9.5.8)

To prove (a), choose an arbitrary policy 7r E II and an arbitrary initial


state x EX. Then, by Lemma 9.4.8,

VI (7r,x) = L [c(x,a) + Ix VI (7r(1) , y)Q(dylx, a)] 7ro(dalx)

> L [c(x, a) + Ix vt(y)Q(dylx, a)] 7ro(dalx) [by (9.3.2)]


> TVt(x) [by definition (9.5.1) of T]j

that is,
vI(7r,x) ~ Tv;.*(x).
This inequality and (9.3.2) give (9.5.8)(a) since 7r and x were arbitrary.
To obtain (9.5.8)(b) use (9.5.4) and (9.5.1) to write

J~(x) :::; c(x, a) + Ix J~_I (y)Q(dylx, a) Vx E X, a E A(x). (9.5.9)

Now consider the condition (i) in Assumption 9.5.2(c). In this case the
sequence J~ is nondecreasing and, therefore, letting n --+ 00 in (9.5.9), we
obtain [by (9.5.6) and the Monotone Convergence Theorem]

v;.*(x) :::; c(x, a) + ! v;.*(y)Q(dylx,a) Vx E X, a E A(x), (9.5.10)

which implies (9.5.8)(b). On the other hand, under the condition (ii) in
Assumption 9.5.2(c), we may take limsuPn in (9.5.9) and obtain (9.5.10)
again, by (9.5.6) and Fatou's Lemma. This completes the proof. 0
B. Optimality criteria

Having the optimality equation (9.5.3) [or (9.5.2)] we can proceed to ob-
tain several optimality criteria, which informally can be obtained taking a
"discount factor" a = 1 in Theorem 4.5.1 (on discounted cost problems). In
96 9. The Expected Total Cost Criterion

particular, the concepts in the following definition correspond to the case


"0: = I" of those introduced in §4.5. [Alternatively, to obtain the discrep-
ancy function DI in (9.5.11), we may take 0: = 1 in (8.4.2).)
9.5.4 Definition.

(a) The discrepancy function for the ETC criterion is the nonnegative
function DI on OC [the set defined in (8.2.1)) given by

DI (x, a) := c(x, a) + Ix vt(y)Q(dylx, a) - VI*(X). (9.5.11)

(b) {M~} denotes the sequence defined by Mo := Vt(xo), and


n-I
M~ := L c(xt, at) + vt (x n ) for n = 1,2, .... (9.5.12)
t=o

(c) Given a policy 7r = {7rt} and an integer n ~ 0, the corresponding


n-shift policy 7r(n) = {7r}n) , t = 0,1, ... } is given [with h t as in
(8.2.2)) by

and for t = 1,2, ... ,

In particular, the O-shift policy is the same as 7r, i.e., 7r(0) = 7r.

The role of the discrepancy function DI is, of course, the same as in


other Markov control problems. For instance, it gives an alternative way of
writing the optimality equation (9.5.2)-(9.5.3) as

inf Ddx, a) =
A(x)
° \lXEX, (9.5.13)

and, more importantly, it gives an "explicity" expression, (9.5.19) below,


for the difference between the expected total cost VI err, .)
and the value
function Vt (.) for every policy 7r.
9.5.5 Theorem. (Optimality criteria.) Suppose that Assumption 9.5.2
is satisfied, and let 7r be a policy such that VI (7r, x) < 00 for each x EX.
Then the following statements (a) and (b) are equivalent, and also (c) and
(d) are equivalent:
(a) 7r is ETC-optimal, i.e., VI (7r,x) = Vt(x) \Ix.

(b) Vt(7r,x) = E:Vt(x n ) \In,x [with Vt(7r,x) as in (9.3. 12)}.


9.5 The optimality equation 97

(c) E;D 1 (xn ,an) = ° Vn,x.


(d) {M~, a(h n )} is a P; -martingale for all x [where a(h n ) is the a-algebra
generated by the n-history h n in (8.2.2), n = 0,1, .. .J; that is, for
every n, M~ is P; -integrable, a(hn)-measurable and

(p;-a.s.) (9.5.14)

If, in addition, Vt satisfies that

lim inf E;Vt (xn) 2:


n-+oo
° for each 7r E II and x E X, (9.5.15)

then the four conditions (a) to (d) are equivalent.


Observe that (9.5.15) holds if, for instance, c(x, a) is nonnegative.
The proof of Theorem 9.5.5 will follow from the next two lemmas, the
first of which (Lemma 9.5.6) basically states that the expected total cost
V1 (7r,x) in (9.3.1) and (9.3.13) can be written in other alternative forms.
9.5.6 Lemma. For each 7r E II, x E X for which Vd7r,x) < 00, and
n = 0, 1, ... :

(a) VI (7r,x) = I n (7r,x)+E;[Vd7r(n),x n )], where 7r(n) is then-shiftpolicy


corresponding to 7r;

Moreover,
(9.5.16)
n-+oo

hence, if (9.5.15) holds, then

(9.5.17)

and, furthermore,
00

vt(7r,x) = E;Vt(xn) + LE;Ddxt,at) V7r,x,n. (9.5.18)


t=n
In particular, with n = 0,
00

Vi (7r,x) = vt(x) + LE;D 1 (xt,ad V7r,x. (9.5.19)


t=Q

Proof. (a) From (9.3.12), (8.2.8)~(8.2.9), and Definition 9.5.4(c) we see


that
98 9. The Expected Total Cost Criterion

=
=
i.e.,
(9.5.20)
Thus, part (a) follows from (9.5.20) and (9.3.13).
(b) This is obtained by adding and subtracting E;Vt(x n ) in (9.3.13),
and using the definition (9.5.12) of M~.
To prove (9.5.16), note that (9.3.2) and (9.3.20) yield

(9.5.21)

and so (9.5.16) follows from (9.3.15). Also note that (9.5.17) is a conse-
quence of (9.5.16) and (9.5.15).
To prove (9.5.18) observe that (9.5.11) and (8.2.9) give, for every t =
0,1, ... ,

Taking the expectation E; on both sides of the latter equation we get

and then
N-l N-l
L E;D1(xt,at) =L E;c(xt,at)+E;Vt(XN) -E;Vi*(xn).
t=n t=n
Finally, letting N -t 00 and using (9.5.17) we obtain (9.5.18). 0
9.5.7 Remark. Suppose that the cost-per-stage C(X, a) is nonnegative-
as in part (i) of Assumption 9.5.2(c). Then (9.5.15) is obviously satisfied,
and so the conditions (a) to (d) in Theorem 9.5.5 are all equivalent and,
moreover, (9.5.17) and (9.5.18) hold. 0
9.5.8 Lemma. For each policy 11" and initial state x for which Vi (1I",x) <
00, the sequence {M~, u(hnH is a P;: -submartingale; that is, for every n,
M~ is P;: -integrable, u(hn)-measurable and

(p:-a.s.); (9.5.22)

hence, for every n = 0, 1, ... ,

(9.5.23)
9.5 The optimality equation 99

Proof. By Definition 9.5.4(b) and Assumption 9.5.2, it is evident that M~


is P;-integrable and O'(hn)-measurable for every n, and, furthermore,

M~+1 = M~ + [c(xn, an) + vt(xn+d - vt(xn)]

Thus, in view of (9.5.11),

E;(M~+llhn) = M~ + E;[DI(Xn' an)lhn], (9.5.24)

which gives (9.5.22) since DI is a nonnegative function. D


Using Lemmas 9.5.6 and 9.5.8 we can now easily prove Theorem 9.5.5:
Proof of Theorem 9.5.5.(a) {:} (b). Suppose that (a) holds. Then, by
(9.5.21), to prove (b) it suffices to show that

(9.5.25)

To obtain this inequality observe that, as 7r is optimal [i.e., VI (7r,') = VtO],


Lemma 9.5.6(b) and (9.5.23) yield

vt(x) ~ vt(x) + [vt(7r, x) - E;VI*(Xn)],

and (9.5.25) follows. This shows that (a) implies (b). The converse is obvi-
ously true: take n = 0 in (b).
(c) {:} (d). As

E;DI(xn,a n) = E;{E;[DI(xn,an)lhn]},

the equivalence of (c) and (d) follows from (9.5.24)-recall that DI is non-
negative.
Finally, if (9.5.15) holds, then (9.5.18) shows that (b) and (c) [hence (a)
to (d)] are equivalent. This completes the proof of Theorem 9.5.5. D
The inequality (9.5.16) can be used to obtain a characterization of Vt
as the pointwise "maximal" solution of the optimality equation within a
certain subclass of functions in U (Definition 9.5.1). This is the essential
content of the following result. (In Theorem 9.5.13 we give conditions for
Vt to be the "unique" solution of the optimality equation.)
9.5.9 Theorem. (A "characterization" of Vt.) Suppose that Assump-
tion 9.5.2 is satisfied. Let u E U be a function that satisfies the optimality
equation {9.5.3} and the inequality {9.5.16}, i.e.,

u(x) = inf [c(x,a) +


A(x) ixr U(Y)Q(dY1x,a)] \Ix E X. (9.5.26)

and
lim sup E; u(xn) :S 0 \Ix E X, 7r E II, (9.5.27)
n->oo
100 9. The Expected Total Cost Criterion

respectively. Then
u(·) ::; vt(·). (9.5.28)
Hence, if Vt belongs to the class U, then Vt is the maximal function in U
that satisfies {9.5.26} and {9.5.27}.
Proof. We will show that if u E U satisfies (9.5.26) and (9.5.27), then

u(x) ::; VI(7r, X) \hr E II, x E X, (9.5.29)

which, by (9.3.2), gives (9.5.28).


To prove (9.5.29), choose an arbitrary policy 7r and an arbitrary initial
state x. Then, for every t = 0, 1, ... , (8.2.9) and (9.5.26) yield

J u(y)Q(dylxt, at)
> u(Xt) - c(xt, at),
so that taking expectation E; (.) and rearranging terms we obtain

Thus, summing over t = 0, ... , n - 1,

which we can rewrite as

As a result, taking lim sUPn' we obtain (9.5.29) from (9.5.27) and (9.3.14).
o
C. Deterministic stationary policies

We conclude this section with some remarks on the expected total cost
(ETC) when using a deterministic stationary policy foo E IIDs. [Recall
Definitions 8.2.1 and 8.2.2(e).]
First note that replacing the policy 'Poo in (9.4.24) by foo we get that
the ETC VI (too, .) satisfies [using the notation (8.2.6)]

Vdf oo , x) = c(x,f) + Ix VI (too, y)Q(dylx, f), x E X. (9.5.30)

In other words, the function h(·) := VI (tOO, .) satisfies

h(x) = c(x, f) + Ix h(y)Q(dylx, f), x E x. (9.5.31)


9.5 The optimality equation 101

Comparing this equation and (7.5.1) we see that (9.5.31) is the Poisson
equation for the stochastic kernel P and the charge C given by

P('lx) := Q('lx, I) and c(·):= c(·, I), (9.5.32)

respectively, with P-invariant (or P-harmonic) function

g(.) := o. (9.5.33)

As a result, Theorem 7.5.5 gives:


9.5.10 Proposition. The following conditions are equivalent for any de-
terministic stationary policy foo for which V 1(foo , .) is finite valued:
(a) h(·) := Vdf oo ,') satisfies the Poisson equation (9. 5. 31}.
(b) h(x) = In(fOO, x) + E£= h(x n ) Vx EX, n = 0,1, .... (9.5.34)
(c) The sequence {Mn(f)} with Mo(f) := h(xo) and
n-l

Mn(f) := L c(Xt, I) + h(x n ) for n = 1,2, ... ,


t=o
is a p/= -martingale for every initial state x.
On the other hand, (9.5.34) and (9.3.14) imply that h(·) = V1(f00,')
satisfies
Vx E X. (9.5.35)
This statement has the following obvious converse:
9.5.11 Proposition. If h satisfies (9.5.31) and (9.5.35), then

h(x) = V1(foo , x) VxE X. (9.5.36)

Proof. Iteration of (9.5.31) gives (9.5), and letting n ---+ 00 we obtain


(9.5.36). D
The connection between Propositions 9.5.10, 9.5.11 and the optimality
equation (9.5.3) is provided by the following result.
9.5.12 Theorem. Suppose that Assumption 9.5.2 is satisfied. Then a de-
terministic stationary policy foo such that V1 (foo , .) < 00 is ETC-optimal
if and only if f(x) E A(x) attains the minimum in (9.5.3) for all x E X,

Ix
i.e.,
vt(x) = c(x, I) + V1*(y)Q(dylx, I) Vx EX, (9.5.37)

and [cf. (9.5.16)]

lim sup Et= vt (x n ) = 0 VxE X. (9.5.38)


n--+oo
102 9. The Expected Total Cost Criterion

Proof. Suppose that foo is ETC-optimal, that is, VI (f00 , .) = Vt (.). Then
(9.5.37) follows from Theorem 9.5.3 and the Poisson equation (9.5.30),
whereas (9.5.38) is a consequence of (9.5.35).
Conversely, suppose that (9.5.37) and (9.5.38) are satisfied. Then, from
(9.5.37) and (9.5.34), we have

vt(x) = I n (foo , x) + E!OO vt (x n ) Vx,n,

and taking limsuPn' (9.5.38) and (9.3.14) yield

vt(x) = J(fOO, x) 'Ix.

That is, foo is ETC-optimal. 0


The conditions (9.5.37) and (9.5.38) are known in the MCP literature as
the conserving and the equalizing properties of f oo , respectively. There are
many elementary examples illustrating that indeed if one of these properties
fails the optimality of foo cannot be guaranteed-see, for instance, Bert-
sekas [1], Puterman [1], Cavazos-Cadena and Montes-de-Oca [1], Schweitzer
[1].
Finally, we will combine Theorems 9.5.9 and 9.5.12 to characterize Vt
as the unique solution of the optimality equation within a certain class of
functions.
9.5.13 Theorem. (Uniqueness of Vt.) Suppose that Assumption 9.5.2
is satisfied, and let u be a function in U such that
(a) u satisfies the optimality equation {9.5.3};

(b) there is a decision function f E IF such that

u(x) = c(x,f) + ! u(y)Q(dylx,f) 'Ix E X;

(c) limsupE;u(xn) =0 't;hr,x.


n-+oo
Then
u=vt· (9.5.39)
,
In other words, Vt is the unique solution of the optiptality equation
(9.5.3) in the class of functions u E U that satisfy (a), (b), (c).
Proof. By Theorem 9.5.9, the present hypotheses (a) and (c) yield that
u ~ Vi·. On the other hand, (b) and (c) yield-as in Proposition 9.5.11-
that u(·) = VI (foo, .); hence u ~ Vt. This completes the proof of (9.5.39).
o
9.6 The transient case 103

Notes on §9.5

In this section we followed Hernandez-Lerma, Carrasco and Perez-Her-


nandez [1], although the main results are already known in some form or
other-see Hinderer [1], QueUe [1], Rieder [2], Schill [1].

9.6 The transient case


In this section we study the class of so-called transient MCMs (Markov
control models) introduced by Veinott [1] in the case of finite state spaces
X and finite action sets A. This class contains the discounted models (see
Proposition 9.6.3), and, under a suitable condition, it is contained in the
class of convergent models (Proposition 9.6.4). Many authors have extended
Veinott's definition and results to countable state space X but most ofthese
extensions very much depend on the "countability" ---or "discreteness" -of
X. To our knowledge, the only author who has studied the transient case
in Borel state spaces is Pliska [1] and here we will follow, and slightly
generalize, his approach.
The section is divided into four parts. Part A presents the transient
MCM and some related models. In part B we go back to the conditions (1)
and (2) in Proposition 9.3.6, whereas in part C we show that to verify if
a MCM is transient it suffices to consider deterministic stationary policies.
Finally, in part D we show that the policy iteration algorithm converges.
[The convergence of the value iteration algorithm will follow from Corollary
9.6.5 and Theorem 9.3.5(b).]

A. Transient models

In view of Theorem 9.4.5, we may-and will-restrict ourselves to deal


with randomized Markov policies 1r = {cpt. t = 0,1, ... } in IIRM. [See Def-
initions 8.2.1 and 8.2.2(a).] We will also use the following notational con-
ventions:

• Given a stochastic kernel cP in ~, the kernel Q('lx, cp) in (8.2.5) will


be written as Q,.,(-ix), so that

Q"'('lx) := i Q(·lx, a)cp(dalx) , x E X. (9.6.1)

• Let 1r = {CPo, CPt , ... } be a randomized Markov policy. Sometimes, if


there is no risk for confusions, we shall write Q,.,. in (9.6.1) as Qt, i.e.,

Qt('lx) := i Q(·lx,a)cpt(dalx), x E X. (9.6.2)


104 9. The Expected Total Cost Criterion

Moreover, using (9.6.2) and the "product" (or "composition") formula


(7.2.14), we define the t-step transition kernels

Q~ := QOQ1 ... Qt-1 for t = 1,2, ... , (9.6.3)

with Q~(·lx) := 8x (·).

In particular, if 7r = {<pt} is a randomized stationary policy <poo (that is,


<Pt = <p for all t = 0,1, ... ), then Q;
reduces to the t-step transition kernel
in (8.2.11), i.e.,

(9.6.4)

where we have used (9.6.1), and similarly if 7r = foo is a deterministic


stationary policy.
Observe, on the other hand, that the kernel (9.6.3) coincides with the
marginal measure ji,~,t in (9.4.13), i.e.,

Vx E X, t = 0,1, .... (9.6.5)

Indeed, from Lemma 9.4.3(b) and equations (9.4.18) and (9.6.1) we have

ji,;,t(-) = Ix Qt-1 (·ly)ji,;,t-1 (dy) Vt = 1,2, ... , (9.6.6)

withji,~ 0 = 8x [by Lemma 9.4.3(a)]. Then iteration of (9.6.6) gives (9.6.5).


Thro~ghout the following, w : X ---t [1,00) denotes a given weight func-
tion such that
IIQ'I'llw < 00 Vip E «I>,
where 11·llw denotes the (operator) w-norm in (7.2.8)-see also (7.2.7) or
(7.2.10). We will use this norm to define the transient case but the reader
should keep in mind that, in principle, in (9.6.7) below we can replace the
w-norm by an arbitrary operator norm, as in Pliska [1].
9.6.1 Definition. The Markov control model is said to be transient if
there is a constant k such that
00

II L Q~llw ::; k (9.6.7)


t=o
The w-norm in (9.6.7) can be written in several equivalent forms, such
as [by (7.2.8)]:
00

IILQ;lIw
t=o
S~pW(X)-l ~ Ix w(y)Q;(dYlx) (9.6.8)

00

supW(X)-l LE;w(xd [by (9.6.3)]


x t=O
9.6 The transient case 105

S~pW(X)-l ~ f w(y)j:i;,t(dy) [by (9.6.5)]

supW(X)-l
x
r w(y)j:i;(dy)
ix
[by (9.4.12)].

9.6.2 Remark. It is clear that for (9.6.7) to be true the transition kernel
Q('lx, a) has to be very "special" and, in general, it is convenient to think of
it as being a substochastic kernel, that is, Q(Xlx, a) ~ 1 for all x E X and
a E A(x). This is precisely the case for the discounted model in Proposition
9.6.3, below. A related situation occurs for absorbing MCMs in which
there exists a Borel set Xo c X such that

Q(Xolx,a) =1 and c(x, a) =0 Vx E Xo, a E A(x), (9.6.9)

and, in addition,

SUpE;(TO) < 00, where TO:= inf{t ~ 11Xt E Xo}. (9.6.10)


n,x

The idea is that the state process {Xt} "lives" in the complement of Xo but
once it reaches Xo [which occurs in a finite expected time, by (9.6.10)] then,
by (9.6.9), it remains there forever-it is "absorbed" by Xo-at zero cost.
If the state and action spaces are both finite, then the transient and the
absorbing models-as well as the class of so-called "contracting" models-
are all equivalent; see Kallenberg [1]. 0
We next show that a discounted model can be transformed into a tran-
sient model.
Consider the usual MCM M = (X,A,{A(x)lx E X},Q,c) and let 0 <
a < 1 be a "discount factor". We suppose that the weight function w
satisfies Assumption 8.3.2(b); that is, there exists a constant (3 such that
1 ~ (3 < l/a and [as in (8.3.5)]

sup
A(x)ix
r w(y)Q(dylx, a) ~ (3w(x) Vx E x. (9.6.11)

9.6.3 Proposition. Suppose that {9.6.11} holds and let M be a MCM that
is the same as M except that the transition law Q is replaced by

Q('lx, a) := aQ('lx, a). (9.6.12)

Then M is transient; in fact, for all1T E IIRM,

II L Q~lIw ~
00

k, with k:= 1/(1 - a(3). (9.6.13)


t=O
106 9. The Expected Total Cost Criterion

Proof. Let 7f = {cpt} be an arbitrary randomized Markov policy. Then, by


(9.6.12) and (9.6.3),

'<It = 0,1, ....

Thus, by (9.6.11),

J w(y)Q;(dYlx) 0/ Jw(y)Q;(dylx)

< (af3)t w (x) '<Ix, t,

so that

f:
t=o
J w(y)Q;(dYlx) ::; kw(x) '<Ix, with k:= 1/(1 - af3).

This inequality and (9.6.8) give (9.6.13). 0


In short, Proposition 9.6.3 states that a discounted model can be trans-
formed into a transient model if part (b) in Assumption 8.3.2 is satisfied. If
instead of (b) we use part (a) of Assumption 8.3.2, then a transient model
is convergent, that is, (9.3.20) holds. In other words:
9.6.4 Proposition. Suppose that the MCM M is transient and that there
is a constant c ~ 0 for which

sup Ic(x, a)1 ::; cw(x) '<Ix E X. (9.6.14)


A(x)

Then M is convergent. In fact, if k satisfies (9.6.7), then (by Theorem


9·4·5)

supE;
II
(f:
t=o
Ictl) : ; ckw(x) '<Ix E Xj (9.6.15)

hence, in particular, the expected total cost Vi (7f, .) is a function in the


space Rw(X) for all policy 7f.
Proof. Let 7f = {cpt} be an arbitrary randomized Markov policy, and x an
arbitrary initial state. Then, for each t = 0,1, ... , (9.6.3) [or (9.6.5)] yields

E;lc(xt,at)1 Ix Ic(Y,CPt)IQ;(dylx)

< c Ix w(y)Q;(dylx) [by (9.6.14)]'

so that, by (9.6.8) and (9.6.7),


00

LE:lc(xt,at)1 ::; ckw(x).


t=o
9.6 The transient case 107

This gives (9.6.15) since 11" and x were arbitrary. 0


As an obvious consequence of (9.6.15) we have:
9.6.5 Corollary. Under the hypotheses of Proposition 9.6.4, part (a) in
Assumption 9.5.2 is satisfied, and also part (c)(ii) with U := $w(X) and
W(·) := ckw(·).
We give below (subsection B) conditions ensuring Assumption 9.5.2(b).
First, however, we will comment on the relation between transient MCMs
and transient Markov chains, and we will also present a class of transient
MCMs (Example 9.6.7).
9.6.6 Remark. (Relation between transient MCMs and transient
Markov chains.) Let us suppose that the MCM M is transient and for
every Borel set B C X define the occupation time (as in the first para-
graph of §7.3.A)

L IB(xt).
00

'TIB := (9.6.16)
t=l

The, for any policy 11" E IIRM and initital state x EX, we see from (9.6.5)
that
E;[IB(xd] = P;(Xt E B) = Q~(Blx),
so that the expected occupation time of B satisfies [by (9.6.16), (9.6.8)
and (9.6.7)]

=L
00

E;('TIB) Q~(Blx) ~ kw(x). (9.6.17)


t=l

Thus, extending Definition 7.3.2(b) to Markov control processes, it follows


from (9.6.17) that if w(·) is bounded on the set B, then B is uniformly
transient for all policy 11"; that is, if sUPB w(x) =: WB < 00, then

where we have used Theorem 9.4.5 to write II in lieu of IIRM. Moreover,


if the weight function w is bounded [which might be the case when dealing
with bounded costs c(x, a)], then the whole state space X is a transient
set, and so the state process {Xt} is transient for all policy 11". 0
9.6.7 Example. We show that if the weight function w satisfies an in-
equality of the form (7.3.9) or (7.3.10), namely,

/ w(y)Q(dylx,a) ~ {3w(x) + bl(x) \/x E X, a E A(x), (9.6.18)

then the MCM M is transient. In (9.6.18) we suppose that {3 and bare


nonnegative constants with {3 < 1, and 0 ~ 1(·) ~ 1 is a measurable func-
tion. In addition, we assume that there exists a nonnegative constant 7 < 1
108 9. The Expected Total Cost Criterion

such that for every randomized Markov policy 7r = {<pt} [using the notation
(9.6.2)]
\It = 1,2, .... (9.6.19)
More explicitly, we can write (9.6.19) as

E;[l(xt)] ~,l Vx E X, t = 1,2, .... (9.6.20)

To prove that M is transient we will show that (9.6.7) holds with

(9.6.21)

which follows from straightforward calculations. Indeed, note that integra-


tion of both sides of (9.6.18) with respect to <pj(·lx) yields

I w(y)Q(dylx,<pj) ~ (3w(x) + bl(x) Vx E X,j = 0, 1, ... ,


which using (9.6.2) can be written in abbreviated form as

QjW ~ (3w + bl.


Then from (9.6.3) and (9.6.18) we obtain

Q~w w,
Q~w = Qow ~ (3w + bl,
and for t = 2,3, ...
t-I
Q~w ~ (3tw + (3t- I bl + b L (3t-I-jQ~l
j=1
t-I
< (3tw + (3t- 1 bl + b L (3t-I-j'Y j [by (9.6.19)].
j=1

It follows that

L
00

Q~w ~ (1- (3)-I[w + bl + b-y(1- 'Y)-I].


t=o
Finally, since I ~ 1 ~ w, we get

with k as in (9.6.21). Therefore, by (9.6.8), M is transient. 0


9.6 The transient case 109

B. Optimality conditions

Let us suppose that the MCM M is as in Proposition 9.6.4. Then, by


(9.6.15) and Corollary 9.6.5, one can see that all of the results in the pre-
vious sections will be true provided that Assumption 9.5.2(b) is satisfied.
This is indeed the case because conditions such as, say, (c) in Theorem
9.5.3 trivially hold for all u in U := $w(X); namely,
(9.6.22)

by (9.6.8). In turn, as already shown in several chapters, Assumption


9.5.2(b) holds in a number of cases. In particular, as in this section we
are dealing with MCMs with a weighted w-norm, it is natural to consider
the analogue of Assumptions 8.3.1-8.3.3 or 8.5.1-8.5.3. To fix ideas, we will
consider the former; that is, we suppose:
9.6.8 Assumption. In addition to the hypotheses of Proposition 9.6.4 we
assume that for every state x EX:
(a) A(x) is compact;
(b) c(x,a) is l.s.c. in a E A(x);
(c) the function u'(x, a) := J u(y)Q(dylx, a) is continuous in a E A(x) for
every bounded function u in $w(X); and
(d) the function w'(x, a) := J w(y)Q(dylx, a) is continuous in a E A(x).
Using Assumption 9.6.8 we get, in particular, that Assumption 9.5.2(b)
holds with U being the space $w(X), as shown in the following lemma.
9.6.9 Lemma. If Assumption 9.6.8 is satisfied, then the optimal n-stage
cost J~ belongs to the space $w(X) and, moreover, J~ = T J~_l for all
n = 1,2, ... , with JO'(.) := 0; that is,

J~(x) = min
A(z)
r J~_l (y)Q(dylx, a)]
[C(X, a) + Jx \:Ix E X. (9.6.23)

In addition, for every n = 1,2, ... , there is a decision function fn E IF such


that fn(x) E A(x) attains the minimum in (9.6.23) for each x E X, i.e.,

J~(x) = c(x, fn) + Ix J~_l (y)Q(dylx, fn) \:Ix E X. (9.6.24)

Proof. The proof follows from a direct induction argument, using Lemma
8.3.8(a) [as in the proof of Proposition 8.3.9(b)]. 0
From Lemma 9.6.9 and Corollary 9.6.5 we immediately deduce our main
optimality result in this section. Namely:
9.6.10 Theorem. Suppose that Assumption 9.6.8 holds. Then:
110 9. The Expected Total Cost Criterion

(a) The value function Vt belongs to the space lRw(X) and there is a
decision function f* E IF such that f*(x) E A(x) attains the minimum
in the right-hand side of the optimality equation (9.5.3), i.e.,

(9.6.25)

and the corresponding deterministic stationary policy f::' is ETC-


optimal.

(b) A deterministic stationary policy f::' is ETC-optimal if and only if


the decision function f* E IF satisfies (9.6.25).
(c) Vt is the unique solution of the optimality equation (9.5.3) in the space
lRw(X).
Proof. (a) From Lemma 9.6.9, Theorem 9.3.5(b) and Proposition 9.3.6(2),
it follows that Vt is a measurable function, whereas the fact that it belongs
to lRw(X) is a consequence of Proposition 9.6.4. The existence of a decision
function f* that satisfies (9.6.24) results from Theorem 9.5.3 and Lemma
8.3.8(a).
Further, from (9.6.22) and Theorem 9.5.12 we obtain the optimality of
f::' as well as part (b).
Finally, part (c) follows from (a) together with (9.6.22) and Theorem
9.5.13.0
One can also obtain other optimality results related to Theorem 9.6.10.
For instance, for every n, let fn E IF be as in (9.6.24). Then from the
result by Schill mentioned in Note 4 of §8.4 and the convergence in (9.5.6),
one can deduce the existence of a decision function (or selector) f* E IF
such that, for each x E X, f*(x) E A(x) is an accumulation point of the
sequence {fn(x)} and, furthermore, the deterministic stationary policy f::'
is ETC-optimal. We will also show, on the other hand, that the policy
iteration algorithm converges (see subsection D). But first we will prove an
interesting result according to which to verify (9.6.7) it suffices to consider
deterministic stationary policies.

c. Reduction to deterministic policies

9.6.11 Theorem. Suppose that the MCM M and the weight function w
satisfy the conditions (a), (b) and (d) of Assumption 9.6.8. In addition,
suppose that there is a constant k 2 0 such that [using the notation (9.6.4)
with f E IF in lieu of <p E ~J

II L Q}llw ~ k 'If E IF.


00

(9.6.26)
t=o
Then (9.6.7) holds and so M is transient.
9.6 The transient case 111

The proof of Theorem 9.6.11 is based on a clever idea of Veinott [1] (see
also Pliska [1]), which consists in showing that there exists a determinis-
tic stationary policy that maximizes the expected total "reward" function
given by

L Q~w(x) = L E;w(xt} ,
00 00

W(1T,X) := 1T E IIRM, X E X, (9.6.27)


t=o t=o
where the one-step "reward" is nothing less than the weight function w
itself! The precise statement is as follows.
9.6.12 Lemma. Under the hypotheses of Theorem 9.6.11, there exists a
function u* in lIi w (X) and a decision function f* E IF such that, for all
XEX,

u*(x) = max [W(X) + f


ix u*(Y)Q(dY1x,a)]
Ix
A(x) (9.6.28)
= w(x) + u*(y)Q(dYlx, f*)j

hence
u*(x) = W(f~,x) = SUpW(1T,X) Yx E X. (9.6.29)
II

Proof. Let R: lIi w (X) -t lIi w (X), u I-t Ru, be the operator defined as

Ru(x) := max [w(x) + f (9.6.30)


A(x) ix u(y)Q(dylx, a)], a E A(x).

If we define a function v(x, a) := w(x) + J u(y)Q(dylx, a) and apply Lemma


8.3.8(a) to -v, we can see that for each u E lIi w (X) there is a decision
function f E IF such that f(x) E A(x) attains the maximum in (9.6.29) for
all x E X, i.e.,

Ru(x) = w(x) + ! u(y)Q(dylx, f) Yx E X. (9.6.31)

This will give the second equality in (9.6.28) if we show that u* is indeed a
function in lIi w (X). To prove this we will use the value iteration approach.
Let {un} be the sequence in lIi w (X) given by Uo := 0 and Un := RUn-l
for n ~ 1. As R is a monotone operator (u ~ u' implies Ru ~ Ru'), the
sequence {un} is nondecreasing, i.e.,

Un+! ~ Un ~ 0 Yn = 0,1, .... (9.6.32)

We will next show that

Un ~ kw Yn = 0,1, ... , (9.6.33)


112 9. The Expected Total Cost Criterion

where k is the constant in (9.6.26). Suppose that (9.6.33) it is not true;


that is, there exists an integer n 2 1 such that

Un-l ::; kw but Un 1:. kw.


Now, as in (9.6.31), let f E IF be a decision function for which

Un(x) = RUn-l(X) = w(x) + f un-l(y)Q(dylx,J) Vx E X,

and consider a new sequence {Vj} with Vo := Un-l and

Vj(X) = w(x) + f Vj-l (y)Q(dylx, J), j = 1, ....

Thus, using (9.6.4) with f in lieu of tp, we can write Vj = w + QfVj-l so


that
j-l

Vj =L Q}w + Qjvo.
t=o
Observe that, by (9.6.22), Qjvo -+ 0 since Vo := Un-l is in lffiw(X). There-
fore, as j -+ 00,

L Q}w ::; kw
00

Vj t [by (9.6.26)]'
t=o
which contradicts that Vl := Un 1:. kw. This proves (9.6.33).
Now, from (9.6.32) and (9.6.33), there exists a function u* in lffiw(X) such
that u* ::; kw and un(x) t u*(x) for all x E X. But, on the other hand, we
also have
Un(x) = RUn-l(X) t u*(x) for all x E X;
hence (by Remark 9.6.13, below), u* satisfies u*(x) = Ru*(x), which is the
same as the first equality in (9.6.28), and the second equality follows from
a previous remark-see (9.6.31).
Finally, the first equality in (9.6.29) follows from the Poisson equation

u*(x) = w(x) + f u*(y)Q(dYlx, f*)

in (9.6.28) combined with (9.6.22) and Proposition 9.5.11 (with the obvi-
ous changes in notation), whereas the second equality in (9.6.29) can be
obtained from the "optimality equation" (9.6.28) by standard arguments
[see for instance the proof of (9.5.29)]. 0
Having Lemma 9.6.12, the proof of Theorem 9.6.11 is straightforward:
Proof of Theorem 9.6.11. As u* ::; kw, (9.6.29) yields

W(7r,x)::; u*(x) ::; kw(x) V7r E II RM , X E X.


9.6 The transient case 113

This implies (9.6.7), by the definition (9.6.27) of W(1f, x). 0


9.6.13 Remark. Let Y and Z be two arbitrary sets, and let v an extended-
real-valued function on Y x Z. Then it is easily seen that

supsupV(y,z) = supsupv(y,z).
y z z y

It follows that if Vn is a nondecreasing sequence of functions on OC [the set


in (8.2.1)] such that Vn t v*, so that V* = sUPn Vn , then

lim sup vn(x, a) = sup lim vn(x, a) = sup v*(x, a) \/x E X.


n--+oo A(x) A(x) n--+oo A(x)

Similarly (replacing "sup" and "non decreasing" by "inf" and "nonincreas-


ing", respectively), if Vn .J.. V* then

lim inf vn(x,a) = inf v*(x,a). 0


n--+oo A(x) A(x)

D. The policy iteration algorithm

Suppose that Assumption 9.6.8 is satisfied and, for some integer i ~ 0,


let gf be a deterministic stationary policy. Then, by Proposition 9.6.4, the
corresponding expected total cost

is a function in the space Ew(X), which, by (9.5.30), satisfies the Poisson


equation

Vi(X) = c(x, 1;) + Ix vi(y)Q(dylx, 1;) \/x E X. (9.6.34)

Now let T be the dynamic programming operator in (9.5.1), and let fi+l E
IF be a decision function such that

Iteration of this inequality gives

Vi(X) ~ I n(fi+l' x) + J vi(y)Qn(dylx, fi+l)

and letting n ---+ 00 we obtain [by (9.6.22) and (9.3.14)]

Vi(X) ~ Vi+l(X) \/x E X, (9.6.36)

where Vi+! (-) := VI (fif\, .). This algorithm, which starts with an arbitrary
policy ff{' E IIDS and which at every step chooses the next policy fi+l
114 9. The Expected Total Cost Criterion

according to (9.6.35), is called the policy iteration algorithm (PIA)~


also known as Howard's policy improvement algorithm. We say that the
PIA converges if the nonincreasing sequence {Vi} satisfies that

(9.6.37)

We will next show that this is indeed the case.


9.6.14 Theorem. (Convergence of the PIA algorithm.) Under As-
sumption 9.6.8, the PIA converges.
Proof. First note that if for some i the equality holds in (9.6.36) for all
x EX, then Vi = Vt and the deterministic stationary policy fiX) is ETC-
optimal. This is a consequence of Theorem 9.6.1O(b), (c) because if the
equality holds in (9.6.36), then [by (9.6.35)] Vi satisfies the optimality equa-
tion Vi = TVi. This proves the theorem in the case of equality in (9.6.36).
In the general case, (9.6.36) and (9.6.15) imply the existence of a function
v* in the space IBw(X) such that

VxEX. (9.6.38)

We wish to show that v* satisfies the optimality equation v* = Tv*, which,


as in the previous paragraph, will give v* = Vt. Observe that (9.6.38) and
Vi ~ TVi [see (9.6.35)], combined with Remark 9.6.13 and Fatou's Lemma,
yield
V* ~ Tv*. (9.6.39)
To obtain the reverse inequality observe that for all x in X:

V*(x) < Vi(X) [by (9.6.38)]

c(x, fi) +/ vi(y)Q(dylx, h) [by (9.6.34)]

< c(x, h) +/ Vi-l (y)Q(dYlx, fi) [by (9.6.36)]


TVi-l (x) [by (9.6.35)].

Thus, by (9.5.1),

v*(x):::; c(x, a) +/ Vi-l(y)Q(dylx,a) Va E A(x).

It follows that, letting i -+ 00 and using (9.6.38) again

V*(x) :::; c(x, a) +/ v*(y)Q(dylx,a) Vx E X, a E A(x),

which implies that v* :::; Tv*. Therefore, in view of (9.6.39), v* = Tv*,


which gives v* = Vt and (9.6.37) follows. 0
9.6 The transient case 115

For the sequence {J;} in (9.6.34), (9.6.35), it also holds the remark in
the paragraph following the proof of Theorem 9.6.10; namely:
9.6.15 Corollary. Suppose that Assumption 9.6.8 is satisfied and let {Ji}
be the sequence of decision functions in (9.6.34), (9. 6. 35}. Then there exists
a decision function f* E IF such that, for each x E X, f*(x) E A(x) is an
accumulation point of {Ji(x)} and, moreover, the deterministic stationary
policy f-;O is ETC-optimal.
Notes on §9.6

1. This section comes from Hernandez-Lerma, Carrasco and Perez-Her-


nandez [1], which also studies the "stability" of an ETC-optimal determin-
istic stationary policy.
2. As noted at the beginning of the section, transient MCMs were intro-
duced by Veinott [1] in the case of finite state spaces, and they have been
analyzed by several authors in the countable state case. Here we followed
Pliska's [1] approach partly because, to begin with, he is-to the best of our
knowledge-the only author who has considered transient models in "gen-
eral" (that is, Borel) state spaces, and partly because it fits very nicely in
the weighted-norm framework of previous (and later) chapters. One should
keep in mind, however, that instead of using the w-norm in (9.6.7), he uses
an arbitrary operator norm. This allows of course some flexibility in the
selection of the norm, but one should be careful how this selection is done.
For instance, a natural choice in some problems might be to use the total
variation norm //Q//TV of Q, which is obtained from (7.2.7) replacing the
w-norm /lu/l w := sup /u(x)/Iw(x) by the sup norm /lu/l := suPx /u(x)/,
i.e.,
/lQ/lTV := sup{/iQu/i : /lu/l :s I}; (9.6.40)
alternatively, we can use (7.2.10) replacing /lJ.l/lw by the total variation
norm /lJ.l/lTV of the measure J.l [see (7.2.3)], i.e.,

/lQ/lTV := suP{/lQJ.l/lTV : /lJ.l/lTV :s I}. (9.6.41)

At any rate, if we use this norm rather that /I . /lw in (9.6.7), then all of
the results in this section remain valid but for MCMs with a bounded
one-stage cost c(x, a), i.e., for some constant c

/c(x,a)/ :Sc Vx E X, a E A(x), (9.6.42)


which is the class of models considered by Pliska.
3. Theorem 9.6.11 might be very convenient for checking whether (9.6.7)
holds. For example, Pliska [1, Theor. 3.2] shows that the following four
conditions (a) to (d) are equivalent and that each of them implies (9.6.26):

L
00

(a) There exists a number k such that /lQ,/lw :s k for all f E IF.
t=o
116 9. The Expected Total Cost Criterion

(b) For each 'Y > 0, there exists an integer N such that IIQ}llw ~ 'Y for
all t ~ N and all I E IF.
(c) For each 'Y > 0, there exists an integer t such that IIQ}lIw ~ 'Y for all
IE IF.
(d) There exist positive numbers 'Y and 8, with 'Y < 1, such that IIQ}lIw ~
8'Y t for all I E IF and t = 0,1, ....
Incidentally, since Pliska considers only models that satisfy (9.6.42), in
the proof of Theorem 9.6.11 and Lemma 9.6.12 one may take w(·) == 1
in (9.6.27) and (9.6.28). On the other hand, he shows that, when using a
"general" operator norm II . II, if the transition kernel Q is such that

VI E IF, (9.6.43)

then the five conditions (a)-(d) and (9.6.26) are all equivalent. For the
w-norm, (9.6.43) can also be written [by (7.2.8)) as

Ix w(y)Qf(dylx) ~ w(x) 'Ix E X, IE IF.

4. For a general-not necessarily transient-MCM, Schiil [1, §17) shows


that the compactness condition (a) in Assumption 9.6.8 can be relaxed;
that is, under suitable hypotheses, the control constraint sets A(x) may be
allowed to be noncompact. Alternatively, one may replace the compactness
of A(x) by inf-compactness of the cost function c(x, a), as in Chapters 4,
5, 6. On the other hand, for a general MCM, Rieder [3) has given condi-
tions for the convergence of the PIA [see (9.6.37)); without the appropriate
assumptions, it is well known that this convergence may not hold (for coun-
terexamples see, for instance, Rieder [3], Bertsekas [1), or Puterman [1)).
10
Undiscounted Cost Criteria

10.1 Introduction
A. U ndiscounted criteria

Infinite-horizon Markov control problems can be roughly classified as


being "discounted" or "undiscounted". The former, which have been the
main subject of Chapter 4 and Chapter 8, are basically well understood
in the sense that their theory can be safely considered to be complete.
This is not the case for un discounted problems-in fact, to start with,
"undiscounted" can have several different meanings.
For example, if in the a-discounted cost criterion [see, for instance,
(9.1.4)] we take a = 1, then the undiscounted problem concerns of course
the expected total cost (ETC)

V1 (7r,x) := E; [~c(xt,at)] = J~~ I n (7r,x), (10.1.1)

[I:
where
I n (7r, x) := E; c(Xt, at)] , n = 1,2, ... (10.1.2)
t=o
is the n-stage expected total cost. [See (9.3.14).] In this case a policy 7r* is
"optimal", or ETC-optimal, if [as in (9.1.3)]

(10.1.3)

O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes


© Springer Science+Business Media New York 1999
118 10. Un discounted Cost Criteria

The ETC criterion, however, has at least two main drawbacks: (i) it might
not be well defined for all policies, and (ii) it does not look at how the finite
horizon cost I n (1f,x) varies with n.
A common way of coping with drawback (i) is to consider the long-run
expected average cost (AC), already studied in Chapter 5 and which is
also considered in the present chapter from a different perspective. But,
as noted in Chapter 5, the AC criterion has the inconvenience of being
extremely underselective, for it ignores what occurs in virtually any finite
period of time.
Thus, to cope with (i) and (ii), it might be more "convenient" to put the
undiscounted problem in the form introduced by Ramsey [1]: A policy 1f*
is said to overtake a policy 1f if for every initial state x there exists an
integer N = N (1f* , 1f, x) such that
(10.1.4)
Then a policy 1f" is called strongly overtaking optimal (strongly 0.0.),
or optimal in the sense of Ramsey, if 1f" overtakes any other policy 1f.
Under suitable conditions~for instance, if the sequences in (10.1.4) con-
verge [as in (9.3.14)]~strong overtaking optimality is equivalent to compare
policies with respect to the ETC criterion. In general, as is to be expected,
strong overtaking optimality turns out to be extremely overselective~there
are many well-known, elementary (finite-state) MCPs for which there is no
strongly 0.0. policy (see §1O.9).
Hence, we have to go back to the original un discounted problem and
put it in a form weaker that Ramsey's. This was done by Gale [1] and
von Weizsiicker [1] introducing the notion of weak overtaking optimality:
A policy 1f" is said to be weakly overtaking optimal (weakly 0.0.)
if for every policy 1f, initial state x, and c > 0, there exists an integer
N = N(1f",1f,X,c) such that
(10.1.5)

This notion of optimality seems to be more "reasonable" than (10.1.4).


But then again, if we are interested in the way I n (1f, x) varies with n,
it seems to be even more reasonable to compare I n (1f, x) with the optimal
n-stage cost
J~(x) := inf I n (1f, x). (10.1.6)
II

We thus arrive at Flynn's [1] opportunity cost of 1f, given the initial state
x, which is defined as

OC(1f,x) := limsup[Jn (1f,x) - J~(x)]. (10.1.7)


n-+oo

A policy 1f" is said to be opportunity cost-optimal (OC-optimal) if

OC(1f*, x) = inf OC(1f,x) \;fx E X. (10.1.8)


II
10.1 Introduction 119

Subtracting J~(x) on both sides of (10.1.5), it follows immediately that OC


optimality is weaker than weak overtaking optimality, i.e.,

7r* weakly 0.0. ~ 7r* OC-optimal. (10.1.9)

On the other hand, instead of comparing I n (7r,x) with J~(x), we might


compare it with nJ*(x), where J*(x) is the optimal expected average
cost (AC), i.e.,
J*(x) := inf J(7r, x), (10.1.10)
II

and
J(7r,x):= lim sup I n (7r,x)/n (10.1.11)
n-+oo

is the long-run expected average cost (AC for short) when using 7r, given
the initial state x. Thus, instead of (10.1.7), we have Dutta's [1] criterion

D(7r, x) := lim sup[Jn (7r, x) - nJ*(x)]. (10.1.12)


n-+oo

In the terminology of Gale [1] and Dutta [1], a policy 7r for which D(7r,·)
is finite is said to be a "good" policy.
A policy 7r* is called D-optimal, or optimal in the sense of Dutta, if

D(7r*, x) = inf D(7r, x) Yx E X. (10.1.13)


II

Again, in analogy with (10.1.9), it follows directly from the definitions that
D-optimality is weaker than weak overtaking optimality, i.e.,

7r* weakly 0.0. ~ 7r* D-optimal. (10.1.14)

B. AC criteria

In general, the converse of (10.1.14) and (10.1.9) does not hold. But, on
the other hand, we do have that if 7r* is such that OC(7r*, .) is finite valued,
then
7r* OC-optimal ~ 7r* AC-optimal, (10.1.15)
and similarly for a D-optimal policy 7r*, where AC-optimal means that

J(7r*, x) = J*(x) Yx. (10.1.16)

Furthermore, if a policy 7r is not AC-optimal, that is, if J(7r, x) > J*(x)


for some state x, then a straightforward argument shows that D (7r, x) and
OC (7r, x) are both infinite, i.e.,
J(7r, x) > J*(x) ~ D(7r, x) = +00 and OC(7r,x) = +00. (10.1.17)
From the latter fact, together with (10.1.9), (10.1.14) and (10.1.15), we
can see not only that there are several "natural" undiscounted cost criteria
120 10. Undiscounted Cost Criteria

and that they all lead in an obvious manner to the AC criterion, but also
that [by (10.1.17)] to find optimal policies with finite undiscounted costs it
suffices to restrict ourselves to the class of AC-optimal policies.
In fact, one of the main objectives in this chapter is to show that within
the class lIDS of deterministic stationary policies, all of the following con-
cepts
weakly 0.0., DC-optimal, D-optimal,
(10.1.18)
and bias-optimal are equivalent,
where bias optimality is an AC-related criterion defined in §1O.3.D.

C. Outline of the chapter

Section 10.2 introduces the Markov control model dealt with in this chap-
ter. In §1O.3 we present the main results, starting with AC-optimality and
then going to special classes of AC-optimal policies (namely, canonical and
bias-optimal polcies), and to the undiscounted criteria in subsection A,
above.
For the sake of "continuity" in the exposition, §1O.3 contains only the
statements of the main results-the proofs are given in §10.4 to §1O.8. The
chapter closes in §1O.9 with some examples.
10.1.1 Remark. Theorem 10.3.1 establishes the existence of solutions to
the Average Cost Optimality Inequality (ACOI) by using the same approach
already used to prove Theorem 5.4.3, namely, the "vanishing discount" ap-
proach. However, to obtain Theorem 10.3.1 we cannot just refer to Theorem
5.4.3 because the hypotheses of these theorems are different, an important
difference being that the latter theorem assumes the cost-per-stage c(x, a)
to be nonnegative-this condition is not required in the present chapter.
Moreover, the hypotheses here (Assumptions 10.2.1 and 10.2.2) are on the
components of the control model itself, whereas in Chapter 5 the assump-
tions are based on the associated a-discounted cost problem.
10.1.2 Remark. The reader should be warned that not everyone uses
the same terminology for the several optimality criteria in this chapter.
For instance, what we call "strong overtaking optimality" [see (10.1.4)] is
referred to as "overtaking optimality" by, say, Fermindez-Gaucherand et al.
[1], whereas our "weak 0.0." [see (10.1.5)] is sometimes called "catching-
up" in the economics literature, for example, in Dutta [1] and Gale [1].

10.2 Preliminaries
Let M := (X, A, {A(x)lx EX}, Q, c) be the Markov control model in-
troduced in §8.2. In this chapter we shall impose two sets of hypotheses
on M. The first one, Assumption 10.2.1 below, is in fact a combination
10.2 Preliminaries 121

of Assumptions 8.3.1, 8.3.2 and 8.3.3, with (8.3.5) being replaced by an


inequality of the form (8.3.11) [see also (7.3.9) or (7.3.10)].

A. Assumptions

10.2.1 Assumption. For every state x E X:

(a) The control-constraint set A(x) is compactj

(b) The cost-per-stage c(x,a) is l.s.c. in a E A(x)j and

(c) The function a f-t Ix


u(y)Q(dylx,a) is continuous on A(x) for every
function u in ~(X).

Moreover, there exists a weight function w 2: 1, a bounded measurable


(possible constant) function b 2: 0, and nonnegative constants C and (3,
with (3 < 1, such that for every x E X:

(d) sUPA(x) Ic(x, a)1 :s cw(x)j


Ix w(y)Q(dylx, a) is continuous on A(x)j and
(e) a f-t

(f) SUPA(x) Ix w(y)Q(dylx, a) :s (3W(x) + b(x). (10.2.1)

As usual, one of the main purposes of parts (a) through (e) in Assumption
10.2.1 is to assure the existence of suitable "measurable selectors" I E IF,
as in, for instance, Lemma 8.3.8 and Proposition 8.3.9(b). On the other
hand, the Lyapunov-like inequality in part (f) yields an "expected growth"
condition on the weight function w, as in (8.3.29) when b = OJ see also
Remark 8.3.5(a), Example 9.6.7 or (10.4.2) below.
We will next introduce an additional assumption according to which
the Markov chains associated to deterministic stationary policies are w-
geometrically ergodic (Definition 7.3.9) uniformly on II DS . To state this
assumption it is convenient to slightly modify the notation (8.2.6) and
(8.2.11) for the stochastic kernel Q(·lx, /), which now will be written as
Qf(·lx)j that is, for every I E IF, B E B(X), x E X, and t = 0, 1, ... ,

(10.2.2)

For t = 0, (10.2.2) reduces, of course, to

(10.2.3)

Observe that [by (10.2.23) below and the argument to obtain (8.3.44)] the
inequality (10.2.1) is equivalent to

Ix w(y)Qf(dylx) :s (3w(x) + b(x) VI E IF, x E X. (10.2.4)


122 10. Un discounted Cost Criteria

Thus, multiplying by W(X)-l one can see that the stochastic kernels Qf
have a uniformly bounded w-norm. Actually, as in Remark 7.3.12, the w-
norm of Qf, i.e.,

IIQfllw := supW(X)-l ( w(y)Qf(dylx) (10.2.5)


x ix
satisfies
IIQ f Ilw ::; (3 + Ilbllw Vf E IF

or, alternatively, by (7.2.2),

IIQfllw ::; (3 + Ilbll Vf E IF, (10.2.6)

where Ilbll := sup x Ib(x)1 is the sup-norm of b. (Recall that b is assumed to


be bounded.)
We will now state our second main assumption, and in the remainder
of this section we discuss some consequences of-as well as some sufficient
conditions for-it.
10.2.2 Assumption. (w-Geometric ergodicity.) For every decision
function f E IF there exists a p.m. I-lf on X such that

IIQ} -I-lfllw ::; Rpl Vt = 0,1, ... , (10.2.7)

where R > 0 and 0 < p < 1 are constants independent of f.


This assumption implies in particular that I-lf is the unique i.p.m. of
Qf in Mw(X). (See the paragraph after Definition 7.3.9.) Furthermore,
integration of both sides of (10.2.4) with respect to I-lf yields [by (7.3.1))

! wdl-lf ::; (3 ! wdl-lf + Ilbll,

!
so that
III-lfllw := wdl-lf ::; Ilbll/(I- (3) Vf E IF. (10.2.8)

In words, (10.2.8) means that the w-norm of I-lf is uniformly bounded in


f E IF.
On the other hand, in analogy with (7.3.8), we may rewrite (10.2.7) as

Ii u(y)Q}(dylx) -l-lf(u)1 ::; IlullwRplw(x), t = 0,1, ... , (10.2.9)

for all u in Ew(X), where

(10.2.10)
10.2 Preliminaries 123

Observe that this integral is well-defined because every function in Bw(X)


is ILrintegrable, that is, with L 1 (IL/) := L 1 (X,B(X),IL/),
u E L 1 (IL/) Vu E Bw(X), / E F. (10.2.11)
In fact, by (7.2.1) and (10.2.8),

J luldIL/ ~ lIull w J wdIL/ ~ lIull wllbll/(1 - (3) (10.2.12)

for all u E Bw and / E F.


B. Corollaries

We should remark that (10.2.11) is just a restatement of Theorem 7.5.IO(a)


applied to the stochastic kernel P := Q/ and the i.p.m. IL = IL/. Similarly,
from parts (b) and (d) of Theorem 7.5.10 we obtain

lim IIQ,ullw/n
n-+oo
=0 Vu E Bw(X), / E F, (10.2.13)

and if u E Bw (X) is harmonic-or invariant-with respect to Q/ (meaning:


Q/u(x) = u(x) for all x E X), then u is the constant IL/(U), i.e.,
Q/u = u =} u(x) = IL/(U) Vx E X. (10.2.14)
We will now consider the strictly unichain Poisson equation (Definition
7.5.9) for the kernel Q/ and the "charge" c(·,/) which sometimes we shall
write as c/(·), i.e., [using (8.2.6)],
c/(x) := c(x, /) = c(x, lex)) Vx E X, / E F. (10.2.15)
As in (10.1.2),

In(foo,x) := E!OO [~c/(Xt)] = ~ Ix c/(y)Q} (dYlx) (10.2.16)

denotes the n-stage expected total cost when using the deterministic sta-
tionary policy /00 EnDS. Similarly, by (10.1.11), the long-run expected
average cost (Ae) is
(10.2.17)
n-+oo

Let J (f) be the constant

J(f) := IL/(c/) = Ix C/(y)IL/(dy), / E F. (10.2.18)

10.2.3 Proposition. (The Poisson equation.) Let / E F be an arbi-


trary decision function, and /00 the corresponding deterministic stationary
policy. Then [with c as in Assumption 10.2.1 (d), and R, p as in (10.2.7)]
124 10. Undiscounted Cost Criteria

(a) IJn(foo, x) - nJ(f)1 :::; cRw(x)/(l - p) \Ix E X, n = 1,2, ... ,


so that, in particular,
(b) J(foo,x) = lim In(foo,x)/n = J(f) \Ix E X [cf. (10.2.17)]' and
n-too
(c) The function

lim [In(foo,x) - nJ(f)]


n-too
00
(10.2.19)
LEt"'[Cf(Xt} - J(f)]
t=O

belongs to ]ff,w(X) since, by (a),

IIhfllw :::; cR/(l - p). (10.2.20)

(d) The pair (J(f), hf) in lR x ]ff,w(X) is the unique solution of the strictly
unichain Poisson equation

x E X, (10.2.21)

that satisfies the condition

(10.2.22)

Proof. Part (a) follows directly from (10.2.9) applied to U = cf, together
with the elementary fact
n-l

L pt = (1 - pn)/(l - p) :::; 1/(1 - p),


t=o

which was already used in (7.5.32).


Part (d), on the other hand, is a consequence of Theorem 7.5.10(e). 0

Additional relevant information on the Poisson equation can be found in


§7.5; see, for instance, Theorem 7.5.5 and Remark 7.5.1I.
The function hf in (10.2.19), that is, the unique solution of (10.2.21)-
(10.2.22) in ]ff,w(X) will be referred to as the bias of the decision function
f E IF or the deterministic stationary policy foo E IIDS.
C. Discussion

The most restrictive hypotheses are of course (10.2.4) [which is equivalent


to (10.2.1)] and (10.2.7). The former can be reduced to checking (10.2.4)
for a single decision function.
10.2 Preliminaries 125

Indeed, by Assumption 1O.2.1(e), we can apply Lemma 8.3.8(a) to the


function
v(x,a) := - ! w(y)Q(dylx,a)

to see that there exists a decision function f w E IF such that

! w(y)Qfw (dYlx) = 'J(~ ! w(y)Q(dylx, a) "Ix E X. (10.2.23)

Then

! w(y)Qfw (dylx) ~ ! w(y)Qf(dylx) "If E IF, x E X,

and so if the inequality in (10.2.4) holds for fw, i.e.,

! w(y)Qfw (dylx) :S {3w(x) + b(x), x E X, (10.2.24)

then the inequality holds for all f E IF.


To verify Assumption 10.2.2, on the other hand, we may try to ap-
ply suitable "MCP-versions" of the results in §7.3D-see Theorems 7.3.10,
7.3.11, 7.3.14, and Remark 7.3.13. [If the cost-per-stage c(x, a) is bounded,
it suffices to verify conditions of the form (7.3.13) of (7.3.16) for the total
variation norm.] For example, one can easily check that the same proof of
Theorem 7.3.14 yields the following.
10.2.4 Proposition. Suppose there exists a weight function w ~ 1, a
number 0 <p< 1, a state x* E X, and a control action a* E A(x*) that
satisfy

(a) w* := Ix w(y)Q(dylx*, a*) < 00, and


(b I ) IIQf('lx) - Qf('lx')llw :S p[w(x) + w(x')] "If E IF, x,x' EX,
or, equivalently,

(b 2 ) IIOQfllw :S pllOllw for every signed measure 0 E Mw(X) with O(X) =


O.
Then, for each f E IF, the stochastic kernel Qf satisfies the conclusions of
Theorem 7.3.14, that is, for each f E IF:
(i) Qfw:s pw + b, with b:= pw(x*) + w*;
(ii) IIQfllw:S p + b < 00;

(iii) Assumption 10.2.2 holds for some constant R :S 1 + b/(l- p), and p
as in (b I ), (b 2 ).
126 10. Undiscounted Cost Criteria

In other words, the hypotheses of Proposition 10.2.4 yield both (10.2.7)


and (10.2.4) with /3 = p and b(·) the constant function b = pw(x*) + w•.
These hypotheses were introduced for MCPs by Gordienko, Montes-de-Oca
and Minjarez-Sosa [1], extending ideas of Kartashov [2] for noncontrolled
Markov chains.
Similarly, the MCP-version of Theorem 7.3.10 is the following propo-
sition, for the proof of which the reader is referred to Gordienko and
Hernandez-Lerma [2, Lemmas 3.3, and 3.4].
10.2.5 Proposition. Suppose that, for each f E IF, the stochastic kernel
Q, has a unique i.p.m. IJ" and, in addition, there is a weight function
w ;::: 1, a p.m. v in Mw(X) and positive numbers 'Y and /3, with /3 < 1, that
satisfy the following. For each f E IF there exists a measurable function
o ::; I, (-) ::; 1 such that
(i) Q,(Blx) ;::: 1,(x)v(B) 't/x E X, B E B(X);

(ii) v(l,) := Ix l,dv;::: 'Y;


(iii) v(w) := Ix wdv = IIvll w < 00; and
(iv) Ix w(y)Q,(dylx) ::; /3w(x) + l,(x)v(w) for all x E x. (10.2.25)
Then there exist constants R ;::: 0 and 0 < p < 1 independent of f E IF for
which (10.2.7) holds.
Following ideas of Kartashov [3, Theor. 6], [5, Theor. 3.6], it is possible
to get estimates of the constants R and p in the conclusion of Proposition
1O.2.5-see Gordienko and Hernandez-Lerma [2] for details.

10.3 From AC optimality to undiscounted


criteria
As was already noted in §1O.1, undiscounted criteria lead directly to the
average cost (AC) criterion-see (10.1.15), and also (10.1.9), (10.1.14), and
above all (10.1.17).
However, a priori it is not obvious how to go in the reverse direction, from
AC optimality to the undiscounted criteria, because AC-optimal policies
can have a "nasty" finite-horizon behavior. For instance, we can have two
AC-optimal policies 7r and 7r' with very different n-steps costs I n (7r, .),
I n (7r',·) for all n, e.g.,

so that
(10.3.1)
10.3 From AC optimality to undiscounted criteria 127

Thus to go from AC optimality to a result such as, say, (10.1.5) seems to


be virtually impossible.
By consequence, to relate the AC criterion to the undiscounted criteria
in §1O.1.A it is necessary to study subclasses of AC-optimal policies with
special properties, which is the main objective in this section. For the sake
of "continuity" in the exposition, here we only state the main results; they
are proved in subsequent sections.
The program for this section is as follows. We begin by distinguishing a
class of deterministic stationary policies which are obtained from the Aver-
age Cost Optimality Inequality (ACOI), and then they are used to obtain
the Average Cost Optimality Equation (ACOE). (See Theorems 10.3.1 and
10.3.6, respectively.) From the latter equation we shall obtain the subclass
of so-called canonical policies, and, finally, a further subclass of bias-optimal
policies (Theorem 10.3.10). In short, we get a hierarchy
(10.3.2)
of subsets of decision functions, where IF AC denotes the family of decision
functions / E IF for which the corresponding deterministic stationary policy
/00 E lIDS is AC-optimal, and similarly for the subfamilies lFca and lFbias
of canonical and bias-optimal decision functions, respectively.
A particularly interesting feature of canc.1ical policies is that they ex-
clude possibilities such as (10.3.1). Namely, it will be shown that a canonical
policy /00 satisfies a relation of the form
In(foo,x) = np* + h*(x) - Ef'" h*(xn) 'Ix E X, n = 0,1, ... , (10.3.3)
for some constant p* and some function h* in lffiw(X). [See (10.3.21).] Thus,
if gOO is any other such policy, then, by (10.2.9) and (10.2.10),

lim[Jn(fOO, x) - In(gOO, x)] =- J h*dp,f + Jh*dp,g 'Ix E X, (10.3.4)

and so the finite-horizon behavior of two canonical policies cannot differ


"too much".
Moreover, it will be shown [in (10.3.30), (10.3.31)] that if /00 is bias-
optimal (that is, / is in lFbias) then

J h*dJ.Lf ~ Jh*dp,g '1gElFAC, (10.3.5)

so that instead of (10.3.4) we will get


(10.3.6)
with equality if and only if g is in lFbiaa also. The inequality (10.3.6) will
turn out to be related to the defining property (10.1.5) of weakly 0.0.
policies. Combining these remarks with (10.1.9) and (10.1.14) we will be
on our way to prove statement (10.1.18), which is the result that closes this
section (see Theorem 10.3.11 and Corollary 10.3.12).
128 10. Undiscounted Cost Criteria

A. The AC optimality inequality

The following theorem states the existence of a solution (p*, ho) to the
ACOI, as well as the existence of a deterministic stationary policy thatfa
is AC-optimal in II DS , i.e.,

J(fO', x) = inf J(foo, x) 'Ix E X, (10.3.7)


lIDS

or, equivalently [by (10.2.18) and Proposition 1O.2.3(b)],

J(fO', x) = J(fo) = inf J(f) 'Ix E X. (10.3.8)


1F

10.3.1 Theorem. (The ACOI.) Suppose that Assumptions 10.2.1 and


10.2.2 are satisfied. Then there exists a constant p*, a function ho in
Ew(X), and a decision function fo E 1F such that for each state x E X
the Average Cost Optimality Inequality (ACOI) holds, i.e.,

p* + ho(x) 2: min
A(x)
[c(x, a) +
Jr
x
ho(y)Q(dylx, a)] , (10.3.9)

and, moreover, fo(x) E A(x) attains the minimum in (10.3.9), so that


[using the notation (10.2.2) and (10.2.15)j

p* + ho(x) 2: cfo(x) + Ix ho(y)Qfo(dYlx). (10.3.10)

In addition, the deterministic stationary policy fa


E IIDS corresponding
to fo is optimal for (that is, minimizes) the AC criterion in IIDS, with p*
being the optimal value, i.e., fa
satisfies (10.3.7}-(10.3.8) and

p* = J(fo) = inf J(f). (10.3.11)


1F
In fact, any decision function fo that satisfies (10.3.10) also satisfies
(10.3.11).
Proof. See §1O.4.
10.3.2 Remark. (AC-optimality of fo.) From the proof of Theorem
10.3.1 (see Lemma 10.4.3 and Remark 10.4.4) it will be clear that if the
one-stage cost function c(x, a) is nonnegative, then fa
is AC-optimal and
p* is the AC value function; that is, with J*(.) as in (10.1.10), we can
rewrite (10.3.11) as

p* = J(fo) = r(x) 'Ix EX. (10.3.12)

Furthermore, any decision function fo that satisfies (10.3.10) also satisfies


(10.3.12). 0
10.3 From AC optimality to undiscounted criteria 129

B. The AC optimality equation

A deterministic stationary policy f':' is said to be canonical (or AC-


canonical) if, for some function h* in Jaw(X), equality holds in (10.3.9) and
(10.3.10) when ho and fo are replaced by h* and f*, respectivelyj that is,
for each state x EX,

p* + h*(x) = min
A(x)
[C(X, a) +
ixr h*(y)Q(dYlx, a)] , (10.3.13)

Ix
and
p* + h*(x) = Ct. (x) + h*(y)Qt.(dYlx). (10.3.14)

As in Chapter 5, (10.3.13) will be referred to as the Average Cost Optimality


Equation (ACOE). From (10.3.13) and (10.2.13) it is easy to see (as in §5.2)
that p* is the AC value function, and that any deterministic stationary
policy f':' for which (10.3.14) holds is AC-optimal, i.e.,
p* = J(f*) = r(x) 'Ix E X. (10.3.15)
The converse is not truej that is, an AC-optimal policy is not necessarily
canonical. [See, for instance, Example 10.9.1.] Nevertheless, in Theorem
10.3.6(b) we give conditions under which an AC-optimal policy is "almost
everywhere (a.e.)" canonical. Furthermore, the proof of Theorem 1O.3.6(a)
will show in fact that the existence of a deterministic stationary policy that
minimizes the AC criterion in IIDS [as fo in (10.3.11)] implies the existence
of a canonical policy.
The term "canonical" comes from Definition 5.2.1 and Theorem 5.2.2,
which are next briefly recalled.
Let h : X ---+ IR be a given measurable function, and let I n (7r, x, h) be
the n-stage expected total cost with terminal cost function hj that is, for
each policy 7r and initial state x,
Jo(7r, x, h) := hex),
and for n = 1,2, ...

(10.3.16)

Of course, we have
(10.3.17)
where I n (7r, x) = I n (7r, x, 0) is the n-stage cost in (10.1.2). The value func-
tion corresponding to J n ( 7r , x, h) is
J~(x,h) := inf I n (7r,X, h). (10.3.18)
II
130 10. Undiscounted Cost Criteria

10.3.3 Definition. Let (p, h, 1) be a triplet consisting of two real-valued


measurable functions p and h on X, and a decision function I E IF. We call
(p, h, 1) a canonical triplet if
JnUoo, x, h) = np(x) + hex) = J~(x, h) "Ix E X, n = 0,1, .... (10.3.19)
A decision function I E IF, or the corresponding policy 100 in IIDS, is said
to be canonical if it enters into some canonical triplet.
In the context of this chapter, p(.) == p* is a constant, and the connection
between a canonical triplet and the ACOE (10.3.13)-(10.3.14) is provided
by Theorem 5.2.2, which can be especialized as follows.
10.3.4 Theorem. (ACOE ¢:> canonical triplet.) A triplet (p*, h* ,!*)
consisting of a number p*, a function h* in lffiw(X) , and a decision function
1* E IF satisfies the ACOE (10.3.13}-(10.3.14) if and only if (p*, h*, f*) is
a canonical triplet, i.e.,
JnU:O,x,h*) = np*+h*(x) = J~(x,h*) "Ix E X, n = 0,1, .... (10.3.20)

Let us rewrite the first equality in (10.3.20) as

I n U:O , x) + Et:' h*(x n ) = np* + h*(x). (10.3.21)


Then, by Theorem 7.5.5(a), (b), we can see that (10.3.21) is just another
way of writing the Poisson equation (10.3.14). In a little more generality,
the Poisson equation (10.2.21) is "equivalent" [in the sense of Theorem
7.5.5(a), (b)] to
(10.3.22)
for all x E X and n = 0,1, ....
We will require the following assumption, which uses the concept of A-
irreducibility [Definition 7.3.1(a), (a1), (a2)] of the stochastic kernel Qf in
(10.2.2).
10.3.5 Assumption. (Irreducibility.) There exists a a-finite measure A
on SeX) with respect to which Qf is A-irreducible for all I E IF. [Of course,
A is non-trivial: A(X) > 0.]
10.3.6 Theorem. (Existence of canonical policies.) Suppose that As-
sumptions 10.2.1, 10.2.2, and also 10.3.5 are satisfied. Then
(a) There exists a canonical policy.
(b) If foo E IIDS is AC-optimal, then

p* + hf(x) = min [c(x, a)


A(x)
+
ixr hf(y)Q(dylx, a)] ILra.e., (10.3.23)

where hf is the solution to the Poisson equation (10.2.21)-(10.2.22).


10.3 From AC optimality to undiscounted criteria 131

Proof. See §10.5.

C. Uniqueness of the ACOE

In view of Theorem 10.3.4, part (a) in Theorem 10.3.6 gives, in other


words, the existence of solutions (p*, h*) to the ACOE (10.3.13). Thus
we already have the two nonempty sets IF AC and lFca satisfying (10.3.2).
Before introducing the third set, lFbias, we need to consider the question of
uniqueness of solutions to the ACOE.
It is obvious that p* is unique-there can be no two different values of p*
that satisfy (10.3.15)! It is just as obvious that if h*(·) satisfies (10.3.13),
then so does h*(·) + k for any constant k.
What is not obvious at all is that the solutions of (10.3.13) are all pre-
cisely of the form h*(·) +kj that is, two solutions of the ACOE can differ at
most by a constant. Accordingly, understanding "uniqueness" as "unique-
ness modulo an additive constant," we have the following.
10.3.7 Theorem. (Uniqueness of solutions to the ACOE.) Sup-
pose that the hypotheses of Theorem 10.3.6(a) hold. If hi and hi are two
functions in Bw(X) such that (p*, hi) and (p*, hi) both satisfy the ACOE
(10.3.13), then there exists a constant k = k(hi, hi) for which

hi(x) = h;(x) + k '<Ix E X. (10.3.24)

Proof. See §10.6.


10.3.8 Remark. As (10.3.14) is just a Poisson equation of the form
(10.2.21), we may expect to be able to "fix" a solution to the ACOE in
the same way we did in Remark 7.5.11(a) for the "noncontrolled" Poisson
equation, which is indeed the case. For instance, let h* be a function that
satisfies (10.3.13) and choose an arbitrary, fixed state x. Then, by (10.3.24),
the function

h*(-) - h*(x) (10.3.25)

is the unique solution of (10.3.13) that vanishes at x. Similarly [as in


(10.2.22)], if h* E Bw(X) and f* ElF satisfy (10.3.13) and (10.3.14), then

h*(·) - Ix h*dJ-t/. (10.3.26)

is the unique solution of the ACOE (10.3.13) whose integral with respect
to J-t/. is zero.
132 10. Undiscounted Cost Criteria

D. Bias-optimal policies

Let hi E ~w(X) and IF' AG be as in (10.2.19) and (10.3.2), respectively;


that is, hi is the bias function corresponding to f E IF', and

IF' AG := {J E IF'IJ(f) = p*}.


The infimum of hi over f in IF' AG is called the optimal bias function and
we shall denote it by h, i.e.,

h(x):= inf{hl(x)lf E IF'AG} 'tIx E X. (10.3.27)


Observe that the inequality (1O.2.20) ensures that

Ilhll w ::; cR/(l - p). (10.3.28)

Hence h is in the space ~w(X) if h is measurable, which is the case, for


1
instance, if there is a decision function such that h = hI' This is related
to the following definition.
1
10.3.9 Definition. A decision function (or the corresponding determin-
istic stationary policy Joo)
is said to be bias-optimal if it attains the
minimum in (10.3.27), i.e.,

J(1) = p* and hi-x) = h(x) 'tIx E X. (10.3.29)

We denote by IF' bias the class of bias-optimal decision functions.


The concept of bias optimality was introduced by Veinott [2].
To prove the existence of bias-optimal policies we will use the fact that,
under the hypotheses of Theorem 10.3.6(a), to obtain h in (10.3.27) we
may replace IF' AG by the smaller class IF' ca of canonical decision functions,
i.e.,
h(x) = inf{hl(x)lf E IF'ca}, x E X. (10.3.30)
In fact, we can write h more explicitly as

h(x) inf [h*(X) -


ix h*dJLI] r
IF'ca
= h*(x) - sup
IF'ca x
1
h*dJLI 'tIx E X.
(10.3.31)

This is easily seen from the second equality in (10.3.20) and the definition
(10.3.18) of J;(x, h*). They yield that for each decision function f in IF' AG
np* + h*(x) ::; In(foo, x, h*) = I n (foo , x) + Ere h*(x n ) 'tin, x,

or, equivalently,
In(foo, x) - np* 2:: h* (x) - E£oc h* (x n ) 'tin, x, (10.3.32)
10.3 From AC optimality to undiscounted criteria 133

with equality if f is canonical [see (10.3.21)]. Consequently, as J(f) = p*,


letting n ~ 00 we obtain [by (10.2.19) and (10.2.9)]

hf(x) ~ h*(x) - Ix h*dJ.Lf "Ix E X, f E IF AC, (10.3.33)

with equality if f is canonical, that is, if f is in lF ca . This fact, gives (10.3.30)


and (10.3.31).
Observe that (10.3.31), (10.3.27) and (10.3.29) give an alternative char-
i
acterization of a bias-optimal decision function: E IF AC is bias-optimal if
and only if it maximizes the integral J h*dJ.Lf over all f E lFca, that is,

1 x
h*dJ.Lj= sup 1
lF e .. x
h*dJ.Lf· (10.3.34)

In addition to (10.3.34), other characterizations of bias-optimal policies


are given in Theorem 10.3.10, which uses the following notation: For each
state x E X, A*(x) C A(x) is the set of actions for which the minimum is
attained in (10.3.13), i.e.,

A*(x) := {a E A(x)lc(x,a) + Ix h*(y)Q(dYlx,a) = p* +h*(x)}. (10.3.35)

Thus, in particular, a decision function f is canonical if and only if f(x) is


in A*(x) for all x E X, i.e.,

f E lFca ¢} f(x) E A*(x) "Ix E X. (10.3.36)

Furthermore, in (10.3.13) we may replace A(x) by A*(x), i.e.,

p* + h*(x) = min [c(x, a)


A*(z)
+
ixr h*(Y)Q(dY1x,a)] , (10.3.37)

and, on the other hand, as h differs from h* only by a constant [see


(10.3.31)], in (10.3.13) and (10.3.37) we may replace h* by h, which gives

p* + h(x) = min [C(X, a) +


A(z) ixr h(y)Q(dylx, a)] (10.3.38)

or, equivalently,

p* + h(x) = min
A*(z)
[c(x, a) +
ixr h(y)Q(dYlx, a)] . (10.3.39)

In either case, a canonical decision function f E lFca satisfies [by (10.3.36)]

p* + h(x) = cf(x) + Ix h(y)Qf(dYlx) "Ix E X. (10.3.40)


134 10. Undiscounted Cost Criteria

The following theorem shows that, among other things, (10.3.40) with an
additional condition characterizes bias-optimal policies.
10.3.10 Theorem. (Existence and characterization of bias-optimal
policies.) Suppose that the hypotheses of Theorem 10.9.6(a) are satisfied.
Then:
(a) There exists a bias-optimal decision junction 1 E IFbias ; moreover,
(b) (p*, h,1) is a canonical triplet [that is, it satisfies (10.9.99)-or
(10.9.98)-and (10.9.40)j and there exists a junction hI in Bw(X)
such that

h(x) + h'(x) = min


A*(x)
r h'(y)Q(dylx,a)
}x
"Ix E X. (10.3.41)

In addition, if f' E IFca is a canonical decision junction that attains


the minimum in (10.9.41), i.e.,

h(x) + h'(x) = Ix h'(y)QI,(dylx) "Ix E X, (10.3.42)

then also f' is bias-optimal.


(c) Conversely, if (p*, h, f) is a canonical triplet and if there is a junc-
tion hI in Bw(X) that together with hand f satisfies (10.9.41) and
(10.3.42) for all x E X, i.e.,

h(x) + h'(x) = Ix h'(y)QI(dylx) (10.3.43)

= min
A*(x)}X
r h'(y)Q(dylx,a),
then f is bias-optimal and h is the optimal bias junction, i.e., h = h.
(d) The following statements are equivalent:

(d 1 ) f E IF is bias-optimal.
(d 2 ) f E IF is a canonical decision junction and

Ix hdltl =0, (10.3.44)

where Itl is the i.p.m. in Assumption 10.2.2.


Proof. See §1O.7.
It is worth noting that Theorem 10.3.10 provides two "optimalityequa-
tions" for the bias-minimization problem, namely:
10.3 From AC optimality to undiscounted criteria 135

(i) From Theorem 10.3.1O(c), the ACOE (10.3.13)-(10.3.14) [see also


(10.3.36)-(10.3.40)] and (10.3.43) together form an "optimalityequa-
tion" for bias minimization; and

(ii) From Theorem 1O.3.1O(d), the ACOE (10.3.13)-(10.3.14) [or (10.3.39)-


(10.3.40)] and (10.3.44) form another "optimality equation".

FUrthermore (as shown in the proof of the theorem, in §10.7), case (i) oc-
curs when the bias-minimization problem is viewed as an "average cost"
problem, and then (i) is a direct consequence of Theorem 10.3.6(a). Case
(ii), on the other hand, appears when bias minimization is posed as an
"expected total cost (ETC)" problem, which is done in Remark 10.7.1.
The latter remark provides a second proof of Theorem 1O.3.10(a) and it
is based on the ETC results in §9.5. Interestingly enough, our (first) proof
of Theorem 10.3.10(a)-following an "average cost" approach [see (10.7.3),
(1O.7.4)]-is basically the same as Nowak's [1] proof of the existence (in
7r'DS) of weakly overtaking optimal policies!!! It was precisely this observa-
tion that suggested the equivalence of the several optimality concepts in
(10.1.18), which is the content of Theorem 10.3.11 below.

E. U ndiscounted criteria

In §10.1 we saw that some undiscounted cost criteria naturally lead to


the AC criterion. The following result, on the other hand, states that we
can go backwards-via bias optimality-in the sense that (10.1.18) holds
in the class IIDS of deterministic stationary policies.
10.3.11 Theorem. (Equivalence of undiscounted criteria.) Suppose
that the hypotheses of Theorem 10.3.6{a} are satisfied. Then the following
statements are equivalent:

(a) j<:>O E IIDs is bias-optimal, that is, f is in lFbias.

(b) /00 is ~C-optimal in II DS , i.e. [see (10.1.8)],

OC(fOO, x) = inf OC(gOO, x) and OC(foo,x) < 00 \:Ix E X.


1rDS

(c) /00 is D-optimal in IIDS, i.e., [see (10.1.13)],

D(fOO, x) = inf
IIDS
D(gOO, x) and D(fOO, x) < 00 \:Ix E X.

(d) foo is weakly 0.0. in IIDS, i.e., [see (10.1.5)],

limsup[Jn(fOO,x) - I n (gOO , x)] ::; 0 \:Ig OO E IIDS, X E X.


n--+oo
136 10. Undiscounted Cost Criteria

Proof. See §10.8.


Finally, as a direct consequence of Theorems 10.3.10 and 10.3.11 we have:
10.3.12 Corollary. (Existence of "undiscounted" optimal poli-
cies.) Under the hypotheses 0/ Theorem 10.3.6(aj, there is a deterministic
stationary policy /00 for which the statements (aj to (dj in Theorem 10.3.11
hold.

Notes on §10.3

1. All of the results in this section are essentially from Vega-Amaya [2],
and Hernandez-Lerma and Vega-Amaya [1]. Theorems 10.3.1 and 10.3.6 are
also obtained in Gordienko and Hernandez-Lerma [2] but under additional
assumptions. In particular, the latter reference requires the cost-per-stage
c(x, a) to be nonnegative, which allows a direct application of the Abelian
theorem (10.4.13) to obtain the result mentioned in Remark 10.3.2. More-
over, the ACOE (10.3.13) is obtained via the Ascoli Theorem, which of
course requires to impose suitable "equicontinuity" hypotheses on the con-
trol model. The proof presented here (in §10.5) of the ACOE uses a "policy
iteration" argument instead of the Ascoli Theorem.
For additional comments (with references) on how to obtain the ACOE
see the Notes on §5.5.
2. Concerning the relation (10.3.2), we may recall from §5.2 that there are
intermediate optimality concepts between "canonical" and "AC-optimal".
For instance, a policy 11"* E II is said to be F-strong AC-optimal (or strong
AC-optimal in the sense of Flynn [1]) if

lim [In (1I"* ,x) -


n-too
J~(x)]/n =0 '<Ix E X. (10.3.45)

Thus, denoting by lFF-SAC the class of decision functions / E IF for which


/00 is F-strong AC-optimal, it is easy to see that lFF-SAC lies between lFca
and IFAC, i.e.,

lFca C lFF-SAC elfAC. (10.3.46)

3. Examples by Brown [1] and Nowak and Vega-Amaya [1] show that,
without additional assumptions, the results in Theorem 10.3.11 and Corol-
lary 10.3.12 cannot be extended to class II of all policies. (See Remark
10.9.2.)
4. Haviv and Puterman [1] use bias optimality to distinguish between
two AC-optimal policies for a certain admission control queueing system.
To discriminate AC-optimal policies one can use the minimum average
variance (see §11.3) instead of the minimum bias.
lOA Proof of Theorem 10.3.1 137

lOA Proof of Theorem 10.3.1


The proof of Theorem 10.3.1 requires several preliminary results that are
presented in the following subsection.
A. Preliminary lemmas

As the function b(·) in (10.2.1) satisfies that 0 ~ b(x) ~ IIbll for all x E X,
we will assume that b(·) is a constant to be denoted by b again, i.e., b(·) == b.
Thus, instead of (10.2.1) we now have

sup ( w(y)Q(dylx, a) ~ f3w(x) + b "Ix E X, (1004.1)


A(z) lx
and similarly for (10.204).
10.4.1 Lemma. Let 11' E II be an arbitrary policy. Then for each x E X
and t = 1,2, ...
t-l
E;w(xt) ~ f3t w(x) + b L:f3j ~ [1 + b/(I- f3)]w(x). (1004.2)
j=O

Hence [with c as in Assumption 1O.2.1(d)] for each t = 0,1, ... , x E X,


and U E IRw(X)

(1004.3)
and

so that

Proof. As in (8.3.31),

E;[w(xdlht-t,at-l] = ! w(y)Q(dylxt-l,at-d
~ f3W(Xt-l) +b [by (1004.1)].
Hence, taking the expectation E; (.),
E;w(xt) ~ f3E;w(Xt-d + b,
which iterated gives the first inequality in (1004.2). The second inequality
in (1004.2) is obvious (recall that w ~ 1).
To obtain (1004.3) it suffices to note that, by Assumption 1O.2.1(d),

(100404)
138 10. Undiscounted Cost Criteria

because then (10.4.3) follows from (10.4.2). Finally, as

the proof of the lemma can be completed in the obvious manner. 0


Let us now consider the a-discounted cost (0 < a < 1) and the a-discount
value function in (8.3.1) and (8.3.2), i.e.,

Va(1T, x) := E; [~atc(xt, at)1 (10.4.5)

and
V;(x) := infVa(1T,x). (10.4.6)
n
From (10.4.3) it is evident that

lVa(1T,X)1 ::; ~(x)/(I- a) V1T,X,

and

IV; (x)1 ::; ~(x)/(I- a), with b:= c[1 + b/(I- ,8)]. (10.4.7)

Thus, for each fixed 0 < a < 1, both functions Va (1T,') and V;O belong to
lIi w (X). On the other hand, note that the inequality (10.4.1) is of the same
form as (8.3.11). Therefore, in view of Remark 8.3.5(a), all the results of
Chapter 8 are valid in our present context. This means in particular, that,
by Theorem 8.3.6(b), we may rewrite V;in (10.4.6) as an infimum over
the class of deterministic stationary policies, i.e.,

V;(x) = inf Va(fOO, x). (10.4.8)


nDS

Now fix an arbitrary state z in X, and for every 0 < a < 1 consider the
function
Ua(x) := V;(x) - V;(z). (10.4.9)
We will next show that Ua belongs to the space lIi w (X) for all 0 < a < 1.
10.4.2 Lemma. Let z E X be the {fixed} state in {10.4.9}. Then for every
E X, and t = 0, 1, ...
100 in lIDS, x
IEf>O c/(Xt) - E!OO c/(xt)1 ::; cRpt[l + w(z)]w(x), (10.4.10)

with Rand p as in {10.2.9}. Hence

lVa(foo, x) - Va(foo, z)1 ::; cR(1 - p)-l[1 + w(z)]w(x) (10.4.11)

and
lua(x)1 ::; cR(I- p)-l[1 + w(z)]w(x) (10.4.12)
10.4 Proof of Theorem 10.3.1 139

for all 0 < a < 1 and x EX.


Proof. Inside the absolute value in the left-hand side of (10.4.10) add and
subtract J.tf(cf) := J cfdJ.tf, where J.tf is the i.p.m. ofQf. Then, by (10.2.9),
the left-hand side of (10.4.10) turns out to be less than or equal to

I J c/(y)Q}(dylx) - J.tf(cf) 1 + I J Cf(y)Q}(dylz) - J.tf(cf)1


~ cRpt[w(x) + w(z)],
and (10.4.10) follows because w(x) ~ 1.
To obtain (10.4.11) note that (10.4.5) and (10.4.10) yield
00

lVo:(foo, x) - Vo:(foo, z)1 < L atlEf'" cf(Xt) - E{OO c/(xt)1


t=o
00

< cR[I+w(z)]w(x)La t pt.


t=o
This implies (10.4.11) because, as 0 < a < 1,
00 00

Latpt ~ Lpt = 1/(1- p).


t=o t=o
Finally, to get (10.4.12) observe that (10.4.9) and (10.4.8) give

luo:(x)1 ~ sup lVo:(foo, x) - Vo:(foo, z)l,


ITDS

so (10.4.12) follows from (10.4.11). 0

The following Lemma 10.4.3 determines a candidate for the number p*


in Theorem 10.3.1. We should also note that the inequality (10.4.16) is not
needed to prove Theorem 10.3.1; we introduce (10.4.16) simply to comple-
ment Remark 10.3.2.
In the proof of (10.4.15) and (10.4.16) we use the Abelian theorem in
Lemma 5.3.1, which states the following:
If {b t , t = 0,1, ... } is a sequence of nonnegative numbers, then
n-l
lim inf!
n-+oo n 'L....J
" bt
t=o
00

< limsup(l- a) Latbt (10.4.13)


o:tl t=o
n-l
< lim sup !
n-+oo n t=o
L bt .
140 10. Undiscounted Cost Criteria

10.4.3 Lemma. There exists a number p* such that

limsup(1 - 0:) V; (x) = p* Vx E X, (10.4.14)


atl

and
p* ~ inf J(foo, x) = inf J(f) Vx. (10.4.15)
IIDS IF
If, in addition, c(x, a) is nonnegative, then

p* ~ J*(x) := iBf J(7r,x) Vx E X. (10.4.16)

Proof. Let z E X be the (fixed) state in (10.4.9), and for every 0 < 0: < 1
define
p(o:) := (1 - 0:) (z). V; (10.4.17)
By (10.4.7), p(o:) is bounded since [with b as in (10.4.7)]

Ip(o:)1 ~ ~(z) VO < 0: < 1.


Therefore, there is a number p* such that

lim sup p(o:) = p*. (10.4.18)


atl

To prove that p* satisfies (10.4.14), observe that (10.4.9) and (10.4.17) yield

1(1 - 0:) V; (x) - p*1 ~ (1 - o:)lua(x)1 + Ip(o:) - p*1 Vx E X.

Thus, by (10.4.18) and (10.4.12), letting 0: t 1 we obtain (10.4.14).


We will now prove (10.4.16), which assumes c(x, a) ~ O. Choose an ar-
bitrary policy 7r and an arbitrary initial state x, and in (10.4.13) write
E; c(xt, at) in lieu of bt . Then the third inequality of (10.4.13) gives [by
(10.4.5) and (10.1.11)]

limsup(l- o:)Va(7r,x) ~ J(7r,x), (10.4.19)


atl

which in turn, by (10.4.6), yields

limsup(l- 0:) V; (x) :::; J(7r,x).


at!

Hence, as 7r and x were arbitrary, the latter inequality and (10.4.14) give
(10.4.16).
To complete the proof of the lemma, let us consider (10.4.15). We cannot
proceed as in (10.4.19) because now E;c(xt, at) may take negative values
lOA Proof of Theorem 10.3.1 141

and so (10.4.13) is not directly applicable. Hence, we will use (10.4.4) to


replace E:; c(Xt, at) by

in other words, we will write Vo as

= L: ci E:[c(Xt, at) + ew(Xt)]- cL: (i E:w(xt).


00 00

Vo (1I',x)
t=O t=o

Moreover, let

Then (10.4.13) gives


limsup(l - a:)Vo(1I',x) ~ J(1I', x) + C[WB(1I', x) - W i (1I',x)],
otl

so that, as V;O ~ Vo (1I',') [by (10.4.6)], we can use again (10.4.14) to


obtain
(10.4.20)
Finally, note that if 11' is a deterministic stationary, say 11' = f oo , then
(10.4.20) reduces to
p* ~ J(foo, x) = J(f) (10.4.21)
because, by (10.2.9) and Proposition 1O.2.3(b),
W 8 (foo , .) = Wi (foo , .) = JLf (w), and J (foo , .) = J (f).
As (10.4.21) holds for all foo in llDS, (10.4.15) follows. 0
We are now ready to complete the proof of Theorem 10.3.1.
B. Completion of the proof

Consider the a:-discounted cost optimality equation in (8.3.4), i.e.,

V;(X) = min [c(x,a) +a: ( V; (y)Q(dylx, a)] , x E X. (10.4.22)


A(z) lx
With p(a:) and uo(x) as in (10.4.17) and (10.4.9), respectively, we can
rewrite (10.4.22) as

p(a:) + uo(x) = min [C(X, a) + a: ( uo(x)Q(dYlx, a)], x E X. (10.4.23)


A(z) lx
142 10. Un discounted Cost Criteria

On the other hand, by (10.4.18), there is a sequence of "discount factors"


a(n) t 1 such that
p* = lim p[a(n)]. (10.4.24)
n-HXl

Define
ho(x) := lim inf u",(n) (x), x E X. (10.4.25)
n-too

Observe that, by (10.4.12), ho is a function in lB\w(X) and, moreover, the


sequence {u",(n)} is bounded in lB\w(X) because

Thus, if in (10.4.23) we replace a by a(n) and take liminfn, then (10.4.24),


(10.4.25), and Fatou's Lemma 8.3.7(b) [more precisely, (8.3.18)] yield

p* + ho(x) 2 Tt~) [c(x, a) + f ho(y)Q(dYlx, a)] \Ix E X,

which is precisely the ACO! (10.3.9).


Also note that the existence of a decision function (or selector) fo E IF
that satisfies (10.3.10) is ensured by Proposition 8.3.9(b).
Furthermore, iterating (10.3.10) we get

np* + ho(x) 2 I n (JOO , x) + f ho(y)Q'}o(dYlx) \In = 1,2, .... (10.4.26)

Thus, multiplying by lin and letting n -t 00, we obtain [from (10.2.13)


and Proposition 1O.2.3(b)]

p* 2 J(JOO, x) = J(Jo) \Ix E X. (10.4.27)

This implies (10.3.11) because, by (10.4.15),

It is also clear that any decision function fo that satisfies (10.3.10) also
satisfies (10.4.27), and, therefore, (10.3.11). This completes the proof of
Theorem 10.3.1. 0
10.4.4 Remark. If c(x, a) is nonnegative, then (10.4.27) and (10.4.16) give

p* ~ J*(x) ~ J(Jo) ~ p* \Ix EX,

and (10.3.12) follows. 0


10.5 Proof of Theorem 10.3.6 143

10.5 Proof of Theorem 10.3.6


We will use the following fact-for a proof see, for instance, Orey [1, The-
orem 7.2] or Meyn and Tweedie [1, Proposition 10.1.2].
10.5.1 Remark. Under Assumption 10.3.5, the irreducibility measure >.
is absolutely continuous with respect to the i.p.m. 1-'/ for each f E F (in
symbols: >. « 1-'/ 't/f E F); that is, if a set BE B(X) is such that I-'/(B) =
0, then >.(B) = O. 0
A. Proof of part (a)

We wish to show that there exists a canonical triplet or, equivalently (by
Theorem 10.3.4), a solution (p*, h*, f*) to the ACOE (10.3.13), (10.3.14),
with h* in Bw(X).
It will be convenient to use the dynamic programming operator T in
(9.5.1), with "min" instead of "inf", to write (10.3.13) in the form
p* + h*(x) = Th*(x), x E X. (10.5.1)
Moreover, to simplify the notation, given a sequence {In} in F we shall
write
c/.. ,h/.. ,Q/.. ,I-'/.. , ... as cn,hn,Qn,l-'n,"" (10.5.2)
respectively, where h n E Bw(X) is the solution to the Poisson equation
(10.2.21), (10.2.22) for fn.
Now, to begin the proof itself, let p* E JR., ho E Bw(X), and fo E F be as
in Theorem 10.3.1. In particular, as in (10.3.11) we have
J(/o) = p* = inf J(/) (10.5.3)
F
and so we can write the Poisson equation (10.2.21) for fo as

p* + ho(x) = eo(x) + Ix ho(y)Qo(dYlx), x E X. (10.5.4)

From this inequality and Proposition 8.3.9(b), there is a decision function


It ElF such that, for all x E X,

p* + ho(x) > min [c(x,a)


A(x)
r
+ lx ho(Y)Q(dY1x,a)]
= Tho(x)

= cit (x) + Ix ho(y)Q/t (dYlx);

that is, using the notation (10.5.2),

p* + ho(x) ~ Cl(X) + Ix hO(y)Ql(dYlx) 't/x E X. (10.5.5)


144 10. Undiscounted Cost Criteria

Furthermore, by the last statement in Theorem 10.3.1, also h satisfies


(10.5.3), i.e., J(fd = p*. This implies in particular that the Poisson equa-
tion for h is again of the form (10.5.4), namely,

p* + h1(x) = Cl(X) +i h1(y)Ql(dYlx). (10.5.6)

On the other hand, from (10.5.6) and (10.5.5),

ho(x) - h1(x) ~ i[ho(Y) - h1(y)]Ql(dylx) "Ix E X;

in other words, the function u := ho - hI E lmw(X) is subinvariant (or


subharmonic) for the stochastic kernel Ql (·Ix). Hence, by Lemma 7.5.12(a),
it follows that u = ho - hI equals J-Ll-a.e. the constant

ixr [ho(y) - hI (y)]J-Ll(dy) = inf[ho(x) - hI (x)] =:


x
~l.
More precisely, there is a Borel set Nl E B(X) such that J-Ll (Nd = 1 and

and
hoO > hlO + ~l on Nf,
where Nf := X - Nl denotes the complement of N 1 •
Repeating this procedure we obtain sequences Un} in IF, {h n } in lmw(X),
and {Nn } in B(X) for which the following holds: For every x E X and
n = 0,1, ... [and using the notation (10.5.2)]
(i) J(fn) = p*;
(ii) (p*, h n ) satisfies the Poisson equation

p* + hn(x) = cn(x) +i hn(y)Qn(dYlx); (10.5.7)

(iii) Thn(x) = Cn+1(x) + Jx hn(y)Qn+1(dYlx); (10.5.8)


(iv) J-Ln+l(Nn+d = 1 and, with ~n+1 := J(hn - hn+1)dJ-Ln+1 = infx
[hn(x) - hn+1(x)]'
hn(·) hn+1 0
+ ~n+1 on Nn+1,
(10.5.9)
hn0 > hn+1 0 + ~n+1 on N~+1 .

In addition, we claim that the set

N*:= n
00

n=l
Nn (10.5.10)
10.5 Proof of Theorem 10.3.6 145

is nonempty. Indeed, if N* were empty, then from Assumption 10.3.5 and


Remark 10.5.1 we would have

L >'(N~) = 0
00

>'(X) = >.(NZ) ~ as J.1.n(N~) =0 Vn ~ 1;


n=l
that is >'(X) = 0, which contradicts that>. is a nontrivial measure.
Now choose x* in N*. Then, by (10.5.9),

and
hn(x*) = hn+1(x*) + ~n+1'
which implies that the functions

form a nonincreasing sequence, Le.,

h~ ~ h~+1 Vn = 0,1, .... (10.5.11)

Moreover [by (10.2.20)], the sequence {h n } is uniformly bounded in $w(X)


and, therefore, so is {h~}. Hence, there is a function h* in $w(X) such that

h*(x) = lim h~(x) Vx E X. (10.5.12)


n---+oo

Finally, to obtain (10.5.1), first note that the Poisson equation (10.5.7)
remains valid if we replace hn by h~, Le.,

p* + h~(x) = cn(x) + Ix h~(y)Qn(dylx). (10.5.13)

This in turn gives p* + h~ ~ Th~, Le.,

p* + h~(x) ~ min
A(x)
[C(X, a) + [ h~(y)Q(dylx, a)]
ix Vx E X.

Therefore, letting n -t 00, (10.5.12) and the Fatou Lemma (8.3.8) yield

p* + h*(x) ~ Th*(x) Vx E X. (10.5.14)


To get the reverse inequality, i.e.,

p* + h*(x) ~ Th*(x) Vx E X, (10.5.15)

write (10.5.8) with h~ and n - 1 in lieu of h n and n, respectively, to obtain

Cn(x) = Th~_l(X) - Ix h~_l(y)Q(dylx).


146 10. Undiscounted Cost Criteria

Thus, from (10.5.13),

p* + h~(x) Th~_l(X) - J[h~-l(Y) - h~(y)]Qn(dYlx)


< Th~_l(X) [by (10.5.11)]

< c(x, a) + J h~_l (y)Q(dylx, a) [by (9.5.1)]


for all x E X and a E A(x). It follows that [from (10.5.12) and the Fatou
Lemma (8.3.19)]

p* + h*(x) ~ c(x, a) + J h*(y)Q(dylx, a) "Ix E X, a E A(x),

which implies (10.5.15).


This completes the proof of part (a) because (10.5.14)-(10.5.15) give the
ACOE (10.5.1) and, as usual, the existence of a decision function f* E IF
that satisfies (10.3.14) is obtained from Proposition 8.3.9(b).
B. Proof of part (b)

Let FXJ E lIDS be an AC-optimal policy, i.e., J(f) = p*. Then, as in


(10.5.4), the Poisson equation for f is

p* + hf(x) = cf(x) + Ix hf(y)Qf(dylx) "Ix.

On the other hand, from the ACOE (10.5.1),

p* + h*(x) ~ cf(x) + Ix h*(y)Qf(dYlx) "Ix.

Hence
hf(x) - h*(x) 2:: j[hf(Y) - h*(y)]Qf(dYlx) "Ix,

which means that the function hf - h* E Jaw(X) is subvariant for the


stochastic kernel Qf. This implies that [by Lemma 7.5.12(a)]
(10.5.16)
with k := f(hf - h*)dJ-tf = - f h*dJ-tf, by (10.2.22). Therefore, replacing
h* with hf - kin (10.5.1) we obtain (10.3.23), and part (b) follows.
This completes the proof of Theorem 10.3.6 0
C. Policy iteration

The approach used to prove Theorem 1O.3.6(a) is a special case of the


policy iteration algorithm (PIA) for the AC problem, which in general
can be described as follows. (See §9.6.D for the PIA associated to the
expected total cost problem.)
10.5 Proof of Theorem 10.3.6 147

Step O. Initialization: Set n = 0 and choose an arbitrary decision function


fn in IF.

Step 1. Policy evaluation: Find J(Jn) E IR and hn E lRw(X) that satisfy


the Poisson equation for fn; that is, using the notation (10.5.2),

Step 3. Policy improvement: Determine a decision function fn+l Elf such


that

(10.5.18)

Comparing (10.5.17)-(10.5.18) with (10.5.7)-(10.5.8) we see that the


proof of Theorem 1O.3.6(a) consisted precisely of the PIA when the initial
decision function fo satisfies (10.3.10)-(10.3.11), which gave us J(Jn) = p*
for all n, as well as (10.5.11).
In general, the objective of the PIA is to find a solution (p*, h*) to the
ACOE (10.3.13). The idea is that combining (10.5.17) and (10.5.18) we
obtain

(10.5.19)

so that integration with respect to the i.p.m. J.Lm+1 yields [by (10.2.18) or
Proposition 1O.2.3(b)]
(10.5.20)
that is, the sequence of average costs J(Jn) is nonincreasing. Moreover, it is
obviously bounded since [by Assumption 1O.2.1(d), (10.2.12), and (10.2.18)]

IJ(J)I :s ! ICfldJ.Lf :s cllbll/(l- p) 'If E IF. (10.5.21)

Therefore, there exists a constant p such that


(10.5.22)

Of course, we necessarily have that p ~ p* because p* is the AC value


function [see (10.3.15)].
Thus, the PIA's objective is to show that p = p* and that {h n } or a
subsequence thereof, or even a modified sequence [such as h~ in (10.5.11)],
converges to a function that satisfies the ACOE. It seems to be an open
problem to determine whether the latter fact is possible under the hy-
potheses of Theorem 10.3.6. However, as stated in the following result,
under some extra condition, the PIA does converge. We shall omit the
148 10. Undiscounted Cost Criteria

proof of Theorem 10.5.2 because it is very similar to the proof of Theorem


1O.3.6(a)-the reader may refer to Hernandez-Lerma and Lasserre [11] for
details.
10.5.2 Theorem. (Convergence of the PIA.) Suppose that the hy-
potheses of Theorem 10.3.6{a} are satisfied, and, in addition, the sequence
{h n } C ~w(X) in {10.5.17} has a convergent subsequence {h m }, i.e.,

lim hm(x) = h(x) "Ix E X, (10.5.23)


m~oo

for some function h on X. Then the PIA converges; in fact, the pair

(p*, h*) := (p, h) [with p as in (10.5.22)]

satisfies the ACOE {10.3.13}.


10.5.3 Remark. (a) By (10.2.20), the function h in (10.5.23) is necessarily
in ~w(X). Moreover, (10.2.20) gives that the sequence {h n } in (10.5.17)
is locally bounded. Hence, for instance, if it can be shown that {h n } is
equicontinuous, then the existence of a subsequence {h m } that satisfies
(10.5.23) is ensured by the well-known Ascoli Theorem. (See, for instance,
Remark 5.5.2 or Royden [1] for the statement of the Ascoli Theorem.)
Remark 5.5.3 and Assumption 2.7 in Gordienko and Hernandez-Lerma [2]
give conditions for {h n } to be equicontinuous.
On the other hand, if X is a denumerable set (with the discrete topology),
then any collection offunctions on X, in particular {h n }, is equicontinuous.
Thus, in the denumerable-state case, Theorem 10.5.2 gives that the PIA
converges.
Finally, it is important to keep in mind that in many cases the sequence
{h n } [or a modified sequence-as h~ in (10.5.11)] can be shown to be
monotone or "nearly monotone", so that {h n } itself satisfies (10.5.23); see,
for instance, Meyn [1] of Puterman [1, Propos. 8.6.5].
(b) The difference between the left-hand side and the right-hand side of
(10.5.19), i.e.,

is called the PIA's discrepancy function at the nth iteration. Similarly, from
(10.5.20) we get the cost decrease C n := J(fn) - J(fn+1), which can also
be written as

Cn = Ix Dn(x)J-tn+1 (dx), n = 0,1, ....

If we now define hn := h n - hn+1' we obtain

(10.5.24)
10.6 Proof of Theorem 10.3.7 149

which means that the pair (en, hn) is a solution to the Poisson equation
for the transition kernel Qn+1(·lx) = Q('lx, fn+d with "cost" (or charge)
function Dn. This fact can be used to prove the convergence of the PIA, at
least when the state space X is a finite set; see, for instance, Puterman [1,
§8.6]. Alternatively, one could try to show that Dn(x) ~ 0 for all x E X,
as n ~ 00.

10.6 Proof of Theorem 10.3.7


Let hi and h; be two functions in Bw(X) such that (p*, hi) and (p*, h;)
both satisfy the ACOE (10.3.13); that is, for each state x E X,

+ hi(x) = +
lxr hi(y)Q(dy1x,a)] ,
p* min [c(x,a) (10.6.1)
A(x)

and
p* + hi(x) = min
A(x)
[c(x, a) r hi (y)Q(dylx, a)]
+ lx . (10.6.2)

Let h E IF be a decision function such that hex) E A(x) attains the


minimum in (10.6.1) for each x E X. Then, using the notation (10.5.2),

whereas using h in (10.6.2) we get

Hence
hi(x) - hi (x) 2:: j[h'i(Y) - hi(y)]Ql(dylx) "Ix,

so that [as in the argument used after (10.5.6) or to obtain (10.5.16)] Lemma
7.5.12(a) yields the existence of a set Nl E B(X) and a constant kl such
that J.Ll(Nt) = 1 and

hi(x) 2:: h;(x) + kl "Ix E X, with equality on N 1 . (10.6.3)

We now repeat the above argument but interchanging the roles of (10.6.1)
and (10.6.2), and using part (b) of Lemma 7.5.12 instead of part (a). That
is, we take a decision function h E IF that attains the minimum in (10.6.2)
and we get a set N2 E B(X) and a constant k2 such that J.L2(N2) = 1 and

hi(x) ~ hi(x) + k2 "Ix E X, with equality on N 2. (10.6.4)


150 10. Undiscounted Cost Criteria

Then, as in the proof of Theorem 1O.3.6(a) [see (10.5.10)], we can use


Assumption 10.3.5 and Remark 10.5.1 to show that the set
N:= NI nN2

is nonempty; otherwise we would get A(X) = A(NC) = 0, a contradiction.


x
Now let be a point in N and define the functions
hlO := hrO - hHx), and h 2 (-):= h;O - h;(x).
Thus [as in (10.5.11)]' (10.6.3) and (10.6.4) yield

hlO ~ h2 (·) and hlO:::; h2 0,


respectively. Hence hlO = h2' which gives (10.3.24) with k := hr(x) -
h 2(x). 0

10.7 Proof of Theorem 10.3.10


We will present two proofs of part (a). The first one is a direct application of
Theorem 1O.3.6(a), the reasoning being that the first equality in (10.3.31)
can be written as

hex) = h*(x) + inf { (-h*)dJlf' (10.7.1)


IFca ix
and, therefore, the problem of finding a bias-optimal policy reduces to an
AC problem with cost-per-stage function
c'(x,a) := -h*(x). (10.7.2)
Moreover, since the minimization in (10.7.1) is over the class IFca of canon-
ical decision functions, it suffices to consider "canonical control actions"
a E A*(x), where A*(x) C A(x) is the set in (10.3.35) [see also (10.3.36)].
The second proof of part (a) is given in Remark 10.7.1.
Proof of (a). Consider the Markov control model
Mbias := (X, A, {A*(x)lx EX}, Q, c'), (10.7.3)
which is the same as the original control model M = (X,A, {A(x)lx E
X},Q,c) except that A(x) and c(x,a) have been replaced by A*(x) [in
(10.3.35)] and c'(x, a) [in (10.7.2)], respectively. Hence, as M satisfies the
hypotheses of Theorem 1O.3.6(a), it is clear that so does Mbias and, by
i
consequence, there is a canonical policy E IFca [see (10.3.36)] for the new
model Mbias, i.e.,

{ (-h*)dJll = inf { (-h*)dJl/ =: p. (10.7.4)


ix IFca ix
10.7 Proof of Theorem 10.3.10 151

1
This fact and (10.7.1) yield that is bias-optimal.
1
Proof of (b). The bias-optimal decision function in (a) is canonical,
and so it satisfies (10.3.39) [or (10.3.38)] and (10.3.40); that is, (p*, h, f) is a
canonical triplet (see Theorem 10.3.4). Moreover, as was already mentioned
in the proof of (a), Mbias satisfies the hypotheses of Theorem 1O.3.6(a).
Therefore, there exists a function h' in Bw(X) and a canonical decision
function l' in IFca such that (p, h' , 1') is a canonical triplet for M bias, i.e.
(by Theorem 10.3.4),

p+h'(x) = A*(x)
min [c'(x,a) + ( h'(y)Q(dY\X,a)]
ix
or [by (10.7.2)]

p + h'(x) = -h*(x) + A*(xl1


min ( h'(y)Q(dy\x, a),
x
and

p + h'(x) = -h*(x) + Ix h'(y)QI' (dy\x) \/x E X, (10.7.5)

which [by (10.7.4) and (10.7.1)] yield (10.3.41) and (10.3.42), respectively.
Finally, integrating both sides of (10.7.5) with respect to the i.p.m. 1'1' we
get
p= Ix (-h*)d1'l"

that is, l' is bias-optimal.


Proof of (c). The first equality in (10.3.43) implies

Ix hd1'l = O. (10.7.6)

On the other hand, as (p*, h, f) is a canonical triplet, the uniqueness The-


orem 10.3.7 yields that h = h* + k for some constant k, which combined
with (10.7.6) implies
k= - Ix h*d1'l·

Hence, by (10.3.33), h coincides with the bias function hI since

h = h* - Ix h* d1' I = hI' (10.7.7)

From this fact and the second equality in (10.3.43) it follows that

h*(x) - f h*d1'1 + h'(x) :::; f h'(y)Qg(dy\x) \/g E IFca ,


152 10. Undiscounted Cost Criteria

and integration with respect to the i.p.m. /19 gives

Thus

which together with (10.3.34) and (10.7.7) yields that f is bias-optimal and
that h = h, = h.
Proof of (d). (dd ~ (d 2 ). If f is bias-optimal, then [by (10.3.31)] f is

Ix
in !Fea and
hO = h*(·) - h*d/1',

so that integrating with respect to /1, we obtain (10.3.44).


(d 2 ) ~ (dd· If (d 2 ) holds, then f E !Fea satisfies (10.3.40) and, on the
other hand, the Poisson equation for f is

Subtracting the latter equation from (10.3.40) we see that the function
u(·) := hO - h,O is invariant with respect to Q" i.e.,

u(x) = Ix u(y)Q,(dylx) "Ix E X.

Consequently [by Theorem 7.5.1O(d)] u(·) = /1,(u), i.e.,

where the second equality is due to (10.3.44) and (10.2.22). Therefore


h= h, and (d 1 ) follows. This completes the proof of Theorem 10.3.10.
o
10.7.1 Remark: Another proof of Theorem 10.3.10. In this proof the
idea is to use (10.3.30) and the definition (10.2.19) of h, [with J(f) = p*]
to pose the question of existence of bias-optimal policies as an expected
total cost (ETC) problem.
Instead of the Markov control model Mbias in (10.7.3) consider the con-
trol model
M*:= (X,A,{A*(x)lx E X},Q,c*)
with cost-per-stage c* given by

c*(x,a) := c(x, a) - p* "Ix E X, a E A*(x),


10.7 Proof of Theorem 10.3.10 153

and A*(x) as in (10.3.35).


Note that for a canonical decision function f E lFea, the bias hf in
(1O.2.19) becomes

=L =L
00 00

hf(x) E{X> [cf(Xt) - p*] Et OO cj(Xt), (10.7.8)


t=o t=o
which is the same as the ETC Vl(fOO, x) in (9.1.1) with c*(x, a) in lieu of
c(x,a). Let us write Vl(foo,X)* for the ETC thus obtained so that

L Et OO cj(Xt),
00

hf(x) = VdfOO,x)* = f E lFea .


t=o
Furthermore, if in (9.1.2) we replace II and Vl (., x) by lFea and Vl (., x)*,
respectively, we get that the value function for the new ETC problem is
precisely the optimal bias function h in (10.3.30), and, on the other hand,
the ETC optimality equation (9.5.3) becomes

h(x) = min [c*(x,a) + [ (10.7.9)


A*(x) Jx h(y)Q(dylx, a)] ,
which can also be written as

p* + h(x) = min [C(X, a) + [ (10.7.10)


A*(x) Jx h(y)Q(dYlx, a)], x E X.

In addition, the control model M* satisfies the ~potheses of Prop~ition


8.3.9(b), and so there exists a decision function f in lFea such that f(x) E
A*(x) attains the minimum in (10.7.10) for each x E X, i.e.,

p* + h(x) = cJ<x) + Ix h(y)QJ<dylx) Vx E X. (10.7.11)

1
Therefore, by Theorem 9.5.12, to conclude that E lFae is bias-optimal it
only remains t2, verifx that Assumption 9.5.2 holds in the present context,
and also that f and h satisfy (9.5.38), i.e.,

limsupElOOh(x n ) = 0 Vx E X. (10.7.12)
n-+oo

To verify Assumption 9.5.2, replace II and Vl (11", x) in (9.3.8) and (9.3.16)


by lFea and Vi (foo,x)* = hf(x), respectively. Then Assumptions 9.3.2 and
9.3.4 both follow from (10.2.20), so that part (a) in Assumption 9.5.2 is
satisfied. Similarly, replacing U, T and W by Bw(X),

T*u(x) := min [c*(x, a)


A*(x)
+ j u(y)Q(dylx, a)] [see (10.7.9)],
154 10. Undiscounted Cost Criteria

and
W(x) := cRw(x)/(l - p) [see (10.2.20)],
respectively, we obtain parts (b) and (c) in Assumption 9.5.2.
Finally, (10.7.12) follows from (10.7.11), which gives [as in (10.3.21) or
(7.5.4)]

h(x) - [In(j, x) - np*]

L Et
00
OO
[ci<x t ) - p*]
t=n
-+ 0 as n -+ 00 [by (10.7.8)].

Therefore, the conditions of Theorem 9.5.12 hold, and so we conclude that


the decision function j E IF ea is bias-optimal. 0

10.8 Proof of Theorem 10.3.11


We already have several implications between the concepts in (a), (b), (c),
and (d), and AC optimality. For instance by (10.3.30),

IFbias C lFea elFAC. (10.8.1)

Similarly, by (10.1.5), it is clear that a weakly 0.0. policy is AC-optimal;


in fact, by (10.1.9) and (10.1.15),

1T* weakly 0.0. => 1T* ~C-optimal => 1T* AC-optimal, (10.8.2)

where the second implication holds if 1T* has a finite opportunity cost.
To obtain further relations between the above concepts, let J~(x) and
J*(x) be as in (10.1.16) and (10.1.10), respectively, and define the upper
and lower limit functions

LU(x) .- limsup[J~(x) - nJ*(x)], (10.8.3)


n-+oo
liminf[J~(x) - nJ*(x)]. (10.8.4)
n-+oo

Then, adding ±nJ*(x) to I n (1T,X) - J~(x), from (10.1.7) and (10.1.12) we


get

D(1T,X) -LU(x) ~ OC(1T,X) ~ D(1T, x) -L1(x) V1T E II, x EX, (10.8.5)

which, of course, holds in particular when 1T = foo is a deterministic sta-


tionary policy, i.e.,

(10.8.6)
10.8 Proof of Theorem 10.3.11 155

for each foo in II Ds and x E X. Moreover, as a special case of (10.1.17), if


foo is not AC-optimal, then D(Joo,.) and OC(Joo,.) are both infinite.
In fact, in the context of Theorem 10.3.11, we can get a relation more
explicit than (10.8.6) if foo is AC-optimal [Le., J(J) = p*, or f E IF AC];
namely,
hf(-) = D(Joo,.) = OC(Joo,.) + L I(.) "If E IF AC. (10.8.7)
Indeed, as J*(-) = p*, from (10.2.19) and (10.1.12) we obtain
hf(x) = lim [In(Joo,x) -np*] = D(Joo,x) "If E lFAC, X E X, (10.8.8)
n--+oo

which yields the first equality in (10.8.7). Similarly, by definition (10.1.7)


of opportunity cost,
limsup{[Jn(Joo, x) - np*] - [J~(x) - np*]}
n--+oo
hf(x) - LI(x) [by (10.8.8) and (10.8.4)],
and (10.8.7) follows.
Thus, in view of (10.1.17) and (10.8.7), we have the equivalence of (a),
(b) and (c) in Theorem 10.3.11, i.e.,
(a) {::} (b) {::} (c). (10.8.9)
In particular, note that (10.8.7) and the definition (10.3.27) of the optimal
bias function yield
hex) = inf D(JOO, x) = inf OC(JOO, x) + LI(x) "Ix E X. (10.8.10)
IF AC IF AC
Finally, to prove the equivalence between (d) and, say, Dutta's optimality
(c), we already have (d) => (c), by (10.1.14). To prove the converse first
note that (10.1.5) is equivalent to
limsup[Jn(n*,x) - In(n,x)]:S 0 "In E II, x E X. (10.8.11)
n--+oo

Thus to prove that (c) => (d) we need to show that, with foo as in (c),
limsup[Jn(Joo,x) - In(goo,x)] :S 0 Vg oo E IIDs, X E X. (10.8.12)
n--+oo

In turn, to get (10.8.12) we consider two cases.


Case 1: gOO is not AC-optimal, i.e., J(g) > p*. Then, by (10.1.12) and
(10.2.19),
D(goo,x) > liminf[Jn(goo,x)
n--+oo
- np*]
> hg(x) + lim inf n[J(g) - p*]
n--+oo
00 "Ix E X.
156 10. Undiscounted Cost Criteria

Therefore,

< 0 \Ix EX,


and (10.8.12) follows for 9 (j.JF AC.
Case 2: gOO is AC-optimal, i.e., J(g) = p*. Then, as in (10.8.8),

D(gOO,x) = lim [In (gOO , x) - np*],


n-+oo

so that, for each x EX,

where the last inequality follows by the assumption on foo.


This completes the proof of (10.8.12), and, therefore, the proof of Theo-
rem 10.3.11. 0
Observe that, by (10.8.10), the inequality in (10.8.13) is equivalent to
the inequality informally introduced in (10.3.6).

10.9 Examples
The calculations in Example 10.9.1 can be done using the value iteration
equation J~ = T J~_l in (9.5.4) for the optimal n-stage cost, i.e., for each
x E X and n = 1,2, ... ,

J~(x) = min [c(x,a)


A(x)
+ ( J~_l(Y)Q(dYlx,a)],
Jx
(10.9.1)

with JO' (.) := O. We will also use the fact that for a deterministic stationary
policy foo the n-stage expected total cost

can be recursively computed as

(10.9.2)

for all x E X, n = 1,2, ... , with Jo(foo,') := O. More generally, for an


arbitrary policy 7r we have

(10.9.3)
10.9 Examples 157

for all x E X, n = 1,2, ... , with Jo(-7I',') := O.


The following example shows that without the appropriate hypotheses,
some of the results in the previous sections may fail. Specifically, the exam-
ple shows a deterministic stationary policy I':' that is canonical but-in
contrast to, say, Theorem 1O.3.11-is not weakly 0.0., nor oC-optimal ,
nor D-optimal. In addition, there is a deterministic stationary policy pXl
that is AC-optimal and F-strong AC-optimal [see (10.3.45), (10.3.46)] but
is not canonical.
10.9.1 Example. Consider a MCP with state space X = {O, 1, ... }, the set
of nonnegative integers, with the discrete topology. The control (or action)
sets are A(x) = A := {1,2} for all state x. The state x = 0 is absorbing
with zero cost; that is, writing Q({y}lx,a) as Q(ylx,a), we have

c(O, a) = 0 and Q(OIO, a) = 1 'Va E A. (10.9.4)

On the other hand, for x ~ 1 we have

c(x, 1) = l/x - 1, and Q(Olx,l) = 1; (10.9.5)

and
c(x,2) = 0, and Q(x + llx, 2) = 1. (10.9.6)
From (10.9.4) and (10.9.3), I n (1I',0) = 0 for each policy 11' and n =
0,1, ... , so that J~(O) = 0 for all n. Moreover, as J o(')
:= 0, we can use
(10.9.1) to obtain

J~(x) = (x + n - 1)-1 - 1 'Vx ~ 1, n ~ 1.

Let us now consider the decision functions I*(x) := 2 and I(x) := 1 for
all x EX. Then, by (10.9.2),

In(f':',x) =0 'Vn ~ 0, x E X,

and

We can also see that I':' is a canonical policy (Le., 1* is in Fca) since
(p*, h*, 1*) with

p* = 0, h*(O) = 0, and h*(x) = -1 for x ~ 1

is a canonical triplet. However, I':' is not weakly O. O. since it does not


satisfy (10.8.11) [which is equivalent to (10.1.5)]; namely,

limsup[Jn(f':',x) - In(fOO, x)] = -(I/x -1) > 0 'Vx> 1.


n--+oo
158 10. Undiscounted Cost Criteria

Furthermore, I':' is neither ~C-optimal nor D-optimal since (10.1.7) and


(10.1.12) yield
OC(f:;O,x)=l 'Ix 2: 1, and D(f:;O,x)=O VXEX,
whereas
OC(FX),x)=l/x 'Ix 2: 1, and D(f'X\x)=1/x-1 Vx2:1j
hence
igfOC(1T,x) ~ OC(f°co,x) < OC(f:;O,x) 'Ix> 1,
and
igfD(1T,x) ~ D(foo,x) < D(f:;O,x) 'Ix> 1.
Finally, it is evident that 100 is AC-optimal and F -strong AC-optimal
[see (10.3.45)], but it is not canonical. D
Example 10.9.1, which comes from Hernandez-Lerma and Vega-Amaya
[1], is a slight modification of an example by Nowak [1].
10.9.2 Remark. It is clear that, because of the condition on Q in (10.9.6),
the MCP in Example 10.9.1 does not satisfy the w-geometric ergodicity
requirement in Assumption 10.2.2. Thus one might be willing to conjecture
that the conclusions in the example do not coincide with the results in
§1O.3 precisely because Assumption 10.2.2 fails. However, Brown [1] shows
a simple (two states, two actions) MCP in which the uniform geometric
ergodicity condition (7.3.16) holds but there is no deterministic stationary
policy which is weakly 0.0. in the class II of all policies. What is even
more interesting in Brown's MCP is that there exists a deterministic sta-
tionary policy which is strongly 0.0. [see (10.1.4)] in the subclass lIDS of
deterministic stationary policies.
On the other hand, Nowak and Vega-Amaya [1] give an example in which
again, as in Brown's MCP, (7.3.16) holds and yet there is no deterministic
stationary policy which is strongly 0.0., not even in the subclass lIDS! The
examples by Brown and by Nowak and Vega-Amaya both show in particular
that Theorem 6.2 in Fernandez-Gaucherand et al. [1] is false. (The latter
theorem supposedly gives conditions for a deterministic stationary policy
to be strongly 0.0. in all of II.)
Finally, in connection with (10.1.9), it is worth noting that Flynn [1,
Example 1] shows a weakly 0.0. policy-hence OC-optimal-but its op-
portunity cost is infinite, and so every policy is ~C-optimal! In cases like
this, (10.1.9) remains valid, of course, but it does not necessarily give "rel-
evant" information. A similar comment is valid for (10.1.14). D
10.9.3 Example. (Example 8.6.2 continued.) Consider the inventory-
production system in Example 8.6.2, with system equation (8.6.5), state
space X = [0, (0), and control sets
A(x) = A := [0,0] 'Ix E X. (10.9.7)
10.9 Examples 159

We again suppose that the demand process {zt} satisfies Assumption 8.6.1
and the condition 8.6.3, but in addition we will now suppose that, with
z:= E(zo),
0< z. (10.9.8)
[Note that (10.9.8) states that the average demand z should exceed the
maximum allowed production. Hence it excludes some frequently encoun-
tered cases that require the opposite, 0 ~ z.] By the results in Example
8.6.2 we see all of the conditions (a) to (f) in Assumption 10.2.1 are sat-
isfied, except that the constant f3 in (10.2.1) is greater than 1; in fact,
after (8.6.14) we obtained f3 = 1 + c. We will next use the new assumption
(10.9.8) to see that f3 can chosen to satisfy f3 < 1, as required in (10.2.1).
Let 1jJ(r) := Eexp[r(O - zo)], r ~ 0, be the moment generating function
of 0 - zoo Then, as 1jJ(0) = 1 and 1jJ'(0) = E(O - zo) = 0- z < 0 [by (10.9.8)],
there is a positive number r. such that

[Compare the latter inequality with (8.6.11).] Therefore, defining the new
weight function
w(x) := exp[r.(x + 2z)], x E X, (10.9.9)
we see that w(·) satisfies (8.6.13) and (8.6.14) when ris replaced by r •. In
particular, (8.6.14) becomes

w'(x,a) ~ f3w(x) +b 'tIx E X, a E A, (10.9.10)

with
f3 := 1jJ(r.) < 1, and b:= w(O). (10.9.11)
Thus, as (10.9.10)-(10.9.11) yield (10.2.1), we have that Assumption 10.2.1
holds in the present case. [Alternatively, to verify (10.2.1) we could use
(10.9.20) below because lfO ~ 1.]
We will next verify Assumption 10.2.2 (using Proposition 10.2.5), and
Assumption 10.3.5. We begin with the following lemma.
10.9.4 Lemma. For each decision function f Elf, let {xl, t = 0, 1, ... } be
the Markov chain defined by (8.6.5) when at := f(xt} for all t, i.e.,

(10.9.12)

Then, for each f Elf, {xl} is positive recurrent, and so it has a unique
i.p.m. JLf.
Proof. Let {xf} be the Markov chain given by (10.9.12) when f(x) := 0
for all state x, i.e.,
(J
Xt+l -
_ «(J
Xt +0 - Zt
)+ , 'tit = 0,1, .... (10.9.13)
160 10. Undiscounted Cost Criteria

Observe that defining Yt := () - Zt, we can rewrite (10.9.13) as a random


walk of the form (7.4.4), i.e.,
° x t + Yt )+ .
x t +1 = (0 (10.9.14)

Hence, as EIYol ::; () + Z < 00 and E(yo) = () - z < 0 [by (10.9.8)], the
Markov chain {xf} is positive recurrent (see Example 7.4.2). This implies
in particular that EO(TO) < 00 where TO denotes the time of first return to
x = 0 given the initial state Xo = O.
Now choose an arbitrary decision function f E IF, and let Tf be the time
of first return to x = 0 given x£ = O. By (10.9.7), f(x) ::; () for all x E X,
and, therefore, x{ ::; xf for all t = 0,1, .... This implies that

EO(Tf) ::; EO(TO) < 00,

which yields that {xl} is positive (in fact, positive Harris) recurrent; see,
for instance, Corollary 5.13 in Nummelin [1]. Thus, as f E IF was arbitrary,
the lemma follows. 0
Lemma 10.9.4 gives the existence of the invariant probability measures
required in Proposition 10.2.5. We will next verify the hypotheses (i)-
f..Lf
(iv).
Let 80 be the Dirac measure at x = 0, and define

v(·) := 80 (.), and lo(x):= 1 - G(x + ()), x EX (10.9.15)

where G denotes the common probability distribution of the demand vari-


ables Zt. Recall that G is supposed to satisfy the condition 8.6.3, the As-
sumption 8.6.1, and also (10.9.8). Moreover, for each decision function f,
let
If(x) := 1 - G(x + f(x)).
By (10.9.7), x + f(x) ::; x + () for all x E X and f E IF, so that

If(x) ~ lo(x) Vx EX, f E IF. (10.9.16)

Hence, if in (8.6.10) we replace the function u(·) by the indicator function


IBO of a Borel set B c X, we see that (8.6.10) [and (8.6.2)] yield
(10.9.17)

for each f E IF, x EX, and B E 8(X); that is the hypotheses (i) in
Proposition 10.2.5 is satisfied.
On the other hand, from (10.9.15), defining

'Y:= v(lo) = lo(O) = 1 - G(()) >0 (10.9.18)

we get the hypothesis (ii) since [by (10.9.16)]

v(lf) = If(O) ~ lo(O) = 'Y Vf E IF. (10.9.19)


10.9 Examples 161

Furthermore, (iii) is obvious since

v(w) = w(O) = exp(2r*z) < 00 [see (10.9.9)].

Finally, the hypothesis (iv) follows from (8.6.13) with r* in lieu of r, which
gives

Ix w(y)Q,(dYlx) ::; w(O)l,(x) + 'I/J(r*)w(x) (10.9.20)


= v(w)l,(x) + (3w(x) [by (10.9.11)].
Thus, all of the hypotheses of Proposition 10.2.5 are satisfied, and so As-
sumption 10.2.2 holds.
Finally, to verify Assumption 10.3.5, simply note that (10.9.16) and
(10.9.17) yield

Q'('lx) ~ v(·)lo(x) Vf E IF, x E X, (10.9.21)

which implies Assumption 10.3.5 with A(') := v(·).


To conclude, the inventory-production system in Example 8.6.2, with
the additional condition (10.9.8), satisfies Assumptions 10.2.1, 10.2.2, and
10.3.5, and, by consequence, all of the results in §10.3 hold in this case. 0
Example 10.9.3 is due to Vega-Amaya [1], [2]; see also Hernandez-Lerma
and Vega-Amaya [1].
10.9.5 Example. (Example 8.6.4 continued.) Let us again consider
the queueing system (8.6.16) in Example 8.6.4. From that example we can
see that all of the conditions in Assumption 10.2.1 are satisfied, except
perhaps for part (f). On the other hand, by (10.2.4), to verify (10.2.1) it
suffices to check that (10.2.25) holds because then, since l,O ::; 1, we may
take b:= v(w) to obtain (10.2.1).
Hence we will proceed directly to verify Assumption 10.2.2 (via Propo-
sition 10.2.5), and Assumption 10.3.5.
Let z := BT}o - ~o be as in Assumption 8.6.5(d), and let {xn be the
Markov chain obtained from (8.6.16) when at := () for all t. Moreover, let
{zd be a sequence of i.i.d. random variables with the same distribution as
z. Then we can write {xn as
o -_ (x 0t
Xt+l + Zt )+ , (10.9.22)

which is a random walk of the form (10.9.14) [or (7.4.4)). Hence, exactly
as in Lemma 10.9.4, we can verify that the Markov chain {xl}, f E IF;
obtained from (8.6.16) with at := f(xt) for all t = 0,1, ... , i.e.,

xl+! := [xl + f(Xt)T}t - ~t)+, t = 0,1, .. .

is positive recurrent. For that reason, {xl} has a unique i.p.m. J.l" for each
f E IF.
162 10. Undiscounted Cost Criteria

We will now verify the hypotheses (i) to (iv) in Proposition 10.2.5.


Let 'IjJ(r) := Eexp(rz), r ;::: 0, be the moment generating function of
z := B'TJo - ~o, and let w be the weight function in (8.6.18), i.e.,
w(x):=e rx , XEX:=[O,oo), (10.9.23)
where r > 0 is such that 'IjJ(r) < 1. Furthermore, let v(·) := boO be the
Dirac measure at x = 0, and let 10, and 11, for f E IF, be defined as
lo(x) := P(x + B'TJo - ~o ~ 0) = P(x + z ~ 0),
ll(x) := P(x + f(x)'TJo - ~o ~ 0). (10.9.24)
Now in (8.6.22) replace u(·) and a E A by IBO and f(x), respectively,
where B C X is a Borel set. This yields
QI(Blx) ;::: 11(x)v(B) Vf E IF, BE B(X), x E X, (10.9.25)
which is precisely the hypothesis (i) in Proposition 10.2.5.
On the other hand, by Assumption 8.6.5(a), we have a ~ B for all a E A;
hence,
ll(x) ;::: lo(x) Vf E IF, x E X, (10.9.26)
so that
v(ll) := 11(0) ;::: 10(0) Vf E IF,
which implies the hypothesis (ii) with 'Y := 10(0).
The hypothesis (iii) trivially holds since [by (10.9.23)] v(w) = w(O) = 1,
and so does the hypothesis (iv) because taking a = f(x) in (8.6.23) we
get (10.2.25) with f3 := 'IjJ(r) < 1 [see (8.6.24)], 110 as in (10.9.24), and
v(w) = 1.
Thus, Assumption 10.2.2 follows from Proposition 10.2.5.
Finally, by (10.9.25) and (10.9.26),
QI(·lx) ;::: lo(x)v(·) Vf E IF, x E X,
which yields Assumption 10.3.5 with -X(.) := v(·).
Threfore, all of the results in §1O.3 hold for the queueing system in Ex-
ample 8.6.4. D
Example 10.9.5 (as well as Example 8.6.4) is due to Gordienko and
Hernandez-Lerma [1]. Other examples can be found, for instance, in Dynkin
and Yushkevich [1], and Yushkevich [1]. For MCPs with a countable state
space see, e.g., Bertsekas [1], Haviv and Puterman [1], Puterman [1], Ross
[1].
One final comment: some of the optimality criteria discussed in §1O.1
and §1O.3 can be called "horizon sensitive" because they compare control
policies on the basis of their performance on finite n-stage horizons. An
alternative is to compare policies according to their performance with re-
spect to the infinite-horizon a-discounted cost (0 < a < 1) as a tends to
1. These alternative criteria are called discount-sensitive criteria. (See, for
instance, Puterman [1], Yushkevich [1].)
11
Sample Path Average Cost

11.1 Introduction
In this chapter we study AC-related criteria, some of which have already
been studied in previous chapters from a different viewpoint. We begin by
introducing some notation and definitions, and then we outline the contents
of this chapter.

A. Definitions

Let M = (X, A, {A(x)lx EX}, Q, c) be a general Markov control model,


and let

I n (7r, v) := E: [~C(Xt' at)1 (11.1.1)

be the expected n-stage total cost when using the control policy 7r, given
the initial distribution v E P(X). We can also write (11.1.1) as

(11.1.2)

where I n (7r, x) is given by (11.1.1) when v = 8", (the Dirac measure con-
trated at Xo = x) or as

(11.1.3)

O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes


© Springer Science+Business Media New York 1999
164 11. Sample Path Average Cost

where
n-l
J~(1f, v) := L c(Xt, at) (11.1.4)
t=o
in the path wise (or sample path) n-stage total cost.
In addition to the usual "limit supremum" expected average cost (AC)

J(1f,v) := lim sup I n (1f,v)/n, (11.1.5)


n-+oo

in this chapter we can also consider


• the sample path AC

J O (1f, v) := lim sup J~(1f, v)/n, (11.1.6)


n-+oo

• and the "limit infimum" expected AC

JI (1f, v) := liminf I n (1f, v)/n. (11.1.7)


n-+oo

In §11.3 we also study a particular case of the limiting average variance

Var(1f,lI) := lim sup ..!:.var[J~(1f,v)], (11.1.8)


n-+oo n
where (by definition of variance of a random variable)

var[J~(1f, v)] := E:[J~(1f, v) - I n (1f, vW. (11.1.9)

In (10.1.16) we defined a policy 1f* to be AC-optimal if

J(1f*, x) = J*(x) VXEX, (11.1.10)

where
J*(x) := inf J(1f,x) (11.1.11)
II
is the optimal expected A C function (also known as the A C value function).
We now wish to relate AC optimality with the performance criteria in the
following definition.
11.1.1 Definition. Let 1f* be a control policy and v* an initial distribution.
Then
(a) (1f*, v*) is a minimum pair if

J(1f*,v*) = Pmin

where
Pmin:= inf J*(v) = inf inf J(1f, v) (11.1.12)
P(X) P(X) II

is the minimum average cost, and [as in (11.1.11)] J*(v) := infII J(1f, 1I).
11.1 Introduction 165

(b) 11"* is sample path AC-optimal if [with J O as in (11.1.6)]

]0(11"*, v) = Pmin P;* -a.s. "Iv E P(X), (11.1.13)

and, furthermore,

J O(1I", v) 2: Pmin P;-a.s. "111" E II, v E P(X). (11.1.14)

(See Remark 11.1.2 and Definition 11.1.3.)

(c) 11"* is strong AC-optimal if it is AC-optimal and, in addition,

J(1I"*,x) ~ JI(1I", x) "111" E II, x E X,

with JI as in (11.1.7).
11.1.2 Remark. The optimality concepts in Definition 11.1.1(a), (c) were
already introduced in Chapter 5. On the other hand, Definition 11.1.1(b)
is related to pathwise AC optimality in Definition 5.7.6(b), which only re-
quires (11.1.13).
In §11.4 we study a class of Markov control problems for which there
exists a sample path AC optimal policy. In general, however, Definition
11.1.1(b) turns out to be extremely demanding in the sense that requiring
(11.1.13) and (11.1.14) to hold for all 11" E II and all v E P(X) is a very
strong condition. It is thus convenient to consider a weaker form of sample
path AC optimality as follows.
11.1.3 Definition. Let IT c II be a subclass of control policies, and
P(X) c P(X) a subclass of probability measures ("initial distributions")
on X. Let
p:= lnf i!!f J(1I", v). (11.1.15)
P(X) II

A policy 1? E IT is said to be sample path AC-optimal with respect


to IT and P(X) if

P P;-a.s.
~

J O(1?, v) = "Iv E P(X), (11.1.16)

and
°
J (1I",v) 2:p P;-a.s. "111" E II,
~ ~
v E P(X). (11.1.17)
IT IT = II and P(X) = P(X), then-as in Definition 11. 1. 1(b)-we simply
say that 1? is sample path AC-optimal.
For example, in §11.3 we consider a class of Markov control problems in
which there exists a sample path AC-optimal policy with respect to

IT = lIDS and P(X) = P(X), (11.1.18)


166 11. Sample Path Average Cost

and another class of problems for which


fi = II and P(X) = {6"lx E X} =: P/i(X), (11.1.19)
where 6" is the Dirac measure at the initial state Xo = x. More explicitly,
in the former case (11.1.18), there is a deterministic stationary policy Iff
such that
J°(fff, v) = Po p!:;" -a.s. '<Iv E P(X) (11.1.20)
and
(11.1.21)
where
PO:= inf inf J(fOO, v). (11.1.22)
P(X) IIDS
Similarly, in case (11.1.19) there is policy n* E II such that, with
p* := inf J*(x) = infinf J(n,x), (11.1.23)
x x II

we have
P"1I"* -a.s. '<IxE X (11.1.24)
and
JO(n,x) ~ p* P;-a.s. '<In E II, x E X. (11.1.25)
Note that the condition "for all x E X" in (11.1.24) and (11.1.25) can also
be expressed as "for all v in P/i(X)", with P/i(X) as in (11.1.19).
B. Outline of the chapter

The rest of the chapter consists of four sections. Section 11.2 presents
background material on positive Harris recurrence and the limiting average
variance (11.1.8). The reader may go directly to §11.3 and refer to the con-
cepts and results in §11.2 as they are needed. In §11.3 we consider a Markov
control model which is "w-geometrically ergodic" (Definition 11.3.1) with
respect to some weight function w. In this case we show the existence of de-
terministic stationary policies that satisfy (11.1.20) and (11.1.21), whereas
under an additional condition (Assumption 11.3.4) they satisfy (11.1.24)
and (11.1.25), and, moreover, the concepts in Definition 11.1.1 turn out to
be "essentially" equivalent (see Theorem 11.3.5 for a precise statement).
Also in §11.3 we prove the existence of a policy that minimizes the limiting
average variance within the class of canonical policies (Theorem 11.3.8).
In §11.4 we turn our attention to Markov control models with a strictly
unbounded cost-per-stage function c(x,a) [see Assumption 11.4.1(c) and
Remark 11.4.2(a)J. The main result in that section, Theorem 11.4.6, in
particular gives conditions ensuring the existence of a sample path AC-
optimal policy.
The chapter concludes in §11.5 with some examples that illustrate the
results of §11.3 and §11.4.
11.2 Preliminaries 167

11.2 Preliminaries
This section reviews background material that can be omitted on a first
reading; the reader may refer to it as needed.
A. Positive Harris recurrence
Let {xt, t = 0,1, ... } be a time-homogeneous X-valued Markov chain
with transition probability function P(Blx). The chain is said to be posi-
tive Harris recurrent if it satisfies that
(i) {Xt} is Harris recurrent [Definition 7.3.1(b)], and
(ii) it has an i.p.m., which [ by Theorem 7.3.4(a)] is necessarily the unique
i.p.m. of {Xt}.
The next theorem presents, in particular, in parts (a) and (b), two charac-
terizations of positive Harris recurrence.
11.2.1 Theorem. (Characterization and properties of positive Har-
ris recurrence.)
(a) Suppose that {xt} has an i.p.m. JL. Then the chain is positive Harris
recurrent if and only if the strong Law of Large Numbers (LLN)
holds for each function in L1 (JL) := L1 (X, B(X), JL); that is, for each
function 9 in L 1 (JL) and each initial distribution v in P(X),
n-l
lim .!. L g(Xt) = JL(g) Pv-a.s., (11.2.1)
n-too n t=o
where
(11.2.2)

(b) The chain {xt} is positive Harris recurrent if and only if for each
Borel set B in B(X) there is a nonnegative number CiB such that
lim p(n)(Blx)
n-too
= CiB \:Ix E X, (11.2.3)

where
n-l
pen) (Blx) := .!.n L
pt(Blx), n = 1,2, ... (11.2.4)
t=o
denotes the expected average occupation measures.
(c) If {xt} is positive Harris recurrent with i.p.m. JL and 9 is in L1(JL),
then

n-too
1
lim -Ex
n
[n-l
~ g(Xt)
~
1= JL(g) for JL-a.a. x E X,
t=o
with JL(g) as (11.2.2).
168 11. Sample Path Average Cost

Proof. For parts (a) and (c) see, for instance, Revuz [1, pp. 139, 140];
for (b) see Glynn [1] or Hernandez-Lerma and Lasserre [12]. (Glynn [2]
proves (b) in the continuous-time case.) Part (a) is given also in Meyn and
Tweedie [1], Theorem 17.1.7 and Proposition 17.1.6. (These references pro-
vide other characterizations of positive Harris recurrence. For additional
comments see Note 1 at the end of this section.)
As an application of Theorem 11.2.1(b), let w 2: 1 be a weight function,
and suppose that the Markov chain {xt} is w-geometrically ergodic (Defi-
nition 7.3.9). If in (7.3.8) we replace the function u by an indicator function
IB, then we get
pt(Blx) -t J.L(B) as t -t 00,
which of course implies (11.2.3) with aB = J.L(B). Thus we have:
11.2.2 Corollary. A w-geometrically ergodic Markov chain is positive Har-
ris recurrent.
B. Limiting average variance

In this subsection we suppose the following.


11.2.3 Assumption. The Markov chain {xt} is w-geometrically ergodic
with limiting i.p.m. J.L. Moreover, c(·) and h(·) are two functions in Jaw(X),
with h as in (7.5.30).
Under this assumption, Theorem 7.5.10 states that the pair (g, h) is the
unique solution of the strictly unichain Poisson equation

c - J.L(c) = h - Ph (11.2.5)

that satisfies
J.L(h) = 0 and g(x) = J.L(c) "Ix E X. (11.2.6)
Equivalently, by Theorem 7.5.5(a), (b) [and (7.5.7) or (7.5.6)]

Ex [~C(Xt)] = nJ.L(c) + h(x) - Exh(xn) "Ix E X, n 2: 1, (11.2.7)

which gives the mean value Ex[sn(x)] of the random sum

L c(xt),
n-l

sn(x) := n = 1,2, ... (11.2.8)


t=o
for each initial state Xo = x.
We now with to compute the limiting average variance

(11.2.9)
11.2 Preliminaries 169

where [by definition of variance, var(~) := E(~ - E~)2]

var[sn(x)] := Ex[sn(x) - Exsn(x)F (11.2.10)

The next theorem gives conditions under which a 2(c,·) is the (finite) con-
stant

a~ := JL[h 2 - (Ph)2] = /)h 2(x) - (Ph(X))2]JL(dx). (11.2.11)

This result is well known (see Doob [1], Duflo [1], Meyn and Tweedie [1],
etc.) but we will give a proof of it because some of the arguments are also
needed in later sections.
11.2.4 Theorem. Suppose that Assumption 11.2.3 holds and, furthermore,
c2(.) is in l$w(X). Let'l/J be the function on X defined by

'l/J(x) := Ix h 2(y)P(dYlx) - [Ix h(Z)P(dZIX)f, (11.2.12)

which in short can be written as


(11.2.13)

Then:
(a) 'l/J is in l$w(X), and
(b) the limiting average variance satisfies

= a~ \Ix E X. (11.2.14)

where a~ is the constant in {11.2.11}.


Of course, if part (a) in Theorem 11.2.4 is valid, then the second equality
in (11.2.14) is evident because

JL('l/J):= Ix 'l/JdJL=a~, (11.2.15)

and (7.3.8) gives

Ex['l/J(xd] -+ JL('l/J) = a~ \Ix E X. (11.2.16)

Thus the proof, given below, of Theorem 11.2.4 essentially reduces to verify
(a) and the first equality in (11.2.14). To prove the latter we will repeatedly
use the following elementary properties of conditional expectations.
11.2.5 Remark. Let ~ and e
be integrable random variables on a proba-
bility space (O,F,1'), and let Q and Q' be sub-a-algebras of F.
170 11. Sample Path Average Cost

(a) E(~) = E[EWQ)l.

(b) If ~ is Q-measurable, then

(b l ) E(~elQ) = ~E(elQ), and


(b 2 ) E(~IQ) = f
(c) If Q c Q' then E[E(~IQ)IQ'l = E[E(~IQ')IQl = E(~IQ)·
(d) If ~ is square-integrable, then [E(~IQ)j2 :::; E(eIQ). 0
Proof of Theorem 11.2.4. (a) As c2 is in ~w(X) (by assumption), (7.3.8)
and (7.5.30) give that h2 is in ~w(X). This implies [by (7.3.8) againl that
P(h 2 ) is in ~w(X), and so does 'Ij; since (11.2.12) gives that 0:::; 'Ij; :::; P(h 2 ).
(b) By a previous remark [see (11.2.15), (11.2.16)], it only remains to
prove the first equality in (11.2.14). To do this we will first show that the
random variables

satisfy that

Ex(:Y/) = Ex ['Ij;(xt-dl "Ix E X, t = 1,2, ... , (11.2.18)

and also that the variance in (11.2.10) can be written as

(11.2.19)

and as

var[sn(x)l = Ex [~Y?1+ 02(X, n), (11.2.20)

where, for i = 1,2, oi(x,n) is a sequence such that, as n -t 00,

Oi(x,n)/n-tO VxEX. (11.2.21)

Hence, by the definition (11.2.9) of the limiting average variance the first
equality in (11.2.14) will follow from (11.2.20), (11.2.21) and (11.2.18).
Proof of (11.2.18). Let

Ft:=a{xo, ... ,xd for t=O,l, .... (11.2.22)

be the a-algebra generated by {xo, ... , xd. Then, by Remark 11.2.5(a) and
the Markov property, for t ~ 1,

Ex [Ex (l?/IFt-dl
Ex [Ex (y,? IXt-dl
Ex ['Ij;(xt-d],
11.2 Preliminaries 171

where the latter equality, which gives (11.2.18), follows from (11.2.17) and
the fact that 1fJ{x) in (11.2.12) can also be written as

1fJ{X) = Ix [h{Y) - Ix h{Z)P{dzlx)] 2 P{dYlx). (11.2.23)

Proof of (11.2.19). From (11.2.7) and (11.2.8)

sn{x) - Ezsn{x) = [Sn{x) - nJL{c)] - [h{x) - Ezh{x n)]. (11.2.24)

Thus, squaring and taking expectations E z {'), we see that (11.2.1O) can be
written as in (11.2.19) with

(11.2.25)

Now, to prove (11.2.21) with i = 1, use the elementary inequality


(a + b)2 ~ 2{a 2 + b2) Va, bE lR (11.2.26)

and Remark 11.2.5(d) to get

IOl{x,n)1 ~ 2[h2{X) + E zh2(x n)],


which yields (11.2.21) with i = 1 because, as h2 is in ~w(X), (7.3.8) gives
E zh2(xn) -+ JL{h 2) Vx E X.

Proof of (11.2.20). We begin by noting that the random variables Yi in


(11.2.17) satisfy
(11.2.27)
Indeed, with F t as in (11.2.22), the Markov property and Remark 11.2.5(b2 )
give

and, therefore, (11.2.27) follows from Remark 11.2.5(a), (bt) because

Ez(YiYs) = Ez[Ez (YiYsIFs- 1 )]


= Ez[YiEz(YsIF8 - 1 )]
= o.
We also note that (11.2.5) yields

(11.2.28)
and so we get

C{Xt) - JL(c) = Yi - [Ph(xt) - Ph(Xt-l)] 'lit = 1,2, ... ,


172 11. Sample Path Average Cost

and

Sn(x) - nJ-t(c) = M n- 1 + [h(xo) - Ph(Xn-l)] 'Vn ~ 2 (11.2.29)

where, for every n ~ 1,

(11.2.30)
t=1

Observe that (11.2.27) implies that

(11.2.31)

and, on the other hand,


(11.2.32)
because [by Remark l1.2.5(a)]

Ex(l"t) = Exh(xt} - Exh(xt} = O. (11.2.33)

Moreover, using (11.2.29) we can rewrite (11.2.24) as

From this relation, together with (11.2.31) and (11.2.32), we obtain (11.2.20)
with
D2(x, n) := 01 (x, n) + A(x, n) + 2B(x, n),
with 01 (x, n) as in (11.2.25), and

A(x,n) Ex[h(xo) - Ph(X n _l)]2,


B(x,n) .- Ex{[h(xo) - Ph(Xn-l)]' Mn-d.

Thus, to complete the proof of Theorem 11.2.4, it only remains to prove


(11.2.21) with i = 2. In fact, as ol(x,n)/n -t 0, we only need to show that,
for every initial state x, as n -t 00

A(x,n)/n -t 0 (11.2.34)

and
B(x,n)/n -t 0 (11.2.35)
To prove (11.2.34), use (11.2.26) and Remark l1.2.5(a), (d) to get

A(x,n) :s: 2[h2(x) + Exh2(xn)].


This yields (11.2.34), as in the proof of (11.2.19).
11.2 Preliminaries 173

Finally, to obtain (11.2.35) we may use the Cauchy-Schwartz inequality


to see that

B(x,n)2 < Ex[h(xo) - Ph(Xn-1W . Ex(M~)


= A(x,n)· Ex (~yt2) [by (11.2.31)].

Therefore, (11.2.35) follows from (11.2.34), (11.2.18), and the second equal-
ity in (11.2.14).
This completes the proof of Theorem 11.2.4. 0
11.2.6 Remark. (a) Let Mn and Fn be as in (11.2.30) and (11.2.22),
respectively. Then it is clear that {Mn,Fn} is a martingale. In §11.3 and
§11.4 we will use a martingale of this form in combination with the following
result.
(b) (The Martingale Stability Theorem.) Let {Yt} be a sequence
of random variables on a probability space (O,F,P) and let {Ft } be a
nondecreasing sequence of sub-a-algebras of F such {Mn, Fn,n = 1,2, ... },
with Mn := I:~-l Yt, is a martingale. If 1::::; q ::::; 2 and

L C qE(IYtlqIFt -
00

1) < 00 P-a.s., (11.2.36)


t=l

then
1
lim - Mn
n
n--+oo
=0 P-a.s.

For a proof of this fact see, for instance, Hall and Heyde [1], Theorem 2.18.
(c) (Alternative expressions for a~.) Letting cbe the centered func-
tion
c(-) := c(·) - p,(c),
we can also write (11.2.11) as

a~ = p,(2ch - if), (11.2.37)

a~ = p,(c2) + 2p, [~cptc] , (11.2.38)

and
= EI'[if(xo)] + 2 L
00

a~ EI'[c(xo)c(xd]· (11.2.39)
t=l

Indeed, to obtain (11.2.37), we write the Poisson equation (11.2.5) as Ph =


h - c, so that
(11.2.40)
174 11. Sample Path Average Cost

This equality and (11.2.11) yield (11.2.37). On the other hand, to get
(11.2.38) first write (7.5.30) as
00 00

h = Lpte=e+ Lpte.
t=o t=1
Then the right-hand side of (11.2.40) becomes
00

2eh - (fJ = e2 + 2 LePte,


t=1
and so (11.2.38) follows from (11.2.37). Finally, note that for any initial
state Xo = x and t = 0, 1, ... ,

and integration with respect to J.t gives

EI'[e(xo)e(Xt)] = J.t(eptC).
Thus (11.2.39) follows from (11.2.38). 0
Notes on §1l.2

1. The conclusion of Theorem 11.2.1(c) remains valid if the hypothesis


"{Xt} is positive Harris recurrent" is replaced by "the state space X is a
locally compact separable metric space and {xtJ has a unique i.p.m. J.t".
(See Hernandez-Lerma and Lasserre [13], Lemma 4.2.) On the other hand,
it follows from Theorem 11.2.1(b) that if {xtJ is positive Harris recurrent
with i.p.m. J.t and 9 is a bounded measurable function on X, then

n-1 ]
lim .!.Ex [ "g(Xt)
n-+oon L..J
= J.t(g) for all x E X. (11.2.41)
t=o
This can be used to obtain additional results. For instance, if 9 is a non-
negative function in L1 (J.t), then

liminf
n-+oo n
[~9(Xd] ~ J.t(g)
.!.Ex L..J for all x E X.
t=o
This follows from (11.2.41) and the fact that 9 E L1(X)+ is the pointwise
limit of a nondecreasing sequence of bounded measurable functions.
On the other hand, writing g(xn) as
n n-1
Lg(Xt) - Lg(Xt),
t=O t=O
11.3 The w-geometrically ergodic case 175

it is easily deduced from (11.2.1) that if {xt} is positive Harris recurrent


with i.p.m. /-l and g is in L1 (/-l), then

lim Ig(xn)l/n
n-+oo
=0 Pv-a.s. for each lJ in P(X). (11.2.42)

In fact, in (11.2.42) we may replace Ig(xn)1 by max1<k<n Ig(Xk)l; see, for


instance, Meyn and Tweedie [1, Theorem 17.3.3]. --
2. (The Central Limit Theorem from Markov chains.) Suppose
that the hypotheses of Theorem 11.2.4 are satisfied, and let Z(·) denote
the Gaussian distribution with mean 0 and variance 1, i.e.,

If a~ > 0, then for each initial state x E X

(11.2.43)

[For a proof see, for example, the references given after (11.2.11).]
3. The reader should be warned that there are different definitions of
limiting average variance. For instance, Baykal-Giirsoy and Ross [1], Filar
et al. [1], Puterman [1, p. 408], etc., define the "limiting average variance"
as
(11.2.44)

which in general [despite (11.2.19)] does not coincide with (11.2.9). In par-
ticular, observe that under the hypotheses of Theorem 11.2.4 the limiting
value in (11.2.44) is

which is not the same as a~ in (11.2.11) [or (11.2.37)].

11.3 The w-geometrically ergodic case


Let w ~ 1 be a given weight function. In this section we study the optimal-
ity criteria in Definition 11.1.1 for a w-geometrically ergodic control model,
by which we mean the following.
11.3.1 Definition. A Markov control model M = (X, A, {A(x)lx EX}, Q, c)
is said to be w-geometrically ergodic if Assumptions 10.2.1 and 10.2.2
are both satisfied.
We already have a lot of information on such a model. For instance,
by Proposition 1O.2.3(b), the "lim sup" and "lim inf" average costs [see
176 11. Sample Path Average Cost

(11.1.5), (11.1.7)] coincide on the class lIDS of deterministic stationary


r
policies; that is, for each Xl in lIDS (equivalently, for each decision function
/ E F)
(11.3.1)
where J(f) = f.1./(c/) is the constant in (10.2.18). Moreover, in (11.3.1) we
may replace the initial state Xo = x by any initial distribution v in the
family

Pw(X) := {v E P(X)lv(w) := Ix wdv < oo}. (11.3.2)

That is, instead of (11.3.1) we have

J(fOO, v) = Ix J(foo,x)v(dx) = J(f) Vv E Pw(X). (11.3.3)

Indeed, by (10.4.3), the n-step cost In(foo, x) satisfies

where k := c[1 + b/(1 - ,8)]. Hence, if v is in Pw(X), from (11.3.1) and the
Dominated Convergence Theorem we obtain

J(f) = Ix J(foo,x)v(dx)

= lim
n-too}x
r n-1Jn(f00,x)v(dx)
= lim n- 1 I n (foo , v)
n-too
[by (11.1.2)]
= J(fOO, v),

which gives (11.3.3).


A result similar to (11.3.1) and (11.3.3) is obtained if we replace the
expected AC J(foo,') by the sample path AC J0(foo,.) in (11.1.6):
11.3.2 Proposition. If the Markov control model is w-geometrically er-
godic, then for each deterministic stationary policy /00 E lIDS
(a) the state (Markov) process is positive Harris recurrent, and

(b) for each initial distribution v in P(X),

~(foo,v) = lim ~(foo,v)/n = J(f) P[>O_a.s. (11.3.4)


n-too

Proof. In (10.2.9) replace u by an indicator function IB, with B in B(X).


This gives
lim Q}(Blx) = f.1./(B) Vx E X, B E B(X),
t-too
11.3 The w-geometrically ergodic case 177

which clearly implies (11.2.3) with QJ(·lx) and f..LJ(B) in lieu of P(·lx) and
aB, respectively. Thus part (a) follows from Theorem 11.2.1(b), and (b)
follows from (11.2.1). 0
Despite these facts, however, w-geometric ergodicity is not enough to
guarantee a "good behavior" of the Markov control model with respect to
the optimality criteria in Definition 11.1.1. In particular, it does not ensure
the existence of sample path AC-optimal policies [Definition 11.1.1(b)]. It
is thus convenient to consider sample path AC-optimality in the restricted
sense of Definition 11.1.3. We shall consider two cases:

(1) fi := lIDS and P(X) := P(X)j

(2) fi := II and P(X) := P.s(X), the set defined in (11.1.19)

Case (I), which is dealt with in subsection A, is quite straightforward but


is explicitly stated here (Theorem 11.3.3) mainly for the purpose of com-
parison with the more technically demanding case (2). The latter case is
studied in subsection B, and, finally, in subsection C we consider a variance
minimization problem.

A. Optimality in II Ds

By Theorem to.3.1 we know that for a w-geometrically ergodic con-


trol model there exists a deterministic stationary policy f{f' which is A C-
optimal in lIDS, that is,

J(Jo) = Po, with Po:= inf J(J).


IF
(11.3.5)

[Furthermore, fff is AC-optimal (in all of II) if the one-stage cost c(x, a)
is nonnegative-see Remark 10.3.2.] We now have the following.
11.3.3 Theorem. Suppose that the Markov control model is w-geometrically
ergodic and let f{f' be as in (11.3.5). Then:

(a) f{f' is sample path AC-optimal with respect to lIDS and P(X); that is,

J°(Jff, v) = Po pta -a.s. "Iv E P(X)

and
~(Jff,v)~po pta-a.s. VfEIIDs, VEP(X).
(b) (J{f', v) is a "minimum pair in lIDS" for each initial distribution v in
Pw(X), the set defined in {11.3.2}; that is,

J(J[f', v) = Po "Iv E Pw{X).


178 11. Sample Path Average Cost

Proof. Part (a) follows from (11.3.5) and Proposition 11.3.2(b), and part
(b) from (11.3.3). 0
Observe that we can write (11.3.1) as

J(fOO, v) = J(f) "Iv E P6(X):= {a.,/x EX}. (11.3.6)

Hence, as P6(X) is contained in Pw(X), we can see that Theorem 11.3.3{b)


gives a statement stronger than (11.3.5). More explicitly, the former is a
result valid for any initial distribution in Pw(X), whereas in Theorem 10.3.1
we proved (11.3.5) for "initial states" only, that is, for "initial distributions"
a.,in P6(X) C Pw(X).

B. Optimality in IT

The following assumption is supposed to hold throughout the rest of this


section.
11.3.4 Assumption. The hypotheses of Theorem 1O.3.6(a) are satisfied
and, in addition, there is a constant r ~ 0 such that

sup c2 (x, a) ~ rw(x) "Ix E X. (11.3.7)


A(.,)

In other words, Assumption 11.3.4 combines Assumptions 10.2.1, 10.2.2


and 10.3.5, but in addition to Assumption 1O.2.1(d), the cost-per-stage
c( x, a) is also required to satisfy the "second order" condition (11.3.7).
By Theorem 1O.3.6(a) and Theorem 10.3.4, there exists a canonical
triplet (p., h*, f.) with h. in lIi w (X); that is, p. E JR, h. E lIi w (X), and
f':' E IIDS satisfy (10.3.13), (10.3.14) and (10.3.15). Moreover, from (10.3.15)
it follows that p* satisfies (11.1.23). The following theorem states, in par-
ticular, that under the additional condition (11.3.7) there is a policy 11"* for
which (11.1.24) and (11.1.25) also hold.
11.3.5 Theorem. Suppose that Assumption 11.9.4 holds. Then

(a) For each policy 11" E II and initial state x E X

(11.3.8)

and the following statements (b), (c) and (d) are equivalent for a deter-
ministic stationary policy foo:

(b) foo is A C-optimal;

(c) 11"*:= foo satisfies {11.1.24} and (11.1.25)-that is, foo is sample
path AC-optimal with respect to II and P6{X) [see (11.1.19)]
11.3 The w-geometrically ergodic case 179

(d) J(foo,v) = p* for every initial distribution v in Pw(X), the set de-
fined in (11.3.2).

Hence, by Theorem 10.3.6(a}, there exists a deterministic stationary policy


foo that satisfies (b), (c), (d). Furthermore, if the cost-per-stage function
c(x, a) is such that
c(x, a) is bounded below, (11.3.9)
then (b), (c) and (d) are equivalent to:

(e) foo is strong AC-optimal [Definition 11.1.1(c)].


The proof of Theorem 11.3.5 is given in subsection D.
11.3.6 Remark. (a) It is worth noting that Theorem 11.3.5(c) and the
second inequality in (11.3.8) give that 11"* := foo is in fact strong sample
path AC-optimal in II and Pli(X), where "strong" means that (11.1.24)
and (11.1.25) are satisfied when the "lim sup" sample path cost JO(1I",x) is
replaced by the lim inf sample path A C

;[0(11", x) := liminf
n-too
~J~(1I",x).
n
(11.3.10)

(b) From (11.1.12) and (11.1.23) it is evident that

Pmin ~ p*. (11.3.11)

Let us now suppose that (11.3.9) holds, and let 11" E II and v E P(X) be
arbitrary. Then, as

J~(1I",v) = Ix ~(1I",x)v(dx),
Fatou's Lemma and (11.3.10) yield that P;-a.s .

.- liminf ~J~(1I", v)
n

Ix
n-too

> ;[0(11", x)v(dx)

> p* by (11.3.8).
From this inequality and (11.1.6) it follows that

JO(1I",v):?: J!l(1I",v) :?: p* P;-a.s. (11.3.12)

for arbitrary 11" E II and v E P(X). On the other hand, from (11.1.5),
(11.1.7), and again using Fatou's Lemma,
180 11. Sample Path Average Cost

where the third inequality follows from (11.3.12). Thus

J(7f, v) 2: p* V7f E II, v E P(X), (11.3.13)

which implies the reverse of inequality (11.3.11), that is, Pmin 2: p*. Hence,
under (11.3.9),
Pmin = p*. (11.3.14)
Combining these we can easily obtain the following corollary of Theorem
11.3.5.0
11.3.7 Corollary. If Assumption 11.3.4 and also {11.3.9} are satisfied,
then there exists a deterministic stationary policy XJ such that r
(a) (rxo, v) is a minimum pair for all v in Pw(X), and

(b) r XJ is sample path AC-optimal [in the sense of Definition 11.1.1{b)j.

Proof. By Theorem 11.3.5, there exists an AC-optimal deterministic sta-


tionary policy foo. Thus, by (11.3.14) and (11.3.3),

J(fOO, v) = p* = Pmin "Iv E Pw(X),

and (a) follows. On the other hand, Proposition l1.3.2(b) yields

J O(fOO ,v) -_ P* -_ Pmm


. ploc
v -a.s. "Iv E P(X)

which together with (11.3.12) gives (b). 0


C. Variance minimization

To state the variance-minimization problem we are interested in, let


(p*, h*) be a solution to the ACOE (10.3.13)-(10.3.14), with h* == h*. Also,
for each state x E X, let A*(x) C A(x) be the set defined in (10.3.35), and
recall (10.3.36): f E F is a canonical decision function if and only if f(x) is
in A*(x) for all x E X. Moreover, in analogy with (11.2.12) [or (11.2.23)),
consider the function

'IjJ(x, a) := Ix h;(y)Q(dylx,a) - [Ix h*(Y)Q(dY1x,a)] 2 (11.3.15)

As usual, for each f E F we shall write

'ljJ1(x) := 'IjJ(x, f(x». (11.3.16)

Then under the same hypotheses of Theorem 11.3.5 we now get the follow-
ing.
11.3.8 Theorem. (Existence of minimum-variance policies.) If As-
sumption 11.3.4 is satisfied, then there exists a constanta; 2: 0, a canonical
11.3 The w-geometrically ergodic case 181

decision function f* E lFca, and a function V*(-) in Bw(X) such that for
each x E X:

= min
A·(z)
['1/1 (x, a) + (
Jx v*(y)Q(dy1x,a)]
(11.3.17)

= '1/1,. (x) + Ix V*(y)Q,.(dylx).

Furthermore, fZO minimizes the limiting average variance Var (foo, x) in


(11.1.8) over the set of AC-optimal policies f oo , and Var (fZO,·) = O'~; in
fact,
Var (fZO, x) = IL,.
('1/1,.) = O'~ Vx E X, (11.3.18)
and
O'~:::; Var (foo,x) Vf E lFAG, x E X, (11.3.19)
where IFAG is the set of AC-optimal decision functions [see (10.9.2)].
Comparing (11.3.17) with the ACOE (10.3.13)-(10.3.14), we see that
(11.3.17) looks very much as an "ACOE" for some Markov control problem.
This is indeed the case, as shown in the proof of Theorem 11.3.8 (see
subsection E).
D. Proof of Theorem 11.3.5

We shall begin with some preliminary results concerning the weight func-
tion
(11.3.20)
Consider the inequality (1O.2.4)-which is equivalent to (10.2.1)-with b(·)
a constant [for instance, replace b(·) by the constant b := IIb(·)IIl, i.e.,

Ix w(y)Q,(dylx) :::; {3w(x) + b Vf E IF, x E X. (11.3.21)

Then, "taking the square root" of both sides of (11.3.21) and using Jensen's
inequality we see that v := W 1 / 2 satisfies

Ix v(y)Q,(dylx) :::; {3ov(x) + bo Vf E IF, x E X, (11.3.22)

with f30 := {31/2 < 1 and bo := b~/2.


Now, as in (7.2.1), consider the v-norm

lIuli v := sup lu(x)l/v(x)


x
of a real-valued function u on X. Then, since w(·) ~ v(·) ~ 1, we have
182 11. Sample Path Average Cost

On the other hand, if R(dylx) denotes a signed kernel on X [Definition


7.2.3(b)], then its v-norm [see (7.2.8)]

IIRliv := supV(X)-l [ v(y)IR(dylx)1


x ix
satisfies
IIRII~ ~ IIRllw,
by (11.3.20) and Jensen's inequality. Finally, as in the proof of Theorem
11.2.4(a), the inequality (11.3.7), which can also be written as

cJ(x) ~ rw(x) "Ix E X, f E IF,

implies that hJ is in Bw(X), where hi is the function in (10.2.19); hence, hi


is in the space Bv(X) of measurable functions on X with a finite v-norm.
From these remarks and (10.2.7) [or (10.2.9)] we obtain the following.
11.3.9 Lemma. For each deterministic stationary policy foo:

(a) The state (Markov) process {xt} is v-geometrically ergodic, that is,

I Ix u(y)Q}(dylx) - I'I(u)1 ~ IIullvRop~v(x) [ef. (10.2.9)]

for all x E X and t = 0, 1, ... , where Po := pl/2 < 1 and Ro := R 1 / 2 ;


(b) The unique solution (J(f), hi) of (10. 2. 21)-{1 O. 2.22) is such that the
bias function hi is in Bv(X).
The next two lemmas are crucial for the proof of Theorem 11.3.5.
11.3.10 Lemma. For each policy 1f E n and initial state x EX:

(a) E: [~r2w(xt)1< 00;


hence the following statements hold P;: -a.s.:

L: r
00

(b) 2 w(Xt) < 00,


t=l

(c) t- 2 w(Xt) ~ 0, and


(d) r1v(Xt) ~ o.
Proof. By (11.3.20), it is clear that (a) => (b) => (c) => (d). Thus, it suffices
to prove (a). This, however, follows directly from (10.4.2), which yields
11.3 The w-geometrically ergodic case 183

for some constant k. 0


Choose an arbitrary policy 7f E n and initial state x EX, and let

(11.3.23)

be the a-algebra generated by the state and control variables up to time t.


Moreover, let h. be as in (11.3.15) [that is, (p., h.) is the solution to the
ACOE (10.3.13)], and define the random variables, for t = 1,2, ... ,

(11.3.24)

and
n
Mn(7f, x) := L Yt(7f, x). (11.3.25)
t=l

11.3.11 Lemma. For each policy 7f E n and initial state x E X, the


sequence {Mn (7f,x),Fn } is a P:-martingale, and

lim .!.Mn (7f,x)


n
n-+oo
=0 P;-a.s. (11.3.26)

Proof. Choose an arbitrary policy 7f E n and initial state x EX. For


notational ease, in (11.3.23)-(11.3.25) we shall write

(11.3.27)

Now note that (p., h.) is a solution to the Poisson equation (10.3.14) and
so, by Lemma 11.3.9(b),

h. is in Bv(X), that is, Ih.(·)1 ~ Ilh.llvv(·). (11.3.28)

Hence, by (11.3.24), (11.3.27) and (11.3.28),

IYtI ~ Ih*(xt)1 + E;[lh.(xt)IIFt- 1 ],

so that, by (11.3.28),

IYtI ~ IIh*lIv{v(xt) + E;[v(xt)IFt- 1 ]}, (11.3.29)

and it follows that [by Remark 11.2.5(a)]

This inequality and (11.3.25) show that Mn is P:-integrable for each n. On


the other hand, it is clear that Mn is Fn-measurable and that [by Remark
11.2.5 (b 2 )]
184 11. Sample Path Average Cost

Therefore, {Mn, Fn} is a martingale, which proves the first part of the
lemma.
To prove (11.3.26) we shall use the Martingale Stability Theorem in
Remark 11.2.6(b). Hence, it suffices to show that [as in (11.2.36) with
q= 2]
00

:~::::>-2 E;(Y?IFt-d < 00 P;-a.s. (11.3.30)


t=l
To prove this, first we use (11.2.26) and the fact that v 2 = W to see that
(11.3.29) yields [using Remark 11.2.5(d)]

yt2 :::; 2I1h*II~{w(xd + E;[w(xdIFt-d.


Hence [by Remark 11.2.5(b2 )]

E(yt2IFt _ 1 ) < 4I1h*II~E[w(xdIFt-d


(11.3.31)

where the second inequality is obtained from the first inequality in the proof
of Lemma 10.4.1 together with the fact that w(·) ~ 1. Finally, (11.3.30)
follows from (11.3.31) and Lemma 11.3.1O(a) because

Lr +L r
00 00

2w(xt_d :::; w(xo) 2w(Xt) < 00 P;-a.s.


t=l t=l

This completes the proof of Lemma 11.3.11. 0


We are now ready for the proof of Theorem 11.3.5.
Proof of Theorem 11.3.5. (a) Choose an arbitrary policy 1f E II and
initial state x EX, and consider the so-called AC-discrepancy function

D(x, a) := c(x, a) + Ix h*(y)Q(dylx,a) - h*(x) - p*. (11.3.32)

Observe that the ACOE (10.3.13) can be written as

min D(x, a) = 0 "Ix E X,


A(x)

and that D is nonnegative. Moreover, using again the notation (11.3.27),


we may write (11.3.24) as

yt h*(xt) - Ix h*(y)Q(dylxt-l, at-d,

h* (Xt) - h* (xt-d - D(Xt-l, at-d + C(Xt-l, at-d - p*,


11.3 The w-geometrically ergodic case 185

and (11.3.25) becomes

Thus, since jj 2: 0,
(11.3.33)
Finally, note that (11.3.28) and Lemma 11.3.10(d) imply that Ih*(xn)l/n ---t
o P;-a.s. as n ---t 00, and, similarly, Mnln ---t 0 P;-a.s. by (11.3.26). There-
fore, multiplying by lin both sides of (11.3.33) and taking lim inf as n ---t 00
we obtain the second inequality in (11.3.8). As the first inequality is obvi-
ous, we thus have (11.3.8).
We will next prove the equivalence of (b), (c) and (d).
(b) ~ (c). Let f= E lIDS be an AC-optimal policy, the existence of
which is ensured by Theorem 10.3.6(a). Then, by Proposition 11.3.2(b)
J0(f=,x) = J(f) = p* pro -a.s. Vx E X.
This fact and (11.3.8) yield (c).
(c) ~ (b). Suppose that f= is sample path AC-optimal with respect to
II and Po(X)i that is, 7r* := f= satisfies (11.1.24) and, moreover, (11.1.25)
holds. Then, by (11.3.1),
J(f=, x) = J(f) = p* Vx E X, (11.3.34)
which together with (11.1.23) implies (b).
(b) ¢:} (d). This follows from (11.3.34) and (11.3.3).
Finally, let us suppose that (11.3.9) is satisfied. Then, by Definition
11.1.1(c), we have (e) ~ (b). Conversely, suppose that f= E lIDS is AC-
optimal. Then f= satisfies (11.3.34) and, on the other hand, (11.3.8) and
Fatou's Lemma give
p* < lim inf E:[J~(7r, x)]/n
n~=

J 1 (7r,x) [by (11.1.7) and (11.1.3)]


for any policy 7r and initial state x. Therefore, (b) => (e).
This completes the proof of Theorem 11.3.5. 0
E. Proof of Theorem 11.3.8

We shall first state some preliminary facts.


Let f E IF be an arbitrary decision function. By Lemma 11.3.9(b), the
bias function hi is in ~v(X), and, therefore, the function

(11.3.35)
186 11. Sample Path Average Cost

for x E X, belongs to Jaw(X)-recall (11.3.20). Furthermore, comparing


(11.3.35) with (11.2.12), we obtain from Theorem 11.2.4(b)

(11.3.36)

where, by (11.2.11),
(11.3.37)
Let us now suppose that f E IF is a canonical decision function, that
is, f is in IFca. Then (J(f),h,) = (p*,h,) satisfies the ACOE (10.3.13),
(10.3.14), and, in addition, Theorem 10.3.7 yields that

(11.3.38)

for some constant k" where h* := h* is the function in (10.3.13)-(10.3.14).


Thus, from (11.3.38), (11.3.35) and (11.3.15)-(11.3.16) we obtain that

(11.3.39)

On the other hand, if f= E IIDS is AC-optimal, Theorem 1O.3.6(b) gives


that (11.3.38) holds J.tralmost everywhere; that is, there exists a Borel set
N = N, in X such that J.t,(N) = 0 and

(11.3.40)

where NC denotes the complement of N. We also get the following.


11.3.12 Lemma. Suppose that f= E IIDS is AC-optimal, but it is not
canonical, that is, f is in IF AC\IFca [see (10.3.2)]. Then there exists a canon-
ical decision function j E IF ca such that J.ti = J.t" and

Var (j=, x) = Var (f=,x) = (J2(f) "Ix E X, (11.3.41)

where (J2(f) is the constant in (11.3.36), (11.3.37).


Proof. Let 9 E IF ca be a canonical decision function-whose existence is
assured by Theorem 10.3.6(a)-and define a new function j as

j := 9 on N, and j:= f on N C ,

where N = N, is the J.trnull set in (11.3.40). Then j is canonical, and

(11.3.42)

Hence, on the one hand, (11.3.42) yields (by definition of i.p.m.)

(11.3.43)
11.3 The w-geometrically ergodic case 187

and, on the other hand, (11.3.42) and (11.3.40) give [by (11.3.35) and
(11.3.16)]
(11.3.44)
Therefore, (11.3.41) follows from (11.3.43)-(11.3.44) and (11.3.36)-(11.3.37).
o
With these preliminaries we can now easily prove Theorem 11.3.8.
Proof of Theorem 11.3.8. Let A*(x) c A(x) and tf;(x, a) be as in
(10.3.35) and (11.3.15), respectively, and recall (10.3.36) and (11.3.16).
Consider the new Markov control model

Mvar:= (X,A,{A*(x)lx E X},Q,C),

where c(x, a) := tf;(x, a). It is easily verified that Mvar satisfies the hy-
potheses of Theorem 1O.3.6(a), replacing c(x, a) and A(x) with c(x, a) and
A*(x). Therefore, by Theorem 1O.3.6(a), there exists a canonical triplet
(a~, V*, f*) for Mvar. with V* in Bw(X); that is, there exists a constant
a~ ~ 0, a function V* in Bw(X), and a canonical decision function f* E IFca
that satisfy (11.3.17). Moreover, as in (10.3.15)

a~ = ftf.(tf;fJ = Var (f':',x) Vx E X, (11.3.45)

and
a~ ~ ftf(tf;f) = Var (foo,x) Vf E IFca, x E X, (11.3.46)
where the second equality in (11.3.45) and (11.3.46) follows from (11.3.39)
and (11.3.36)-(11.3.37). Finally, to verify (11.3.19), observe that (11.3.46)
and Lemma 11.3.12 yield

This completes the proof of Theorem 11.3.8. 0

Notes on §11.3

1. This section is based on Hernandez-Lerma, Vega-Amaya and Carrasco


[1]. For related results concerning Markov control models with a finite state
space X, see Mandl [1], [2], and Mandl and Lausmanova [1]. The equality
preceding (11.3.33) was perhaps first noted by Mandl [1]. Kurano [1] con-
siders the variance-minimization problem for Markov control models with
a transition law Q( ·Ix, a) that satisfies a Doeblin-like condition, and that
are absolutely continuous with respect to some given reference measure. In
the finite state case, the minimization of the "limiting average variance"
defined as in (11.2.43) has been studied in the references given in Note 3
of §11.2. For additional references on sample path AC optimality see §5.7,
and Note 1 in §11.4.
188 11. Sample Path Average Cost

2. By (10.3.31), the optimal bias function h [defined in (10.3.27)] and


the function h. (:= h*) in (11.3.15) satisfy that h(-) = h*(·) - k for some
constant k. Hence, by (11.3.39) the variance (72(1) in (11.3.36) or (11.3.37)
can also be written as

or
(72(1) = Ix {h 2(X) - [Ix h(Y)Qf(dylx)] 2} JLf(dx),
for f E !Fea · Similarly, the "cost-per-stage" function 1jJ(x, a) in (11.3.15)
can be written as

1jJ(x, a) = Ix r(y) - Ix h(z)Q(dzlx, a)] 2 Q(dylx, a).

These expressions suggest that the bias-optimal polices and the minimum-
variance policies fr:o in Theorem 11.3.8 should be related in some sense,
but it is open question what this relation (if any) should be.

11.4 Strictly unbounded costs


In this section we consider another Markov control model M for which
there exists a sample path AC-optimal policy [Definition 1.1.1(b)]. We shall
suppose that M satisfies the following assumption, which was already used
in §5.7 (Condition 5.7.4).
11.4.1 Assumption. (a) J(1?, x) < 00 for some policy 1? and some initial
state x.
(b) The one-stage cost function c( x, a) is nonnegative and lower semicon-
tinuous (l.s.c.), and the set {a E A(x)lc(x,a) ~ r} is compact for each
x E X and r E JR.
(c) c(x, a) is strictly unbounded, that is [with ][{ as in (8.2.1)], there is a
nondecreasing sequence of compact sets Kn t][{ such that

lim inf{ c(x, a)l(x, a)


n-+=
f/. Kn} = 00.

(d) The transition law Q is weakly continuous, that is,

(x, a) 1-+ J u(y)Q(dylx, a)

is a continuous bounded function on ][{ for every continuous bounded


function u on X.
11.4 Strictly unbounded costs 189

11.4.2 Remark. (a) A function that satisfies Assumption 11.4.1(c) is also


known as a norm-like function or as a moment function. Each of the fol-
lowing conditions implies that c(x, a) is strictly unbounded:
(ad The state space X is compact.
(a2) c is in/-compact, that is, the level set {(x, a) E IK Ic(x, a) ~ r} is
compact for every number r ~ o.
(a3) X and A are u-compact (Borel) spaces; A(x) is compact for every
x E X, and the set-valued mapping x I-t A(x) is u.s.c. Moreover, the
function ci(x) := infA(z) c(x, a) satisfies that for every r ~ 0 there is
a compact set Kr for which
ci(x) ~ r "Ix ¢ K r .

(These conditions were already discussed in Remark 5.7.5.)


(b) We will use Assumption 11.4.1(c) here and in Chapter 12 in the same
way we did in Chapters 5 and 6: If M is a family of probability measures
on IK such that
sup ( c(x, a)J.L(d(x, a» < 00, (11.4.1)
IJEM 11K
then by Theorems 12.2.15 and 12.2.16, for each sequence {J.Ln} in M there
is a subsequence {J.Ln,} and a p.m. J.L on IK (but not necessarily in M) such
that {J.Ln,} converges weakly to J.L, which means that

.lim /VdJ.Ln;
'--+00
= /VdJ.L "Iv E Cb(IK), (11.4.2)

where Cb(lK) denotes the Banach space of continuous bounded function on


IK with the sup norm. In this section "weak" convergence refers to (11.4.2)
but in Chapter 12 we consider other forms of "weak" convergence of mea-
sures. 0
Let cP and fiRS be as in Definitions 8.2.1 and 8.2.2. Instead of (8.2.5)
sometimes we shall use the notation

c<p(x) := i c(x,a)<p(dalx), Q<p(·lx):= i Q(·lx,a)<p(dalx). (11.4.3)

For a decision function / E IF (c cp), the notation (11.4.3) reduces to


(10.2.2), (10.2.15).
11.4.3 Definition. A randomized stationary policy <poo (in particular, a
deterministic stationary policy /00 E fiDS C fiRS) is said to be
(a) stable if [using the notation (11.4.3)] the transition law Q<p admits
an i.p.m. PIP' and the AC J(<poo 'PIP) is finite and such that

(11.4.4)
190 11. Sample Path Average Cost

(b) positive Harris recurrent if cpoo is stable and Q'{J is Harris recurrent.
The following proposition states some useful properties of stable poli-
cies. In part (b) of the proposition, Pmin and p* are the numbers defined in
(11.1.12) and (11.1.23), respectively.
11.4.4 Proposition. (a) If cpoo E II Rs is stable, then

J(cpOO,p'{J) = Ix J(cpoo,x)p'{J(dx). (11.4.5)

(b) If there exists a stable policy cpoo E IIRs such that (cpoo 'P'{J) is a
minimum pair, that is (by Definition 11.1.1{a)]

J(cpOO 'P'{J) = Pmin, (11.4.6)

then
Pmin = p*. (11.4.7)

(c) Let cpoo E IIRs be a stable policy. Then (cpoo 'P'{J) is a minimum pair if
and only if
J(cpOO, x) = p* p'{J-a.a. x E X. (11.4.8)

Proof. (a) By (11.1.5) and the Individual Ergodic Theorem (7.5.24), the
limit

exists p'{J-a.e. and satisfies

Ix J(cpoo, x)p'{J(dx) = Ix c'{J (x)P'{J (dx).

The latter equality and (11.4.4) give (11.4.5).


(b) By the definitions (11.1.12) and (11.1.23) of Pmin and p*, it is evident
that Pmin :5 p*. On the other hand, J(cpoo, x) ~ p* for all x E X and,
therefore, (11.4.6) and (11.4.5) give the reverse inequality,

Pmin = Ix J(cpoo, x)p'{J(dx) ~ p*.


(c) This part is a direct consequence of (a) and (b). 0
Before stating the main result in this section, Theorem 11.4.6, let us re-
call (from Chapter 5) the facts in the following lemma. Observe that part
(a) in the lemma ensures the existence of a policy cpoo that satisfies the hy-
potheses of Proposition 11.4.4(b)-hence (11.4.7) holds. Moreover, part (c)
states that if, among other things, (11.4.8) holds for all state x, then there
exists a deterministic stationary policy which is AC-optimal [see (11.1.10)].
11.4.5 Lemma. Suppose that Assumption 11...1.1 is satisfied. Then:
11.4 Strictly unbounded costs 191

(a) There exists a stable policy <p~ E IIRS such that (<p~ 'P<PJ is a mini-
mum pair.
(b) If in addition the policy <p~ in (aJ is positive Harris recurrent, then
its sample path AC JO(<p~,') satisfies that

JO(<p~, v) = p* pt':' -a.s. Vv E P(X), (11.4.9)


where p* = Pmin {by (aJ and Proposition 11.1.J,(bJ).
(c) Suppose that there exists a randomized stationary policy <poo such that
J(<pOO, x) = p* Vx E X, (11.4.10)
and let
1 N-l
hex) := Hminf N ' " hn(x)
N-too ~
(~O) (11.4.11)
n=O
where, for n ~ 1,
hn(x) := In(<pOO, x) - Mn

and ho(-) = Mo := O. If h is a real-valued function, then there exists a


deterministic stationary policy foo which is A C-optimal with value
function p*, that is,
J(fOO,x) = J*(x) = p* Vx E X. (11.4.12)

Proof. Part (a) is the same as Theorem 5.7.9(a), whereas (11.4.9) follows
from Theorem 11.2.1(a). Furthermore, part (c) is a consequence of Theorem
5.4.3(ii), which states that (11.4.10) and (11.4.11) yield the Average Cost
Optimality Inequality (ACOI)

p* + hex) ~ c<p(x) + Ix h(y)Q<p(dYlx)


(11.4.13)
~ min [c(x, a)
A(z)
+
ixr h(y)Q(dYlx, a)].
Hence, by the usual argument, there exists a decision function f E IF such
that f(x) E A(x) attains the minimum in (11.4.13) for every x E X, that

Ix
is,
p* + hex) ~ cf(x) + h(y)Qf(dylx) Vx EX,
which yields (11.4.12). 0
We shall now state our main result, which, in particular, in part (c) gives
conditions for the existence of sample path AC-optimal policies. Also note
that (11.4.16) is a statement stronger than (11.4.7) because it uses the lim
inf expected AC in (11.1.7).
11.4.6 Theorem. Suppose that Assumption 11.,,/.1 is satisfied. Then:
192 11. Sample Path Average Cost

(a) For each policy 1f and initial distribution v

liminf J~(1f,v)/n 2: p* P;:-a.s.; (11.4.14)


n--+oo

hence
(11.4.15)
and
p* = inf inf JI (1f, v). (11.4.16)
P(X) II

(b) If 1f* E II is an A C-optimal policy and the A C-value function J* (.)


equals p*, that is

J(1f*,x) = J*(x) = p* 'Ix EX, (11.4.17)

then 1f* is strong AC-optimal and

liminfJ~(1f*,v)/n=p* P;:*-a.s. VvEP(X). (11.4.18)


n--+oo

(c) If the policy <PC: in Lemma 11.4.5 is positive Harris recurrent, then
it is sample path AC-optimal; in fact, every positive Harris recurrent
policy <poo in IIRS (or in IIDs) for which (<poo, p",) is a minimum pair,
is also sample path AC-optimal.
Proof. As the proof of (11.4.14) is quite "technical", to simplify the expo-
sition we will first suppose that it holds and prove the remaining parts of
the theorem; then we will prove (11.4.14).
Suppose that (11.4.14) is satisfied, and choose an arbitrary policy 1f and
initial distribution v. Then (11.1.3), (11.1.7), and Fatou's Lemma yield

J 1 (1f,v) 2: E: [liminfJ~(1f,v)/n] 2: p*;


n--+oo
(11.4.19)

that is, (11.4.15) holds. Moreover, as [by (11.1.5)]

J(1f, v) 2: JI (1f, v), (11.4.20)

(11.4.16) follows from (11.4.19) and the definition (11.1.12) of p*.


Proof of (b). If 1f* satisfies (11.4.17), then (11.4.16) yields that 1f* is
strong AC-optimal, and, on the other hand, using (11.4.20) and (11.4.19)
with v = r5x ,

p* = J(1f*, x) 2: E;*[liminf J~(1f*,x)/n]2: p*,


n--+oo

i.e.,
E;*[liminf ~(1f*,x)/n] = p* 'Ix E X.
n--+oo
This equality and (11.4.14) give (11.4.18).
11.4 Strictly unbounded costs 193

Proof of (c). Part (c) follows from (11.4.14) and (11.4.8).


We have now completed the proof of Theorem 11.4.6 except for the key
fact (11.4.14). The proof of the latter is based on Remark 11.4.2(b), Lemma
9.4.4, and the next two lemmas. (Lemma 11.4.7 combines Propositions A.2
and E.2 in Appendices A and E, respectively.)
11.4.7 Lemma. Let S be a metric space with Borel a-algebra B(S), and
let Cb(S) be the Banach space of real-valued continuous bounded functions
on S.

(a) A function u : S -+ IR is l.s.c. and bounded below if and only if there


exists a non decreasing sequence of functions Un in Cb(S) such that
Un t u pointwise.

(b) Let u : S -+ IR be l.s.c. and bounded below, and let IL, ILn(n = 1,2, ... )
be probability measures on B(S) such that ILn converges weakly to IL,
that is

Then
lim inf { udILn ~
n-too is is{ udIL.
In Lemma 11.4.8 below we use the following terminology and notation.
Let (S, T) be a separable metrizable space, that is, a separable topological
space for which there exists a metric don S consistent with the topology T.
For each metric d on S we denote by U(S, d) the subfamily of functions in
Cb(S) which are uniformly continuous with respect to d. We take U(S, d)
to have the relative topology of Cb(S).
11.4.8 Lemma. Let (S, T) be a separable metrizable space. Then there
exists a metric d* on S consistent with T such that:

(a) the family U(S, d*) is separable;

(b) for each function u in Cb(S) there exist sequences {u~} and {u~} in
U(S, d*) such that u~ t u and u~..j.. u pointwise as n -+ 00.

Proof. See Bertsekas and Shreve [1], Corollary 7.6.1 (p. 113), Proposition
7.9 (p. 116), and Lemma 7.7 (p. 125). 0
Proof of (11.4.14). Choose an arbitrary policy 11" and initial distribu-
tion v, and let (O,:F, P;) be the "canonical" probability space in Remark
8.2.3(c). Furthermore, define on 0 a random variable J as in the left-hand
side of (11.4.14), that is

J:= lim inf .J!!(1I", v)/n


n-too
(11.4.21)
194 11. Sample Path Average Cost

with J~(7r, v) as in (11.1.4). Iffor some sample path w = (xo, ao, Xl, al, ... )
of the state-action process it occurs that J(w) = +00, then (11.4.14) triv-
ially holds. Thus without loss of generality we may restrict to sample paths
in the set {1' := {wIJ(w) < oo}. Now consider the empirical measures

1 n-l
'Yn(r) := nL Jr(Xt, at) for r E B(X x A), n = 1,2, ....
t=O

By (8.2.3), each 'Yn is a (random) probability measure on X x A concen-


trated on the set lK defined in (8.2.1), and, moreover, we can write J as

J = lim inf
n-too 1lK
r
c(x, ahn(d(x, a)).

To prove (11.4.14) we will proceed in two steps. First we will show that:

(i) For each w E {1' there exist a probability measure 'Yw on X x A,


contrated on lK, such that

(11.4.22)

Thus, by Lemma 9.4.4, there exists a stochastic kernel CPw E ~ such that

where ;:Yw(-) := 'YW(. x A) is the marginal of 'Yw on X.


In the second step we will show that.

(ii) For P: -almost all w, the randomized stationary policy CP': is stable,
with i.p.m. p",,,, =;:Yw.

Therefore, by Definition 11.4.3(a) and (11.4.22),

and so the proof of (11.4.14) will be complete.


Proof of (i). Choose an arbitrary wE {1' and a sequence {nil such that

J(w) = l-tOO
.lim r
1lK
c(x, ah~ (d(x, a)).
'
Then
sup
i
r c(x, ah~.' (d(x, a)) < 00,
1lK
11.4 Strictly unbounded costs 195

which [as in Remark 11.4.2(b)] implies the existence of a p.m. "(won ][( and
a subsequence {md of {ni} such that b~J converges weakly to "(W, that
is, as i -+ 00
(11.4.24)

Thus, since c(x, a) is l.s.c. and nonnegative [Assumption 11.4.1(b)], we ob-


tain (11.4.22) from Lemma 11.4.7(b).
Proof of (ii). In Lemma 11.4.8 take S = X, the state space, and let U
be a countable dense subset of U(X, d*). For each function U E U define on
][( the function

Lu(x, a) := u(x) - Ix u(y)Q(dylx, a),

and let us consider two random sequences {Mn(u)} and {yt(u)} as in


(11.3.24) and (11.3.25), that is,
n
Mn(u) := L yt(u),
t=l
with
yt(u) := u(xt} - E~[u(xt}lxt-l,at-d.
Observe that we can also write yt(u) as

so that
n
Mn(u) = LLu(xt,at)+E~[u(xn+dlxn,an]-E~[u(xdlxo,ao]. (11.4.25)
t=l

As in Lemma 11.3.11, one can show that {Mn(u),Fn} is a martingale and,


on the other hand, as u is bounded, so is the sequence

Hence, by the Martingale Stability Theorem in Remark 11.2.6(b),

lim .!Mn(u) = 0 P;-a.s.


n~oon

Equivalently, from (11.4.25) and noting that


196 11. Sample Path Average Cost

we have
lim
n-+=
r Lu(x,ah~(d(x,a))=O
ioc VWEO u ,

where Ou is a measurable subset of 0 such that P;(Ou) = 1. Thus

lim
n-+=
r Lu(x,ah~(d(x,a)) =
ioc 0 Vu E U, wE 0*, (11.4.26)

where
0* := nuEuOu.
Moreover, by Assumption 11.4.1(c), the function Lu is in Cb(OC) for every
u in U. Hence, for each win 0* there is a sequence {mi(w)} as in (11.4.24),
so that, by (11.4.26),

k. Lu(x, ahW(d(x, a)) = 0 Vu E U.

In fact, by Lemma 11.4.8(b), the latter equality holds for all u in Cb(X),
i.e.,
loc Lu(x, ahW(d(x, a)) = 0 Vu E Cb(X),

which [by (11.4.23)] can also be written as

and (ii) follows. The proof of Theorem 11.4.6 is now complete. 0

Notes on §11.4

1. Theorem 11.4.6 comes from Vega-Amaya [2, 3]. Related results are
obtained by Lasserre [3] using a different approach. In addition to these
works and the paper by Hernandez-Lerma, Vega-Amaya and Carrasco [1]
mentioned in Note 1 of §11.3, we know of no previous works on sample
path AC-optimality for MCPs on general (uncountable) Borel spaces.
2. Vega-Amaya [2, Theorem 6.3.1] gives a proof of Lemma 11.4.5(a)
different from our proof in §5.7.

11.5 Examples
11.5.1 Example. (Examples 10.9.3 and 8.6.2, continued.) Let us
consider the inventory-production system (8.6.5) with cost-per-stage (8.6.7),
namely,
Xt+l = (Xt + at - Zt)+ for t = 0,1, ... , (11.5.1)
11.5 Examples 197

and

c(x,a) := p. a + m· (x + a) - s· E[min(x + a,zo)]. (11.5.2)

The state and control spaces are X:= [0,00) and A = A(x) = [0,0] for all
x in X. In Example 10.9.3 we saw that Assumption 8.6.1, together with
the condition 8.6.3 and (10.9.8), implies that the system is w-geometrically
ergodic with respect to the weight function

w(x) = exp[r*(x + 2z)], (11.5.3)

where z:= E(zo), and r* is a positive number such that

with 1jJ(r) := exp[r(O - zo)], the moment generating function of 0 - zoo


Moreover, (10.9.21) implies that Assumption 10.3.5 is satisfied. Therefore,
to complete the verification of, say, Assumption 11.3.4, it only remains to
check that (11.3.7) holds.
To verify that (11.3.7) is satisfied first note that, as A:= [0,0],

o ~ E[min(x + a),zo)] ~ x + 0 \:Ix E X. (11.5.4)

Using this inequality and (11.5.2), a straightforward calculation shows that


there are constants k i (i = 0,1,2) such that

c2(x,a) ~ ko + k1x + k2X2 \:Ix E X.


This yields (11.3.7) for some r sufficiently large, with w(x) as in (11.5.3).
In other words, Assumption 11.3.4 holds, and so are all of the results in
§11.3 except perhaps for the last statement in Theorem 11.3.5. The problem
in the latter case is that (11.3.9) might not be true. In fact, from (11.5.4)
and (11.5.2) it can be seen that c(x,a) is minorized by (m-s)x+sO, which
is not bounded below unless m = Sj see (8.6.8).
Concerning the inventory system (11.5.1)' see also Remark 11.5.3, below.
o
11.5.2 Example. (Examples 10.9.5 and 8.6.4, continued.) We shall
consider again the controlled queueing system (8.6.16), namely

(11.5.5)
under the Assumptions 8.6.5. In Examples 8.6.4' and 10.9.5 we already
verified Assumption 11.3.4 except for (11.3.7), which in the present case
refers to the weight function w in (8.6.18) or (10.9.23), that is,

w(x) := era: for x E X := [0,00), (11.5.6)


for some positive number r.
198 11. Sample Path Average Cost

The hypotheses in Example 8.6.4 and 10.9.5 imply the hypotheses of


Theorem 1O.3.6(a). However, since we did not give a specific form to the
cost-per-stage c(x, a), in addition to (8.6.19) we shall suppose that

c(x, a) is nonnegative and satisfies (11.3.7). (11.5.7)


With this additional requirement, the queueing system (11.5.5) satisfies
Assumption 11.3.4 and (11.3.9); hence all of the results in §11.3 hold in
this case.
On the other hand, the hypotheses in the previous paragraphs also imply
parts (a) and (d) in Assumption 11.4.1. To obtain (b) and (c) in that as-
sumption, we may further suppose that, for instance, c(x, a) is continuous
and satisfies either (a2) or (a3) in Remark 11.4.2. [Note that in the present
example the control set A == A(x) is compact and independent of x, and
so the requirement on A(·) in Remark 11.4.2(a3) trivially holds.] In other
words, in the latter case, Assumption 11.4.1 is valid, and, therefore, so are
the corresponding results in Lemma 11.4.5 and Theorem 11.4.6. In particu-
lar, in lieu of the minimum pair (<pC: ,p",.) in Lemma 11.4.5(a) we may take
a pair (foo, J.L I) consisting of an AC-optimal deterministic stationary policy
loo-as in Theorem 11.3.6(a)-and the associated i.p.m. J.LI' Then 100 is
a positive Harris recurrent policy [see Definition 11.4.3(b) and Proposition
11.3.2(a)] for which the conclusion of Theorem 11.4.6(c) is true. This would
be, in other words, an alternative proof (instead of using Theorem 11.3.5)
that 100 is sample path AC-optimal. 0
11.5.3 Remark. In the last part of the previous example we easily verified
Assumption 11.4.1 by imposing suitable conditions on the cost-per-stage
c(x, a). This can also be done in Example 11.5.1. For instance, consider
the system (11.5.1), with the same control sets, A = A(x) = [0,9] for all
x EX, but the cost function c(x, a) in (11.5.2) is replaced by, say,

(11.5.8)

where Cl and C2 are given positive constants. In (11.5.8), x* E X and a* E


A are fixed nominal values of the inventory level and production rate,
respectively, and the interpretation of an AC-optimal control policy 7r =
{at} is that, in the long run, it minimizes the mean-square distance of the
inventory level {xt} and production rates {at} to the given nominal values
x* and a*. We can now see that part (b) in Assumption 11.4.1 is trivially
satisfied, whereas part (c) follows from either (a2) or (a3) in Remark 11.4.2.
Furthermore, since we already verified (d) in Example 8.6.2, to complete the
proof that Assumption 11.4.1 holds we only need to check part (a). To do
this, take for instance the initial state if := 0, and let 1i' be the stationary
policy 100 such that I(x) := 0 for all x E X. Then a straightforward
calculation using (10.9.2) shows that

In(foo,O) = n(clx~ + c2a~) \In = 0,1, ... ,


11.5 Examples 199

and, therefore, the corresponding average cost is

J(fOO, 0) = CIX~ + C2a~ < 00.


This completes the verification of Assumption 11.4.1 for the system (11.5.1)
with the cost-per-stage in (11.5.8). 0
In Example 11.5.6 below we wish to illustrate the results in §11.4 for an
IRd-valued additive-noise control system

(11.5.9)

with a quadratic cost-per-stage function

c(x, a) = x' ~x + a'ea, (11.5.10)

where a is in IRq and ~ and e are suitable matrices; x' and a' denote the
transpose of x and a, respectively. Since the analysis of this control system
relies on the properties of the noncontrolled Markov chain

(11.5.11)

we shall first present the following result, which is a variant of Example


7.4.6.
11.5.4 Proposition. Consider the Markov chain (11. 5.11) and the follow-
ing conditions:
(a) F : IRd ---t IRd is locally bounded; that is, for each compact set K C IRd
there is a constant m = m(K) such that IF(x) I :::; m for all x E K.
(b) The disturbance sequence {Zt} consists of i.i.d. random vectors in IRd ,
independent of the initial state xo, and whose common distribution
has a density 9 which is positive A-a.e., where A denotes Lebesgue
measure. In addition, E(zo) = O.
(c) There exist positive constants s ~ 1, M I , and b such that

(11.5.12)

(d) There exist positive constants s ~ 1, {3 < 1, and M2 such that Elzol8 <
00 and
(11.5.13)
Then the following holds:
(i) Under (a) and (b), the Markov chain is aperiodic, A-irreducible and
Harris recurrent with respect to A.
(ii) Under (a), (b) and (c), the chain has a unique i.p.m.-hence, by (i),
it is positive Harris recurrent.
200 11. Sample Path Average Cost

(iii) Under (a), (b) and (d), there exist a p.m. {t, and positive numbers
p < 1 and R such that

(11.5.14)

[cf. (7.3.7)J, where w is the weight function w(x) = 1 + Ixl s , and

J wd{t < 00. (11.5.15)

Proof. Part (i) follows from (7.4.18). For the proof of (ii) see Tweedie [1]
or Mokkadem [1, Proposition 1], and for the proof of (iii) see Tweedie [2]
or Mokkadem [1, Proposition 3]. 0
11.5.5 Remark. (a) The main difference between Proposition 11.5.4(iii)
and Example 7.4.6 is that the latter requires F to be continuous. Moreover,
from (i), and comparing (11.5.14) with (7.3.7), it can be seen that the
measure in (11.5.14) is the unique i.p.m. for the Markov chain. This can also
be deduced from the conclusion (ii) by noting that the inequality (11.5.13)
implies that (11.5.12) holds with Ml := max{M2 , [b/(l- .8)Fls}, because

.8lxl s ::; Ixl s - b {:} Ixl ~ [b/(l- .8W ls .


(b) Using Minkowski's inequality it can be seen that (11.5.12) and (11.5.13)
are both satisfied if there are positive constants .8* < 1 and M such that

IF(x) I + IIzolis ::; .8*lxl Vlxl > M, (11.5.16)

where IIzolls := (ElzoIS)lls.


(c) Proposition 11.5.4 is of course applicable to the linear case in which
F(x) = fx for some matrix f. 0
In the following example we use Proposition 11.5.4 (with s = 2) as a
guide to derive conditions ensuring that (11.5.9) and (11.5.10) satisfy the
hypotheses of §11.4.
11.5.6 Example. (An additive-noise quadratic cost system.) Con-
sider the stochastic control system (11.5.9) with cost-per-stage (11.5.10).
The state space is X = lR d , and we shall assume that the control constraint
sets A (x), for each x EX, are closed subsets of (say) lR q and such that

IK := {(x, a)lx E X, a E A(x)}

is a convex set. Firstly, we will also assume:


(a) The matrices ~ and e in (11.5.10) are symmetric and positive definite.
This condition clearly implies that the quadratic cost (11.5.10) satisfies
Assumptions 11.4.1(b) and 11.4.1(c).
To verify Assumption 11.4.1(d) we suppose:
11.5 Examples 201

(b) {ztl satisfies the condition (b) in Proposition 11.5.4, and

(c) F : lK -t X is continuous.

Then, since

Ix u(y)Q(dylx, a) = E(u[F(x, a) + zo]), (11.5.17)

the Bounded Convergence Theorem and (c) yield Assumption 11.4.1(d).


[For this to be true we only need the first part of (b)-we do not require
the density 9 to be positive A-a.e.]
To derive a sufficient condition for Assumption 11.4.1(a) we will use
Proposition 11.5.4(iii) and the following notation: IT cpoo E llRS is a ran-
domized stationary policy we write

F(x,cp) := L F(x,a)cp(dalx) for x E X, (11.5.18)

where A := IRq. [By Definitions 8.2.1 and 8.2.2(b), in (11.5.18) we may


replace A by A(x), if necessary.] Let us now consider the conditions:
(d) F ( ., cp) is locally bounded for each randomized stationary policy cpoo.

(e) Elzol2 < 00 and, moreover, there exists a randomized stationary policy
cpoo and positive constants f3 < 1, M, and k such that
(et} EIF(x, (j5) + ZOl2 ~ f3lxl 2 Vlxl > Mj
(e2) fA(a'9a){j5(dalx) ~ klxl 2 Vx E X.
From (b), (d) and (e), we can see that cpoo is stable [Definition 11.4.3(a)].
Indeed, by Proposition 11.5.4(iii), the transition law Q~
I{)
has an i.p.m. p:=
Pv; that, in particular, satisfies (11.5.15) with w(x) = 1 + Ixl 2 and J.t = p.
Therefore, from (11.5.10) and (e2), there is a constant k such that

J(~ ,p) ~ k Ix IxI2p(dx) < 00.

Thus (11.4.4) holds, which in turn [by (11.4.5)] gives Assumption 11.4.1(a)
with 7r = cpoo and some initial state x.
Summarizing, the current conditions (a) to (e) imply that (11.5.9) and
(11.5.10) satisfy Assumption 11.4.1, and so the corresponding results in
§11.4 are applicable. Furthermore, if we wish to use, for instance, Lemma
11.4.5(b) or Theorem 11.4.6(c), we then need conditions for the policy cp';'
in those results to be positive Harris recurrent. One way of getting this is to
assume (or to verify, when a specific control system is given) that (e) holds
for every randomized stationary policy cpoo, with constants f3, M, and k
202 11. Sample Path Average Cost

that may depend on <p""-note that we may replace (et} by the analogues
of (11.5.12) or (11.5.16), namely,

EIF(x, <p) + ZOl2 ~ Ixl2 - b Vlxl > Ml


or
IF(x, <p)1 + II zoll2 ~ ,8*lxl Vlxl > M,
with constants b, Ml ,8*, and M that may depend on <p"". 0
Notes on §11.5

The control model in Example 11.5.1 and Remark 11.5.3 has been studied
by Vega-Amaya [2, 3]. These references contain other examples related to
§11.3 and §11.4.
Example 11.5.6 comes from Hernandez-Lerma and Lasserre [14], where
the reader can find additional references on results related to Proposition
11.5.4 (on which Example 11.5.6 is based).
12
The Linear Programming Approach

12.1 Introduction
In this chapter we study the linear programming (LP) approach to Markov
control problems. Our ultimate goal is to show how a Markov control prob-
lem can be approximated by finite linear programs.
To reach this goal, we shall first proceed to find a suitable linear program
associated to the Markov control problem. Here, by a "suitable" linear
program we mean a linear program (P) that together with its dual (P*)
satisfies that
sup(P*) ~ (MCP)* ~ inf(P), (12.1.1)
where (using terminology specified in the following section)
inf(P) .- value of the primal program (P),
sup(P*) .- value of the dual program (P*),
(MCP)* .- value function of the Markov control problem.

In particular, if there is no duality gap for (P), so that


sup(P*) = inf(P), (12.1.2)
then of course the values of (P) and of (P*) yield the desired value function
(MCP)*.
However, to find an optimal policy for the Markov control problem,
(12.1.1) and (12.1.2) are not good enough because they do not guaran-
tee that (P) or (P*) are solvable. If it can be ensured that, say, the primal

O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes


© Springer Science+Business Media New York 1999
204 12. The Linear Programming Approach

(P) is solvable-in which case we write its value as min (P)-and that

min(P) = (MCP)*, (12.1.3)

then an optimal solution for (P) can be used to determine an optimal policy
for the Markov control problem. Likewise, if the dual (P*) is solvable and
its value-which in this case is written as max (P*)-satisfies

max(P*) = (MCP)*, (12.1.4)

then we can use an optimal solution for (P*) to find an optimal policy for
the Markov control problem. In fact, one of the main results in this chapter
(Theorem 12.4.2) gives conditions under which (12.1.3) and (12.1.4) are
both satisfied, so that in particular strong duality for (P) holds, that is,

max(P*) = min(P). (12.1.5)

The LP approach to Markov control problems was already studied in


Chapter 6, but from a very different viewpoint. Namely, in Chapter 6 we
first introduced linear programs (P a) associated to a-discounted MCPs,
with 0 < a < 1, and derived properties such as (12.1.1)-(12.1.5). Then to
study the average cost (AC) problem we suitably modified (P a) to obtain
"modified" programs (MP a), and finally we obtained and analized the AC-
related linear program (MPd as the "limit" of (MP a) as a t 1. In the
present chapter, however, we do not use a-discounted programs to study
(MPd, which we now call (P) [as in (12.1.1)]. Instead, we go directly to
(P) == (MP 1 ) and analize it without using a-discounted programs.
Another key difference with respect to Chapter 6 is that here we ob-
tain necessary and sufficient conditions for (P) to be consistent (Theorem
12.3.7), as opposed to Chapter 6 that only gives sufficient conditions. More-
over, we study minimizing sequences for (P) and maximizing sequences for
(P*), and, more importantly, we prove the convergence of an approximation
scheme for (P) based on finite linear programs (Theorem 12.5.7).
A. Outline of the chapter

Section 12.2 presents background material that can be omitted on a first


reading. It contains, in particular, a brief introduction to infinite LP and
some important facts on the concept of "tightness". In §12.3 we introduce
the program (P) associated to the AC problem, and we show that (P) is
solvable and that there is no duality gap, so that (12.1.2) becomes

sup(P*) = min(P).

Several equivalent formulations of the consistency of (P) are also proved.


Section 12.4 deals with approximating sequences for (P) and its dual
(P*). In particular, it is shown that if a suitable maximizing sequence for
12.2 Preliminaries 205

(P*) exists, then the strong duality condition (12.1.5) is satisfied. Section
12.5 presents an approximation scheme for (P) using finite-dimensional
programs. The scheme consists of three main steps. In step 1 we introduce
an "increasing" sequence of aggregations of (P), each one with finitely many
constraints. In step 2 each aggregation is relaxed (from an equality to an
inequality), and, finally, in step 3, each aggregation-relaxation is combined
with an inner approximation that has a finite number of decision variables.
Thus the resulting aggregation-relaxation-inner approximation turns out to
be a finite linear program, that is, a program with finitely many constraints
and decision variables. The corresponding convergence theorems are stated
in §12.5, and they are all proved in the final section 12.6.
To fix ideas, we shall consider only the so-called "unichain" AC problem.
However, from the proof of our main results it should be clear that sim-
ilar results are valid for other Markov control problems, in particular for
discounted and for "multichain" AC MCPs.

12.2 Preliminaries
This section contains background material that can be omitted on a first
reading; the reader may refer to it as needed.
The material is divided into four subsections. Subsection A reviews some
basic definitions and facts related to dual pairs of vector spaces and linear
operators. Subsections Band C summarize the main results on infinite
LP needed in later sections. Finally, Subsection D reviews the notion of
"tightness" and its connection to the existence of i. p.m. 's for Markov chains.
A. Dual pairs of vector spaces

Let X and Y be two arbitrary (real) vector spaces, and let (".) be a
bilinear form on X x y, that is, a real-valued function on X x Y such
that
• the map x I-t (x, y) is linear on X for every y E y, and

• the map y I-t (x, y) is linear on Y for every x E X.


Then the pair (X, Y) is called a dual pair if the bilinear form "separates
points" in x and y, that is,
• for each x I- 0 in X there is some y E Y with (x, y) I- 0, and
• for each y I- 0 in Y there is some x E X with (x, y) I- O.
If (X,Y) is a dual pair, then so is (Y,X).
If (Xl, YI) and (X2, Y2) are two dual pairs of vector spaces with bilinear
forms (-,·h and (', 'h, respectively, then the product (Xl x X2,YI x Y2) is
206 12. The Linear Programming Approach

endowed with the bilinear form

(12.2.1)

this definition can be extended to the product of three or more dual pairs.
12.1.1 Examples. (a) If X = Y = lI~n for some n = 1,2, ... , then (x, y)
will denote the usual "inner product" X· Y of the vectors x, y. that is,

(x, y) := x . y = XIYl + ... + XnYn.


(b) Let 8 be a Borel space with Borel a-algebra B(8), and let X := M(8)
be a vector space of finite signed measures on B (8). In the following sections
M(8) will be the Banach space M(8) of finite signed measures on B(8),
endowed the total variation norm II·IITV, or the subspace Mw(8) of finite
signed measures f.-t with finite w-norm

11f.-tllw:= Is wdlf.-tl, (12.2.2)

for some weight function w 2: 1. (See §7.2.)


Now let y := F(8) be a vector space of real-valued measurable functions
on 8. In the following sections F(8) will be one of the Banach spaces

(12.2.3)

where

• Jffi w (8) is the Banach space of measurable functions u with finite w-


norm
Ilullw:= sup lu(s)l/w(s), (12.2.4)
s
for some weight function w 2: 1 (see §7.2.);

• Jffi(8) is the subspace of measurable bounded functions u with finite


supremum (or sup) norm

lIull := sup lu(s)1 (12.2.5)


s
[obtained from (12.2.4) with w(·) == 1];

• Cb(8) c Jffi(8) is the subspace of continuous bounded functions, and

• Co(8) c Cb(8) is the subspace of continuous functions u vanishing at


infinity, that is, for each EO > 0 there is a compact set Kc such that

lu(s)1 < EO Vs f/. Kc. (12.2.6)


12.2 Preliminaries 207

In the latter case sometimes we shall simply write

lim u(s)
B-tOO
= O. (12.2.7)

The spaces Cb(S) and Co(S) coincide if S is compact.


In any of the above cases, the dual pair (X, Y) = (M(S), F(S)) is en-
dowed with the bilinear form

(/L, u) := Is ud/L. (12.2.8)

Thus, by (12.2.1) and part (a), the bilinear form corresponding to the dual
pair (lin X M(S), lin X F(S)) is

«x, /L), (y, u)} = x . y + (/L, u}.O (12.2.9)

Given a dual pair (X,Y), we denote by u(X,y) the weak topology


on X (also referred to as the u-topology on X), namely, the coarsest-or
weakest-topology on X under which all the elements of Y are continu-
ous when regarded as linear forms (.,y) on X. Equivalently, the base of
neighborhoods of the origin of the u-topology is the family of all sets of the
form
N(I,c:) := {x E XI(x,y}1 ~ c: Vy E I}, (12.2.10)
where c: > 0 and I is a finite subset of y. (See, for instance, Robertson and
Robertson [1], p. 32.)
Let {x n } be a sequence or a net in X. (See Note 1 at the end of this
section for the definition of "net".) Then Xn converges to x in the weak
topology u(X,Y) if
(Xn, y) -+ (x, y) Vy E y. (12.2.11)
For instance, for the dual pair (M(S), F(S)) in Example 12.2.1(b), a se-
quence or a net of measures /Ln converges to /L in the weak topology
u(M(S), F(S)) if
(/Ln, u) -+ (/L, u) Vu E F(S), (12.2.12)
where {.,.} stands for the bilinear form in (12.2.8).
12.2.2 Remark. (a) Let (X,Y) be a dual pair such that Y is a Banach
space and X = Y* is the topological dual of y. In this case, the weak
topology u(X,y) is called the weak· (weak-star) topology on X, and so
(12.2.11) is referred to as the weak· convergence of Xn to x.
(b) For instance, let X = M(S) and Y := Co(S) be as in Example 12.2.1.
If S is a locally compact separable metric (LCSM) space, then-by the
Riesz Representation Theorem (see, for example, Rudin [1])-M(S) is the
topological dual of the (separable) Banach space Co(S), and so the weak
topology u(M(S), Co(S)) on M(S) is in fact the weak· topology.
The fact that Co(S) is a separable Banach space will be used later on in
conjunction with the following result.
208 12. The Linear Programming Approach

(c) (The Alaoglu or Banach-Alaoglu-Bourbaki Theorem; see Ash [1] or


Brezis [1]). Let Y be a Banach space with topological dual Y*. Then:

(cd the closed unit sphere U := {y E Y*lllyll ::; I} in Y* is compact in


the weak* topology a(Y* ,Y).

(C2) If in addition Y is separable, then the weak* topology of U is metriz-


able; hence U is weakly* sequentially compact.

A consequence of the metrizability in (C2) is, in particular, that to verify


that a subset, say V, of U is closed in the weak* topology, it suffices to use
the usual criterion in metric spaces (if Yn is in V and Yn ~ Y in the weak*
topology, then y is in V) for sequences Yn, rather than nets.
An advantage of using sequences instead of nets is shown in the following
proposition. 0
12.2.3 Proposition. Let (X, Y) be a dual pair of normed vector spaces. If
{xn} is a net converging to x in the weak topology a(X, Y), then

(12.2.13)

If {xn} is a sequence, instead of a net, then (12.2.13) holds, and in ad-


dition the sequence {llxnll} is bounded.
Proof. (12.2.13) follows from the fact that the map x f-t Ilxll is lower semi-
continuous (l.s.c.) in the weak topology. The result for sequences is well
known-see, for instance, Ash [1, p. 145] or Brezis [1, p. 41]. 0
12.2.4 Definition. Let (X,Y) and (Z, W) be two dual pairs of vector
spaces, and G : X ~ Z a linear map.

(a) G is said to be weakly continuous if it is continuous with respect


to the weak topologies a(X,y) and a(Z, W); that is, if {xn} is a net
in X such that Xn ~ x in the weak topology a(X, Y) [see (12.2.11)]'
then GX n ~ Gx in the weak topology a(Z, W), i.e.,

(Gxn' w) ~ (Gx, w) \lw E W. (12.2.14)

(b) The adjoint G* of G is defined by the relation

(Gx, w) = (x, G*w) \Ix E X, w E W. (12.2.15)

The following proposition gives a well-known (easy-to-use) criterion for


the map G in Definition 12.2.4 to be weakly continuous-for a proof see,
for instance, Robertson and Robertson [1], p. 38.
12.2.5 Proposition. the linear map G is weakly continuous if and only if
its adjoint G* maps W into y, that is, G* (W) c y.
12.2 Preliminaries 209

12.2.6 Example. Let X and Y be two Borel spaces, and let wo(x) and
w(x,y) be weight functions on X and X x Y, respectively, such that
1 ~ wo(x) ~ w(x, y) Vx E X, Y E Y. (12.2.16)
We shall consider spaces Bw(X x Y), Mw(X x Y), Bwo(X), and Mwo(X)
as in Examples 12.2.1(b).
(a) Consider the dual pairs (Mw(X x Y), Bw(X x Y)) and (JR, JR), and
the linear map
Lo : Mw(X x Y) --+ JR, J.t t-t LoJ.t := (J.t, I). (12.2.17)
By (12.2.8) with u(x, y) == 1,
LoJ.t = (J.t, I) = J.t(X x Y) VJ.t E Mw(X x Y).
In particular,
(12.2.18)
where Mw(X x Y)+ stands for the convex cone of nonnegative measures in
Mw(X x Y) and 1I·lIrv denotes the total variation norm.
Since Bw(X x Y) contains the constant functions [see (12.2.3)], the ad-
joint
r t-t (L~r)(x, y) == r Vr E JR,
obviously maps JR into Bw(X x Y), and so Lo is weakly continuous, by
Proposition 12.2.5.
(b) Consider the dual pairs (Mw(X x Y),Bw(X x Y)) and (Mwo(X),
Bwo (X)), and the linear map
G 1 : Mw(X x Y) --+ Mwo (X), J.t t-t G1J.t := ji,
where ji denotes the marginal (also known as the projection) of J.t on X,
that is,
ji(B) := J.t(B x Y) VB E 8(X). (12.2.19)
The adjoint u t-t Giu, with
(Giu)(x, y) := u(x) Vu E Bwo (X), (x, y) E X x Y, (12.2.20)
maps Bwo(X) into Bw(X x Y), because (12.2.16) gives
IGiul
--=-.-~-,
lui Wo lui
w WOW Wo

and so IIGiull w ~ lIuli wo < 00. Thus G 1 is weakly continuous, by Proposi-


tion 12.2.5.
(c) Consider the dual spaces in part (b), and let P(Blx, y) be a stochastic
kernel on X given X x Y (see Definition 7.2.3). Moreover, assume that

Ix wo(x')P(dx'I·) is in Bw(X x Y),


210 12. The Linear Programming Approach

that is, there is a constant k such that

Ix wo(x')P(dx'lx,y) ~ kw(x,y) V(x,y) E X x Y. (12.2.21)

Then the linear map G2 : Mw(X x Y) -t Mwo(X), J.L r+ G 2 J.L, defined by

(G 2 J.L)(B):=
ixxy
r P(Blx,Y)J.L(d(x,y» for BE SeX) (12.2.22)

is weakly continuous. Indeed, for each function u in la wo (X), (12.2.21) yields

I Ix u(x')P(dx'lx,y)1 < Ilull wo Ix wo(x')P(dx'lx,y)


< Ilullwokw(x,y),
which means that the adjoint Gi, u r+ Giu, given by

(G~u)(x, y) := Ix u(x')P(dx'lx, y), (12.2.23)

maps lawo(X) into law(X x Y). Thus, the weak continuity of G 2 follows
from Proposition 12.2.5.
(d) As a consequence of (b) and (c), if the inequality (12.2.21) holds,
then the linear map

i.e.,

(L1J.L)(B) := Ii(B) - r
ixxy
P(Blx, Y)J.L(d(x, y» for BE SeX), (12.2.24)

is weakly continuous. Furthermore, with Lo as in (12.2.17), the linear map

i.e.,
LJ.L := (LoJ.L, L 1J.L) for J.L in Mw(X x Y), (12.2.25)
is weakly continuous.
Note that the adjoints

of L1 and L are given by Lr = Gr - G2 and L* = Lo + Lr; that is,


(Lru)(x, y) = u(x) - Ix u(x')P(dx'lx, y) (12.2.26)
12.2 Preliminaries 211

[see (12.2.20) and (12.2.23)] and


L*(r,u)(x,y) = (Lor)(x,y) + (Liu)(x,y)
r + u(x) - Ix u(x')P(dx'lx,y),
(12.2.27)

respectively. 0
12.2.7 Remark. (a) Let P(Blx, y) be the stochastic kernel in Example
12.2.6(c), and consider the Banach spaces CbO and CoO in Example
12.2.1(b). By a standard abuse of terminology, the kernel P is said to
be weakly continuous if the adjoint G 2 in (12.2.23) maps Cb(X) into
Cb(X x Y), that is,
(12.2.28)
[Observe that this is a Feller-like condition; see (12.2.47).]
(b) Suppose, on the other hand, that P is weakly continuous and, more-
over,
P(KI·) vanishes at infinity for each compact K c X; (12.2.29)
that is [as in (12.2.6)]' for each c: > 0 there is a compact set K' = K'(c:,K)
in X x Y such that
P(Klx,y) :::; c: \I(x,y) ~ K'.
Then a straightforward calculation shows that, in addition to (12.2.28), G;
maps Co(X) into Co(X x Y), that is
G;u is in Co(X x Y) if u is in Co(X). (12.2.30)
In other words, suppose that (12.2.28) and (12.2.29) are satisfied, and that
X and Y -hence the product X x Y -are locally compact separable met-
ric spaces. Then, in view of Remark 12.2.2(b) and Proposition 12.2.5, the
condition (12.2.30) states that the map G 2 : M(X x Y) -+ M(X) defined
by (12.2.22) is weakly* continuous, that is, continuous with respect to the
weak* topologies CT(M(X x Y), Co(X x Y)) and CT(M(X), Co(X). 0
12.2.8 Remark. (Positive and dual cones.) (a) Let (X,Y) be a dual
pair of vector spaces, and K a convex cone in X, that is, x + x' and AX
belong to K whenever x and x' are in K and A > O. Unless explicitly stated
otherwise, we shall assume that K -:j:. X and the origin (that is, the zero
vector, 0) is in K. In this case, K defines a partial order ~ on X such that
x ~ x' {:} x - x' E K,
and K will be referred to as a positive cone. The dual cone of K is the
convex cone K* in Y defined by
K* := {y E YI(x,y) ~ 0 \Ix E K}. (12.2.31)
212 12. The Linear Programming Approach

(b) If X = M(S) is any of the measure spaces in Example 12.2.1(b), we


will denote by M(S)+ the "natural" positive cone in M(S), which consists
of all the nonnegative measures in M(S), that is,

M(S)+ := {J.t E M(S)IJ.t 2': a}.


The corresponding dual cone M(S)+ in (any of the spaces) F(S) coincides
with the "natural" positive cone

F(S)+:= {u E F(S)lu 2': a}. 0

B. Infinite linear programming

An infinite linear program requires the following components:


• two dual pairs (X,Y) and (Z, W) ofreal vector spaces;
• a weakly continuous linear map L : X --t Z, with adjoint L* : W --t Y;
• a positive cone K in X, with dual cone K* in Y [see (12.2.31)]; and
• vectors b E Z and c E Y.
Then the primal linear program is

IP': minimize (x, c)


subject to: Lx = b, x E K. (12.2.32)

The corresponding dual problem is


IP'*: maximize (b, w)
subject to: c - L*w E K*, wE W. (12.2.33)

An element x of X is called feasible for IP' if it satisfies (12.2.32), and IP'


is said to be consistent if it has a feasible solution. If IP' is consistent, then
its value is defined as

inflP':= inf{(x,c)lx is feasible for IP'}; (12.2.34)

otherwise, inf IP' := +00. The program IP' is solvable if there is a feasible
solution x* that achieves the infimum in (12.2.34). In this case, x* is called
an optimal solution for IP' and, instead of inf IP', the value of IP' is written
as
minlP' = (x*,c).
Similarly, w E W is feasible for the dual program IP'* if it satisfies
(12.2.33), and IP'* is said to be consistent if it has a feasible solution.
If IP'* is consistent, then its value is defined as

suplP'* := sup{(b,w)lw is feasible for IP'*}; (12.2.35)


12.2 Preliminaries 213

otherwise, sup lP'* := -00. The dual lP'* is solvable if there is a feasible
solution w* that attains the supremum in (12.2.35), in which case we write
the value of lP'* as
maxlP'* = {b,w*}.
The next theorem can be proved as in elementary (finite-dimensional)
LP.
12.2.9 Theorem.
(a) (Weak duality.) If lP' and lP'* are both consistent, then their values
are finite and satisfy
sup lP'* ~ inf lP'. (12.2.36)

(b) (Complementary slackness.) If x is feasible for lP', w is feasible for


lP'*, and
{x, c - L *w} = 0, (12.2.37)
then x is optimal for lP' and w is optimal for lP'* .
The converse of Theorem 12.2.9(b) does not hold in general [as shown in
Example 6.2.5(b)]. It does hold, however, if there is no duality gap for lP',
which means that equality holds in (12.2.36), i.e.,

sup lP'* = inf lP'. (12.2.38)

On the other hand, it is said that the strong duality condition for lP'
holds if lP' and its dual are both solvable and

max lP'* = min lP'. (12.2.39)

The following theorem gives conditions under which lP' is solvable and
there is no duality gap-for a proof see Anderson and Nash [1, Theorem
3.9].
12.2.10 Theorem. Let H be the set in Z x lR defined as

H := {(Lx, {x, c} + rlx E K, r ;::: O}.


If lP' is consistent and H is weakly closed [that is, closed in the weak topology
a(Z x lR, W x lR)J, then lP' is solvable and there is no duality gap, so that
(12.2.38) becomes
sup lP'* = min lP'.
The following Generalized Farkas Theorem of Craven and Koliha [1, The-
orem 2] gives a necessary and sufficient condition for lP' to be consistent.
The result is similar to Theorem 12.2.10 in that it also requires a certain
set to be weakly closed.
12.2.11 Theorem. (Craven and Koliha [1].) If L is weakly continuous and
L(K) C Z is weakly closed then the following conditions are equivalent:
214 12. The Linear Programming Approach

(a) {12.2.32} is satisfied; that is, the equation Lx = b has a solution x in


K.

(b) L*w E K* =? (b,w) ~ o.


We can view Theorem 12.2.11 as an "alternative theorem". Namely, if L
is weakly continuous and L{K) is weakly closed, then either

(i) (12.2.32) is satisfied, or

(ii) there exists w E W such that: L *w is in K* and (b, w) < O.


Due to this fact, the Generalized Farkas Theorem 12.2.11 is sometimes
referred to as the Farkas Alternative Theorem.

C. Approximation of linear programs

An important practical question is how to obtain-or at least estimate-


the value of a linear program. In later sections we shall consider two ap-
proaches related to the following definitions.
12.2.12 Definition. (Minimizing and maximizing sequences.)

(a) A sequence {x n } in X is called a minimizing sequence for!P if each


Xn is feasible for !P and (xn, c) .,J.. inf!P.

(b) A sequence {w n } in W is called a maximizing sequence for the


dual problem !P* if each Wn is feasible for !P" and (b, wn) t sup!P* .

Note that if!P is consistent with a finite value inf!P, then [by definition
(12.2.34) of inf!P] there exists a minimizing sequence. A similar remark
holds for !P*.
12.2.13 Definition. (Aggregations and inner approximations.)

(a) Let W be a subset of W. Then the linear program

!P{W) : minimize (x, c)


subject to: (Lx - b, w) =0 Vw E W, x E K, (12.2.40)

is called an aggregation (of constraints) of!P.

(b) If K' C K is a subset of the positive cone K eX, then the program

!P{K') : minimize (x, c)


subject to: Lx = b, x E K', (12.2.41)

is called an inner approximation of!P.


12.2 Preliminaries 215

As K' is contained in K, we have inflP' ::; inflP'(K'). On the other hand,


if x satisfies (12.2.32), then it satisfies (12.2.40), and so inflP'(W) ::; inflP'.
Hence
inf IP'(W) ::; inf IP' ::; inf IP'(K').
Thus, we can use an aggregation (of constraints) to approximate inflP' from
below, whereas an inner approximation can be used to approximate inf IP'
from above. One can also easily get the following (for a proof see Hernandez-
Lerma and Lasserre [15]):
12.2.14 Proposition. Suppose that IP' is solvable.
(a) IfW is weakly dense in W, then IP'(W) is equivalent to IP' in the sense
that IP'(W) is also solvable and

minlP'(W) = minlP'.
(b) If K' is weakly dense in K, then there is a sequence {xn} in K' such
that

D. Tightness and invariant measures

The Alaoglu Theorem in Remark 12.2.2(c) gives conditions for a cer-


tain set to be compact in the weak" topology. This is particularly useful
when dealing, for example, with the dual pair (M(S), Co(S)) in Remark
12.2.2(b). However, we will also need to consider compact sets in the weak
topology a(M(S), Cb(S)) , and so in this subsection we will briefly review
some related notions.
Let S be a Borel space, and let M(S)+ be the positive cone of (finite) non-
negative measures in M(S) [see Example 12.2.1(b) and Remark 12.2.8(b)].
A measure 'Y in M(S)+ is said to be tight if for each £0 > 0 there is a com-
pact set K C S such that 'Y(KC) < £0, where KC denotes the complement
of K.
For example, suppose that either (i) S is a-compact, or (ii) S is a Polish
space. Then any finite measure 'Y on the Borel a-algebra B(S) is tight.
Similarly, a family r in M(S)+ is said to be tight if for each £0 > 0 there
is a compact set K such that 'Y(KC) < £0 for all 'Y in r.
Tightness turns out to be closely related to the existence of a strictly
unbounded (also known as a moment or norm-like) function 9 ~ 0 on
S, which means that there exists an increasing sequence of compact sets
Kn t S such that
lim inf g(x) = +00. (12.2.42)
n-+oozf/.Kn

For instance, if 9 ~ 0 is inf-compact-that is, the level set {xlg(x) ::; r}


is compact for every number r-then 9 is strictly unbounded.
216 12. The Linear Programming Approach

The connection between tightness and strictly unbounded functions is


provided by the following theorem (see, for instance, Balder [1, §2] or Bour-
baki [1, p. 109]).
12.2.15 Theorem. Let r be a bounded family of measures in M(8)+. Then
r is tight if and only if there is a strictly unbounded function 9 2: 1 such
that [using the notation (12.2.8)J

sup(-y, g) < 00. (12.2.43)


r
Moreover, if r consists of probability measures only, then the condition
9 2: 1 can be replaced by 9 2: o.
As an elementary application of Theorem 12.2.15, if 8 is a compact metric
space, then each bounded family r of measures in M(8)+ is tight. Indeed,
let g(.) == 9 > 0 be a constant function on 8, and M a constant such
that Ih'IITV S M for all '"Y in r. Then (12.2.42) trivially holds (because the
infimum over the empty set 8 e is +00), and (12.2.43) becomes

sup(-y,g) S gM < 00.


r
On the other hand, the connection between tightness and compactness
in the weak topology a(M(8), Cb (8» is provided by Prohorov's Theorem
(see Billingsley [1] or Parthasarathy [1]):
12.2.16 Theorem. (Prohorov's Theorem.) Let P(8) be the family of
probability measures in M(8)+, and r a subset ofP(8). Ifr is tight, then
it is sequentially relatively compact in the weak topology a(M(8), Cb(8»;
that is, for each sequence {ltn} in r there is a subsequence {ltm} and a
probability measure It (not necessarily in r) such that
(12.2.44)

Prohorov's Theorem 12.2.16 is true for a general metric-not necessarily


Borel-space 8. Furthermore, the converse holds (weak sequential relative
compactness implies tightness) if 8 is a Polish space.
Finally, we will present a theorem of Benes [1, 2] that relates the con-
cepts of tightness, strictly unbounded functions, and invariant probability
measures (Lp.m.'s) for a Markov chain {xn} on the Borel space 8, with
transition kernel P(Blx). Recall that if v is a measure on 8, then vpn
denotes the measure

(vpn)(B) := Is pn(Blx)v(dx) 'tin = 0,1, ... , BE B(8), (12.2.45)

whereas if u is a function on X, then Pu stands for the function

(Pu)(x) := Is u(y)P(dylx) for x E 8. (12.2.46)


12.2 Preliminaries 217

Moreover, the chain {x n }, or the transition kernel P, is said to satisfy the


(weak) Feller property if
(12.2.47)

12.2.17 Theorem. Suppose that the Borel space S is a-compact, and that
{xn} is a Markov chain on S that satisfies the Feller property. Then the
following conditions are equivalent:
(a) {Xn} has an i.p.m.
(b) there is a p.m. v such that the sequence {vpn,n = 0,1, ... } is tight.
(c) There is a p.m. v and a strictly unbounded function g ~ 0 such that
sup(vpn, g} < 00.
n

(d) There is a p.m. v and a compact set K in S such that


N-l
lim sup ~ L vpn(K) > O.
N-+oo n=O

12.2.18 Remark. Benes [1) proves Theorem 12.2.17 under the following,
stronger, assumptions:
(i) S is a LCSM space;
(ii) P satisfies the Feller property; and
(iii) For each compact set K, the function x I-t P(Klx) vanishes at infinity.
As in Remark 12.2.7(b), it is easy to see that (ii) and (iii) imply that
Pu is in Co(S) if u is in Co(S) [ef. (12.2.30)), (12.2.48)
and so (12.2.45), with n = 1, defines a map P : M(S) --t M(S) which is
continuous in the weak* topology a(M(S), Co(S)). Benes uses this fact and
the Alaoglu Theorem [Remark 12.2.2(c)) to relate (a) and (d) in Theorem
12.2.17, as well as a fifth condition (e) not included here. Without the
latter condition (e), it can be verified that the proof by Benes also yields
Theorem 12.2.17 in its present form, assuming a-compactness and the Feller
property, rather than (i), (ii), (iii). In fact, the relations
(a) => (b) ¢} (c) => (d)
are immediate. Indeed, if (a) holds and 'Y denotes an i.p.m., then taking
v = 'Y in (b), the sequence vpn = 'Y (n = 0,1, ... ) is tight because any
single finite measure 'Y on a a-compact metric space is tight. Hence (a)
implies (b). On the other hand, the equivalence of (b) and (c) follows from
Theorem 12.2.15, wheres (b) => (d) follows from the definition of tightness.
o
218 12. The Linear Programming Approach

Notes on §12.2

1. A partially ordered set (D,::;) is said to be directed if every finite


subset of D has an upper bound; that is, if a, b are in D, then there exists
c E D such that a ::; c and b ::; c. A net (also known as a generalized
sequence) in a topological space X is a function from a directed set D
into X, and it is denoted as {xn' nED} or simply {x n }. The net {xn} is
said to converge to the point x if for every neighborhood N of x there is
an no in D such that Xn ED for all nED such that n ~ no. (For further
properties of nets see, for instance, Ash [1].)
2. The material on infinite LP (subsection B) is borrowed from Anderson
and Nash [1], except for the Generalized Farkas Theorem 12.2.11, which
is due to Craven and Koliha [1]. For applications of Theorem 12.2.11 see,
for instance, Hernandez-Lerma and Lasserre [3, 5, 7], and Theorem 12.3.7
below. A Farkas-like result different from Theorem 12.2.11 can be found in
Hernandez-Lerma and Lasserre [4].
3. For additional comments and references related to Theorem 12.2.17 see
Hernandez-Lerma and Lasserre [2], which also gives necessary and sufficient
conditions for existence of i.p.m.'s.

12.3 Linear programs for the AC problem


We will now consider the average cost (AC) criterion J('Jr, v) from the view-
point of LP. Throughout the rest of this chapter we suppose that Assumption
11.4.1 is satisfied. Thus, by Lemma 11A.5(a) and Proposition 11.4A(b),
we already have in particular the existence of a stable randomized station-
ary policy <poo E II Rs such that (<p,pcp) is a minimum pair, namely [by
Definition 11.1.1 (a)],
(12.3.1)
where
Pmin:= inf J*(v) = inf inf J('Jr, v) (12.3.2)
'P(X) 'P(X) n

In this section we begin by introducing a linear program (P) that satisfies


(12.1.1) with (MCP)* = Pmin, that is,
sup(P*) ::; Pmin ::; inf(P). (12.3.3)
Then we will show that (P) is solvable and that there is no duality gap
[see (12.2.38)]' so that instead of (12.3.3) we will actually have the stronger
relation
sup(P*) = Pmin = min(P). (12.304)
Finally, we will use the Generalized Farkas Theorem 12.2.11 to obtain nec-
essary and sufficient conditions for (P) to be consistent, which will require
a set of hypotheses different from Assumption 1104.1.
12.3 Linear programs for the AC problem 219

A. The linear programs

We will first proceed as at the beginning of subsection 12.2.B to intro-


duce the components of the linear program we are interested in.
The dual pairs. Let OC C X x A be the set defined in (8.2.1), and
let w(x,a) and wo(x) be the weight functions on OC and X, respectively,
defined as

w(x, a) := 1 + c(x, a), wo(x) := min w(x, a). (12.3.5)


A(z)

(By a well-known result of Rieder [1], also stated in Proposition D.6(a) of


Volume I, Assumption 11.4.1(b) implies that wo(x) is measurable.) Then
the dual pairs we are concerned with are

(12.3.6)

and
(12.3.7)
where Mw (OC) and lBw (OC) are the weighted-norm spaces in Example
12.2.1(b), and similarly for Mwo(X) and lBwo(X). In particular, the bi-
linear form on (Mw(OC), lBw(OC» is [as in (12.2.8)]

(fL, u) := !oc udfL, (12.3.8)

and on (JR x Mwo (X), JR x lBwo (X» is [by (12.2.1)]

«r, v), (p, v») := r· p + Ix vdv. (12.3.9)

Note that, since c(x, a) is nonnegative [Assumption 11.4.1(b)], (12.3.5)


yields
o ~ c(x, a) ~ w(x, a) V(x, a) E OC,
which implies that the cost-per-stage function c is in lBw (OC), and, on the
other hand,
1 ~ wo(x) ~ w(x, a) V(x, a) E OC, (12.3.10)
which is the same as (12.2.16) with OC in lieu of X x Y. Moreover, the policy
x
fr and the initial state in Assumption 11.4.1(a) satisfy
n-l
lim sup !
n-+oo
L: .Ei[w(xt,at)] = 1 + J(fr,x) <
n t=o z
00. (12.3.11)

We will suppose that w and Wo satisfy a condition of the form (12.2.21),


with the kernel P('lx,y) being replaced by the transition law Q('lx,a);
220 12. The Linear Programming Approach

namely:
12.3.1 Assumption. There is a constant k such that

Ix wo(y)Q(dYlx, a) :::; kw(x, a) Vex, a) E K

This assumption is equivalent to saying that [as in Example 12.2.6(c)]


the function

(x, a) H Ix wo(y)Q(dYlx, a) is in lffiw(lK).

The linear maps. In (12.2.17) and (12.2.24) replace X x Y by K This


yields the linear map

with
Lop, := (p" I) = p,(lK) (12.3.12)
and

(LIP,)(B) := ji(B) - IlK Q(Blx, a)p,(d(x, a)) for B E SeX), (12.3.13)

where ji denotes the marginal of p, on X-see (12.2.19). Finally, let

be the linear map in (12.2.25), i.e.,

(12.3.14)

As in (12.2.27), the adjoint

of L is given by

L*(p,u)(x, a) := p+u(x) - Ix u(y)Q(dylx,a) (12.3.15)

for every pair (p, u) in lR x lffiwo (X) and (x, a) in lK. Hence, Assumption
12.3.1 and Proposition 12.2.5 yield [as in Example 12.2.6(d)] that

the linear map L in (12.3.14) is weakly continuous, (12.3.16)

that is, continuous with respect to the weak topologies


12.3 Linear programs for the AC problem 221

The linear programs. To complete the description of our linear pro-


gram as at the beginning of Subsection 12.2.B, we introduce the "vectors"
b:= (1,0) in lR. x Mwo(X), and c in Bw (lK) ,
where c is the cost-per-stage function, as well as the positive cone
(12.3.17)
whose dual cone is
K* := Bw(lK)+. (12.3.18)
[See Remark 12.2.8(b).] Then the primal linear program is
(P) minimize (J.L, c)
subject to: LJ.L = (1,0), J.L E Mw(lK)+. (12.3.19)

More explicitly, by (12.3.12)-(12.3.14), the constraint (12.3.19) is satis-


fied if
(12.3.20)
and LIJ.L = 0, i.e.,
ji(B) - II[{ Q(Blx, a)J.L(d(x, a)) =0 'VB E B(X), with J.L E Mw(I[{)+.
(12.3.21)
Observe that, in particular, (12.3.20) requires J.L to be a probability measure
(p.m.). Moreover, recalling Lemma 9.4.4 [see also Remark 12.3.2(a) below],
(12.3.21) can be written as

ji(B) = Ix Q(Blx, 'P)ji(dx) VB E B(X),

for some stochastic kernel 'P E ell, which means that J.L is feasible for (P)
if J.L is a p.m. on I[{ such that its marginal ji on X is an i.p.m. for the
transition kernel Q(·lx,'P).
On the other hand, observe that
(b, w) = «(1,0), (p, u») = p Vw = (p, u) E lR. x Bwo (X).
Hence, by (12.3.18) and (12.3.15), the dual of (P) is [as in (12.2.33)]
(P*) maximize p

subject to: p + u(x) - Ix u(y)Q(dylx, a) ::; c(x, a) (12.3.22)

V(x, a) E I[{, with (p,u) E lR. x Bwo(X).

This completes the specification of the linear programs associated to the


AC problem.
222 12. The Linear Programming Approach

B. Solvability of (P)

Before proceeding to verify (12.3.3) and (12.3.4), let us note the following.
12.3.2 Remark. We will use the following conventions:
(a) A measure {t on lK. c X x A may (and will) be viewed as a measure
on all of X x A by defining {t(lK.C) := 0, where lK.c stands for the
complement of lK. in X x A.
(b) We will regard c : lK. -t IR+ as a function on all of X x A with c( x, a) :=
+00 if (x, a) is in lK.c • Observe that this convention is consistent with
Assumption 1l.4.1(c), and, moreover, by (12.3.5), the weight function
W = +00 on lK.c . Any other function u in $w(lK.) can be arbitrarily
extended to X x A, for example, as u := 0 on lK.c .
(c) O· (+00) := 0
(d) As in (12.2.20), a function u in $wo(X) will also be seen a function
in $w(lK.) given by u(x, a) := u(x) for all (x,a) in K
Then, in particular, we may write the bilinear form in (12.3.8) as

({t, u) = r
JxxA
ud{t

for any measure {t in Mlw(lK.) and any function u in $w(lK.) or in $wo(X).


o
We will next show that (P) is solvable and that instead of (12.3.3) we
have
sup(P*) ~ Pmin = min(P). (12.3.23)
[Concerning part (b) in the next theorem, see Note 2 at the end of this
section.]
12.3.3 Theorem. Suppose that Assumptions 11.4.1 and 12.3.1 are satis-
fied. Then:
(a) [Solvability of (P)]. There exists an optimal solution {t* for (P),
and
min(P) = Pmin = ({t*, c). (12.3.24)
(b) [Consistency of (P*)]. The dual problem (P*) is consistent and it
satisfies the inequality in (12. 3. 23}.
Proof. (a) By Lemma 1l.4.5(a), there exists a stable randomized policy
'PC: such that ('PC: ,PIP'> is a minimum pair. That is, by Definitions 1l.4.3(a)
and ll.l.l(a), P"" is an i.p.m. for the transition kernel

Q",.(Blx) := i Q(Blx,a)'P*(dalx),
12.3 Linear programs for the AC problem 223

and
(12.3.25)

where
crp.(x):= i c(x,a)<;?*(dalx).

Furthermore, as Prp. is an Lp.m. for Qrp., for every B in B(X) we have

Ix i
Le.,
Prp. (B) = Q(Blx, a)<;?*(dalx)prp. (dx). (12.3.26)

Now let p.* be the measure on X x A defined as

p.*(B x C) := k <;?*(Clx)prp.(dx) VB E B(X), C E B(A).

Then, by Definition 8.2.1 (of CI», p.* is a p.m. on X x A, concentrated on


I[{ [that is, p.* (I[{) = 1], and its marginal on X coincides with Prp.:

jl*(B):= p.*(B x A) =prp.(B) VB E B(X).

It follows that we may rewrite (12.3.26) and (12.3.25) as

ji*(B) - !I[{ Q(Blx, a)p.*(d(x, a)) =0 VB E B(X),

and
(12.3.27)
which means that we already have the second equality in (12.3.24), as well
as the equalities p.*(I[{) = 1 and LIP.* = 0 in (12.3.20) and (12.3.21).
Therefore, to complete the proof of part (a) it suffices to show that
(i) p.* is Mi:w(l[{) [see (12.2.2)], so that p.* is indeed feasible for (P)j and
(ii) (p., c) ;::: Pmin for any feasible solution p. for (P), which would yield
inf(P) ;::: Pmin'
In other words, (i), (ii) and (12.3.27) will give that p.* is feasible for (P)
and
Pmin = (J.L*, c) ;::: inf(P) ;::: Pmin, Le., (p.*, c) = Pmin·

Let us, then, prove (i), (ii).


Proof of (i). This is easy because, by (12.3.5) and (12.3.27),

(p.* ,w) = 1 + (p.* ,c) < 00.


224 12. The Linear Programming Approach

Proof of (ii). If JJ satisfies (12.3.20) and (12.3.21), then, in particular, JJ is


a probability measure on X x A concentrated on IK; see Remark 12.3.2(a).
Thus, by Lemma 9.4.4, there is a stochastic kernel cp E q, such that

JJ(B x C) = l cp(CJx)jL(dx) VB E SeX), C E SeA).

Furthermore, taking (cpoo,p",) := (cpoo,JL), (12.3.21) gives that cpoo is a


stable randomized policy, and, therefore, by (11.4.4) and the definition of
Pmin,
(JJ, c) = J (cpoo ,JL) ~ Pmin.

This proves (ii), which completes the proof of part (a).


(b) By (a) and the weak duality property (12.2.36), to prove (b) it suffices
to show that (P*) is consistent. This, however, is obvious: for example, the
pair (p,u) with P = u(·) == 0 satisfies (12.3.22). 0
C. Absence of duality gap

To continue with the program for this section, we now turn our attention
to proving (12.3.4).
12.3.4 Theorem. (Absence of duality gap.) If Assumptions 11.4.1 and
12.3.1 are satisfied, then (12.3.4) holds.
Proof. We wish to use Theorem 12.2.10 with Z and L as in (12.3.7) and
(12.3.14), respectively. Hence, we wish to show that the set

is closed in the weak topology

a(JR x Mwo (X) x JR, JR x lEwo (X) x JR).


[See (12.2.1) for the bilinear form on a product of dual pairs, and Note 1
in §12.2 for the definition of "nets", which are used next.] Thus, let (D, :::;)
be a directed set, and consider a net {(JJa, ra), Q: E D} in Mw(IK)+ x JR+
such that

LoJJa := JJa (IK) --+ r * (12.3.28)


(LIJJa, u) --+ (v*, u) Vu E lEwo (X), and (12.3.29)
(JJa, c) + ra --+ P*· (12.3.30)

We will show that ((r*, v*), P*) is in H; that is, there exists a measure JJ in
Mw (IK)+ and a number r ~ 0 such that
LoJJ := JJ(IK) , (12.3.31)
L1JJ, and (12.3.32)
p* (JJ, c) + r. (12.3.33)
12.3 Linear programs for the AC problem 225

We shall consider two cases, r * = 0 and T * > o.


Case 1: T. = o. As J.ta is nonnegative,

(12.3.34)

[See (12.2.18).] Therefore, if T. = 0 in (12.3.28), it follows easily that


(12.3.31)-(12.3.33) hold with 1'(.) = 0 and T = P.·
Case 2: T. > O. By (12.3.28) [together with (12.3.34)] and (12.3.30),
there exists 00 in D such that

(12.3.35)

Hence, as (J.ta,c+ 1) = (J.ta,c) + IIJ.taIITV, we get that r:= {J.ta,o ~ oo} is


a bounded set of measures, which combined with Assumption llA.l(c) and
Theorem 12.2.15 yields that r is tight. Moreover, if J.ta(1K) > 0, we may
"normalize" I-'a rewriting it as J.ta (.) /J.ta (OC), and so we may assume that r
is a (tight) family of probability measures. Then, by Prohorov's Theorem
12.2.16, for each sequence {J.tn} in r there is a subsequence {I'm} and a
p.m. I' on OC such that

(12.3.36)

In particular, taking v(·) == 1, (12.3.28) yields that I' satisfies (12.3.31).


We will next show that

(i) I' is in Mw(OC)+, that is, IIJ.tllw := (1-', w) < 00 [see (12.2.2)], and
(ii) I' satisfies (12.3.32).

Proof of (i). As w := 1 + c, to prove (i) we need to show that (1-', c) is


finite. We will prove the latter by showing that

[(12.3.36), c ~ 0 and l.s.c.] =* liminf(J.tm,c) ~ (I', c). (12.3.37)

Indeed, if c ~ 0 and l.s.c. [as in Assumption 11.4.1(b)], then there exists


an increasing sequence of functions Vk in Cb(OC) such that Vk t c. It follows
from (12.3.36) that for each k

Thus, letting k ~ 00, the Monotone Convergence Theorem gives (12.3.37).


Proof of (ii). As in Example 12.2.6(d), the weak continuity condition on
Q [Assumption llA.l(d)] implies that the adjoint of L 1 , namely,

(Lru)(x,a):= u(x) - Ix u(y)Q(dylx,a),


226 12. The Linear Programming Approach

maps Cb(X) into Cb(IK). Therefore, (12.3.36) and (12.3.29) yield that for
any function u in Cb(X)

(Ll/L, u) = (JL, Liu) = lim (JLm, Liu) [by (12.3.36)]


m-too
lim (LIJLm, u)
m-too
= (v*, u) [by (12.3.29)].

That is, (L1JL,u) = (v*,u) for any function u in Cb(X), which implies
(12.3.32). This proves (ii).
Summarizing, we have shown that JL is a measure in MIw(IK)+ that sat-
isfies (12.3.31) and (12.3.32). Finally, from (12.3.37) and (12.3.30) we see
that
P* 2: (JL, c) + lim inf rm 2: (JL, c) as rm 2: 0 "1m.
m-too

Thus, defining r := P* - (JL, c) (2: 0), we conclude that JL and r satisfy


(12.3.31), (12.3.32) and (12.3.33). This shows that H is indeed weakly
closed, and so (12.3.4) follows. 0
Having (12.3.4), in the following sections we consider conditions for the
solvability of the dual problem (P*) and for the convergence of approx-
imations to the optimal values max(P*) and min(P). First, however, we
shall conclude this section by showing a different approach to obtain the
consistency of (P).
D. The Farkas alternative

Theorem 12.2.11 is important because, in particular, it can be used to


obtain necessary and sufficient conditions for (P) to be consistent. It re-
quires, however, that a certain set should be weakly closed, which turns
out to be technically demanding. Hence, to apply Theorem 12.2.11 to our
linear program (P) we shall proceed first to introduce a set of hypotheses,
Assumption 12.3.5, different from Assumption 11.4.1, and then we shall
proceed to "perturbate" (P) in a suitable form.
12.3.5 Assumption. Assumption 12.3.1 is satisfied and, in addition:

(a) X and IK are locally compact separable metric (LCSM) spaces.


(b) The one-stage cost c(x, a) is nonnegative and l.s.c.
(c) Q is weakly continuous [see Assumption 11.4.1(d)].
(d) For every compact subset C of X, the function (x,a) I-t Q(Clx,a)
vanishes at infinity; that is, for each c > 0 there is a compact set
C' = C'(c,C) such that Q(Clx,a):::; c for all (x,a) ¢ C'.
12.3.6 Remark. (a) Observe that Assumption 12.3.5 and Assumption
11.4.1 are not directly "comparable". For instance, the latter requires X
12.3 Linear programs for the AC problem 227

and IK to be Borel spaces, which is a condition weaker than Assumption


12.3.5(a). We now need X and IK to be locally compact separable metric
(LCSM) spaces because we wish to consider dual pairs (X, Y) as in Remark
12.2.2(b), (c), and Remark 12.2.18. A sufficient condition for IK C X x A
to be as in Assumption 12.3.5(a) is that X and A are both LCSM (which
implies that X x A is LCSM) and that IK is either open or closed in X x A.
For a proof of the latter fact see, for instance, Dieudonne [1, pp. 66, 75].
Similarly, Assumption 12.3.5(b) is weaker that 11.4.1(b), and, further-
more, Assumption 12.3.5 requires neither 11.4.1(a) nor 11.4.1(c), but it
does require the "vanishing-at-infinity" condition 12.3.5(d).
(b) As in Remark 12.2.7(b) [see (12.2.29) and (12.2.30)], Assumptions
12.3.5(c) and 12.3.5(d) imply that

(x,a) I-t Ix u(y)Q(dylx,a) is in Co(lK) for each u E Co(X). (12.3.38)

For example, consider a general discrete-time system


Xt+l = F(xt, at, Zt), t = 0,1, ... ,
with values in, say, X = ]Rd, and LLd. disturbances in Z = ]Rm. Then, since
Ix u(y)Q(dylx,a) = Eu[F(x,a,zo)],
Assumptions 12.3.5(c) and 12.3.5(d) are both satisfied if, for every z in
Z, the function F(x,a,z) is continuous in (x,a) and F(x,a,z) -t 00 as
(x, a) -t 00. 0

A "perturbation" of (P). Let Vo be a strictly positive function in


Co (X), and consider the linear operator
C : Mw(lK) x ]R2 -t Mwo (X) X ]R2

defined as
(12.3.39)
with Ll as in (12.3.13). [To write (/1, vo) in (12.3.39) we have used Remark
12.3.2(d).] The adjoint
c* : Bwo(X) x ]R2 -t Bw(lK) X ]R2

is given by
(12.3.40)
We will next use C and the Generalized Farkas Theorem 12.2.11 to obtain
the following.
12.3.7 Theorem [Equivalent formulations of the consistency of
(P).] If Assumption 12.3.5 holds, then the following statements are equiv-
alent:
228 12. The Linear Programming Approach

(a) (P) is consistent, that is, there is a measure JL that satisfies (12.3.19).
(b) The linear equation

£(JL,rl,r2) = (O,I,c) has a solution (JL,rl,r2) in Mw(lK)+ x ~~


(12.3.41)
for some c > O.

(c) The condition

L~U+PI +P2VO ~ 0 with u E lBwo(X), P ~ 0, and P2 ::; 0 (12.3.42)


implies
PI + CP2 ~ 0 for some c > o. (12.3.43)

In the proof of Theorem 12.3.7 we will use the following lemma, where
we use the notation in Remark 12.2.2(a), (b), and Remark 12.3.2(d).
12.3.8 Lemma. Suppose that Assumption 12.3.5{a) holds, and let {JLn}
be a bounded sequence of measures on IK. If JLj converges to JL in the weak*
topology a(M(IK), Co (IK», then the marginals /1j on X converge to /1 in the
weak* topology a(M(X), Co (X)); that is, if

(12.3.44)

then
(12.3.45)

Proof. Under Assumption 12.3.5(a), IK is a-compact; that is, there is


an increasing sequence of compact sets Kn t IK. Moreover, by Urysohn's
Lemma (see, for instance, Rudin [1], p. 39), for any given c > 0 and each
n = 1,2, ... , there is a function an in Co (lK) such that 0 ::; an ::; 1, with
an = 1 on Kn and an(x,a) = 0 if the distance from (x,a) to Kn is ~ c.
Now choose an arbitrary function u in Co(X) and define the functions

Vn(x, a) := an(x,a)u(x) for (x,a) ElK.

Then Vn is in Co(lK) and for (x, a) in IK,

Ivn(x, a)1 < lu(x)1 lIull(:= sup norm of u), and


~
vn(x, a) -+ u(x) as n -+ 00.

Hence, by the Bounded Convergence Theorem, for every fixed j,

(12.3.46)

and, on the other hand,

(12.3.47)
12.3 Linear programs for the AC problem 229

where in the latter equality we have used Remark 12.3.2(d) to write

(J..t, u) = loc udJ..t = Ix ud[i = ([i, u) for u E Co(X), (12.3.48)

and similarly for the second equality in (12.3.46). Moreover, for every fixed
n, (12.3.44) yields
.lim (J..tj, un) = (J..t, Un). (12.3.49)
3-t00

Finally, the desired conclusion (12.3.45) follows from (12.3.46)-(12.3.49)


and the inequality [which uses (12.3.48) again]

!([ij,u) - ([i,u)! = !(J..tj,u) - (J..t,u)!


< !(J..tj,u) - (J..tj,vn)! + !(JLj,vn) - (JL,vn)! + !(JL,vn) - (J..t,u)!. 0

We are now ready for the proof of Theorem 12.3.7.


Proof of Theorem 12.3.7. (a) ¢:> (b). If I' satisfies (12.3.19), then

LJ..t := «I', I), L 1JL) = (1,0), i.e., (1', I) = 1 and LIJL = 0,

and so (1',0,0) satisfies (12.3.41) with e := (1', vo). Conversely, suppose


that (1', T1, T2) satisfies (12.3.41), that is,

L1JL = 0, (1', I) + Tl = 1, and (1', vo) - T2 = c.

Then, in particular, (1', vo) ~ e, which implies that (1', I) = JL(OC) > O.
Therefore, the measure 1'* := J..t/(JL, I} satisfies (12.3.19).
(b) ¢:> (c). In this proof we use the Generalized Farkas Theorem 12.2.11
with the following identifications:

(X,y) .- (Mlw (OC) x lR?, 18 w (OC) x lR?),


(Z,W) .- (Mlwo(X) X 1R2, 18 wo (X) X 1R2),
K .- Mlw(OC)+ x 1R~, (12.3.50)
L .- C [in (12.3.39)], and b:= (0, l,e).

Then condition (a) in Theorem 12.2.11 turns out to be precisely (12.3.41),


whereas condition (b) in Theorem 12.2.11 becomes

C*(U,P1,P2) ~ 0 =} «0, l,e), (U,Pl,P2)} ~ 0,

which is the same as "the condition (12.3.42) implies (12.3.43)" . Therefore,


to prove the equivalence of (b) and (c) in Theorem 12.3.7 we only need to
verify the hypotheses of Theorem 12.2.11, namely:
(a) C is weakly continuous,
(b) C(K) is weakly closed.
230 12. The Linear Programming Approach

In fact, part (i) is obvious because the adjoint C* [in (12.3.40)] maps W
into Y-see Proposition 12.2.5. Thus, it only remains to prove (ii).
Proof of (ii). To prove that C(K) is closed, with K as in (12.3.50),
consider a directed set (D, :::;), and a net {(Ilk, rf , r~) ,0: ED} in K such
that C(pf, rf, r~) converges weakly to, say, (v, PI, P2) in Mwo (X) x ]R2; that
is,

(LlpO,u) -t (v,u) Vu E IRwo(X), (12.3.51)


(pO, 1) + r? -t PI, and (12.3.52)
(pO, vo) - r~ -t P2. (12.3.53)

We wish to show that the limiting triplet (V,PI,P2) is in C(K); that is,
there exists (pO, r~, rg) in K such that

(12.3.54)

We shall consider two cases, PI = 0 and PI > O.


If PI = 0, then (12.3.52) yields, in particular,

which in turn yields v = 0, by (12.3.51) and the weak continuity of L I .


Hence (12.3.54) holds for (pO, r~, rg) = (0,0, -P2)' Let us now consider the
case PI > O.
Suppose that PI > O. Then, by (12.3.52), there exists 0:0 ED such that

(12.3.55)

Hence, by Remark 12.2.2(b), (c) there is a (nonnegative) measure pO on


lK and a sequence {j} in D such that pj -t pO in the weak* topology
a(M(lK) , Co (lK», i.e.,

(12.3.56)

Moreover, pO is in M(lK)+ since, by (12.3.55) and Proposition 12.2.3,

On the other hand, from (12.3.56), Lemma 12.3.8 and (12.3.8) we get that
LIPj converges to LIPo in the weak* topology a(M(X), Co(X», i.e.,

(LIPi,u) = (pi,L~u) -t (po,L~u) = (LlpO,u) Vu E Co(X).

This fact and (12.3.51) yield the first equality in (12.3.54), LIPo = v.
Finally, the second and third equalities in (12.3.54) hold with

rr:= PI - (pO, I) and rg:= (pO,VO) - P2,


12.3 Linear programs for the AC problem 231

which concludes the proof of (ii), and Theorem 12.3.7 follows. 0


From Theorem 12.3.7 we can obtain a sufficient condition for (P) to be
consistent under Assumption 12.3.5, as well as a connection with Theorem
12.2.17-see Remark 12.3.10.
12.3.9 Corollary. [A sufficient condition for the consistency of
(P).] Suppose that Assumption 12.9.5 is satisfied, and let Vo be as in
{12.9.99}, a strictly positive function in Co(X). Let Wo be the weight func-
tion on X and suppose, in addition, that there exists c > 0, a randomized
stationary policy c.poo, and an initial state x E X such that

liminf .Efoo[wo(xn)]/n = 0 (12.3.57)


n-+oo

and
1 n-i 00

liminf - ""' E~ [vo(Xt)] ~ c.


n-+oo n L..J
(12.3.58)
t=o
Then (P) is consistent.
Proof. The idea is to prove that part (c) in Theorem 12.3.7 is satisfied.
More precisely, we wish to show that, under (12.3.57) and (12.3.58), the con-
dition (12.3.42) implies (12.3.43). To do this, let us first note that (12.3.57)
yields
lim inf .Efoo lu(xn}\/n = 0 \/u E Bwo (X), (12.3.59)
n-+oo

which is obvious because lu(·)1 ~ lIuliwowo{-} for u in Bwo(X).


Let us now use the definition of Li [see (12.2.26)] to rewrite (12.3.42) as

u(x) ~ Ix u(y)Q(dylx,a) - Pi - P2VO(X) \/x E X, a E A(x),

which integrated with respect to c.p(·lx) yields

u(x) ~ Ix u(y)Q(dylx, c.p) - Pi - P2 V O(X) \/x E X.

Iteration of the latter inequality gives, for all x E X and n = 1,2, ... ,
n-i
u(x) ~ E':oo u(xn) - npi - P2 L E':oo vO(Xt),
t=o
i.e.,
n-i
u(x) + npi + P2 L
E':oo vo(Xt) ~ E':oo u(xn).
t=o
Thus, multiplying by lin and taking lim inf as n -+ 00, (12.3.59) and
(12.3.58) give (12.3.43) as P2 ~ o. 0
232 12. The Linear Programming Approach

12.3.10 Remark. The connection between Corollary 12.3.9 and Theorem


12.2.17 (see also Remark 12.2.18) is as follows. Let 10 > 0 in (12.3.58) be
such that 10 ::; Ilvoll, and choose 0 < co < c. As vo(-) > 0 is in Co(X),
there is a compact set C in X such that 0 < vo(x) ::; co for all x not in
C. Then, writing X as the union of C and its complement, for each x E X
and t = 0,1, ... we obtain

Et'°[vo(Xt)] Ix vo(y)Qt(dylx,<p)

< (livoll-co)Qt(Clx,<p)+co.
Therefore, with x as in (12.3.58),
1 n-l
liminf - "Qt(Clx,<p) ~ (10 - co)/(livoll - co) > 0, (12.3.60)
n--+oo n ~
t=o
which gives part (d) in Theorem 12.2.17 for the transition kernel P('lx) :=
Q('lx, <p) on S := X, with v:= J., and the compact set K:= C. Thus, part
(a) in Theorem 12.2.17 implies the existence of an i.p.m. ji for Q('lx,<p),
and if we could show that the p.m. j.L(d(x, a)) := <p(dalx)ji(dx) is in Mw(IK),
then we would have the same conclusion of Corollary 12.3.9, the consistency
of (P), by a quite different approach. Finally, it is worth noting (and easy
to prove) that (12.3.57) and (12.3.58) are also necessary for (P) to be
consistent.
For further comments on-and references related to-Theorem 12.2.17
see Hernandez-Lerma and Lasserre [2]. 0
12.3.11 Remark. (Absence of duality gap.) Theorem 12.3.4 remains
valid if Assumption 12.3.1 holds, but Assumption 11.4.1 is replaced with:
(a) Assumption 12.3.5 is satisfied;
(b) c(x,a) is inf-compact [see Remark 11.4.2(a2)];
(c) (P) is consistent.
The proof of Theorem 12.3.4 is the same under this new set of hypotheses.
o
Notes on §12.3

1. In subsections A, B, C we essentially followed Hernandez-Lerma and


Lasserre [14]. Subsection D comes from Hernandez-Lerma and GonzaIez-
Hernandez [1]. The latter reference and also Hordijk and Lasserre [1] deal
with AC problems in the multichain case.
The approach in this section to prove (12.3.4) is quite different from the
approach in Chapter 6, where we used a variant of the "vanishing discount"
approach. Namely, for each discount factor 0 < a < 1 we introduced a linear
12.4 Approximating sequences and strong duality 233

program, say (Pa ), related to the a-discount Markov control problem, and
then we studied (P) as the "limit" of (Pa ) as at 1.
2. Historically speaking, it is interesting to note that the LP formulation
of the AC problem-as well as to other Markov control problems-was born
trying to solve the corresponding dynamic programming equation, which
in our present case is the Average Cost Optimality Equation (ACOE)

p* + h*(x) = min [C(X, a) +


A(z) ixr h*(y)Q(dylx, a)] , (12.3.61)

where [by (11.4.7) and (11.1.23)]

p* := inf J*(x) = Pmin. (12.3.62)


X

Observe that if h* is a function in Bwo (X), then the ACOE (12.3.61) implies
that the pair (p*, h*) satisfies (12.3.22), that is, (p*, h*) is feasible for the
dual program (P*). On the other hand, if (p,u) satisfies (12.3.22), then
using straightforward arguments [or using (12.3.23)] one can see that

p ~ p*(= Pmin).

It is due to the latter inequality that the pairs (p, u) that are feasible for
(P*) are also called subsolutions to the ACOE. Thus the equality

sup(P*) = p*(= Pmin)


in (12.3.4) is sometimes stated in the stochastic control literature saying
that "p* is the supremum of the subsolutions to the ACOE" .
See Chapter 6 for early references (going back to about 1960) on the LP
formulation of Markov control problems.

12.4 Approximating sequences and strong


duality
In the rest of this chapter we are mainly interested in the approximation
of the AC-related linear program (P) and its dual (P*). In this section we
first study minimizng sequences for (P), and then maximizing sequences
for (P*).
A. Minimizing sequences for (P)

By Definition 12.2.12(a), a sequence of measures J.tn in Mlw(lK)+ is said


to be a minimizing sequence for (P) if each J.tn is feasible for (P), that
is, it satisfies (12.3.19), and in addition

(J.tn, c) .J.. min(P), (12.4.1)


234 12. The Linear Programming Approach

where we have used that (P) is solvable [Theorem 12.3.3(a)] to write its
value as min (P) rather than inf (P).
12.4.1 Theorem. Suppose that Assumptions 11.4.1 and 12.3.1 are satis-
fied. If {JLn} is a minimizing sequence for (P), then there exists a subse-
quence {j} of {n} such that {JL;} converges in the weak topology u(M(IK),
Cb(lK)) to an optimal solution for (P).
Proof. let {JLn} be a minimizing sequence for (P)j that is [by (12.3.19)],

(JLn,l) =1 and LIJLn =0 'lin, (12.4.2)

and (12.4.1) holds. In particular, (12.4.1) implies that for any given c >0
there exists n(c) such that

min(P) ~ (JLn,c) ~ min(P) +c 'lin?: n(c). (12.4.3)

By the second inequality [together with Assumption 1l.4.1(c) and Theo-


rems 12.2.15 and 12.2.16], there exists a p.m. JL* on IK and a subsequence
{j} of {n} such that
(JLj,V) -t (JL*,v) Vv E Cb(l[{). (12.4.4)

Moreover, by (12.3.37),

(JL*,c) ~ li~inf(JL;,c) ~ min(P) +c. (12.4.5)


3-+ 00

Thus, as c was arbitrary, the latter inequality and (12.4.3) yield

min(P) = (JL*,c). (12.4.6)

This will prove that JL. is optimal for (P) provided that JL. is feasible for
(P)j in other words, provided that JL* is a measure in Mw(IK)+ and that
(12.4.7)

This, however, is obvious because (12.4.5) yields (JL*, w) = 1 + (JL., c) < 00,
whereas (12.4.7) follows from (12.4.2) and (12.4.4). 0
B. Maximizing sequences for (p.)
By Definition 12.2.12(b) and the definition of the dual program (p.) [see
(12.3.22)], a sequence (Pn, Un) in IR x Bwo (X) is a maximizing sequence for
(P*) if
Pn + Un (x) ~ c(x, a) + Ix un(y)Q(dylx, a) (12.4.8)

for all n and (x, a) E 1K, and, in addition,

Pn = «1,0), (Pn, un)} t sup{P*). (12.4.9)


12.4 Approximating sequences and strong duality 235

The following theorem shows that the existence of a suitable maximizing


sequence for (P*) implies, in particular, that the strong duality condition
for (P) holds [see (12.2.39)].
12.4.2 Theorem. [Solvability of (P*), strong duality and the
ACOE.] Suppose that Assumptions 11.,4.1 and 12.3.1 are satisfied, and,
furthermore, there exists a maximizing sequence (Pn, Un) for (p.) with {un}
bounded in the wo-norm, that is,

(12.4.10)

for some constant k. Then:


(a) The dual problem (p.) is solvable.
(b) The strong duality condition holds, that is, max(P*) = min(P).
(c) If 1'* is an optimal solution for the primal program (P), then the
ACOE {12.3.61} holds /i*-a.e., where Ii* is the marginal of 1'* on
X; in fact, there is a function h * in Bwo (X) and a deterministic
stationary policy f;:O such that

p. + h*(x) = min [c(x,a) + f


A(z) ix h*(Y)Q(dY1x,a)] (12.4.11)
= c(x, f.) + Ix h*(y)Q(dYlx, f.)
for /i* -almost all x EX.
Proof. (a) By (12.3.62) and Theorem 12.3.4 we have

sup(P*) = p* = min(P) (12.4.12)

and, moreover, we can write (12.4.9) as

Pn t p*. (12.4.13)

Now define the function

h*(x) := limsupun(x),
n-too

which belongs to Bwo(X), by (12.4.10). Therefore [by (12.4.13) and Fatou's


Lemma 8.3.7(b)] taking limsuPn in (12.4.8) we obtain

p* + h*(x) ~ c(x, a) + Ix h*(y)Q(dYlx,a) V(x, a) E K

This yields that (p*, h*) is feasible for (P*) [see (12.3.22)], which together
with the first equality in (12.4.12) shows that (p*, h*) is in fact optimal for
(P*).
236 12. The Linear Programming Approach

(b) This part follows from (a) and (12.4.12).


(c) Let us first note that if p. is feasible for (P) and (p, u) is feasible for
(P*), then
(Lp., (p, u» = «1,0), (p, u») = p,
or, equivalently,
(p.,L*(p,u» = p, (12.4.14)
where L* is the adjoint of L, in (12.3.15). Now let p.* be an optimal solution
for (P), and (p*, h*) an optimal solution for (P*). By part (b) we have

whereas (12.4.14) gives

(p.*,L*(p*,h*» = p*.

Thus, subtracting the last two equalities we get

(p.*, c - L*(p*, h*») = 0,


i.e.,
j[c(x,a) - L*(p*,h*)(x,a)]p.*(d(x,a» = o. (12.4.15)

By Lemma 9.4.4 we may disintegrate p.* as p.*(d(x,a» = cp(da/x)j1*(dx)


for some stochastic kernel cp E C}>, and then [using (12.3.15)] we can rewrite
(12.4.15) as

Ix [C(X,CP) - p* - h*(x) + Ix h*(y)Q(dY/X,CP)] j1*(dx) = o.


Therefore, as the integrand is nonnegative [by (12.3.22)], we get that for
j1*-a.a. (almost all) x in X

p* + h*(x) = c(x,cp) + Ix h*(y)Q(dy/x,cp)

= L [c(x,a) + Ix h*(y)Q(dy/x,a)] cp(da/x),

and so

p* + h*(x) ~ c(x, !*) + Ix h*(y)Q(dy/x, !*) j1* - a.a. x E X (12.4.16)

for some decision function !* Elf whose existence is guaranteed by Lemma


9.4.7. Finally, as (12.3.22) implies

p* + h*(x) ~ min
A(z)
[c(x, a) + r
Jx h*(y)Q(dy/x, a)] for all (x, a) ElK,
12.5 Finite LP approximations 237

we get that, by (12.4.16), for /i*-a.a. x E X

p* + h*(x) > c(x, 1*) + Ix h*(y)Q(dYlx, 1*)

> min [C(X, a) + [ h*(y)Q(dYlx, a)]


A(z) Jx
> p* + h*(x),
and (12.4.11) follows. 0
12.4.3 Remark. Theorem 12.4.1 and 12.4.2 remain valid if Assumption
12.3.1 holds, but Assumption 11.4.1 is replaced by the conditions (a), (b),
and (c) in Remark 12.3.11. 0

Notes on §12.4

1. The results in this sections are from Hermindez-Lerma and Gonzruez-


Hernandez [1], in which similar results for multichain AC-problems are also
obtained.
2. Minimizing sequences and policy iteration. Let 100 be a deter-
ministic stationary policy for which the transition law Q1 (·Ix) == Q (·Ix, f)
admits an i.p.m. J.LI, that is,

J.LI(B) = Ix Q(Blx,f)J.LI(dx) VB E B(X). (12.4.17)

Now, for every x E X, let 81(z) 0 be the Dirac measure at I(x), and let J.LI
be the p.m. on X x A, concentrated on OC, given by

for B and C in B(X) and B(A), respectively. Then the marginal of J.LI on
X is [i'! = J.L I, and, on the other hand,

(12.4.18)

where c/(x) == c(x, f) == c(x,/(x)) for all x EX. Finally, let {f~} be the
sequence of deterministic stationary policies defined by (10.5.17), (10.5.18).
Thus, under the assumptions of Theorem 10.5.2, the sequence {J.Lln} can be
seen as a minimizing sequence for (P). In particular, observe that (12.4.17)
is the same as the equation after (12.3.21), with cp := I and ji := J.LI.
Similarly, it can be seen that the value iteration procedure in §5.6 gives
a maximizing sequence for (P*).
238 12. The Linear Programming Approach

12.5 Finite LP approximations


We will now show a procedure to approximate the AC-related primal linear
program (P) by finite-dimensional linear programs. For the sake of conti-
nuity in the exposition, in this section we describe the procedure and the
proof of its convergence is postponed to §12.6.
We will work in essentially the same setting of the previous sections ex-
cept that now we shall require the spaces X and lK to be locally compact
separable metric (LCSM) spaces. Hence throughout the following we sup-
pose:
12.5.1 Assumption. Assumptions 11.4.1 and 12.3.1 are satisfied, and in
addition X and lK are LCSM spaces.
A sufficient condition for lK to be LCSM is given in Remark 12.3.6(a). On
the other hand, the hypothesis that X and lK are LCSM spaces ensures that
Co(X) and Co(lK) are both separable Banach spaces [see Remark 12.2.2(b)].
In particular, Co(X) contains a countable subset C(X) which is dense in
Co(X). This is a key fact to proceed with the first step of our approximation
procedure.
A. Aggregation

Let P w(lK) be the family of probability measures (p.m.'s) in Mw (lK) + ,


which, in other words, is the family of measures Il that satisfy the con-
straint (Il, 1) = ll(lK) = 1 in (12.3.19), (12.3.20). Thus we may rewrite (P)
as:
(P) minimize (Il, c)

subject to: Llll = 0, Il E Pw(lK), (12.5.1)

where Llll is the signed measure in Mwo(X) c M(X) defined by (12.3.13).


We also have:
12.5.2 Lemma. Let C(X) C Co(X) be a countable dense subset of Co(X).
Then the following are equivalent conditions for Il in Pw(lK):

(a) Llll = o.
(b) (L11l, u) = 0 Vu E Co(X).

(c) (L11l,U) = 0 Vu E C(X).


Proof. The equivalence of (a) and (b) is due to the fact that (M(X), Co(X»
is a dual pair-in fact, M(X) is the topological dual of Co(X) [Remark
12.2.2(b)]. Finally, the implication (b) =:} (c) is obvious, whereas the con-
verse follows from the denseness of C(X) in Co(X). 0
By Lemma 12.5.2, we may further rewrite (P) in the equivalent form:
12.5 Finite LP approximations 239

(P) minimize (J.t, c)

subject to: (LIJ.t, u) =0 'flu E C(X)j J.t E Pw(lK). (12.5.2)

Observe that (12.5.2) defines an aggregation (of constraints) of (P)j see


Definition 12.2.13(a). In other words, the constraint L1J.t = 0 in (12.5.1) is
"aggregated" into countably many constraints (LIJ.t, u) = 0 with u in C(X).
We will next reaggregate (12.5.2) into finitely many constraints as follows.
Let {Ck} be an increasing sequence of finite sets Ck t C(X). For each k,
consider the aggregation
IP(Ck) minimize (J.t, c)

subject to: (LIJ.t, u) =0 'flu E Ckj J.t E Pw(lK). (12.5.3)

This linear program has indeed a finite number of constraints, namely, the
cardinality ICkl of Ck' We also have our first approximation result:
12.5.3 Theorem. Suppose that Assumption 12.5.1 is satisfied. Then
(a) IP(Ck) is solvable for each k = 1,2, ... j in fact, the aggregation IP(W)
is solvable for any subset W of Co(X).

(b) For each k = 1,2, ... , let J.tk be an optimal solution for IP(Ck), i.e.,

Then
(J.tk, c) t min(P) = Pmin, (12.5.4)
where the equality is due to Theorem 12. 3. 3(a}. Furthermore, there
is a subsequence {J.tm} of {J.tk} that converyes in the weak topology
a (M[(][{), C b (][{)) to an optimal solution J.t* for (P), i.e.,

(12.5.5)

in fact, any weak-a(M[(][{) , Cb(][{)) accumulation point of {J.tk} is an


optimal solution for (P).

Proof. See §12.6.


B. Aggregation-relaxation

The equality constraint (L1J.t, u) = 0 in (12.5.3) will now be "relaxed" to


inequalities of the form I(LIJ.t, u)1 ~ c with c > O.
Let Ck t C(X) be as in (12.5.3), and let {ck} be a sequence of numbers
Ck .j.. O. For each k = 1,2, ... , consider the linear program
240 12. The Linear Programming Approach

minimize (p, c)

(12.5.6)

12.5.4 Remark. If C > 0 and I c Co(X) is a finite subset of Co(X), then


[by (12.2.10)] the set

N(I,c) := {v E M(X)II(v,u)1 ~ C Vu E I}

defines a (closed) weak-actually weak*-neighborhood of the "origin"


(that is, the null measure) in M(X). In particular, if we take C and I
as Ck and Ck, respectively, then the constraint (12.5.6) states that LIP is
in the weak* neighborhood N(Ck, ck), i.e.,

(12.5.7)

This provides a natural interpretation of W(Ck,ck) as an approximation of


the original program (P) in the weak* topology a(M(X),Co(X)). 0
The following result states that Theorem 12.5.3 remains basically un-
changed when W(C k ) is replaced by W(Ck,ck).
12.5.5 Theorem. Suppose that Assumption 12.5.1 is satisfied. Then
(a) W(Ck,ck) is solvable for each k = 1,2, ....
(b) If Pk is an optimal solution for W(Ck,ck), i.e.,

(pk'C) =minW(Ck,ck) for k= 1,2, ... ,

then {pd satisfies the same conclusion of Theorem 12.5.3{b); in par-


ticular,
(Pk, c) t min(P) = Pmin· (12.5.8)

Proof. See §12.6.


C. Aggregation-relaxation-inner approximations

The programs W(Ck) and W(C k , ck) have a finite number of constraints and
give "nice" approximation results-Theorems 12.5.3 and 12.5.5. However,
they are still not good enough for our present purpose because the "decision
variable" plies in the infinite-dimensional space Mw{lK) c M(JK). (For the
latter spaces to be finite-dimensional we would need the state and action
sets, X and A, to both be finite sets.) Now to obtain finite-dimensional
approximations of (P) we will combine W(Ck,ck) with a suitable sequence
of inner approximations [see Definition 12.2.13(b)]. These are based on the
following well-known result (for a proof see, for instance, Billingsley [1, p.
237, Theorem 4] or Parthasarathy [1, p. 44, Theorem 6.3]). We shall use
the notation introduced in Remark 12.2.1(b).
12.5 Finite LP approximations 241

12.5.6 Proposition. [Existence of a weakly dense set in P(S).] Let


S be a separable metric space and, DeS a countable dense subset of S.
Then the family of p. m. 's whose supports are finite subsets of D is dense
in P(S) in the weak topology a(M£(S) , Cb(S)).
We will now apply Proposition 12.5.6 to the space S := oc. Let D c OC
be a countable dense subset of OC, and let {Dn} be an increasing sequence
of finite sets Dn t D. For each n = 1,2, ... , let ~n := P(D n ) be the family
of p.m.'s on Dn; that is, an element of ~n is a convex combination of the
Dirac measures concentrated at points of Dn. Then, as Dn t D, the sets
~n for an increasing sequence (of sets of p.m.'s) whose limit

00

~:= U~n (12.5.9)


n=l

is dense in P(OC) in the weak topology a(M£(OC) , Cb(OC»; that is, for each
p.m. IL in P(OC), there is a sequence {Vk} in ~ such that
(12.5.10)
Let us now consider a linear program as IP(Ck,ck) except that the p.m.'s
IL in (12.5.6) are replaced by p.m.'s in ~n n Pw(OC). That is, instead of
IP(Ck,ck) consider the finite program
IP(Ck, Ck, ~n): minimize (IL, c)
(12.5.11)
This is indeed a finite linear program because it has a finite number ICkl of
constraints, and a finite number IDnl of "decision variables", namely, the
coefficients of a measure in ~n n Pw(OC).
The corresponding approximation result is as follows.
12.5.7 Theorem. [Finite approximations for (P).] If Assumption
12.5.1 is satisfied then:
(a) For each k = 1,2, ... , there exists n(k) such that, for all n ~ n(k),
the finite linear program IP(Ck, Ck, ~n) is solvable and
(12.5.12)

(b) Suppose that, in addition, the cost-per-stage function c(x, a) is con-


tinuous. Then for each k = 1,2, ... there exists n*(k) such that
minlP(Ck,ck,~n)::; min(P) +ck '<In ~ n*(k); (12.5.13)
hence {by {12.5.12} and {12.5.8}J
(12.5.14)
242 12. The Linear Programming Approach

where of course the limit is taken over values ofn ~ n*(k). Moreover,
if /-lkn [for k ~ 1, n ~ n*(k)] is an optimal solution for JP'(Ck,ck, tl n ),
then every weak accumulation point of {/-lkn} is an optimal solution
for (P).

Proof. See §12.6.


Notes on §12.5

The notes for this section are given at the end of §12.6.

12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7


In this section we prove Theorems 12.5.3, 12.5.5 and 12.5.7.
Proof of Theorem 12.5.3. (a) Let W be an arbitrary subset of Co(X),
and in (12.5.3) replace Ck by W. This yields the following aggregation of
(P).
JP'(W): minimize (/-l, c)

subject to: (L1/-l, u) = 0 Vu E W, /-l E Pw(lK). (12.6.1)

We wish to show that JP'(W) is solvable. First note that as (P) is consistent
[Theorem 12.3.3(a)], so is JP'(W). More explicitly, there exists a p.m. /-l that
satisfies (12.5.1), which, by Lemma 12.5.2(b), yields that /-l satisfies (12.6.1).
Moreover, as (12.5.1) implies (12.6.1), we have

o ~ inf JP'(W) ~ min(P), (12.6.2)

where the first inequality holds because c ~ O. Now let {/-In} be a minimiz-
ing sequence for JP'(W); that is, each /-In satisfies (12.6.1) and

(/-In, c) .!- infJP'(W). (12.6.3)

Thus, by (12.6.2), there exist constants M and N such that

Therefore, applying Theorem 12.2.15 and then Prohorov's Theorem 12.2.6,


there is a subsequence {/-lm} of {/-In} and a p.m. /-l on lK such that

(12.6.4)

This implies [as in (12.3.37)]

lim inf(/-lm, c} ~ (/-l, c), (12.6.5)


m-too
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 243

and so, by (12.6.3),


(p, c) ~ inf lP'(W).
It follows that to prove that lP'(W) is solvable, it suffices to show that P is
feasible for lP'(W), that is,

(i) (L1P,u) =0 'v'u E W, and (ii) P is in Pw(lK), (12.6.6)

which would yield the reverse inequality (p, c) ~ inf lP'(W). To prove (12.6.6)
observe that the condition (ii) is obvious because

(p,w) = 1 + (p,c) < 00


by (12.6.5) and the definition of win (12.3.5). Concerning (i), first note that,
by Assumption 11.4.1(d), Li maps Cb(X) into Cb(OC) and, consequently,
(12.6.4) implies that

(12.6.7)

because

(12.6.8)

In particular, (12.6.7) holds for all u in W C Co(X) c Cb(X), which implies


(ii) since (L1Pm,u) = 0 for all u E W.
Thus, as lP'(W) is solvable for any subset W of Co(X), it follows that
IP(Ck) is solvable for each k.
(b) For each k, let Pk be an optimal solution for lP'(C k) , that is, Pk satisfies
(12.5.3) and
(pk'C) = minlP(Ck) ~ min(P),
where the second inequality follows from (12.6.2). Furthermore, as the se-
quence Ck is increasing, so is the sequence of values (Pk, c). Therefore, there
is a number p such that

(Pk, c) t p, and p ~ min(P). (12.6.9)

Thus, as (Pk, c) ~ p for all k, the same arguments used to obtain (12.6.4)
and (12.6.5) yield a subsequence {Pm} of {Pk} and a p.m. P on OC such
that
(12.6.10)
and
liminf(pm,c} ~ (p,c).
m--+oo

In fact, as (Pm, c) satisfies (12.6.9), we have

(p, c) ~ lim (Pm, c)


m--+oo
= P ~ min(P),
244 12. The Linear Programming Approach

so that
(J-l, c) ~ min(P). (12.6.11)
Therefore, if we can show that J-l is a feasible solution for (P), then we shall
have that (J-l, c) ~ min(P), which combined with (12.6.11) will give that J-l
is optimal for (P), that is, (J-l, c) = min(P). Now to prove that J-l is feasible
for (P) we need to check that the p.m. J-l satisfies (12.5.1); equivalently, by
Lemma 12.5.2(c), we need to check

(i) (L1J-l,u) = 0 Vu E C(X), (ii) J-l is in Pw(IK). (12.6.12)

The condition (ii) follows from (for instance) (12.6.11) and the definition
of w in (12.3.5), that is, (J-l, w) = 1 + (J-l, c) < 00. To prove (i) recall first
that C(X) is the limit of the increasing sequence {Ck}, i.e.,
00

C(X) = U Ck.
k=l

Therefore, if u is in C(X), then there exist N such that u is in Ck for all


k ~ N, and so the subsequence {J-lm} in (12.6.10) satisfies

because J-lm is feasible for lP'(Cm). Finally [arguing as in (12.6.7), (12.6.8)],


from (12.6.10) we obtain

which implies (12.6.12)(ii) since u as an arbitrary function in C(X). This


completes the proof that J-l is feasible for (P), which, as was already men-
tioned, yields that J-l is an optimal solution for (P). 0
Proof of Theorem 12.5.5. Many of the arguments in this proof are very
similar to those in the proof of Theorem 12.5.3, and, consequently, in several
places we only sketch the main facts.
(a) If a measure J-l satisfies (12.5.3), then it obviously satisfies (12.5.6).
this fact and Theorem 12.5.3(a) yield

(12.6.13)

Now fix k, and let {J-ln} be a minimizing sequence for lP'(Ck,C:k)' Then, as
in (12.6.3)-(12.6.5), there exists a subsequence {J-lm} of {J-ln} and a p.m. J-l
on IK such that
(12.6.14)
and
(12.6.15)
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 245

Thus, to complete the proof of part (a) it suffices to show that J-t is feasible
for P(Ck, ck) since this would yield the reverse inequality in (12.6.15). On
the other hand, since (12.6.15) implies (J-t, w) = 1 + (J-t, c) < 00, to show
that J-t satisfies (12.5.6) it only remains to prove that
(12.6.16)

To prove this, observe that

I(L1J-t,u)1 ~ I(L1J-t,u} - (L1J-tm,u)1 + I(L1J-tm,u}1


(12.6.17)
~ I(L1J-t,u) - (L1J-tm,u)1 +ck 'tIu E Ck,

because each J-tm satisfies (12.5.6). Thus letting m -+ 00 in the latter in-
equality we obtain (12.6.16) since [as in (12.6.7), (12.6.8)] the weak con-
vergence (12.6.14) implies

(LIJ-tm' u) -+ (LIJ-t, u) 'tIu E Cb(X).


(b) For every k = 1,2, ... , let J-tk be an optimal solution for P(Ck, ck).
Clearly the sequence of optimal values (J-tk, c) = min P(Ck, C k) is nonde-
creasing, which combined with (12.6.13) implies [as in (12.6.9)-(12.6.11)]
the existence of a number p ~ min(P), a subsequence {J-tm} of {J-tk}, and a
p.m. J-t on lK such that J-tm -+ J-t weakly [as in (12.6.10)] and

(J-t,c) ~ lim (J-tm,c)


m-too
= P ~ min(P).
Hence to prove that J-t is optimal for (P) it suffices to show that J-t is feasible
for (P), which would yield (J-t, c) ~ min(P). In fact, as it is evident that J-t is
in Pw(lK) [Le., (J-t, w) = 1 + (J-t, c) < 00], to show that J-t satisfies (12.5.1) it
only remains to prove that LIJ-t = 0, or equivalently [by Lemma 12.5.2(c)],
that
(L1J-t, u) = 0 'tIu E C(X).
This, however, can be proved almost exactly as (12.6.12)(i) [simply replace
the equality "(LIJ-tm,U) = 0 'tim ~ N" by the inequality "1(L1J-tm,u}1 ~ Cm
'tim ~ N"J, and so we shall omit the details. 0
Proof of Theorem 12.5.7. (a) Fix an arbitrary k ~ 1. Observe that if
P(Ck, ck, An) is solvable for some n, then the inequality (12.5.12) trivially
follows from (12.5.11) and (12.5.6), because An n Pw(lK) C Pw(lK). Thus
to prove part (a) we shall concentrate in the first statement, solvability of
P(Ck, ck, An) for all n ~ n(k).
Let J-t* be an optimal solution for (P)j that is, J-t* satisfies

(J-t* ,c) = min(P) = Pmin (12.6.18)


and [by (12.5.1) and Lemma 12.5.2(c)]
(LIJ-t*, u) = 0 'tIu E C(X), and J-t* is in Pw(lK). (12.6.19)
246 12. The Linear Programming Approach

Now, as the family ~ in (12.5.9) is weakly dense in P(OC) [in the weak
topology a(M(OC), Cb(OC»] and
~ n Pw(OC) c Pw(OC) c P(OC),
we see that ~npw(OC) is weakly dense in Pw(OC). Hence, there is a sequence
{/Lj} in ~ n Pw(OC) such that
(/Lj, v) -+ (/L*,v) "Iv E Cb(OC).
This implies [as in (12.6.7), (12.6.8)]

(LI/Lj,u) -+ (LI/L*,U) Vu E Cb(X);


in particular, by (12.6.19),
(LI/Lj, u) -+ (LI/L*, u) = 0 Vu E Ck (12.6.20)
because Ck C C(X) C Co(X) C Cb(X). Therefore, as Ck is a finite set,
letting Ck be as in (12.5.6), we see from (12.6.20) that there exists j(k)
such that
(12.6.21)
That is, for all j ~ j(k), the measure /Lj E ~ n Pw(OC) is feasible for
lP'(Ck,ck)' On the other hand, as ~ is the limit of the increasing sequence
~n [see (12.5.9)], there exists n(k) such that /Lj(k) is in

~n(k) n Pw(OC) C ~n n Pw(OC) "In ~ n(k).


It follows from (12.6.21) and (12.5.11), that lP'(Ck,ck, ~n) is consistent for
all n ~ n(k), which in turn implies that lP'(Ck,ck,~n) is solvable for all
n ~ n(k) because it is a finite linear program with a finite value (~ O)-see
any reference on elementary (finite-dimensional) LP. This completes the
proof of part (a).
(b) As in the proof of part (a), fix an arbitrary k ~ 1, and let /L* be
an optimal solution for (P)-see (12.6.18), (12.6.19). Moreover, for each
j = 1,2, ... , consider the set

E j := {(x, a) E OClc(x, a) -::; j}.


As c ~ 0 is l.s.c. and strictly unbounded [by Assumption 11.4.1(b), (c)], E j
is a closed set contained in some compact set; hence, E j itself is compact.
Now let /Lj be the p.m. on OC defined by
/Lj(B) := /L*(B n Ej)//L*(Ej) for BE 13(OC),
which of course is well defined [that is, /L*(Ej) > 0] for all j sufficiently
large. Furthermore, as
(12.6.22)
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 247

J.Lj converges weakly to J.L* in the weak topology a(M(OC), Cb (OC», i.e.,

(J.Lj, v} = J.L*(Ej)-l r
iE,
vdJ.L* ~ (J.L*,v) \Iv E Cb(OC). (12.6.23)

Thus [as in (12.6.7), (12.6.8)]

(LlJ.Lj,u) ~ (LlJ.L*,U) \lu E Cb(X).

This implies [as in (12.6.20)]

(LlJ.Lj,U}~O \luECk·
Therefore, as Ck is a finite set, there exists jr(k) such that

I(LlJ.Lj,u}1 ~ck/2 \luECk and j "?jl(k). (12.6.24)

On the other hand, by (12.6.22) and the definition of J.Lj,

Hence, there exists h(k) such that

(12.6.25)

Now fix jo "? max{jl(k),h(k)}, and let P(Ejo ) be the family of p.m.'s on
OC, concentrated on E jo . Then J.Ljo is a p.m. in P(Ejo ) and satisfies (12.6.24)
and (12.6.25). We now wish to approximate J.Ljo by a suitable sequence {lin}
in 6. n P(Ejo ).
Let DC OC be the countable dense subset in the definition of IP(Ck , Ck, 6. n ).
Then D n Ejo is dense in E jo , and so, by Proposition 12.5.6,

is dense in P(Ejo ) in the weak topology a(M(OC), Cb(OC)). Therefore, there


exists a sequence {lin} in 6. n P(Ejo ) such that

(lin' v} ~ (J.Ljo,v) '<Iv E Cb(OC), (12.6.26)

and [as in (12.6.7), (12.6.8)]

(Lllln' u) ~ (LlJ.Ljo' u) Vu E Cb(X).


In particular, as Ck C Cb(X) is a finite set, there exists nl (k) such that

Combining this fact with (12.6.24), taking j = jo, we obtain

(12.6.27)
248 12. The Linear Programming Approach

On the other hand, since the restriction of c to Ejo is a continuous bounded


function, (12.6.26) yields

(Vn,c) -t (J.Ljo'c).

From this fact and (12.6.25), with j = jo, there exists n2(k) such that

(12.6.28)

which in particular implies that Vn is in Pw(lK), that is,

(12.6.29)

Finally, to verify that Vn satisfies (12.5.11), note that since each Vn is in


the set ~ n P(Ejo ) C ~, (12.6.29) and (12.5.9) imply the existence of n3
such that

(12.6.30)

To summarize, define n*(k) := max{nl(k),n2(k),n3}' Then (12.6.30) and


(12.6.27) give that Vn is a feasible solution for IP'(Ck,ck,~n) for all n 2::
n*(k), which together with (12.6.28) and part (a) yields (12.5.13) and
(12.5.14).
It only remains to prove the last statement in (b). To do this, for each k 2::
1 fix n 2:: n*(k) and let J.Lk := J.Lkn be an optimal solution for IP'(Ck,Ck,~n),
that is, J.Lk satisfies (12.5.11) and

(J.Lk'C) = minlP'(Ck,ck,~n)'

Then, by (12.5.14) and the argument used to obtain (12.6.4) and (12.6.5),
there exists a subsequence {J.Lm} of {J.Ld and a p.m. J.L on IK such that

(12.6.31)

and
min(P) = liminf(Jlm, c) 2:: (J.L,c), (12.6.32)
m-too

which in particular gives that J.L is in Pw(IK). Thus, to conclude that J.L is an
optimal solution for (P), it only remains to check that J.L satisfies LIJ.L = 0
in (12.5.1), or, equivalently [by Lemma 12.5.2(c)] that

(LIJ.L, u) = 0 \/u E C(X). (12.6.33)

This would yield that J.L is feasible for (P), so that (J.L, c) 2:: min(P), which
together with (12.6.32) would show that J.L is optimal for (P). To prove
(12.6.33), first note that (12.6.31) implies [as in (12.6.7), (12.6.8)]

(12.6.34)
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 249

Now fix an arbitrary function u in C(X) and note that, as Ck t C(X), there
exists N such that u is in Ck for all k ~ N. Therefore, by (12.5.11),

\(L 1 JLm, u)\ ~ em -t 0 as m -t 00,

which combined with (12.6.34) gives (LIJL, u) = O. Thus, as u E C(X) was


arbitrary, (12.6.33) follows. 0
Notes on §12.5 and §12.6

1. Sections 12.5 and 12.6 are based on Hernandez-Lerma and Lasserre


[16]. A similar approach, combining aggregations, relaxations and inner
approximations, can be used to approximate general (not necessarily MCP-
related) infinite linear programs, as in Hernandez-Lerma and Lasserre [15].
These two papers provide many related references.
2. The approximation schemes in §12.5 are somewhat similar in spirit to
schemes proposed by Vershik [1] and Vershik and Temel't [1]; but with a
basic difference. Namely, we use weak and weak* topologies (see Remark
12.5.4 and Lemma 12.5.6), whereas Vershik and Temel't use stronger-for
instance, normed-topologies. This is a key fact because we only need "rea-
sonable" things, for example (12.6.4) and (12.6.7), whereas their context
would require convergence in the total variation norm, which is obviously
too restrictive. For instance, for uncountable metric space, the density re-
sult in Proposition 12.5.6-with finitely supported measures-is, in general,
virtually impossible to get in the total variation norm.
Finally, it is worth noting that the approach in §12.5 can be used to
approximately compute an i.p.m. for a noncontrolled Markov chain on a
LCSM space whose transition kernel satisfies the (weak) Feller condition in
Assumption 11.4.1(d). The idea would be to introduce an "artificial" MCP
with a singleton control set A and with a continuous "cost" function that
satisfies the hypothesis of Theorem 12.5.7.
References

Alden, J. M. and Smith, R. L.


[1] Rolling horizon procedures in nonhomogeneous Markov decision pro-
cesses. Oper. Res. 40 (1992), 183-194.
Altman, E.
[1] Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca
Raton, FL, to appear.
Anderson, E. J. and Nash, P.
[1] Linear Programming in Infinite-Dimensional Spaces. Wiley, Chich-
ester, U.K., 1987.
Ash, R. B.
[1] Real Analysis and Probability. Academic Press, New York, 1972.
Balder, E.
[1] Lectures on Young Measures, Cahiers de matMmatiques de la decision.
CEREMADE, Universite Paris IX-Dauphine, Paris, 1995.
Baykal-Giirsay, M. and Ross, K. W.
[1] Variability sensitive Markov decision processes. Math. Oper. Res. 11
(1992), 558-571.
Bensoussan, A.
[1] Stochastic control in discrete time and applications to the theory of
production. Math. Programm. Study 18 (1982),43-60.
252 References

Bertsekas, D.P.
[1] Dynamic Programming: Deterministic and Stochastic Models. Pren-
tice-Hall, Englewood Cliffs, N. J., 1987.
Bertsekas, D. P. and Shreve, S. E.
[1] Stochastic Optimal Control: The Discrete Time Case. Academic Press,
New York, 1978.
Bes, C. and Lasserre, J. B.
[1] An on-line procedure in discounted infinite-horizon stochastic optimal
control. J. Optim. Theory Appl. 50 (1986),61-67.
Bes, C. and Sethi, S. P.
[1] Concepts of forecast and decision horizons: applications to dynamic
stochastic optimization problems. Math. Oper. Res. 13 (1988), 295-
310.
Bhattacharya, R. N. and Majumdar, M.
[1] Controlled semi-Markov models-the discounted case. J. Statist. Plann.
and Inference 21 (1989), 365-381.
Billingsley, P.
[1] Convergence of Probability Measures. Wiley, New York, 1968.
Blackwell, D.
[1] Memoryless strategies in finite-stage dynamic programming. Ann. Math.
Statist. 35 (1964), 863-865.
Bourbaki, N.
[1] Integration, Chap. IX. Hermann, Paris, 1969.
Brezis, H.
[1] Analyse Fonctionnelle: Theorie et Applications, 4 e tirage. Masson,
Paris, 1993.
Brown, B. W.
[1] On the iterative method of dynamic programming on a finite space
discrete time Markov process. Ann. Math. Statist. 33 (1965), 719-726.
Cavazos-Cadena, R.
[1] Finite-state approximations to denumerable discounted Markov deci-
sion processes. Appl. Math. Optim. 14 (1986), 1-26.
Cavazos-Cadena, R. and Montes-de-Oca, R.
[1] Optimal stationary policies in controlled Markov chains with the ex-
pected total-reward criterion. Preprint, Departamento de Matemati-
cas, UAM-Iztapalapa, Mexico, 1997.
Chen, C.-T.
[1] Linear System Theory and Design. Holt, Rinehart and Winston, New
York, 1984.
References 253

Craven, B. D. and Koliha, J. J.


[1] Generalizations of Farkas' theorem. SIAM J. Math. Anal. Appl. 8
(1977), 983-997.
Dekker, R., Hordijk, A. and Spieksma, F.
[1] On the relation between recurrence and ergodicity properties in denu-
merable Markov chains. Math. Oper. Res. 19 (1994),539-559.
Derman, C.
[1] Finite State Markovian Decision Processes. Academic Press, New
York, 1970.
Derman, C. and Strauch, R. E.
[1] A note on memoryless rules for controlling sequential control processes.
Ann. Math. Statist. 37 (1966),276-278.
Doob, J. L.
[1] Measure Theory. Springer-Verlag, New York, 1994.
Dufio, M.
[1] Methodes Recursives Ateatoires. Masson, Paris, 1990.
Dutta, P. K.
[1] What do discounted optima converge to? A theory of discount rate
asymptotics in economic models. J. Economic Theory 55 (1991),64-
94.
Dynkin, E. B. and Yushkevich, A. A.
[1] Controlled Markov Processes. Springer-Verlag, New York, 1979.
Easley, D. and Spulber, D.F.
[1] Stochastic equilibrium and optimality with rolling plans. International
Econ. Rev. 22 (1981), 79-103.
Fermindez-Gaucherand, E., Ghosh, M. K., and Marcus, S. I.
[1] Controlled Markov processes on the infinite planning horizon: weighted
and overtaking cost criteria. ZOR: Math. Methods in Oper. Res. 39
(1994),131-155.

Filar, J. A., Kallenberg, L. C. M. and Huey-Miin, L.


[1] Variance-penalized Markov decision processes. Math. Oper. Res. 14
(1989),147-161.
Flynn, J.
[1) On optimality criteria for dynamic programs with long finite horizons.
J. Math. Anal. Appl. 76 (1980), 202-208.
254 References

Gale, D.
[1] On optimal development in a multi-sector economy. Rev. of Economic
Studies 34 (1965), 1-19.
Glynn, P. W.
[1] Simulation output analysis for geneml state space Markov chains. Ph.
D. Dissertation, Dept. of Operations Research, Stanford University,
1989.
[2] Some topics in regenerative steady-state simulation. Acta Appl. Math.
34 (1994), 225-236.
Glynn, P. W. and Meyn, S. P.
[1] A Lyapunov bound for solutions of the Poisson equation. Ann. Prob.
24 (1996), 916-931.
GonzaIez-Hermindez, J. and Hermindez-Lerma, O.
[1] Envelopes of sets of measures, tightness, and Markov control processes.
Appl. Math. Optim., to appear.
Gordienko, E. and Hernandez-Lerma, O.
[1] Average cost Markov control processes with weighted norms: value
iteration. Appl. Math. (Warsaw) 23 (1995), 219-237.
[2] Average cost Markov control processes with weighted norms: existence
of canonical policies. Appl. Math. (Warsaw) 23 (1995), 199-218.
Gordienko, E., Montes-de-Oca, R. and Minjarez-Sosa, A.
[1] Average cost optimization in Markov control processes with unbounded
costs: ergodicity and finite horizon approximation. Preprint, Departa-
mento de Matematicas, UAM-Iztapalapa, Mexico, 1995.
Hall, P. and Heyde, C. C.
[1] Martingale Limit Theory and Its Applications. Academic Press, New
York, 1980.
Haviv, M. and Puterman, M. L.
[1] Bias optimality in controlled queueing systems. J. Appl. Prob. 35
(1998), 136-150.
Hernandez-Lerma, O.
[1] Adaptive Markov Control Processes. Springer-Verlag, New York, 1989.
Hernandez-Lerma, 0., Carrasco, G. and Perez-Hernandez, R.
[1] Markov control processes with the expected total-cost criterion: op-
timality, stability, and transient models. Reporte Interno, Depto. de
Matematicas, CINVESTAV-lPN, 1998. (Submitted.)
References 255

Hermindez-Lerma, O. and Gonzalez-Hermindez, J.


[1] Infinite linear programming and multichain Markov control processes
in uncountable spaces. SIAM J. Control Optim. 36 (1998),313-335.
Hernandez-Lerma, O. and Lasserre, J. B.
[1] Discrete-Time Markov Control Processes: Basic Optimality Criteria.
Springer-Verlag, New York, 1996.
[2] Invariant probabilities for Feller-Markov chains. J. Appl. Math. and
Stoch. Anal. 8 (1995), 341-345.
[3] Existence of bounded invariant probability densities for Markov chains.
Stat. Prob. Lett. 28 (1996), 359-366.
[4] Cone-constrained linear equations in Banach spaces. J. Convex Anal.
4 (1997), 149-164.
[5] Existence and uniqueness of fixed points for Markov operators and
Markov processes. Proc. London Math. Soc. 76 (1998), 711-736.
[6] On the probabilistic multichain Poisson equation. LAAS Report N°
97155, LAAS-CNRS, Toulouse, 1997.
[7] Existence of solutions to the Poisson equation in Lp spaces. Proc. IEEE
Conference on Decision and Control (CDC), Kobe, Japan, 1996, vol.
4, pp. 4190-4195.
[8] Error bounds for rolling horizon policies in discrete-time Markov con-
trol processes. IEEE 1rans. Autom. Control 35 (1990), 1118-1124.
[9] Value iteration and rolling plans for Markov control processes with
unbounded rewards. J. Math. Anal. Appl. 177 (1993), 38-55.
[10] A forecast horizon and a stopping rule for general Markov decision
processes. J. Math. Anal. Appl. 132 (1988), 388-400.
[11] Policy iteration for average cost Markov control processes on Borel
spaces. Acta Appl. Math. 47 (1997), 125-154.
[12] New criteria for positive Harris recurrence of Markov chains. LAAS
Report N° 97431, LAAS-CNRS, Toulouse, 1997.
[13] Ergodic theorems and ergodic decomposition for Markov chains. Acta
Appl. Math. 54 (1998), 99-199.
[14] Linear programming and average optimality for Markov control pro-
cesses in Borel spaces-unbounded costs. SIAM J. Control Optim. 32
(1994),480-500.
[15] Approximation schemes for infinite linear programs. SIAM J. Optim.
8 (1998), 973-988.
[16] Linear programming approximations for Markov control processes in
metric spaces. Acta Appl. Math. 51 (1998), 123-139.
Hernandez-Lerma, 0., Montes-de-Oca, R. and Cavazos-Cadena, R.
[1] Recurrence conditions for Markov decision processes with Borel state
space: a survey. Ann. Oper. Res. 28 (1991), 29-46.
256 References

Hernandez-Lerma, O. and Vega-Amaya, O.


[1] Infinite-horizon Markov control processes with undiscounted cost cri-
teria: from average to overtaking optimality. Appl. Math. (Warsaw)
25 (1998), 153-178.

Hernandez-Lerma, 0., Vega-Amaya, O. and Carrasco, G.


[1] Sample-path optimality and variance-minimization of average cost
Markov control processes. SIAM J. Control Optim., to appear.
Hinderer, K.
[1] Foundations of Non-Stationary Dynamic Programming with Discrete-
Time Parameter. Lecture Notes in Oper. Res. and Math. Syst. 33,
Springer-Verlag, Berlin, 1970.
Hinderer, K. and Hiibner, G.
[1] An improvement of J. F. Shapiro's turnpike theorem for the horizon
of finite stage discrete dynamic programs. In Trans. 7th Prague Conf.
on Information Theory, Statist. Dec. Func. and Random Proc., Vol.
A, 1974 (Academia, Prague, 1977), pp. 245-255.
Hopp, W.
[1] Identifying forecast horizons in non-homogeneous Markov decision
processes. Oper. Res. 37 (1989), 339-344.
Hordijk, A.
[1] Dynamic Programming and Markov Potential Theory, 2nd ed. Math-
ematical Centre Tracts No. 51, Mathematisch Centrum, Amsterdam,
1977.
Hordijk, A. and Lasserre, J. B.
[1] Linear programming formulation of MDPs in countable state space:
the multichain case. Z. Oper. Res. 40 (1994),91-108.
Hordijk, A. and Spieksma, F.
[1] A new formula for the deviation matrix. In Probability, Statistics and
Optimization (F. P. Kelly, ed.), Wiley, New York, 1994, pp. 497-507.
Johansen, L.
[1] Lectures on Macroeconomic Planning. North-Holland, Amsterdam,
1977.
Kallenberg, L. C. M.
[1] Linear Programming and Finite Markovian Control Problems. Mathe-
matical Centre Tracts No. 148, Mathematisch Centrum, Amsterdam,
1983.
References 257

Kartashov, N. V.
[1] Criteria for uniform ergodicity and strong stability of Markov chains
with a common phase space. Theory Probab. and Math. Statist. 30
(1985),71-89.
[2] Inequalities in theorems of ergodicity and stability for Markov chains
with common phase space. I. Theory Probab. Appl. 30 (1985), 247-259.
[3] Inequalities in theorems of ergodicity and stability for Markov chain
with common phase space. II. Theory Probab. Appl. 30 (1985), 507-
515.
[4] Strongly stable Markov chains. J. Soviet Math. 34 (1986), 1493-1498.
[5] Strong Stable Markov Chains. VSP, Utrecht, The Netherlands, 1996.
Kleinman, D.
[1] An easy way to stabilize a linear control system. IEEE Trans. Autom.
Control 15 (1970), p. 692.
Kurano, M.
[1] Markov decision processes with a minimum-variance criterion. J. Math.
Anal. Appl. 123 (1987), 572-583.
Kurano, M. and Kawai, M.
[1] Existence of optimal stationary policies in discounted decision pro-
cesses: approaches by occupation measures. Computers Math. Appl.
27 (1994), 95-101.
Kwon, W. H., Bruckstein, A. M. and Kailath, T.
[1] Stabilizing state feedback design via the moving horizon method. In-
ternat. J. Control 37 (1983), 631-643.
Lasota, A. and Mackey, M. C.
[1] Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics, 2nd ed.
Springer-Verlag, New York, 1994.
Lasserre, J. B.
[1] Existence and uniqueness of an invariant probability measure for a
class of Feller-Markov chains. J. Theoret. Prob. 9 (1996), 595-612.
[2] Invariant probabilities for Markov chains on a metric space. Stat. Prob.
Lett., to appear.
[3] Sample-path average optimality for Markov control processes. IEEE
Trans. Autom. Control, to appear.
Lippman, S. A.
[1] On dynamic programming with unbounded rewards. Manage. Sci. 21
(1975), 1225-1233.
Luenberger, D. G.
[1] Optimization by Vector Space Methods. Wiley, New York, 1969.
258 References

Makowski, A. M. and Shwartz, A.


[1] On the Poisson equation for countable Markov chains. Tech. Rpt.,
Dept. of Electrical Engineering, University of Maryland, 1994.
Mandl, P.
[1] On the variance in controlled Markov chains. Kybemetika 7 (1971),
1-12.
[2] A connection between controlled Markov chains and martingales. Ky-
bemetika 9 (1973), 237-241.
Mandl, P. and Lausmanova, M.
[1] Two extensions of asymptotic methods in controlled Markov chains.
Ann. Oper. Res. 28 (1991), 67-80.
McKenzie, L. W.
[1] Turnpike theory. Econometrica 44 (1976),841-865.
Metivier, M. and Priouret, P.
[1] Theoremes de convergence presque sure pour une classe d'algorithmes
stochastique it pas decroissant. Prob. Th. Rei. Fields 74 (1987), 403-
428.
Meyn, S. P.
[1] The policy iteration algorithm for average reward Markov decision
processes with general state space. IEEE Trans. Autom. Control 42
(1997), 1663-1679.
Meyn, S. P. and Tweedie, R. L.
[1] Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993.
[2] Computable bounds for geometric convergence rates of Markov chains.
Ann. Appl. Prob. 4 (1994),981-1011.
Mokkadem, A.
[1] Sur un modele autoregressif nonlineaire. Ergodicite et ergodicite geo-
metrique. J. Time Series Anal. 8 (1987), 195-205.
Neveu, J.
[1] Mathematical Foundations of the Calculus of Probability. Holden-Day,
San Francisco, 1965.
Nowak, A. S.
[1] Stationary overtaking optimal strategies in Markov decision processes
with general state space. Preprint, Institute of Mathematics, Technical
University of Wroclaw, Poland, 1992.
Nowak, A. S. and Vega-Amaya, O.
[1] A counterexample on overtaking optimality. Preprint, Institute of
Mathematics, Technical University ofWroclaw, Poland; Departamento
de Matematicas, Universidad de Sonora, Mexico, 1997.
References 259

Nummelin, E.
[1] General Irreducible Markov Chains and Non-Negative Operators. Cam-
bridge University Press, Cambridge, 1984.
[2] On the Poisson equation in the potential theory of a single kernel.
Math. Scand. 68 (1991), 59-82.
Orey, S.
[1] Limit Theorems for Markov Chain Transition Probabilities. Van Nos-
trand Reinhold, London, 1971.
Parthasarathy, K. R.
[1] Probability Measures on Metric Spaces. Academic Press, New York,
1967.
Piunovski, A. B.
[1] General Markov models with the infinite horizon. Problems of Control
and Infor. Theory 18 (1989), 169-182.
Pliska, S. R.
[1] On the transient case for Markov decision chains with general state
spaces. In: Puterman [2], pp. 335-349.
Puterman, M. L.
[1] Markov Decision Processes. Wiley, New York, 1994.
[2] (Editor) Dynamic Programming and Its Applications. Academic Press,
New York, 1979.
Quelle, G.
[1] Dynamic programming of expectation and variance. J. Math. Anal.
Appl. 55 (1976), 239-252.
Ramsey, F. P.
[1] A mathematical theory of savings. Economic J. 38 (1928),543-559.
Rempala, R.
[1] Forecast horizon in a dynamic family of one-dimensional control prob-
lems. Diss. Math. 315 (1991).
Revuz, D.
[1] Markov Chains, revised ed. North-Holland, Amsterdam, 1984.
Rieder, U.
[1] Measurable selection theorems for optimization problems. M anuscripta
Math. 24 (1978), 115-131.
[2] On optimal policies and martingales in dynamic programming. J.
Appl. Prob. 13 (1976), 507-518.
[3] On Howard's policy improvement method. Math. Operationsforsch.
Statist., Ser. Optimization 8 (1977), 227-236.
260 References

Robertson, A. P. and Robertson, W.


[1] Topological Vector Spaces. Cambridge University Press, Cambridge,
UK, 1964.
Ross, S. M.
[1] Applied Probability Models with Optimization Applications. Holden-
Day, San Francisco, 1970.
Royden, H. L.
[1] Real Analysis, 2nd ed. Macmillan, New York, 1968.
Rudin, W.
[1] Real and Complex Analysis, 3rd ed. McGraw-Hill, New York, 1986.
Schal, M.
[1] Conditions for optimality and for the limit of n-stage optimal policies
to be optimal. Z. Wahrs. Verw. Geb. 32 (1975), 179-196.
Schweitzer, P. J.
[1] On undiscounted Markovian decision processes with compact action
spaces. RAIRO Rech. Oper.jOper. Res. 19 (1985), 71-86.
Shapiro, J. M.
[1] Turnpike planning horizons for a Markovian decision model. Manage.
Sci. 14 (1968), 292-300.
Spieksma, F. and Tweedie, R. L.
[1] Strengthening ergodicity to geometric ergodicity for Markov chains.
Stoch. Models 10 (1994),45-75.
Strauch, R.
[1] Negative dynamic programming. Ann. Math. Statist. 37 (1966), 871-
890.
Syski, R.
[1] Ergodic potential. Stoch. Proc. Appl. 7 (1978), 311-336.
Tijms, H. C.
[1] Stochastic Models: An Algorithmic Approach. Wiley, New York, 1994.
Tijms, H. C. and Wessels, J.
[1] (Editors) Markov Decision Theory. Mathematical Centre Tracts No.
93, Mathematisch Centrum, Amsterdam, 1977.

Tweedie, R. L.
[1] Sufficient conditions for ergodicity and geometric ergodicity of Markov
chains on a general state space. Stoch. Proc. Appl. 3 (1975), 385-403.
[2] The existence of moments for stationary Markov chains. J. Appl.
Probab. 20 (1983), 191-196.
References 261

van Hee, K. M. Hordijk, A., and van der Wal, J.


[1] Successive approximations for convergent dynamic programming. In:
Tijms and Wessels [1], pp. 183-211.
van Nunen, J. A. E. E. and Wessels, J.
[1] A note on dynamic programming with unbounded rewards. Manage.
Sci. 24 (1978), 576-580.
[2] Markov decision processes with unbounded rewards. In: Tijms and
Wessels [1], pp. 1-24.
Vega-Amaya, O.
[1] Overtaking optimality for a class of production-inventory systems.
Preprint, Departamento de Matematicas, Universidad de Sonora, Me-
xico, 1996.
[2] Markov Control Processes In Borel Spaces: Undiscounted Criteria.
Doctoral Thesis, UAM-Iztapalapa, Mexico, 1998. (In Spanish.)
[3] Sample path average optimality of Markov control processes with
strictly unbounded cost. Appl. Math. (Warsaw), to appear.
Veinott, A. F. (Jr.)
[1] Discrete dynamic programming with sensitive discount optimality cri-
teria. Ann. Math. Statist. 40 (1969), 1635-1660.
[2] On finding optimal policies in discrete dynamic programming with no
discounting. Ann. Math. Statist. 37 (1965), 1284-1294.
Vershik, A. M.
[1] Some remarks on the infinite-dimensional problems of linear program-
ming. Russian Math. Surveys 29 (1970), 117-124.
Vershik, A. M. and Temel't, V.
[1] Some questions concerning the approximation of the optimal value of
infnite-dimensional problems in linear programming. Siberian Math.
J. 9 (1968), 591-601.
von Weizsacker, C. C.
[1] Existence of optimal programs of accumulation for an infinite horizon.
Rev. of Economic Studies 32 (1965),85-104.
Wakuta, K.
[1] Arbitrary state semi-Markov decision processes with unbounded re-
wards. Optimization 18 (1987), 447-454.
Wessels, J.
[1] Markov programming by successive approximations with respect to
weighted supremum norms. J. Math. Anal. Appl. 58 (1977),326-335.
Yosida, K.
[1] Functional Analysis, 6th ed. Springer-Verlag, Berlin, 1980.
Yushkevich, A. A.
[1] Blackwell optimality in Borelian continuous-in-action Markov decision
processes. SIAM J. Control Optim. 35 (1997), 2157-2182.
Abbreviations

a.a. almost all


a.e. almost everywhere
a.s. almost surely
Li.d. independent and identically distributed
l.s.c. lower semicontinuous
u.s.c. upper semicontinuous
p.m. probability measure
i.p.m. invariant probability measure
AC average cost
ACOE Average Cost Optimality Equation
ACOI Average Cost Optimality Inequality
ADO asymptotic discount optimality
DC discounted cost
DCOE discounted cost optimality equation
DP dynamic programming
ETC expected total cost
IFS iterated function system
LCSM locally compact separable metric
LP linear programming
LLN Law of Large Numbers
MCM Markov control model
MCP Markov control process
0.0. overtaking optimality
P.E. Poisson equation
264 References

PI policy iteratio n
PIA policy iteratio n algorit hm
RH rolling horizon
VI value iteratio n
Glossary of notation

o end of proof or example or remark


equality by definition
indicator function of a set B, defined as

if x E B,
otherwise.

r+ := max(r,O), r- := - min(r, 0)

Chapter 7 total variation norm of a


measure JL
Section 7.1 IIJLllw w-norm of a measure JL
X Borel (state) space M(X) Banach space of signed
measures JL on X with
B(X) Borel a-algebra
IIJLIITV < 00
Section 7.2 Banach space of signed
measures JL on X with
w weight function
IIJLllw < 00
Ilull sup norm of a function u Q(Blx) signed kernel on X
Ilull w w-norm of a function u
Banach space of bounded
Qu(x):= Ix u(y)Q(dylx)

Ix
~(X)
measurable functions on X
JLQ(B):= Q(Blx)JL(dx)
Banach space of measur-
able functions on X with fi- IIQllw w-norm of a signed kernel
nite w-norm Q
266 Glossary of notation

8", Dirac measure at x E X cpOO randomized stationary po-


QR composition of the signed licy, cp E <I>
kernels Q and R foo deterministic stationary
Qn:= QQn-l policy, f E F

QO(Blx) := 8",(B) II set of all control policies


IIRM set of randomized Markov
Section 7.3 policies
P(Blx) transition probability func- IIRS set of randomized station-
tion of a Markov chain ary policies cpoo
pt (Blx) t-step transition probabil- IID set of deterministic policies
ity IIDM set of deterministic Markov
TB hitting time of the set B policies

"IB occupation time of the set II DS set of deterministic station-


B ary policies foo

L(x, B) := P",(TB < (0) c(x,cp) == c",(x):= fA c(x,a)cp(dalx)


for cp E <I>
U(x, B) := E",("IB)
c(x, f) == cf(x) := c(x,j(x)) for f E F
v«p the measure v is absolutely
continuous with respect to
the measure p Q('lx, cp) == Q",('lx):= fA Q('lx,a)cp
(dalx) for cp E <I>
maximal irreducibility mea-
sure Q('lx,f) == Qf('lx):= Q('lx,f(x))
for f E F
8(X)+ family of sets B E 8(X)
(n,F) canonical measurable space
with >'(B) > 0
P," p.m. on (n, F) determined
f vdp == p(v) == (p, v) v
by the policy 11' and the ini-
Cb(X) Banach space of continuous tial distribution v
bounded functions on X

E"v expectation with respect to


Chapter 8
P::
Section 8.2 E; := E:: if v = 8",

M:= (X, A, {A(x)lx E X},Q,c) Section 8.3


Markov control model
V(1I',x) a-discounted cost (0 < a <
IK set of feasible state-action 1) when using the policy 11',
pairs given the initial state x
F set of decision functions (or V'(x) a-discount value function
selectors)
Vn (1I',x) a-discounted n-stage ex-
<I> set of randomized decision pected cost
functions a-value iteration (a-VI)
vn(x)
Ht family of admissible histo- function, n = 1,2, ...
ries up to time t
IIQllw w-norm of the transition
11' control policy law Q
Glossary of notation 267

DP operator J~(x) n-stage optimal ETC

Section 8.4 Va (7r, x) == V (7r, x) a-discounted cost,


0<a<1
set of a-VI decision func-
tion, n = 1,2, ... V; (x) == V' (x) a-discount value func-
tion

(fct)
IF, set of a-discount optimal
decision functions
Vl(+)(7r,x):= E;
D(x,a) a-discount discrepancy t=o

(f ct)
function
A,(x) a-discount optimal control Vl(-)(7r, x) := E;
actions in the state x t=o
An (x) a-VI optimal control ac- Vln (7r, x) ETC from time n onwards
tions in the state Xj n = when using the policy 7r,
1,2, ... given the initial state Xo =
Dn(x,a) a-VI discrepancy function, x
n = 1,2, ...
Section 9.4.
Section 8.5
ETC when using the policy
JL(X) family of l.s.c. functions on 7r,given the initial distribu-
X tion v
JLw(X) := JL(X) n lRw(X) ETC-expected occupation
C(X) family of continuous func- measure on X x A when us-
tions on X ing the policy 7r, given the
Cw(X) := C(X) n lRw(X) initial distribution v
family of continuous marginal of J1~ on X
bounded functions on X
J1~,t distribution of (Xt, at)
when using the policy 7r,
Chapter 9 given the initial distribu-
tion v, t = 0,1, ...
Section 9.1
~,t marginal of J1~,t on X
expected total cost (ETC)
7r(l) I-shift policy determined
when using the policy 7r,
given the initial state x by 7r
vt(x) ETC value function
Section 9.5
Section 9.2
dynamic programming op-
Ii extended real numbers erator (when a = 1)
r+ := max(r, 0) see Definition 9.5.1
r- := max( -r, 0) = - min(r, 0) ETC-discrepancy function
Section 9.3 n-shift policy determined
I n (7r,x) n-stage ETC when using by 7r (n = 0,1, ... )
the policy 7r, given the ini- M'n see (9.5.12)
tial state x
268 Glossary of notation

Section 9.6 A irreducibility measure


Qcp(·lx) == Q(·lx,r,o) h(x) optimal bias function
:= fA Q(·lx,a)r,o(dalx) A"(x) AC-canonical control ac-
Qt(·lx) := Qcp, (·Ix) tions at state x
Q~ := QOQ1 ... Qt-1 for t = 1,2, ... Section 10.4
w-norm of Qcp
U a (·) := V;O - V;(z)
Chapter 10 p(a) := (1 - a)V;(z)

Section 10.1 Section 10.7

OC ('11", x) opportunity cost of policy see (10.7.3)


'11", given the initial state x

D('II",x) Dutta's criterion Chapter 11


J('II",x) expected average cost (AC)
Section 11.1
when using the policy'll",
given the initial state x J~ ('II", v) n-stage sample-path cost
J*(x) optimal expected AC when using the policy'll",
given the initial distribu-
Section 10.2 tion v

Q}(-Ix) == Qt(·lx, f) t-step transition JO('II",v) long-run sample-path AC


probability, t = 0,1, ... JI ('II", v) limit-infimum expected AC
IIQlllw w-norm of QI Var ('II",v) limiting average variance
1-'1 i.p.m. of QI when using the policy'll",
given the initial distribu-
L1(1-'):= L1(X,B(X),I-')
tion v
I-'I(u) := fx 001-'1
pmin see (11.1.12)
c/(x) == c(x, f) := c(x, f(x))
p see (11.1.15)
J(f) := 1-'1 (C/)
p" see (11.1.23)
hI bias function of f E IF
P(X) set of probability measures
Section 10.3
onX
family of AC-optimal deci-
P.(X) := {b",lx E X}
sion functions
family of canonical decision Section 11.2
functions
p(n)(·lx) expected average occupa-
family of bias-optimal deci- tion measure
sion functions
0"2 (c, x) limiting average variance
(p",h") solution of the ACOE for a Markov chain
I n ( 'II" , x, h) n-stage ETC with terminal
see (11.2.11)
cost function h
J:(x, h) value function for
1/J(x) see (11.2.12)
In('II",x,h) Yt see (11.2.17)
Glossary of notation 269
n
lP primal linear program
Mn:= I)'i inflP value of lP
t=l
F t := q{xo, ... , Xt} minlP optimal value of lP
lPo dual of lP
Section 11.3
sup lPo value of lPo
see (11.3.2)
maxlPo optimal value of lPo
(pO, ho) solution of the ACOE
,l0(7r, x) lim-inf sample path AC Section 12.3
1/J(x, a) see (11.3.5) w(x,a) weight function on II{j see
(12.3.5)
1/JJ(x) := 1/J(x, /(x»
(q~, Vo) see (11.3.17) wo(x) weight function on X j see
(12.3.5)
v(x) := W(X)I/2
Lop, see (12.3.12)
F t (7r, x) see (11.3.23)
LIp' see (12.3.13)
Yt(7r,x) see (11.3.24)
Lp, see (12.3.14)
M n (7r,x) see (11.3.25)
L °(p, 1') see (12.3.15)
D(x, a) AC-discrepancy function
(P) AC-related primal linear
q, J(X) see (11.3.35) program
q2(f) := Var (foo,.) (PO) dual of (P)
Mvar MCM for the variance min- C perturbation of L
imization problem
Section 12.5
Chapter 12 C(X) a countable dense subset of
Co(X)
Section 12.2
{Ck} increasing sequence of fi-
(X,Y) dual pair of vector spaces nite sets Ck t C(X)
X,Y lP(Ck) aggregation of constraints
q(X,Y) weak topology on X of (P)
q(Yo,Y) weako topology on yo {ck} sequence of numbers ck..\.O
Co(S) Banach space of continuous lP(C1c,ck) aggregation-relaxation of
functions vanishing at in- (P)
finity D a countable dense subset of
(p" u) := J udp, II{
GO adjoint of linear map G {Dn} increasing sequence of fi-
Wo weight function on X nite sets Dn t D
w weight function on X x Y ~n:= P(Dn)
Mw(X)+ positive cone in Mw(X) ~:= U~=l~n
Bw(X)+ positive cone in Bw(X) lP(Ck, Ck, ~n) aggregation-relaxation-
inner approximation of (P)
KO dual cone of the positive
cone K Section 12.6
marginal on X of the mea- TVI C (x, a ) <
E J:= {( x, a ) E!No. _ J.}
surep,onXxY
Index

The numbers in this index refer to sections

a-discounted cost, 8.3 average cost (AC), 10.1, 10.3


discrepancy function, 8.4 discrepancy function, 11.3
optimality equation, 8.3 lim inf, 11.1
optimal policy, 8.3 lim inf sample path, 11.3
value function, 8.3 lim sup, 10.1, 11.1
a-value iteration (VI), 8.3 minimum, 11.1
asymptotic optimality of, 8.4 related linear programs, 12.3
discrepancy functions, 8.4 approximation of, 12.4, 12.5
VI-policy, 8.4 sample path, 11.1, 11.3, 11.4
Abelian theorem, 10.4 value function, 10.1, 11.1
absorbing MCM, 9.6 Average Cost Optimality Equa-
action set, 8.2 tion (ACOE), 10.3
additive-noise system, 7.4, 11.5 uniqueness of solutions to, 10.3
adjoint of linear map, 12.2 subsolutions to the, 12.3
Alaoglu Theorem, 12.2 A verage Cost Optimality Inequal-
Alden, 8.4 ity (ACOI), 10.3, 11.4
Altman, 9.3 average variance
Anderson, 12.2 different definitions of, 11.2
aperiodic Markov chain, 7.3 of Markov chains, 11.2
Ash, 12.2 alternative expressions for,
asymptotic discount optimality 11.2
(ADO), 8.4 of MCPs, 11.1, 11.3
272 Index

Balder, 12.2 control policy, 8.2


Banach-Alaoglu-Bourbaki Theo- AC-optimal, 10.1
rem, see Alaoglu Theo- a.e. canonical, 10.3
rem bias-optimal, 10.3
Banach's Fixed Point Theorem, canonical, 10.3
7.2,8.3 catching-up, 10.1
Baykal-Giirsoy, 11.2 D-optimal, 10.1
Benes 12.2 deterministic, 8.2
Bensoussan, 8.3 deterministic Markov, 8.2
Bertsekas, 9.3, 9.5, 9.6, 10.9, 11.4 deterministic stationary, 8.2
Bess, 8.4, 8.6 F-strong AC-optimal, 10.3
Bhattacharya, 8.3 minimum variance, 11.3
bias function, 10.2 n-shift, 9.5
optimal, 10.3 DC-optimal, 10.1
bias minimization, 10.3 I-shift, 9.4
optimality equations for, 10.3 overtaking, 10.1
vs. ETC problem, 10.7 pointwise-ADO, 8.4
bilinear form, 12.2 positive Harris recurrent, 11.4
Billingsley, 12.2, 12.5 randomized, 8.2
Blackwell, 9.4 randomized Markov, 8.2
Borel space, 7.1 randomized stationary, 8.2
bounding function, see weight func- sample path AC-optimal, 11.1
tion stable, 11.4
Bourbaki, 12.2 strong AC-optimal, 11.1
Brezis, 12.2 strongly 0.0., 10.1
Brown, 10.3, 10.9 weakly 0.0., 10.1
control set, see action set
C-set, 7.3 convex cone, 12.2
canonical pair, 7.5 cost-per-stage c(x, a), 8.2
canonical policy, 10.3 Craven, 12.2
canonical triplet, 7.5, 10.3
vs. ACOE, 10.3 dam model, 7.4
for bias-optimality, 10.7 decision function, 8.2
Carrasco, 9.5, 9.6, 11.3, 11.4 Dekker, 7.3
Cavazos-Cadena, 7.3, 8.3, 9.5 Derman, 9.4
Central Limit Theorem for Mar- deviation matrix, 7.5
kov chains, 11.2 Dieudonne, 12.3
Cesaro sums, 7.3, 7.5 Dirac measure, 7.2
Chen, 7.4 directed set, 12.2
complementary slackness, 12.2 discounted cost, see a-discounted
composition of signed kernels, 7.2 cost
conditional expectation, discrepancy function,
properties of, 11.2 discounted, 8.4
continuous component, 7.3 for ETC criterion, 9.5
contraction map, 7.2 for the PIA, 10.5
Index 273

Doeblin's condition, 7.3 Feller property, 7.3


Doob, 11.2 strong, 7.3, 8.5
Drazin inverse, 7.5 weak, 7.3, 8.5, 12.2
dual cone, 12.2 Fermindez-Gaucherand, 10.9
dual pair of vector spaces, 12.2 Filar, 11.2
Dufio, 7.3, 11.2 fixed point, 7.2, 8.3
Dutta, 10.1 Flynn, 10.1, 10.3, 10.9
Dutta's criterion, 10.1 forecast horizon, 8.4
dynamic programming equation detection problem of, 8.4
for ETC criterion, 9.5 (x,1f)-, 8.4
dynamic programming operator, function
as a contraction, 8.3 bounded, 7.2
discounted, 8.3 w-bounded, 7.2
ETC-, 9.5 fundamental kernel, 7.5
Dynkin, 9.4, 10.9 fundamental matrix, 7.5

Easley, 8.4 Gale, 10.1


elimination of nonoptimal actions, Generalized Farkas Theorem, 12.2
8.4 generalized sequence, see net
expected occupation measure, 9.4 geometric ergodicity in the total
a-discount, 9.4 variation norm, 7.3
expected total cost, 9.4 Glynn, 7.5, 11.2
marginal on X, 9.4 Gonzalez-Hernandez, 9.4, 12.3, 12.4
expected total cost (ETC), 9.1, Gordienko, 7.3, 8.6, 10.2, 10.3, 10.5,
9.3 10.9
and bias minimization, 10.7
expected occupation measure, Hall, 11.2
9.4 harmonic function, see invariant
n-stage, 9.3 function
optimal n-stage, 9.3 Harris recurrence, 7.3
optimal policy, 9.1 null, 7.3
optimality equation, 9.5 positive, 7.3, 11.2
value function, 9.1, 9.3 Haviv, 10.3, 10.9
extended real numbers, 9.2 Hernandez-Lerma, 7.3, 7.5, 8.5,
convergence of series of, 9.2 8.6, 9.4, 9.5, 9.6, 10.2,
10.3,10.5,10.9,11.2,11.3,
Farkas Theorem, 7.3 11.4,11.5,12.2,12.3,12.4,
Generalized, 12.2 12.6
Fatou's Lemma, 8.3 Heyde, 11.2
feasible actions, 8.2 Hinderer, 8.4, 9.2, 9.3, 9.4, 9.5
set of, 8.2 hitting time, 7.3
feasible controls, see feasible ac- Hordijk, 7.3, 7.5, 12.3
tions Hiibner, 8.4
feasible state-action pairs, 8.2
set of, 8.2 Individual Ergodic Theorem, 7.5
274 Index

inf-compact function, 11.4, 12.2 minimizing sequence for, 12.2,


vs. tightness, 12.2 12.4
integrability, 9.2 no duality gap for, 12.2
invariant function, 7.5, 10.2 optimal solution for, 12.2
invariant probability measure primal, 12.2
(Lp.m.),7.3 solvable, 12.2
inventory-production system, 7.4, strong duality of, 12.2
8.5, 11.5 value of, 12.2
irreducibility measure, 7.3 weak duality of, 12.2
maximal, 7.3 linear system, 7.4
iterated function system (IFS), 7.4 Poisson's equation for, 7.5
Poisson equation for, 7.5 Lippman, 8.3
Luenberger, 7.2
Johansen, 8.4 Lyapunov function, see strictly un-
bounded
Kallenberg, 9.6
Kartashov, 7.2, 7.3, 7.4, 10.2 Mackey, 7.3, 7.4
Kawai,9.4 Majumdar, 8.3
Kleinman, 8.4 Makowski,7.5
Koliha, 12.2 Mandl, 11.3
Kurano, 9.4, 11.3 marginal measure, 12.2
Kwon,8.4 Markov chain corresponding to Mar-
kov policy, 8.3
A-irreducible, 7.3 Markov control model (MCM), 8.2
Lasota, 7.3, 7.4 absorbing, 9.6
Lasserre, 7.3, 7.5, 8.5, 8.6, 10.5, convergent, 9.3
11.2,11.4,11.5,12.2,12.3, negative, 9.3
12.6 positive, 9.3
Lausmanova, 11.3 transient, 9.6
Law of Large Numbers for Markov w-geometrically ergodic, 11.3
chains, 11. 2 zero-average cost, 9.3
limiting average variance, see av- Martingale Stability Theorem, 11.2
erage variance McKenzie, 8.4
linear program Mean Ergodic Theorem, 7.5
aggregation of constraints of, measurable selection theorem, 8.3
12.2, 12.5 Metivier, 7.5
consistent, 12.2 Meyn, 7.2, 7.3, 7.4, 7.5, 10.5, 11.2
dual, 12.2 minimum average cost, 11.1
feasible solution for, 12.2 minimum pair, 11.1
for the AC problem, 12.3 Minjarez-Sosa, 7.3, 10.2
approximation of, 12.4, 12.5 Mokkadem, 11.5
inner approximation of, 12.2, moment function, see strictly un-
12.5 bounded
maximizing sequence for, 12.2, moment generating function, 7.4
12.4 monotone operator, 7.2
Index 275

Montes-de-Oca, 7.3, 9.5, 10.2 for linear systems, 7.5


multifunction, 8.2 for MCPs, 10.2
u.s.c., 8.3 strictly unichain, 7.5
uniqueness of solutions, 7.5,
n-shift policy 1!"(n) , 9.5 10.2
n-stage cost, 8.3 policy, see control policy
Nash, 12.2 policy iteration algorithm (PIA)
net, 12.2 for AC problem, 10.5
Neveu, 9.2 and minimizing sequences,
nonexpansive map, 7.2 12.4
norm-like function, see strictly un- for ETC criterion, 9.6
bounded positive cone, 12.2
Nowak, 10.3, 10.9 Priouret, 7.5
Nummelin, 7.3, 7.4, 7.5, 10.9 product of signed kernels, 7.2
occupation measures Prohorov's Theorem, 12.2
discounted, 9.4 Puterman, 7.5,9.3, 9.5, 9.6, 10.3,
expected average, 11.2 10.5, 10.9, 11.2
expected total, 9.4
occupation time, 7.3 quadratic cost, 11.5
I-shift control policy, 9.4 quasi-integrable random variable,
one-stage cost, see cost-per-stage 9.2
operator Quelle, 9.3, 9.5
bounded, 7.5 queueing system, 7.4, 8.6, 11.5
dynamic programming, 9.5
nonexpansive, 7.5 Ramsey, 9.1, 10.1
norm of, 7.5 random walk, 7.4
power-bounded, 7.5 recurrent, 7.3
opportunity cost, 10.1 null, 7.3
optimality equation positive, 7.3
for AC problem, 10.3 Rempala, 8.4
for DC problem, 8.3 resolvent, 7.3
for ETC-criterion, 9.5 Revuz, 7.5, 11.2
Orey, 10.5 Rieder, 8.3, 9.3, 9.5, 9.6
Robertson, 12.2
Parthasarathy, 12.2, 12.5 rolling horizon (RH), 8.4
pathwise AC-optimality, 11.1 Ross, K.W., 11.2
Perez-Hernandez, 9.5, 9.6 Ross, S.M., 7.2, 10.9
petite set, 7.3 Royden, 10.5
Piunovski, 8.3 Rudin, 12.2, 12.3
Pliska,9.6
Poisson equation (P.E.), 7.5, 9.5, sample path
10.2 AC-optimality, 11.1
for ETC problem, 9.5 average cost (AC), 11.1
for iterated function systems, n-stage total cost, 11.1
7.5 sampling distribution, 7.3
276 Index

Schal, 8.4, 9.3, 9.6 uniformly transient, 7.3


Schweitzer, 9.5 upper semicontinuous (u.s.c.) mul-
selector, see decision function tifunction, 8.3
set-valued function, see multifunc-
tion value iteration (VI)
Sethi, 8.4 a-discounted, 8.3
Shapiro, 8.4 discrepancy functions, 8.4
Shreve, 11.4 estimates of, 8.4
Shwartz, 7.5 ETC criterion, 9.5, 9.6
signed kernel, 7.2 policy, 8.4
small set, 7.3 van Hee, 9.3
Smith, 8.4 van Nunen, 8.3, 9.3
Spieksma, 7.3, 7.5 Vega-Amaya, 8.6,10.3,10.9, 11.3,
Spulber, 8.4 11.4, 11.5
state space, 8.2 Veinott, 9.6
stochastic kernel, 7.2 Vershik, 12.6
weakly continuous, 12.2 von Weizsacker, 10.1
Strauch, 9.3, 9.4
w-bounded, 7.2
strictly unbounded function, 7.3,
w-geometric ergodicity, 7.3, 10.2
11.4, 12.2
implies positive Harris recur-
subinvariant function, 7.5
rence, 11.2
substochastic kernel, 7.3, 9.6
w-norm
sufficiency problem, 9.4
offunctions, 7.2
sufficient set of policies, 9.3
of measures, 7.2
sup norm, 7.2
of signed kernel, 7.2
superinvariant function, 7.5
Wakuta, 8.3
support of a measure, 7.3
weak convergence, 12.2
Syski, 7.5
of measures, 11.4
weak topology, 12.2
T-chain, 7.3
weak* (weak-star) topology, 12.2
Temel't, 12.6
weakly continuous linear map, 12.2
tight family of measures, 12.2
weight function, 7.2
tightness, 12.2
Wessels, 8.3, 9.3
vs. inf-compact function, 12.2
total variation norm, 7.2 Yosida, 7.5
of transition kernel, 9.6 Yushkevich, 9.4, 10.9
transient Markov chain, 7.3, 9.6
uniformly, 7.3
transient MCM, 9.6
transition law, 8.2
n-step, 8.2
weakly continuous, 8.5, 11.4
transition probability function, 7.2
Tweedie, 7.2, 7.3, 7.4, 10.5, 11.2,
11.5
Applications of Mathematics
(continued from page ii)

33 Embrechts/KJiippeJbergIMikosch, Modelling Extremal Events (1997)


34 Duflo, Random Iterative Models (1997)
35 KushnerlYin, Stochastic Approximation Algorithms and Applications (1997)
36 MusielalRutkowski, Martingale Methods in Financial Modeling: Theory and
Application (1997)
37 Yin/Zhang, Continuous-Time Markov Chains and Applications (1998)
38 DembolZeitouni, Large Deviations Techniques and Applications, Second Ed.
(1998)
39 Karatzas/Shreve, Methods of Mathematical Finance (1998)
40 FayollelIasnogorodskilMalyshev, Random Walks in the Quarter Plane (1999)
41 Aven/Jensen, Stochastic Models in Reliability (1999)
42 Hemaodez-LermalLasserre, Further Topics on Discrete-Time Markov Control
Processes (1999)

You might also like