Further Topics On Discrete-Time Markov Control Processes
Further Topics On Discrete-Time Markov Control Processes
Further Topics on
Discrete-Time Markov
Control Processes
i Springer
Onesimo Hemândez-Lerma Jean Bemard Lasserre
CINVESTAV-IPN LAAS-CNRS
Departamento de Matemâticas 7 Av. du Colonel Roche
Apartado Postal 14-740 31077 Toulouse Cedex, France
07000 Mexico DF, Mexico lasserre@laas.fr
ohemand@math.cinvestav.mx
Managing Editors
1. Karatzas
Departments of Mathematics and Statistics
Columbia University
New York, NY 10027, USA
M. Yor
CNRS, Laboratoire de Probabilites
Universite Pierre et Marie Curie
4, Place Jussieu, Tour 56
F-75252 Paris Cedex 05, France
This book presents the second part of a two-volume series devoted to a sys-
tematic exposition of some recent developments in the theory of discrete-
time Markov control processes (MCPs). As in the first part, hereafter re-
ferred to as "Volume I" (see Hernandez-Lerma and Lasserre [1]), interest is
mainly confined to MCPs with Borel state and control spaces, and possibly
unbounded costs. However, an important feature of the present volume is
that it is essentially self-contained and can be read independently of Volume
I. The reason for this independence is that even though both volumes deal
with similar classes of MCPs, the assumptions on the control models are
usually different. For instance, Volume I deals only with nonnegative cost-
per-stage functions, whereas in the present volume we allow cost functions
to take positive or negative values, as needed in some applications. Thus,
many results in Volume Ion, say, discounted or average cost problems are
not applicable to the models considered here.
On the other hand, we now consider control models that typically re-
quire more restrictive classes of control-constraint sets and/or transition
laws. This loss of generality is, of course, deliberate because it allows us
to obtain more "precise" results. For example, in a very general context,
in §4.2 of Volume I we showed the convergence of the value iteration (VI)
procedure for discounted-cost MCPs, whereas now, in a somewhat more
restricted setting, we actually get a lot more information on the VI proce-
dure, such as the rate of convergence (§8.3), which in turn is used to study
"rolling horizon" procedures, as well as the existence of "forecast horizons" ,
and criteria for the elimination of nonoptimal control actions. Similarly, in
Chapter 10 and Chapter 11, which deal with average cost problems, we
viii Preface
References 251
Abbreviations 263
Index 271
Contents of Volume I
Appendices
A. Miscellaneous results
B. Conditional expectation
C. Stochastic kernels
D. Multifunctions and selectors
E. Convergence of probability measures
7
Ergodicity and
Poisson's Equation
7.1 Introduction
This chapter deals with noncontrolled Markov chains and presents impor-
tant background material used in later chapters. The reader may omit it
and refer to it as needed.
There are in particular two key concepts we wish to arrive at in this
chapter, and which will be used to study several classes of Markov control
problems. One is the concept of w-geometric ergodicity with respect to some
weight function w, and the other is the Poisson equation (P.E.), which can
be seen as a special case of the Average-Cost Optimality Equation. The
former is introduced in §7.3.D and the latter in §7.5. First we introduce,
in §7.2, some general notions on weighted norms and signed kernels, and
then, in §§7.3.A-7.3.C, we review some standard results from Markov chain
theory. The latter results are presented without proofs; they are introduced
here mainly for ease of reference. Finally, §§7.4 and 7.5.C contain some ex-
amples on w-geometric ergodicity and the P.E., respectively.
Throughout the following X denotes a Borel space (that is, a Borel subset
of a complete and separable metric space), unless explicitly stated other-
wise. Its Borel a-algebra is denoted by 8(X).
O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes
© Springer Science+Business Media New York 1999
2 7. Ergodicity and Poisson's Equation
A. Weighted-norm spaces,
=
xEX
IIul19 ix
r
IIJ-tllTV:= sup I udJ-t1 = 1J-tI(X), (7.2.3)
where IJ-tl = J-t+ + J-t- denotes the total variation of J-t, and J-t+, J-t- stand for
the positive and negative parts of J-t, respectively. By analogy, the w-norm
of J-t is defined by
IIJ-tllw:= sup
lIull w 9
I r udJ-t1 = ixr wdlJ-tl,
ix
(7.2.4)
The latter inequality can be used to show that the normed linear space
Mw(X) of finite signed measures with a finite w-norm is a Banach space.
Summarizing:
7.2.2 Proposition. Mw(X) is a Banach space and it is contained in
M(X).
Remark. (a) In Chapter 6 we have already seen a particular class of
weighted norms; see Definitions 6.3.2 and 6.3.4.
(b) Bw(X) and Mw(X) are in fact ordered Banach spaces. Namely, as
usual, for functions in Bw(X) "u ::; v" means u(x) ::; v(x) for all x in X,
and for measures "JL ::; v" means JL(B) ::; v(B) for all B in B(X).
B. Signed kernels
and
IIQull w = supw(x)-lIQu(x)l,
x
which combined with (7.2.4) yields
IIQllw = supw(x)-lIIQ(·lx)lIw
x (7.2.8)
= supW(X)-l
x
Ix w(y)IQ(dYlx)l·
On the other hand, replacing J.t in (7.2.6) and (7.2.4) by the Dirac measure
~x at the point x E X [that is, ~x(B) := 1 if x E Band := 0 otherwise], we
see that
~xQ(·) = Q(·lx) and lI~xllw = w(x). (7.2.9)
Then a direct calculation shows that (7.2.7) can also be written in the
following equivalent form using measures J.t in Mw(X):
Then
(7.2.12)
7.2 Weighted norms and signed kernels 5
and, therefore, (7.2.5) defines a linear map from Bw(X) into itself. Simi-
larly, (7.2.6) defines a linear map from Mw(X) into itself, since
(7.2.13)
we define Qn recursively as
Then, from Proposition 7.2.5 and standard results for bounded linear op-
erators, we obtain:
7.2.6 Proposition. If Q and R are signed kernels with finite w-norms,
then
(7.2.16)
In particular,
C. Contraction maps
where "I is the modulus of T, and Tn := T(Tn-l) for n = 1,2, ... , with
TO :=identity.
Proof. See Ross [1] or Luenberger [1], for instance. 0
As an example, the maps defined by (7.2.5) and (7.2.6) are both nonex-
pansive if IIQllw = 1, and they are contractions with modulus "I := IIQllw
if
IIQllw < 1. (7.2.18)
This follows from (7.2.12) and (7.2.13). For instance, viewing Q as a map
on lffiw(X), (7.2.12) gives
(7.2.19)
Notes on §7.2
1. Since the material for this chapter comes from several sources, to
avoid confusions with related literature it is important to keep in mind the
equivalence of the several forms (7.2.7), (7.2.8) and (7.2.10) for the w-norm
of Q. For instance, Kartashov [1]-[5] uses (7.2.10), while, say, Meyn and
Tweedie [1] use (7.2.7).
2. For Markov control processes with a nonnegative cost function c, in
Chapter 6 we used a weight function of the form w := 1 + c.
'fiB := LIB(Xt),
t=l
t=l
7.3.4 Theorem.
(c) Suppose that P is Harris-recurrent and let /-L be as in (a). Then there
exists a triplet (n, v, 1) consisting of an integer n ~ 1, a p. m. v, and
a measurable function 0 ~ I ~ 1 such that:
We next introduce some concepts that will be used to give conditions for
irreducibility and recurrence of Markov chains.
Let E(X) be as in §7.2, the Banach space of bounded measurable func-
tions on X endowed with the sup norm, and let Cb(X) be the subspace of
continuous bounded functions. We shall use the notation Pu as in (7.2.5)
with Q = P.
10 7. Ergodicity and Poisson's Equation
7.3.5 Definition. The Markov chain {xt} (or its stochastic kernel P) is
said to satisfy the weak Feller property if P leaves Cb(X) invariant, i.e.,
and the strong Feller property if P maps B(X) into Cb(X), i.e.,
Pu E Cb(X) Vu E B(X).
L q(t)pt(Blx),
00
A more general concept than that of a small set is the following. A set
C in 8(X) is called petite if there is a sampling distribution q and a
nontrivial measure J.t' such that
(7.3.5)
If the chain {xt} is A-irreducible, then a Borel set C is petite if and only if
for some n ~ 1 and a nontrivial measure II,
n
L pt('lx) ~ 11(') Vx E C. (7.3.6)
t=l
Further, for an aperiodic A-irreducible chain, the class of petite sets is the
same as the class of small sets; however, the measures J.t and J.t' in (7.3.4)
and (7.3.5) need not coincide.
Finally, we have the following important result, which will be used later
on in conjuntion with Theorem 7.3.7.
7.3.8 Theorem. If every compact set is petite, then {xt} is aT-chain.
Conversely, if {xt} is a A-irreducible T -chain, then every compact set is
petite.
D. w-Geometric ergodicity
7.3.9 Definition. Let w ~ 1 be a weight function such that 1IPllw < 00.
Then P (or the Markov chain {xt}) is called w-geometrically ergodic
if there is a p.m. J.t in Mw(X) and nonnegative constants Rand p, with
p < 1, that satisfy
(7.3.8)
Thus p, = p,P.
We next present several results that guarantee (7.3.7).
7.3.10 Theorem. Let w ~ 1 be a weight function. Suppose that the Markov
chain is A-irreducible and either
(i) there is a petite set C E 8(X) and constants f3 < 1 and b < 00 such
that
Pw(x) :::; f3w(x) + bIc(x) \Ix E Xj (7.3.9a)
or
(ii) w is unbounded off petite sets (that is, the level set {x E Xlw(x) :::; n}
is petite for every n < 00) and, for some f3 < 1 and b < 00,
Then:
(a) The chain is positive Harris recurrent, and
(b) p,(w) < 00, where p, denotes the unique i.p.m. for P.
If, in addition, the chain is aperiodic then
(c) The chain is w-geometrically ergodic, with limiting p.m. p, as in (b).
Proof. See Meyn and Tweedie [1], §16.1. [Lemma 15.2.8 of the latter ref-
erence shows the equivalence of the conditions (i) and (ii).] 0
7.3.11 Theorem. Letw ~ 1 be a weight function. Suppose that the Markov
chain has a unique i.p.m. p, and, further, there exists a p.m. v, a measurable
function 0:::; 10 :::; 1, and a number 0 < f3 < 1 such that
7.3 Recurrence concepts 13
(Pw)/w ~ /3 + b;
hence, by (7.2.8), IIPllw ~ /3 + b.
7.3.13 Remark. (See Meyn and Tweedie [1], Theorem 16.0.1.) If the
Markov chain is A-irreducible and aperiodic, then the following conditions
are equivalent for any weight function w ~ 1:
(b) The inequality (7.3.9a) holds for some function Wo which is equivalent
to w in the sense that c-1w ~ Wo ~ cw for some constant c ~ 1.
On the other hand, Kartashov [2], [5, p. 20], studies (7.3.7) using the follow-
ing equivalent conditions (c) and (d):
(c) There is an integer n ~ 1 and a number 0 < p < 1 such that
(7.3.11)
for all x, x, in X.
He shows that if P has a finite w-norm, then (c) [or (d)] is equivalent to
(a).
Statements (c) and (d) are nice because they are easy to interpret. For
instance [by Definition 7.2.7(b)], (7.3.11) means that pn is a contmction
map (or, equivalently, P is a n-contmction) on the subspace M~(X) of
Mw(X). Furthermore, (c) or (d) can be directly "translated" to obtain
14 7. Ergodicity and Poisson's Equation
(iii) P has a unique i.p.m. JL with a finite w-norm IIJLllw :::; b/(1- p), and,
moreover, (7.3.7) holds with R := 1 + IIJLllw :::; 1 + b/(1 + p).
Proof. (i) Let x* and w* be as in (a). Then, by (bI),
IIp m ll w :::; R' '<1m = 1,2, ... , with R' := 1 + b/(1- p). (7.3.14)
7.3 Recurrence concepts 15
(iii) To prove part (iii), we first show that for every x E X the sequence
{pt('lx)} is a Cauchy sequence in (the Banach space) Mw(X). To prove
this, fix an arbitrary state x and note that, for any given integer m 2: 1,
the finite signed measure
(7.3.15)
which, as m was arbitrary, clearly shows that {pt ('Ix)} is a Cauchy se-
quence in Mw(X).
Therefore, as Mw(X) is a Banach space, pt('lx) converges in the w-norm
to some measure J.lx in Mw(X). In fact, J.lx == J.l is independent of x, because
if we apply (7.3.15) to the signed measure 0' (.) := J x (-) - J x / (.), we obtain
Finally, to prove the last statement in (iii), we can use the invariance of
J.l to write J.l = J.lpn for all n = 0,1, ... , so that
in lieu of pt. Kartashov does not use any specific term for (7.3.7).
7.4 Examples on w-geometric ergodicity 17
where r+ := max(r, 0). We wish to verify that {xtJ satisfies the hypotheses
of Theorem 7.3.11 so that it is w-geometrically ergodic for some weight
function w. We shall suppose that Eizol < 00 and
Under the condition Eizol < 00, it is well known that (7.4.5) implies that
{xtJ is positive recurrent (see Meyn and Tweedie [1] or Nummelin [1]), and
so {xtJ has a unique i.p.m., which we shall denote by /L. Let us also suppose
that the moment generating function of zo, namely,
is finite for all s in some interval [0, s], with s > O. Then, as m(O) = 1 and
m'(O) = E(zo) < 0, there is a number s such that
Equivalently, if 10 stands for the indicator function of the set {O}, we can
write Xt+1 as
On the other hand, to ensure that the conditions (i)-(iii) in Theorem 7.3.11
hold, we shall suppose that Zo has a finite moment generating function m(s)
in some interval [0, s], that is,
(b) m(s) := E[exp(szo)] < 00 Vs in [O,s], for some s > 0.
Now fix a number 8 in (0, s), and define the number (J := exp( -8) and the
functions
l(i) := 10(i) and w(i):= exp(8i), i 2:: 0,
where 10 is the indicator function of {O}. Then taking the p.m. v as the
arrival distribution, that is, v(i) = q(i) for all i, easy calculations show
that the hypotheses (i)-(iii) of Theorem 7.3.11 are satisfied; hence, {xt} is
w-geometrically ergodic. 0
7.4 Examples on w-geometric ergodicity 21
7.4.4 Example: Linear systems. In this example the state and distur-
bance spaces are X = Z = JRd and (7.4.1) is replaced by
(7.4.10)
F'LF =L - I.
The existence of such a matrix L is ensured by the hypothesis (a)j see, for
instance, Chen [1], Theorem 8-22. We shall omit the calculations showing
the wrgeometric ergodicity and refer to the proofs of Proposition 12.5.1
and Theorem 17.6.2 of Meyn and Tweedie [1] for further details. Moreover,
it is worth noting that these results are valid when (7.4.10) is replaced by
the more general model
(7.4.12)
Then
Pw(x) ~ f3w(x) + bIe(x) 'v'x E X, (7.4.13)
where
is w-geometrically ergodic.
7.4.6 Example: Additive-noise nonlinear systems. Let {Xt} be the
Markov chain given by (7.4.15), with values in X = Rd and disturbances
satisfying Assumption 7.4.1. In addition, suppose that
(a) F : Rd --+ Rd is a continuous function;
(b) The disturbance distribution G is absolutely continuous with respect
to Lebesgue measure ,x(dz) = dz, and its density g [Le., G(dz)
g(z)dz] is positive ,x-a.e., and has a finite mean value;
7.4 Examples on w-geometric ergodicity 23
(c) There exist positive constants f3, 'Y and M such that f3 < 1, f3 + 'Y ~ 1,
and
EIF(x) + zol ::; f3lxl - 'Y Vlxl > M. (7.4.16)
Then {xt} is w-geometrically ergodic, where w(x) := Ixl + 1.
Indeed, the conditions (a) and (c) yield that the hypotheses of Propo-
sition 7.4.5 are satisfied, and, therefore, we obtain the inequality (7.4.13),
which is the same as (7.3.9a). Hence, to obtain the desired conclusion it
suffices to verify that the chain satisfies the other hypotheses of Theorem
7.3.10. To do this, first note that (7.4.15) and (7.4.2) yield
(H) The zero vector 0 is in X and the functions Fi satisfy the Lipschitz
conditions
By Proposition 12.8.1 of Lasota and Mackey [1], (H) implies that the IFS is
weakly asymptotically stable, which means that the transition probability
function P has a unique i.p.m. J.l, and, in addition, (vpt)(u) ---+ J.l(u) for
any initial distribution v and any continuous bounded function u on X.
(In other words, for any p.m. v, vpt converges weakly to J.l.) On the other
hand, the uniqueness of J.l and the assumption that qi > 0 yield that the
IFS {xd is J.l-irreducible and aperiodic. Hence, by Theorem 7.3.10, to prove
that {xd is w-geometrically ergodic it suffices to show that the inequality
(7.3.9b) holds. To see this observe that, by (H),
Pu(x) := Ix u(y)P(dylx), x E X,
q:= 00.
i=O
Let X = B(X) be the Banach space of bounded functions on X with the
sup norm, and consider the P.E. (7.5.1) with charge c in B(X) given by
E[g(xt+1)l xt]
! g(y)P(dylxt)
Pg(Xt) = g(Xt) by (7.5.1)(a).
Ex (Mn),
7.5 Poisson's equation 27
Canonical pairs, the sequence {Mn} in (7.5.6), and the P.E. (7.5.1) are
related as follows.
7.5.5 Theorem. The following conditions are equivalent:
(a) (g, h) is a solution to the P.E. with charge e.
(b) (g, h) is a e-canonical pair.
(c) {Mn} is a martingale and 9 is invariant.
Proof. (a) ¢:} (b). The implication (a) ~ (b) follows from (7.5.4) and
(7.5.3). Conversely, (7.5.1)(b) follows from (7.5.7) with n = 1. To obtain
the invariance equation (7.5.1)(a), apply P to both sides of (7.5.1)(b) to
get
p2 h = Pg + Ph - Pc,
and, on the other hand, observe that for n = 2 (7.5.7) becomes
P2h = 2g + h - e - Pc.
(a) ¢:} (c). If (a) holds, then the invariance conditon on 9 is obvious,
whereas, by its very definition, Mn is measurable with respect to a{xo, ... , x n }
for every n, and ExlMnl ~ n(11c11 + IlglD + IIhll < 00 for each x EX. More-
over, by the Markov property,
i.e., {Mn} is a martingale. The converse follows from (7.5.5) [or (7.5.4)]
with n = 1, and the invariance of g. 0
28 7. Ergodicity and Poisson's Equation
cf. (7.2.7) or (7.2.10).] The condition (7.5.8) holds, for instance, if IIFI! ~ 1,
in which case P is nonexpansive (Definition 7.2.7) with respect to the
norm II . II· For such a P, we have IlFnl! ~ 1IFIln ~ 1 [cf. (7.2.16)] and
(7.5.8) follows. On the other hand, note that obviously a power-bounded
operator is bounded, Le., IIFII ~ K for some constant K. An important
consequence of power-boundedness is given in part (c) of the following
corollary.
7.5.6 Corollary. Let (g, h) be a solution of the P.E. with charge c. Then:
n-l
(a) 9 = lim 1.,", ptg pointwise and in norm (that is, in the norm II· II
n-+oc> n ~
t=o
of X).
(b) If
pnh/n -+ 0 pointwise or in norm, (7.5.9)
then
n-l n-l
9 = lim
n-4OO
.!.n '"'
~
ptg = lim
n-4OO
.!.n L ptc (7.5.10)
t=o t=o
pointwise or in norm, respectively.
(c) If, further, P is power-bounded [with a constant K as in (7.5.8)] then,
for all n ~ 1,
n-l n-l
I! L pt(c - g)11 = II L ptc - ngll ~ (1 + K)llhll;
t=o t=o
hence
1 n-l
11-n L ptc - gil ~ (1 + K)llhll/n.
t=o
(d) Uniqueness of solutions: Let (gl, hd and (g2, h 2) be two solutions of
the P.E. with charge c such that hI and h2 satisfy (7.5.9). Then gl =
g2 and
7.5 Poisson's equation 29
n-I
hI - h2 lim ..!:. I: pt(hi -
= n---+oo h 2) pointwise and in norm. (7.5.11)
n
t=o
Proof. Part (a) follows from (7.5.3). Moreover, from (7.5.3) and (7.5.7) we
get
n-I
I: pt(e - g) = h - pnh, (7.5.12)
t=o
which, using part (a), yields (b). In fact, (7.5.12) also gives (c) because
n-I
II I: pt(e - g)11 ~ (1 + K)llhll,
t=o
with K as in (7.5.8).
(d) The equality gl = g2 results from (b), since
n-l
gl = lim..!:.
n
I: pte = g2·
t=o
Let 9 := gl = g2· Then writing (7.5.1)(b) for (g, hI) and for (g, h 2) and
subtracting we see that u := hI - h2 is invariant, i.e., u = Pu. Therefore,
(7.5.11) follows from the same argument used in (a). D
The following theorem gives another characterization of a solution to the
P.E.
7.5.7 Theorem. Let e,g and h be functions in (X, 11·11) and suppose that
(a) P is bounded (i.e., IIPII ~ K for some constant K), and
(b) (7.5.9) holds in norm for every u in X, i.e., Ilpnull/n -+ o.
Then the two following assertions are equivalent:
(i) (g, h) is the unique solution of the P.E. with charge e for which
n-l
lim..!:. I: pth = 0 in norm. (7.5.13)
n t=o
(ii) 9 satisfies (7.5.10) in norm and
N n-l
h= J~oo ~ I: I: pt(e - g) in norm. (7.5.14)
n=1 t=O
Proof. (i) ~ (ii). Suppose that (i) holds. Then, by the hypothesis (b), h
satisfies (7.5.9) and so the requirement on 9 follows from Corollary 7.5.6(b).
On the other hand, by (7.5.7) and (7.5.3),
N n-l N
Nh = I: I: pt(e - g) + I: pnh \IN = 1,2, .... (7.5.15)
n=1 t=o n=1
30 7. Ergodicity and Poisson's Equation
so that
n-l
(I - P) L pt(c - g) = (I - pn)(c - g). (7.5.18)
t=o
Therefore, applying 1- P to (7.5.14) and using (7.5.16) and (7.5.18) we
get
N n-l
(I-P)h = IW~L(I-P)Lpt(c-g)
n=l t=o
N
= (c - g) -lim ~ ' " pn(c - g)
NNL....i
n=l
C - 9 by (7.5.17)j
that is, (7.5.1)(b) holds. Hence, (g, h) is a solution to the P.E. with charge
c, and, by (7.5.14) and (7.5.15), the function h satisfies (7.5.13). The latter
condition, (7.5.13), and Corollary 7.5.6(d) give the uniqueness of (g, h). 0
7.5.8 Remark: Finite X. If the state space X is a finite set, in which
case the stochastic kernel P is a square matrix, it is well known that the
limiting matrix
n-l
lim .!. L
P:= n--+oo pt (componentwise) (7.5.19)
n
t=o
H = lim N1
N-too
L L (pt - F) (7.5.21)
n=l t=o
is a solution to the P.E. for P with charge c, and it is precisely of the form
given by Theorem 7.5.7(ii). In fact, in a suitable theoretical setting, all of
the expressions (7.5.19)-(7.5.23) have a well-defined meaning in a much
more general context than the finite-state case. (See Hernandez-Lerma and
Lasserre [6] for details.) 0
Suppose that P has an Lp.m. J.t. Then by the Individual Ergodic Theorem
(Yosida [1]) for any given function u in L 1(J.t) == L1(X,8(X),J.t) there is a
function u* in L1 (J.t) such that
(a) u* = .
hm -1 n-1
n-too
L ptu J.t-a.e., and (b)
n t=o
J u*dJ.t = J
udJ.t. (7.5.24)
On the other hand, the Mean Ergodic Theorem ensures that the conver-
gence in (a) holds in L 1(J.t), that is, for every u in L 1(J.t),
n-1
(a) u* = lim .!. L ptu in L 1(J.t), and (b) Pu* = u*. (7.5.25)
n-too n t=o
Further, if J.t is the unique i.p.m. for P, then u* is a constant J.t-a.e. and
in fact, by (7.5.24)(b),
u* = J udJ.t J.t-a.e. (7.5.26)
32 7. Ergodicity and Poisson's Equation
Consider now the unichain P.E., so that P has a unique i.p.m. p., and
let (g, h) be a solution to the P.E. with charge c. Moreover, assume that:
e is in L 1 (p.), and h satisfies (7.5.9). (7.5.27)
Then, by (7.5.26) and Corollary 7.5.6(b), we see that 9 = c·, that is,
9 = p.(e) := / edp. p.-a.e. (7.5.28)
Proof. (a) By definition (7.2.1) of the w-norm, for any function u in Bw(X)
f f
we have:
luldJL ::; Ilullw wdJL = IlullwliJLliw < 00,
where the last equality is due to (7.2.4).
(b) This follows from (7.3.8) and the inequality
(c) This fact was already proved in the paragraph after Definition 7.3.9.
(d) If u = Pu, then u = ptu for all t = 0,1, .... Thus, the desired
conclusion follows from (7.3.8).
(e) The statement (ii) follows from (d) and Corollary 7.5.6(d). Similarly,
statement (i) follows from Theorem 7.5.7 if we can show that the functions
in (7.5.14) and (7.5.30) coincide. In turn, to prove the latter, first note that
if a sequence {Sn} converges in (some) norm to S, then the sequence of
Cesaro sums
1 N
NLSn
n=1
and use (7.3.8) to show that {Sn} is a Cauchy sequence in Bw(X), and,
therefore, it converges to a function, say, S := h in Bw(X). Finally, observe
that (7.5.14) with g = JL(c) and the w-norm, is precisely the limit of the
Cesaro sums of {Sn}, so that the function in (7.5.14) coincides with S = h.
o
7.5.11 Remark. In the context of Theorem 7.5.10 we have the following:
(a) By part (ii) of Theorem 7.5.10(e), for any two solutions (g, ht), (g, h 2)
of the strictly unichain P.E. we have hI = h2 + k where k is the constant
JL( hI - h 2). Thus if we wish to guarantee that the P.E. has a unique solution
it suffices to have k = 0, which is precisely the role of condition (7.5.29).
In general, to "fix" a unique h we only need to take any solution (g, h) of
the P.E. and replace h by Ii = h - JL(h). This is tacitly what we did in
Theorems 7.5.1O(e) and 7.5.7. Indeed, if we look at (7.5.15) we see that the
"full form" of his
N n-I
There are other ways one can "fix" a unique h. For instance, let x be an
arbitrary fixed point in X and replace (7.5.29) by the condition h(x) = O.
Then in the above notation we again get fJ(h l - h 2 ) = O.
(b) We can replace the "convergence estimate" in Corollary 7.5.6(c) by
the following estimate, which is obtained from (7.3.8): For all n ~ 1
n-l
II :L pt(c - fJ(c))lIw < 1Ic/lwR(l - pn)j(l - p)
(7.5.32)
t=o
< IIcllwRj(l - p).
Observe that (7.5.32) holds for any function c in Bw(X). On the other
hand, again from (7.3.8) [or (7.3.7)] one can see that the function h in
(7.5.4) can be written in the form (7.5.22)-(7.5.23), where the "limiting
kernel (matrix)" P(Blx) is the limiting p.m. fJ, that is, P('lx) = fJ(') for
all x E X. In this case, the "fundamental kernel" Z = (J - P + P)-l is
given by
00 00
Then u(x) = fJ(u) = infx u(x) fJ-a.e. in case (a), and u(x) = J-L(u) =
= fJ(u) J-L-a.e.
supx u(x) fJ-a.e. in case (b). Hence, in either case, u(x)
Proof. (a) Suppose that u ~ Pu. This inequality yields u ~ ptu for all
t ~ 0, so that, by (7.3.8), u ~ ptu .!. fJ(u). Thus
Hence, letting Ui := infx u(x), we obtain Ui ~ f udfJ ~ Ui, that is, f udfJ =
Ui' Finally, since f(u - ui)dfJ = 0 and u ~ Ui, we conclude that u = Ui =
fJ(u) J-L-a.e. The proof in case (b) is similar [or apply (a) to -u]. 0
C. Examples
some iterative procedure. For instance, consider the Markov chain in Ex-
amples 7.4.3 and 7.5.2. In the latter example we mentioned that a solution
to the P.E. with charge c(O) := -q and c(i) := 1 for all i ~ 1 is given by
the pair (g, h) with
where
q:= L iq(i) < 00.
00
(7.5.34)
i=l
The question is, how did we get the solution (7.5.33)? To see this, recall
that in Example 7.4.3 is shown that, under (7.5.34), the chain has the i.p.m.
= 1'(0) L
00
Therefore, the constant g(.) == 1'(c) is 0 since (7.5.34) and (7.5.35) yield
=L = c(O)1'(O) + 1'(0) L
00 00
Hence to solve the P.E. (7.5.1) it suffices to consider (7.5.1)(b), which be-
comes
h(i) = c(i) + Ph(i) for all i = 0,1, .... (7.5.36)
To compute
=L
00
Ph(i) h(j)p(jli),
j=O
recall from Example 7.4.3 that the transition probabilities P(jli) are given
by
p(i - Iii) = 1 Vi ~ 1, and p(jIO) = q(j) Vj ~ o.
Consequently,
=L
00
We assume of course that F :j:. O. In addition, we will assume that the i.i.d.
disturbances Zt have zero mean (which greatly simplifies the calculations),
and finite second moment:
for some constant 'Y, and we take the weight function w := W2 in (7.4.11)
Step 1. By (7.3.8), to "estimate" p,(c) it suffices to compute
i.e.,
Ezc(xt) = P 2t c(x) + 'Y0-2(1- P2t)/(1_ p2), t ~ 1. (7.5.44)
and consider the charge c(x) = x for all x E X. To solve the strictly
unichain P.E. with charge c we may proceed exactly as in Example 7.5.13.
In fact,
8.1 Introduction
In this chapter we consider the infinite-horizon discounted cost problem for
a Markov control model (X, A, {A(x)lx EX}, Q, c). We already studied this
problem using dynamic programming and linear programming in Chapters
4 and 6, respectively. Here we use again dynamic programming, so it is
important to state at the outset the differences between this chapter and
Chapter 4.
The main difference lies in the assumptions. In Chapter 4 we considered
nonnegative cost-per-stage functions c, virtually without restriction in their
"growth rate", and we allowed non-compact control-constraint sets A(x).
In contrast, the hypotheses in this chapter-see Assumptions 8.3.1, 8.3.2
and 8.3.3, or 8.5.1, 8.5.2 and 8.5.3-require the sets A(x) to be compact,
but the cost functions c are allowed to take positive and negative values,
provided that they satisfy a certain growth condition (Assumption 8.3.2).
The corresponding dynamic programming theorems (Theorems 4.2.3 and
8.3.6) turn out to be very similar, except that in the present context the
dynamic programming operator is a contraction operator with respect to
a weighted w-norm (Proposition 8.3.9), which yields the w-geometric con-
vergence of the value iteration (VI) algorithm [see (8.3.15)]. The latter
fact, w-geometric convergence of the VI functions, gives many interest-
ing results-such as evaluation of rolling horizon procedures, criteria for
elimination of nonoptimal control actions, and existence and detection of
forecast horizons-which are practically impossible to get in a context as
O. Hernández-Lerma et al., Further Topics on Discrete-Time Markov Control Processes
© Springer Science+Business Media New York 1999
40 8. Discounted Dynamic Programming
with (Xi, ai) E OC for i = 0, ... , t-l, and Xt EX. A (randomized) control
policy is a sequence 7r = {7rd of stochastic kernels 7rt on the control set A
given H t , that satisfy the constraint
(8.2.3)
The set of all control policies is denoted by n. Moreover, a control policy
7r = {7rd is said to be a:
(a) randomized Markov policy if there is a sequence {<pt} of stochastic
kernels <Pt E <P such that
(8.2.4)
which reduces to
G(x, f) := G(x, I(x)) for I in IF.
In particular, for the cost function c and the transition law Q we write
if cp is in <P, and
c(x,f):= c(x,/(x)), Q('lx,f) = Q('lx,/(x)) if IE IF. (8.2.6)
(c) Let 7f = {7fd be an arbitrary control policy, and v an arbitrary "ini-
and similarly for Qn(·I·, I). [For a proof of (8.2.10) see Proposition 2.3.5.]
o
A. Assumptions
Indeed, it is obvious that (c) implies (c'). Conversely, suppose that (c') holds
and let u be an arbitrary function in lff,(X). Then u + Ilull is nonnegative,
and, therefore, by (c'), the function
/ C(y)Q(dylx, a) = f:
t=o
at / Ct(y)Q(dylx, a)
00
< La~ct+l(x)
t=o
= aol[C(x) - eo (x)],
46 8. Discounted Dynamic Programming
i.e.,
/ C(y)Q(dylx, a) :::; ao1C(x).
This inequality combined with the conditions (i), (ii), yields the desired
conclusion. D
8.3.5 Remark. (a) In several places we shall consider inequalities of the
form (7.3.9) or (7.3.10), so it is important to note that many results in
this chapter remain valid if part (b) in Assumption 8.3.2 is replaced by an
inequality of the form (8.3.11) below, where,,( > 0 is not necessarily :2: 1,
in contrast to (8.3.5) where (3 :2: 1. More precisely, we have:
Assumption 8.3.2 is satisfied if there exists a real-valued measurable func-
tion w' :2: 1 on X and positive constants m, "( and b such that a"( < 1 and,
for every state x EX,
Indeed, let C(x) and Ct(x) be as in (8.3.6) and (8.3.7), respectively, except
that Co is redefined as
where the sup is over all feasible state-action pairs (x, a) in lK. Hence, under
(8.3.5), we have IIQllw :::; (3 < 00, and so Proposition 7.2.5 is applicable in
the present "controlled" context. D
8.3 The optimality equation 47
To state our main result in this section, we first recall from §4.2 the
definition of the a-value iteration (or a-VI) functions
for all n ~ 1 and x EX, with vo(-) == O. For every n = 1,2, ... , Vn is the
optimal n-stage cost [see (3.4.11)], i.e.,
(8.3.13)
where
The following theorem states among other things that the sequence {v n }
converges geometrically in the w-norm to V* [see (8.3.2)].
8.3.6 Theorem. Suppose that Assumptions 8.3.1, 8.3.2 and 8.3.3 hold.
Let (3 be the constant in {8.3.5}, and define, := a{3. Then:
(b) There exists a selector f* E IF such that f* (x) E A(x) attains the
minimum in {8.3.4} for every state x, that is [using the notation in
(8.2.6}J,
(c) A policy 1f* is a-discount optimal if and only if the corresponding cost
function V(1f*,·) satisfies the a-DCDE.
Then for any state x E X and any sequence {an} in A(x) such that
an --+ a in A(x), we have
and
Proof. (a) Let u be a function in I$w(X), so that lu(x)1 ::; mw(x) for all
x E X, where m := Ilull w . Then U m := u + mw is a nonnegative function
in I$w(X), and so it is the limit of a nondecreasing sequence of measurable
bounded functions uk E I$(X). Now fix x E X and let {an} be a sequence
in A(x) converging to a E A(x). Then, as uk tUm, Assumption 8.3.1(c)
yields, for every k,
j uk(y)Q(dylx,a).
J
and, therefore, um(y)Q(dylx,') is l.s.c. on A(x), which implies that u'(x,·)
is l.s.c. on A(x). In other words, u'(x,·) is l.s.c. on A(x) for every function
u in I$w(X). Hence, if we now apply the latter fact to -u in lieu of u, we
see that u'(x,·) is also u.s.c. Thus u'(x,·) is continuous on A(x).
(b) Write u I as uI(x) := limk--+oo Uk(X), where
and, as Uk is in I$w(X) (in fact, IIUkll w ::; K for all k), part (a) yields
Thus, letting k -+ 00, we obtain (8.3.18) from (8.3.21) and monotone con-
vergence. The proof of (8.3.19) is similar and is left to the reader. 0
50 8. Discounted Dynamic Programming
We will also need the following lemma, whose parts (a) and (b) are the
same as the "measurable selection theorem" in Proposition D.5 (Appendix
D). Recall that the set-valued mapping (or multifunction) x t-t A(x) from
X to A is said to be upper semicontinuous (u.s.c.) if {x E XIA(x)nF"I-
0} is a closed subset of X for every closed set F C A. [Equivalently,
x t-t A(x) is U.S.c. if {x E XIA(x) c G} is an open set in X for ev-
ery open set G CA.]
8.3.8 Lemma. Let II{ and A(x) be as in (8.2.1) and Assumption 8.3.1(a},
respectively, and let v : II{ ---+ lR. be a given measumble function. Define
(a) Ifv(x,·) is l.s.c. on A(x) for every x E X, then there exists a selector
f E IF such that f(x) E A(x) attains the minimum in (8.3.22a) for
all x E X, that is [using the notation in Remark 8.2.3(b}},
that is, v* is a l.s.c. function in the space lRw(X) and its w-norm
satisfies Ilv*lIw ~ k.
Proof. For the proof of parts (a) and (b) see Rieder [1] or SchaJ. [1]. To
prove (c), apply (b) to the nonnegative l.s.c. function
u(x,a) := v(x, a) + kW(x). 0
8.3.9 Proposition. For 0 < a < 1, suppose that Assumptions 8.3.1, 8.3.2,
and 8.3.3 hold, and let To be the map defined by (8.3.17). Then:
(a) To is a contmction opemtor on lRw(X), with modulus "Y := af3 < 1;
that is, To maps lRw(X) into itself and
(8.3.23)
8.3 The optimality equation 51
is l.s.c. in a E A(x) for every x EX. Hence, Lemma 8.3.8(a) yields that Tau
is a measurable function and that there exists I ElF that satisfies (8.3.24).
It is also clear that Tau has a finite w-norm since, by Assumption 8.3.2,
(8.3.29)
and
(8.3.30)
Indeed, (8.3.29) is trivial for t = O. Now, if t ~ 1, it follows from (8.2.9)
that
lim atE;u(xt)
t-too
=0 V1f' E II, x E X, u E lmw(X). (8.3.35)
and (8.3.35) follows. Let us now consider the equality u* = Tau* in (8.3.27).
By Proposition 8.3.9(b), there exists a selector f E IF such that
Hence, for any policy 1f E n and initial state x EX, (8.2.9) and (8.3.38)
yield
Finally, letting n --t 00 in the latter inequality and using (8.3.35), it follows
that
u*(x) ::; V(1f,x),
so that, as 1f and x were arbitrary, we conclude that u*(x) ::; V*(x) for all
x E X. This inequality and (8.3.37) yield (a2).
(b) The existence of a selector f* E IF that satisfies (8.3.16) follows from
part (a) and Proposition 8.3.9(b). Conversely, for any deterministic sta-
tionary policy f;C, the corresponding a-discounted cost satisfies
8.3.10 Remark. Concerning (8.3.39), there are at least two ways in which
one can show that, for any deterministic stationary policy foo E lIDS,
by
(RfU)(X):= c(x, I) +a Ix u(y)Q(dylx,l), x E X. (8.3.41)
(8.3.42)
From this equation and (8.3.41) we have then that uf is the unique solution
in :$w(X) of the equation
1. Except for the fact that the cost-per-stage c is allowed to take neg-
ative values, Assumption 8.3.2 and (8.3.6) are, respectively, the same as
conditions (b) and (c) in Proposition 4.3.1. However, there is a misprint in
Proposition 4.3.1(b}: the inequality 1::; k ::; l/a should be 1 ::; k < l/a.
8.4 Further analysis of value iteration 55
2. Assumption 8.3.2 was introduced by Wessels [1], and it has been used
by other authors, including Piunovski [1], and Wakuta [1]. On the other
hand, van Nunen and Wessels [1] show that Assumption 8.3.2 is implied by
the following condition introduced by Lippman [1]:
(L) There is a measurable function Wo ~ 1 on X, a positive integer m,
and positive constants band M such that for all (x, a) E 1K:
Ic(x,a)1 ~ Mwo(x)m
and
3. Also the condition (8.3.6) has been used by several authors; see, for
instance, Bensoussan [1], Bhattacharya and Majumdar [1], Cavazos-Cadena
[1].
4. If in Assumption 8.3.2 we allow {3 to be less than 1, then (8.3.29) would
yield a contradiction. Namely, letting t ---+ 00 in (8.3.29) we get 1 ~ 0, since
w~l.
5. Let IK be as in (8.2.1). Then for every pair (x, a) in IK there is a decision
function f E IF such that a = f(x). (See Rieder [1], Example 2.6.) This fact
and Proposition 8.3.9(b) yield that we can rewrite Tau in (8.3.17) as
Ix
if
Vn(x) = c(x, 1) + a Vn-l (y)Q(dylx, 1) "Ix E X. (8.4.1)
From the a-DCOE (8.3.4) we can see that D is a nonnegative function and
that (8.3.4) can be rewritten as
minD(x,a) = 0 Vx E X. (8.4.3)
A(x)
D(x,/*) = 0 Vx E X. (8.4.4)
B. Estimates of VI convergence
(8.4.10)
To prove the latter inequality, first note that [as in (8.3.40)] we have
and so
The ideal goal in optimal control problems is, of course, to explicitly de-
termine the optimal value function and an optimal control policy. Unfortu-
nately, this goal is quite often very "difficult" -if perhaps not impossible-
to obtain. Thus, there are many cases in which one prefers to use a subopti-
mal but more practical procedure, provided its global performance can be
assessed and compared with that of an optimal policy. We shall now discuss
one such procedure, which is of frequent use in engineering and economics
applications, such as stabilization of control systems, production manage-
ment, and economic growth and macroplanning problems.
In a rolling horizon (RH) procedure-also known as a moving or re-
ceding or sliding horizon procedure-we begin by fixing a positive integer
N, which is called the rolling horizon, and proceed as follows:
(8.4.13)
Define he := fk,k, the first optimal decision function for the N-stage
problem.
Step 2. Substitute k by k + 1 and go back to step l.
(8.4.16)
or
(8.4.17)
or even
(8.4.18)
If there is a positive integer N* such that (8.4.16) holds for all n ~ N*, then
N* is called a forecast horizon. Since In E IF n is the first optimal decision
function for the n-stage problem [see (8.3.12)-(8.3.14)], the existence of a
60 8. Discounted Dynamic Programming
Similarly, for n 2: 1, An(x) denotes the set of actions a E A(x) that attain
the minimum in (8.3.12), i.e.,
then for every initial state x E X and every a- VI policy 7r = Un} there
exists a (x, 7r)-forecast horizon N l . If, in addition, the state space X is also
finite, then Nl is independent of x and 7r; in other words, Nl is a forecast
horizon in the sense that (8.4.16) holds for all n 2: N l .
(b) In addition to (8.4.20), suppose that there exists a unique a-optimal
control policy f':"; that is, IF * consists of the single selector f* E IF:
(8.4.21)
Then for every initial state x EX, there exists a positive integer N2 =
N 2(x) such that f*(x) is in An(x) for all n 2: N2 [cf. (8.4.17)J; in other
words,
(8.4.22)
8.4 Further analysis of value iteration 61
Thus, if X is finite, there exists N 2 1 such that (8.4.17) and (8.4.18) hold
for all n 2 N-.-
Proof. (a) Fix x and 7f = {In}, and suppose that (a) does not hold; that
is, for every positive integer N 2 1 there exists n 2 N such that, instead
of (8.4.19), we have
On the other hand, as A(x) is a finite set [by (8.4.20)), there is a further
subsequence {mi} of {m} and a control action ax E A(x) such that
In other words,
J
and
V*(x) < c(x,a x ) + Q V*(y)Q(dylx, ax). (8.4.24)
which contradicts (8.4.24). This proves the first part of (a); that is, there
exists a (x,7f)-forecast horizon, say N 1 (x,7f).
Now, if A and X are both finite, then there are finitely many Q- VI
policies 7f. Hence, Nl := max X ,1T N1(x, 7f) defines a forecast horizon.
(b) Fix the initial state x and suppose that (8.4.20) and (8.4.21) both
hold. If (8.4.22) is not satisfied, then [arguing as in the proof of (a)) there is
a subsequence {nd of {n}, and controls an, E An, (x) and ax E A(x) such
that an, = ax for all i, and
(8.4.26)
62 8. Discounted Dynamic Programming
+ !
(8.4.27)
< c(x, f.) a Vn;-l (y)Q(dylx, f.).
for n = 1,2, .... Observe that, by (8.3.12), the functions Dn are nonnegative
and that (8.3.12) can be rewritten as
lim Dn(x, a)
n-+oo
= D(x, a). (8.4.31)
hence,
(8.4.37)
so that
D(x,a) = lim Dn+m(x,a) ~ 2"C')'n-l w(x) > OJ
m--+=
hence, a is not in A*(x). Conversely, if (8.4.32) does not hold, then
Then N* is finite and it is a (x, 7r)-forecast horizon for every a- VI policy 7r.
Proof. Let 7r = Un} be an arbitrary a-VI policy, and let a := fn(x). If
a is not in A*(x), then as A(x) C A is a finite set, it is clear that n(a) is
also finite, and so is N*. Further, if n ;::: N* , then a is necessarily in A* (x)
because, otherwise, Dn(x, a) = 0 would contradict (8.4.32). D
Finally, if both conditions (8.4.20) and (8.4.21) hold, then Proposition
8.4.7 and Corollary 8.4.8 yield the following (on-line) algorithm to de-
tect N* in (8.4.38) and the optimal selector f* in (8.4.21) for a
given initial state x EX:
Initialization. Let n = 0, and define Ao := A(x).
If Ao has a single element, say a*, then stop: a* = f*(x). Otherwise, go
to step n = 1.
Step n: For every a in A n - 1 compute Dn(x, a). If
Notes on §8.4
and Rempala [1). Perhaps the first paper on forecast horizons for (finite-
state, finite-action) Markov control problems was the work of Shapiro [1),
which soon afterwards was improved by Hinderer and Hubner [1). In math-
ematical economics, "turnpike theorems" refer to results on asymptotic
properties of optimal paths of capital accumulation of economic growth
(see McKenzie [1)), and, by extension, forecast horizons are also known as
turnpike-planning horizons.
4. Theorem 4.6.5 referred to at the beginning of §8.4.D is based on the
following result by M. Schiil [1):
Let A(x) and OC be as in Assumption B.B.l(a} and (B.2.1), respectively,
and let {fn} be a sequence in IF. Then there exists a selector f E IF such
that, for each state x E X, f(x) E A(x) is an accumulation point of the
sequence {fn(x)}.
(a) A(x) is compact for every x E X, and the set-valued mapping x t-+
A(x) is u.s.c.;
(c) Q is weakly continuous on][{, that is, the function u'(x,a) in (8.5.1) is
continuous on ][{ for every bounded continuous function u on X.
8.5.2 Assumption. This is the same as Assumption 8.3.2 except that the
function w ~ 1 is required to be continuous.
8.5.3 Assumption. The function w'(x,a) in Assumption 8.3.3 is contin-
uous on ][{.
Under this new set of assumptions (being basically a strengthening of As-
sumptions 8.3.1, 8.3.2, 8.3.3) Theorem 8.3.6 remains valid, but in addition
we get that
V* is a l.s.c. function in Bw(X). (8.5.3)
To obtain (8.5.3) we need to make some changes in the proof of Theorem
8.3.6. First we introduce the following notation.
8.5.4 Definition. IL(X) denotes the family of l.s.c. functions on X, and
ILw(X) stands for the subfamily of l.s.c. functions that also belong to
Bw(X), i.e.,
ILw(X) := IL(X) n Bw(X).
Similarly, qX) C IL(X) denotes the subfamily of continuous functions on
X, and Cw(X) := qX) n Bw(X) is the subfamily of continuous functions
in Bw(X). The family of continuous bounded functions on X is denoted by
Cb(X).
To prove (8.5.3) we need the following lemma, whose parts (a) and (b)
correspond to Lemma 8.3.7(a) and Proposition 8.3.9, respectively. On the
other hand, Lemma 8.5.5(c) states that convergence in w-norm preserves
lower semicontinuity.
8.5.5 Lemma. Suppose that Assumptions 8.5.1, 8.5.2 and 8.5.3 are satis-
fied. Then:
(a) The function u' in (8.5.1) is continuous on ][{ whenever u is in ILw(X);
(b) Proposition 8.3.9 remains valid if Bw(X) is replaced by ILw(X);
(c) If {vn} is a sequence in ILw(X) that converges in w-norm to a function
v, then v is in ILw(X).
Proof. (a) The proof of this part is essentially the same as the proof of
Lemma 8.3.7(a) with the obvious changes. Let u be a function in ILw(X) and
define Urn as in the proof of Lemma 8.3.7(a). Then Urn is a nonnegative l.s.c.
function and, therefore, there is a nondecreasing sequence of continuous
bounded functions uk E Cb(X) such that uk t U. Now let (xn, an) be a
sequence in ][{ converging to (x, a) E K Then, Assumption 8.5.1(c) yields
that, for every k,
8.6 Examples 67
f uk(y)Q(dylx, a).
(bd Tau is a l.s.c. function in lEw(X) for every u in lLw(X), and that
Finally, letting n ----+ 00 we obtain liminfk v(xk) 2:: v(x); that is, v is l.s.c. 0
We can now see that (8.5.3) is a direct consequence of Lemma 8.5.5(b), (c)
and (8.3.15). Namely, by Lemma 8.5.5(b) and a trivial induction argument,
the Q- VI functions Vn in (8.3.26) [or (8.3.12)] belong to lLw(X), which
combined with Lemma 8.5.5(c) and (8.3.15) yields that V* is in lLw(X);
that is, (8.5.3) holds.
The previous paragraphs illustrate a situation already mentioned in §§4.2
and 3.3: in addition to the features of the particular control problem we
are dealing with, the choice of hypotheses basically depends on whether one
wishes to (or can) work in a class of lower semicontinuous functions (as is
the case under Assumptions 8.5.1, 8.5.2, 8.5.3) or in a class of measurable
functions (Assumptions 8.3.1, 8.3.2, 8.3.3).
68 8. Discounted Dynamic Programming
8.6 Examples
In this section we present a couple of examples of Markov control models
that satisfy the assumptions in §8.3 and §8.5. These examples are intended
to illustrate how one can proceed in similar cases.
When considering an X -valued controlled process {xt} of the form
(8.6.1)
(a) The disturbance sequence {zt} consists of LLd. random variables with
values in a Borel space Z, and {Zt} is independent of the initial state
Xo. The common distribution of the Zt is denoted by G.
when u is the indicator function lB. In general, we may use (8.6.1) and
Assumption 8.6.1(a) to write (8.6.3) as
u'(x, a) = h
u[F(x,a,z)]G(dz). (8.6.4)
and so we see that u'(x, a) is continuous in (x, a) E ][( for every bounded
measurable function u on X. This implies part (c) in both Assumptions
8.3.1 and 8.5.1.
Verification of Assumptions 8.3.2 and 8.5.2. It suffices to find a con-
tinuous weight function w that satisfies the conditions (i) and (ii) in Remark
8.3.5(a). To do this, let us first consider the moment generating function 'IjJ
of the variable (J - zo,
w' (x, a) = w(O)[I - G(x + a)] + w(x) lx+a exp[T(a - z)]G(dz), (8.6.13)
(a) The action (or control) set A = A(x) for all x E X is a compact subset
of an interval (0,0] for some (finite) number 0.
(b) {'fJtl and {~tl are independent sequences of LLd. random variables.
(c) 'fJo and ~o have continuous bounded densities 91 and 92, respectively.
(d) The random variable z := O'fJo - ~o has a (finite) negative mean and
a moment generating function 1jJ(r) := E(e TZ ) that is finite for some
r> 0, that is
(i) E(z) < 0, and (ii) 1jJ(r) < 00. (8.6.17)
1jJ(r) < 1.
For such a number r, we define the continuous weight function
w(x) := e TX , x E X. (8.6.18)
(8.6.20)
(8.6.21)
u'(x,a) Eu[(x
u(O)P(x
+ Za)+]
+ Za ::; 0) + i: u(x + y)ga(y)dy
(8.6.22)
[ : u(x + y)ga(y)dy = 1 00
u(y)ga(Y - x)dy.
which is continuous on lK = X x A.
Finally, to verify (8.3.5) observe that, by the Assumptions 8.6.5(a),(d),
Za = a'TJo - ~o S ()'TJo - ~o = Z Va E A,
i: i:
so that the integral in (8.6.23) satisfies
Eexp(rza) (8.6.24)
< 'IjJ(r) Vex, a) ElK.
Thus, as P(za S -x) SIS w(x) for all (x, a) in lK, (8.6.23) implies that
(8.3.5) holds with (3 := 1 + 'IjJ(r). The latter fact and (8.6.19) show that
Assumptions 8.3.2 and 8.5.2 are satisfied for every discount factor a < 1/(3.
In conclusion, all of the results in §8.3 and §8.5 are applicable to the
queueing system (8.6.16). In fact, many results in §8.4 are also applicable
since the compactness of A in Assumption 8.6.5(a) includes, for instance,
the condition (8.4.20). 0
Example 8.6.4 comes from Gordienko and Hernandez-Lerma [1].
9.1 Introduction
Let M = (X,A, {A(x)lx E X},Q,c) be the Markov control model (MCM)
in §8.2. In this chapter we study the expected total cost (ETC) criterion
defined as
(9.1.1)
Va (11", x) := E: [f
t=o
(ic(xt, at)] (9.1.4)
9.2 Preliminaries
This section contains background material. The reader may skip the section
and refer to it as needed.
A. Extended real numbers
if r E iR, r > 0
r.oo=oo.r={ c;
-00
if r = 0
if r E iR, r < o.
(9.2.2)
9.2 Preliminaries 77
The positive and negative parts of an extended real number r are defined
as
r+ := max(r,O) and r-:= max( -r, 0), (9.2.3)
respectively, and satisfy
Let {rn} be a sequence in iR that may contain one of the numbers +00,
-00, but not both. Then the "partial sum" SN := L:=o rn is well defined
for each N = 1,2, ... , and we say that the series L~=o rn converges (in iR)
to r E iR if the limit limN-+oo SN exists (in iR) and equals r. For example, if
rn 2: °
for all n = 0,1, ... , (9.2.4)
then the series L rn converges in iR (the limit may be +00). On the other
hand, if {rn} is such that
(9.2.7)
VN=O,l, ....
(9.2.8)
78 9. The Expected Total Cost Criterion
Proof. If the proposition is true under (9.2.4), then [by (9.2.5) and (9.2.6)]
it is also true under (9.2.5). Now, to prove the proposition assuming (9.2.4)
it suffices to note that o:nrn ~ 0 for all n and, further, the partial sums
N
L o:nrn
n=O
B. Integrability
Let (0, F, P) be a probability space, and iii the set of extended real
numbers. A random variable, : 0 -t iii is said to be integrable (with
respect to P) if
E(,+) < 00 and E(C) < 00.
In this case, the expectation (or expected value) of , is the real number
(9.2.9)
'n
stance, Neveu [1, p.41]; for a proof of (f) see Hinderer [1, p.146].
9.2.2 Proposition. Let' and (n = 1,2, ... ) be quasi-integrable random
variables. Then:
(a) E(k,) = kE(,) for every finite constant k.
9.3 The expected total cost 79
n n n
n n
(9.3.1)
when using the policy 11", given the initial state Xo = x. The corresponding
(optimal) value function is
(9.3.3)
The first step in our study of the ETC criterion will be to consider the
following basic theoretical issues.
80 9. The Expected Total Cost Criterion
9.3.1 Questions
(a) Given a policy 7r, is VI ( 7r, .) : X -+ lR (or iR) a measurable function?
Similarly,
(b) Is Vt : X -+ lR (or iR) a measurable function?
(c) For each policy 7r and each initial state x, let JO(7r,x) := 0 and
TT(+)(7r
VI , x) .-
. - E7r
x (~C+)
L...J t , (9.3.7)
t=o
and we suppose the following:
9.3.2 Assumption. For each x E X
sup VI (-)(7r,x) < 00. (9.3.8)
rr
Proof. (a) For each policy 7f and initial state x, the condition (9.3.8) implies
that VI(-l(7f,x) < 00, with VI(-l as in (9.3.7). Hence, as
(9.3.9)
and similarly for Va (7f, x), part (a) follows from Proposition 9.2.2(f) and
the properties (8.2.7) and (8.2.9) of the p.m. P;: with v = c5 x , the Dirac
measure concentrated at Xo = x.
(b) This follows from Proposition 9.2.1. Incidentally, observe that since
taking the limit a t 1 and using (b) and the definition (9.3.2) of Vt, we
obtain
lim sup V,; (x) ~ vt(x) "Ix E X. (9.3.10)
atl
(c) As V?l :::=: 0, (9.3.9) yields VI (7f,x) :::=: -VI(-l(7f,x). Thus, taking the
infimum over all 7f and using (9.3.8) we obtain part (c). 0
Concerning Question 9.3.1(c), we show below [Proposition 9.3.5(a)] that
Assumption 9.3.2 yields
(9.3.15)
(9.3.18)
9.3 The expected total cost 83
(2) Assumptions 9.3.2 and 9.3.4 both hold and the functions J~ are mea-
surable.
Proof. If (1) holds, then the measurability of Vt follows from Proposition
9.3.3(a).
On the other hand, under (2), the measurability of Vt follows from a
well-known result in Real Analysis: a pointwise limit of Borel-measurable
functions is Borel-measurable. (See, for instance, Ash [1], Theor. 1.5.4.)
Indeed, if (2) holds, then Vt is measurable because, by Theorem 9.3.5(b)
and (9.3.6), it is the pointwise limit of the measurable functions J~. 0
Of course, Proposition 9.3.6 answers one question [Question 9.3.1(b)],
but simultaneously it raises another: When are (1) or (2) satisfied? This
question is dealt with in §9.5 and §9.6. First, however, in the next section
we consider Question 9.3.1(e).
Notes on §9.3
1. Most of the works on the expected total cost (ETC) criterion deal
with Markov control processes (MCPs) in which: (i) the state space X is
a countable set, and/or (ii) the MCP is either positive (that is, c ~ 0) or
negative (c ~ 0). For extensive bibliographies on these two cases see, for
instance, Altman [1], Bertsekas [1], or Puterman [1]. Among the few works
dealing with Borel state spaces and not distinguishing between positive and
negative MCPs, we can mention the papers by Quelle [1], Rieder [2], SchaJ.
[1], and Hinderer's [1] monograph.
2. With respect to Question 9.3.1(b), it is well known that, in a very
general context, the value function Vt is universally measurable (see, for
instance, Hinderer [1]), which is a concept much weaker than measurabil-
ity. To our knowledge, measurability of Vt typically requires restrictive
conditions, such as (1) and (2) in Proposition 9.3.6.
3. The Markov control model M = (X,A,{A(x)lx E X},Q,c) is called
convergent if it satisfies Assumption 9.3.2 and, in addition,
supE;
II
(f: let I)
t=o
< 00 for each x E X. (9.3.20)
n-I )
lim
n--+oo
!..E;
n
( "Ictl
~
= 0 '</7(, x. (9.3.21)
t=o
MCMs with this property are called zero-average cost models.
vt (-) = inf
IT
Vi (7(, .) = inf VI (7(, .).
IT'
(9.4.1)
(9.4.2)
V1 (7f,v) = r
JxxA
c(x,a)jL~(d(x,a». (9.4.7)
To see this, suppose that c is nonnegative, as is the case for c+ and c-.
In addition, in (9.3.4) replace x with v, and let rn+l be the measure in
(9.3.5). Then, since c 2: 0 and rn ::; jL~, we get
! cdJ.t~ ~
that
Vi (n, v),
which completes the proof of (9.4.7) when c is nonnegative. Finally, replac-
ing c with c+ and c- we obtain (9.4.7) for a general c.
Strictly speaking, we should write (9.4.7) as an integral over ][{ instead of
XxA, since c(x, a) is defined on ][{only. However, we can always measurably
extend c to all of X x A for (9.4.7) to be well defined. For instance, we may
take c(x, a) := +00 on the complement of ][{, and then [by (9.4.6)] the
convention (9.2.2) would yield that (9.4.7) consists of the integral over ][{
plus a term equal to zero.
Also note that, for the given initial distribution v, we may rewrite (9.3.8)
!
as
s~p c-(x,a)J.t~(d(x,a)) < 00.
(c) If Band C are Borel sets in X and A, respectively, and if r is the
measurable rectangle B x C, then (9.4.4) becomes
00
E:[IB(Xt)lht- l , at-d
Q(BIXt-l, at-I).
So, taking expectation E:O and using (9.4.13), we get, for any B in B(X),
E:[IB(xdl
E:[Q(Blxt-l,at-dl
r
iXXA
Q(Blx, a)IL~,t-1 (d(x, a)),
Hence
(9.4.15)
and
(9.4.16)
for every B E B(X), C E B(A), and t = 0,1, .... Now let 1r' be the random-
ized Markov policy 1r' := {CPo, CPl, ... }, with CPt as in (9.4.18). Then (9.4.17)
9.4 Occupation measures and the sufficiency problem 89
trivially holds for t = 0 since, by (9.4.11) and Lemma 9.4.3(a), for all B in
B(X) and C in B(A) we have
l <po(Clx)v(dx)
f.-t~,o(B x C),
where the last equality is due to (9.4.18) and Lemma 9.4.3(a) again. The
proof now proceeds by induction: Suppose that (9.4.17) holds for some
integer t 2: o. Then, by Lemma 9.4.3(b), the marginal of f.-t~,tH on X
satisfies that
Ix a)f.-t~,t(d(x,
Q(·lx, a))
that is, the marginals (on X) of f.-t~,t+l and f.-t~:tH coincide. This implies
that (9.4.17) holds for t + 1 because [by (9.4.11)]
l <ptH(Clx),:L~~l(dx)
l (Clx),:L~,tH
<PtH (dx)
f.-t~,tH (B x C) [by (9.4.18)].
This completes the proof of (9.4.17), which, as was already mentioned, gives
(9.4.14). In turn, (9.4.14) and (9.4.7) give the equality (9.4.15), and also
give (9.4.16) since 7r was an arbitrary policy. 0
In connection with the question of "sufficiency" of sets of policies, we
next present an interesting result which states that, under appropriate
assumptions, the ETC corresponding to a randomized stationary policy
<poo E IIRS can always be "improved" (or "minorized") by the ETC of
some deterministic stationary policy foo E lIDS in the sense that
where v is the given initial distribution. In other words, this fact would
yield another affirmative answer to Question 9.3.1(e) when II and II' are
replaced by II RS and lIDS, respectively-see (9.4.21). The precise state-
ment is as follows.
90 9. The Expected Total Cost Criterion
The proof of Theorem 9.4.6 requires two preliminary facts. The first one
is the following lemma of Hinderer [1, Lemma 15.1], which is an extension
of a result by Blackwell [1].
9.4.7 Lemma. Let ][{ and IF, <P be as in (8.2.1) and Definition 8.2.1, re-
spectively. If cp is a stochastic kernel in <P and v : ][{ -+ i: is a measurable
function such that
x I-t i v-(x,a)cp(dalx)
The second fact we need is simply another expression for the ETC VI in
(9.3.1), which will also be useful in later sections. [The expression (9.4.23)
corresponds to the special case n = 1 of Lemma 9.5.6(a).] We will use the
following notation: Given a policy 1r = {1rt, n = O,I, ... } the I-shift policy
1r(1) = {1rF) , t = 0,1, ... } is defined as
1r~l)('lxd := 1rl(·lxo,ao,xd,
and for t = 1,2, ... ,
1r~1) ('IXl, all' .. ,xt+d := 1rtH (·Ixo, ao, Xl, ai, ... ,Xt+l).
In particular, if 1r = cp= is a randomized stationary policy, then [by Defi-
nition 8.2.2 (b)] the I-shift policy is given by
(9.4.22)
9.4 Occupation measures and the sufficiency problem 91
9.4.8 Lemma. Suppose that Assumption 9.3.2 is satisfied. Then, for each
policy 7f = {7ft} and initial state x EX,
(9.4.23)
i c(x,a)7fo(dalx),
VI (cpoo, x) = i v(x,a)cp(dalx)
Ix
with
v(x,a) := c(x, a) + V1(cpoo,y)Q(dylx,a).
Therefore, by the hypothesis (*) and Lemma 9.4.7, there is a decision func-
tion f E IF' such that, using the notation (8.2.6),
Iteration of this inequality gives, for every n = 1,2, ... [with Qn('lx, f) as
in (8.2.11) and foo E IIDs the deterministic stationary policy determined
by f E IF'-see Remark 8.2.3(a),
Therefore, by (9.3.4) (with 7f = rXJ) and the assumption that c 2:: 0, we get
Vn = 1,2, ... ,
and letting n -t 00, we see that (9.4.19) follows from (9.3.14).
On the other hand, integration of both sides of (9.4.19) with respect to
v yields (9.4.20).
Finally, to prove (9.4.21), observe that (9.4.20) implies that
whereas the reverse inequality follows from the fact that IIDS is contained
in IIRS (since IF is contained in q;-see the paragraph after Definition 8.2.1).
o
Notes on §9.4
(9.4.25)
To prove this result the idea is that by Theorem 9.4.5 we may assume at
the outset that 7f is a randomized Markov policy, say 7f = {'Pt}. Then, by
(9.4.23) and Lemma 9.4.7, there exist fo ElF such that
Next, applying the same argument to VI (7f(1),,) one obtains h ElF, and
continuing in this manner we get 7fd = {fo, h, .. .}, which satisfies (9.4.25).
2. The ETC-expected occupation measure 11-: in (9.4.4) corresponds to
the case a = 1 of the a-discount expected occupation measures (or state-
action frequencies) in §6.3, namely,
00
(9.4.27)
Ii(·) = v(·) +a r
lxxA
Q(·jx,a)p,(d(x, a)),
Observe that (9.5.4) is simply the value iteration equation that we have
already seen in several chapters; see, for instance, (8.3.12) or (8.3.26) for
the discounted case.
Under Assumption 9.5.2, which is supposed to hold throughout this sec-
tion, Theorem 9.3.5(b) and Proposition 9.3.6 yield (9.3.6), i.e.,
lim J~(x) = vt(x) Vx EX, (9.5.6)
n-too
and that Vt is a measurable function on X. Moreover, Vt is quasi-integrable
with respect to the measure Q(·lx, a) for each x E x and a E A(x). Indeed,
this is obvious if the condition (i) in Assumption 9.5.2(c) holds-in this
case Vt is nonnegative and so the integral of its negative part is zero [see
(9.2.9), or (9.3.7), (9.3.8)]. On the other hand, under the condition (ii), we
have [by (9.5.5) and Proposition 9.3.3(c)]
-00 < vt(x) ~ W(x) Vx E X; (9.5.7)
9.5 The optimality equation 95
hence, as W belongs to the set U (see Definition 9.5.1), (9.5.7) and Propo-
sition 9.2.2(c) yield
that is,
vI(7r,x) ~ Tv;.*(x).
This inequality and (9.3.2) give (9.5.8)(a) since 7r and x were arbitrary.
To obtain (9.5.8)(b) use (9.5.4) and (9.5.1) to write
Now consider the condition (i) in Assumption 9.5.2(c). In this case the
sequence J~ is nondecreasing and, therefore, letting n --+ 00 in (9.5.9), we
obtain [by (9.5.6) and the Monotone Convergence Theorem]
which implies (9.5.8)(b). On the other hand, under the condition (ii) in
Assumption 9.5.2(c), we may take limsuPn in (9.5.9) and obtain (9.5.10)
again, by (9.5.6) and Fatou's Lemma. This completes the proof. 0
B. Optimality criteria
Having the optimality equation (9.5.3) [or (9.5.2)] we can proceed to ob-
tain several optimality criteria, which informally can be obtained taking a
"discount factor" a = 1 in Theorem 4.5.1 (on discounted cost problems). In
96 9. The Expected Total Cost Criterion
(a) The discrepancy function for the ETC criterion is the nonnegative
function DI on OC [the set defined in (8.2.1)) given by
In particular, the O-shift policy is the same as 7r, i.e., 7r(0) = 7r.
inf Ddx, a) =
A(x)
° \lXEX, (9.5.13)
(p;-a.s.) (9.5.14)
Moreover,
(9.5.16)
n-+oo
(9.5.17)
and, furthermore,
00
=
=
i.e.,
(9.5.20)
Thus, part (a) follows from (9.5.20) and (9.3.13).
(b) This is obtained by adding and subtracting E;Vt(x n ) in (9.3.13),
and using the definition (9.5.12) of M~.
To prove (9.5.16), note that (9.3.2) and (9.3.20) yield
(9.5.21)
and so (9.5.16) follows from (9.3.15). Also note that (9.5.17) is a conse-
quence of (9.5.16) and (9.5.15).
To prove (9.5.18) observe that (9.5.11) and (8.2.9) give, for every t =
0,1, ... ,
and then
N-l N-l
L E;D1(xt,at) =L E;c(xt,at)+E;Vt(XN) -E;Vi*(xn).
t=n t=n
Finally, letting N -t 00 and using (9.5.17) we obtain (9.5.18). 0
9.5.7 Remark. Suppose that the cost-per-stage C(X, a) is nonnegative-
as in part (i) of Assumption 9.5.2(c). Then (9.5.15) is obviously satisfied,
and so the conditions (a) to (d) in Theorem 9.5.5 are all equivalent and,
moreover, (9.5.17) and (9.5.18) hold. 0
9.5.8 Lemma. For each policy 11" and initial state x for which Vi (1I",x) <
00, the sequence {M~, u(hnH is a P;: -submartingale; that is, for every n,
M~ is P;: -integrable, u(hn)-measurable and
(p:-a.s.); (9.5.22)
(9.5.23)
9.5 The optimality equation 99
(9.5.25)
and (9.5.25) follows. This shows that (a) implies (b). The converse is obvi-
ously true: take n = 0 in (b).
(c) {:} (d). As
E;DI(xn,a n) = E;{E;[DI(xn,an)lhn]},
the equivalence of (c) and (d) follows from (9.5.24)-recall that DI is non-
negative.
Finally, if (9.5.15) holds, then (9.5.18) shows that (b) and (c) [hence (a)
to (d)] are equivalent. This completes the proof of Theorem 9.5.5. D
The inequality (9.5.16) can be used to obtain a characterization of Vt
as the pointwise "maximal" solution of the optimality equation within a
certain subclass of functions in U (Definition 9.5.1). This is the essential
content of the following result. (In Theorem 9.5.13 we give conditions for
Vt to be the "unique" solution of the optimality equation.)
9.5.9 Theorem. (A "characterization" of Vt.) Suppose that Assump-
tion 9.5.2 is satisfied. Let u E U be a function that satisfies the optimality
equation {9.5.3} and the inequality {9.5.16}, i.e.,
and
lim sup E; u(xn) :S 0 \Ix E X, 7r E II, (9.5.27)
n->oo
100 9. The Expected Total Cost Criterion
respectively. Then
u(·) ::; vt(·). (9.5.28)
Hence, if Vt belongs to the class U, then Vt is the maximal function in U
that satisfies {9.5.26} and {9.5.27}.
Proof. We will show that if u E U satisfies (9.5.26) and (9.5.27), then
J u(y)Q(dylxt, at)
> u(Xt) - c(xt, at),
so that taking expectation E; (.) and rearranging terms we obtain
As a result, taking lim sUPn' we obtain (9.5.29) from (9.5.27) and (9.3.14).
o
C. Deterministic stationary policies
We conclude this section with some remarks on the expected total cost
(ETC) when using a deterministic stationary policy foo E IIDs. [Recall
Definitions 8.2.1 and 8.2.2(e).]
First note that replacing the policy 'Poo in (9.4.24) by foo we get that
the ETC VI (too, .) satisfies [using the notation (8.2.6)]
Comparing this equation and (7.5.1) we see that (9.5.31) is the Poisson
equation for the stochastic kernel P and the charge C given by
g(.) := o. (9.5.33)
Ix
i.e.,
vt(x) = c(x, I) + V1*(y)Q(dylx, I) Vx EX, (9.5.37)
Proof. Suppose that foo is ETC-optimal, that is, VI (f00 , .) = Vt (.). Then
(9.5.37) follows from Theorem 9.5.3 and the Poisson equation (9.5.30),
whereas (9.5.38) is a consequence of (9.5.35).
Conversely, suppose that (9.5.37) and (9.5.38) are satisfied. Then, from
(9.5.37) and (9.5.34), we have
Notes on §9.5
A. Transient models
(9.6.4)
Indeed, from Lemma 9.4.3(b) and equations (9.4.18) and (9.6.1) we have
IILQ;lIw
t=o
S~pW(X)-l ~ Ix w(y)Q;(dYlx) (9.6.8)
00
supW(X)-l
x
r w(y)j:i;(dy)
ix
[by (9.4.12)].
9.6.2 Remark. It is clear that for (9.6.7) to be true the transition kernel
Q('lx, a) has to be very "special" and, in general, it is convenient to think of
it as being a substochastic kernel, that is, Q(Xlx, a) ~ 1 for all x E X and
a E A(x). This is precisely the case for the discounted model in Proposition
9.6.3, below. A related situation occurs for absorbing MCMs in which
there exists a Borel set Xo c X such that
and, in addition,
The idea is that the state process {Xt} "lives" in the complement of Xo but
once it reaches Xo [which occurs in a finite expected time, by (9.6.10)] then,
by (9.6.9), it remains there forever-it is "absorbed" by Xo-at zero cost.
If the state and action spaces are both finite, then the transient and the
absorbing models-as well as the class of so-called "contracting" models-
are all equivalent; see Kallenberg [1]. 0
We next show that a discounted model can be transformed into a tran-
sient model.
Consider the usual MCM M = (X,A,{A(x)lx E X},Q,c) and let 0 <
a < 1 be a "discount factor". We suppose that the weight function w
satisfies Assumption 8.3.2(b); that is, there exists a constant (3 such that
1 ~ (3 < l/a and [as in (8.3.5)]
sup
A(x)ix
r w(y)Q(dylx, a) ~ (3w(x) Vx E x. (9.6.11)
9.6.3 Proposition. Suppose that {9.6.11} holds and let M be a MCM that
is the same as M except that the transition law Q is replaced by
II L Q~lIw ~
00
Thus, by (9.6.11),
J w(y)Q;(dYlx) 0/ Jw(y)Q;(dylx)
so that
f:
t=o
J w(y)Q;(dYlx) ::; kw(x) '<Ix, with k:= 1/(1 - af3).
supE;
II
(f:
t=o
Ictl) : ; ckw(x) '<Ix E Xj (9.6.15)
E;lc(xt,at)1 Ix Ic(Y,CPt)IQ;(dylx)
L IB(xt).
00
'TIB := (9.6.16)
t=l
The, for any policy 11" E IIRM and initital state x EX, we see from (9.6.5)
that
E;[IB(xd] = P;(Xt E B) = Q~(Blx),
so that the expected occupation time of B satisfies [by (9.6.16), (9.6.8)
and (9.6.7)]
=L
00
such that for every randomized Markov policy 7r = {<pt} [using the notation
(9.6.2)]
\It = 1,2, .... (9.6.19)
More explicitly, we can write (9.6.19) as
(9.6.21)
Q~w w,
Q~w = Qow ~ (3w + bl,
and for t = 2,3, ...
t-I
Q~w ~ (3tw + (3t- I bl + b L (3t-I-jQ~l
j=1
t-I
< (3tw + (3t- 1 bl + b L (3t-I-j'Y j [by (9.6.19)].
j=1
It follows that
L
00
B. Optimality conditions
J~(x) = min
A(z)
r J~_l (y)Q(dylx, a)]
[C(X, a) + Jx \:Ix E X. (9.6.23)
Proof. The proof follows from a direct induction argument, using Lemma
8.3.8(a) [as in the proof of Proposition 8.3.9(b)]. 0
From Lemma 9.6.9 and Corollary 9.6.5 we immediately deduce our main
optimality result in this section. Namely:
9.6.10 Theorem. Suppose that Assumption 9.6.8 holds. Then:
110 9. The Expected Total Cost Criterion
(a) The value function Vt belongs to the space lRw(X) and there is a
decision function f* E IF such that f*(x) E A(x) attains the minimum
in the right-hand side of the optimality equation (9.5.3), i.e.,
(9.6.25)
9.6.11 Theorem. Suppose that the MCM M and the weight function w
satisfy the conditions (a), (b) and (d) of Assumption 9.6.8. In addition,
suppose that there is a constant k 2 0 such that [using the notation (9.6.4)
with f E IF in lieu of <p E ~J
(9.6.26)
t=o
Then (9.6.7) holds and so M is transient.
9.6 The transient case 111
The proof of Theorem 9.6.11 is based on a clever idea of Veinott [1] (see
also Pliska [1]), which consists in showing that there exists a determinis-
tic stationary policy that maximizes the expected total "reward" function
given by
L Q~w(x) = L E;w(xt} ,
00 00
hence
u*(x) = W(f~,x) = SUpW(1T,X) Yx E X. (9.6.29)
II
Proof. Let R: lIi w (X) -t lIi w (X), u I-t Ru, be the operator defined as
This will give the second equality in (9.6.28) if we show that u* is indeed a
function in lIi w (X). To prove this we will use the value iteration approach.
Let {un} be the sequence in lIi w (X) given by Uo := 0 and Un := RUn-l
for n ~ 1. As R is a monotone operator (u ~ u' implies Ru ~ Ru'), the
sequence {un} is nondecreasing, i.e.,
Vj =L Q}w + Qjvo.
t=o
Observe that, by (9.6.22), Qjvo -+ 0 since Vo := Un-l is in lffiw(X). There-
fore, as j -+ 00,
L Q}w ::; kw
00
Vj t [by (9.6.26)]'
t=o
which contradicts that Vl := Un 1:. kw. This proves (9.6.33).
Now, from (9.6.32) and (9.6.33), there exists a function u* in lffiw(X) such
that u* ::; kw and un(x) t u*(x) for all x E X. But, on the other hand, we
also have
Un(x) = RUn-l(X) t u*(x) for all x E X;
hence (by Remark 9.6.13, below), u* satisfies u*(x) = Ru*(x), which is the
same as the first equality in (9.6.28), and the second equality follows from
a previous remark-see (9.6.31).
Finally, the first equality in (9.6.29) follows from the Poisson equation
in (9.6.28) combined with (9.6.22) and Proposition 9.5.11 (with the obvi-
ous changes in notation), whereas the second equality in (9.6.29) can be
obtained from the "optimality equation" (9.6.28) by standard arguments
[see for instance the proof of (9.5.29)]. 0
Having Lemma 9.6.12, the proof of Theorem 9.6.11 is straightforward:
Proof of Theorem 9.6.11. As u* ::; kw, (9.6.29) yields
supsupV(y,z) = supsupv(y,z).
y z z y
Now let T be the dynamic programming operator in (9.5.1), and let fi+l E
IF be a decision function such that
where Vi+! (-) := VI (fif\, .). This algorithm, which starts with an arbitrary
policy ff{' E IIDS and which at every step chooses the next policy fi+l
114 9. The Expected Total Cost Criterion
(9.6.37)
VxEX. (9.6.38)
Thus, by (9.5.1),
For the sequence {J;} in (9.6.34), (9.6.35), it also holds the remark in
the paragraph following the proof of Theorem 9.6.10; namely:
9.6.15 Corollary. Suppose that Assumption 9.6.8 is satisfied and let {Ji}
be the sequence of decision functions in (9.6.34), (9. 6. 35}. Then there exists
a decision function f* E IF such that, for each x E X, f*(x) E A(x) is an
accumulation point of {Ji(x)} and, moreover, the deterministic stationary
policy f-;O is ETC-optimal.
Notes on §9.6
At any rate, if we use this norm rather that /I . /lw in (9.6.7), then all of
the results in this section remain valid but for MCMs with a bounded
one-stage cost c(x, a), i.e., for some constant c
L
00
(a) There exists a number k such that /lQ,/lw :s k for all f E IF.
t=o
116 9. The Expected Total Cost Criterion
(b) For each 'Y > 0, there exists an integer N such that IIQ}llw ~ 'Y for
all t ~ N and all I E IF.
(c) For each 'Y > 0, there exists an integer t such that IIQ}lIw ~ 'Y for all
IE IF.
(d) There exist positive numbers 'Y and 8, with 'Y < 1, such that IIQ}lIw ~
8'Y t for all I E IF and t = 0,1, ....
Incidentally, since Pliska considers only models that satisfy (9.6.42), in
the proof of Theorem 9.6.11 and Lemma 9.6.12 one may take w(·) == 1
in (9.6.27) and (9.6.28). On the other hand, he shows that, when using a
"general" operator norm II . II, if the transition kernel Q is such that
VI E IF, (9.6.43)
then the five conditions (a)-(d) and (9.6.26) are all equivalent. For the
w-norm, (9.6.43) can also be written [by (7.2.8)) as
10.1 Introduction
A. U ndiscounted criteria
[I:
where
I n (7r, x) := E; c(Xt, at)] , n = 1,2, ... (10.1.2)
t=o
is the n-stage expected total cost. [See (9.3.14).] In this case a policy 7r* is
"optimal", or ETC-optimal, if [as in (9.1.3)]
(10.1.3)
The ETC criterion, however, has at least two main drawbacks: (i) it might
not be well defined for all policies, and (ii) it does not look at how the finite
horizon cost I n (1f,x) varies with n.
A common way of coping with drawback (i) is to consider the long-run
expected average cost (AC), already studied in Chapter 5 and which is
also considered in the present chapter from a different perspective. But,
as noted in Chapter 5, the AC criterion has the inconvenience of being
extremely underselective, for it ignores what occurs in virtually any finite
period of time.
Thus, to cope with (i) and (ii), it might be more "convenient" to put the
undiscounted problem in the form introduced by Ramsey [1]: A policy 1f*
is said to overtake a policy 1f if for every initial state x there exists an
integer N = N (1f* , 1f, x) such that
(10.1.4)
Then a policy 1f" is called strongly overtaking optimal (strongly 0.0.),
or optimal in the sense of Ramsey, if 1f" overtakes any other policy 1f.
Under suitable conditions~for instance, if the sequences in (10.1.4) con-
verge [as in (9.3.14)]~strong overtaking optimality is equivalent to compare
policies with respect to the ETC criterion. In general, as is to be expected,
strong overtaking optimality turns out to be extremely overselective~there
are many well-known, elementary (finite-state) MCPs for which there is no
strongly 0.0. policy (see §1O.9).
Hence, we have to go back to the original un discounted problem and
put it in a form weaker that Ramsey's. This was done by Gale [1] and
von Weizsiicker [1] introducing the notion of weak overtaking optimality:
A policy 1f" is said to be weakly overtaking optimal (weakly 0.0.)
if for every policy 1f, initial state x, and c > 0, there exists an integer
N = N(1f",1f,X,c) such that
(10.1.5)
We thus arrive at Flynn's [1] opportunity cost of 1f, given the initial state
x, which is defined as
and
J(7r,x):= lim sup I n (7r,x)/n (10.1.11)
n-+oo
is the long-run expected average cost (AC for short) when using 7r, given
the initial state x. Thus, instead of (10.1.7), we have Dutta's [1] criterion
In the terminology of Gale [1] and Dutta [1], a policy 7r for which D(7r,·)
is finite is said to be a "good" policy.
A policy 7r* is called D-optimal, or optimal in the sense of Dutta, if
Again, in analogy with (10.1.9), it follows directly from the definitions that
D-optimality is weaker than weak overtaking optimality, i.e.,
B. AC criteria
In general, the converse of (10.1.14) and (10.1.9) does not hold. But, on
the other hand, we do have that if 7r* is such that OC(7r*, .) is finite valued,
then
7r* OC-optimal ~ 7r* AC-optimal, (10.1.15)
and similarly for a D-optimal policy 7r*, where AC-optimal means that
and that they all lead in an obvious manner to the AC criterion, but also
that [by (10.1.17)] to find optimal policies with finite undiscounted costs it
suffices to restrict ourselves to the class of AC-optimal policies.
In fact, one of the main objectives in this chapter is to show that within
the class lIDS of deterministic stationary policies, all of the following con-
cepts
weakly 0.0., DC-optimal, D-optimal,
(10.1.18)
and bias-optimal are equivalent,
where bias optimality is an AC-related criterion defined in §1O.3.D.
Section 10.2 introduces the Markov control model dealt with in this chap-
ter. In §1O.3 we present the main results, starting with AC-optimality and
then going to special classes of AC-optimal policies (namely, canonical and
bias-optimal polcies), and to the undiscounted criteria in subsection A,
above.
For the sake of "continuity" in the exposition, §1O.3 contains only the
statements of the main results-the proofs are given in §10.4 to §1O.8. The
chapter closes in §1O.9 with some examples.
10.1.1 Remark. Theorem 10.3.1 establishes the existence of solutions to
the Average Cost Optimality Inequality (ACOI) by using the same approach
already used to prove Theorem 5.4.3, namely, the "vanishing discount" ap-
proach. However, to obtain Theorem 10.3.1 we cannot just refer to Theorem
5.4.3 because the hypotheses of these theorems are different, an important
difference being that the latter theorem assumes the cost-per-stage c(x, a)
to be nonnegative-this condition is not required in the present chapter.
Moreover, the hypotheses here (Assumptions 10.2.1 and 10.2.2) are on the
components of the control model itself, whereas in Chapter 5 the assump-
tions are based on the associated a-discounted cost problem.
10.1.2 Remark. The reader should be warned that not everyone uses
the same terminology for the several optimality criteria in this chapter.
For instance, what we call "strong overtaking optimality" [see (10.1.4)] is
referred to as "overtaking optimality" by, say, Fermindez-Gaucherand et al.
[1], whereas our "weak 0.0." [see (10.1.5)] is sometimes called "catching-
up" in the economics literature, for example, in Dutta [1] and Gale [1].
10.2 Preliminaries
Let M := (X, A, {A(x)lx EX}, Q, c) be the Markov control model in-
troduced in §8.2. In this chapter we shall impose two sets of hypotheses
on M. The first one, Assumption 10.2.1 below, is in fact a combination
10.2 Preliminaries 121
A. Assumptions
As usual, one of the main purposes of parts (a) through (e) in Assumption
10.2.1 is to assure the existence of suitable "measurable selectors" I E IF,
as in, for instance, Lemma 8.3.8 and Proposition 8.3.9(b). On the other
hand, the Lyapunov-like inequality in part (f) yields an "expected growth"
condition on the weight function w, as in (8.3.29) when b = OJ see also
Remark 8.3.5(a), Example 9.6.7 or (10.4.2) below.
We will next introduce an additional assumption according to which
the Markov chains associated to deterministic stationary policies are w-
geometrically ergodic (Definition 7.3.9) uniformly on II DS . To state this
assumption it is convenient to slightly modify the notation (8.2.6) and
(8.2.11) for the stochastic kernel Q(·lx, /), which now will be written as
Qf(·lx)j that is, for every I E IF, B E B(X), x E X, and t = 0, 1, ... ,
(10.2.2)
(10.2.3)
Observe that [by (10.2.23) below and the argument to obtain (8.3.44)] the
inequality (10.2.1) is equivalent to
Thus, multiplying by W(X)-l one can see that the stochastic kernels Qf
have a uniformly bounded w-norm. Actually, as in Remark 7.3.12, the w-
norm of Qf, i.e.,
!
so that
III-lfllw := wdl-lf ::; Ilbll/(I- (3) Vf E IF. (10.2.8)
(10.2.10)
10.2 Preliminaries 123
lim IIQ,ullw/n
n-+oo
=0 Vu E Bw(X), / E F, (10.2.13)
denotes the n-stage expected total cost when using the deterministic sta-
tionary policy /00 EnDS. Similarly, by (10.1.11), the long-run expected
average cost (Ae) is
(10.2.17)
n-+oo
(d) The pair (J(f), hf) in lR x ]ff,w(X) is the unique solution of the strictly
unichain Poisson equation
x E X, (10.2.21)
(10.2.22)
Proof. Part (a) follows directly from (10.2.9) applied to U = cf, together
with the elementary fact
n-l
Then
(iii) Assumption 10.2.2 holds for some constant R :S 1 + b/(l- p), and p
as in (b I ), (b 2 ).
126 10. Undiscounted Cost Criteria
so that
(10.3.1)
10.3 From AC optimality to undiscounted criteria 127
The following theorem states the existence of a solution (p*, ho) to the
ACOI, as well as the existence of a deterministic stationary policy thatfa
is AC-optimal in II DS , i.e.,
p* + ho(x) 2: min
A(x)
[c(x, a) +
Jr
x
ho(y)Q(dylx, a)] , (10.3.9)
p* + h*(x) = min
A(x)
[C(X, a) +
ixr h*(y)Q(dYlx, a)] , (10.3.13)
Ix
and
p* + h*(x) = Ct. (x) + h*(y)Qt.(dYlx). (10.3.14)
(10.3.16)
Of course, we have
(10.3.17)
where I n (7r, x) = I n (7r, x, 0) is the n-stage cost in (10.1.2). The value func-
tion corresponding to J n ( 7r , x, h) is
J~(x,h) := inf I n (7r,X, h). (10.3.18)
II
130 10. Undiscounted Cost Criteria
is the unique solution of the ACOE (10.3.13) whose integral with respect
to J-t/. is zero.
132 10. Undiscounted Cost Criteria
D. Bias-optimal policies
This is easily seen from the second equality in (10.3.20) and the definition
(10.3.18) of J;(x, h*). They yield that for each decision function f in IF' AG
np* + h*(x) ::; In(foo, x, h*) = I n (foo , x) + Ere h*(x n ) 'tin, x,
or, equivalently,
In(foo, x) - np* 2:: h* (x) - E£oc h* (x n ) 'tin, x, (10.3.32)
10.3 From AC optimality to undiscounted criteria 133
1 x
h*dJ.Lj= sup 1
lF e .. x
h*dJ.Lf· (10.3.34)
or, equivalently,
p* + h(x) = min
A*(z)
[c(x, a) +
ixr h(y)Q(dYlx, a)] . (10.3.39)
The following theorem shows that, among other things, (10.3.40) with an
additional condition characterizes bias-optimal policies.
10.3.10 Theorem. (Existence and characterization of bias-optimal
policies.) Suppose that the hypotheses of Theorem 10.9.6(a) are satisfied.
Then:
(a) There exists a bias-optimal decision junction 1 E IFbias ; moreover,
(b) (p*, h,1) is a canonical triplet [that is, it satisfies (10.9.99)-or
(10.9.98)-and (10.9.40)j and there exists a junction hI in Bw(X)
such that
= min
A*(x)}X
r h'(y)Q(dylx,a),
then f is bias-optimal and h is the optimal bias junction, i.e., h = h.
(d) The following statements are equivalent:
(d 1 ) f E IF is bias-optimal.
(d 2 ) f E IF is a canonical decision junction and
FUrthermore (as shown in the proof of the theorem, in §10.7), case (i) oc-
curs when the bias-minimization problem is viewed as an "average cost"
problem, and then (i) is a direct consequence of Theorem 10.3.6(a). Case
(ii), on the other hand, appears when bias minimization is posed as an
"expected total cost (ETC)" problem, which is done in Remark 10.7.1.
The latter remark provides a second proof of Theorem 1O.3.10(a) and it
is based on the ETC results in §9.5. Interestingly enough, our (first) proof
of Theorem 10.3.10(a)-following an "average cost" approach [see (10.7.3),
(1O.7.4)]-is basically the same as Nowak's [1] proof of the existence (in
7r'DS) of weakly overtaking optimal policies!!! It was precisely this observa-
tion that suggested the equivalence of the several optimality concepts in
(10.1.18), which is the content of Theorem 10.3.11 below.
E. U ndiscounted criteria
D(fOO, x) = inf
IIDS
D(gOO, x) and D(fOO, x) < 00 \:Ix E X.
Notes on §10.3
1. All of the results in this section are essentially from Vega-Amaya [2],
and Hernandez-Lerma and Vega-Amaya [1]. Theorems 10.3.1 and 10.3.6 are
also obtained in Gordienko and Hernandez-Lerma [2] but under additional
assumptions. In particular, the latter reference requires the cost-per-stage
c(x, a) to be nonnegative, which allows a direct application of the Abelian
theorem (10.4.13) to obtain the result mentioned in Remark 10.3.2. More-
over, the ACOE (10.3.13) is obtained via the Ascoli Theorem, which of
course requires to impose suitable "equicontinuity" hypotheses on the con-
trol model. The proof presented here (in §10.5) of the ACOE uses a "policy
iteration" argument instead of the Ascoli Theorem.
For additional comments (with references) on how to obtain the ACOE
see the Notes on §5.5.
2. Concerning the relation (10.3.2), we may recall from §5.2 that there are
intermediate optimality concepts between "canonical" and "AC-optimal".
For instance, a policy 11"* E II is said to be F-strong AC-optimal (or strong
AC-optimal in the sense of Flynn [1]) if
3. Examples by Brown [1] and Nowak and Vega-Amaya [1] show that,
without additional assumptions, the results in Theorem 10.3.11 and Corol-
lary 10.3.12 cannot be extended to class II of all policies. (See Remark
10.9.2.)
4. Haviv and Puterman [1] use bias optimality to distinguish between
two AC-optimal policies for a certain admission control queueing system.
To discriminate AC-optimal policies one can use the minimum average
variance (see §11.3) instead of the minimum bias.
lOA Proof of Theorem 10.3.1 137
As the function b(·) in (10.2.1) satisfies that 0 ~ b(x) ~ IIbll for all x E X,
we will assume that b(·) is a constant to be denoted by b again, i.e., b(·) == b.
Thus, instead of (10.2.1) we now have
(1004.3)
and
so that
Proof. As in (8.3.31),
E;[w(xdlht-t,at-l] = ! w(y)Q(dylxt-l,at-d
~ f3W(Xt-l) +b [by (1004.1)].
Hence, taking the expectation E; (.),
E;w(xt) ~ f3E;w(Xt-d + b,
which iterated gives the first inequality in (1004.2). The second inequality
in (1004.2) is obvious (recall that w ~ 1).
To obtain (1004.3) it suffices to note that, by Assumption 1O.2.1(d),
(100404)
138 10. Undiscounted Cost Criteria
and
V;(x) := infVa(1T,x). (10.4.6)
n
From (10.4.3) it is evident that
and
IV; (x)1 ::; ~(x)/(I- a), with b:= c[1 + b/(I- ,8)]. (10.4.7)
Thus, for each fixed 0 < a < 1, both functions Va (1T,') and V;O belong to
lIi w (X). On the other hand, note that the inequality (10.4.1) is of the same
form as (8.3.11). Therefore, in view of Remark 8.3.5(a), all the results of
Chapter 8 are valid in our present context. This means in particular, that,
by Theorem 8.3.6(b), we may rewrite V;in (10.4.6) as an infimum over
the class of deterministic stationary policies, i.e.,
Now fix an arbitrary state z in X, and for every 0 < a < 1 consider the
function
Ua(x) := V;(x) - V;(z). (10.4.9)
We will next show that Ua belongs to the space lIi w (X) for all 0 < a < 1.
10.4.2 Lemma. Let z E X be the {fixed} state in {10.4.9}. Then for every
E X, and t = 0, 1, ...
100 in lIDS, x
IEf>O c/(Xt) - E!OO c/(xt)1 ::; cRpt[l + w(z)]w(x), (10.4.10)
and
lua(x)1 ::; cR(I- p)-l[1 + w(z)]w(x) (10.4.12)
10.4 Proof of Theorem 10.3.1 139
and
p* ~ inf J(foo, x) = inf J(f) Vx. (10.4.15)
IIDS IF
If, in addition, c(x, a) is nonnegative, then
Proof. Let z E X be the (fixed) state in (10.4.9), and for every 0 < 0: < 1
define
p(o:) := (1 - 0:) (z). V; (10.4.17)
By (10.4.7), p(o:) is bounded since [with b as in (10.4.7)]
To prove that p* satisfies (10.4.14), observe that (10.4.9) and (10.4.17) yield
Hence, as 7r and x were arbitrary, the latter inequality and (10.4.14) give
(10.4.16).
To complete the proof of the lemma, let us consider (10.4.15). We cannot
proceed as in (10.4.19) because now E;c(xt, at) may take negative values
lOA Proof of Theorem 10.3.1 141
Vo (1I',x)
t=O t=o
Moreover, let
Define
ho(x) := lim inf u",(n) (x), x E X. (10.4.25)
n-too
It is also clear that any decision function fo that satisfies (10.3.10) also
satisfies (10.4.27), and, therefore, (10.3.11). This completes the proof of
Theorem 10.3.1. 0
10.4.4 Remark. If c(x, a) is nonnegative, then (10.4.27) and (10.4.16) give
We wish to show that there exists a canonical triplet or, equivalently (by
Theorem 10.3.4), a solution (p*, h*, f*) to the ACOE (10.3.13), (10.3.14),
with h* in Bw(X).
It will be convenient to use the dynamic programming operator T in
(9.5.1), with "min" instead of "inf", to write (10.3.13) in the form
p* + h*(x) = Th*(x), x E X. (10.5.1)
Moreover, to simplify the notation, given a sequence {In} in F we shall
write
c/.. ,h/.. ,Q/.. ,I-'/.. , ... as cn,hn,Qn,l-'n,"" (10.5.2)
respectively, where h n E Bw(X) is the solution to the Poisson equation
(10.2.21), (10.2.22) for fn.
Now, to begin the proof itself, let p* E JR., ho E Bw(X), and fo E F be as
in Theorem 10.3.1. In particular, as in (10.3.11) we have
J(/o) = p* = inf J(/) (10.5.3)
F
and so we can write the Poisson equation (10.2.21) for fo as
and
hoO > hlO + ~l on Nf,
where Nf := X - Nl denotes the complement of N 1 •
Repeating this procedure we obtain sequences Un} in IF, {h n } in lmw(X),
and {Nn } in B(X) for which the following holds: For every x E X and
n = 0,1, ... [and using the notation (10.5.2)]
(i) J(fn) = p*;
(ii) (p*, h n ) satisfies the Poisson equation
N*:= n
00
n=l
Nn (10.5.10)
10.5 Proof of Theorem 10.3.6 145
L >'(N~) = 0
00
and
hn(x*) = hn+1(x*) + ~n+1'
which implies that the functions
Finally, to obtain (10.5.1), first note that the Poisson equation (10.5.7)
remains valid if we replace hn by h~, Le.,
p* + h~(x) ~ min
A(x)
[C(X, a) + [ h~(y)Q(dylx, a)]
ix Vx E X.
Therefore, letting n -t 00, (10.5.12) and the Fatou Lemma (8.3.8) yield
Hence
hf(x) - h*(x) 2:: j[hf(Y) - h*(y)]Qf(dYlx) "Ix,
(10.5.18)
(10.5.19)
so that integration with respect to the i.p.m. J.Lm+1 yields [by (10.2.18) or
Proposition 1O.2.3(b)]
(10.5.20)
that is, the sequence of average costs J(Jn) is nonincreasing. Moreover, it is
obviously bounded since [by Assumption 1O.2.1(d), (10.2.12), and (10.2.18)]
for some function h on X. Then the PIA converges; in fact, the pair
is called the PIA's discrepancy function at the nth iteration. Similarly, from
(10.5.20) we get the cost decrease C n := J(fn) - J(fn+1), which can also
be written as
(10.5.24)
10.6 Proof of Theorem 10.3.7 149
which means that the pair (en, hn) is a solution to the Poisson equation
for the transition kernel Qn+1(·lx) = Q('lx, fn+d with "cost" (or charge)
function Dn. This fact can be used to prove the convergence of the PIA, at
least when the state space X is a finite set; see, for instance, Puterman [1,
§8.6]. Alternatively, one could try to show that Dn(x) ~ 0 for all x E X,
as n ~ 00.
+ hi(x) = +
lxr hi(y)Q(dy1x,a)] ,
p* min [c(x,a) (10.6.1)
A(x)
and
p* + hi(x) = min
A(x)
[c(x, a) r hi (y)Q(dylx, a)]
+ lx . (10.6.2)
Hence
hi(x) - hi (x) 2:: j[h'i(Y) - hi(y)]Ql(dylx) "Ix,
so that [as in the argument used after (10.5.6) or to obtain (10.5.16)] Lemma
7.5.12(a) yields the existence of a set Nl E B(X) and a constant kl such
that J.Ll(Nt) = 1 and
We now repeat the above argument but interchanging the roles of (10.6.1)
and (10.6.2), and using part (b) of Lemma 7.5.12 instead of part (a). That
is, we take a decision function h E IF that attains the minimum in (10.6.2)
and we get a set N2 E B(X) and a constant k2 such that J.L2(N2) = 1 and
1
This fact and (10.7.1) yield that is bias-optimal.
1
Proof of (b). The bias-optimal decision function in (a) is canonical,
and so it satisfies (10.3.39) [or (10.3.38)] and (10.3.40); that is, (p*, h, f) is a
canonical triplet (see Theorem 10.3.4). Moreover, as was already mentioned
in the proof of (a), Mbias satisfies the hypotheses of Theorem 1O.3.6(a).
Therefore, there exists a function h' in Bw(X) and a canonical decision
function l' in IFca such that (p, h' , 1') is a canonical triplet for M bias, i.e.
(by Theorem 10.3.4),
p+h'(x) = A*(x)
min [c'(x,a) + ( h'(y)Q(dY\X,a)]
ix
or [by (10.7.2)]
which [by (10.7.4) and (10.7.1)] yield (10.3.41) and (10.3.42), respectively.
Finally, integrating both sides of (10.7.5) with respect to the i.p.m. 1'1' we
get
p= Ix (-h*)d1'l"
Ix hd1'l = O. (10.7.6)
From this fact and the second equality in (10.3.43) it follows that
Thus
which together with (10.3.34) and (10.7.7) yields that f is bias-optimal and
that h = h, = h.
Proof of (d). (dd ~ (d 2 ). If f is bias-optimal, then [by (10.3.31)] f is
Ix
in !Fea and
hO = h*(·) - h*d/1',
Subtracting the latter equation from (10.3.40) we see that the function
u(·) := hO - h,O is invariant with respect to Q" i.e.,
=L =L
00 00
L Et OO cj(Xt),
00
1
Therefore, by Theorem 9.5.12, to conclude that E lFae is bias-optimal it
only remains t2, verifx that Assumption 9.5.2 holds in the present context,
and also that f and h satisfy (9.5.38), i.e.,
limsupElOOh(x n ) = 0 Vx E X. (10.7.12)
n-+oo
and
W(x) := cRw(x)/(l - p) [see (10.2.20)],
respectively, we obtain parts (b) and (c) in Assumption 9.5.2.
Finally, (10.7.12) follows from (10.7.11), which gives [as in (10.3.21) or
(7.5.4)]
L Et
00
OO
[ci<x t ) - p*]
t=n
-+ 0 as n -+ 00 [by (10.7.8)].
1T* weakly 0.0. => 1T* ~C-optimal => 1T* AC-optimal, (10.8.2)
where the second implication holds if 1T* has a finite opportunity cost.
To obtain further relations between the above concepts, let J~(x) and
J*(x) be as in (10.1.16) and (10.1.10), respectively, and define the upper
and lower limit functions
(10.8.6)
10.8 Proof of Theorem 10.3.11 155
Thus to prove that (c) => (d) we need to show that, with foo as in (c),
limsup[Jn(Joo,x) - In(goo,x)] :S 0 Vg oo E IIDs, X E X. (10.8.12)
n--+oo
Therefore,
10.9 Examples
The calculations in Example 10.9.1 can be done using the value iteration
equation J~ = T J~_l in (9.5.4) for the optimal n-stage cost, i.e., for each
x E X and n = 1,2, ... ,
with JO' (.) := O. We will also use the fact that for a deterministic stationary
policy foo the n-stage expected total cost
(10.9.2)
(10.9.3)
10.9 Examples 157
and
c(x,2) = 0, and Q(x + llx, 2) = 1. (10.9.6)
From (10.9.4) and (10.9.3), I n (1I',0) = 0 for each policy 11' and n =
0,1, ... , so that J~(O) = 0 for all n. Moreover, as J o(')
:= 0, we can use
(10.9.1) to obtain
Let us now consider the decision functions I*(x) := 2 and I(x) := 1 for
all x EX. Then, by (10.9.2),
In(f':',x) =0 'Vn ~ 0, x E X,
and
We can also see that I':' is a canonical policy (Le., 1* is in Fca) since
(p*, h*, 1*) with
We again suppose that the demand process {zt} satisfies Assumption 8.6.1
and the condition 8.6.3, but in addition we will now suppose that, with
z:= E(zo),
0< z. (10.9.8)
[Note that (10.9.8) states that the average demand z should exceed the
maximum allowed production. Hence it excludes some frequently encoun-
tered cases that require the opposite, 0 ~ z.] By the results in Example
8.6.2 we see all of the conditions (a) to (f) in Assumption 10.2.1 are sat-
isfied, except that the constant f3 in (10.2.1) is greater than 1; in fact,
after (8.6.14) we obtained f3 = 1 + c. We will next use the new assumption
(10.9.8) to see that f3 can chosen to satisfy f3 < 1, as required in (10.2.1).
Let 1jJ(r) := Eexp[r(O - zo)], r ~ 0, be the moment generating function
of 0 - zoo Then, as 1jJ(0) = 1 and 1jJ'(0) = E(O - zo) = 0- z < 0 [by (10.9.8)],
there is a positive number r. such that
[Compare the latter inequality with (8.6.11).] Therefore, defining the new
weight function
w(x) := exp[r.(x + 2z)], x E X, (10.9.9)
we see that w(·) satisfies (8.6.13) and (8.6.14) when ris replaced by r •. In
particular, (8.6.14) becomes
with
f3 := 1jJ(r.) < 1, and b:= w(O). (10.9.11)
Thus, as (10.9.10)-(10.9.11) yield (10.2.1), we have that Assumption 10.2.1
holds in the present case. [Alternatively, to verify (10.2.1) we could use
(10.9.20) below because lfO ~ 1.]
We will next verify Assumption 10.2.2 (using Proposition 10.2.5), and
Assumption 10.3.5. We begin with the following lemma.
10.9.4 Lemma. For each decision function f Elf, let {xl, t = 0, 1, ... } be
the Markov chain defined by (8.6.5) when at := f(xt} for all t, i.e.,
(10.9.12)
Then, for each f Elf, {xl} is positive recurrent, and so it has a unique
i.p.m. JLf.
Proof. Let {xf} be the Markov chain given by (10.9.12) when f(x) := 0
for all state x, i.e.,
(J
Xt+l -
_ «(J
Xt +0 - Zt
)+ , 'tit = 0,1, .... (10.9.13)
160 10. Undiscounted Cost Criteria
Hence, as EIYol ::; () + Z < 00 and E(yo) = () - z < 0 [by (10.9.8)], the
Markov chain {xf} is positive recurrent (see Example 7.4.2). This implies
in particular that EO(TO) < 00 where TO denotes the time of first return to
x = 0 given the initial state Xo = O.
Now choose an arbitrary decision function f E IF, and let Tf be the time
of first return to x = 0 given x£ = O. By (10.9.7), f(x) ::; () for all x E X,
and, therefore, x{ ::; xf for all t = 0,1, .... This implies that
which yields that {xl} is positive (in fact, positive Harris) recurrent; see,
for instance, Corollary 5.13 in Nummelin [1]. Thus, as f E IF was arbitrary,
the lemma follows. 0
Lemma 10.9.4 gives the existence of the invariant probability measures
required in Proposition 10.2.5. We will next verify the hypotheses (i)-
f..Lf
(iv).
Let 80 be the Dirac measure at x = 0, and define
for each f E IF, x EX, and B E 8(X); that is the hypotheses (i) in
Proposition 10.2.5 is satisfied.
On the other hand, from (10.9.15), defining
Finally, the hypothesis (iv) follows from (8.6.13) with r* in lieu of r, which
gives
which is a random walk of the form (10.9.14) [or (7.4.4)). Hence, exactly
as in Lemma 10.9.4, we can verify that the Markov chain {xl}, f E IF;
obtained from (8.6.16) with at := f(xt) for all t = 0,1, ... , i.e.,
is positive recurrent. For that reason, {xl} has a unique i.p.m. J.l" for each
f E IF.
162 10. Undiscounted Cost Criteria
11.1 Introduction
In this chapter we study AC-related criteria, some of which have already
been studied in previous chapters from a different viewpoint. We begin by
introducing some notation and definitions, and then we outline the contents
of this chapter.
A. Definitions
be the expected n-stage total cost when using the control policy 7r, given
the initial distribution v E P(X). We can also write (11.1.1) as
(11.1.2)
where I n (7r, x) is given by (11.1.1) when v = 8", (the Dirac measure con-
trated at Xo = x) or as
(11.1.3)
where
n-l
J~(1f, v) := L c(Xt, at) (11.1.4)
t=o
in the path wise (or sample path) n-stage total cost.
In addition to the usual "limit supremum" expected average cost (AC)
where
J*(x) := inf J(1f,x) (11.1.11)
II
is the optimal expected A C function (also known as the A C value function).
We now wish to relate AC optimality with the performance criteria in the
following definition.
11.1.1 Definition. Let 1f* be a control policy and v* an initial distribution.
Then
(a) (1f*, v*) is a minimum pair if
J(1f*,v*) = Pmin
where
Pmin:= inf J*(v) = inf inf J(1f, v) (11.1.12)
P(X) P(X) II
is the minimum average cost, and [as in (11.1.11)] J*(v) := infII J(1f, 1I).
11.1 Introduction 165
and, furthermore,
with JI as in (11.1.7).
11.1.2 Remark. The optimality concepts in Definition 11.1.1(a), (c) were
already introduced in Chapter 5. On the other hand, Definition 11.1.1(b)
is related to pathwise AC optimality in Definition 5.7.6(b), which only re-
quires (11.1.13).
In §11.4 we study a class of Markov control problems for which there
exists a sample path AC optimal policy. In general, however, Definition
11.1.1(b) turns out to be extremely demanding in the sense that requiring
(11.1.13) and (11.1.14) to hold for all 11" E II and all v E P(X) is a very
strong condition. It is thus convenient to consider a weaker form of sample
path AC optimality as follows.
11.1.3 Definition. Let IT c II be a subclass of control policies, and
P(X) c P(X) a subclass of probability measures ("initial distributions")
on X. Let
p:= lnf i!!f J(1I", v). (11.1.15)
P(X) II
P P;-a.s.
~
and
°
J (1I",v) 2:p P;-a.s. "111" E II,
~ ~
v E P(X). (11.1.17)
IT IT = II and P(X) = P(X), then-as in Definition 11. 1. 1(b)-we simply
say that 1? is sample path AC-optimal.
For example, in §11.3 we consider a class of Markov control problems in
which there exists a sample path AC-optimal policy with respect to
we have
P"1I"* -a.s. '<IxE X (11.1.24)
and
JO(n,x) ~ p* P;-a.s. '<In E II, x E X. (11.1.25)
Note that the condition "for all x E X" in (11.1.24) and (11.1.25) can also
be expressed as "for all v in P/i(X)", with P/i(X) as in (11.1.19).
B. Outline of the chapter
The rest of the chapter consists of four sections. Section 11.2 presents
background material on positive Harris recurrence and the limiting average
variance (11.1.8). The reader may go directly to §11.3 and refer to the con-
cepts and results in §11.2 as they are needed. In §11.3 we consider a Markov
control model which is "w-geometrically ergodic" (Definition 11.3.1) with
respect to some weight function w. In this case we show the existence of de-
terministic stationary policies that satisfy (11.1.20) and (11.1.21), whereas
under an additional condition (Assumption 11.3.4) they satisfy (11.1.24)
and (11.1.25), and, moreover, the concepts in Definition 11.1.1 turn out to
be "essentially" equivalent (see Theorem 11.3.5 for a precise statement).
Also in §11.3 we prove the existence of a policy that minimizes the limiting
average variance within the class of canonical policies (Theorem 11.3.8).
In §11.4 we turn our attention to Markov control models with a strictly
unbounded cost-per-stage function c(x,a) [see Assumption 11.4.1(c) and
Remark 11.4.2(a)J. The main result in that section, Theorem 11.4.6, in
particular gives conditions ensuring the existence of a sample path AC-
optimal policy.
The chapter concludes in §11.5 with some examples that illustrate the
results of §11.3 and §11.4.
11.2 Preliminaries 167
11.2 Preliminaries
This section reviews background material that can be omitted on a first
reading; the reader may refer to it as needed.
A. Positive Harris recurrence
Let {xt, t = 0,1, ... } be a time-homogeneous X-valued Markov chain
with transition probability function P(Blx). The chain is said to be posi-
tive Harris recurrent if it satisfies that
(i) {Xt} is Harris recurrent [Definition 7.3.1(b)], and
(ii) it has an i.p.m., which [ by Theorem 7.3.4(a)] is necessarily the unique
i.p.m. of {Xt}.
The next theorem presents, in particular, in parts (a) and (b), two charac-
terizations of positive Harris recurrence.
11.2.1 Theorem. (Characterization and properties of positive Har-
ris recurrence.)
(a) Suppose that {xt} has an i.p.m. JL. Then the chain is positive Harris
recurrent if and only if the strong Law of Large Numbers (LLN)
holds for each function in L1 (JL) := L1 (X, B(X), JL); that is, for each
function 9 in L 1 (JL) and each initial distribution v in P(X),
n-l
lim .!. L g(Xt) = JL(g) Pv-a.s., (11.2.1)
n-too n t=o
where
(11.2.2)
(b) The chain {xt} is positive Harris recurrent if and only if for each
Borel set B in B(X) there is a nonnegative number CiB such that
lim p(n)(Blx)
n-too
= CiB \:Ix E X, (11.2.3)
where
n-l
pen) (Blx) := .!.n L
pt(Blx), n = 1,2, ... (11.2.4)
t=o
denotes the expected average occupation measures.
(c) If {xt} is positive Harris recurrent with i.p.m. JL and 9 is in L1(JL),
then
n-too
1
lim -Ex
n
[n-l
~ g(Xt)
~
1= JL(g) for JL-a.a. x E X,
t=o
with JL(g) as (11.2.2).
168 11. Sample Path Average Cost
Proof. For parts (a) and (c) see, for instance, Revuz [1, pp. 139, 140];
for (b) see Glynn [1] or Hernandez-Lerma and Lasserre [12]. (Glynn [2]
proves (b) in the continuous-time case.) Part (a) is given also in Meyn and
Tweedie [1], Theorem 17.1.7 and Proposition 17.1.6. (These references pro-
vide other characterizations of positive Harris recurrence. For additional
comments see Note 1 at the end of this section.)
As an application of Theorem 11.2.1(b), let w 2: 1 be a weight function,
and suppose that the Markov chain {xt} is w-geometrically ergodic (Defi-
nition 7.3.9). If in (7.3.8) we replace the function u by an indicator function
IB, then we get
pt(Blx) -t J.L(B) as t -t 00,
which of course implies (11.2.3) with aB = J.L(B). Thus we have:
11.2.2 Corollary. A w-geometrically ergodic Markov chain is positive Har-
ris recurrent.
B. Limiting average variance
c - J.L(c) = h - Ph (11.2.5)
that satisfies
J.L(h) = 0 and g(x) = J.L(c) "Ix E X. (11.2.6)
Equivalently, by Theorem 7.5.5(a), (b) [and (7.5.7) or (7.5.6)]
L c(xt),
n-l
(11.2.9)
11.2 Preliminaries 169
The next theorem gives conditions under which a 2(c,·) is the (finite) con-
stant
This result is well known (see Doob [1], Duflo [1], Meyn and Tweedie [1],
etc.) but we will give a proof of it because some of the arguments are also
needed in later sections.
11.2.4 Theorem. Suppose that Assumption 11.2.3 holds and, furthermore,
c2(.) is in l$w(X). Let'l/J be the function on X defined by
Then:
(a) 'l/J is in l$w(X), and
(b) the limiting average variance satisfies
= a~ \Ix E X. (11.2.14)
Thus the proof, given below, of Theorem 11.2.4 essentially reduces to verify
(a) and the first equality in (11.2.14). To prove the latter we will repeatedly
use the following elementary properties of conditional expectations.
11.2.5 Remark. Let ~ and e
be integrable random variables on a proba-
bility space (O,F,1'), and let Q and Q' be sub-a-algebras of F.
170 11. Sample Path Average Cost
satisfy that
(11.2.19)
and as
Hence, by the definition (11.2.9) of the limiting average variance the first
equality in (11.2.14) will follow from (11.2.20), (11.2.21) and (11.2.18).
Proof of (11.2.18). Let
be the a-algebra generated by {xo, ... , xd. Then, by Remark 11.2.5(a) and
the Markov property, for t ~ 1,
Ex [Ex (l?/IFt-dl
Ex [Ex (y,? IXt-dl
Ex ['Ij;(xt-d],
11.2 Preliminaries 171
where the latter equality, which gives (11.2.18), follows from (11.2.17) and
the fact that 1fJ{x) in (11.2.12) can also be written as
Thus, squaring and taking expectations E z {'), we see that (11.2.1O) can be
written as in (11.2.19) with
(11.2.25)
(11.2.28)
and so we get
and
(11.2.30)
t=1
(11.2.31)
From this relation, together with (11.2.31) and (11.2.32), we obtain (11.2.20)
with
D2(x, n) := 01 (x, n) + A(x, n) + 2B(x, n),
with 01 (x, n) as in (11.2.25), and
A(x,n)/n -t 0 (11.2.34)
and
B(x,n)/n -t 0 (11.2.35)
To prove (11.2.34), use (11.2.26) and Remark l1.2.5(a), (d) to get
Therefore, (11.2.35) follows from (11.2.34), (11.2.18), and the second equal-
ity in (11.2.14).
This completes the proof of Theorem 11.2.4. 0
11.2.6 Remark. (a) Let Mn and Fn be as in (11.2.30) and (11.2.22),
respectively. Then it is clear that {Mn,Fn} is a martingale. In §11.3 and
§11.4 we will use a martingale of this form in combination with the following
result.
(b) (The Martingale Stability Theorem.) Let {Yt} be a sequence
of random variables on a probability space (O,F,P) and let {Ft } be a
nondecreasing sequence of sub-a-algebras of F such {Mn, Fn,n = 1,2, ... },
with Mn := I:~-l Yt, is a martingale. If 1::::; q ::::; 2 and
L C qE(IYtlqIFt -
00
then
1
lim - Mn
n
n--+oo
=0 P-a.s.
For a proof of this fact see, for instance, Hall and Heyde [1], Theorem 2.18.
(c) (Alternative expressions for a~.) Letting cbe the centered func-
tion
c(-) := c(·) - p,(c),
we can also write (11.2.11) as
and
= EI'[if(xo)] + 2 L
00
a~ EI'[c(xo)c(xd]· (11.2.39)
t=l
This equality and (11.2.11) yield (11.2.37). On the other hand, to get
(11.2.38) first write (7.5.30) as
00 00
h = Lpte=e+ Lpte.
t=o t=1
Then the right-hand side of (11.2.40) becomes
00
EI'[e(xo)e(Xt)] = J.t(eptC).
Thus (11.2.39) follows from (11.2.38). 0
Notes on §1l.2
n-1 ]
lim .!.Ex [ "g(Xt)
n-+oon L..J
= J.t(g) for all x E X. (11.2.41)
t=o
This can be used to obtain additional results. For instance, if 9 is a non-
negative function in L1 (J.t), then
liminf
n-+oo n
[~9(Xd] ~ J.t(g)
.!.Ex L..J for all x E X.
t=o
This follows from (11.2.41) and the fact that 9 E L1(X)+ is the pointwise
limit of a nondecreasing sequence of bounded measurable functions.
On the other hand, writing g(xn) as
n n-1
Lg(Xt) - Lg(Xt),
t=O t=O
11.3 The w-geometrically ergodic case 175
lim Ig(xn)l/n
n-+oo
=0 Pv-a.s. for each lJ in P(X). (11.2.42)
(11.2.43)
[For a proof see, for example, the references given after (11.2.11).]
3. The reader should be warned that there are different definitions of
limiting average variance. For instance, Baykal-Giirsoy and Ross [1], Filar
et al. [1], Puterman [1, p. 408], etc., define the "limiting average variance"
as
(11.2.44)
which in general [despite (11.2.19)] does not coincide with (11.2.9). In par-
ticular, observe that under the hypotheses of Theorem 11.2.4 the limiting
value in (11.2.44) is
where k := c[1 + b/(1 - ,8)]. Hence, if v is in Pw(X), from (11.3.1) and the
Dominated Convergence Theorem we obtain
J(f) = Ix J(foo,x)v(dx)
= lim
n-too}x
r n-1Jn(f00,x)v(dx)
= lim n- 1 I n (foo , v)
n-too
[by (11.1.2)]
= J(fOO, v),
which clearly implies (11.2.3) with QJ(·lx) and f..LJ(B) in lieu of P(·lx) and
aB, respectively. Thus part (a) follows from Theorem 11.2.1(b), and (b)
follows from (11.2.1). 0
Despite these facts, however, w-geometric ergodicity is not enough to
guarantee a "good behavior" of the Markov control model with respect to
the optimality criteria in Definition 11.1.1. In particular, it does not ensure
the existence of sample path AC-optimal policies [Definition 11.1.1(b)]. It
is thus convenient to consider sample path AC-optimality in the restricted
sense of Definition 11.1.3. We shall consider two cases:
A. Optimality in II Ds
[Furthermore, fff is AC-optimal (in all of II) if the one-stage cost c(x, a)
is nonnegative-see Remark 10.3.2.] We now have the following.
11.3.3 Theorem. Suppose that the Markov control model is w-geometrically
ergodic and let f{f' be as in (11.3.5). Then:
(a) f{f' is sample path AC-optimal with respect to lIDS and P(X); that is,
and
~(Jff,v)~po pta-a.s. VfEIIDs, VEP(X).
(b) (J{f', v) is a "minimum pair in lIDS" for each initial distribution v in
Pw(X), the set defined in {11.3.2}; that is,
Proof. Part (a) follows from (11.3.5) and Proposition 11.3.2(b), and part
(b) from (11.3.3). 0
Observe that we can write (11.3.1) as
B. Optimality in IT
(11.3.8)
and the following statements (b), (c) and (d) are equivalent for a deter-
ministic stationary policy foo:
(c) 11"*:= foo satisfies {11.1.24} and (11.1.25)-that is, foo is sample
path AC-optimal with respect to II and P6{X) [see (11.1.19)]
11.3 The w-geometrically ergodic case 179
(d) J(foo,v) = p* for every initial distribution v in Pw(X), the set de-
fined in (11.3.2).
;[0(11", x) := liminf
n-too
~J~(1I",x).
n
(11.3.10)
Let us now suppose that (11.3.9) holds, and let 11" E II and v E P(X) be
arbitrary. Then, as
J~(1I",v) = Ix ~(1I",x)v(dx),
Fatou's Lemma and (11.3.10) yield that P;-a.s .
.- liminf ~J~(1I", v)
n
Ix
n-too
> p* by (11.3.8).
From this inequality and (11.1.6) it follows that
for arbitrary 11" E II and v E P(X). On the other hand, from (11.1.5),
(11.1.7), and again using Fatou's Lemma,
180 11. Sample Path Average Cost
which implies the reverse of inequality (11.3.11), that is, Pmin 2: p*. Hence,
under (11.3.9),
Pmin = p*. (11.3.14)
Combining these we can easily obtain the following corollary of Theorem
11.3.5.0
11.3.7 Corollary. If Assumption 11.3.4 and also {11.3.9} are satisfied,
then there exists a deterministic stationary policy XJ such that r
(a) (rxo, v) is a minimum pair for all v in Pw(X), and
Then under the same hypotheses of Theorem 11.3.5 we now get the follow-
ing.
11.3.8 Theorem. (Existence of minimum-variance policies.) If As-
sumption 11.3.4 is satisfied, then there exists a constanta; 2: 0, a canonical
11.3 The w-geometrically ergodic case 181
decision function f* E lFca, and a function V*(-) in Bw(X) such that for
each x E X:
= min
A·(z)
['1/1 (x, a) + (
Jx v*(y)Q(dy1x,a)]
(11.3.17)
We shall begin with some preliminary results concerning the weight func-
tion
(11.3.20)
Consider the inequality (1O.2.4)-which is equivalent to (10.2.1)-with b(·)
a constant [for instance, replace b(·) by the constant b := IIb(·)IIl, i.e.,
Then, "taking the square root" of both sides of (11.3.21) and using Jensen's
inequality we see that v := W 1 / 2 satisfies
(a) The state (Markov) process {xt} is v-geometrically ergodic, that is,
L: r
00
(11.3.23)
(11.3.24)
and
n
Mn(7f, x) := L Yt(7f, x). (11.3.25)
t=l
(11.3.27)
Now note that (p., h.) is a solution to the Poisson equation (10.3.14) and
so, by Lemma 11.3.9(b),
so that, by (11.3.28),
Therefore, {Mn, Fn} is a martingale, which proves the first part of the
lemma.
To prove (11.3.26) we shall use the Martingale Stability Theorem in
Remark 11.2.6(b). Hence, it suffices to show that [as in (11.2.36) with
q= 2]
00
where the second inequality is obtained from the first inequality in the proof
of Lemma 10.4.1 together with the fact that w(·) ~ 1. Finally, (11.3.30)
follows from (11.3.31) and Lemma 11.3.1O(a) because
Lr +L r
00 00
Thus, since jj 2: 0,
(11.3.33)
Finally, note that (11.3.28) and Lemma 11.3.10(d) imply that Ih*(xn)l/n ---t
o P;-a.s. as n ---t 00, and, similarly, Mnln ---t 0 P;-a.s. by (11.3.26). There-
fore, multiplying by lin both sides of (11.3.33) and taking lim inf as n ---t 00
we obtain the second inequality in (11.3.8). As the first inequality is obvi-
ous, we thus have (11.3.8).
We will next prove the equivalence of (b), (c) and (d).
(b) ~ (c). Let f= E lIDS be an AC-optimal policy, the existence of
which is ensured by Theorem 10.3.6(a). Then, by Proposition 11.3.2(b)
J0(f=,x) = J(f) = p* pro -a.s. Vx E X.
This fact and (11.3.8) yield (c).
(c) ~ (b). Suppose that f= is sample path AC-optimal with respect to
II and Po(X)i that is, 7r* := f= satisfies (11.1.24) and, moreover, (11.1.25)
holds. Then, by (11.3.1),
J(f=, x) = J(f) = p* Vx E X, (11.3.34)
which together with (11.1.23) implies (b).
(b) ¢:} (d). This follows from (11.3.34) and (11.3.3).
Finally, let us suppose that (11.3.9) is satisfied. Then, by Definition
11.1.1(c), we have (e) ~ (b). Conversely, suppose that f= E lIDS is AC-
optimal. Then f= satisfies (11.3.34) and, on the other hand, (11.3.8) and
Fatou's Lemma give
p* < lim inf E:[J~(7r, x)]/n
n~=
(11.3.35)
186 11. Sample Path Average Cost
(11.3.36)
where, by (11.2.11),
(11.3.37)
Let us now suppose that f E IF is a canonical decision function, that
is, f is in IFca. Then (J(f),h,) = (p*,h,) satisfies the ACOE (10.3.13),
(10.3.14), and, in addition, Theorem 10.3.7 yields that
(11.3.38)
(11.3.39)
(11.3.40)
j := 9 on N, and j:= f on N C ,
(11.3.42)
(11.3.43)
11.3 The w-geometrically ergodic case 187
and, on the other hand, (11.3.42) and (11.3.40) give [by (11.3.35) and
(11.3.16)]
(11.3.44)
Therefore, (11.3.41) follows from (11.3.43)-(11.3.44) and (11.3.36)-(11.3.37).
o
With these preliminaries we can now easily prove Theorem 11.3.8.
Proof of Theorem 11.3.8. Let A*(x) c A(x) and tf;(x, a) be as in
(10.3.35) and (11.3.15), respectively, and recall (10.3.36) and (11.3.16).
Consider the new Markov control model
where c(x, a) := tf;(x, a). It is easily verified that Mvar satisfies the hy-
potheses of Theorem 1O.3.6(a), replacing c(x, a) and A(x) with c(x, a) and
A*(x). Therefore, by Theorem 1O.3.6(a), there exists a canonical triplet
(a~, V*, f*) for Mvar. with V* in Bw(X); that is, there exists a constant
a~ ~ 0, a function V* in Bw(X), and a canonical decision function f* E IFca
that satisfy (11.3.17). Moreover, as in (10.3.15)
and
a~ ~ ftf(tf;f) = Var (foo,x) Vf E IFca, x E X, (11.3.46)
where the second equality in (11.3.45) and (11.3.46) follows from (11.3.39)
and (11.3.36)-(11.3.37). Finally, to verify (11.3.19), observe that (11.3.46)
and Lemma 11.3.12 yield
Notes on §11.3
or
(72(1) = Ix {h 2(X) - [Ix h(Y)Qf(dylx)] 2} JLf(dx),
for f E !Fea · Similarly, the "cost-per-stage" function 1jJ(x, a) in (11.3.15)
can be written as
These expressions suggest that the bias-optimal polices and the minimum-
variance policies fr:o in Theorem 11.3.8 should be related in some sense,
but it is open question what this relation (if any) should be.
.lim /VdJ.Ln;
'--+00
= /VdJ.L "Iv E Cb(IK), (11.4.2)
(11.4.4)
190 11. Sample Path Average Cost
(b) positive Harris recurrent if cpoo is stable and Q'{J is Harris recurrent.
The following proposition states some useful properties of stable poli-
cies. In part (b) of the proposition, Pmin and p* are the numbers defined in
(11.1.12) and (11.1.23), respectively.
11.4.4 Proposition. (a) If cpoo E II Rs is stable, then
(b) If there exists a stable policy cpoo E IIRs such that (cpoo 'P'{J) is a
minimum pair, that is (by Definition 11.1.1{a)]
then
Pmin = p*. (11.4.7)
(c) Let cpoo E IIRs be a stable policy. Then (cpoo 'P'{J) is a minimum pair if
and only if
J(cpOO, x) = p* p'{J-a.a. x E X. (11.4.8)
Proof. (a) By (11.1.5) and the Individual Ergodic Theorem (7.5.24), the
limit
(a) There exists a stable policy <p~ E IIRS such that (<p~ 'P<PJ is a mini-
mum pair.
(b) If in addition the policy <p~ in (aJ is positive Harris recurrent, then
its sample path AC JO(<p~,') satisfies that
Proof. Part (a) is the same as Theorem 5.7.9(a), whereas (11.4.9) follows
from Theorem 11.2.1(a). Furthermore, part (c) is a consequence of Theorem
5.4.3(ii), which states that (11.4.10) and (11.4.11) yield the Average Cost
Optimality Inequality (ACOI)
Ix
is,
p* + hex) ~ cf(x) + h(y)Qf(dylx) Vx EX,
which yields (11.4.12). 0
We shall now state our main result, which, in particular, in part (c) gives
conditions for the existence of sample path AC-optimal policies. Also note
that (11.4.16) is a statement stronger than (11.4.7) because it uses the lim
inf expected AC in (11.1.7).
11.4.6 Theorem. Suppose that Assumption 11.,,/.1 is satisfied. Then:
192 11. Sample Path Average Cost
hence
(11.4.15)
and
p* = inf inf JI (1f, v). (11.4.16)
P(X) II
(c) If the policy <PC: in Lemma 11.4.5 is positive Harris recurrent, then
it is sample path AC-optimal; in fact, every positive Harris recurrent
policy <poo in IIRS (or in IIDs) for which (<poo, p",) is a minimum pair,
is also sample path AC-optimal.
Proof. As the proof of (11.4.14) is quite "technical", to simplify the expo-
sition we will first suppose that it holds and prove the remaining parts of
the theorem; then we will prove (11.4.14).
Suppose that (11.4.14) is satisfied, and choose an arbitrary policy 1f and
initial distribution v. Then (11.1.3), (11.1.7), and Fatou's Lemma yield
i.e.,
E;*[liminf ~(1f*,x)/n] = p* 'Ix E X.
n--+oo
This equality and (11.4.14) give (11.4.18).
11.4 Strictly unbounded costs 193
(b) Let u : S -+ IR be l.s.c. and bounded below, and let IL, ILn(n = 1,2, ... )
be probability measures on B(S) such that ILn converges weakly to IL,
that is
Then
lim inf { udILn ~
n-too is is{ udIL.
In Lemma 11.4.8 below we use the following terminology and notation.
Let (S, T) be a separable metrizable space, that is, a separable topological
space for which there exists a metric don S consistent with the topology T.
For each metric d on S we denote by U(S, d) the subfamily of functions in
Cb(S) which are uniformly continuous with respect to d. We take U(S, d)
to have the relative topology of Cb(S).
11.4.8 Lemma. Let (S, T) be a separable metrizable space. Then there
exists a metric d* on S consistent with T such that:
(b) for each function u in Cb(S) there exist sequences {u~} and {u~} in
U(S, d*) such that u~ t u and u~..j.. u pointwise as n -+ 00.
Proof. See Bertsekas and Shreve [1], Corollary 7.6.1 (p. 113), Proposition
7.9 (p. 116), and Lemma 7.7 (p. 125). 0
Proof of (11.4.14). Choose an arbitrary policy 11" and initial distribu-
tion v, and let (O,:F, P;) be the "canonical" probability space in Remark
8.2.3(c). Furthermore, define on 0 a random variable J as in the left-hand
side of (11.4.14), that is
with J~(7r, v) as in (11.1.4). Iffor some sample path w = (xo, ao, Xl, al, ... )
of the state-action process it occurs that J(w) = +00, then (11.4.14) triv-
ially holds. Thus without loss of generality we may restrict to sample paths
in the set {1' := {wIJ(w) < oo}. Now consider the empirical measures
1 n-l
'Yn(r) := nL Jr(Xt, at) for r E B(X x A), n = 1,2, ....
t=O
J = lim inf
n-too 1lK
r
c(x, ahn(d(x, a)).
To prove (11.4.14) we will proceed in two steps. First we will show that:
(11.4.22)
Thus, by Lemma 9.4.4, there exists a stochastic kernel CPw E ~ such that
(ii) For P: -almost all w, the randomized stationary policy CP': is stable,
with i.p.m. p",,,, =;:Yw.
J(w) = l-tOO
.lim r
1lK
c(x, ah~ (d(x, a)).
'
Then
sup
i
r c(x, ah~.' (d(x, a)) < 00,
1lK
11.4 Strictly unbounded costs 195
which [as in Remark 11.4.2(b)] implies the existence of a p.m. "(won ][( and
a subsequence {md of {ni} such that b~J converges weakly to "(W, that
is, as i -+ 00
(11.4.24)
so that
n
Mn(u) = LLu(xt,at)+E~[u(xn+dlxn,an]-E~[u(xdlxo,ao]. (11.4.25)
t=l
we have
lim
n-+=
r Lu(x,ah~(d(x,a))=O
ioc VWEO u ,
lim
n-+=
r Lu(x,ah~(d(x,a)) =
ioc 0 Vu E U, wE 0*, (11.4.26)
where
0* := nuEuOu.
Moreover, by Assumption 11.4.1(c), the function Lu is in Cb(OC) for every
u in U. Hence, for each win 0* there is a sequence {mi(w)} as in (11.4.24),
so that, by (11.4.26),
In fact, by Lemma 11.4.8(b), the latter equality holds for all u in Cb(X),
i.e.,
loc Lu(x, ahW(d(x, a)) = 0 Vu E Cb(X),
Notes on §11.4
1. Theorem 11.4.6 comes from Vega-Amaya [2, 3]. Related results are
obtained by Lasserre [3] using a different approach. In addition to these
works and the paper by Hernandez-Lerma, Vega-Amaya and Carrasco [1]
mentioned in Note 1 of §11.3, we know of no previous works on sample
path AC-optimality for MCPs on general (uncountable) Borel spaces.
2. Vega-Amaya [2, Theorem 6.3.1] gives a proof of Lemma 11.4.5(a)
different from our proof in §5.7.
11.5 Examples
11.5.1 Example. (Examples 10.9.3 and 8.6.2, continued.) Let us
consider the inventory-production system (8.6.5) with cost-per-stage (8.6.7),
namely,
Xt+l = (Xt + at - Zt)+ for t = 0,1, ... , (11.5.1)
11.5 Examples 197
and
The state and control spaces are X:= [0,00) and A = A(x) = [0,0] for all
x in X. In Example 10.9.3 we saw that Assumption 8.6.1, together with
the condition 8.6.3 and (10.9.8), implies that the system is w-geometrically
ergodic with respect to the weight function
(11.5.5)
under the Assumptions 8.6.5. In Examples 8.6.4' and 10.9.5 we already
verified Assumption 11.3.4 except for (11.3.7), which in the present case
refers to the weight function w in (8.6.18) or (10.9.23), that is,
(11.5.8)
(11.5.9)
where a is in IRq and ~ and e are suitable matrices; x' and a' denote the
transpose of x and a, respectively. Since the analysis of this control system
relies on the properties of the noncontrolled Markov chain
(11.5.11)
(11.5.12)
(d) There exist positive constants s ~ 1, {3 < 1, and M2 such that Elzol8 <
00 and
(11.5.13)
Then the following holds:
(i) Under (a) and (b), the Markov chain is aperiodic, A-irreducible and
Harris recurrent with respect to A.
(ii) Under (a), (b) and (c), the chain has a unique i.p.m.-hence, by (i),
it is positive Harris recurrent.
200 11. Sample Path Average Cost
(iii) Under (a), (b) and (d), there exist a p.m. {t, and positive numbers
p < 1 and R such that
(11.5.14)
Proof. Part (i) follows from (7.4.18). For the proof of (ii) see Tweedie [1]
or Mokkadem [1, Proposition 1], and for the proof of (iii) see Tweedie [2]
or Mokkadem [1, Proposition 3]. 0
11.5.5 Remark. (a) The main difference between Proposition 11.5.4(iii)
and Example 7.4.6 is that the latter requires F to be continuous. Moreover,
from (i), and comparing (11.5.14) with (7.3.7), it can be seen that the
measure in (11.5.14) is the unique i.p.m. for the Markov chain. This can also
be deduced from the conclusion (ii) by noting that the inequality (11.5.13)
implies that (11.5.12) holds with Ml := max{M2 , [b/(l- .8)Fls}, because
(c) F : lK -t X is continuous.
Then, since
(e) Elzol2 < 00 and, moreover, there exists a randomized stationary policy
cpoo and positive constants f3 < 1, M, and k such that
(et} EIF(x, (j5) + ZOl2 ~ f3lxl 2 Vlxl > Mj
(e2) fA(a'9a){j5(dalx) ~ klxl 2 Vx E X.
From (b), (d) and (e), we can see that cpoo is stable [Definition 11.4.3(a)].
Indeed, by Proposition 11.5.4(iii), the transition law Q~
I{)
has an i.p.m. p:=
Pv; that, in particular, satisfies (11.5.15) with w(x) = 1 + Ixl 2 and J.t = p.
Therefore, from (11.5.10) and (e2), there is a constant k such that
Thus (11.4.4) holds, which in turn [by (11.4.5)] gives Assumption 11.4.1(a)
with 7r = cpoo and some initial state x.
Summarizing, the current conditions (a) to (e) imply that (11.5.9) and
(11.5.10) satisfy Assumption 11.4.1, and so the corresponding results in
§11.4 are applicable. Furthermore, if we wish to use, for instance, Lemma
11.4.5(b) or Theorem 11.4.6(c), we then need conditions for the policy cp';'
in those results to be positive Harris recurrent. One way of getting this is to
assume (or to verify, when a specific control system is given) that (e) holds
for every randomized stationary policy cpoo, with constants f3, M, and k
202 11. Sample Path Average Cost
that may depend on <p""-note that we may replace (et} by the analogues
of (11.5.12) or (11.5.16), namely,
The control model in Example 11.5.1 and Remark 11.5.3 has been studied
by Vega-Amaya [2, 3]. These references contain other examples related to
§11.3 and §11.4.
Example 11.5.6 comes from Hernandez-Lerma and Lasserre [14], where
the reader can find additional references on results related to Proposition
11.5.4 (on which Example 11.5.6 is based).
12
The Linear Programming Approach
12.1 Introduction
In this chapter we study the linear programming (LP) approach to Markov
control problems. Our ultimate goal is to show how a Markov control prob-
lem can be approximated by finite linear programs.
To reach this goal, we shall first proceed to find a suitable linear program
associated to the Markov control problem. Here, by a "suitable" linear
program we mean a linear program (P) that together with its dual (P*)
satisfies that
sup(P*) ~ (MCP)* ~ inf(P), (12.1.1)
where (using terminology specified in the following section)
inf(P) .- value of the primal program (P),
sup(P*) .- value of the dual program (P*),
(MCP)* .- value function of the Markov control problem.
(P) is solvable-in which case we write its value as min (P)-and that
then an optimal solution for (P) can be used to determine an optimal policy
for the Markov control problem. Likewise, if the dual (P*) is solvable and
its value-which in this case is written as max (P*)-satisfies
then we can use an optimal solution for (P*) to find an optimal policy for
the Markov control problem. In fact, one of the main results in this chapter
(Theorem 12.4.2) gives conditions under which (12.1.3) and (12.1.4) are
both satisfied, so that in particular strong duality for (P) holds, that is,
sup(P*) = min(P).
(P*) exists, then the strong duality condition (12.1.5) is satisfied. Section
12.5 presents an approximation scheme for (P) using finite-dimensional
programs. The scheme consists of three main steps. In step 1 we introduce
an "increasing" sequence of aggregations of (P), each one with finitely many
constraints. In step 2 each aggregation is relaxed (from an equality to an
inequality), and, finally, in step 3, each aggregation-relaxation is combined
with an inner approximation that has a finite number of decision variables.
Thus the resulting aggregation-relaxation-inner approximation turns out to
be a finite linear program, that is, a program with finitely many constraints
and decision variables. The corresponding convergence theorems are stated
in §12.5, and they are all proved in the final section 12.6.
To fix ideas, we shall consider only the so-called "unichain" AC problem.
However, from the proof of our main results it should be clear that sim-
ilar results are valid for other Markov control problems, in particular for
discounted and for "multichain" AC MCPs.
12.2 Preliminaries
This section contains background material that can be omitted on a first
reading; the reader may refer to it as needed.
The material is divided into four subsections. Subsection A reviews some
basic definitions and facts related to dual pairs of vector spaces and linear
operators. Subsections Band C summarize the main results on infinite
LP needed in later sections. Finally, Subsection D reviews the notion of
"tightness" and its connection to the existence of i. p.m. 's for Markov chains.
A. Dual pairs of vector spaces
Let X and Y be two arbitrary (real) vector spaces, and let (".) be a
bilinear form on X x y, that is, a real-valued function on X x Y such
that
• the map x I-t (x, y) is linear on X for every y E y, and
(12.2.1)
this definition can be extended to the product of three or more dual pairs.
12.1.1 Examples. (a) If X = Y = lI~n for some n = 1,2, ... , then (x, y)
will denote the usual "inner product" X· Y of the vectors x, y. that is,
(12.2.3)
where
lim u(s)
B-tOO
= O. (12.2.7)
Thus, by (12.2.1) and part (a), the bilinear form corresponding to the dual
pair (lin X M(S), lin X F(S)) is
(12.2.13)
12.2.6 Example. Let X and Y be two Borel spaces, and let wo(x) and
w(x,y) be weight functions on X and X x Y, respectively, such that
1 ~ wo(x) ~ w(x, y) Vx E X, Y E Y. (12.2.16)
We shall consider spaces Bw(X x Y), Mw(X x Y), Bwo(X), and Mwo(X)
as in Examples 12.2.1(b).
(a) Consider the dual pairs (Mw(X x Y), Bw(X x Y)) and (JR, JR), and
the linear map
Lo : Mw(X x Y) --+ JR, J.t t-t LoJ.t := (J.t, I). (12.2.17)
By (12.2.8) with u(x, y) == 1,
LoJ.t = (J.t, I) = J.t(X x Y) VJ.t E Mw(X x Y).
In particular,
(12.2.18)
where Mw(X x Y)+ stands for the convex cone of nonnegative measures in
Mw(X x Y) and 1I·lIrv denotes the total variation norm.
Since Bw(X x Y) contains the constant functions [see (12.2.3)], the ad-
joint
r t-t (L~r)(x, y) == r Vr E JR,
obviously maps JR into Bw(X x Y), and so Lo is weakly continuous, by
Proposition 12.2.5.
(b) Consider the dual pairs (Mw(X x Y),Bw(X x Y)) and (Mwo(X),
Bwo (X)), and the linear map
G 1 : Mw(X x Y) --+ Mwo (X), J.t t-t G1J.t := ji,
where ji denotes the marginal (also known as the projection) of J.t on X,
that is,
ji(B) := J.t(B x Y) VB E 8(X). (12.2.19)
The adjoint u t-t Giu, with
(Giu)(x, y) := u(x) Vu E Bwo (X), (x, y) E X x Y, (12.2.20)
maps Bwo(X) into Bw(X x Y), because (12.2.16) gives
IGiul
--=-.-~-,
lui Wo lui
w WOW Wo
(G 2 J.L)(B):=
ixxy
r P(Blx,Y)J.L(d(x,y» for BE SeX) (12.2.22)
maps lawo(X) into law(X x Y). Thus, the weak continuity of G 2 follows
from Proposition 12.2.5.
(d) As a consequence of (b) and (c), if the inequality (12.2.21) holds,
then the linear map
i.e.,
(L1J.L)(B) := Ii(B) - r
ixxy
P(Blx, Y)J.L(d(x, y» for BE SeX), (12.2.24)
i.e.,
LJ.L := (LoJ.L, L 1J.L) for J.L in Mw(X x Y), (12.2.25)
is weakly continuous.
Note that the adjoints
respectively. 0
12.2.7 Remark. (a) Let P(Blx, y) be the stochastic kernel in Example
12.2.6(c), and consider the Banach spaces CbO and CoO in Example
12.2.1(b). By a standard abuse of terminology, the kernel P is said to
be weakly continuous if the adjoint G 2 in (12.2.23) maps Cb(X) into
Cb(X x Y), that is,
(12.2.28)
[Observe that this is a Feller-like condition; see (12.2.47).]
(b) Suppose, on the other hand, that P is weakly continuous and, more-
over,
P(KI·) vanishes at infinity for each compact K c X; (12.2.29)
that is [as in (12.2.6)]' for each c: > 0 there is a compact set K' = K'(c:,K)
in X x Y such that
P(Klx,y) :::; c: \I(x,y) ~ K'.
Then a straightforward calculation shows that, in addition to (12.2.28), G;
maps Co(X) into Co(X x Y), that is
G;u is in Co(X x Y) if u is in Co(X). (12.2.30)
In other words, suppose that (12.2.28) and (12.2.29) are satisfied, and that
X and Y -hence the product X x Y -are locally compact separable met-
ric spaces. Then, in view of Remark 12.2.2(b) and Proposition 12.2.5, the
condition (12.2.30) states that the map G 2 : M(X x Y) -+ M(X) defined
by (12.2.22) is weakly* continuous, that is, continuous with respect to the
weak* topologies CT(M(X x Y), Co(X x Y)) and CT(M(X), Co(X). 0
12.2.8 Remark. (Positive and dual cones.) (a) Let (X,Y) be a dual
pair of vector spaces, and K a convex cone in X, that is, x + x' and AX
belong to K whenever x and x' are in K and A > O. Unless explicitly stated
otherwise, we shall assume that K -:j:. X and the origin (that is, the zero
vector, 0) is in K. In this case, K defines a partial order ~ on X such that
x ~ x' {:} x - x' E K,
and K will be referred to as a positive cone. The dual cone of K is the
convex cone K* in Y defined by
K* := {y E YI(x,y) ~ 0 \Ix E K}. (12.2.31)
212 12. The Linear Programming Approach
otherwise, inf IP' := +00. The program IP' is solvable if there is a feasible
solution x* that achieves the infimum in (12.2.34). In this case, x* is called
an optimal solution for IP' and, instead of inf IP', the value of IP' is written
as
minlP' = (x*,c).
Similarly, w E W is feasible for the dual program IP'* if it satisfies
(12.2.33), and IP'* is said to be consistent if it has a feasible solution.
If IP'* is consistent, then its value is defined as
otherwise, sup lP'* := -00. The dual lP'* is solvable if there is a feasible
solution w* that attains the supremum in (12.2.35), in which case we write
the value of lP'* as
maxlP'* = {b,w*}.
The next theorem can be proved as in elementary (finite-dimensional)
LP.
12.2.9 Theorem.
(a) (Weak duality.) If lP' and lP'* are both consistent, then their values
are finite and satisfy
sup lP'* ~ inf lP'. (12.2.36)
On the other hand, it is said that the strong duality condition for lP'
holds if lP' and its dual are both solvable and
The following theorem gives conditions under which lP' is solvable and
there is no duality gap-for a proof see Anderson and Nash [1, Theorem
3.9].
12.2.10 Theorem. Let H be the set in Z x lR defined as
Note that if!P is consistent with a finite value inf!P, then [by definition
(12.2.34) of inf!P] there exists a minimizing sequence. A similar remark
holds for !P*.
12.2.13 Definition. (Aggregations and inner approximations.)
(b) If K' C K is a subset of the positive cone K eX, then the program
minlP'(W) = minlP'.
(b) If K' is weakly dense in K, then there is a sequence {xn} in K' such
that
12.2.17 Theorem. Suppose that the Borel space S is a-compact, and that
{xn} is a Markov chain on S that satisfies the Feller property. Then the
following conditions are equivalent:
(a) {Xn} has an i.p.m.
(b) there is a p.m. v such that the sequence {vpn,n = 0,1, ... } is tight.
(c) There is a p.m. v and a strictly unbounded function g ~ 0 such that
sup(vpn, g} < 00.
n
12.2.18 Remark. Benes [1) proves Theorem 12.2.17 under the following,
stronger, assumptions:
(i) S is a LCSM space;
(ii) P satisfies the Feller property; and
(iii) For each compact set K, the function x I-t P(Klx) vanishes at infinity.
As in Remark 12.2.7(b), it is easy to see that (ii) and (iii) imply that
Pu is in Co(S) if u is in Co(S) [ef. (12.2.30)), (12.2.48)
and so (12.2.45), with n = 1, defines a map P : M(S) --t M(S) which is
continuous in the weak* topology a(M(S), Co(S)). Benes uses this fact and
the Alaoglu Theorem [Remark 12.2.2(c)) to relate (a) and (d) in Theorem
12.2.17, as well as a fifth condition (e) not included here. Without the
latter condition (e), it can be verified that the proof by Benes also yields
Theorem 12.2.17 in its present form, assuming a-compactness and the Feller
property, rather than (i), (ii), (iii). In fact, the relations
(a) => (b) ¢} (c) => (d)
are immediate. Indeed, if (a) holds and 'Y denotes an i.p.m., then taking
v = 'Y in (b), the sequence vpn = 'Y (n = 0,1, ... ) is tight because any
single finite measure 'Y on a a-compact metric space is tight. Hence (a)
implies (b). On the other hand, the equivalence of (b) and (c) follows from
Theorem 12.2.15, wheres (b) => (d) follows from the definition of tightness.
o
218 12. The Linear Programming Approach
Notes on §12.2
(12.3.6)
and
(12.3.7)
where Mw (OC) and lBw (OC) are the weighted-norm spaces in Example
12.2.1(b), and similarly for Mwo(X) and lBwo(X). In particular, the bi-
linear form on (Mw(OC), lBw(OC» is [as in (12.2.8)]
namely:
12.3.1 Assumption. There is a constant k such that
with
Lop, := (p" I) = p,(lK) (12.3.12)
and
(12.3.14)
of L is given by
for every pair (p, u) in lR x lffiwo (X) and (x, a) in lK. Hence, Assumption
12.3.1 and Proposition 12.2.5 yield [as in Example 12.2.6(d)] that
for some stochastic kernel 'P E ell, which means that J.L is feasible for (P)
if J.L is a p.m. on I[{ such that its marginal ji on X is an i.p.m. for the
transition kernel Q(·lx,'P).
On the other hand, observe that
(b, w) = «(1,0), (p, u») = p Vw = (p, u) E lR. x Bwo (X).
Hence, by (12.3.18) and (12.3.15), the dual of (P) is [as in (12.2.33)]
(P*) maximize p
B. Solvability of (P)
Before proceeding to verify (12.3.3) and (12.3.4), let us note the following.
12.3.2 Remark. We will use the following conventions:
(a) A measure {t on lK. c X x A may (and will) be viewed as a measure
on all of X x A by defining {t(lK.C) := 0, where lK.c stands for the
complement of lK. in X x A.
(b) We will regard c : lK. -t IR+ as a function on all of X x A with c( x, a) :=
+00 if (x, a) is in lK.c • Observe that this convention is consistent with
Assumption 1l.4.1(c), and, moreover, by (12.3.5), the weight function
W = +00 on lK.c . Any other function u in $w(lK.) can be arbitrarily
extended to X x A, for example, as u := 0 on lK.c .
(c) O· (+00) := 0
(d) As in (12.2.20), a function u in $wo(X) will also be seen a function
in $w(lK.) given by u(x, a) := u(x) for all (x,a) in K
Then, in particular, we may write the bilinear form in (12.3.8) as
({t, u) = r
JxxA
ud{t
Q",.(Blx) := i Q(Blx,a)'P*(dalx),
12.3 Linear programs for the AC problem 223
and
(12.3.25)
where
crp.(x):= i c(x,a)<;?*(dalx).
Ix i
Le.,
Prp. (B) = Q(Blx, a)<;?*(dalx)prp. (dx). (12.3.26)
and
(12.3.27)
which means that we already have the second equality in (12.3.24), as well
as the equalities p.*(I[{) = 1 and LIP.* = 0 in (12.3.20) and (12.3.21).
Therefore, to complete the proof of part (a) it suffices to show that
(i) p.* is Mi:w(l[{) [see (12.2.2)], so that p.* is indeed feasible for (P)j and
(ii) (p., c) ;::: Pmin for any feasible solution p. for (P), which would yield
inf(P) ;::: Pmin'
In other words, (i), (ii) and (12.3.27) will give that p.* is feasible for (P)
and
Pmin = (J.L*, c) ;::: inf(P) ;::: Pmin, Le., (p.*, c) = Pmin·
To continue with the program for this section, we now turn our attention
to proving (12.3.4).
12.3.4 Theorem. (Absence of duality gap.) If Assumptions 11.4.1 and
12.3.1 are satisfied, then (12.3.4) holds.
Proof. We wish to use Theorem 12.2.10 with Z and L as in (12.3.7) and
(12.3.14), respectively. Hence, we wish to show that the set
We will show that ((r*, v*), P*) is in H; that is, there exists a measure JJ in
Mw (IK)+ and a number r ~ 0 such that
LoJJ := JJ(IK) , (12.3.31)
L1JJ, and (12.3.32)
p* (JJ, c) + r. (12.3.33)
12.3 Linear programs for the AC problem 225
(12.3.34)
(12.3.35)
(12.3.36)
(i) I' is in Mw(OC)+, that is, IIJ.tllw := (1-', w) < 00 [see (12.2.2)], and
(ii) I' satisfies (12.3.32).
maps Cb(X) into Cb(IK). Therefore, (12.3.36) and (12.3.29) yield that for
any function u in Cb(X)
That is, (L1JL,u) = (v*,u) for any function u in Cb(X), which implies
(12.3.32). This proves (ii).
Summarizing, we have shown that JL is a measure in MIw(IK)+ that sat-
isfies (12.3.31) and (12.3.32). Finally, from (12.3.37) and (12.3.30) we see
that
P* 2: (JL, c) + lim inf rm 2: (JL, c) as rm 2: 0 "1m.
m-too
defined as
(12.3.39)
with Ll as in (12.3.13). [To write (/1, vo) in (12.3.39) we have used Remark
12.3.2(d).] The adjoint
c* : Bwo(X) x ]R2 -t Bw(lK) X ]R2
is given by
(12.3.40)
We will next use C and the Generalized Farkas Theorem 12.2.11 to obtain
the following.
12.3.7 Theorem [Equivalent formulations of the consistency of
(P).] If Assumption 12.3.5 holds, then the following statements are equiv-
alent:
228 12. The Linear Programming Approach
(a) (P) is consistent, that is, there is a measure JL that satisfies (12.3.19).
(b) The linear equation
In the proof of Theorem 12.3.7 we will use the following lemma, where
we use the notation in Remark 12.2.2(a), (b), and Remark 12.3.2(d).
12.3.8 Lemma. Suppose that Assumption 12.3.5{a) holds, and let {JLn}
be a bounded sequence of measures on IK. If JLj converges to JL in the weak*
topology a(M(IK), Co (IK», then the marginals /1j on X converge to /1 in the
weak* topology a(M(X), Co (X)); that is, if
(12.3.44)
then
(12.3.45)
(12.3.46)
(12.3.47)
12.3 Linear programs for the AC problem 229
and similarly for the second equality in (12.3.46). Moreover, for every fixed
n, (12.3.44) yields
.lim (J..tj, un) = (J..t, Un). (12.3.49)
3-t00
Then, in particular, (1', vo) ~ e, which implies that (1', I) = JL(OC) > O.
Therefore, the measure 1'* := J..t/(JL, I} satisfies (12.3.19).
(b) ¢:> (c). In this proof we use the Generalized Farkas Theorem 12.2.11
with the following identifications:
In fact, part (i) is obvious because the adjoint C* [in (12.3.40)] maps W
into Y-see Proposition 12.2.5. Thus, it only remains to prove (ii).
Proof of (ii). To prove that C(K) is closed, with K as in (12.3.50),
consider a directed set (D, :::;), and a net {(Ilk, rf , r~) ,0: ED} in K such
that C(pf, rf, r~) converges weakly to, say, (v, PI, P2) in Mwo (X) x ]R2; that
is,
We wish to show that the limiting triplet (V,PI,P2) is in C(K); that is,
there exists (pO, r~, rg) in K such that
(12.3.54)
(12.3.55)
(12.3.56)
On the other hand, from (12.3.56), Lemma 12.3.8 and (12.3.8) we get that
LIPj converges to LIPo in the weak* topology a(M(X), Co(X», i.e.,
This fact and (12.3.51) yield the first equality in (12.3.54), LIPo = v.
Finally, the second and third equalities in (12.3.54) hold with
and
1 n-i 00
Iteration of the latter inequality gives, for all x E X and n = 1,2, ... ,
n-i
u(x) ~ E':oo u(xn) - npi - P2 L E':oo vO(Xt),
t=o
i.e.,
n-i
u(x) + npi + P2 L
E':oo vo(Xt) ~ E':oo u(xn).
t=o
Thus, multiplying by lin and taking lim inf as n -+ 00, (12.3.59) and
(12.3.58) give (12.3.43) as P2 ~ o. 0
232 12. The Linear Programming Approach
Et'°[vo(Xt)] Ix vo(y)Qt(dylx,<p)
< (livoll-co)Qt(Clx,<p)+co.
Therefore, with x as in (12.3.58),
1 n-l
liminf - "Qt(Clx,<p) ~ (10 - co)/(livoll - co) > 0, (12.3.60)
n--+oo n ~
t=o
which gives part (d) in Theorem 12.2.17 for the transition kernel P('lx) :=
Q('lx, <p) on S := X, with v:= J., and the compact set K:= C. Thus, part
(a) in Theorem 12.2.17 implies the existence of an i.p.m. ji for Q('lx,<p),
and if we could show that the p.m. j.L(d(x, a)) := <p(dalx)ji(dx) is in Mw(IK),
then we would have the same conclusion of Corollary 12.3.9, the consistency
of (P), by a quite different approach. Finally, it is worth noting (and easy
to prove) that (12.3.57) and (12.3.58) are also necessary for (P) to be
consistent.
For further comments on-and references related to-Theorem 12.2.17
see Hernandez-Lerma and Lasserre [2]. 0
12.3.11 Remark. (Absence of duality gap.) Theorem 12.3.4 remains
valid if Assumption 12.3.1 holds, but Assumption 11.4.1 is replaced with:
(a) Assumption 12.3.5 is satisfied;
(b) c(x,a) is inf-compact [see Remark 11.4.2(a2)];
(c) (P) is consistent.
The proof of Theorem 12.3.4 is the same under this new set of hypotheses.
o
Notes on §12.3
program, say (Pa ), related to the a-discount Markov control problem, and
then we studied (P) as the "limit" of (Pa ) as at 1.
2. Historically speaking, it is interesting to note that the LP formulation
of the AC problem-as well as to other Markov control problems-was born
trying to solve the corresponding dynamic programming equation, which
in our present case is the Average Cost Optimality Equation (ACOE)
Observe that if h* is a function in Bwo (X), then the ACOE (12.3.61) implies
that the pair (p*, h*) satisfies (12.3.22), that is, (p*, h*) is feasible for the
dual program (P*). On the other hand, if (p,u) satisfies (12.3.22), then
using straightforward arguments [or using (12.3.23)] one can see that
p ~ p*(= Pmin).
It is due to the latter inequality that the pairs (p, u) that are feasible for
(P*) are also called subsolutions to the ACOE. Thus the equality
where we have used that (P) is solvable [Theorem 12.3.3(a)] to write its
value as min (P) rather than inf (P).
12.4.1 Theorem. Suppose that Assumptions 11.4.1 and 12.3.1 are satis-
fied. If {JLn} is a minimizing sequence for (P), then there exists a subse-
quence {j} of {n} such that {JL;} converges in the weak topology u(M(IK),
Cb(lK)) to an optimal solution for (P).
Proof. let {JLn} be a minimizing sequence for (P)j that is [by (12.3.19)],
and (12.4.1) holds. In particular, (12.4.1) implies that for any given c >0
there exists n(c) such that
Moreover, by (12.3.37),
This will prove that JL. is optimal for (P) provided that JL. is feasible for
(P)j in other words, provided that JL* is a measure in Mw(IK)+ and that
(12.4.7)
This, however, is obvious because (12.4.5) yields (JL*, w) = 1 + (JL., c) < 00,
whereas (12.4.7) follows from (12.4.2) and (12.4.4). 0
B. Maximizing sequences for (p.)
By Definition 12.2.12(b) and the definition of the dual program (p.) [see
(12.3.22)], a sequence (Pn, Un) in IR x Bwo (X) is a maximizing sequence for
(P*) if
Pn + Un (x) ~ c(x, a) + Ix un(y)Q(dylx, a) (12.4.8)
(12.4.10)
Pn t p*. (12.4.13)
h*(x) := limsupun(x),
n-too
This yields that (p*, h*) is feasible for (P*) [see (12.3.22)], which together
with the first equality in (12.4.12) shows that (p*, h*) is in fact optimal for
(P*).
236 12. The Linear Programming Approach
(p.*,L*(p*,h*» = p*.
and so
p* + h*(x) ~ min
A(z)
[c(x, a) + r
Jx h*(y)Q(dy/x, a)] for all (x, a) ElK,
12.5 Finite LP approximations 237
Notes on §12.4
Now, for every x E X, let 81(z) 0 be the Dirac measure at I(x), and let J.LI
be the p.m. on X x A, concentrated on OC, given by
for B and C in B(X) and B(A), respectively. Then the marginal of J.LI on
X is [i'! = J.L I, and, on the other hand,
(12.4.18)
where c/(x) == c(x, f) == c(x,/(x)) for all x EX. Finally, let {f~} be the
sequence of deterministic stationary policies defined by (10.5.17), (10.5.18).
Thus, under the assumptions of Theorem 10.5.2, the sequence {J.Lln} can be
seen as a minimizing sequence for (P). In particular, observe that (12.4.17)
is the same as the equation after (12.3.21), with cp := I and ji := J.LI.
Similarly, it can be seen that the value iteration procedure in §5.6 gives
a maximizing sequence for (P*).
238 12. The Linear Programming Approach
(a) Llll = o.
(b) (L11l, u) = 0 Vu E Co(X).
This linear program has indeed a finite number of constraints, namely, the
cardinality ICkl of Ck' We also have our first approximation result:
12.5.3 Theorem. Suppose that Assumption 12.5.1 is satisfied. Then
(a) IP(Ck) is solvable for each k = 1,2, ... j in fact, the aggregation IP(W)
is solvable for any subset W of Co(X).
(b) For each k = 1,2, ... , let J.tk be an optimal solution for IP(Ck), i.e.,
Then
(J.tk, c) t min(P) = Pmin, (12.5.4)
where the equality is due to Theorem 12. 3. 3(a}. Furthermore, there
is a subsequence {J.tm} of {J.tk} that converyes in the weak topology
a (M[(][{), C b (][{)) to an optimal solution J.t* for (P), i.e.,
(12.5.5)
minimize (p, c)
(12.5.6)
N(I,c) := {v E M(X)II(v,u)1 ~ C Vu E I}
(12.5.7)
The programs W(Ck) and W(C k , ck) have a finite number of constraints and
give "nice" approximation results-Theorems 12.5.3 and 12.5.5. However,
they are still not good enough for our present purpose because the "decision
variable" plies in the infinite-dimensional space Mw{lK) c M(JK). (For the
latter spaces to be finite-dimensional we would need the state and action
sets, X and A, to both be finite sets.) Now to obtain finite-dimensional
approximations of (P) we will combine W(Ck,ck) with a suitable sequence
of inner approximations [see Definition 12.2.13(b)]. These are based on the
following well-known result (for a proof see, for instance, Billingsley [1, p.
237, Theorem 4] or Parthasarathy [1, p. 44, Theorem 6.3]). We shall use
the notation introduced in Remark 12.2.1(b).
12.5 Finite LP approximations 241
00
is dense in P(OC) in the weak topology a(M£(OC) , Cb(OC»; that is, for each
p.m. IL in P(OC), there is a sequence {Vk} in ~ such that
(12.5.10)
Let us now consider a linear program as IP(Ck,ck) except that the p.m.'s
IL in (12.5.6) are replaced by p.m.'s in ~n n Pw(OC). That is, instead of
IP(Ck,ck) consider the finite program
IP(Ck, Ck, ~n): minimize (IL, c)
(12.5.11)
This is indeed a finite linear program because it has a finite number ICkl of
constraints, and a finite number IDnl of "decision variables", namely, the
coefficients of a measure in ~n n Pw(OC).
The corresponding approximation result is as follows.
12.5.7 Theorem. [Finite approximations for (P).] If Assumption
12.5.1 is satisfied then:
(a) For each k = 1,2, ... , there exists n(k) such that, for all n ~ n(k),
the finite linear program IP(Ck, Ck, ~n) is solvable and
(12.5.12)
where of course the limit is taken over values ofn ~ n*(k). Moreover,
if /-lkn [for k ~ 1, n ~ n*(k)] is an optimal solution for JP'(Ck,ck, tl n ),
then every weak accumulation point of {/-lkn} is an optimal solution
for (P).
The notes for this section are given at the end of §12.6.
We wish to show that JP'(W) is solvable. First note that as (P) is consistent
[Theorem 12.3.3(a)], so is JP'(W). More explicitly, there exists a p.m. /-l that
satisfies (12.5.1), which, by Lemma 12.5.2(b), yields that /-l satisfies (12.6.1).
Moreover, as (12.5.1) implies (12.6.1), we have
where the first inequality holds because c ~ O. Now let {/-In} be a minimiz-
ing sequence for JP'(W); that is, each /-In satisfies (12.6.1) and
(12.6.4)
which would yield the reverse inequality (p, c) ~ inf lP'(W). To prove (12.6.6)
observe that the condition (ii) is obvious because
(12.6.7)
because
(12.6.8)
Thus, as (Pk, c) ~ p for all k, the same arguments used to obtain (12.6.4)
and (12.6.5) yield a subsequence {Pm} of {Pk} and a p.m. P on OC such
that
(12.6.10)
and
liminf(pm,c} ~ (p,c).
m--+oo
so that
(J-l, c) ~ min(P). (12.6.11)
Therefore, if we can show that J-l is a feasible solution for (P), then we shall
have that (J-l, c) ~ min(P), which combined with (12.6.11) will give that J-l
is optimal for (P), that is, (J-l, c) = min(P). Now to prove that J-l is feasible
for (P) we need to check that the p.m. J-l satisfies (12.5.1); equivalently, by
Lemma 12.5.2(c), we need to check
The condition (ii) follows from (for instance) (12.6.11) and the definition
of w in (12.3.5), that is, (J-l, w) = 1 + (J-l, c) < 00. To prove (i) recall first
that C(X) is the limit of the increasing sequence {Ck}, i.e.,
00
C(X) = U Ck.
k=l
(12.6.13)
Now fix k, and let {J-ln} be a minimizing sequence for lP'(Ck,C:k)' Then, as
in (12.6.3)-(12.6.5), there exists a subsequence {J-lm} of {J-ln} and a p.m. J-l
on IK such that
(12.6.14)
and
(12.6.15)
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 245
Thus, to complete the proof of part (a) it suffices to show that J-t is feasible
for P(Ck, ck) since this would yield the reverse inequality in (12.6.15). On
the other hand, since (12.6.15) implies (J-t, w) = 1 + (J-t, c) < 00, to show
that J-t satisfies (12.5.6) it only remains to prove that
(12.6.16)
because each J-tm satisfies (12.5.6). Thus letting m -+ 00 in the latter in-
equality we obtain (12.6.16) since [as in (12.6.7), (12.6.8)] the weak con-
vergence (12.6.14) implies
Now, as the family ~ in (12.5.9) is weakly dense in P(OC) [in the weak
topology a(M(OC), Cb(OC»] and
~ n Pw(OC) c Pw(OC) c P(OC),
we see that ~npw(OC) is weakly dense in Pw(OC). Hence, there is a sequence
{/Lj} in ~ n Pw(OC) such that
(/Lj, v) -+ (/L*,v) "Iv E Cb(OC).
This implies [as in (12.6.7), (12.6.8)]
J.Lj converges weakly to J.L* in the weak topology a(M(OC), Cb (OC», i.e.,
(J.Lj, v} = J.L*(Ej)-l r
iE,
vdJ.L* ~ (J.L*,v) \Iv E Cb(OC). (12.6.23)
(LlJ.Lj,U}~O \luECk·
Therefore, as Ck is a finite set, there exists jr(k) such that
(12.6.25)
Now fix jo "? max{jl(k),h(k)}, and let P(Ejo ) be the family of p.m.'s on
OC, concentrated on E jo . Then J.Ljo is a p.m. in P(Ejo ) and satisfies (12.6.24)
and (12.6.25). We now wish to approximate J.Ljo by a suitable sequence {lin}
in 6. n P(Ejo ).
Let DC OC be the countable dense subset in the definition of IP(Ck , Ck, 6. n ).
Then D n Ejo is dense in E jo , and so, by Proposition 12.5.6,
(12.6.27)
248 12. The Linear Programming Approach
(Vn,c) -t (J.Ljo'c).
From this fact and (12.6.25), with j = jo, there exists n2(k) such that
(12.6.28)
(12.6.29)
(12.6.30)
(J.Lk'C) = minlP'(Ck,ck,~n)'
Then, by (12.5.14) and the argument used to obtain (12.6.4) and (12.6.5),
there exists a subsequence {J.Lm} of {J.Ld and a p.m. J.L on IK such that
(12.6.31)
and
min(P) = liminf(Jlm, c) 2:: (J.L,c), (12.6.32)
m-too
which in particular gives that J.L is in Pw(IK). Thus, to conclude that J.L is an
optimal solution for (P), it only remains to check that J.L satisfies LIJ.L = 0
in (12.5.1), or, equivalently [by Lemma 12.5.2(c)] that
This would yield that J.L is feasible for (P), so that (J.L, c) 2:: min(P), which
together with (12.6.32) would show that J.L is optimal for (P). To prove
(12.6.33), first note that (12.6.31) implies [as in (12.6.7), (12.6.8)]
(12.6.34)
12.6 Proof of Theorems 12.5.3, 12.5.5, 12.5.7 249
Now fix an arbitrary function u in C(X) and note that, as Ck t C(X), there
exists N such that u is in Ck for all k ~ N. Therefore, by (12.5.11),
Bertsekas, D.P.
[1] Dynamic Programming: Deterministic and Stochastic Models. Pren-
tice-Hall, Englewood Cliffs, N. J., 1987.
Bertsekas, D. P. and Shreve, S. E.
[1] Stochastic Optimal Control: The Discrete Time Case. Academic Press,
New York, 1978.
Bes, C. and Lasserre, J. B.
[1] An on-line procedure in discounted infinite-horizon stochastic optimal
control. J. Optim. Theory Appl. 50 (1986),61-67.
Bes, C. and Sethi, S. P.
[1] Concepts of forecast and decision horizons: applications to dynamic
stochastic optimization problems. Math. Oper. Res. 13 (1988), 295-
310.
Bhattacharya, R. N. and Majumdar, M.
[1] Controlled semi-Markov models-the discounted case. J. Statist. Plann.
and Inference 21 (1989), 365-381.
Billingsley, P.
[1] Convergence of Probability Measures. Wiley, New York, 1968.
Blackwell, D.
[1] Memoryless strategies in finite-stage dynamic programming. Ann. Math.
Statist. 35 (1964), 863-865.
Bourbaki, N.
[1] Integration, Chap. IX. Hermann, Paris, 1969.
Brezis, H.
[1] Analyse Fonctionnelle: Theorie et Applications, 4 e tirage. Masson,
Paris, 1993.
Brown, B. W.
[1] On the iterative method of dynamic programming on a finite space
discrete time Markov process. Ann. Math. Statist. 33 (1965), 719-726.
Cavazos-Cadena, R.
[1] Finite-state approximations to denumerable discounted Markov deci-
sion processes. Appl. Math. Optim. 14 (1986), 1-26.
Cavazos-Cadena, R. and Montes-de-Oca, R.
[1] Optimal stationary policies in controlled Markov chains with the ex-
pected total-reward criterion. Preprint, Departamento de Matemati-
cas, UAM-Iztapalapa, Mexico, 1997.
Chen, C.-T.
[1] Linear System Theory and Design. Holt, Rinehart and Winston, New
York, 1984.
References 253
Gale, D.
[1] On optimal development in a multi-sector economy. Rev. of Economic
Studies 34 (1965), 1-19.
Glynn, P. W.
[1] Simulation output analysis for geneml state space Markov chains. Ph.
D. Dissertation, Dept. of Operations Research, Stanford University,
1989.
[2] Some topics in regenerative steady-state simulation. Acta Appl. Math.
34 (1994), 225-236.
Glynn, P. W. and Meyn, S. P.
[1] A Lyapunov bound for solutions of the Poisson equation. Ann. Prob.
24 (1996), 916-931.
GonzaIez-Hermindez, J. and Hermindez-Lerma, O.
[1] Envelopes of sets of measures, tightness, and Markov control processes.
Appl. Math. Optim., to appear.
Gordienko, E. and Hernandez-Lerma, O.
[1] Average cost Markov control processes with weighted norms: value
iteration. Appl. Math. (Warsaw) 23 (1995), 219-237.
[2] Average cost Markov control processes with weighted norms: existence
of canonical policies. Appl. Math. (Warsaw) 23 (1995), 199-218.
Gordienko, E., Montes-de-Oca, R. and Minjarez-Sosa, A.
[1] Average cost optimization in Markov control processes with unbounded
costs: ergodicity and finite horizon approximation. Preprint, Departa-
mento de Matematicas, UAM-Iztapalapa, Mexico, 1995.
Hall, P. and Heyde, C. C.
[1] Martingale Limit Theory and Its Applications. Academic Press, New
York, 1980.
Haviv, M. and Puterman, M. L.
[1] Bias optimality in controlled queueing systems. J. Appl. Prob. 35
(1998), 136-150.
Hernandez-Lerma, O.
[1] Adaptive Markov Control Processes. Springer-Verlag, New York, 1989.
Hernandez-Lerma, 0., Carrasco, G. and Perez-Hernandez, R.
[1] Markov control processes with the expected total-cost criterion: op-
timality, stability, and transient models. Reporte Interno, Depto. de
Matematicas, CINVESTAV-lPN, 1998. (Submitted.)
References 255
Kartashov, N. V.
[1] Criteria for uniform ergodicity and strong stability of Markov chains
with a common phase space. Theory Probab. and Math. Statist. 30
(1985),71-89.
[2] Inequalities in theorems of ergodicity and stability for Markov chains
with common phase space. I. Theory Probab. Appl. 30 (1985), 247-259.
[3] Inequalities in theorems of ergodicity and stability for Markov chain
with common phase space. II. Theory Probab. Appl. 30 (1985), 507-
515.
[4] Strongly stable Markov chains. J. Soviet Math. 34 (1986), 1493-1498.
[5] Strong Stable Markov Chains. VSP, Utrecht, The Netherlands, 1996.
Kleinman, D.
[1] An easy way to stabilize a linear control system. IEEE Trans. Autom.
Control 15 (1970), p. 692.
Kurano, M.
[1] Markov decision processes with a minimum-variance criterion. J. Math.
Anal. Appl. 123 (1987), 572-583.
Kurano, M. and Kawai, M.
[1] Existence of optimal stationary policies in discounted decision pro-
cesses: approaches by occupation measures. Computers Math. Appl.
27 (1994), 95-101.
Kwon, W. H., Bruckstein, A. M. and Kailath, T.
[1] Stabilizing state feedback design via the moving horizon method. In-
ternat. J. Control 37 (1983), 631-643.
Lasota, A. and Mackey, M. C.
[1] Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics, 2nd ed.
Springer-Verlag, New York, 1994.
Lasserre, J. B.
[1] Existence and uniqueness of an invariant probability measure for a
class of Feller-Markov chains. J. Theoret. Prob. 9 (1996), 595-612.
[2] Invariant probabilities for Markov chains on a metric space. Stat. Prob.
Lett., to appear.
[3] Sample-path average optimality for Markov control processes. IEEE
Trans. Autom. Control, to appear.
Lippman, S. A.
[1] On dynamic programming with unbounded rewards. Manage. Sci. 21
(1975), 1225-1233.
Luenberger, D. G.
[1] Optimization by Vector Space Methods. Wiley, New York, 1969.
258 References
Nummelin, E.
[1] General Irreducible Markov Chains and Non-Negative Operators. Cam-
bridge University Press, Cambridge, 1984.
[2] On the Poisson equation in the potential theory of a single kernel.
Math. Scand. 68 (1991), 59-82.
Orey, S.
[1] Limit Theorems for Markov Chain Transition Probabilities. Van Nos-
trand Reinhold, London, 1971.
Parthasarathy, K. R.
[1] Probability Measures on Metric Spaces. Academic Press, New York,
1967.
Piunovski, A. B.
[1] General Markov models with the infinite horizon. Problems of Control
and Infor. Theory 18 (1989), 169-182.
Pliska, S. R.
[1] On the transient case for Markov decision chains with general state
spaces. In: Puterman [2], pp. 335-349.
Puterman, M. L.
[1] Markov Decision Processes. Wiley, New York, 1994.
[2] (Editor) Dynamic Programming and Its Applications. Academic Press,
New York, 1979.
Quelle, G.
[1] Dynamic programming of expectation and variance. J. Math. Anal.
Appl. 55 (1976), 239-252.
Ramsey, F. P.
[1] A mathematical theory of savings. Economic J. 38 (1928),543-559.
Rempala, R.
[1] Forecast horizon in a dynamic family of one-dimensional control prob-
lems. Diss. Math. 315 (1991).
Revuz, D.
[1] Markov Chains, revised ed. North-Holland, Amsterdam, 1984.
Rieder, U.
[1] Measurable selection theorems for optimization problems. M anuscripta
Math. 24 (1978), 115-131.
[2] On optimal policies and martingales in dynamic programming. J.
Appl. Prob. 13 (1976), 507-518.
[3] On Howard's policy improvement method. Math. Operationsforsch.
Statist., Ser. Optimization 8 (1977), 227-236.
260 References
Tweedie, R. L.
[1] Sufficient conditions for ergodicity and geometric ergodicity of Markov
chains on a general state space. Stoch. Proc. Appl. 3 (1975), 385-403.
[2] The existence of moments for stationary Markov chains. J. Appl.
Probab. 20 (1983), 191-196.
References 261
PI policy iteratio n
PIA policy iteratio n algorit hm
RH rolling horizon
VI value iteratio n
Glossary of notation
if x E B,
otherwise.
r+ := max(r,O), r- := - min(r, 0)
Ix
~(X)
measurable functions on X
JLQ(B):= Q(Blx)JL(dx)
Banach space of measur-
able functions on X with fi- IIQllw w-norm of a signed kernel
nite w-norm Q
266 Glossary of notation
(fct)
IF, set of a-discount optimal
decision functions
Vl(+)(7r,x):= E;
D(x,a) a-discount discrepancy t=o
(f ct)
function
A,(x) a-discount optimal control Vl(-)(7r, x) := E;
actions in the state x t=o
An (x) a-VI optimal control ac- Vln (7r, x) ETC from time n onwards
tions in the state Xj n = when using the policy 7r,
1,2, ... given the initial state Xo =
Dn(x,a) a-VI discrepancy function, x
n = 1,2, ...
Section 9.4.
Section 8.5
ETC when using the policy
JL(X) family of l.s.c. functions on 7r,given the initial distribu-
X tion v
JLw(X) := JL(X) n lRw(X) ETC-expected occupation
C(X) family of continuous func- measure on X x A when us-
tions on X ing the policy 7r, given the
Cw(X) := C(X) n lRw(X) initial distribution v
family of continuous marginal of J1~ on X
bounded functions on X
J1~,t distribution of (Xt, at)
when using the policy 7r,
Chapter 9 given the initial distribu-
tion v, t = 0,1, ...
Section 9.1
~,t marginal of J1~,t on X
expected total cost (ETC)
7r(l) I-shift policy determined
when using the policy 7r,
given the initial state x by 7r
vt(x) ETC value function
Section 9.5
Section 9.2
dynamic programming op-
Ii extended real numbers erator (when a = 1)
r+ := max(r, 0) see Definition 9.5.1
r- := max( -r, 0) = - min(r, 0) ETC-discrepancy function
Section 9.3 n-shift policy determined
I n (7r,x) n-stage ETC when using by 7r (n = 0,1, ... )
the policy 7r, given the ini- M'n see (9.5.12)
tial state x
268 Glossary of notation