Stochastic Programming
Stochastic Programming
First Edition
Peter Kall
Institute for Operations Research
and Mathematical Methods of Economics
University of Zurich
CH-8044 Zurich
Stein W. Wallace
Department of Managerial Economics
and Operations Research
Norwegian Institute of Technology
University of Trondheim
N-7034 Trondheim
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Stochastic Programs: General Formulation . . . . . . . . . . . . 15
1.3.1 Measures and Integrals . . . . . . . . . . . . . . . . . . . 16
1.3.2 Deterministic Equivalents . . . . . . . . . . . . . . . . . 25
1.4 Properties of Recourse Problems . . . . . . . . . . . . . . . . . 31
1.5 Properties of Probabilistic Constraints . . . . . . . . . . . . . . 41
1.6 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . 48
1.6.1 The Feasible Set and Solvability . . . . . . . . . . . . . 49
1.6.2 The Simplex Algorithm . . . . . . . . . . . . . . . . . . 59
1.6.3 Duality Statements . . . . . . . . . . . . . . . . . . . . . 64
1.6.4 A Dual Decomposition Method . . . . . . . . . . . . . . 70
1.7 Nonlinear Programming . . . . . . . . . . . . . . . . . . . . . . 75
1.7.1 The Kuhn–Tucker Conditions . . . . . . . . . . . . . . . 77
1.7.2 Solution Techniques . . . . . . . . . . . . . . . . . . . . 84
1.7.2.1 Cutting-plane methods . . . . . . . . . . . . . 84
1.7.2.2 Descent methods . . . . . . . . . . . . . . . . . 88
1.7.2.3 Penalty methods . . . . . . . . . . . . . . . . . 91
1.7.2.4 Lagrangian methods . . . . . . . . . . . . . . . 93
1.8 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . 97
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.1 Problem Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.1.1 Finding a Frame . . . . . . . . . . . . . . . . . . . . . . 250
5.1.2 Removing Unnecessary Columns . . . . . . . . . . . . . 251
5.1.3 Removing Unnecessary Rows . . . . . . . . . . . . . . . 252
5.2 Feasibility in Linear Programs . . . . . . . . . . . . . . . . . . . 253
5.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . 260
5.3 Reducing the Complexity of Feasibility Tests . . . . . . . . . . 261
5.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . 262
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
viii STOCHASTIC PROGRAMMING
Preface
Over the last few years, both of the authors, and also most others in the field
of stochastic programming, have said that what we need more than anything
just now is a basic textbook—a textbook that makes the area available not
only to mathematicians, but also to students and other interested parties who
cannot or will not try to approach the field via the journals. We also felt
the need to provide an appropriate text for instructors who want to include
the subject in their curriculum. It is probably not possible to write such a
book without assuming some knowledge of mathematics, but it has been our
clear goal to avoid writing a text readable only for mathematicians. We want
the book to be accessible to any quantitatively minded student in business,
economics, computer science and engineering, plus, of course, mathematics.
So what do we mean by a quantitatively minded student? We assume that
the reader of this book has had a basic course in calculus, linear algebra
and probability. Although most readers will have a background in linear
programming (which replaces the need for a specific course in linear algebra),
we provide an outline of all the theory we need from linear and nonlinear
programming. We have chosen to put this material into Chapter 1, so that
the reader who is familiar with the theory can drop it, and the reader who
knows the material, but wonders about the exact definition of some term, or
who is slightly unfamiliar with our terminology, can easily check how we see
things. We hope that instructors will find enough material in Chapter 1 to
cover specific topics that may have been omitted in the standard book on
optimization used in their institution. By putting this material directly into
the running text, we have made the book more readable for those with the
minimal background. But, at the same time, we have found it best to separate
what is new in this book—stochastic programming—from more standard
material of linear and nonlinear programming.
Despite this clear goal concerning the level of mathematics, we must
admit that when treating some of the subjects, like probabilistic constraints
(Section 1.5 and Chapter 4), or particular solution methods for stochastic
programs, like stochastic decomposition (Section 3.8) or quasi-gradient
x STOCHASTIC PROGRAMMING
methods (Section 3.9), we have had to use a slightly more advanced language
in probability. Although the actual information found in those parts of the
book is made simple, some terminology may here and there not belong to
the basic probability terminology. Hence, for these parts, the instructor must
either provide some basic background in terminology, or the reader should at
least consult carefully Section 1.3.1, where we have tried to put together those
terms and concepts from probability theory used later in this text.
Within the mathematical programming community, it is common to split
the field into topics such as linear programming, nonlinear programming,
network flows, integer and combinatorial optimization, and, finally, stochastic
programming. Convenient as that may be, it is conceptually inappropriate.
It puts forward the idea that stochastic programming is distinct from integer
programming the same way that linear programming is distinct from nonlinear
programming. The counterpart of stochastic programming is, of course,
deterministic programming. We have stochastic and deterministic linear
programming, deterministic and stochastic network flow problems, and so on.
Although this book mostly covers stochastic linear programming (since that is
the best developed topic), we also discuss stochastic nonlinear programming,
integer programming and network flows.
Since we have let subject areas guide the organization of the book, the
chapters are of rather different lengths. Chapter 1 starts out with a simple
example that introduces many of the concepts to be used later on. Tempting as
it may be, we strongly discourage skipping these introductory parts. If these
parts are skipped, stochastic programming will come forward as merely an
algorithmic and mathematical subject, which will serve to limit the usefulness
of the field. In addition to the algorithmic and mathematical facets of the
field, stochastic programming also involves model creation and specification
of solution characteristics. All instructors know that modelling is harder to
teach than are methods. We are sorry to admit that this difficulty persists
in this text as well. That is, we do not provide an in-depth discussion of
modelling stochastic programs. The text is not free from discussions of models
and modelling, however, and it is our strong belief that a course based on this
text is better (and also easier to teach and motivate) when modelling issues
are included in the course.
Chapter 1 contains a formal approach to stochastic programming, with a
discussion of different problem classes and their characteristics. The chapter
ends with linear and nonlinear programming theory that weighs heavily in
stochastic programming. The reader will probably get the feeling that the
parts concerned with chance-constrained programming are mathematically
more complicated than some parts discussing recourse models. There is a
good reason for that: whereas recourse models transform the randomness
contained in a stochastic program into one special parameter of some random
vector’s distribution, namely its expectation, chance constrained models deal
PREFACE xi
more explicitly with the distribution itself. Hence the latter models may
be more difficult, but at the same time they also exhaust more of the
information contained in the probability distribution. However, with respect to
applications, there is no generally valid justification to state that any one of the
two basic model types is “better” or “more relevant”. As a matter of fact, we
know of applications for which the recourse model is very appropriate and of
others for which chance constraints have to be modelled, and even applications
are known for which recourse terms for one part of the stochastic constraints
and chance constraints for another part were designed. Hence, in a first reading
or an introductory course, one or the other proof appearing too complicated
can certainly be skipped without harm. However, to get a valid picture about
stochastic programming, the statements about basic properties of both model
types as well as the ideas underlying the various solution approaches should be
noticed. Although the basic linear and nonlinear programming is put together
in one specific part of the book, the instructor or the reader should pick up
the subjects as they are needed for the understanding of the other chapters.
That way, it will be easier to pick out exactly those parts of the theory that
the students or readers do not know already.
Chapter 2 starts out with a discussion of the Bellman principle for
solving dynamic problems, and then discusses decision trees and dynamic
programming in both deterministic and stochastic settings. There then follows
a discussion of the rather new approach of scenario aggregation. We conclude
the chapter with a discussion of the value of using stochastic models.
Chapter 3 covers recourse problems. We first discuss some topics from
Chapter 1 in more detail. Then we consider decomposition procedures
especially designed for stochastic programs with recourse. We next turn to
the questions of bounds and approximations, outlining some major ideas
and indicating the direction for other approaches. The special case of simple
recourse is then explained, before we show how decomposition procedures for
stochastic programs fit into the framework of branch-and-cut procedures for
integer programs. This makes it possible to develop an approach for stochastic
integer programs. We conclude the chapter with a discussion of Monte-Carlo
based methods, in particular stochastic decomposition and quasi-gradient
methods.
Chapter 4 is devoted to probabilistic constraints. Based on convexity
statements provided in Section 1.5, one particular solution method is described
for the case of joint chance constraints with a multivariate normal distribution
of the right-hand side. For separate probabilistic constraints with a joint
normal distribution of the coefficients, we show how the problem can be
transformed into a deterministic convex nonlinear program. Finally, we
address a problem very relevant in dealing with chance constraints: the
problem of how to construct efficiently lower and upper bounds for a
multivariate distribution function, and give a first sketch of the ideas used
xii STOCHASTIC PROGRAMMING
in this area.
Preprocessing is the subject of Chapter 5. “Preprocessing” is any analysis
that is carried out before the actual solution procedure is called. Preprocessing
can be useful for simplifying calculations, but the main purpose is to facilitate
a tool for model evaluation.
We conclude the book with a closer look at networks (Chapter 6). Since
these are nothing else than specially structured linear programs, we can draw
freely from the topics in Chapter 3. However, the added structure of networks
allows many simplifications. We discuss feasibility, preprocessing and bounds.
We conclude the chapter with a closer look at PERT networks.
Each chapter ends with a short discussion of where more literature can be
found, some exercises, and, finally, a list of references.
Writing this book has been both interesting and difficult. Since it is the first
basic textbook totally devoted to stochastic programming, we both enjoyed
and suffered from the fact that there is, so far, no experience to suggest how
such a book should be constructed. Are the chapters in the correct order?
Is the level of difficulty even throughout the book? Have we really captured
the basics of the field? In all cases the answer is probably NO. Therefore,
dear reader, we appreciate all comments you may have, be they regarding
misprints, plain errors, or simply good ideas about how this should have been
done. And also, if you produce suitable exercises, we shall be very happy to
receive them, and if this book ever gets revised, we shall certainly add them,
and allude to the contributor.
About 50% of this text served as a basis for a course in stochastic
programming at The Norwegian Institute of Technology in the fall of 1992. We
wish to thank the students for putting up with a very preliminary text, and
for finding such an astonishing number of errors and misprints. Last but not
least, we owe sincere thanks to Julia Higle (University of Arizona, Tucson),
Diethard Klatte (Univerity of Zurich), Janos Mayer (University of Zurich) and
Pavel Popela (Technical University of Brno) who have read the manuscript1
very carefully and fixed not only linguistic bugs but prevented us from quite a
number of crucial mistakes. Finally we highly appreciate the good cooperation
and very helpful comments provided by our publisher. The remaining errors
are obviously the sole responsibility of the authors.
1 Written in LATEX
1
Basic Concepts
1.1 Preliminaries
During the last four decades, progress in computational methods for solving
mathematical programs has been impressive, and problems of considerable size
may be solved efficiently, and with high reliability.
In many modelling situations it is unreasonable to assume that the
coefficients cj , aij , bi or the functions gi (and the set X) respectively in
problems (1.1) and (1.3) are deterministically fixed. For instance, future
productivities in a production problem, inflows into a reservoir connected
to a hydro power station, demands at various nodes in a transportation
network, and so on, are often appropriately modelled as uncertain parameters,
which are at best characterized by probability distributions. The uncertainty
about the realized values of those parameters cannot always be wiped out
just by inserting their mean values or some other (fixed) estimates during
the modelling process. That is, depending on the practical situation under
consideration, problems (1.1) or (1.3) may not be the appropriate models
for describing the problem we want to solve. In this chapter we emphasize—
4 STOCHASTIC PROGRAMMING
and possibly clarify—the need to broaden the scope of modelling real life
decision problems. Furthermore, we shall provide from linear programming
and nonlinear programming the essential ingredients absolutely necessary for
the understanding of the subsequent chapters. Obviously these latter sections
may be skipped—or used as a quick revision—by readers who are already
familiar with the related optimization courses.
Before coming to a more general setting we first derive some typical
stochastic programming models, using a simplified production problem to
illustrate the various model types.
Let us consider the following problem, idealized for the purpose of easy
presentation. From two raw materials, raw1 and raw2, we may simultaneously
produce two different goods, prod1 and prod2 (as may happen for example in
a refinery). The output of products per unit of the raw materials as well
as the unit costs of the raw materials c = (craw1 , craw2 )T (yielding the
production cost γ), the demands for the products h = (hprod1 , hprod2 )T and
the production capacity b̂, i.e. the maximal total amount of raw materials that
can be processed, are given in Table 1.
According to this formulation of our production problem, we have to deal
with the following linear program:
Products
Raws prod1 prod2 c b̂
raw1 2 3 2 1
raw2 6 3 3 1
relation ≥ ≥ = ≤
h 180 162 γ 100
min(2xraw1 + 3xraw2 )
s.t. xraw1 + xraw2 ≤ 100,
2xraw1 + 6xraw2 ≥ 180,
(2.1)
3xraw1 + 3xraw2 ≥ 162,
xraw1 ≥ 0,
xraw2 ≥ 0.
BASIC CONCEPTS 5
Figure 2 LP: feasible production plans and cost function for γ = 290.
the output of gas from raw1 and the output of fuel from raw2 may change
randomly (whereas the other productivities are deterministic);
• simultaneously, the weekly demands of the clients, hprod1 for gas and hprod2
for fuel are varying randomly;
• the weekly production plan (xraw1 , xraw2 ) has to be fixed in advance and
cannot be changed during the week, whereas
• the actual productivities are only observed (measured) during the
production process itself, and
• the clients expect their actual demand to be satisfied during the
corresponding week.
hprod1 = 180 + ζ̃1 ,
hprod2 = 162 + ζ̃2 ,
(2.3)
π(raw1, prod1) = 2 + η̃1 ,
π(raw2, prod2) = 3.4 − η̃2 ,
where the random variables ζ̃j are modelled using normal distributions, and
η̃1 and η̃2 are distributed uniformly and exponentially respectively, with the
BASIC CONCEPTS 7
following parameters:1
distr ζ̃1 ∼ N (0, 12),
distr ζ̃2 ∼ N (0, 9),
(2.4)
distr η̃1 ∼ U[−0.8, 0.8],
distr η̃2 ∼ EX P(λ = 2.5).
For simplicity, we assume that these four random variables are mutually
independent. Since the random variables ζ̃1 , ζ̃2 and η̃2 are unbounded,
we restrict our considerations to their respective 99% confidence intervals
(except for U). So we have for the above random variables’ realizations
ζ1 ∈ [−30.91, 30.91],
ζ2 ∈ [−23.18, 23.18],
(2.5)
η1 ∈ [−0.8, 0.8],
η2 ∈ [0.0, 1.84].
Hence, instead of the linear program (2.1), we are dealing with the stochastic
linear program
min(2xraw1 + 3xraw2 )
s.t. xraw1 + xraw2 ≤ 100,
(2 + η̃1 )xraw1 + 6xraw2 ≥ 180 + ζ̃1 ,
(2.6)
3xraw1 + (3.4 − η̃2 )xraw2 ≥ 162 + ζ̃2 ,
xraw1 ≥ 0,
xraw2 ≥ 0.
This is not a well-defined decision problem, since it is not at all clear what
the meaning of “min” can be before knowing a realization (ζ1 , ζ2 , η1 , η2 ) of
(ζ̃1 , ζ̃2 , η̃1 , η̃2 ).
Geometrically, the consequence of our random parameter changes may
be rather complex. The effect of only the right-hand sides ζi varying
over the intervals given in (2.5) corresponds to parallel translations of the
corresponding facets of the feasible set as shown in Figure 3.
We may instead consider the effect of only the ηi changing their values
within the intervals mentioned in (2.5). That results in rotations of the related
facets. Some possible situations are shown in Figure 4, where the centers of
rotation are indicated by small circles.
Allowing for all the possible changes in the demands and in the
productivities simultaneously yields a superposition of the two geometrical
motions, i.e. the translations and the rotations. It is easily seen that the
1 We use N (µ, σ) to denote the normal distribution with mean µ and variance σ 2 .
8 STOCHASTIC PROGRAMMING
a fat solution exists at the intersection of the two rightmost constraints for
prod1 and prod2, which is easily computed as
To introduce another possibility, let us assume that the refinery has made
the following arrangement with its clients. In principle, the clients expect
the refinery to satisfy their weekly demands. However, very likely—according
to the production plan and the unforeseen events determining the clients’
demands and/or the refinery’s productivity—the demands cannot be covered
by the production, which will cause “penalty” costs to the refinery. The
amount of shortage has to be bought from the market. These penalties are
supposed to be proportional to the respective shortage in products, and we
assume that per unit of undeliverable products they amount to
Figure 5 LP: feasible set varying with productivities and demands; some wait-
and-see solutions.
More precisely, we may want to find a production plan that minimizes the
sum of our original first-stage (i.e. production) costs and the expected recourse
costs. To formalize this approach, we abbreviate our notation. Instead of the
four single random variables ζ̃1 , ζ̃2 , η̃1 and η̃2 , it seems convenient to use the
random vector ξ˜ = (ζ̃1 , ζ̃2 , η̃1 , η̃2 )T . Further, we introduce for each of the
two stochastic constraints in (2.6) a recourse variable yi (ξ), ˜ i = 1, 2, which
simply measures the corresponding shortage in production if there is any;
since shortage depends on the realizations of our random vector ξ, ˜ so does the
corresponding recourse variable, i.e. the yi (ξ) ˜ are themselves random variables.
Following the approach sketched so far, we now replace the vague stochastic
program (2.6) by the well defined stochastic program with recourse, using
˜ := hprod1 = 180 + ζ̃1 , h2 (ξ)
h1 (ξ) ˜ := hprod2 = 162 + ζ̃2 ,
˜ + 12y2 (ξ)]}
min{2xraw1 + 3xraw2 + Eξ̃ [7y1 (ξ) ˜
s.t. xraw1 + xraw2 ≤ 100,
˜ raw1 +
α(ξ)x ˜
6xraw2 + y1 (ξ) ≥ ˜
h1 (ξ),
˜
3xraw1 + β(ξ)xraw2 ˜
+ y2 (ξ) ≥ ˜
h2 (ξ), (2.10)
xraw1 ≥ 0,
xraw2 ≥ 0,
˜
y1 (ξ) ≥ 0,
˜
y2 (ξ) ≥ 0.
In (2.10) Eξ̃ stands for the expected value with respect to the distribution
˜ and in general, it is understood that the stochastic constraints have
of ξ,
to hold almost surely (a.s.) (i.e., they are to be satisfied with probability
1). Note that if ξ˜ has a finite discrete distribution {(ξ i , pi ), i = 1, · · · , r}
(pi > 0 ∀i) then (2.10) is just an ordinary linear program with a so-called
dual decomposition structure:
Pr
min{2xraw1 + 3xraw2 + i=1 pi [7y1 (ξ i ) + 12y2 (ξ i )]}
s.t. xraw1 + xraw2 ≤ 100,
α(ξ i )xraw1 + 6xraw2 + y1 (ξ i ) ≥ h1 (ξ i ) ∀i,
3xraw1 + β(ξ i )xraw2 + y2 (ξ i ) ≥ h2 (ξ i ) ∀i, (2.11)
xraw1 ≥ 0,
xraw2 ≥ 0,
y1 (ξ i ) ≥ 0 ∀i,
y2 (ξ i ) ≥ 0 ∀i.
6
%
rb 15
rb br
12
r b b r
r ∼ N (0, 12)
9 b r
r b b ∼ N (0, 9)
6 b r
r b
r b b r
3
r b b r
r b b r
-
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30
Figure 6 Discrete distribution generated from N (0, 12), N (0, 9); (r1 , r2 ) =
(15, 15).
whereas the solution of our original LP (2.1) would yield as total expected
costs
γ(x̂) = 204.561.
14 STOCHASTIC PROGRAMMING
min(2xraw1 + 3xraw2 )
s.t. xraw1 + xraw2 ≤ 100,
xraw1 ≥ 0,
xraw2 ≥ 0,
µ ¶
2xraw1 ˜
+ 6xraw2 ≥ h1 (ξ)
P ˜ ≥ 0.95.
3xraw1 + 3xraw2 ≥ h2 (ξ)
BASIC CONCEPTS 15
This problem can be solved with appropriate methods, one of which will be
presented later in this text. It seems worth mentioning that in this case
using the normal distributions instead of their discrete approximations is
appropriate owing to theoretical properties of probabilistic constraints to be
discussed later on. The solution of the probabilistically constrained program
is
In the same way as random parameters in (2.1) led us to the stochastic (linear)
program (2.6), random parameters in (1.3) may lead to the problem
“min”g0 (x, ξ) ˜
˜ ≤ 0, i = 1, · · · , m,
s.t. gi (x, ξ) (3.1)
x ∈ X ⊂ IRn ,
• an interval if k = 1,
• a rectangle if k = 2,
• a cube if k = 3,
while for k > 3 there is no common language term for these objects since
geometric imagination obviously ends there.
Sometimes we want to know something about the “size” of a set in IRk , e.g.
the length of a beam, the area of a piece of land or the volume of a building;
in other words, we want to measure it. One possibility to do this is to fix
first how we determine the measure of intervals, and a “natural” choice of a
measure µ would be
½
1 b − a if a ≤ b,
• in IR : µ(I[a,b) ) =
0 otherwise,
½
(b1 − a1 )(b2 − a2 ) if a ≤ b,
• in IR2 : µ(I[a,b) ) =
0 otherwise,
½
(b1 − a1 )(b2 − a2 )(b3 − a3 ) if a ≤ b,
• in IR3 : µ(I[a,b) ) =
0 otherwise.
Obviously for a set A that is the disjoint finite union of intervals, i.e.
A = ∪M n=1 I
(n)
, I (n) being intervals such that I (n) ∩ I (m) = ∅ for n 6= m,
PM (n)
we define its measure as µ(A) = n=1 µ(I ). In order to measure a set
A that is not just an interval or a finite union of disjoint intervals, we may
proceed as follows.
Any finite collection of pairwise-disjoint intervals contained in A forms
a packing C of A, C being the union of those intervals, with a well-
defined measure µ(C) as mentioned above. Analogously, any finite collection
of pairwise disjoint intervals, with their union containing A, forms a covering
D of A with a well-defined measure µ(D).
BASIC CONCEPTS 17
i.e. the half-circle illustrated in Figure 7, which also shows a first possible
packing C1 and covering D1 . Obviously we learned in high school that the
area of Acirc is computed as µ(Acirc ) = 12 × π × (radius)2 = 25.1327, whereas
we easily compute µ(C1 ) = 13.8564 and µ(D1 ) = 32. If we forgot all our
wisdom from high school, we would only be able to conclude that the measure
of the half-circle Acirc is between 13.8564 and 32. To obtain a more precise
estimate, we can try to improve the packing and the covering in such a way
that the new packing C2 exhausts more of the set Acirc and the new covering
D2 becomes a tighter outer approximation of Acirc . This is shown in Figure 8,
for which we get µ(C2 ) = 19.9657 and µ(D2 ) = 27.9658.
Hence the measure of Acirc is between 19.9657 and 27.9658. If this is still
not precise enough, we may further improve the packing and covering. For
the half-cirle Acirc , it is easily seen that we may determine its measure in this
way with any arbitrary accuracy.
In general, for any closed bounded set A ⊂ IRk , we may try a similar
procedure to measure A. Denote by CA the set of all packings for A and by
18 STOCHASTIC PROGRAMMING
sup{µ(C) | C ∈ CA } = inf{µ(D) | D ∈ DA },
This implies that IRk itself is measurable. Observing that there always exist
collections of countably many pairwise-disjoint intervals I[aν ,bν ) , ν = 1, 2, · · · ,
S∞
covering IRk , i.e. ν=1 I[aP
ν ,bν ) = IR
k
(e.g. take intervals with all edges having
∞
length 1), we get µ(A) = ν=1 µ(A ∩ I[aν ,bν ) ) as the measure of A. Obviously
µ(A) = ∞ may happen, as it does for instance with A = IR2+ (i.e. the positive
orthant of IR2 ) or with A = {(x, y) ∈ IR2 | x ≥ 1, 0 ≤ y ≤ x1 }. But we also
may find unbounded sets with finite measure as e.g. A = {(x, y) ∈ IR2 | x ≥
0, 0 ≤ y ≤ e−x } (see the exercises at the end of this chapter).
2 “iff” stands for “if and only if”
BASIC CONCEPTS 19
The measure introduced this way for closed sets and based on the
elementary measure for intervals as defined in (3.2) may be extended as a
“natural” measure for the class A of measurable sets in IRk , and will be
denoted throughout by µ. We just add that A is characterized by the following
properties:
if A ∈ A then also IRk − A ∈ A; (3.3 i)
∞
[
if Ai ∈ A, i = 1, 2, · · · , then also Ai ∈ A. (3.3 ii)
i=1
T∞
This implies that with Ai ∈ A, i = 1, 2, · · ·, also i=1 Ai ∈ A.
As a consequence of the above construction, we have, for the natural
measure µ defined in IRk , that
µ(A) ≥ 0 ∀A ∈ A and µ(∅) = 0; (3.4 i)
if Ai ∈ A,
S∞i = 1, 2, ·P
· · , and Ai ∩ Aj = ∅ for i 6= j,
∞ (3.4 ii)
then µ( i=1 Ai ) = i=1 µ(Ai ).
In other words, the measure of a countable disjoint union of measurable sets
equals the countable sum of the measures of these sets.
These properties are also familiar from probability theory: there we have
some space Ω of outcomes ω (e.g. the results of random experiments), a
collection F of subsets F ⊂ Ω called events, and a probability measure (or
probability distribution) P assigning to each F ∈ F the probability with
which it occurs. To set up probability theory, it is then required that
(i) Ω is an event, i.e. Ω ∈ F, and, with F ∈ F, it holds that also Ω − F ∈ F ,
i.e. if F is an event then so also is its complement (or notF );
(ii) the countable union of events is an event.
Observe that these formally coincide with (3.3) except that Ω can be any
space of objects and need not be IRk .
For the probability measure, it is required that
(i) P (F ) ≥ 0 ∀F ∈ F and P (Ω) = 1;
S∞
(ii) if
PF i ∈ F, i = 1, 2, · · · , and Fi ∩ Fj = ∅ for i 6= j, then P ( i=1 Fi ) =
∞
i=1 P (Fi ).
The only difference with (3.4) is that P is bounded to P (F ) ≤ 1 ∀F ∈ F ,
whereas µ is unbounded on IRk . The triple (Ω, F, P ) with the above properties
is called a probability space.
In addition, in probability theory we find random variables and random
˜ With A the collection of naturally measurable sets in IRk , a random
vectors ξ.
vector is a function (i.e. a single-valued mapping)
ξ˜ : Ω −→ IRk such that, for all A ∈ A, ξ˜−1 [A] := {ω | ξ(ω)
˜ ∈ A} ∈ F. (3.5)
20 STOCHASTIC PROGRAMMING
˜ of any measurable
This requires the “inverse” (with respect to the function ξ)
k
set in IR to be an event in Ω.
Observe that a random vector ξ˜ : Ω −→ IRk induces a probability measure
Pξ̃ on A according to
˜
Pξ̃ (A) = P ({ω | ξ(ω) ∈ A}) ∀A ∈ A.
Example 1.2 At a market hall for the fruit trade you find a particular species
of apples. These apples are traded in certain lots (e.g. of 1000 lb). Buying a lot
involves some risk with respect to the quality of apples contained in it. What
does “quality” mean in this context? Obviously quality is a conglomerate of
criteria described in terms like size, ripeness, flavour, colour and appearance.
Some of the criteria can be expressed through quantitative measurement, while
others cannot (they have to be judged upon by experts). Hence the set Ω of
all possible “qualities” cannot as such be represented as a subset of some IRk .
Having bought a lot, the trader has to sort his apples according to their
“outcomes” (i.e. qualities), which could fall into “events” like “unusable”
(e.g. rotten or too unripe), “cooking apples”and “low (medium, high) quality
eatable apples”. Having sorted out the “unusable” and the “cooking apples”,
for the remaining apples experts could be asked to judge on ripeness, flavour,
colour and appearance, by assigning real values between 0 and 1 to parameters
r, f, c and a respectively, corresponding to the “degree (or percentage) of
achieving” the particular criterion.
Now we can construct a scalar value for any particular outcome (quality)
ω, for instance as
0 if ω ∈ “unusable”,
1
ṽ(ω) := if ω ∈ “cooking apples”,
2
(1 + r)(1 + f )(1 + c)(1 + a) otherwise.
Obviously ṽ has the range ṽ[Ω] = {0} ∪ { 12 } ∪ {[1, 16]}. Denoting the events
“unusable” by U and “cooking apples” by C, we may define the collection F
of events as follows. With G denoting the family of all subsets of Ω − (U ∪ C)
let F contain all unions of U, C, ∅ or Ω with any element of G. Assume that
after long series of observations we have a good estimate for the probabilities
P (A), A ∈ F.
According to our scale, we could classify the apples as
• eatable and
– 1st class for ṽ(ω) ∈ [12, 16] (high selling price),
– 2nd class for ṽ(ω) ∈ [8, 12) (medium price),
– 3rd class for ṽ(ω) ∈ [1, 8) (low price);
1
• good for cooking for ṽ(ω) = 2 (cheap);
BASIC CONCEPTS 21
using the fact that ṽ is single-valued and {[1, 8)}, { 12 } and hence ṽ −1 [{[1, 8)}],
ṽ −1 [{ 21 }] = C are disjoint. For an illustration, see Figure 9. 2
½
1 if x ∈ Ai ,
χAi (x) =
0 otherwise,
R
Then the integral B
ϕ(x)dµ is defined as
Z r
X
ϕ(x)dµ = ci µ(Ai ). (3.6)
B i=1
In Figure 10 the integral would result by accumulating the shaded areas with
their respective signs as indicated.
BASIC CONCEPTS 23
Observe that the sum (or difference) of simple functions ϕ1 and ϕ2 is again
a simple function and that
Z Z Z
[ϕ1 (x) + ϕ2 (x)]dµ = ϕ1 (x)dµ + ϕ2 (x)dµ
B B B
¯Z ¯ Z
¯ ¯
¯ ϕ(x)dµ¯ ≤ |ϕ(x)|dµ
¯ ¯
B B
R
then the integral B
ψ(x)dµ is defined by
Z Z
ψ(x)dµ = lim ϕn (x)dµ
B n→∞ B
If there exists a function fξ̃ : Ξ −→ IR such that the distribution function can
be represented by an integral with respect to the natural measure µ as
Z
Fξ̃ (x̂) = fξ̃ (x)dµ, x̂ ∈ IRk ,
x≤x̂
then fξ̃ is called the density function of P . In this case the distribution function
is called Rof continuous type. It follows that for any event A ∈ F we have
P (A) = A fξ̃ (x)dµ. This implies in particular that for any A ∈ F such that
µ(A) = 0 also P (A) = 0 has to hold. This fact is referred to by saying that
the probability measure P is absolutely continuous with respect to the natural
measure µ. It can be shown that the reverse statement is also true: given a
probability space (Ξ, F, P ) in IRk with P absolutely continuous with respect
to µ (i.e. every event A ∈ F with the natural measure µ(A) = 0 has also a
probability of zero), there exists a density function fξ̃ for P .
the ith constraint of (3.1) is violated if and only if gi+ (x, ξ) > 0 for a given
decision x and realization ξ of ξ. ˜ Hence we could provide for each constraint a
recourse or second-stage activity yi (ξ) that, after observing the realization ξ,
is chosen such as to compensate its constraint’s violation—if there is one—by
satisfying gi (x, ξ) − yi (ξ) ≤ 0. This extra effort is assumed to cause an extra
cost or penalty of qi per unit, i.e. our additional costs (called the recourse
function) amount to
(m )
X ¯
¯ +
Q(x, ξ) = min qi yi (ξ) ¯ yi (ξ) ≥ gi (x, ξ), i = 1, · · · , m , (3.7)
y
i=1
26 STOCHASTIC PROGRAMMING
and we have to decide on xτ such that the constraint(s) (with vector valued
constraint functions gτ )
gτ (x0 , · · · , xτ , ξ1 , · · · , ξτ ) ≤ 0
are satisfied, which—as stated—at this stage can only be achieved by the
proper choice of xτ , based on the knowledge of the previous decisions and
realizations. Hence, assuming a cost function qτ (xτ ), at stage τ ≥ 1 we have
a recourse function
indicating that the optimal recourse action x̂τ at time τ depends on the
previous decisions and the realizations observed until stage τ , i.e.
Hence, taking into account the multiple stages, we get as total costs for the
multistage problem
K
X
f0 (x0 , ξ1 , · · · , ξK ) = g0 (x0 ) + Qτ (x0 , x̂1 , · · · , x̂τ −1 , ξ1 , · · · , ξτ ) (3.12)
τ =1
X = {x ∈ IRn | Ax = b, x ≥ 0},
28 STOCHASTIC PROGRAMMING
where the fi are constructed from the objective and the constraints in (3.1)
or (3.14) respectively. So far, f0 represented the total costs (see (3.8) or (3.12))
and f1 , · · · , fm̄ could be used to describe the first-stage feasible set X.
However, depending on the way the functions fi are derived from the problem
functions gj in (3.1), this general formulation also includes other types of
deterministic equivalents for the stochastic program (3.1).
To give just two examples showing how other deterministic equivalent
problems for (3.1) may be generated, let us choose first α ∈ [0, 1] and define
a “payoff” function for all constraints as
½
1 − α if gi (x, ξ) ≤ 0, i = 1, · · · , m,
ϕ(x, ξ) :=
−α otherwise.
Consequently, for x infeasible at ξ we have an absolute loss of α, whereas for
x feasible at ξ we have a return of 1 − α. It seems natural to aim for decisions
on x that, at least in the mean (i.e. on average), avoid an absolute loss. This
is equivalent to the requirement
Z
˜
Eξ̃ ϕ(x, ξ) = ϕ(x, ξ)dP ≥ 0.
Ξ
where, with the vector-valued function g(x, ξ) = (g1 (x, ξ), · · · , gm (x, ξ))T ,
Z
˜
Eξ̃ f1 (x, ξ) = f1 (x, ξ)dP
Ξ
Z Z
= (α − 1)dP + αdP
{g(x,ξ)≤0} {g(x,ξ)6≤0}
If, in particular, we have that the functions gi (x, ξ) are linear in x, and if
furthermore the set X is convex polyhedral, i.e. we have the stochastic linear
program
˜
“min” cT (ξ)x
s.t. Ax = b,
˜ ≥ h(ξ),
T (ξ)x ˜
x ≥ 0,
then problems (3.21) and (3.22) become
)
˜
minx∈X Eξ̃ cT (ξ)x
(3.23)
s.t. P ({ξ | T (ξ)x ≥ h(ξ)}) ≥ α,
and, with Ti (·) and hi (·) denoting the ith row and ith component of T (·) and
h(·) respectively,
)
minx∈X Eξ̃ cT (ξ)x˜
(3.24)
s.t. P ({ξ | Ti (ξ)x ≥ hi (ξ)}) ≥ αi , i = 1, · · · , m,
the stochastic linear programs with joint and with single chance constraints
respectively.
Obviously there are many other possibilities to generate types of
deterministic equivalents for (3.1) by constructing the fi in different ways
out of the objective and the constraints of (3.1).
Formally, all problems derived, i.e. all the above deterministic equivalents,
are mathematical programs. The first question is, whether or under which
assumptions do they have properties like convexity and smoothness such that
we have any reasonable chance to deal with them computationally using the
toolkit of mathematical programming methods.
BASIC CONCEPTS 31
Convexity may be shown easily for the recourse problem (3.11) under rather
mild assumptions (given the integrability of g0 + Q).
Proposition 1.1 If g0 (·, ξ) and Q(·, ξ) are convex in x ∀ξ ∈ Ξ, and if X is
a convex set, then (3.11) is a convex program.
Proof For x̂, x̄ ∈ X, λ ∈ (0, 1) and x̌ := λx̂ + (1 − λ)x̄ we have
g0 (x̌, ξ) + Q(x̌, ξ) ≤ λ[g0 (x̂, ξ) + Q(x̂, ξ)] + (1 − λ)[g0 (x̄, ξ) + Q(x̄, ξ)] ∀ξ ∈ Ξ
implying
˜
Eξ̃ {g0 (x̌, ξ)+Q(x̌, ˜ ≤ λE {g0 (x̂, ξ)+Q(x̂,
ξ)} ˜ ˜
ξ)}+(1−λ)E ˜ ˜
ξ̃ ξ̃ {g0 (x̄, ξ)+Q(x̄, ξ)}.
Remark 1.1 Observe that for Y = IRn+ the convexity of Q(·, ξ) can
immediately be asserted for the linear case (3.16) and that it also holds for the
nonlinear case (3.10) if the functions q(·) and gi (·, ξ) are convex and the Hi (·)
are concave. Just to sketch the argument, assume that ȳ and y̌ solve (3.10)
for x̄ and x̌ respectively, at some realization ξ ∈ Ξ. Then, by the convexity of
gi and the concavity of Hi , i = 1, · · · , m, we have, for any λ ∈ (0, 1),
R
Smoothness (i.e. partial differentiability of Q(x) = Ξ Q(x, ξ) dP ) of
recourse problems may also be asserted under fairly general conditions. For
example, suppose that ϕ : IR2 −→ IR, so that ϕ(x, y) ∈ IR. Recalling that ϕ is
partially differentiable at some point (x̂, ŷ) with respect to x, this means that
∂ϕ(x, y)
there exists a function, called the partial derivative and denoted by ,
∂x
such that
ϕ(x̂ + h, ŷ) − ϕ(x̂, ŷ) ∂ϕ(x̂, ŷ) r(x̂, ŷ; h)
= + ,
h ∂x h
32 STOCHASTIC PROGRAMMING
ˆ − Q(x̂, ξ)
Q(x̂ + hej , ξ) ˆ ˆ
∂Q(x̂, ξ) ˆ h)
ρj (x̂, ξ;
= +
h ∂xj h
with
ˆ h) h→0
ρj (x̂, ξ;
−→ 0,
h
where ej is the jth unit vector. The vector ( ∂Q(x,ξ) ∂Q(x,ξ) T
∂x1 , · · · , ∂xn ) is called
the gradient of Q(x, ξ) with respect to x and is denoted by ∇x Q(x, ξ). Now we
are not only interested in the partial differentiability of the recourse function
Q(x, ξ) but also in that of the expected recourse function Q(x). Provided that
˜ is partially differentiable at x̂ a.s., we get
Q(x, ξ)
Z
Q(x̂ + hej ) − Q(x̂) Q(x̂ + hej , ξ) − Q(x̂, ξ)
= dP
h h
ZΞ · ¸
∂Q(x̂, ξ) ρj (x̂, ξ; h)
= + dP
Ξ−Nδ ∂xj h
Z Z
∂Q(x̂, ξ) ρj (x̂, ξ; h)
= dP + dP,
Ξ−Nδ ∂xj Ξ−Nδ h
Remark 1.2 In the linear case (3.16) with complete fixed recourse it is known
from linear programming (see Section 1.6) that the optimal value function
Q(x, ξ) is continuous and piecewise linear in h(ξ) − T (ξ)x. In other words,
there exist finitely many convex polyhedral cones Bl ⊂ IRm1 with nonempty
interiors such that any two of them have at most boundary points in common
and ∪l Bl = IRm1 , and Q(x, ξ) is given as Q(x, ξ) = dlT (h(ξ) − T (ξ)x) + δl
for h(ξ) − T (ξ)x ∈ Bl . Then, for h(ξ) − T (ξ)x ∈ intBl (i.e. for h(ξ) − T (ξ)x
an interior point of Bl ), the function Q(x, ξ) is partially differentiable with
respect to any component of x. Hence for the gradient with respect to x we
get from the chain rule that ∇x Q(x, ξ) = −T T (ξ)dl for h(ξ) − T (ξ)x ∈ intBl .
Assume for simplicity that Ξ is a bounded interval in IRk and keep x fixed.
Then, by (3.15), we have a linear affine mapping ψ(·) := h(·) − T (·)x : Ξ −→
IRm1 . Therefore the sets
C l z ≤ 0,
i.e. for any fixed j there exists a τ̂lj > 0 such that
or, equivalently,
e = (1, · · · , 1)T . This implies that for γ := max γ(ξ) there exists a t0 > 0 such
ξ∈Ξ
that
C l [h(ξ) − T (ξ)x] ≤ −|t|γe ∀|t| < t0
(choose, for example, t0 = tl /γ). In other words, there exists a t0 > 0 such
that
Dl (x; t) := {ξ | C l [h(ξ) − T (ξ)x] ≤ −|t|γe} 6= ∅ ∀|t| < t0 ,
and obviously Dl (x; t) ⊂ Dl (x). Furthermore, by elementary geometry, the
natural measure µ satisfies
and, since
¯Z ¯
¯ Q(x + tej , ξ) − Q(x, ξ) ¯
¯ ¯ t→0
¯ ϕ(ξ)dµ¯ ≤ β max ϕ(ξ)|t|v −→ 0,
¯ Dl (x)−Dl (x;t) t ¯ ξ∈Ξ
XZ
˜ =
∇Eξ̃ Q(x, ξ) ∇x Q(x, ξ)ϕ(ξ)dξ
Dl (x)
XZ
l
= −T T (ξ)dl ϕ(ξ)dξ.
l Dl (x)
Hence for the linear case—observing (3.15)—we get the differentiability state-
ment of Proposition 1.2 provided that (4.1) is satisfied and P has a continuous
density on Ξ. 2
cT qT qT ··· qT
T (ξ 1 ) W h(ξ 1 )
T (ξ 2 ) W h(ξ 2 )
.. .. ..
. . .
T (ξ r ) W h(ξ r )
From the definition of a support, it follows that x ∈ IRn allows for a feasible
solution of the second-stage program for all ξ ∈ Ξ if and only if this is true
for all ξ j , j = 1, · · · , r. In other words, the induced first-stage feasibility set
K is given as
K = {x | T (ξ j )x + W y j = h(ξ j ), y j ≥ 0, j = 1, · · · , r}.
38 STOCHASTIC PROGRAMMING
From this formulation of K (which obviously also holds if ξ˜ has a finite discrete
distribution, i.e. Ξ = {ξ 1 , · · · , ξ r }), we evidently get the following.
Proposition 1.3 If the support Ξ of the distribution of ξ˜ is either a finite
set or a (bounded) convex polyhedron then the induced first-stage feasibility
set K Tis a convex polyhedral set. The first-stage decisions are restricted to
x ∈ X K.
Example 1.3 Consider the following first-stage feasible set:
and a random vector ξ˜ with the support Ξ = [4, 19] × [13, 21]. Then the
constraints to be satisfied for all ξ ∈ Ξ are
W y = ξ − T x, y ≥ 0.
Since these inequalities have to be satisfied for all ξ ∈ Ξ, choosing the minimal
right-hand sides (for ξ ∈ Ξ) yields the induced constraints as
The first-stage feasible set X together with the induced feasible set are illus-
trated in Figure 16. 2
T
It might happen that X K = ∅; then we should check our model very
carefully to figure out whether we really modelled what we had in mind or
whether we can find further possibilities for compensation that are not yet
contained in our model. On the other hand, we have already mentioned the
case of a complete fixed recourse matrix (see (3.17) on page 28), for which
K = IRn and therefore the problem of induced constraints does not exist.
Hence it seems interesting to recognize complete recourse matrices.
Proposition 1.4 An m1 × n matrix W is a complete recourse matrix iff 4
{z | z = W y, y ≥ 0} = IRm1 .
With ½
y̌i + 1, i = 1, · · · , m1 ,
yi =
y̌i , i > m1 ,
this implies that the constraints (4.4) are necessarily feasible.
To show that the above conditions are also sufficient for complete recourse
let us choose an arbitrary z̄ ∈ IRm1 . Since the columns W1 , · · · , Wm1 are
linearly independent, the system of linear equations
m1
X
Wi yi = z̄
i=1
solves W y = z̄, y ≥ 0. 2
is the union of all those vectors x feasible according to (5.2), and consequently
may be rewritten as
[ \
B(α) = {x | g(x, ξ) ≤ 0}. (5.3)
G∈G ξ∈G
Example 1.4 Assume that in our refinery problem (2.1) the demands are
random with the following discrete joint distribution:
µ ¶
h1 (ξ 1 ) = 160
P = 0.85,
h2 (ξ 1 ) = 135
µ ¶
h1 (ξ 2 ) = 150
P = 0.08,
h2 (ξ 2 ) = 195
µ ¶
h1 (ξ 3 ) = 200
P = 0.07.
h2 (ξ 3 ) = 120
42 STOCHASTIC PROGRAMMING
It follows that the feasible set for the above constraints is nonconvex, as shown
in Figure 17. 2
BASIC CONCEPTS 43
On the other hand, Fξ̃ being a quasi-concave function does not imply in general
that the corresponding probability measure P is quasi-concave. For instance,
in IR1 every monotone function is easily seen to be quasi-concave, such that
every distribution function of a random variable (always being monotonically
increasing) is quasi-concave.But not every probability measure P on IR is
quasi-concave (see Figure 19 for a counterexample).
Hence we stay with the question of when a probability measure—or its
distribution function—is quasi-concave. This question was answered first for
the subclass of log-concave probability measures, i.e. measures satisfying
for all convex Si ∈ F and λ ∈ [0, 1]. That the class of log-concave measures is
really a subclass of the class of quasi-concave measures is easily seen.
Lemma 1.2 If P is a log-concave measure on F then P is quasi-concave.
Proof Let Si ∈ F, i = 1, 2, be convex sets such that P (Si ) > 0, i = 1, 2
(otherwise there is nothing to prove, since P (S) ≥ 0 ∀S ∈ F). By assumption,
for any λ ∈ (0, 1) we have
and hence
P (λS1 + (1 − λ)S2 ) ≥ min[P (S1 ), P (S2 )].
2
The proof has to be omitted here, since it would require a rather advanced
knowledge of measure theory.
Proof Consider any sequence {xν } such that xν −→ x̂ and xν ∈ B(α) ∀ν.
To prove the assertion, we have to show that x̂ ∈ B(α). Define A(x) := {ξ |
g(x, ξ) ≤ 0}. Let Vk be the open ball with center x̂ and radius 1/k. Then we
show first that
∞
\ [
A(x̂) = cl A(x). (5.4)
k=1 x∈Vk
Here the inclusion “⊂” is obvious since x̂ ∈ Vk ∀k, so we have only to show
that
∞
\ [
A(x̂) ⊃ cl A(x).
k=1 x∈Vk
T∞ S
Assume that ξˆ ∈ k=1 cl x∈Vk A(x). This means that for every k we have
S S
ξˆ ∈ cl x∈Vk A(x); in other words, for every k there exists a ξ k ∈ x∈Vk A(x)
and hence some xk ∈ Vk with ξ k ∈ A(xk ) such that kξ k − ξk ˆ ≤ 1/k (and
k k k k
obviously kx − x̂k ≤ 1/k since x ∈ Vk ). Hence (x , ξ ) −→ (x̂, ξ). ˆ Since
k k k k
ξ ∈ A(x ), g(x , ξ ) ≤ 0 ∀k and therefore, by the continuity of g(·, ·),
ξˆ ∈ A(x̂), which proves (5.4) to be true.
The sequence of sets
K
\ [
BK := cl A(x)
k=1 x∈Vk
For stochastic programs with joint chance constraints the situation appears
to be more difficult than for stochastic programs with recourse. But, at least
under certain additional assumptions, we may assert convexity and closedness
of the feasible sets as well (Proposition 1.5, Remark 1.3 and Proposition 1.7).
For stochastic linear programs with single chance constraints, convexity
statements have been derived without the joint convexity assumption on
gi (x, ξ) := hi (ξ) − Ti (ξ)x, for special distributions and special intervals for
the values of αi . In particular, if Ti (ξ) ≡ Ti (constant), the situation becomes
rather convenient: with Fi the distribution function of hi (ξ), ˜ we have
or equivalently
Ti x ≥ Fi−1 (αi ),
where Fi−1 (αi ) is assumed to be the smallest real value η such that Fi (η) ≥ αi .
Hence in this special case any single chance constraint turns out to be just a
linear constraint, and the only additional work to do is to compute Fi−1 (αi ).
where the vectors c ∈ IRn , b ∈ IRm and the m × n matrix A are given
and x ∈ IRn is to be determined. Any other LP6 formulation can easily be
transformed to assume the form (6.1). If, for instance, we have the problem
min cT x
s.t. Ax ≥ b
x ≥ 0,
min cT x
s.t. Ax − y = b
x≥0
y ≥ 0,
which is of the form (6.1). This LP is equivalent to (6.1) in the sense that
the x part of its solution set and the solution set of (6.1) as well as the two
optimal values obviously coincide. Instead, we may have the problem
min cT x
s.t. Ax ≥ b,
min{cT z + − cT z − }
s.t. Az + − Az − − y = b,
z+ ≥ 0,
z− ≥ 0,
y ≥ 0,
which is again of the form (6.1). Furthermore, it is easily seen that this
transformed LP and its original formulation are equivalent in the sense that
• given any solution (ẑ + , ẑ − , ŷ) of the transformed LP, x̂ := ẑ + − ẑ − is a
solution of the original version,
• given any solution x̌ of the original LP, the vectors y̌ := Ax̌ − b and
ž + , ž − ∈ IRn+ , chosen such that ž + − ž − = x̌, solve the transformed version,
and the optimal values of both versions of the LP coincide.
is satisfied. Given this condition, it may happen that rk(A) < m, but then
we may drop one or more equations from the system without changing its
solution set. Therefore we assume throughout this section that
rk(A) = m, (6.3)
B := {x | Ax = b, x ≥ 0}
7 According to this definition, for I(x̂) = ∅, i.e. x̂ = 0 and hence b = 0, it follows that x̂ is
a feasible basic solution as well.
50 STOCHASTIC PROGRAMMING
In general, the set I(x̂) and hence also the column set {Ai | i ∈ I(x̂)} may
have less than m elements, which can cause some inconvenience—at least in
formulating the statements we want to present.
Proposition 1.8 Given assumption (6.3), for any basic solution x̂ of B there
exists at least one index set IB (x̂) ⊃ I(x̂) such that the corresponding column
m
set {Ai | i ∈ IB (x̂)} is a basis of IRP . The components x̂i , i ∈ IB (x̂), of
x̂ uniquely solve the linear system i∈IB (x̂) Ai xi = b with the nonsingular
matrix (Ai | i ∈ IB (x̂)).
Proof Assume that x̌ ∈ B is a basic solution and that {Ai | i ∈ I(x̌)}
contains k columns, k < m, of A. By (6.3), there exists at least one index
set Jm ⊂ {1, · · · , n} with m elements such that the columns {Ai | i ∈ Jm }
are linearly independent and hence form a basis of IRm . A standard result in
linear algebra asserts that, given a basis of an m-dimensional vector space and
a linear independent subset of k < m vectors, it is possible, by adding m − k
properly chosen vectors from the basis, to complement the subset to become
a basis itself. Hence in our case it is possible to choose m − k indices from Jm
and to add them to I(x̌), yielding IB (x̌) such that {Ai | i ∈ IB (x̌)} is a basis
of IRm . 2
B = (Ai | i ∈ IB (x̂))
Bx{B} + N x{N B} = b
or equivalently as
x{B} = B −1 b − B −1 N x{N B} , (6.5)
which—using the assignment (6.4) – yields for any choice of the nonbasic
variables x{N B} a solution of our system Ax = b, and in particular for
x{N B} = 0 reproduces our feasible basic solution x̂.
BASIC CONCEPTS 51
Proposition 1.9 If B 6= ∅ then there exists at least one feasible basic solution.
Ax̂ = b, x̂ ≥ 0.
If for I(x̂) = {i | x̂i > 0} the column set {Ai | i ∈ I(x̂)} is linearly dependent,
then the linear homogeneous system of equations
P
i∈I(x̂) Ai yi = 0,
yi = 0, i 6∈ I(x̂),
has a solution y̌ 6= 0 with y̌i < 0 for at least one i ∈ I(x̂)—if this does not
hold for y̌, we could take −y̌, which solves the above homogeneous system as
well. Hence for
λ̄ := max{λ | x̂ + λy̌ ≥ 0}
we have 0 < λ̄ < ∞. Since Ay̌ = 0 obviously holds for y̌, it follows—observing
the definition of λ̄—that for z := x̂ + λ̄y̌
Az = Ax̂ + λ̄Ay̌
= b,
z ≥ 0,
i.e. z ∈ B, and I(z) ⊂ I(x̂), I(z) 6= I(x̂), such that we have “reduced” our
original feasible solution x̂ to another one with fewer positive components.
Now either z is a basic solution or we repeat the above “reduction” with
x̂ := z. Obviously there are only finitely many reductions of the number of
positive components in feasible solutions possible. Hence we have to end up—
after finitely many of these steps—with a feasible basic solution. 2
has a solution ỹ 6= 0, for which at least one component is strictly negative and
another is strictly positive, since otherwise we could assume ỹ ≥ 0, ỹ 6= 0,
to solve the homogeneous system Ay = 0, implying that x̂ + λỹ ∈ B ∀λ ≥ 0,
which, according to the inequality kx̂ + λỹk ≥ λkỹk − kx̂k, contradicts the
assumed boundedness of B. Hence we find for
β := min{λ | x̂ + λỹ ≥ 0}
that 0 < α < ∞ and 0 > β > −∞. Defining v := x̂ + αỹ and w := x̂ + β ỹ,
we have v, w ∈ B and—by the definitions of α and β—that |I(v)| ≤ k
and |I(w)| ≤ k such that, according to our induction assumption, Pr with
{x{i} , P
i = 1, · · · , r} the set of all feasible basic
Pr solutions, v = P
i=1 λ i x {i}
,
r {i} r
where i=1 λ i = 1, λ i ≥ 0 ∀i, and w = i=1 µ i x , where i=1 µ i =
1, µi ≥ 0 ∀i. As is easily checked, we have x̂ = ρv + (1 − ρ)w with
ρ = −β/(α − β) ∈ (0, 1). This implies immediately that x̂ is a convex linear
combination of {x{i} , i = 1, · · · , r}. 2
The convex hull of finitely many points {x{1} , · · · , x{r} }, formally denoted
by conv{x{1} , · · · , x{r} }, is called a convex polyhedron or a bounded convex
polyhedral set (see Figure 20). Take for instance in IR2 the points z 1 =
(2, 2), z 2 = (8, 1), z 3 = (4, 3), z 4 = (7, 7) and z 5 = (1, 6). In Figure 21 we
have P̃ = conv{z 1 , · · · , z 5 }, and it is obvious that z 3 is not necessary to
generate P̃; in other words, P̃ = conv{z 1 , z 2 , z 3 , z 4 , z 5 } = conv{z 1 , z 2 , z 4 , z 5 }.
Hence we may drop z 3 without any effect on the polyhedron P̃, whereas
omitting any other of the five points would essentially change the shape of
the polyhedron. The points that really count in the definition of a convex
polyhedron are its vertices (z 1 , z 2 , z 4 and z 5 in the example). Whereas in two-
or three-dimensional spaces, we know by intuition what we mean by a vertex,
we need a formal definition for higher-dimensional cases: A vertex of a convex
BASIC CONCEPTS 53
polyhedron P is a point x̂ ∈ P such that the line segment connecting any two
points in P, both different from x̂, does not contain x̂. Formally,
It may be easily shown that for an LP with a bounded feasible set B the
feasible basic solutions x{i} , i = 1, · · · , r, coincide with the vertices of B.
By Proposition 1.10, the feasible set of a linear program is a convex
polyhedron provided that B is bounded. Hence we have to find out under
what conditions B is bounded or unbounded respectively. For B 6= ∅ we have
seen already in the proof of Proposition 1.10 that the existence of a ỹ 6= 0 such
that Aỹ = 0, ỹ ≥ 0, would imply that B is unbounded. Therefore, for B to be
bounded, the condition {y | Ay = 0, y ≥ 0} = {0} is necessary. Moreover, we
have the following.
Proof Given the above observations, it is only left to show that the condition
{y | Ay = 0, y ≥ 0} = {0} is sufficient for the boundedness of B. Assume
in contrast that B is unbounded. This means that we have feasible solutions
arbitrarily large in norm. Hence for any natural number K there exists an
xK ∈ B such that kxK k ≥ K. Defining
xK
z K := ∀K,
kxK k
54 STOCHASTIC PROGRAMMING
we have
z K ≥ 0,
kz K k = 1,
Az K = b/kxK k, ∀K. (6.6)
and hence
kAz K k ≤ kbk/K
Proof Since for C = {0} the statement is trivial,P we assume that C 6= {0}.
n
For any P
arbitrary ŷ ∈ C such that ŷ 6= 0 and hence i=1P ŷi > 0 we have, with
n n
µ := 1/ i=1 ŷi for ỹ := µŷ, that ỹ ∈ C := {y | Ay
Pn = 0, i=1 yi = 1, y ≥ 0}.
Obviously C ⊂ C and, owing to the constraints i=1 yi = 1, y ≥ 0, the set C
BASIC CONCEPTS 55
{y | Ay = 0, yi = 0, i 6∈ I(x̂), y ≥ 0} = {0}
B1 := {x | Ax = b, xi = 0, i 6∈ I(x̂), x ≥ 0}
Ax = b
x≥0
polyhedral set. The set of boundary points again consists of different convex
polyhedral sets, namely sides (two-dimensional), edges (one-dimensional) and
vertices (zero-dimensional). The sides are called facets. In general, consider
an arbitrary convex polyhedral set B ⊂ IRn . Without loss of generality,
assume that 0 ∈ B (if not, one could, for any fixed z ∈ B, consider the
transposition B − {z} obviously containing the origin). The dimension of B,
dim B, is the smallest dimension of all linear spaces (in IRn ) containing B.
Therefore dim B ≤ n. For any linear subspace U ∈ IRn and any ẑ ∈ B the
intersection Bẑ,U := [{ẑ} + U ] ∩ B 6= ∅ is again a convex polyhedral set. This
set is called a facet if
B = {x | Ax = b, x ≥ 0} 6= ∅ (6.8)
and
cT y ≥ 0 ∀y ∈ C = {y | Ay = 0, y ≥ 0}. (6.9)
58 STOCHASTIC PROGRAMMING
Given that these two conditions are satisfied, there is at least one feasible basic
solution that is an optimal solution.
where {x{1} , · · · , x{r} } is the set of all feasible basic solutions in B and
{y {1} , · · · , y {s} } is a set of elements generating C, for instance as described
in Proposition 1.12. Hence solving
min cT x
s.t. Ax = b,
x≥0
The objective value of this latter program can be driven to −∞ if and only
if we have cT y {j} < 0 for at least one j ∈ {1, · · · , s}; otherwise, i.e. if
BASIC CONCEPTS 59
Observe that in general the solution of a linear program need not be unique.
Given the solvability conditions of Proposition 1.14 and the notation of its
proof, if cT y {j0 } = 0, we may choose µj0 > 0, and x{i0 } + µj0 y {j0 } is a solution
as well; and obviously it also may happen that min1≤i≤r {cT x{i} } is assumed
by more than just one feasible basic solution. In any case, if there is more
than one (different) solution for our linear program then there are infinitely
many owing to the fact that, given the optimal value γ, the set Γ of optimal
solutions is characterized by the linear constraints
Ax = b
cT x ≤ γ
x≥0
and therefore Γ is itself a convex polyhedral set.
|I(x{i} )| = m, i = 1, · · · , r, (6.10)
i.e. that all feasible basic solutions are nondegenerate. For the case of
degenerate basic solutions, and the adjustments that might be necessary in
this case, the reader may consult the wide selection of books devoted to
linear programming in particular. Referring to our former presentation (6.5),
we have, owing to (6.10), that IB (x̂) = I(x̂), and, with the basic part
B = (Ai | i ∈ I(x̂)) and the nonbasic part N = (Ai | i 6∈ I(x̂)) of the matrix
A, the constraints of (6.1) may be rewritten—using the basic and nonbasic
variables as introduced in (6.4)—as
x{B} = B −1 b − B −1 N x{N B} ,
x{B} ≥ 0, (6.11)
{N B}
x ≥ 0.
60 STOCHASTIC PROGRAMMING
Obviously this system yields our feasible basic solution x̂ iff x{N B} = 0, and
then we have, by our assumption (6.10), that x{B} = B −1 b > 0. Rearranging
the components of c analogously to (6.4) into the two vectors
{B}
ck = ci , i the kth element of I(x̂), k = 1, · · · , m,
{N B}
cl = ci , i the lth element of {1, · · · , n} − I(x̂), l = 1, · · · , n − m,
owing to (6.11), the objective may now be expressed as a function of the
nonbasic variables:
cT x = (c{B} )T x{B} + (c{N B} )T x{N B}
(6.12)
= (c{B} )T B −1 b + [(c{N B} )T − (c{B} )T B −1 N ] x{N B} .
This representation of the objective connected to the particular feasible basic
solution x̂ implies the optimality condition for linear programming—the so-
called simplex criterion.
Proposition 1.15 Under the assumption (6.10), the feasible basic solution
resulting from (6.11) for x{N B} = 0 is optimal iff
[(c{N B} )T − (c{B} )T B −1 N ]T ≥ 0. (6.13)
Proof By assumption (6.10), the feasible basic solution given by
x{B} = B −1 b − B −1 N x{N B} ,
{N B}
x =0
{N B}
satisfies x{B} = B −1 b > 0. Therefore any nonbasic variable xl may be
increased to some positive amount without violating the constraints x{B} ≥ 0.
Furthermore, increasing the nonbasic variables is the only feasible change
applicable to them, owing to the constraints x{N B} ≥ 0. From the objective
presentation in (6.12), we see immediately that
cT x̂ = (c{B} )T B −1 b
≤ (c{B} )T B −1 b + [(c{N B} )T − (c{B} )T B −1 N ] x{N B} ∀x{N B} ≥ 0
iff [(c{N B} )T − (c{B} )T B −1 N ]T ≥ 0. 2
Simplex method.
Step 1 Determine a feasible basis B = (Ai | i ∈ IB ) for (6.1) and
N = (Ai | i 6∈ IB ).
Step 2 If the simplex criterion (6.13) is satisfied then stop with
x{B} = B −1 b, x{N B} = 0
being an optimal solution; otherwise, there is some ρ ∈ {1, · · · , n − m}
such that for the ρth component of [(c{N B} )T − (c{B} )T B −1 N ]T we
have
[(c{N B} )T − (c{B} )T B −1 N ]T
ρ < 0,
{N B}
and we increase the ρ-th nonbasic variable xρ .
{N B}
If increasing xρ is not “blocked” by the constraints x{B} ≥ 0,
{N B}
i.e. if xρ → ∞ is feasible, then inf B γ = −∞ such that our problem
has no (finite) optimal solution.
{N B}
If, on the other hand, increasing xρ is “blocked” by one of the
{B}
constraints xi ≥ 0, i = 1, · · · , m, such that, for instance, for some
{B}
µ ∈ {1, · · · , m} the basic variable xµ is the first one to become
{B} {N B}
xµ = 0 while increasing xρ , then go to step 3.
Step 3 Exchange the µth column of B with the ρth column of N , yielding
new basic and nonbasic parts B̃ and Ñ of A such that B̃ contains
Nρ as its µth column and Ñ contains Bµ as its ρth column. Redefine
B := B̃ and N := Ñ , and rearrange x{B} , x{N B} , c{B} and c{N B}
correspondingly, and then return to step 2.
Remark 1.5 The following comments on the single steps of the simplex
method may be helpful for a better understanding of this procedure:
Step 1 Obviously we assume that B 6= ∅. The existence of a feasible
basis B follows from Propositions 1.9 and 1.8. Because of our
assumption (6.10), we have B −1 b > 0.
Step 2 (a) If for a feasible basis B we have
[(c{N B} )T − (c{B} )T B −1 N ]T ≥ 0
62 STOCHASTIC PROGRAMMING
Proposition 1.16 If the linear program (6.1) is feasible then the simplex
method yields—under the assumption of nondegeneracy (6.10)—after finitely
many steps either a solution or else the information that there is no finite
solution, i.e. that inf B γ = −∞.
Remark 1.6 In step 2 of the simplex method it may happen that the simplex
criterion is not satisfied and that we discover that inf B γ = −∞. It is worth
mentioning that in this situation we may easily find a generating element of
the cone C associated with B, as discussed in Proposition 1.12. With the above
notation, we then have a feasible basis B, and for some column Nρ 6= 0 we
have B −1 Nρ ≤ 0. Then, with e = (1, · · · , 1)T of appropriate dimensions, for
(ŷ {B} , ŷ {N B} ) satisfying
{N B}
ŷ {B} = −B −1 Nρ ŷρ ,
{N B} 1
ŷρ = ,
−eT B −1 Nρ + 1
{N B}
ŷl = 0 for l 6= ρ
64 STOCHASTIC PROGRAMMING
it follows that
B ŷ {B} + N ŷ {N B} = 0
{N B} {N B}
eT ŷ {B} + eT ŷ {N B} = −eT B −1 Nρ ŷρ + ŷρ
{N B}
= (−eT B −1 Nρ + 1)ŷρ
= 1,
ŷ {B} ≥ 0,
ŷ {N B} ≥ 0.
v = B −1 Nρ ≤ 0, and hence 1 − eT v ≥ 1,
we have
µ ¶ µ ¶
B1 ··· Bm 0 B1 ··· Bm 0
rk = rk
1 ··· 1 1 − eT v 1 ··· 1 1
µ ¶
B1 ··· Bm 0
= rk .
0 ··· 0 1
It follows that µ ¶
B1 B2 ··· Bm Nρ
1 1 ··· 1 1
is a basis of IRm+1 . Hence (ŷ {B} , ŷ {N B} ) is one of the generating elements of
the convex polyhedral cone {(y {B} , y {N B} ) | By {B} + N y {N B} = 0, y {B} ≥
0, y {N B} ≥ 0}, as derived in Proposition 1.12. 2
program to the standard form (6.14), followed by the assignment of the linear
program (6.15) as its dual. Let us just give some examples.
min cT x
s.t. Ax ≥ b,
x ≥ 0,
min cT x
s.t. Ax − Iy = b,
x ≥ 0,
y ≥ 0,
min cT x max bT u
s.t. Ax ≥ b, s.t. AT u ≤ c,
x ≥ 0; u ≥ 0.
2
min cT x
s.t. Ax ≤ b,
x≥0
max bT u
s.t. AT u ≤ c,
u ≤ 0,
max (−bT v)
s.t. AT v ≥ −c,
v ≥ 0.
Therefore we now have the following pair of a primal and the corresponding
dual program:
min cT x max (−bT v)
s.t. Ax ≤ b, s.t. AT v ≥ −c,
x ≥ 0; v ≥ 0.
2
max g T x
s.t. Dx ≤ f.
This program is of the same form as the dual of our standard linear
program (6.14) and—using the fact that for any function ϕ defined on some set
M we have supx∈M ϕ(x) = − inf x∈M {−ϕ(x)}—its standard form is written
as
− min (−g T x+ + g T x− )
s.t. Dx+ − Dx− + Iy = f,
x+ ≥ 0,
x− ≥ 0,
y ≥ 0,
with the dual program
− max f T z
s.t. DT z ≤ −g,
−DT z ≤ g,
Iz ≤ 0
which is (with w := −z) equivalent to
min f T w
s.t. DT w = g,
w ≥ 0,
BASIC CONCEPTS 67
Hence, by comparison with our standard forms of the primal program (6.14)
and the dual program (6.15), it follows that the dual of the dual is the primal
program. 2
There are close relations between a primal linear program and its dual
program. Let us denote the feasible set of the primal program (6.14) by B and
that of its dual program by D. Furthermore, let us introduce the convention
that
inf x∈B cT x = +∞ if B = ∅,
(6.16)
supu∈D bT u = −∞ if D = ∅.
Then we have as a first statement the following so-called weak duality theorem:
Proposition 1.17 For the primal linear program (6.14) and its dual (6.15)
inf cT x ≥ sup bT u.
x∈B u∈D
6x2 − 16x3 = 6,
and hence
x2 = 1 + 83 x3 ,
which, on insertion into the first equation, yields
x1 = 15 (2 − 3 − 8x3 + 8x3 )
= − 51 ,
showing that the primal program is not feasible.
Looking at the dual constraints, we get from the second and third
inequalities that
u1 + u2 ≤ 1,
u1 + u2 ≥ 2,
such that also the dual constraints do not allow a feasible solution. Hence, by
our convention (6.16), we have for this dual pair
However, the so-called duality gap in the above example does not occur
so long as at least one of the two problems is feasible, as is asserted by the
following strong duality theorem of linear programming.
Proposition 1.18 Consider the feasible sets B and D of the dual pair of
linear programs (6.14) and (6.15) respectively. If either B 6= ∅ or D 6= ∅ then
it follows that
inf cT x = sup bT u.
x∈B u∈D
BASIC CONCEPTS 69
If one of these two problems is solvable then so is the other, and we have
min cT x = max bT u.
x∈B u∈D
max bT u
s.t. B T u ≤ c{B} ,
N T u ≤ c{N B} .
{x | Ax = b, x ≥ 0} 6= ∅
if and only if
AT u ≥ 0 implies that bT u ≥ 0.
Proof Assume that ũ satisfies AT ũ ≥ 0 and that {x | Ax = b, x ≥ 0} 6= ∅.
Then let x̂ be a feasible solution, i.e. we have
Ax̂ = b, x̂ ≥ 0,
70 STOCHASTIC PROGRAMMING
ũT b = ũT
|{z}A |{z}
x̂ ≥ 0,
≥0 ≥0
AT u ≥ 0 implies that bT u ≥ 0.
max cT x
s.t. Ax = b,
x≥0
In addition, we assume that the problem is solvable and that the set {x |
Ax = b, x ≥ 0} is bounded. The above problem may be restated as
min{cT x + f (x)}
s.t. Ax = b,
x ≥ 0,
BASIC CONCEPTS 71
with
f (x) := min{q T y | W y = h − T x, y ≥ 0}.
Our recourse function f (x) is easily seen to be piecewise linear and convex. It
is also immediate that the above problem can be replaced by the equivalent
problem
min{cT x + θ}
s.t. Ax = b
θ − f (x) ≥ 0
x ≥ 0;
however, this would require that we know the function f (x) explicitly in
advance. This will not be the case in general. Therefore we may try to
construct a sequence of new (additional) linear constraints that can be
used to define a monotonically decreasing feasible set B1 of (n + 1)-vectors
(x1 , · · · , xn , θ)T such that finally, with B0 := {(xT , θ)T | Ax = b, x ≥ 0, θ ∈
IR}, the problem min(x,θ)∈B0 ∩B1 {cT x + θ} yields a (first-stage) solution of our
problem (6.17).
After these preparations, we may describe the following particular method.
and hence
ũT h ≤ ũT T x,
which has to hold for any feasible x, and obviously does not hold
for x̂, since ũT (h − T x̂) > 0. Therefore we introduce the feasibility
cut, cutting off the infeasible solution x̂:
ũT (h − T x) ≤ 0.
T
Then we redefine B1 := B1 {(xT , θ) | ũT (h − T x) ≤ 0} and go on
to step 3.
(b) Otherwise, if f (x̂) is finite, we have for the recourse problem (see
the proof of Proposition 1.18) simultaneously—for x̂—a primal
optimal basic solution ŷ and a dual optimal basic solution û. From
the dual formulation of the recourse problem, it is evident that
f (x̂) = (h − T x̂)T û,
whereas for any x we have
f (x) = sup{(h − T x)T u | W T u ≤ q}
≥ (h − T x)T û
= ûT (h − T x).
which is violated by (x̂T , θ̂)T iff (h − T x̂)T û > θ̂; in this case
we introduce the optimality cut (see Figure 25), cutting off the
nonoptimal solution (x̂T , θ̂)T :
θ ≥ ûT (h − T x).
T
Correspondingly, we redefine B1 := B1 {(xT , θ) | θ ≥ ûT (h−T x)}
and continue with step 3; otherwise, i.e. if f (x̂) ≤ θ̂, we stop, with
x̂ being an optimal first-stage solution.
Step 3 Solve the updated problem
min{cT x + θ | (xT , θ) ∈ B0 ∩ B1 },
yielding the optimal solution (x̃T , θ̃)T .
With (x̂T , θ̂)T := (x̃T , θ̃)T , we return to step 2.
{(x, y) | Ax = b, T x + W y = h, x ≥ 0, y ≥ 0} 6= ∅,
{v | W v = 0, q T v < 0, v ≥ 0} = ∅.
In addition, we have assumed {x | Ax = b, x ≥ 0} to be bounded.
Hence inf{f (x) | Ax = b, x ≥ 0} is finite such that the lower bound
θ0 exists. This (and the boundedness of {x | Ax = b, x ≥ 0}) implies
that
min{cT x + θ | Ax = b, θ ≥ θ0 , x ≥ 0}
is solvable.
Step 2 If f (x̂) = +∞, we know from Proposition 1.14 that {u | W T u ≤
0, (h − T x̂)T u > 0} 6= ∅, and, according to Remark 1.6, for the convex
polyhedral cone {u | W T u ≤ 0} we may find with the simplex method
one of the generating elements ũ mentioned in Proposition 1.12 that
satisfies (h − T x̂)T ũ > 0. By Proposition 1.12, we have finitely many
generating elements for the cone {u | W T u ≤ 0} such that, after
having used all of them to construct feasibility cuts, for all feasible
x we should have (h − T x)T u ≤ 0 ∀u ∈ {u | W T u ≤ 0} and hence
solvability of the recourse problem. This shows that f (x̂) = +∞ may
appear only finitely many times within this method.
If f (x̂) is finite, the simplex method yields primal and dual optimal
feasible basic solutions ŷ and û respectively. Assume that we already
74 STOCHASTIC PROGRAMMING
then our present θ̂ has to satisfy this constraint for x = x̂ such that
θ̂ ≥ ũT (h − T x̂)
= ûT (h − T x̂)
θ̂ ≥ (h − T x̂)T u{i} , i = 1, · · · , k,
Given our stopping rule f (x̂) ≤ θ̂, with the set of all feasible basic
solutions, {u{1} , · · · , u{k} , · · · , u{r} }, of {u | W T u ≤ q}, it follows that
We have described this method for the data structure of the linear program
(6.17) that would result if a stochastic linear program with recourse had just
one realization of the random data. To this end, we introduced the feasibility
and optimality cuts for the recourse function f (x) := min{q T y | W y =
h − T x, y ≥ 0}. The modification for a finite discrete distribution with K
BASIC CONCEPTS 75
B := {x | gi (x) ≤ 0, i = 1, · · · , m}.
min f (x)
s.t. gi (x) ≤ 0, i = 1, · · · , m,
x≥0
or
min f (x)
s.t. gi (x) ≤ 0, i = 1, · · · , m1 ,
gi (x) = 0, i = m1+1 , · · · , m,
x≥0
or
min f (x)
s.t. gi (x) ≥ 0, i = 1, · · · , m,
x ≥ 0,
may be transformed into the standard form (7.1).
76 STOCHASTIC PROGRAMMING
Proposition 1.21 The function ϕ : IRn −→ IR is convex iff for all arbitrarily
chosen x, y ∈ IRn we have
∇ϕ(x̂) = 0.
If, moreover, the function ϕ is convex then, owing to Proposition 1.21, this
condition is also sufficient for x̂ to be a global minimum, since then for any
BASIC CONCEPTS 77
and hence
ϕ(x̂) ≤ ϕ(x) ∀x ∈ IRn .
Whereas the above optimality condition is necessary for unconstrained
minimization, the situation may become somewhat different for constrained
minimization.
min ψ(x) = x2
s.t. x ≥ 1,
dψ
x̂ = 1, with ∇ψ(x̂) = (x̂) = 2.
dx
Hence we cannot just transfer the optimality conditions for unconstrained op-
timization to the constrained case. 2
Therefore we shall first deal with the necessary and/or sufficient conditions
for some x̂ ∈ IRn to be a local or global solution of the program (7.1).
cT x + bT u = cT x + uT Ax − uT Ax + bT u
= (c + AT u)T x + (b − Ax)T u
Xm m
X (7.6)
= [∇f (x) + ui ∇gi (x)]T x − ui gi (x).
i=1 i=1
Remark 1.10 The optimality condition derived in Remark 1.9 for the linear
case could be formulated as follows:
(1) For the feasible x̂ the negative gradient of the objective f —i.e. the
direction of the greatest (local) descent of f —is equal (with the multipliers
ûi ≥ 0) to a nonnegative linear combination of the gradients of those
constraint functions gi that are active at x̂, i.e. that satisfy gi (x̂) = 0.
(2) This corresponds to the fact that the multipliers satisfy the complemen-
tarity conditions ûi gi (x̂) = 0, i = 1, · · · , m, stating that the multipliers
ûi are zero for those constraints that are not active at x̂, i.e. that satisfy
gi (x̂) < 0.
In conclusion, this optimality condition says that −∇f (x̂) must be contained
in the convex polyhedral cone generated by the gradients ∇gi (x̂) of the con-
straints being active in x̂. This is one possible formulation of the Kuhn–Tucker
conditions illustrated in Figure 27. 2
Let us now return to the more general nonlinear case and consider the
following question. Given that x̂ is a (local) solution, under what assumption
80 STOCHASTIC PROGRAMMING
hold? Hence we ask under what assumption are the conditions (7.7) necessary
for x̂ to be a (locally) optimal solution of the program (7.1). To answer this
question, let I(x̂) := {i | gi (x̂) = 0}, such that the optimality conditions (7.7)
are equivalent to
n ¯ X o
¯
u¯ ui ∇gi (x̂) = −∇f (x̂), ui ≥ 0 for i ∈ I(x̂) 6= ∅.
i∈I(x̂)
Observing that ∇gi (x̂) and ∇f (x̂) are constant vectors when x is fixed at x̂,
the condition of Farkas’ lemma (Proposition 1.19) is satisfied if and only if
the following regularity condition holds in x̂:
RC 0
z T ∇gi (x̂) ≤ 0, i ∈ I(x̂) implies that z T ∇f (x̂) ≥ 0. (7.8)
Hence we have the rigorous formulation of the Kuhn–Tucker conditions:
Proposition 1.22 Given that x̂ is a (local) solution of the nonlinear
program (7.1), under the assumption that the regularity condition RC 0 is
satisfied in x̂ it necessarily follows that
m
X
∃û ≥ 0 such that ∇f (x̂) + ûi ∇gi (x̂) = 0,
i=1
Xm
ûi gi (x̂) = 0.
i=1
Example 1.10 The Kuhn–Tucker conditions need not hold if the regularity
condition cannot be asserted. Consider the following simple problem (x ∈ IR1 ):
min{x | x2 ≤ 0}.
Its unique solution is x̂ = 0. Obviously we have
∇f (x̂) = (1), ∇g(x̂) = (0),
and there is no way to represent ∇f (x̂) as (positive) multiple of ∇g(x̂). (Need-
less to say, the regularity condition RC 0 is not satisfied in x̂.) 2
BASIC CONCEPTS 81
We just mention that for the case of linear constraints the Kuhn–Tucker
conditions are necessary for optimality, without the addition of any regularity
condition.
Instead of condition RC 0, there are various other regularity conditions
popular in optimization theory, only two of which we shall mention here. The
first is stated as
RC 1
∀z 6= 0 s.t. z T ∇gi (x̂) ≤ 0, i ∈ I(x̂), ∃{xk | xk 6= x̂, k = 1, 2, · · ·} ⊂ B
such that
xk − x̂ z
lim xk = x̂, lim = .
k→∞ k→∞ kxk − x̂k kzk
The second—used frequently for the convex case, i.e. if the functions gi are
convex—is the Slater condition
RC 2
∃x̃ ∈ B such that gi (x̃) < 0 ∀i. (7.9)
Observe that there is an essential difference among these regularity conditions:
to verify RC 0 or RC 1, we need to know the (locally) optimal point for which
we want the Kuhn–Tucker conditions (7.7) to be necessary, whereas the Slater
condition RC 2—for the convex case—requires the existence of an x̃ such that
gi (x̃) < 0 ∀i, but does not refer to any optimal solution. Without proof we
might mention the following.
Proposition 1.23
(a) The regularity condition RC 1 (in any locally optimal solution) implies
the regularity condition RC 0.
(b) For the convex case the Slater condition RC 2 implies the regularity
condition RC 1 (for every feasible solution).
In Figure 28 we indicate how the proof of the implication RC 2 =⇒ RC 1 can
be constructed.
Based on these facts we immediately get the following.
Proposition 1.24
(a) If x̂ (locally) solves problem (7.1) and satisfies RC 0 then the Kuhn–
Tucker conditions (7.7) necessarily hold in x̂.
(b) If the functions f, gi , i = 1, · · · , m, are convex and the Slater condition
RC 2 holds, then x̂ ∈ B (globally) solves problem (7.1) if and only if the
Kuhn–Tucker conditions (7.7) are satisfied for x̂.
82 STOCHASTIC PROGRAMMING
− x̂)T ∇f (x̂)
f (x) − f (x̂) ≥ (xX
=− ûi (x − x̂)T ∇gi (x̂)
i∈I(x̂)
X
≥− ûi [gi (x) − gi (x̂)]
|{z} | {z }
i∈I(x̂)
≥0 ≤ 0 ∀x ∈ B, i ∈ I(x̂)
≥0
BASIC CONCEPTS 83
Observe that
and hence
L(x̂, û) ≤ L(x, û) ∀x ∈ IRn .
On the other hand, since ∇u L(x̂, û) ≤ 0 is equivalent to
Pgmi (x̂) ≤ 0 ∀i, and
the Kuhn–Tucker conditions assert that ûT ∇u L(x̂, û) = i=1 ûi gi (x̂) = 0, it
follows that
L(x̂, u) ≤ L(x̂, û) ∀u ≥ 0.
Hence we have the following.
84 STOCHASTIC PROGRAMMING
It is an easy exercise to show that for any saddle point (x̂, û), with û ≥ 0,
of the Lagrange function, the Kuhn–Tucker conditions (7.10) are satisfied.
Therefore, if we knew the right multiplier vector û in advance, the task to
solve the constrained optimization problem (7.1) would be equivalent to
that of solving the unconstrained optimization problem minx∈IRn L(x, û). This
observation can be seen as the basic motivation for the development of a class
of solution techniques known in the literature as Lagrangian methods.
• cutting-plane methods;
• methods of descent;
• penalty methods;
• Lagrangian methods
B = {x | gi (x) ≤ 0, i = 1, · · · , m}
BASIC CONCEPTS 85
is bounded. Furthermore, assume that ∃ŷ ∈ int B—which for instance would
be true if the Slater condition (7.9) held. Then, instead of the original problem
min f (x),
x∈B
min θ
s.t. gi (x) ≤ 0, i = 1, · · · , m,
f (x) − θ ≤ 0,
where the bounded convex set B is assumed to contain an interior point ŷ.
Under the assumptions mentioned, it is possible to include the feasible set
B of problem (7.11) in a convex polyhedron P, which—after our discussions
in Section 1.6—we may expect to be able to represent by linear constraints.
Observe that the inclusion P ⊃ B implies the inequality
min cT x ≤ min cT x.
x∈P x∈B
min{cT x | x ∈ Pk },
and let
z k := λk ŷ + (1 − λk )x̂k .
(Obviously we have z k ∈ B and moreover z k is a boundary point of B
on the line segment between the interior point ŷ of B and the point
x̂k , which is “external” to B.)
Step 3 Determine a “supporting hyperplane” of B in z k (i.e. a hyperplane
being tangent to B at the boundary point z k ). Let this hyperplane be
given as
Hk := {x | (ak )T x = αk }
such that the inequalities
and hence
cT x̂k ≤ min cT x, k = 0, 1, 2, · · · ,
x∈B
∆k := cT z lk − cT x̂k , k = 0, 1, 2, · · · ,
On the other hand, for any x ∈ B, gi0 (x) ≤ 0. Again by Proposition 1.21, it
follows that
(a) If z is optimal then the Kuhn–Tucker conditions have to hold. For (7.12)
these are
∇f (z) + AT u − w = 0,
z T w = 0,
w ≥ 0,
or—with J(z) := {j | zj > 0}—equivalently
AT u − w = −∇f (z),
wj = 0 for j ∈ J(z),
w ≥ 0.
Applying Farkas’ Lemma 1.19 tells us that this system (and hence the
above Kuhn–Tucker system) is feasible if and only if
(b) If the feasible point z is not optimal then the Kuhn–Tucker conditions
cannot hold, and, according to (a), there exists a direction d such that
Ad = 0, dj ≥ 0 ∀j : zj = 0 and [∇f (z)]T d < 0. A direction like
this is called a feasible descent direction at z, which has to satisfy the
following two conditions: ∃λ0 > 0 such that z + λd ∈ B ∀λ ∈ [0, λ0 ]
and [∇f (z)]T d < 0. Hence, having at a feasible point z a feasible descent
direction d (for which, by its definition, d 6= 0 is obvious), it is possible to
move from z in direction d with some positive step length without leaving
B and at the same time at least locally to decrease the objective’s value.
BASIC CONCEPTS 89
Remark 1.12 It is worth mentioning that not every choice of feasible descent
directions would lead to a well-behaved algorithm. By construction we should
get—in any case—a sequence of feasible points {z (k) } with a monotonically
(strictly) decreasing sequence {f (z (k) )} such that for the case that f is
bounded below on B the sequence {f (z (k) )} has to converge to some value
γ. However, there are examples in the literature showing that if we do not
restrict the choice of the feasible descent directions in an appropriate way, it
may happen that γ > inf B f (x), which is certainly not the kind of a result we
want to achieve.
Let us assume that B 6= ∅ is bounded, implying that our problem (7.12)
is solvable. Then there are various possibilities of determining the feasible
descent direction, each of which defines its own algorithm for which a
“reasonable” convergence behaviour can be asserted in the sense that the
sequence {f (z (k) )} converges to the true optimal value and any accumulation
point of the sequence {z (k) } is an optimal solution of our problem (7.12). Let
us just mention two of those algorithms:
(a) The feasible direction method For this algorithm we determine in step 2
the direction d(k) as the solution of the following linear program:
with e = (1, · · · , 1)T . Then for [∇f (z (k) )]T d(k) < 0 we have a feasible
descent direction, whereas for [∇f (z (k) )]T d(k) = 0 the point z (k) is an
optimal solution of (7.12).
(b) The reduced gradient method Assume that B is bounded and every feasible
basic solution of (7.12) is nondegenerate. Then for z (k) we find a basis B
in A such that the components of z (k) belonging to B are strictly positive.
Rewriting A—after the necessary rearrangements of columns—as (B, N )
and correspondingly presenting z (k) as (xB , xN B ), we have
BxB + N xN B = b,
or equivalently
xB = B −1 b − B −1 N xN B .
We also may rewrite the gradient ∇f (z (k) ) as (∇B f (z (k) ), ∇N B f (z (k) )).
Then, rearranging d accordingly into (u, v), for a feasible direction we
need to have
Bu + N v = 0,
and hence
u = −B −1 N v.
For the directional derivative [∇f (z (k) )]T d it follows
and hence
µ ¶
(k) T T T rB
[∇f (z )] d = (u , v )
rN B
= ([∇N B f (z (k) )]T − [∇B f (z (k) )]T B −1 N )v.
Defining v as (
−rjN B if rjN B ≤ 0,
vj :=
−xN
j
B NB
rj if rjN B > 0,
BASIC CONCEPTS 91
rB = ∇B f (z (k) ) − B T w = 0,
NB
r = ∇N B f (z (k) ) − N T w ≥ 0,
and
(rB )T xB = 0, (rN B )T xN B = 0,
i.e. v = 0 is equivalent to satisfying the Kuhn–Tucker conditions.
It is known that the reduced gradient method with the above definition
of v may fail to converge to a solution (so-called “zigzagging”). However,
we can perturb v as follows:
−rjN B if rjN B ≤ 0,
NB NB
vj := −xj rj if rjN B > 0 and xN
j
B
≥ ε,
NB NB
0 if rj > 0 and xj < ε.
Then a proper control of the perturbation ε > 0 during the procedure can
be shown to enforce convergence.
The feasible direction and the reduced gradient methods have been extended
to the case of nonlinear constraints. We omit the presentation of the general
case here for the sake of better readability.
B10 ∩ B2 6= ∅
and that B = B1 ∩ B2 is bounded. Then for {rk } and {sk } strictly monotone
sequences decreasing to zero there exists an index k0 such that for all k ≥ k0
the modified objective function Frk sk attains its (free) minimum at some point
x(k) where x(k) ∈ B10 .
The sequence {x(k) | k ≥ k0 } is bounded, and any of its accumulation points
is a solution of the original problem (7.1). With γ the optimal value of (7.1),
the following relations hold:
lim f (x(k) ) = γ,
k→∞ X
(k)
lim rk ϕ(gi (x )) = 0,
k→∞
i∈I
1 X
lim ψ(gi (x(k) )) = 0.
k→∞ sk
i∈J
min f (x)
s.t. gi (x) ≤ 0, i = 1, · · · , m.
To simplify the description, let us first consider the optimization problem with
equality constraints
¾
min f (x)
(7.14)
s.t. gi (x) = 0, i = 1, · · · , m.
Knowing for this problem the proper multiplier vector û or at least a good
approximate u of it, we should find
and driving the parameter λ towards +∞, with kg(x)k being the Euclidean
norm of g(x) = (g1 (x), · · · , gm (x))T .
One idea is to combine the two approaches (7.15) and (7.16) such that we are
dealing with the so-called augmented Lagrangian as our modified objective:
min [f (x) + uT g(x) + 12 λkg(x)k2 ].
x∈IRn
Observe that for u(k) = 0 ∀k we should get back the penalty method with
a quadratic loss function, which, according to Proposition 1.26, is known to
“converge” in the sense asserted there.
For the method (7.17) in general the following two statements can be proved,
showing
(a) that we may expect a convergence behaviour as we know it already for
penalty methods; and
(b) how we should successively adjust the multiplier vector u(k) to get the
intended convergence to the proper Kuhn–Tucker multipliers.
Proposition 1.27 If f and gi , i = 1, · · · , m, are continuous and x(k) , k =
1, 2, · · · , are global solutions of
min Lλk (x, u(k) )
x
BASIC CONCEPTS 95
The following statement also shows that it would be sufficient to solve the
free optimization problems minx Lλk (x, u(k) ) only approximately.
Proposition 1.28 Let f and gi , i = 1, · · · , m, be continuously differentiable,
and let the approximate solutions x(k) to the free minimization problems
in (7.17) satisfy
k∇x Lλk (x(k) , u(k) )k ≤ ²k ∀k,
where ²k ≥ 0 ∀k and ²k → 0. For some K ⊂ IN let {x(k) , k ∈ K} converge
to some x? (i.e. x? is an accumulation point of {x(k) , k ∈ IN}), and let
{∇g1 (x? ), · · · , ∇gm (x? )} be linearly independent. Then ∃u? such that
λ1 := 1, λk+1 := 1.1λk ∀k ≥ 1,
min f (x)
s.t. gi (x) ≤ 0, i = 1, · · · , m
min f (x)
s.t. gi (x) + zi2 = 0, i = 1, · · · , m.
yielding hu i
i
ỹi = − + gi (x) .
λ
Hence we have for the solution of (7.21)
½
? ỹi if ỹi ≥ 0,
yi =
0 n otherwise,
hu io (7.22)
i
= max 0, − + gi (x)
λ
implying h ui i
gi (x) + yi? = max gi (x), − (7.23)
λ
which, with ẑi2 = yi? , after an elementary algebraic manipulation reduces our
extended Lagrangian (7.20) to
Minimization for some given uk and λk of the Lagrangian (7.20) with respect
to x and z will now be achieved by solving the problem
min L̃λk (x, u(k) ),
x
and, with a solution x(k) of this problem, our update formula (7.19) for the
multipliers—recalling that we now have the equality constraints gi (x)+zi2 = 0
instead of gi (x) = 0 as before—becomes by (7.23)
£ u(k) ¤
u(k+1) := u(k) + λk “max” g(x(k) ), −
λ (7.24)
£ ¤ k
= “max” 0, u(k) + λk g(x(k) ) ,
where “max” is to be understood componentwise.
The observation that some data in real life optimization problems could be
random, i.e. the origin of stochastic programming, dates back to the 1950s.
Without any attempt at completeness, we might mention from the early
contributions to this field Avriel and Williams [3], Beale [5, 6], Bereanu [8],
Dantzig [11], Dantzig and Madansky [13], Tintner [43] and Williams [49].
For more detailed discussions of the situation of the decision maker facing
random parameters in an optimization problem we refer for instance to
Dempster [14], Ermoliev and Wets [16], Frauendorfer [18], Kall [22], Kall and
Prékopa [24], Kolbin [28], Sengupta [42] and Vajda [45].
Wait-and-see problems have led to investigations of the distribution of the
optimal value (and the optimal solution); as examples of these efforts, we
mention Bereanu [8] and King [26].
The linear programs resulting as deterministic equivalents in the recourse
case may become (very) large in scale, but their particular block structure
is amenable to specially designed algorithms, which are until now under
investigation and for which further progress is to be expected in view of the
possibilities given with parallel computers (see e.g. Zenios [51]). For those
problems the particular decomposition method QDECOM—which will be
described later—was proposed by Ruszczyński [41].
The idea of approximating stochastic programs with recourse (with a
continuous type distribution) by discretizing the distribution, as mentioned in
Section 1.2, is related to special convergence requirements for the (discretized)
expected recourse functions, as discussed for example by Attouch and Wets [2]
and Kall [23].
More on probabilistically constrained models and corresponding
applications may be found for example in Dupačová et al. [15], Ermoliev and
98 STOCHASTIC PROGRAMMING
Wets [16] and Prékopa et al. [36]. The convexity statement of Proposition 1.5
can be found in Wets [48]. The probalistically constrained program at the
end of Section 1.2 (page 14) was solved by PROCON. This solution method
for problems with a joint chance constraint (with normally distributed right-
hand side) was described first by Mayer [30], and has its theoretical base in
Prékopa [35].
Statements on the induced feasible set K and induced constraints are
found in Rockafellar and Wets [39], Walkup and Wets [46] and Kall [22].
The requirement that the decision on x does not depend on the outcome of
ξ˜ is denoted as nonanticipativity, and was discussed rigorously in Rockafellar
and Wets [40]. The conditions for complete recourse matrices were proved in
Kall [21], and may be found in [22].
Necessary and sufficient conditions for log-concave distributions were
derived first in Prékopa [35]; later corresponding conditions for quasi-concave
measures were derived in Borell [10] and Rinott [37].
More details on stochastic linear programs may be found in Kall [22];
multistage stochastic programs are still under investigation, and were
discussed early by Olsen [31, 32, 33]; useful results on the deterministic
equivalent of recourse problems and for the expectation functionals arising
in Section 1.3 are due to Wets [47, 48].
There is a wide literature on linear programming, which cannot be listed
here to any reasonable degree of completeness. Hence we restrict ourselves to
mentioning the book of Dantzig [12] as a classic reference.
For a rigorous development of measure theory and the foundations of
probability theory we mention the standard reference Halmos [19].
The idea of feasibility and optimality cuts in the dual decomposition method
may be traced back to Benders [7].
There is a great variety of good textbooks on nonlinear programming
(theory and methods) as well. Again we have to restrict ourselves, and just
mention Bazaraa and Shetty [4] and Luenberger [29] as general texts.
Cutting-plane methods have been proposed in various publications, differing
in the way the cuts (separating hyperplanes) are defined. An early version
was published by Kelley [25]; the method we have presented is due to
Kleibohm [27].
The method of feasible directions is due to Zoutendijk [52, 53]; an extension
to nonlinear constraints was proposed by Topkis and Veinott [44].
The reduced gradient method can be found in Wolfe [50], and its extension
to nonlinear constraints was developed by Abadie and Carpentier [1].
A standard reference for penalty methods is the monograph of Fiacco and
McCormick [17].
The update formula (7.19) for the multipliers in the augmented Lagrangian
method for equality constraints motivated by Proposition 1.28 goes back
BASIC CONCEPTS 99
to Hestenes [20] and Powell [34], whereas the update (7.24) for inequality-
constrained problems is due to Rockafellar [38]. For more about Lagrangian
methods we refer the reader to the book of Bertsekas [9].
Exercises
References
Dynamic Systems
Figure 1 Basic set-up for a dynamic program with four states, four stages
and three possible decisions.
as follows. With
t the stages, t = 1, · · · , T ,
Gt (zt , xt ) the transformation (or transition) of the system from the state zt
and the decision taken at stage t into the state zt+1 at the next
stage, i.e. zt+1 = Gt (zt , xt ),
rt (zt , xt ) the immediate return if at stage t the system is in state zt and the
decision xt is taken,
Xt (zt ) the set of feasible decisions at stage t, (which may depend on the
state zt ),
Observe that owing to the relation zt+1 = Gt (zt , xt ), the objective function
can be rewritten in the form Φ(z1 , x1 , x2 , · · · , xT ). To get an idea of the
DYNAMIC SYSTEMS 107
possible structures we can face, let us revisit the example in Figure 1. The
purpose of the example is not to be realistic, but to illustrate a few points. A
more realistic problem will be discussed in the next section.
Example 2.1 Assume that stages are years, and that the system is inspected
annually, so that the three stages correspond to 1 January of the first, second
and third years, and 31 December of the third year. Assume further that four
different levels are distinguished as states for the system, i.e. at any stage one
may observe the state zt = 1, 2, 3 or 4. Finally, depending on the state of the
system in stages 1, 2 and 3, one of the following decisions may be made:
1, leading to the immediate return rt = 2,
xt = 0, leading to the immediate return rt = 1,
−1, leading to the immediate return rt = −1.
zt+1 = zt + xt .
(a) Let
F (r1 , · · · , r4 ) := r1 + r2 + r3 + r4
and assume that the initial state is z1 = 4. This is illustrated in Figure 2,
which has the same structure as Figure 1. Using the figure, we can check
that an optimal policy (i.e. sequence of decisions), is x1 = x2 = x3 = 0
keeping us in zt = 4 for all t with the optimal value F (r1 , · · · , r4 ) =
1 + 1 + 1 − 2 = 1.
We may determine this optimal policy iteratively as follows. First, we
determine the decision for each of the states in stage 3 by determining
Figure 2 Dynamic program: additive composition. The solid lines show the
result of the backward recursion.
F (r1 , · · · , r4 ) := r1 r2 r3 r4
where the composition operation “⊕” was chosen as addition in case (a) and
multiplication in case (b). For the backward recursion we have made use of
the so-called separability of F . That is, there exist two functions ϕ1 , ψ2 such
that
Proposition 2.1 “An optimal policy has the property that whatever the
initial state and initial decision are, the remaining decisions must constitute
an optimal policy with regard to the state resulting from the first decision.”
for all x1 . Therefore this also holds when the right-hand side of this inequality
is maximized with respect to x1 .
On the other hand, it is also obvious that
max ψ2 (r2 (z2 , x2 ), · · · , rT (zT , xT ))
{xt ∈Xt , t≥2}
The purpose of this section is to look at certain aspects of the field of dynamic
programming. The example we looked at in the previous section is an example
of a dynamic programming problem. It will not represent a fair description
of the field as a whole, but we shall concentrate on aspects that are useful in
our context. This section will not consider randomness. That will be discussed
later.
We shall be interested in dynamic programming as a means of solving
problems that evolve over time. Typical examples are production planning
under varying demand, capacity expansion to meet an increasing demand
and investment planning in forestry. Dynamic programming can also be used
to solve problems that are not sequential in nature. Such problems will not
be treated in this text.
Important concepts in dynamic programming are the time horizon, state
variables, decision variables, return functions, accumulated return functions,
optimal accumulated returns and transition functions. The time horizon refers
to the number of stages (time periods) in the problem. State variables describe
the state of the system, for example the present production capacity, the
present age and species distribution in a forest or the amount of money
one has in different accounts in a bank. Decision variables are the variables
under one’s control. They can represent decisions to build new plants, to cut
a certain amount of timber, or to move money from one bank account to
another. The transition function shows how the state variables change as a
function of decisions. That is, the transition function dictates the state that
will result from the combination of the present state and the present decisions.
For example, the transition function may show how the forest changes over
the next period as a result of its present state and of cutting decisions, how the
amount of money in the bank increases, or how the production capacity will
change as a result of its present size, investments (and detoriation). A return
function shows the immediate returns (costs or profits) as a result of making
a specific decision in a specific state. Accumulated return functions show the
accumulated effect, from now until the end of the time horizon, associated with
a specific decision in a specific state. Finally, optimal accumulated returns show
the value of making the optimal decision based on an accumulated return
function, or in other words, the best return that can be achieved from the
present state until the end of the time horizon.
10% 7%
fee 20 fee 20
A A A
fee 10 fee 10
S0 S3
B B B
7% 5%
Figure 4 Graphical description of a simple investment problem.
the option of moving the money to account A. You will there face an interest
rate of 10% the first year and 7% the second year. However, there is a fixed
charge of 20 per year and a charge of 10 each time we withdraw money from
account A. The fixed charge is deducted from the account at the end of a year,
whereas the charges on withdrawals are deducted immediately. The question
is: Should we move our money to account A for the first year, the second year
or both years? In any case, money left in account A at the end of the second
year will be transferred to account B. The goal is to solve the problem for all
initial S0 > 1000. Figure 4 illustrates the example.
Note that all investments will result in a case where the wealth increases,
and that it will never be profitable to split the money between the accounts
(why?).
Let us first define the two-dimensional state variables zt = (zt1 , zt2 ). The
first state variable, zt1 , refers to the account name (A or B); the second
state variable, zt2 , refers to the amount of money St in that account. So
zt = (B, St ) refers to a state where there is an amount St in account B
in stage t. Decisions are where to put the money for the next time period. If
xt is our decision variable then xt ∈ {A, B}. The transition function will be
denoted by Gt (zt , xt ), and is defined via interest rates and charges. It shows
what will happen to the money over one year, based on where the money is
now, how much there is, and where it is put next. Since the state space has
two elements, the function Gt is two-valued. For example
µµ ¶ ¶ µµ ¶ ¶
1 1 A 2 2 A
zt+1 = Gt , A = A, zt+1 = Gt , A = St × 1.07 − 20.
St St
The calculations for our example are as follows. Note that we have three
stages, which we shall denote Stage 0, Stage 1 and Stage 2. Stage 2 represents
the point in time (after two years) when all funds must be transferred to
account B. Stage 1 is one year from now, where we, if we so wish, may move
the money from one account to another. Stage 0 is now, where we must decide
if we wish to keep the money in account B or move it to account A.
Stage 2 At Stage 2, all we can do is to transfer whatever money we have in
account A to account B:
f2∗ (A, S2 ) = S2 − 10,
f2∗ (B, S2 ) = S2 ,
Stage 0 Since we start out with all our money in account B, we only need
to check that account. Initially we have S0 . If we transfer to A, we get
S1 = S0 × 1.1 − 20, and if we keep it in B, S1 = S0 × 1.07. The accumulated
returns are
f0 (B, S0 , A) = f1∗ (A, S1 ) = f1∗ (A, S0 × 1.1 − 20)
= (S0 × 1.1 − 20) × 1.07 − 30 = 1.177 × S0 − 51.4,
f0 (B, S0 , B) = f½1∗ (B, S1 ) = f1∗ (B, S0 × 1.07)
S0 × 1.1449 − 30 if S0 ≥ 1402,
=
S0 × 1.1235 if S0 ≤ 1402.
Comparing the two options, we see that account A is always best, yielding
So we should move our money to account A and keep it there until the end
of the second period. Then we move it to B as required. We shall be left with
a total interest of 17.7% and fixed charges of 51.4 (including lost interest on
charges).
2
As we can see, the main idea behind dynamic programming is to take one
stage at a time, starting with the last stage. For each stage, find the optimal
decision for all possible states, thereby calculating the optimal accumulated
return from then until the end of the time horizon for all possible states. Then
move one step towards the present, and calculate the returns from that stage
until the end of the time horizon by adding together the immediate returns,
and the returns for all later periods based on the calculations made at the
previous stage. In the example we found that f1∗ (A, S1 ) = S1 × 1.07 − 30.
This shows us that if we end up in stage 1 with S1 in account A, we shall (if
we behave optimally) end up with S1 × 1.07 − 30 in account B at the end of
the time horizon. However, f1∗ does not tell us what to do, since that is not
needed to calculate optimal decisions at stage 0.
Formally speaking, we are trying to solve the following problem, where
x = (x0 , . . . , xT )T :
function for all but the last stage, Q the return function for the last stage, Gt
the transition function, T the time horizon, zt the (possibly multi-dimensional)
state variable in stage t and xt the (possibly multi-dimensional) decision
variable in stage t. The accumulated return function ft (zt , xt ) and optimal
accumulated returns ft∗ (zt ) are not part of the problem formulation, but
rather part of the solution procedure. The solution procedure, justified by
the Bellman principle, runs as follows.
by solving recursively
with
zt+1 = Gt (zt , xt ) for t = T, . . . , 0,
fT∗ +1 (zT +1 ) = Q(zT +1 ).
In each case the problem must be solved for all possible values of the state
variable zt , which might be multi-dimensional.
Problems that are not dynamic programming problems (unless rewritten
with a large expansion of the state space) would be problems where, for
example,
zt+1 = Gt (z0 , . . . , zt , x0 , . . . , xt ),
or where the objective function depends in an arbitrary way on the whole
history up til stage t, represented by
rt (z0 , . . . , zt , x0 , . . . , xt ).
Such problems may more easily be solved using other approaches, such as
decision trees, where these complicated functions cause little concern.
Stage 0
B,S 0
B
A
A B A B
B B B B
B, S3 B, S3 B, S3 B, S3
this method to be useful, since there is one leaf in the tree for each possible
sequence of decisions.
The tree indicates that at stage 0 we have S0 in account B. We can then
decide to put them into A (go left) or keep them in B (go right). Then at stage
1 we have the same possible decisions. At stage 2 we have to put them into
B, getting S3 , the final amount of money. As before we could have skipped
the last step. To be able to solve this problem, we shall first have to follow
each path in the tree from the root to the bottom (the leaves) to find S3
in all cases. In this way, we enumerate all possible sequences of decisions
that we can possibly make. (Remember that this is exactly what we avoid in
dynamic programming). The optimal sequence, must, of course, be one of these
sequences. Let (AAB) refer to the path in the tree with the corresponding
indices on the arcs. We then get
We have now achieved numbers in all leaves of the tree (for some reason
118 STOCHASTIC PROGRAMMING
decision trees always grow with the root up). We are now going to move back
towards the root, using a process called folding back. This implies moving one
step up the tree at a time, finding for each node in the tree the best decision
for that node.
This first step is not really interesting in this case (since we must move the
money to account B), but, even so, let us go through it. We find that the best
we can achieve after two decisions is as follows.
We can then fold back to the top, finding that it is best going left, obtaining
the given S3 = S0 × 1.177 − 51.4. Of course, we recognize most of these
computations from Section 2.2 on dynamic programming.
You might feel that these computations are not very different from those in
the dynamic programming approach. However, they are. For example, assume
that we had 10 periods, rather than just 2. In dynamic programming we would
then have to calculate the optimal accumulated return as a function of St for
both accounts in 10 periods, a total of 210 = 20 calculations, each involving
a maximization over the two possible decisions. In the decision tree case the
number of such calculations will be 210 + 29 + . . . + 1 = 211 − 1 = 2047. (The
counting depends a little on how we treat the last period.) This shows the
strength of dynamic programming. It investigates many fewer cases. It should
be easy to imagine situations where the use of decision trees is absolutely
impossible due to the mere size of the tree.
On the other hand, the decision tree approach certainly has advantages.
Assume, for example, that we were not to find the optimal investments for
all S0 > 1000, but just for S0 = 1000. That would not help us much in the
dynamic programming approach, except that f1∗ (B, S1 ) = S1 × 1.05, since
S1 < 1500. But that is a minor help. The decision tree case, on the other
DYNAMIC SYSTEMS 119
We shall now see how decision trees can be used to solve certain classes
of stochastic problems. We shall initiate this with a look at our standard
investment problem in Example 2.2. In addition, let us now assume that the
interest on account B is unchanged, but that the interest rate on account A
is random, with the previously given rates as expected values. Charges on
account A are unchanged. The distribution for the interest rate is given in
Table 1. We assume that the interest rates in the two periods are described
by independent random variables.
Based on this information, we can give an update of Figure 4, where we
show the deterministic and stochastic parameters of the problem. The update
is shown in Figure 6.
Consider the decision tree in Figure 7. As in the deterministic case, square
nodes are decision nodes, from which we have to choose between account A
and B. Circular nodes are called chance nodes, and represent points at which
something happens, in this case that the interest rates become known.
Start at the top. In stage 0, we have to decide whether to put the money
into account A or into B. If we choose A, we shall experience an interest rate
of 8% or 12% for the first year. After that we shall have to make a new decision
for the second year. That decision will be allowed to depend on what interest
rate we experienced in the first period. If we choose A, we shall again face
an uncertain interest rate. Whenever we choose B, we shall know the interest
rate with certainty.
Having entered a world of randomness, we need to specify what our decisions
will be based on. In the deterministic setting we maximized the final amount
in account B. That does not make sense in a stochastic setting. A given series
120 STOCHASTIC PROGRAMMING
8% or 12% 5% or 9%
fee 20 fee 20
A A A
fee 10 fee 10
S0 S3
B B B
7% 5%
Figure 6 Simple investment problem with uncertain interest rates.
Stage 0
B
A
12%
8% 7%
Stage 1
A B A B A B
5% 9% 5% 9%
5% 9% 5% 5%
5%
Stage 2
Stage 0 1126
B
A
1126 1124
12%
8% 7%
Stage 1 1147 1124
1104
A B A B A B
Stage 2 1083 1125 1103 1125 1169 1145 1094 1136 1124
Figure 8 Stochastic decision tree for the investment problem when we
maximize the expected amount in account B at the end of stage 2.
122 STOCHASTIC PROGRAMMING
You might have observed that the solution derived here is exactly the same
as we found in the deterministic case. This is caused by two facts. First, the
interest rate in the deterministic case equals the expected interest rate in the
stochastic case, and, secondly, the objective function is linear. In other words,
if ξ˜ is a random variable and a and b are constants then
Eξ̃ (aξ˜ + b) = aE ξ˜ + b.
For the stochastic case we calculated the left-hand side of this expression, and
for the deterministic case the right-hand side.
In many cases it is natural to maximize expected profits, but not always.
One common situation for decision problems under uncertainty is that
the decision is repeated many times, often, in principle, infinitely many.
Investments in shares and bonds, for example, are usually of this kind. The
situation is characterized by long time series of data, and by many minor
decisions. Should we, or should we not, maximize expected profits in such a
case?
Economics provide us with a tool to answer that question, called a utility
function. Although it is not going to be a major point in this book, we should
like to give a brief look into the area of utility functions. It is certainly an
area very relevant to decision making under uncertainty. If you find the topic
interesting, consult the references listed at the end of this chapter. The area
is full of pitfalls and controversies, something you will probably not discover
from our little glimpse into the field. More than anything, we simply want to
give a small taste, and, perhaps, something to think about.
We may think of a utility function as a function that measures our happiness
(utility) from a certain wealth (let us stick to money). It does not measure
utility in any fixed unit, but is only used to compare situations. So we can say
that one situation is preferred to another, but not that one situation is twice
as good as another. An example of a utility function is found in Figure 9.
Note that the utility function is concave. Let us see what that means.
Assume that our wealth is w0 , and we are offered a game. With 50%
probability we shall win δw; with 50% probability we shall lose the same
amount. It costs nothing to take part. We shall therefore, after the game, either
have a wealth of w0 + δw or a wealth of w0 − δw. If the function in Figure 9 is
our utility function, and we calculate the utility of these two possible future
situations, we find that the decrease in utility caused by losing δw is larger
than the increase in utility caused by winning δw. What has happened is that
we do not think that the advantage of possibly increasing our wealth by δw
is good enough to offset our worry about losing the same amount. In other
words, our expected utility after having taken part in the game is smaller
than our certain utility of not taking part. We prefer w0 with certainty to a
distribution of possible wealths having expected value w0 . We are risk-averse.
If we found the two situations equally good, we are risk-neutral. If we prefer
DYNAMIC SYSTEMS 123
Utility
u(w )
0
w Wealth
0
Figure 9 Example of a typical concave utility function representing risk
aversion.
single project is large. The reason behind this argument is that, with a very
large number of projects at hand (which certainly the government has), some
will win, some will lose. Over all, owing to offsetting effects, the government
will face very little risk. (It is like in a life insurance company, where the
death of costumers is not considered a random event. With a large number of
costumers, they “know” how many will die the next year.)
In all, as we see, we must argue in each case whether or not a linear or
concave utility function is appropriate. Clearly, in most cases a linear utility
function creates easier problems to solve. But in some cases risk should indeed
be taken into account.
Let us now continue with our example, and assume we are faced with a
concave utility function
u(s) = ln(s − 1000)
and that we wish to maximize the expected utility of the final wealth s. In
the deterministic case we found that it would never be profitable to split
the money between the two accounts. The argument is the same when we
simply maximized the expected value of S3 as outlined above. However, when
maximizing expected utility, that might no longer be the case. On the other
hand, the whole set-up used in this chapter assumes implicitly that we do
not split the funding. Hence in what follows we shall assume that all the
money must be in one and only one account. The idea in the decision tree is
to determine which decisions to make, not how to combine them. Figure 10
shows how we fold back with expected utilities. The numbers in the leaves
represent the utility of the numbers in the leaves of Figure 8. For example
u(1083) = ln(1083 − 1000) = ln 83 = 4.419.
We observe that, with this utility function, it is optimal to use account B.
The reason is that we fear the possibility of getting only 8% in the first period
combined with the charges. The result is that we choose to use B, getting
the certain amount S3 = 1124. Note that if we had used account A in the
first period (which is not optimal), the optimal second-stage decision would
depend on the actual outcome of the interest on account A in the first period.
With 8%, we pick B in the second period; with 12%, we pick A.
Stage 0 4.820
B
A
4.807 4.820
12%
8% 7%
Stage 1 4.634 4.979 4.820
A B A B A B
Stage 2 4.419 4.828 4.634 4.828 5.130 4.977 4.543 4.913 4.820
Figure 10 Stochastic decision tree for the investment problem when we
maximize the expected utility of the amount in account B at the end of period
2.
since we must move the money into account B at the end of the
126 STOCHASTIC PROGRAMMING
second year.
Stage 1 We have to consider the two accounts separately.
Account A If we keep the money in account A, we get the following
expected return:
f1 (B, S1 , A)
= 0.5[f2∗ (A, S1 × 1.05 − 20) + f2∗ (A, S1 × 1.09 − 20)]
= 0.5 ln[(S1 × 1.05 − 1030)(S1 × 1.09 − 1030)],
Stage 0 We here have to consider only the case when the amount S0 > 1000
sits in account B. The basis for these calculations will be the
following two expressions. The first calculates the expected result
of using account A, the second the certain result of using account B.
To find the value of this expression for f0∗ (B, S0 ), we must make sure
that we use the correct expressions for f1∗ from stage 1. To do that,
we must know how conditions on S1 relate to conditions on S0 . There
are three different ways S0 and S1 can be connected (see e.g. the top
part of Figure 10):
From this, we see that three different cases must be discussed, namely
1000 < S0 < 1016, 1016 < S0 < 1437 and 1437 < S0 .
Case 1 Here 1000 < S0 < 1016. In this case
which means that we always put the money into account B. (Make
sure you understand this by actually performing the calculations.)
Case 2 Here 1016 < S0 < 1437. In this case
ln(S0 × 1.1235 − 1000) if S0 < 1022,
f0∗ (B, S0 ) = 0.25 × ln[(S0 × 1.134 − 1051)
×(S0 × 1.1772 − 1051.8) × (S0 × 1.176 − 1051)
×(S0 × 1.2208 − 1051.8)] if S0 > 1022,
128 STOCHASTIC PROGRAMMING
S1 >1077
A A A
S3
S0 B B B
S 0 <1022 S 1<1538
which means that we use account B for small amounts and account
A for large amounts within the given interval.
Case 3 Here we have S0 > 1437. In this case
f0∗ (B, S0 ) = 1
4
ln[(S0 × 1.134 − 1051) × (S0 × 1.1772 − 1051.8)
×(S0 × 1.176 − 1051) × (S0 × 1.2208 − 1051.8)],
If we put these results into Figure 4, we obtain Figure 11. From the latter,
we can easily construct a solution similar to the one in Figure 10 for any
S0 > 1000. Verify that we do indeed get the solution shown in Figure 10 if
S0 = 1000.
But we see more than that from Figure 11. We see that if we choose account
B in the first period, we shall always do the same in the second period. There
is no way we can start out with S0 < 1022 and get S1 > 1538.
Formally, what we are doing is as follows. We use the vocabulary of
Section 2.2. Let the random vector for stage t be given by ξ˜t and let the
return and transition functions become rt (zt , xt , ξt ) and zt+1 = Gt (zt , xt , ξt ).
Given this, the procedure becomes
by recursively calculating
with
zt+1 = Gt (zt , xt , ξt ) for t = 0, . . . , T,
fT∗ +1 (zT +1 ) = Q(zT +1 ),
where the functions satisfy the requirements of Proposition 2.2. In each stage
the problem must be solved for all possible values of the state zt . It is possible
to replace expectations (represented by E above) by other operators with
respect to ξ˜t , such as max or min. In such a case, of course, probability
distributions are uninteresting—only the support matters.
So far we have looked at two different methods for formulating and solving
multistage stochastic problems. The first, stochastic decision trees, requires a
tree that branches off for each possible decision xt and each possible realization
of ξ˜t . Therefore these must both have finitely many possible values. The state
zt is not part of the tree, and can therefore safely be continuous. A stochastic
decision tree easily grows out of hand.
The second approach was stochastic dynamic programming. Here we must
make a decision for each possible state zt in each stage t. Therefore, it is clearly
an advantage if there are finitely many possible states. However, the theory is
also developed for a continuous state space. Furthermore, a continuous set of
decisions xt is acceptable, and so is a continuous distribution of ξ˜t , provided
we are able to perform the expectation with respect to ξ˜t .
The method we shall look at in this section is different from those mentioned
above with respect to where the complications occur. We shall now operate
on an event tree (see Figure 12 for an example). This is a tree that branches
off for each possible value of the random variable ξ˜t in each stage t. Therefore,
compared with the stochastic decision tree approach, the new method has
similar requirements in terms of limitations on the number of possible values
of ξ˜t . Both need finite discrete distributions. In terms of xt we must have
finitely many values in the decision tree, the new method prefers continuous
variables. Neither of them has any special requirements on zt .
The second approach we have discussed so far for stochastic problems
is stochastic dynamic programming. The new method we are about to
outline is called scenario aggregation. We shall see that stochastic dynamic
programming is more flexible than scenario aggregation in terms of
130 STOCHASTIC PROGRAMMING
zt+1 = Gt (zt , xt , ξt ),
with z0 given. Let α be a discount factor. What is often done in this case is
to solve for each s ∈ S the following problem
T
X
min αt rt (zt , xt , ξts ) + αT +1 Q(zT +1 )
t=0 (6.1)
s.t. zt+1 = Gt (zt , xt , ξts ) for t = 0, . . . , T with z0 given,
At (zt ) ≤ xt ≤ Bt (zt ) for t = 0, . . . , T,
where Q(z) represents the value of ending the problem in state z, yielding
an optimal solution xs = (xs0 , xs1 , . . . , xsT ). Now what? We have a number of
different solutions—one for each s ∈ S. Shall we take the average and calculate
for each t X
xt = ps xst ,
s∈S
s
where p is the probability that we end up on scenario s? This is very often
done, either by explicit probabilities or by more subjective methods based on
“looking at the solutions”. However, several things can go wrong. First, if x is
chosen as our policy, there might be cases (values of s) for which it is not even
feasible. We should not like to suggest to our superiors a solution that might
be infeasible (infeasible probably means “going broke”, “breaking down” or
something like that). But even if feasibility is no problem, is using x a good
idea?
In an attempt to answer this, let us again turn to event trees. In Figure 12 we
have T = 1. The top node represents “today”. Then one out of three things can
happen, or, in other words, we have a random variable with three outcomes.
The second row of nodes represents “tomorrow”, and after tomorrow a varying
DYNAMIC SYSTEMS 131
Today
Tomorrow
Second
random
variable
The future
Figure 12 Example of an event tree for T = 1.
number of things can happen, depending on what happens today. The bottom
row of nodes takes care of the rest of the time—the future.
This tree represents six scenarios, since the tree has six leaves. In the setting
of optimization that we have discussed, there will be two decisions to be made,
namely one “today” and one “tomorrow”. However, note that what we do
tomorrow will depend on what happens today, so there is not one decision for
tomorrow, but rather one for each of the three nodes in the second row. Hence
x0 works as a suggested first decision, but x1 isn’t very interesting. However, if
we are in the leftmost node representing tomorrow, we can talk about an x1 for
the two scenarios going through that node. We can therefore calculate, for each
version of “tomorrow”, an average x1 , where the expectation is conditional
upon being on one of the scenarios that goes through the node.
Hence we see that the nodes in the event tree are decision points and the arcs
are realizations of random variables. From our scenario solutions xs we can
therefore calculate decisions for each node in the tree, and these will all make
sense, because they are all possible decisions, or what are called implementable
decisions.
s
For each time period t let {s}t be the set of all scenarios having ξ0s , . . . , ξt−1
in common with scenario s. In Figure 12, {s}0 = S, whereas each {s}2 contains
only one scenario. There are three sets {s}1 . Let p({s}t ) be the sum of the
probabilities of all scenarios in {s}t . Hence, after solving (6.1) for all s, we
calculate for all {s}t
X 0
ps xst
0
x({s}t ) = .
p({s}t )
s0 ∈{s}t
subject to
s
zt+1 = Gt (zts , xst , ξts ) for t = 0, . . . , T with z0s = z0 given,
At (zt ) ≤ xst ≤ Bt (zts ) for t = 0, . . . , T,
s
X ps0 xs0 (6.2)
t
xst = for t = 0, . . . , T and all s.
0
p({s} t)
s ∈{s}t
(
X T
X h
p(s) αt rt (zts , xst , ξts )
s∈S t=0 (6.3)
X ps xst i
0 0
+wts (xst − ) + αT +1 Q(zTs +1 )
p({s}t )
s0 ∈{s}t
procedure scenario(s, x, xs , z s );
begin
Solve the problem
( T )
X
t
min α [rt (zt , xt , ξts ) + wts xt 1
+ ρ(xt − x) ] + α
2
2 T +1
Q(zT +1 )
t=0
X 0
ps xst
0
p({s}t )
s0 ∈{s}t
with
X 0
ps xst
0
x({s}t ) =
p({s}t )
s0 ∈{s}t
But since, for a fixed w, the terms wts x({s}t ) are fixed, we can as well drop
them. If we then add an augmented Lagrangian term, we are left with
X ½X
T h i ¾
p(s) αt rt (zts , xst , ξts ) + wts xst + 12 ρ[xst − x({s}t )]2 + αT +1 Q(zTs +1 ) .
s∈S t=0
Our problem is now totally separable in the scenarios. That is what we need
to define the scenario aggregation method. See the algorithms in Figures 13
and 14 for details. A few comments are in place. First, to find an initial
x({s}t ), we can solve (6.1) using expected values for all random variables.
Finding the correct value of ρ, and knowing how to update it, is very hard.
134 STOCHASTIC PROGRAMMING
procedure scen-agg;
begin
for all s and t do wts := 0;
Find initial x({s}t );
Initiate ρ > 0;
repeat
for all s ∈ S do scenario(s, x({s}t ), xs , z s );
for all x({s}t ) do
X ps0 xs0
t
x({s}t ) = ;
0
p({s} t)
s ∈{s}t
Update ρ if needed;
for all s and t do
wts := wts + ρ [xst − x({s}t )];
until result good enough;
end;
is given by ³ zt ´
szt 1 − ,
K
where s is a growth ratio and K is the carrying capacity of the environment.
Note that if zt = K there is no net change in the stock size. Also note that
if zt > K, then there is a negative net effect, decreasing the size of the stock,
and if zt < K, then there is a positive net effect. Hence zt = K is a stable
situation (as zt = 0 is), and the fish stock will, according to the model, stabilize
at z = K if no fishing takes place.
If fish are caught, the catch has to be subtracted from the existing stock,
giving us the following transition function:
³ zt ´
zt+1 = zt − xt zt + szt 1 − .
K
This transition function is clearly nonlinear, with both a zt xt term and a zt2
term. If the goal is to catch as much as possible, we might choose to maximize
∞
X
α t zt xt ,
t=0
³ zt ´
xt = ξ 1 − .
K
But since this leaves zt = zT +1 for all t ≥ T + 1, and therefore all xt for
t ≥ T + 1 equal, we can let
∞
X ξzT +1 (1 − zT +1 /K)
Q(zT +1 ) = αt−T −1 xt zt = .
1−α
t=T +1
With these assumptions on the horizon, the existence of Q(zT +1 ) and a finite
discretization of the random variables, we arrive at the following optimization
problem, (the objective function amounts to the expected catch, discounted
over the horizon of the problem; of course, it is easy to bring this into monetary
terms):
P hP i
T t s s T +1
min s∈S p(s) t=0 α zt xt + α Q(zTs +1 )
h ³ ´i
zs
s
s.t. zt+1 = zts 1 − xξt + ξts 1 − Kt , with z0s = z0 given,
0 ≤ xst ≤ 1,
P 0
ps xst
0
all elements of dynamics (it has several time periods, but all decisions are
made here and now). Therefore decisions that have elements of options in
them will never be of any use. In a deterministic world there is never a need
to do something just in case.
Secondly, replacing random variables by their means will in itself have an
effect, as we shall discuss in much more detail in the next chapter.
Therefore, even if these two models come out with about the same optimal
objective value, one does not really know much about whether or not it is
wise to work with a stochastic model. These models are simply too different
to say much in most situations.
From this short discussion, you may have observed that there are really two
major issues when solving a model. One is the optimal objective value, the
other the optimal solution. It depends on the situation which of these is more
important. Sometimes one’s major concern is if one should do something or
not; in other cases the question is not if one should do something, but what
one should do.
When we continue, we shall be careful, and try to distinguish these cases.
Example 2.4 Assume that we have a container that can take up to 10 units,
and that we have two possible items that can be put into the container. The
items are called A and B, and some of their properties are given in Table 2.
Table 2 Properties of the two items A and B.
The goal is to fill the container with as valuable items as possible. However,
the size of an item is uncertain. For simplicity, we assume that each item can
have two different sizes, as given in Table 2. All sizes occur with the same
probability of 0.5. As is always the case with a stochastic model, we must
decide on how the stages are defined. We shall assume that we must pick an
item before we learn its size, and that once it is picked, it must be put into
the container. If the container becomes overfull, we obtain a penalty of 2 per
unit in excess of 10. We have the choice of picking only one item, and they
can be picked in any order.
A stochastic decision tree for the problem is given in Figure 15, where we
DYNAMIC SYSTEMS 139
have already folded back and crossed out nonoptimal decisions. We see that
the expected value is 7.5. That is obtained by first picking item A, and then,
if item A turns out to be small, also pick item B. If item A turns out to be
large, we choose not to pick item B.
If we assume that the event tree (or the stochastic part of the stochastic
decision tree) is a fair description of the randomness of a model, the following
simple approach gives a reasonable measure of how good the deterministic
model really is. Start in the root of the event tree, and solve the deterministic
model. (Probably this means replacing random variables by their means.
However, this approach can be used for any competing deterministic model.)
Take that part of the deterministic solution that corresponds to the first stage
of the stochastic model, and let it represent an implementable solution in the
root of the event tree. Then go to each node at level two of the event tree and
repeat the process. Taking into consideration what has happened in stage 1
(which is different for each node), solve the deterministic model from stage
2 onwards, and use that part of the solution that corresponds to stage 2 as
an implementable solution. Continue until you have reached the leaves of the
event tree.
This is a fair comparison, since even people who prefer deterministic models
140 STOCHASTIC PROGRAMMING
Table 3 The four possible wait-and-see solutions for the container problem in
Example 2.4.
With each case in Table 3 equally probable, the expected value of the wait-
and-see solution is 8, which is 0.5 more than what we found in Figure 15.
Hence EVPI equals 0.5; The value of knowing the true sizes of the items
before making decisions is 0.5. This is therefore also the maximal price one
would pay to know this.
What if we were offered to pay for knowing the value of A or B before
making our first pick? In other words, does it help to know the size of for
example item B before choosing what to do? This is illustrated in Figure 16.
We see that the EVPI for knowing the size of item B is 0.5, which is the
same as that for knowing both A and B. The calculation for item A is left as
an exercise.
142 STOCHASTIC PROGRAMMING
Figure 16 Stochastic decision tree for the container problem when we know
the size of B before making decisions.
Example 2.5 Let us conclude this section with another similar example. You
are to throw a die twice, and you will win 1 if you can guess the total number
of eyes from these two throws. The optimal guess is 7 (if you did not know
that already, check it out!), and that gives you a chance of winning of 16 . So
the expected win is also 61 .
Now, you are offered to pay for knowing the result of the first throw. How
much will you pay (or alternatively, what is the EVPI for the first throw)? A
close examination shows that knowing the result of the first throw does not
help at all. Even if you knew, guessing a total of 7 is still optimal (but that
is no longer a unique optimal solution), and the probability that that will
happen is still 16 . Hence, the EVPI for the first stage is zero.
Alternatively, you are offered to pay for learning the value of both throws
before “guessing”. In that case you will of course make a correct guess, and
be certain of winning one. Therefore the expected gain has increased from 16
to 1, so the EVPI for knowing the value of both random variables is 56 . 2
As you see, EVPI is not one number for a stochastic program, but can
be calculated for any combination of random variables. If only one number is
given, it usually means the value of learning everything, in contrast to knowing
nothing.
DYNAMIC SYSTEMS 143
Exercises
2. Look back at Example 2.3. Assume that T = 1. Use this fisheries example
to write down all necessary functions needed in the scenario aggregation
method, detailed in Figure 14.
3. Look back at Example 2.3. Assume that we change the model slightly
to take into account that there are young and adult fish, and that the
characteristics of catch and recruitment depend on the age composition of
the stock. We now need a two-dimensional state space:
• zt1 : the number of young fish in period t;
144 STOCHASTIC PROGRAMMING
(a) Give a verbal interpretation of the two transition functions given above,
including the decision variable xt .
(b) What is now a natural objective function, in your view?
(c) Assume s is random, and formalize the use of scenario aggregation for
solving the new version of the Schaefer model.
References
Recourse Problems
min cT x + Q(x)
s.t. Ax = b, x ≥ 0,
where X
Q(x) = pj Q(x, ξ j )
j
and
Q(x, ξ) = min{q(ξ)T y | W (ξ)y = h(ξ) − T (ξ)x, y ≥ 0},
where pj is the probability
P that ξ˜ = ξ j ,Pthe jth realization of ξ,
˜ h(ξ) =
P
h0 + Hξ = h0 + i hi ξi , T (ξ) = T0 + i Ti ξi and q(ξ) = q0 + i qi ξi .
148 STOCHASTIC PROGRAMMING
Potential plants
Fishing ground
Figure 1 A map showing potential plant sites and actual fishing grounds for
Southern Norway and the North Sea.
The function Q(x, ξ) is called the recourse function, and Q(x) therefore the
expected recourse function.
In this chapter we shall look at only the case with fixed recourse, i.e. the
case where W (ξ) ≡ W . Let us repeat a few terms from Section 1.3, in order
to prepare for the next section. The cone pos W , mentioned in (3.17) of
Chapter 1, is defined by
pos W = {t | t = W y, y ≥ 0}.
W y = h, y ≥ 0 is feasible ⇐⇒ h ∈ pos W.
pos W = Rm .
But that is definitely more than we need in most cases. Usually, it is more
than enough to know that
RECOURSE PROBLEMS 149
W1 W2
W3
W4
Figure 2 The cone pos W for a case where W has three rows and four
columns.
This section contains a much more detailed version of the material found
in Section 1.6.4. In addition to adding more details, we have now added
randomness more explicitly, and have also chosen to view some of the aspects
from a different perspective. It is our hope that a new perspective will increase
the understanding.
3.2.1 Feasibility
The material treated here coincides with step 2(a) in the dual decomposition
method of Section 1.6.4. Let the second-stage problem be given by
Hξ
pos W
h0-T0 x^
Figure 3 Illustration showing that if infeasibility is to occur for a fixed x̂, it
must occur for an extreme point of the support of H ξ,˜ and hence of ξ.˜ In this
example T (ξ) is assumed to be equal to T0 .
where W is fixed. Assume we are given an x̂ and should like to know if that x̂
yields a feasible second-stage problem for all possible values of ξ. ˜ We assume
that ξ˜ has a rectangular and bounded support. Consider Figure 3. We have
there drawn pos W plus a parallelogram that represents all possible values of
h0 +H ξ˜−T0 x̂. We have assumed that T (ξ) ≡ T0 , only to make the illustration
simpler.
Figure 3 should be interpreted as representing a case where H is a 2 × 2
matrix, so that the extreme points of the parallelogram correspond to the
extreme points of the support Ξ of ξ. ˜ This is a known result from linear
algebra, namely that if one polyhedron is a linear transformation of another
polyhedron, then the extreme points of the latter are maps of extreme points
in the first.
What is important to note from Figure 3 is that if the second-stage problem
is to be infeasible for some realizations of ξ˜ then at least one of these
realizations will correspond to an extreme point of the support. The figure
shows such a case. And conversely, if all extreme points of the support produce
feasible problems, all other possible realizations of ξ˜ will also produce feasible
problems. Therefore, to check feasibility, we shall in the worst case have to
check all extreme points of the support. With k random variables, and Ξ a k-
dimensional rectangle, we get 2k points. Let us define A to be a set containing
these points. In Chapter 5 we shall discuss how we can often reduce the number
RECOURSE PROBLEMS 151
{y | W y = h, y ≥ 0} 6= ∅
if and only if
W T u ≥ 0 implies that hT u ≥ 0.
The first of these equivalent statements is just an alternative way of saying
that h ∈ pos W , which we now know means that h represents a feasible
problem.
By changing the sign of u, the second of the equivalent statements can be
rewritten as
W T u ≤ 0 implies that hT u ≤ 0.
or equivalently
hT t ≤ 0 whenever t ∈ {u | W T u ≤ 0}.
{u | W T u ≤ 0} = {u | uT W y ≤ 0 for all y ≥ 0}
= {u | uT h ≤ 0 for all h ∈ pos W }.
Using Figure 4, we can now restate Farkas’ lemma the following way. The
system W y = h, y ≥ 0, is feasible if and only if the right-hand side h
has a non-positive inner product with all vectors in the cone pol pos W , in
particular with its generators. Generators were discussed in Chapter 1 (see e.g.
Remark 1.6, page 63). The matrix W ∗ , containing as columns all generators
of pol pos W , is denoted the polar matrix of W .
We shall see in Chapter 5 how this understanding can be used to generate
relatively complete recourse in a problem that does not possess that property.
For now, we are satisfied by understanding that if we knew all the generators
of pol pos W , that is the polar matrix W ∗ , then we could check feasibility of a
second-stage problem by performing a number of inner products (one for each
generator), and if at least one of them gave a positive value then we could
conclude that the problem was indeed infeasible.
152 STOCHASTIC PROGRAMMING
pos W
pol pos W
If we do not know all the generators of pol pos W , and we are not aware
of relatively complete recourse, for a given x̂ and all ξ ∈ A we must check
for feasibility. We should like to check for feasibility in such a way that if the
given problem is not feasible, we automatically come up with a generator of
pol pos W . For the discussion, we shall use Figure 5.
We should like to find a σ such that
pos W
ω
h(ξ-Tx
h(ξ)−Τ(ξ)x
σ
Figure 5 Generation of feasibility cuts.
where the last constraint has been added to bound σ. We can do that, because
otherwise the maximal value will be +∞, and that does not interest us since
we are looking for the direction defined by σ. If we had chosen the `2 norm, the
maximization would have made sure that σ came as close to h(ξ) − T (ξ)x̂ as
possible (see Figure 5). Computationally, however, we should not like to work
with quadratic constraints. Let us therefore see what happens if we choose
the `1 norm. Let us write our problem differently to see the details better. To
do that, we need to let the unconstrained σ be replaced by σ 1 − σ 2 , where
σ 1 , σ 2 ≥ 0. We then get the following:
Since σ T T0 is a vector and the right-hand side is a scalar, this can conveniently
be written as −γ T x ≥ δ. The x̂ we started out with will not satisfy this
constraint.
RECOURSE PROBLEMS 155
Example 3.1 We present this little example to indicate why the `1 and `2
norms give different results when we generate feasibility cuts. The important
point is how the two norms limit the possible σ values. The `1 norm is given
in the left part of Figure 6, the `2 norm in the right part.
For simplicity, we have assumed that pol pos W equals the positive
quadrant, so that the constraints σ T W ≤ 0 reduce to σ ≥ 0. Since at the
same time kσk ≤ 1, we get that σ must be within the shaded part of the two
figures.
For convenience, let us denote the right-hand side by h, and let σ =
(σ x , σ y )T , to reflect the x and y parts of the vector. In this example h =
(4, 2)T . For the `1 norm the problem now becomes.
max{4σ x + 2σ y kσ x + σ y ≤ 1, σ ≥ 0}.
σ
The optimal solution here is σ = (1, 0)T . Graphically this can be seen from
the figure from the fact that an inner product equals the length of one vector
multiplied by the length of the projection of the second vector on the first. If
we take the h vector as the fixed first vector, the feasible σ vector with the
largest projection on h is σ = (1, 0)T .
For the `2 norm the problem becomes
3.2.2 Optimality
The material discussed here concerns step 1(b) of the dual decomposition
method in Section 1.6.4. Let us first note that if we have relatively complete
recourse, or if we have checked that h(ξ) − T (ξ)x ∈ pos W for all ξ ∈ A, then
the second-stage problem
Figure 7 LP solver.
As long as q(ξ) ≡ q0 , the dual is either feasible or infeasible for all x and ξ, since
x and ξ do not enter the constraints. We see that this is more complicated
if q is also affected by randomness. But even when ξ enters the objective
function, we can at least say that if the dual is feasible for one x and a
given ξ then it is feasible for all x for that value of ξ, since x enters only
the objective function. Therefore, from standard linear programming duality,
since the primal is feasible, the primal must be unbounded if and only if the
dual is infeasible, and that would happen for all x for a given ξ, if randomness
affects the objective function. If q(ξ) ≡ q0 then it would happen for all x
and ξ. Therefore we can check in advance for unboundedness, and this is
particularly easy if randomness does not affect the objective function. Note
that this discussion relates to Proposition 1.18. Assume we know that our
problem is bounded.
Now consider
X
Q(x) = pj Q(x, ξ j ),
j
with
Q(x, ξ) = min{q(ξ)T y | W y = h(ξ) − T (ξ)x, y ≥ 0}.
and
Q(x, ξ) = min{q(ξ)T y | W y = h(ξ) + T (ξ)x, y ≥ 0}.
Of course, computationally we cannot use θ ≥ Q(x) as a constraint since
Q(x) is only defined implicitly by a large number of optimization problems.
Instead, let us for the moment drop it, and solve the above problem without
it, simply hoping it will be satisfied (assuming so far that all feasibility cuts
−γkT x ≥ δk are there, or that we have relatively complete recourse). We then
get some x̂ and θ̂ (the first time θ̂ = −∞). Now we calculate Q(x̂), and then
check if θ̂ ≥ Q(x̂). If it is, we are done. If not, our x̂ is not optimal—dropping
θ ≥ Q(x) was not acceptable.
Now X X
Q(x̂) = pj Q(x̂, ξ j ) = pj q(ξ j )T y j
j j
where π̂ is the optimal dual solution yielding Q(x̂, ξ j ). The constraints in the
j
since π̂ is feasible but not necessarily optimal, and the dual problem is a
maximization problem. Since what we dropped from the constraint set was
θ ≥ Q(x), we now add in its place
RECOURSE PROBLEMS 159
procedure L-shaped;
begin
K := 0, L := 0;
θ̂ := −∞
LP(A, b, c, x̂, feasible);
stop := not (feasible);
while not (stop) do begin
feascut(A, x̂,newcut);
if not (newcut) then begin
Find Q(x̂);
stop := (θ̂ ≥ Q(x̂));
if not (stop) then begin
(* Create an optimality cut—see page 155. *)
L := L + 1;
Construct the cut −βLT x + θ ≥ αL ;
end;
end;
if not (stop) then begin
master(K, L, x̂, θ̂,feasible);
stop := not (feasible);
end;
end;
end;
X
θ≥ pj (π̂ j )T [h(ξ j ) − T (ξ j )x] = α + β T x,
j
or
−β T x + θ ≥ α.
Since there are finitely many feasible bases coming from the matrix W , there
are only finitely many such cuts.
We are now ready to present the basic setting of the L-shaped decomposition
algorithm. It is shown in Figure 10. To use it, we shall need a procedure
that solves LPs. It can be found in Figure 7. Also, to avoid too complicated
expressions, we shall define a special procedure for solving the master problem;
see Figure 8. Furthermore, we refer to procedure pickξ(A, ξ), which simply
160 STOCHASTIC PROGRAMMING
Q(x)
x
cx+θ
cut 5
0 x3 x1
x2 cut 3 cut 1
cut 2
(x , θ )
5 5
cut 4
(x4 ,θ )
4
picks an element ξ from the set A, and, finally, we use procedure feascut
which is given in Figure 9. The set A was defined on page 150.
In the algorithms to follow, let −Γx ≥ ∆ represent the K feasibility
cuts −γkT x ≥ δk , and let −βx + Iθ ≥ α represent the L optimality cuts
−βlT x + θ ≥ αl . Furthermore, let e be a column of 1s of appropriate size.
The example in Figure 11 can be useful in understanding the L-shaped
decomposition algorithm. The five first solutions and cuts are shown. The
initial x̂1 was chosen arbitrarily. Cuts 1 and 2 are feasibility cuts, and the rest
optimality cuts. θ̂1 = θ̂2 = θ̂3 = −∞. To see if you understand this, try to
find (x̂6 , θ̂6 ), cut 6 and then the final optimal solution.
RECOURSE PROBLEMS 161
As mentioned at the end of Section 1.6.4, the recourse problem (for a discrete
distribution) looks like
PK
min{cT x + i=1 pi (q i )T y i }
s.t. Ax =b
i i i (3.1)
T x + W y = h , i = 1, · · · , K
x ≥ 0,
y i ≥ 0, i = 1, · · · , K.
yielding (x̂, θ̂1 , · · · , θ̂K ) as a solution. With this solution try to construct
further cuts for the blocks.
• If there are no further cuts to generate, then stop (optimal solution);
• otherwise repeat the cycle.
The advantage of a method like this lies in the fact that we obviously make
use of the particular structure of problem (3.1) in that we P
have to deal in the
master program only with n + K variables instead of n + i ni , if y i ∈ IRni .
162 STOCHASTIC PROGRAMMING
The drawback is easy to see as well: we may have to add very many cuts,
and so far we have no reliable criterion to drop cuts that are obsolete for
further iterations. Moreover, initial iterations are often inefficient. This is not
surprising, since in the master (3.4) we deal only with
θi ≥ max[(γ ij )T x + δij ]
j∈Ji
for Ji denoting the set of optimality cuts generated so far for block i with the
related dual basic solutions π̂ ij according to (3.3), and not, as we intend to,
with
θi ≥ fi (x) = max[(γ ij )T x + δij ]
j∈Jˆi
where Jˆi enumerates all dual feasible basic solutions for block i. Hence
we are working in the beginning with a piecewise linear convex function
(maxj∈Ji [(γ ij )T x + δij ]) supporting fi (x) that does not sufficiently reflect the
shape of fi (see e.g. Figure 25 of Chapter 1, page 73). The effect may be—and
often is—that even if we start a cycle with an (almost) optimal first-stage
solution x? of (3.1), the first-stage solution x̂ of the master (3.4) may be far
away from x? , and it may take many further cycles to come back towards x? .
The reason for this is now obvious: if the set of available optimality cuts, Ji , is
a small subset of the collection Jˆi then the piecewise linear approximation of
fi (x) may be inadequate near x? . Therefore it seems desirable to modify the
master program in such a way that, when starting with some overall feasible
first-stage iterate z k , its solution xk does not move too far away from z k .
Thereby we can expect to improve the approximation of fi (x) by an optimality
cut for block i at xk . This can be achieved by introducing into the objective of
the master the term kx−z k k2 , yielding a so-called regularized master program
( ¯ )
1 X K ¯ ³\K ´
¯
min kx − z k k2 + cT x + pi θi ¯(x, θ1 , · · · , θK ) ∈ B0 ∩ B1i ,
2ρ ¯
i=1 i=1
(3.5)
with a control parameter ρ > 0. To avoid too many constraints in (3.5), let
us start with some z 0 ∈ B0 such that fi (z 0 ) < ∞ ∀i and G0 being the feasible
set defined by the first-stage equations Ax = b and all optimality cuts at z 0 .
Hence we start (for k = 0) with the reduced regularized master program
( ¯ )
1 XK ¯
k 2 T i ¯
min kx − z k + c x + p θi ¯ (x, θ1 , · · · , θK ) ∈ Gk . (3.6)
2ρ ¯
i=1
2 Recall that in IRn+K never more than n + K independent hyperplanes intersect at one
point.
164 STOCHASTIC PROGRAMMING
K
X
F (x) := cT x + pi fi (x)
i=1
go to step 2.
Step 2 Delete from (3.6) some constraints that are inactive at (xk , θk ) such
that no more than n + K constraints remain.
Step 3 If xk satisfies the first-stage constraints (i.e. xk ≥ 0) then go to
step 4; otherwise add to (3.6) no more than K violated (first-stage)
RECOURSE PROBLEMS 165
3.4 Bounds
Section 3.2 was devoted to the L-shaped decomposition method. We note that
the deterministic methods very quickly run into dimensionality problems with
respect to the number of random variables. With much more than 10 random
variables,we are in trouble.
This section discusses bounds on stochastic problems. These bounds can be
useful and interesting in their own right, or they can be used as subproblems
in larger settings. An example of where we might need to bound a problem,
and where this problem is not a subproblem, is the following. Assume that a
company is facing a decision problem. The decision itself will be made next
year, and at that time all parameters describing the problem will be known.
However, today a large number of relevant parameters are unknown, so it
is difficult to predict how profitable the operation described by the decision
problem will actually be. It is desired to know the expected profitability of the
operation. The reason is that, for planning purposes, the firm needs to know
the expected activities and profits for the next year. Given the large number of
uncertain parameters, it is not possible to calculate the exact expected value.
However, using bounding techniques it may be possible to identify an interval
that contains the expected value. Technically speaking, one needs to find the
expected value of the “wait-and-see” solution discussed in Chapter 1, and also
166 STOCHASTIC PROGRAMMING
in Example 2.4. Another example, which we shall see later in Section 6.6, is
that of calculating the expected project duration time in a project consisting
of activities with random durations.
Bounding methods are also useful if we wish to use deterministic
decomposition methods (such as the L-shaped decomposition method or
scenario aggregation), on problems with a large number of random variables.
That will be discussed later in Section 3.5.2. One alternative to bounding
involves the development of approximations using stochastic methods. We
shall outline two of them later, they are called stochastic decomposition
(Section 3.8) and stochastic quasi-gradient methods (Section 3.9).
As discussed above, bounds can be used either to approximate the expected
value of some linear program or to bound the second-stage problem in a
two-stage problem. These two settings are principally the same, and we
shall therefore consider the problem of finding the expected value of a linear
program. We shall discuss this in terms of a function φ(ξ), which in the two-
stage case represents Q(x̂, ξ) for a fixed x̂. To illustrate, we shall look at the
refinery example of Section 1.2. The problem is repeated here for convenience:
φ(ξ) = “ min ” {2xraw1 + 3xraw2 }
s.t. xraw1 + xraw2 ≤ 100,
2xraw1 + 6xraw2 ≥ 180 + ξ1 ,
(4.1)
3xraw1 + 3xraw2 ≥ 162 + ξ2 ,
xraw1 ≥ 0,
xraw2 ≥ 0.
φ
φ(ξ)
1
ξ)
2
ξ)
ξ
ˆ = φ0 (ξ)
φ(ξ) ˆ ξˆ + b,
ˆ = L(ξ).
since φ(ξ) ˆ Hence, in total, the lower-bounding function is given by
ˆ + φ0 (ξ)(ξ
L(ξ) = φ(ξ) ˆ − ξ).
ˆ
Since this is a linear function, we easily calculate the expected value of the
lower-bounding function:
˜ = φ(ξ)
EL(ξ) ˆ + φ0 (ξ)(E
ˆ ξ˜ − ξ)
ˆ = L(E ξ).
˜
In other words, we find the expected lower bound by evaluating the lower
˜ From this, is is easy to see that we obtain the best
bounding function in E ξ.
(largest) lower bound by letting ξˆ = E ξ.
˜ This can be seen not only from the
168 STOCHASTIC PROGRAMMING
fact that no linear function that supports φ(ξ) can have a value larger than
˜ in E ξ,
φ(E ξ) ˜ but also from the following simple differentiation:
d ˜ = φ0 (ξ)
ˆ − φ0 (ξ)
ˆ + φ00 (ξ)(E
ˆ ξ˜ − ξ).
ˆ
L(E ξ)
dξˆ
If we set this equal to zero we find that ξˆ = E ξ.
˜ What we have developed is
the so-called Jensen lower bound, or the Jensen inequality.
Proposition 3.1 If φ(ξ) is convex over the support of ξ˜ then
˜ ≥ φ(E ξ)
Eφ(ξ) ˜
This best lower bound is illustrated in Figure 14. We can see that the
Jensen lower bound can be viewed two different ways. First, it can be seen as
a bound where a distribution is replaced by its mean and the problem itself
is unchanged. This is when we calculate φ(E ξ). ˜ Secondly, it can be viewed as
a bound where the distribution is left unchanged and the function is replaced
by a linear affine function, represented by a straight line. This is when we
integrate L(ξ) over the support of ξ. ˜ Depending on the given situation, both
these views can be useful.
There is even a third interpretation. We shall see it used later in the
stochastic decomposition method. Assume we first solve the dual of φ(E ξ) ˜ to
obtain an optimal basis B. This basis, since ξ does not enter the constraints of
the dual of φ, is dual feasible for all possible values of ξ. Assume now that we
solve the dual version of φ(ξ) for all ξ, but constrain our optimization so that
we are allowed to use only the given basis B. In such a setting, we might claim
that we use the correct function, the correct distribution, but optimize only in
an approximate way. (In stochastic decomposition we use not one, but a finite
number of bases.) The Jensen lower bound can in this setting be interpreted as
representing approximate optimization using the correct problem and correct
distribution, but only one dual feasible basis.
It is worth pointing out that these interpretations of the Jensen lower bound
are put forward to help you see how a bound can be interpreted in different
ways, and that these interpretations can lead you in different directions when
trying to strengthen the bound. An interpretation is not necessarily motivated
by computational efficiency.
Looking back at our example in (4.1), we find the Jensen lower bound by
˜ = φ(0). That has been solved already in Section 1.2, where
calculating φ(E ξ)
we found that φ(0) = 126.
that x is fixed at x̂.) Consider Figure 14, where we have drawn a linear function
U (ξ) between the two points (a, φ(a)) and (b, φ(b)). The line is clearly above
φ(ξ) for all ξ ∈ Ξ. Also this straight line has the formula cξ + d, and since we
know two points, we can calculate
φ(b) − φ(a) b a
c= , d= φ(a) − φ(b).
b−a b−a b−a
We can now integrate, and find (using the linearity of U (ξ))
˜ = φ(b) − φ(a) ˜ b a
EU (ξ) Eξ + φ(a) − φ(b)
b−a b−a b−a
b − E ξ˜ E ξ˜ − a
= φ(a) + φ(b) .
b−a b−a
In other words, if we have a function that is convex in ξ over a bounded
support Ξ = [a, b], it is possible to replace an arbitrary distribution by a
two point distribution, such that we obtain an upper bound. The important
parameter is
E ξ˜ − a
p= ,
b−a
so that we can replace the original distribution with
As for the Jensen lower bound, we have now shown that the Edmundson–
Madansky upper bound can be seen as either changing the distribution and
keeping the problem, or changing the problem and keeping the distribution.
Looking back at our example in (4.1), we have two independent random
variables. Hence we have 22 = 4 LPs to solve to find the Edmundson–
Madansky upper bound. Since both distributions are symmetric, the
probabilities attached to these four points will all be 0.25. Calculating this
we find an upper bound of
1
4
(106.6825 + 129.8625 + 122.1375 + 145.3175) = 126.
This is exactly the same as the lower bound, and hence it is the true value
˜ We shall shortly comment on this situation where the bounds turn
of Eφ(ξ).
out to be equal.
In higher dimensions, the Jensen lower bound corresponds to a hyperplane,
while the Edmundson–Madansky bound corresponds to a more general
polynomial. A two-dimensional illustration of the Edmundson–Madansky
bound is given in Figure 15. Note that if we fix the value of all but one of the
170 STOCHASTIC PROGRAMMING
Edmundson-Madansky U(ξ)
φ(ξ)
Jensen L(ξ)
a Eξ b ξ
Figure 14 The Jensen lower bound and the Edmundson–Madansky upper
bound in a minimization problem. Note that x is fixed.
at the end points. This is illustrated in Figure 16, where we have shown the
case for one random variable. The support of ξ˜ has been partitioned into
two parts, called cells. For each of these cells, we have drawn the straight
lines corresponding to the Jensen lower bound and the Edmundson–Madansky
upper bound. Corresponding to each cell, there is a one-point distribution that
gives a lower bound, and a two-point distribution that gives an upper bound,
just as we have outlined earlier.
If the random variables have continuous (but bounded) distributions, we
use these conditional bounds to replace the original distribution with discrete
distributions. If the distribution is already discrete, we can remove some of
the outcomes by using the Edmundson–Madansky inequality conditionally on
parts of the support, again pushing probability mass to the end points of
the intervals. Of course, the Jensen inequality can be used in the same way
to construct conditional lower bounds. The point with these changes is not
to create bounds per se, but to simplify distributions in such a way that we
have control over what we have done to the problem when simplifying. The
idea is outlined in Figure 17. Whatever the original distribution was, we now
have two distributions: one giving an overall lower bound, the other an overall
upper bound.
Edmundson-Madansky
φ(ξ)
Jensen
a b ξ
cell 1 cell 2
Figure 16 Illustration of the effect on the Jensen lower bound and the
Edmundson–Madansky upper bound of partitioning the support into two cells.
well.
3.4.3 Combinations
If we have randomness in the objective function, but not in the right-hand
side (so h(ξ)−T (ξ)x ≡ h0 −T0 x), then, by simple linear programming duality,
we can obtain the dual of Q(x, ξ) with all randomness again in the right-hand
side, but now in a setting of maximization. In such a setting the Jensen bound
is an upper bound and the Edmundson–Madansky bound a lower bound.
If we have randomness in both the objective and the right-hand side, and the
random variables affecting these two positions are different and independent,
then we get a lower bound by applying the Jensen rule on the right-hand side
random variables and the Edmundson–Madansky rule in the objective. If we
do it the other way around, we get an overall upper bound.
RECOURSE PROBLEMS 173
φ(ξ) = min{q T y | W y = b + ξ, 0 ≤ y ≤ c}
y
where all components in the random vector ξ˜T = (ξ˜1 , ξ˜2 , . . .) are mutually
independent. Furthermore, let the support be given by Ξ(ξ) ˜ = [A, B]. For
convenience, but without any loss of generality, we shall assume that E ξ˜ = 0.
The goal is to create a piecewise linear, separable and convex function in ξ:
X ½ d+ ξ if ξi ≥ E ξ˜i = 0
U (ξ) = φ(0) + i i (4.3)
i
d−
i ξi ifξi < E ξ˜i = 0.
174 STOCHASTIC PROGRAMMING
There is a very good reason for such a choice. Note how U (ξ) is separable in
its components ξi . Therefore, for almost all distribution functions, U is simple
to integrate.
To appreciate the bound, we must understand its basic motivation. If we
take some minimization problem, like the one here, and add extra constraints,
the resulting problem will bound the original problem from above. What we
shall do is to add restrictions with respect to the upper bounds c. We shall do
this by viewing φ(ξ) as a parametric problem in ξ, and reserve portions of the
upper bound c for the individual random variables ξi . We may, for example,
end up by saying that two units of cj are reserved for variable ξi , meaning that
these two units can be used in the parametric analysis, only when we consider
ξi . For all other variables ξk these two units will be viewed as nonexisting.
The clue of the bound is to introduce the best possible set of such constraints,
such that the resulting problem is easy to solve (and gives a good bound).
˜ = φ(0) by finding
First, let us calculate φ(E ξ)
φ(0) = min{q T y | W y = b, 0 ≤ y ≤ c} = q T y 0 .
y
This can be interpreted as the basic setting, and all other values of ξ will be
seen as deviations from E ξ˜ = 0. (Of course, any other starting point will also
do—for example solving Q(A), where, as stated before, A is the lowest possible
value of ξ.) Note that since y 0 is “always” there, we can in the following operate
with bounds −y 0 ≤ y ≤ c − y 0 . For this purpose, we define α1 = −y 0 and
β 1 = c − y 0 . Let ei be a unit vector of appropriate size with a +1 in position
i.
Next, define a counter r and let r := 1. Now check out the case when ξr > 0
by solving (remembering that Br is the maximal value of ξr )
min{q T y | W y = er Br , αr ≤ y ≤ β r } = q T y r+ = d+
r Br . (4.4)
y
Note that d+r represents the per unit cost of increasing the right-hand side
from 0 to er Br . Similarly, check out the case with ξr < 0 by solving
min{q T y | W y = er Ar , αr ≤ y ≤ β r } = q T y r− = d−
r Ar . (4.5)
y
subtracted off what we had before. There are three possibilities. Both (4.4) and
(4.5) may yield non-negative values for the variable yi . In that case nothing
is used of the available “negative bound” αir . Then αir+1 = αir . Alternatively,
if (4.4) has yir+ < 0, then it will in the worst case use yir+ of the available
“negative bound”. Finally, if (4.5) has yir− < 0 then in the worst case we use
yir− of the bound. Therefore αir+1 is what is left for the next random variable.
Similarly, we find
where βir+1 shows how much is still available of bound i in the forward
(positive) direction.
We next increase the counter r by one and repeat (4.4)–(4.7). This takes
care of the piecewise linear functions in ξ.
Note that it is possible to solve (4.4) and (4.5) by parametric linear
programming, thereby getting not just one linear piece above E ξ˜ and one
below, but rather piecewise linearity on both sides. Then (4.6) and (4.7) must
be updated to “worst case” analysis of bound usage. That is simple to do.
Let us turn to our example (4.1). Since we have developed the piecewise
linear upper bound for equality constraints, we shall repeat the problem with
slack variables added explicitly.
First, we have already calculated φ(0, 0) = 126 with xraw1 = 36, xraw2 =
18 and s1 = 46. Next, let us try to find d± 1 . To do that, we need α ,
1
which equals (−36, −18, −46, 0, 0). We must then formulate (4.4), using ξ1 ∈
[−30.91, 30.91]:
176 STOCHASTIC PROGRAMMING
min{2xraw1 + 3xraw2 }
s.t. xraw1 + xraw2 + s1 = 0,
2xraw1 + 6xraw2 − s2 = 30.91,
3xraw1 + 3xraw2 − s3 = 0,
xraw1 ≥ −36,
xraw2 ≥ −18,
s1 ≥ −46,
s2 ≥ 0,
s3 ≥ 0.
(2, 3, 0, 0, 0)y 1−
d−
1 = = 0.25.
−30.91
The next step is to update α according to (4.6) to find out how much is left
of the negative bounds on the variables. For xraw1 we get
2
αraw1 = −36 − min{−7.7275, 7.7275, 0} = −28.2725.
For the three other variables, αi2 equals αi1 . We can now turn to (4.4) for
random variable 2. The problem to solve is as follows, when we remember the
ξ2 ∈ [−23.18, 23.18].
min{2xraw1 + 3xraw2 }
s.t. xraw1 + xraw2 + s1 = 0,
2xraw1 + 6xraw2 − s2 = 0,
3xraw1 + 3xraw2 − s3 = 23.18,
xraw1 ≥ −28.2725,
xraw2 ≥ −10.2725,
s1 ≥ −46,
s2 ≥ 0,
s3 ≥ 0.
RECOURSE PROBLEMS 177
The solution to this is y 2+ = (11.59, −3.863, −7.727, 0, 0)T , with a total cost
of 11.59. This gives us
(2, 3, 0, 0, 0)y 2+
d+
2 = = 0.5.
23.18
Next, we solve the same problem, just with 23.18 replaced by −23.18.
This amounts to problem (4.5), and gives us the solution y 2− =
(−11.59, 3.863, 7.727, 0, 0)T , with a total cost of −11.59. Hence we get
(2, 3, 0, 0, 0)y 2−
d−
2 = = 0.5.
−23.18
This finishes the calculation of the (piecewise) linear functions in the upper
bound. What we have now found is that
( (
1 1
4 ξ1 if ξ1 ≥ 0, ξ2 if ξ2 ≥ 0,
U (ξ1 , ξ2 ) = 126 + 1 + 21
4 ξ1 if ξ1 < 0, 2 ξ2 if ξ2 < 0,
U (ξ1 , ξ2 ) = 126 + 41 ξ1 + 12 ξ2 .
This is how it should be for linearity, the contribution from a random variable
over which U (and therefore φ) is linear is zero. We should of course get the
same result with respect to ξ2 , and therefore the upper bound is 126, which
equals the Jensen lower bound.
Now that we have seen how things go in the linear case, let us try to see
how the results will be when linearity is not present. Hence assume that we
have now developed the necessary parameters d± i for (4.3). Let us integrate
˜
with respect to the random variable ξi , assuming that Ξi = [Ai , Bi ]:
Z 0 Z Bi
d−
1 ξi f (ξi )dξi + d+
1 ξi f (ξi )dξi
Ai 0
= d− ˜ ˜ ˜ + ˜ ˜ ˜
i E{ξi | ξi ≤ 0}P {ξi ≤ 0} + di E{ξi | ξi > 0}P {ξi > 0}.
This result should not come as much of a surprise. When one integrates a
linear function, one gets the function evaluated at the expected value of the
178 STOCHASTIC PROGRAMMING
for all i, then the contribution to the upper bound from ξ˜ equals φ(E ξ),
˜ which
equals the contribution to the Jensen lower bound.
Let us repeat why this is an upper bound. What we have done is to
distribute the bounds c on the variables among the different random variables.
They have been given separate pieces, which they will not share with others,
˜ do not need the capacities themselves.
even if they, for a given realization of ξ,
This partitioning of the bounds among the random variables represents
a set of extra constraints on the problem, and hence, since we have a
minimization problem, the extra constraints yield an upper bound. If we
run out of capacities before all random variables have received their parts,
we must conclude that the upper bound is +∞. This cannot happen with
the Edmundson–Madansky upper bound. If φ(ξ) is feasible for all ξ then the
Edmundson–Madansky bound is always finite. However, as for the Jensen and
Edmundson–Madansky bounds, the piecewise linear upper bound is also exact
when the recourse function turns out to be linear.
As mentioned before, we shall consider random upper bounds in Chapter 6,
in the setting of networks.
3.5 Approximations
We can now look at U − L to see if we are happy with the result or not.
If we are not, there are basically two approaches that can be used. Either
we might resort to a better bounding procedure (probably more expensive in
terms of CPU time) or we might start using the old bounding methods on a
partition of the support, thereby making the bounds tighter. Since we know
only finitely many different methods, we shall eventually be left with only the
second option.
The set-up for such an approach to bounding will be as follows. First,
partition the support of the random variables into an arbitrary selection of
cells—possibly only one cell initially. We shall only consider cells that are
rectangles, so that they can be described by intervals on the individual random
variables. Figure 18 shows an example in two dimensions with five cells. Now,
apply the bounding procedures on each of the cells, and add up the results.
RECOURSE PROBLEMS 179
Cell2
Cell 1
Cell 3
Cell 4 Cell 5
increased our work load. It is now harder to achieve a given error bound than
it was before the partition. And note, we shall never recover from the error,
in the sense that intelligent choices later on will not counteract this one bad
choice. Each time we make a bad partition, the workload from there onwards
basically doubles for the cell from which we started. Since we do not want to
unnecessarily increase the workload too often, we must be careful with how
we partition.
Now that we know that bad choices can increase the workload, what should
we do? The first observation is that chosing at random is not a good idea,
because, every now and then, we shall make bad choices. On the other hand,
it is clear that the partitioning procedure will have to be a heuristic. Hence,
we must make sure that we have a heuristic rule that we hope never makes
really bad choices.
By knowing our problem well, we may be able to order the random variables
according to their importance in the problem. Such an ordering could be used
as is, or in combination with other ideas. For some network problems, such as
the PERT problem (see Section 6.6), the network structure may present us
with such a list. If we can compile the list, it seems reasonable to ask, from a
modelling point of view, if the random variables last on the list should really
have been there in the first place. They do not appear to be important.
Over the years, some attempts to understand the problem of partitioning
have been made. Most of them are based on the assumption that the
Edmundson–Madansky bound was used to calculate the upper bound. The
reason is that the dual variables associated with the solution of the recourse
function tell us something about its curvature. With the Edmundson–
Madansky bound, we solve the recourse problem at all extreme points of
the support, and thus get a reasonably good idea of what the function looks
like.
To introduce some formality, assume we have only one random variable
˜ with support Ξ = [A, B]. When finding the Edmundson–Madansky upper
ξ,
bound, we calculated φ(A) = Q(x̂, A) and φ(B) = Q(x̂, B), obtaining dual
solutions π A and π B . We know from duality that
and
β = φ(B) − (π A )T [h(B) − T (B)x̂] ≥ 0.
RECOURSE PROBLEMS 181
φ φ
α β
ξ ξ
α
β
α
ξ
we have much curvature in the sense of the slopes of φ at the end point, but
still almost linearity (as in Figure 20), then the smaller of the two parameters
will be small. Hence the conclusion seems to be to calculate both α and β,
pick the smaller of the two, and use that as a measure of nonlinearity.
Using α and β, we have a good measure of nonlinearity in one dimension.
However, with more than one dimension, we must again be careful. We can
certainly perform tests corresponding to those illustrated in Figures 19 and 20,
for one random variable at a time. But the question is what value should
we give the other random variables during the test. If we have k random
variables, and have the Edmundson–Madansky calculations available, there
are 2k−1 different ways we can fix all but one variable and then compare dual
solutions. There are at least two possible approaches.
A first possibility is to calculate α and β for all neighbouring pairs of
extreme points in the support, and pick the one for which the minimum of
α and β is the largest. We then have a random variable for which φ is very
nonlinear, at least in parts of the support. We may, of course, have picked a
variable for which φ is linear most of the time, and this will certainly happen
once in a while, but the idea is tested and found sound.
An alternative, which tries to check average nonlinearity rather than
maximal nonlinearity, is to use all 2k−1 pairs of neighbouring extreme points
involving variation in only one random variable, find the minimum of α and
β for each such pair, and then calculate the average of these minima. Then
we pick the random variable for which this average is maximal.
The number of pairs of neighbouring extreme points is fairly large. With k
random variables, we have k2k−1 pairs to compare. Each comparison requires
the calculation of two inner products. We have earlier indicated that the
Edmundson–Madansky upper bound cannot be used for much more than 10
random variables. In such a case we must perform 5120 pairwise comparisons.
RECOURSE PROBLEMS 183
What we need to know to utilize this information is how the right-hand side
changes as a given random variable ξ˜i changes. This is easy to calculate, since
all we have to do is to find the derivative of
X ³ X ´
h(ξ) − T (ξ)x̂ = h0 + hj ξj − T0 + Tj ξj x̂
j j
hi − Ti x̂ ≡ δi .
˜ and hence it is the
Note that this expression is independent of the value of ξ,
j
same at all extreme points of the support. Now, if π(ξ ) is the optimal dual
solution at an extreme point of the support, represented by ξ j , then the slope
of φ(ξ) = Q(x̂, ξ) with respect to ξi is given by
π(ξ j )T δi .
π(ξ j )T δ ≡ (π j )T (5.1)
characterizes how φ(ξ) = Q(x̂, ξ) changes with respect to all random variables.
Since these calculations are performed at each extreme point of the support,
and each extreme point has a probability according to the Edmundson–
Madansky calculations, we can interpret the vectors π j as outcomes of
a random vector π̃ that has 2k possible values and the corresponding
Edmundson–Madansky probabilities. If, for example, the random variable π̃i
has only one possible value, we know that φ(ξ) is linear in ξi . If π̃i has several
possible values, its variance will tell us quite a bit about how the slope varies
over the support. Since the random variables ξ˜i may have very different units,
and the dual variables measure changes in the objective function per unit
184 STOCHASTIC PROGRAMMING
Ui` − L`i
be the error on the “left” cell if we partition random variable i, and let
Uir − Lri
be the error on the “right” cell. If p`i is the probability of being in the left
cell, given that we are in the original cell, when we partition coordinate i, we
chose to partition the random variable i that minimizes
end. With a cheap (in terms of CPU time) upper bound, the approach seems
more reasonable, since checking all possibilities is not particularly costly, but,
even so, bad partitions will still double the work load locally. Numerical tests
indicate that this approach is very good even with the Edmundson–Madansky
upper bound, and the reason seems to be that it produces so few cells. Of
course, without Edmundson–Madansky, we cannot calculate α, β and π̃, so if
we do not like the look-ahead, we are in need of a new heuristic.
We have pointed out before that the piecewise linear upper bound can
obtain the value +∞. That happens if one of the problems (4.4) or (4.5)
becomes infeasible. If that takes place, the random variable being treated
when it happens is clearly a candidate for partitioning.
So far we have not really defined what constitutes a good partition. We shall
return to that after the next subsection. But first let us look at an example
illustrating the partitioning ideas.
Let us assume that Ξ1 = [0, 20] and Ξ2 = [0, 10]. For simplicity, we shall
assume uniform and independent distributions. We do that because the form
of the distribution is rather unimportant for the heuristics we are to explain.
The feasible set for the problem, except the upper bounds, is given in
Figure 21. The circled numbers refer to the numbering of the inequalities.
For all problems we have to solve (for varying values of ξ), it is reasonably
easy to read the solution directly from the figure.
Since we are maximizing, the Jensen bound is an upper bound, and the
Edmundson–Madansky bound a lower bound. We easily find the Jensen upper
bound from
φ(10, 5) = 20.
Figure 21 Set of feasible solutions for the example used to illustrate the
piecewise linear upper bound.
LL:UL We first must test the optimal dual solution π LL together with the
right-hand side bU L . We get
α = (π LL )T bU L
= (0, 0, 0, 0, 0, 1, 2)(6, 21, 49, 120, 45, 20, 0)T − φ(U, L)
= 20 − 10.5 = 9.5.
β = (π U L )T bLL
= (0, 12 , 0, 0, 0, 0, 72 )(6, 21, 49, 120, 45, 0, 0)T − φ(L, L)
= 10.5 − 0 = 10.5.
If we were to pick the pair with the largest minimum of α and β, we should
pick the pair UL:UU, over which it is ξ2 that varies. In such a case we have
tried to find that part of the function that is the most nonlinear. When we
look at Figure 21, we see that as ξ2 increases (with ξ1 = 20), the optimal
solution moves from F to E and then to D, where it stays when ξ2 comes
above the y coordinate in D. It is perhaps not so surprising that this is the
most serious nonlinearity in φ.
If we try to find the random variable with the highest average nonlinearity,
by summing the errors over those pairs for which the given random variable
varies, we find that for ξ˜1 the sum is 9.5 + 15.143 = 24.643, and for ξ˜2 it is
8 + 16.643, which also equals 24.643. In other words, we have no conclusion.
The next approach we suggested was to look at the dual variables as in
(5.1). The right-hand side structure is very simple in our example, so it is
easy to find the connections. We define two random variables: π̃1 for the row
constraining x, and π̃2 for the row constraining y. With the simple kind of
uniform distributions we have assumed, each of the four values for π̃1 and π̃2
will have probability 0.25. Using Table 1, we see that the possible values for
π̃1 are 0, 1 and 3 (with 0 appearing twice), while for π̃2 they are 0, 2 and 3.5
(also with 0 appearing twice). There are different ideas we can follow.
1. We can find out how the dual variables vary between the extreme points.
The largest individual change is that π̃2 falls from 3.5 to 0 as we go from UL
to UU. This should again confirm that ξ˜2 is a candidate for partitioning.
2. We can calculate E π̃ = (1, 11
8 ), and the individual variances to 1.5 and
2.17. If we choose based on variance, we pick ξ˜2 .
3. We also argued earlier that the size of the support was of some importance.
A way of accommodating that is to multiply all outcomes with the length
of the support. (That way, all dual variables are, in a sense, a measure of
change per total support.) That should make the dual variables comparable.
The calculations are left to the reader. We now end up with π̃1 having the
largest variance. (And if we now look at the biggest change in dual variable
over pairs of neighboring extreme points, ξ˜1 will be the one to partition.)
(5,5) 15 (10,7.5) 25
(15,5) 25 (10,2.5) 15
(10,10) 27.143 (0,5) 10
(20,5) 25 (10,0) 10
Based on this, we can find the total error after splitting to be about 4.5 for
ξ˜1 and 5 for ξ˜2 . Therefore, based on “look-ahead”, we should chose ξ˜1 , since
that reduces the error the most.
2
with pj being the probability of being in cell j. Let x̂ be the optimal solution
to (5.3). Clearly if x is the optimal solution to the original problem then
190 STOCHASTIC PROGRAMMING
U 1 (x) +cx
U 2 (x) +cx
Q (x) +cx
^ ^
cx+ U1(x) L 2(x) +cx
^ ^ L 1(x) +cx
cx+ Q(x)
^
cx+ ^
L1(x)
x^ x
Figure 22 Example illustrating the use of bounds in the L-shaped
decomposition method. An initial partition corresponds to the lower bound-
ing function L1 (x) and the upper bounding function U1 (x). For all x we have
L1 (x) ≤ Q(x) ≤ U1 (x). We minimize cx + L1 (x) to obtain x̂. We find the error
U1 (x̂) − L1 (x̂), and we decide to refine the partition. This will cause L1 to be
replaced by L2 and U1 by U2 . Then the process can be repeated.
so that the optimal value we found by solving (5.3) is really a lower bound
on min cT x + Q(x). The first inequality follows from the observation that x̂
minimizes cT x + L(x). The second inequality holds because L(x) ≤ Q(x) for
all x (Jensen’s inequality). Next, we use some method to calculate U(x̂), for
example the Edmundson–Madansky or piecewise linear upper bound. Note
that
that, with more cells, and hence a larger `, the function L(x) is now closer to
Q(x). Feasibility cuts are still valid and tight. Figure 22 illustrates how the
approximating functions L(x) and U(x) change as the partition is refined.
In total, this gives us the procedure in Figure 23. The procedure refine(Ξ)
will not be detailed, since there are so many options. We refer to our earlier
discussion of the subject in Section 3.5.1. Note that, for simplicity, we have
assumed that, after a partitioning, the procedure starts all over again in the
repeat loop. That is of course not needed, since we already have checked the
present x̂ for feasibility. If we replace the set A by Ξ in the call to procedure
feascut, the procedure Bounding L-shaped must stay as it is. In many cases
this may be a useful change, since A might be very large. (In this case old
feasibility cuts might no longer be tight.)
minimizes the final number of cells, and that it is worthwhile to pay quite a
lot per iteration to achieve this goal.
where
Hence we assume
W = (I, −I),
T (ξ) ≡ T (constant),
h(ξ) ≡ ξ,
and in addition
q = q + + q − ≥ 0.
In other words, we consider the case where only the right-hand side is
random, andPwe shall see that in this case, using our former presentation
h(ξ) = h0 + i hi ξi , we only need to know the marginal distributions of the
components hj (ξ) of h(ξ). However, stochastic dependence or independence
of these components does not matter at all. This justifies the above setting
h(ξ) ≡ ξ.
By linear programming duality, we have for the recourse function
Q(x, ξ)
= min{q +T y + + q −T y − | y + − y − = ξ − T x, y + ≥ 0, y − ≥ 0}
= max{(ξ − T x)T π | −q − ≤ π ≤ q + }. (6.2)
Hence, with ½
(ξi − χi )qi+ if χi < ξi ,
Q̂i (χi , ξi ) =
−(ξi − χi )qi− if χi ≥ ξi ,
194 STOCHASTIC PROGRAMMING
we have X
Q(x, ξ) = Q̂i (χi , ξi ) with χ = T x.
i
The last expression shows that knowledge of the marginal distributions of the
ξ˜i is sufficient to evaluate the expected recourse. Moreover, Eξ̃ Q(x, ξ) ˜ is a
˜ Pm1
so-called separable function in (χ1 , · · · , χm1 ), i.e. Eξ̃ Q(x, ξ) = i=1 Qi (χi ),
where, owing to q + + q − = q,
R R
Qi (χi ) = qi+ ξi >χi (ξi − χi )Pξ̃ (dξ) − qi− ξi ≤χi (ξi − χi )Pξ̃ (dξ)
R R
= qi+ Ξ (ξi − χi )Pξ̃ (dξ) − (qi+ + qi− ) ξi ≤χi (ξi − χi )Pξ̃ (dξ) (6.3)
R
+ +
= qi ξ i − qi χi − q i ξi ≤χi (ξi − χi )Pξ̃ (dξ)
showing that for χi < αi and χi > βi the functions Qi (χi ) are linear (see
Figure 24). In particular, we have
If, on the other hand, αi < χ̂i < βi , we partition the interval (αi , βi ] into the
two subintervals (αi , χ̂i ] and (χ̂i , βi ] with the conditional expectations
1 2
ξ i = Eξ̃ (ξ˜i | ξi ∈ (αi , χ̂i ]), ξ i = Eξ̃ (ξ˜i | ξi ∈ (χ̂i , βi ]).
Hence, instead of using the E–M upper bound, we can easily determine the
1 2 1 2
exact value Qi (χ̂i ). With Q̂i (χi , ξ i , ξ i ) := p1i Q̂i (χi , ξ i ) + p2i Q̂i (χi , ξ i ), the
resulting situation is demonstrated in Figure 25.
Assume now that for a partition of the intervals (αi , βi ] into subintervals
Iiν := (δiν , δiν+1 ], ν = 0, · · · , Ni − 1, with αi = δi0 < δi1 < · · · < δiNi = βi ,
we have minimized the Jensen lower bound (see Section 3.4.1), letting piν =
196 STOCHASTIC PROGRAMMING
1 2
Figure 25 Simple recourse: supporting Qi (χi ) by Q̂i (χi , ξ i , ξ i ).
h k N
X Xi −1 i
minx,χ cT x + piν Q̂i (χi , ξ iν )
i=1 ν=0
s.t. Ax = b,
T x − χ = 0,
x ≥ 0,
yielding the solution x̂ and χ̂ = T x̂. Obviously relation (6.5) holds for
conditional expectations Qiν (χ̂i ) (with respect to Iiν ) as well. Then for each
component of χ̂ there are three possibilities.
(a) If χ̂i ≤ αi , then
Q̂i (χ̂i , ξ iν ) = Qiν (χ̂i ) = Eξ̃ (Q̂i (χ̂i , ξ˜i ) | ξi ∈ Iiν ), ν = 0, · · · , Ni − 1,
and hence
N
Xi −1
(c) If χ̂i ∈ Iiµ for exactly one µ, with 0 ≤ µ < Ni , then there are two cases.
First, if δiµ < χ̂i < δiµ+1 , partition Iiµ = (δiµ , δiµ+1 ] into
1 2
Jiµ = (δiµ , χ̂i ] and Jiµ = (χ̂i , δiµ+1 ].
X 2
X ρ
Qi (χ̂i ) = piν Q̂i (χ̂i , ξ iν ) + pρiµ Q̂i (χ̂i , ξ iµ ),
ν6=µ ρ=1
where
ρ
pρiµ = P (ξi ∈ Jiµ
ρ
), ξ iµ = Eξ̃ (ξ˜i | ξi ∈ Jiµ
ρ
), ρ = 1, 2.
If, on the other hand, χ̂i = δiµ+1 , we again have
N
Xi −1
In conclusion, having determined the minimal point χ̂ for the Jensen lower
bound, we immediately get the exact expected recourse at this point and
decide whether for all components the relative error fits into a prescribed
tolerance, or in which component the refinement (partitioning the subinterval
containing χ̂i by deviding it exactly at χ̂i ) seems appropriate for a further
improvement of the approximate solution of (6.1). Many empirical tests have
shown this approach to be very efficient. In particular, for this special problem
type higher dimensions of ξ˜ do not cause severe computational difficulties,
as they did for general stochastic programs with recourse, as discussed in
Section 3.5 .
This book deals almost exclusively with convex problems. The only exception
is this section, where we discuss, very briefly, some aspects of integer
programming. The main reason for doing so is that some solution procedures
for integer programming fit very well with some decomposition procedures
for (continuous) stochastic programming. Because of that we can achieve
two goals: we can explain some connections between stochastic and integer
programming, and we can combine the two subject areas. This allows us to
arrive at a method for stochastic integer programming. Note that talking
about stochastic and integer programming as two distinct areas is really
meaningless, since stochastic programs can contain integrality constraints, and
integer programs can be stochastic. But we still do it, with some hesitation,
198 STOCHASTIC PROGRAMMING
We next continue to work with the problem in one of these waiting nodes.
RECOURSE PROBLEMS 199
We shall call this problem the present problem. When doing so, a number of
different situations can occur.
• In the integer case we partition the solution space, and in the stochastic
case the input data (support of random variables).
• In the integer case we must find a branching variable, and in the stochastic
case a random variable for partitioning.
200 STOCHASTIC PROGRAMMING
• In the integer case we must find a value dj of xj (see (7.2)) for the branching,
and in the stochastic case we must determine a point dj in the support
through which we want to partition.
• Both methods therefore operate with a situation as depicted in Figure 18,
but in one case the rectangle is the solution space, while in the other it is
the support of the random variables.
• Both problems can be seen as building up a tree. For integer programming
we build a branch-and-bound tree. For stochastic programming we build a
splitting tree. The branch-and-bound-tree in Figure 26 could have been a
splitting tree as well. In that case we should store the error rather than the
objective value.
• In the integer case we fathom a problem (corresponding to a cell in
Figure 18, or a leaf in the tree) when it has nothing more to tell us, in
the stochastic case we do this when the bounds (in the cell or leaf) are close
enough.
From this, it should be obvious that anyone who understands the ins and
outs of integer programming, will also have a lot to say about bounding
stochastic programs. Of course there are differences, but they are smaller
than one might think.
So far, what we have compared is really the problem of bounding the
RECOURSE PROBLEMS 201
min cT x
s.t. Ax = b,
(7.3)
x ≥ 0,
W y(ξ) = h(ξ) − T (ξ)x, y(ξ) ≥ 0.
To use the L-shaped method to solve (7.3), we should begin solving the
problem
min cT x
s.t. Ax = b,
x ≥ 0,
i.e. (7.3) without the last set of constraints added. Then, if the resulting x̂
makes the last set of constraints in (7.3) feasible for all ξ, we are done. If not,
an implied feasibility cut is added.
An integer program, on the other hand, could be written as
min cT x
s.t. Ax = b, (7.4)
xi ∈ {ai , . . . , bi } for all xi .
A cutting-plane procedure for (7.4) will solve the problem with the constraints
a ≤ x ≤ b so that the integrality requirement is relaxed. Then, if the resulting
x̂ is integral in all its elements, we are done. If not, an integrality cut is added.
This cut will, if possible, be a facet of the solution space with all extreme points
integer.
By now, realizing that integrality cuts are also feasibility cuts, the
connection should be clear. Integrality cuts in integer programming are just
a special type of feasibility cuts.
For the bounding version of the L-shaped decomposition method we
combined bounding (with partitioning of the support) with cuts. In the same
way, we can combine branching and cuts in the branch-and-cut algorithm for
integer programs (still deterministic). The idea is fairly simple (but requires
a lot of details to be efficient). For all waiting nodes, before or after we
have solved the relaxed LP, we add an appropriate number of cuts, before
we (re)solve the LP. How many cuts we add will often depend on how well we
know the facets of the (integer) solution space. This new LP will have a smaller
(continuous) solution space, and is therefore likely to give a better result—
either in terms of a nonintegral optimal solution with a higher objective value
(increasing the probability of bounding), or in terms of an integer solution.
So, finally, we have reached the ultimate question. How can all of this be
used to solve integer stochastic programs? Given the simplification that we
have integrality only in the first-stage problem, the procedure is given in
Figure 27. In the procedure we operate with a set of waiting nodes P. These
are nodes in the cut-and-branch tree that are not yet fathomed or bounded.
RECOURSE PROBLEMS 203
The procedure feascut was presented earlier in Figure 9, whereas the new
procedure intcut is outlined in Figure 28. Let us try to compare the L-shaped
integer programming method with the continuous one presented in Figure 10.
3.7.1 Initialization
In the continuous case we started by assuming the existence of an x̂, feasible
in the first stage. It can be found, for example, by solving the expected value
problem. This is not how we start in the integer case. The reason is partly
that finding a feasible solutions is more complicated in that setting. On the
other hand, it might be argued that if we hope to solve the integer stochastic
problem, we should be able to solve the expected value problem (or at least
find a feasible solution to the master problem), thereby being able to start out
with a feasible solution (and a z better than ∞). But, even in this case, we shall
not normally be calling procedure master with a feasible solution at hand. If
we have just created a feasibility cut, the present x̂ is not feasible. Therefore
the difference in initialization is natural. This also affects the generation of
feasibility cuts.
min{φ(x) ≡ cT x + Q(x)}
s.t. Ax = b,
x ≥ 0,
where Z
Q(x) = Q(x, ξ)f (ξ) dξ
Again we note that ξ and x do not enter the constraints of the dual
formulation, so that if a given ξ and x produce a solvable problem, the problem
is dual feasible for all ξ and x. Furthermore, if π 0 is a dual feasible solution
then
Q(x, ξ) ≥ (π 0 )T [ξ − T (ξ)x]
for any ξ and x, since π 0 is feasible but not necessarily optimal in a
maximization problem. This observation is a central part of SD. Refer back
to our discussion of how to interpret the Jensen lower bound in Section 3.4.1,
where we gave three different interpretations, one of which was approximate
optimization using a finite number of dual feasible bases, rather than all
possible dual feasible bases. In SD we shall build up a collection of dual
feasible bases, and in some of the optimizations use this subset rather than
all possible bases. In itself, this will produce a lower-bounding solution.
But SD is also a sampling technique. By ξk , we shall understand the sample
made in iteration k. At the same time, xk will refer to the iterate (i.e. the
presently best guess of the optimal solution) in iteration k. The first thing to
do after a new sample has been made available is to evaluate Q(xk , ξj ) for
the new iterate and all samples ξj found so far. First we solve for the newest
sample ξk ,
to obtain an optimal dual solution πk . Note that this optimization, being the
first involving ξk , is exact. If we let V be the collection of all dual feasible
solutions obtained so far, we now add πk to V . Next, instead of evaluating
Q(xk , ξj ) for j = 1, . . . , k − 1 (i.e. for the old samples) exactly, we simply solve
RECOURSE PROBLEMS 207
ξ)
ξ
ξ ξ ξ
Figure 29 Illustration of how stochastic decomposition performs exact
optimization for the latest (third) sample point, but inexact optimization for
the two old points.
to obtain πjk . Since V contains a finite number of vectors, this operation is very
simple. Note that for all samples but the new one we perform approximate
optimization using a limited set of dual feasible bases. The situation is
illustrated in Figure 29. There we see the situation for the third sample point.
We first make an exact optimization for the new sample point, ξ3 , obtaining
a true optimal dual solution π3 . This is represented in Figure 29 by the
supporting hyperplane through ξ3 , Q(x3 , ξ3 ). Afterwards, we solve inexactly
for the two old sample points. There are three bases available for the inexact
optimization. These bases are represented by the three thin lines. As we see,
neither of the two old sample points find their true optimal basis.
˜ = {ξ1 , ξ2 , ξ3 }, with each outcome having the same probability 1 , we
If Ξ(ξ) 3
could now calculate a lower bound on Q(x3 ) by computing
3
1X 3 T
L(x3 ) = (π ) (ξj − T (ξj )x3 ).
3 j=1 j
for the old sample points. However, the three sample points probably do not
represent the true distribution well, and hence what we have is only something
that in expectation is a lower bound. Since, eventually, this term will converge
towards Q(x), we shall in what follows write
k
1X k T
Q(xk ) = (π ) (ξj − T (ξj )xk ).
k j=1 j
Remember, however, that this is not the true value of Q(xk )—just an estimate.
In other words, we have now observed two major differences from the
exact L-shaped method (page 159). First, we operate on a sample rather
than on all outcomes, and, secondly, what we calculate is an estimate of
a lower bound on Q(xk ) rather than Q(xk ) itself. Hence, since we have a
lower bound, what we are doing is more similar to what we did when we
used the L-shaped decomposition method within approximation schemes, (see
page 192). However, the reason for the lower bound is somewhat different. In
the bounding version of L-shaped, the lower bound was based on conditional
expectations, whereas here it is based on inexact optimization. On the other
hand, we have earlier pointed out that the Jensen lower bound has three
different interpretations, one of which is to use conditional expectations (as
in procedure Bounding L-shaped) and another that is inexact optimization
(as in SD). So what is actually the principal difference?
For the three interpretations of the Jensen bound to be equivalent, the
limited set of bases must come from solving the recourse problem in the points
of conditional expectations. That is not the case in SD. Here the points are
random (according to the sample ξj ). Using a limited number of bases still
produces a lower bound, but not the Jensen lower bound.
Therefore SD and the bounding version of L-shaped are really quite
different. The reason for the lower bound is different, and the objective value
in SD is only a lower bound in terms of expectations (due to sampling). One
method picks the limited number of points in a very careful way, the other
at random. One method has an exact stopping criteria (error bound), the
other has a statistically based stopping rule. So, more than anything else,
they are alternative approaches. If one cannot solve the exact problem, one
either resorts to bounds or to sample-based methods.
In the L-shaped method we demonstrated how to find optimality cuts. We
can now find a cut corresponding to xk (which is not binding and might even
not be a lower bound, although it represents an estimate of a lower bound).
As for the L-shaped method, we shall replace Q(x) in the objective by θ, and
then add constraints. The cut generated in iteration k is given by
RECOURSE PROBLEMS 209
k
1X k T
θ≥ (π ) [ξj − T (ξj )x] = αkk + (βkk )T x.
k j=1 j
The double set of indices on α and β indicate that the cut was generated in
iteration k (the subscript) and that it has been updated in iteration k (the
superscript).
In contrast to the L-shaped decomposition method, we must now also look
at the old cuts. The reason is that, although we expect these cuts to be loose
(since we use inexact optimization), they may in fact be far too tight (since
they are based on a sample). Also, being old, they are based on a sample that
is smaller than the present one, and hence, probably not too good. We shall
therefore want to phase them out, but not by throwing them away. Assume
that there exists a lower bound on Q(x, ξ) such that Q(x, ξ) ≥ Q for all x and
ξ. Then the old cuts
will be replaced by
k − 1 k−1 k−1 T 1
θ≥ [αj + (βj ) x] + Q
k k (8.1)
= αjk + (βjk )T x for j = 1, . . . , k − 1.
This defines the function φk (x) and shows more clearly than (8.2) that we do
indeed have a function in x that we are minimizing. Also φk (x) is the present
estimate of φ(x) = cT x + Q(x).
210 STOCHASTIC PROGRAMMING
φ()
min cT x + q0T y
s.t. Ax = b,
W y = ξ0 − T (ξ0 )x, x, y ≥ 0,
In addition, we need to update the incumbent cut ik . This is done just the
way we found cut k. We solve
φ
φ
We are still dealing with recourse problems stated in the somewhat more
general form · Z ¸
min f (x) + Q(x, ξ) Pξ̃ (dξ) . (9.1)
x∈X Ξ
This formulation also includes the stochastic linear program with recourse,
letting
X = {x | Ax = b, x ≥ 0},
f (x) = cT x,
Q(x, ξ) = min{(q(ξ))T y | W y = h(ξ) − T (ξ)x, y ≥ 0}.
Observe that for stochastic linear programs with recourse the assump-
tions (9.3) are satisfied if, for instance,
214 STOCHASTIC PROGRAMMING
Any vector g satisfying (9.7) is called a subgradient of ϕ at z, and the set of all
vectors satisfying (9.7) is called the subdifferential of ϕ at z and is denoted by
∂ϕ(z). If ϕ is differentiable at z then ∂ϕ(z) = {∇ϕ(z)}; otherwise, i.e. in the
nondifferentiable case, ∂ϕ(z) may contain more than one element as shown
for instance in Figure 32. Furthermore, in view of (9.7), it is easily seen that
∂ϕ(z) is a convex set.
If ϕ is convex and g 6= 0 is a subgradient of ϕ at z then, by (9.7) for λ > 0,
it follows that
ϕ(z + λg) ≥ ϕ(z) + g T (x − z)
= ϕ(z) + g T (λg)
= ϕ(z) + λkgk2
> ϕ(z).
RECOURSE PROBLEMS 215
Then for ẑ = (0, 3)T we have g = (1, 1)T ∈ ∂ψ(ẑ), since for all ε > 0 the
gradient ∇ψ(ε, 3) exists and is equal to g. Hence, by (9.6), we have, for all
(u, v),
· µ ¶ µ ¶ ¸T µ ¶T µ ¶
u ε u−ε 1
− g=
v 3 v−3 1
=u−ε+v−3
≤ |u| + |v| − |ε| − |3|,
which is obviously true ∀ε ≥ 0, such that g is a subgradient in (0, 3)T . Then
for 0 < λ < 3 and ẑ − λg = (−λ, 3 − λ)T it follows that
and therefore, in this particular case, −g is not a strict descent direction for
ψ in ẑ. Nevertheless, as we see in Figure 33, moving from ẑ along the ray
ẑ − λg, λ > 0, for any λ < 3 we would come closer—with respect to the
Euclidean norm—to arg min ψ = {(0, 0)T } than we are at ẑ.
216 STOCHASTIC PROGRAMMING
Since, by our assumption, ϕ(z) − ϕ(x? ) > 0, we may choose a step size
ρ = ρ̄ > 0 such that
implying that z − ρ̄g is closer to x? ∈ arg min ϕ than z. This property provides
the motivation for the iterative procedures known as subgradient methods,
which minimize convex functions even in the nondifferentiable case.
Obviously for the above procedure (9.4) we may not expect any reasonable
convergence statement without further assumptions on the search direction
v ν and on the step size ρν . Therefore let v ν be a so-called stochastic quasi-
gradient, i.e. assume that
˜ + bν ,
E(v ν | x0 , · · · , xν ) ∈ ∂x Eξ̃ F (xν , ξ) (9.8)
RECOURSE PROBLEMS 217
˜ − E F (xν , ξ)
Eξ̃ F (x∗ , ξ) ˜ ≥ g νT (x∗ − xν ) (9.9)
ξ̃
where
γν = −bνT (x∗ − xν ). (9.11)
or more generally
Nν
1 X
vν = wµ , wµ ∈ ∂x F (xν , ξ νµ ), (9.13)
Nν µ=1
With the choices (9.12) or (9.13), for uniformly bounded v ν this assumption
could obviously be replaced by the step size assumption
∞
X ∞
X
ρν ≥ 0, ρν = ∞, ρ2ν < ∞. (9.15)
ν=0 ν=0
With these prerequisites, it can be shown that, under the assumptions (9.3),
(9.8) and (9.14) (or (9.3), (9.12) or (9.13), and (9.15)) the iterative
method (9.4) converges almost surely (a.s.) to a solution of (9.2).
What we observe here is that the part that varies, h(ξ) − T (ξ)x, appears only
in the objective. As a consequence, if (10.1) is feasible for one value of x and
ξ, it is feasible for all values of x and ξ. Of course, the problem might be
unbounded (meaning that the primal is infeasible) for some x and ξ. For the
moment we shall assume that that does not occur. (But if it does, it simply
shows that we need a feasibility cut, not an optimality cut).
In a given iteration of the L-shaped decomposition method, x will be fixed,
and all we are interested in is the selection of right-hand sides resulting from
all possible values of ξ. Let us therefore simplify notation, and assume that
we have a selection of right-hand sides B, so that, instead of (10.1), we solve
for all h ∈ B. Assume (10.2) is solved for one value of h ∈ B with optimal
basis B. Then B is a dual feasible basis for all h ∈ B. Therefore, for all
h ∈ B for which B −1 h ≥ 0, the basis B is also primal feasible, and hence
optimal. The idea behind bunching is simply to start out with some h ∈ B,
find the optimal basis B, and then check B −1 h for all other h ∈ B. Whenever
B −1 h ≥ 0, we have found the optimal solution for that h, and these right-
hand sides are bunched together. We then remove these right-hand sides from
B, and repeat the process, of course with a warm start from B, using the dual
simplex method, for one of the remaining right-hand sides in B. We continue
until all right-hand sides are bunched. That gives us all information needed
to find Q and the necessary optimality cut.
This procedure has been followed up in several directions. An important
one is called trickling down. Again, we start out with B, and we solve (10.2)
for some right-hand side to obtain a dual feasible basis B. This basis is stored
in the root of a search tree that we are about to make. Now, for one h ∈ B at
a time do the following. Start in the root of the tree, and calculate B −1 h. If
B −1 h ≥ 0, register that this right-hand side belongs to the bunch associated
with B, and go to the next h ∈ B. If B −1 h 6≥ 0, pick a row for which primal
feasibility is not satisfied. Perform a dual pivot step to obtain a new basis
B 0 (still dual feasible). Create a new node in the search tree associated with
this new B 0 . If the pivot was made in row i, we let the new node be the ith
child of the node containing the previous basis. Continue until optimality is
found. This situation is illustrated in Figure 34, where a total of eight bases
are stored. The numbers on the arc refer to the row where pivoting took place,
the B in the nodes illustrate that there is a basis stored in each node.
This might not seem efficient. However, the real purpose comes after some
iterations. If a right-hand side h is such that B −1 h 6≥ 0, and one of the negative
primal variables corresponds to a row index i such that the ith child of the
given node in the search tree already exists, we simply move to that child
without having to price. This is why we use the term trickling down. We try
to trickle a given h as far down in the tree as possible, and only when there
is no negative primal variable that corresponds to a child node of the present
node do we price and pivot explicitly, thereby creating a new branch in the
tree.
Attempts have been made to first create the tree, and then trickle down the
right-hand sides in the finished tree. This was not successful for two reasons.
If we try to enumerate all dual feasible bases, then the tree grows out of hand
(this corresponds to extreme point enumeration), and if we try to find the
correct selection of such bases, then that in itself becomes an overwhelming
problem. Therefore a pre-defined tree does not seem to be a good idea.
It is worth noting that the idea of storing a selection of dual feasible bases,
as was done in the stochastic decomposition method, is also related to the
above approach. In that case the result is a lower bound on Q(x).
220 STOCHASTIC PROGRAMMING
B
1
4
B B
2 3
B B
B
5
B
Figure 34 Example of a bunching tree.
A variant of these methods is as follows. Start out with one dual feasible
basis B as in the trickling down procedure. Pick a leading right-hand side.
Now solve the problem corresponding to this leading right-hand side using
the dual simplex method. On pivoting along, create a branch of the search
tree just as for trickling down. The difference is as follows. For each basis
B encountered, check B −1 h for all h ∈ B. Then split the right-hand sides
remaining in B into three sets. Those that have B −1 h ≥ 0 are bunched with
that B, and removed from B. Those that have a primal infeasibility in the
same row as the one chosen to be the pivot row for the leading problem are
kept in B and hence carried along at least one more dual pivot step. The
remaining right-hand sides are left behind in the given node, to be picked up
later on.
When the leading problem has been solved to optimality, and bunching has
been performed with respect to its optimal basis, check if there are any right-
hand sides left in B. If there are, let one of them be the leading right-hand
side, and continue the process. Eventually, when a leading problem has been
solved to optimality, B = ∅. At that time, start backtracking the search tree.
Whenever a selection of right-hand sides left behind is encountered, pick one
of them as the leading problem, and repeat the process. On returning to the
root, and finding there are no right-hand sides left behind there, the process
is finished. All right-hand sides are bunched. Technically, what has now been
done is to traverse the search tree in pre-order.
RECOURSE PROBLEMS 221
Benders’ [1] decomposition is the basis for all decomposition methods in this
chapter. In stochastic programming, as we have seen, it is more common to
refer to Benders’ decomposition as the L-shaped decomposition method. That
approach is outlined in detail in Van Slyke and Wets [63]. An implementation
of the L-shaped decomposition method, called MSLiP, is presented in
Gassmann [31]. It solves multistage problems based on nested decomposition.
Alternative computational methods are also discussed in Kall [44].
The regularized decomposition method has been implemented under the
name QDECOM. For further details on the method and QDECOM, in
particular for a special technique to solve the master (3.6), we refer to the
original publication of Ruszczyński [61]; the presentation in this chapter is
close to the description in his recent paper [62].
Some attempts have also been made to use interior point methods. As
222 STOCHASTIC PROGRAMMING
examples consider Birge and Qi [7], Birge and Holmes [6], Mulvey and
Ruszczyński [60] and Lustig, Mulvey and Carpenter [55]. The latter two
combine interior point methods with parallel processing.
Parallel techniques have been tried by others as well; see e.g. Berland [2]
and Jessup, Yang and Zenios [42]. We shall mention some others in Chapter 6.
The idea of combining branch-and-cut from integer programming with
primal decomposition in stochastic programming was developed by Laporte
and Louveaux [53]. Although the method is set in a strict setting of
integrality only in the first stage, it can be expanded to cover (via a
reformulation) multistage problems that possess the so-called block-separable
recourse property, see Louveaux [54] for details.
Stochastic quasi-gradient methods were developed by Ermoliev [20, 21],
and implemented by, among others, Gaivoronski [27, 28]. Besides stochastic
quasi-gradients several other possibilities for constructing stochastic descent
directions have been investigated, e.g. in Marti [57] and in Marti and
Fuchs [58, 59].
The Jensen lower bound was developed in 1906 [41]. The Edmundson–
Madansky upper bound is based on work by Edmundson [19] and
Madansky [56]. It has been extended to the multidimensional case by
Gassmann and Ziemba [33]; see also Hausch and Ziemba [36] and Edirisinghe
and Ziemba [17, 18]. Other references in this area include Huang, Vertinsky
and Ziemba [39] and Huang, Ziemba and Ben-Tal [40]. The Edmundson–
Madansky bound was generalized to the case of stochastically dependent
components by Frauendorfer [23].
The piecewise linear upper bound is based on two independent approaches,
namely those of Birge and Wets [11] and Wallace [66]. These were later
combined and strengthened in Birge and Wallace [8].
There is a large collection of bounds based on extreme measures (see e.g.
Dulá [12, 13], Hausch and Ziemba [36], Huang, Ziemba and Ben-Tal [40] and
Kall [48]). Both the Jensen and Edmundson–Madansky bounds can be put
into this category. For a fuller description of these methods, consult Birge and
Wets [10], Dupačová [14, 15, 16] and Kall [47]; more on extreme measures
may be found in Karr [51] and Kemperman [52].
Bounds can also be found when limited information is available. Consult
e.g. Birge and Dulá [5]. An upper bound based on structure can be found in
Wallace and Yan [68].
Stochastic decomposition was developed by Higle and Sen [37, 38].
The ideas presented about trickling down and similar methods come from
different authors, in particular Wets [70, 72], Haugland and Wallace [35],
Wallace [65, 64] and Gassmann and Wallace [32]. A related approach is that
of Gartska and Rutenberg [29], which is based on parametric optimization.
Partitioning has been discussed several times during the years. Some general
ideas are presented in Birge and Wets [9]. More detailed discussions (with
RECOURSE PROBLEMS 223
numerical results), on which the discussions in this book are based, can be
found in Frauendorfer and Kall [26] and Berland and Wallace [3, 4]. Other
texts about approximation by discretization include for example those of
Kall [43, 45, 46], and Kall, Ruszczyński and Frauendorfer [49].
When partitioning the support to tighten bounds, it is possible to use more
complicated cells than we have done. For example, Frauendorfer [24, 25] uses
simplices. It is also possible to use more general polyhedra.
For simple recourse, the separability of the objective, which facilitates
computations substantially, was discovered by Wets [69]. The ability to replace
the Edmundson–Madansky upper bound by the true objective’s value was
discussed in Kall and Stoyan [50]. Wets [71] has derived a special pivoting
scheme that avoids the tremendous increase of the problem size known from
general recourse problems according to the number of blocks (i.e. realizations).
See also discussions by Everitt and Ziemba [22] and Hansotia [34].
The fisheries example in the beginning of the chapter comes from
Wallace [67]. Another application concerning natural resources is presented
by Gassmann [30].
Exercises
where ξ˜ is a random variable with support Ξ = [0, 1]. Write down the LP
(both primal and dual formulation) needed to check if a given x produces
a feasible second-stage problem. Do it in such a way that if the problem
is not feasible, you obtain an inequality in x that cuts off the given x. If
you have access to an LP code, perform the computations, and find the
inequality explicitly for x̂ = (1, 1, 1)T .
2. Look back at problem (4.1) we used to illustrate the bounds. Add one extra
constraint, namely
xraw1 ≤ 40.
(a) Find the Jensen lower bound after this constraint has been added.
(b) Find the Edmundson–Madansky upper bound.
(c) Find the piecewise linear upper bound.
(d) Try to find a good variable for partitioning.
224 STOCHASTIC PROGRAMMING
(a) Assume that you solve the integer program with branch-and-bound.
Your first step is then to solve the integer program above, but with
xi ∈ {0, . . . , bi } ∀i replaced by 0 ≤ x ≤ b. Assume that you get x̂.
Explain why x̂ can be a good partitioning point if you wanted to find
Eφ(x̃) by repeatedly partitioning the support, and finding bounds on
each cell. [Hint: It may help to draw a little picture.]
(b) We have earlier referred to Figure 18, stating that it can be seen
as both the partitioning of the support for the stochastic program,
and partitioning the solution space for the integer program. Will the
number of cells be largest for the integer or the stochastic program
above? Note that there is not necessarily a clear answer here, but you
should be able make arguments on the subject. Question (a) may be
of some help.
8. Look back at Figure 17. There we replaced one distribution by two others:
one yielding an upper bound, and one a lower bound. The possible values
for these two new distributions were not the same. How would you use the
ideas of Jensen and Edmundson–Madansky to achieve, as far as possible,
the same points? You can assume that the distribution is bounded. [Hint:
The Edmundson–Madansky distribution will have two more points than
the Jensen distribution.]
References
[41] Jensen J. L. (1906) Sur les fonctions convexes et les inégalités entre les
valeurs moyennes. Acta Math. 30: 173–177.
[42] Jessup E. R., Yang D., and Zenios S. A. (1993) Parallel factorization
of structured matrices arising in stochastic programming. Report 93-02,
Department of Public and Business Administartion, University of Cyprus,
Nicosia, Cyprus.
[43] Kall P. (1974) Approximations to stochastic programs with complete fixed
recourse. Numer. Math. 22: 333–339.
[44] Kall P. (1979) Computational methods for solving two-stage stochastic
linear programming problems. Z. Angew. Math. Phys. 30: 261–271.
[45] Kall P. (1986) Approximation to optimization problems: An elementary
review. Math. Oper. Res. 11: 9–18.
[46] Kall P. (1987) On approximations and stability in stochastic programming.
In Guddat J., Jongen H. T., Kummer B., and Nožička F. (eds) Parametric
Optimization and Related Topics, pages 387–407. Akademie-Verlag, Berlin.
[47] Kall P. (1988) Stochastic programming with recourse: Upper bounds and
moment problems—a review. In Guddat J., Bank B., Hollatz H., Kall
P., Klatte D., Kummer B., Lommatzsch K., Tammer K., Vlach M., and
Zimmermann K. (eds) Advances in Mathematical Optimization (Dedicated
to Prof. Dr. Dr. hc. F. Nožička), pages 86–103. Akademie-Verlag, Berlin.
[48] Kall P. (1991) An upper bound for SLP using first and total second
moments. Ann. Oper. Res. 30: 267–276.
[49] Kall P., Ruszczyński A., and Frauendorfer K. (1988) Approximation
techniques in stochastic programming. In Ermoliev Y. M. and Wets R.
J.-B. (eds) Numerical Techniques for Stochastic Optimization, pages 33–64.
Springer-Verlag, Berlin.
[50] Kall P. and Stoyan D. (1982) Solving stochastic programming problems
with recourse including error bounds. Math. Operationsforsch. Statist., Ser.
Opt. 13: 431–447.
[51] Karr A. F. (1983) Extreme points of certain sets of probability measures,
with applications. Math. Oper. Res. 8: 74–85.
[52] Kemperman J. M. B. (1968) The general moment problem, a geometric
approach. Ann. Math. Statist. 39: 93–122.
[53] Laporte G. and Louveaux F. V. (1993) The integer l-shaped method for
stochastic integer programs. Oper. Res. Lett. 13: 133–142.
[54] Louveaux F. V. (1986) Multistage stochastic linear programs with block
separable recourse. Math. Prog. Study 28: 48–62.
[55] Lustig I. J., Mulvey J. M., and Carpenter T. J. (1991) Formulating two-
stage stochastic programs for interior point methods. Oper. Res. 39: 757–
770.
[56] Madansky A. (1959) Bounds on the expectation of a convex function of a
multivariate random variable. Ann. Math. Statist. 30: 743–746.
[57] Marti K. (1988) Descent Directions and Efficient Solutions in Discretely
RECOURSE PROBLEMS 229
Probabilistic Constraints
As we have seen in Sections 1.4 and 1.5, at least under appropriate assump-
tions, chance-constrained problems such as (3.21), or particularly (3.23), as
well as recourse problems such as (3.11), or particularly (3.16), (all from Chap-
ter 1), appear as ordinary convex smooth mathematical programming prob-
lems. This might suggest that these problems may be solved using known
nonlinear programming methods. However, this viewpoint disregards the fact
that in the direct application of those methods to problems like
˜
minx∈X Eξ̃ cT (ξ)x
s.t. P ({ξ | T (ξ)x ≥ h(ξ)}) ≥ α
or
˜
min Eξ̃ {cT x + Q(x, ξ)}
x∈X
where
Q(x, ξ) = min{q T y | W y ≥ h(ξ) − T (ξ)x, y ∈ Y },
we had repeatedly to obtain gradients and evaluations for functions like
P ({ξ | T (ξ)x ≥ h(ξ)})
or
˜
Eξ̃ {cT x + Q(x, ξ)}.
Each of these evaluations requires multivariate numerical integration, so that
up to now this seems to be outside of the set of efficiently solvable problems.
Hence we may try to follow the basic ideas of some of the known nonlinear
programming methods, but at the same time we have to find ways to evade
the exact evaluation of the integral functions contained in these problems.
On the other hand we also know from the example illustrated in Figure 17
of Chapter 1 that chance constraints may easily define nonconvex feasible
sets. This leads to severe computational problems if we intend to find a global
optimum. There is one exception to this general problem worth mentioning.
232 STOCHASTIC PROGRAMMING
is convex.
Proof Assume that x, y ∈ B(1) and that λ ∈ (0, 1). Then for Ξx := {ξ |
T (ξ)x ≥ h(ξ)} and Ξy := {ξ | T (ξ)y ≥ h(ξ)} we have P (Ξx ) = P (Ξy ) = 1.
As is easily shown, this implies for Ξ∩ := Ξx ∩ Ξy that P (Ξ∩ ) = 1.
Obviously, for z := λx + (1 − λ)y we have T (ξ)z ≥ h(ξ) ∀ξ ∈ Ξ∩ such that
{ξ | T (ξ)z ≥ h(ξ)} ⊃ Ξ∩ . Hence we have z ∈ B(1). 2
minx∈X cT x
s.t. T (ξ j )x ≥ h(ξ j ), j = 1, · · · , r.
This observation may be helpful for some particular chance-constrained
problems with discrete distributions. However, it also tells us that for chance-
constrained problems stated with continuous-type distributions and requiring
a reliability level α < 1, we cannot expect—as discussed in Section 3.5 for the
recourse problem—approximating the continuous distribution by successively
refined discrete ones to be a successful approach. The reason should now be
obvious: refining the discrete (approximating) distributions would imply at
some stage that minj pj < 1−α such that the “approximating” problems were
likely to become nonconvex—even if the original problem with its continuous
distribution were convex. And approximating convex problems by nonconvex
ones should certainly not be our aim!
In the next two sections we shall describe under special assumptions
(multivariate normal distributions) how chance-constrained programs can
PROBABILISTIC CONSTRAINTS 233
P ({ξ | T x ≥ ξ}) ≥ α.
To see how this may be realized, let us briefly sketch one iteration of the
reduced gradient method’s variant implemented in PROCON, a computer
program for minimizing a function under PRObabilistic CONstraints.
With the notation
G(x) := P ({ξ | T x ≥ ξ}),
let x be feasible in
min cT x
s.t. G(x) ≥ α,
(1.2)
Dx = d,
x ≥ 0,
and—assuming D to have full row rank—let D be partitioned as D = (B, N )
into basic and nonbasic parts and accordingly partition xT = (y T , z T ), cT =
234 STOCHASTIC PROGRAMMING
u = −B −1 N v,
rT = g T − f T B −1 N,
sT = ∇z G(x)T − ∇y G(x)T B −1 N
are the reduced gradients of the objective and the probabilistic constraint
function. Problem (1.5)—and hence (1.4)—is always solvable owing to its
nonempty and bounded feasible set. Depending on the obtained solution
(τ ∗ , u∗T , v ∗T ) the method proceeds as follows.
If a search direction w∗T = (u∗T , v ∗T ) has been found, a line search follows
using bisection. Since the line search in this case amounts to determining the
intersection of the ray x + µw∗ , µ ≥ 0 with the boundary bdB(α) within the
tolerance ε, the evaluation of G(x) becomes important. For this purpose a
special Monte Carlo technique is used, which allows efficient computation of
upper and lower bounds of G(x) as well as the gradient ∇G(x).
If the next iterate x̌, resulting from the line search, still satisfies strict
nondegeneracy, the whole step is repeated with the same partition of D into
basic and nonbasic parts; otherwise, a basis exchange is attempted to reinstall
strict nondegeneracy for a new basis.
Let us now consider stochastic linear programs with separate (or single) chance
constraints as introduced at the end of Section 1.3. Using the formulation given
there we are dealing with the problem
)
minx∈X Eξ̃ cT (ξ)x˜
(2.1)
s.t. P ({ξ | Ti (ξ)x ≥ hi (ξ)}) ≥ αi , i = 1, · · · , m,
where Ti (ξ) is the ith row of T (ξ). The main question is whether or under
what assumptions the feasibility set defined by any one of the constraints
in (2.1),
{x | P ({ξ | Ti (ξ)x ≥ hi (ξ)} ≥ αi },
is convex. As we know from Section 1.5, this question is very simple to answer
˜
for the special case where Ti (ξ) ≡ Ti , i.e. where only the right-hand side hi (ξ)
is random. That is, with Fi the distribution function of hi (ξ),˜
where (t̃T , h̃)T is a random vector. Assume now that (t̃T , h̃)T has a joint
normal distribution with expectation µ ∈ IRn+1 and (n + 1) × (n + 1)
236 STOCHASTIC PROGRAMMING
covariance matrix S. For any fixed x, let ζ̃(x) := xT t̃ − h̃. It follows that
our feasible set may be rewritten in terms of the random variable ζ̃(x) as
Bi (αi ) = {x | P (ζ(x) ≥ 0) ≥ αi }. From probability theory, we know
that, because ζ̃(x) is a linear combination of jointly normally distributed
random variables, it has a (one-dimensional)
Pn normal distribution function
Fζ̃ with expectation mζ̃ (x) = j=1 µj xj − µn+1 , and, using the (n + 1)-
vector z(x) := (x1 , · · · , xn , −1)T , the variance σζ̃2 (x) = z(x)T Sz(x). Since the
covariance matrix S of a (nondegenerate) multivariate normal distribution
is positive-definite, it follows that the variance σζ̃2 (x) and, as can be easily
shown, the standard deviation σζ̃ (x) are convex in x (and σζ̃ (x) > 0 ∀x in
view of zn+1 (x) = −1). Hence we have
Bi (αi ) = {x | P (ζ(x) ≥ 0) ≥ αi }
( ¯ Ã ! )
¯ ζ(x) − mζ̃ (x) −mζ̃ (x)
¯
= x ¯P ≥ ≥ αi .
¯ σζ̃ (x) σζ̃ (x)
Observing that for the normally distributed random variable ζ̃(x) the random
variable [ζ̃(x) − mζ̃ (x)]/σζ̃ (x) has the standard normal distribution function
Φ, it follows that
n ¯ ³ −m (x) ´ o
¯ ζ̃
Bi (αi ) = x¯1 − Φ ≥ αi .
σζ̃ (x)
Hence
n ¯ ³ −m (x) ´ o
¯ ζ̃
Bi (αi ) = x¯1 − Φ ≥ αi
σζ̃ (x)
n ¯ ³ −m (x) ´ o
¯ ζ̃
= x¯Φ ≤ 1 − αi
σζ̃ (x)
n ¯ −m (x) o
¯ ζ̃
= x¯ ≤ Φ−1 (1 − αi )
σ (x)
n ¯ ζ̃ o
¯
= x¯ − Φ−1 (1 − αi )σζ̃ (x) − mζ̃ (x) ≤ 0 .
Here mζ̃ (x) is linear affine in x and σζ̃ (x) is convex in x. Therefore the left-
hand side of the constraint
−Φ−1 (1 − αi )σζ̃ (x) − mζ̃ (x) ≤ 0
is convex iff Φ−1 (1 − αi ) ≤ 0, which is exactly the case iff αi ≥ 0.5. Hence
we have, under the assumption of normal distributions and αi ≥ 0.5, instead
of (2.1) a deterministic convex program with constraints of the type
−Φ−1 (1 − αi )σζ̃ (x) − mζ̃ (x) ≤ 0,
which can be solved with standard tools of nonlinear programming.
PROBABILISTIC CONSTRAINTS 237
contained in the constraints of problem (1.1). Here Fξ̃ (·) denotes the
˜ In the following we sketch some
distribution function of the random vector ξ.
ideas underlying these bounding methods. For a more technical presentation,
the reader should consult the references provided below.
To simplify the notation,let us assume that ξ˜ is a random vector with a
support Ξ ⊂ IRn . For any z ∈ IRn , we have
Bi := Aci = {ξ | ξi > zi },
A1 ∩ · · · ∩ An = (B1 ∪ · · · ∪ Bn )c ,
and consequently
Fξ̃ (z) = P (A1 ∩ · · · ∩ An )
= P ((B1 ∪ · · · ∪ Bn )c )
= 1 − P (B1 ∪ · · · ∪ Bn ).
Therefore asking for the value of Fξ̃ (z) is equivalent to looking for the
probability that at least one of the events B1 , · · · , Bn occurs. Defining the
counter ν̃ : Ξ −→ IN by
Hence finding a good approximation for P (ν̃ ≥ 1) yields at the same time a
satisfactory approximation of Fξ̃ (z).
238 STOCHASTIC PROGRAMMING
µ¶
i
Since = 1, i = 0, 1, · · · , n, it follows that S0,n = 1. Furthermore,
0
choosing v ∈ IRn+1 according to vi := P ({ξ | ν̃(ξ) = i}), i = 0, 1, · · · , n, it is
obvious from (3.1) that v solves the system of linear equations
v0 + v1 + v2 + · · · + vn = S0,n ,
v1 + 2v2 + · · · + µ ¶nvn = S1,n ,
n
v2 + · · · + vn = S2,n ,
2
.. .. (3.2)
.
.
..
..
. .
vn = Sn,n .
These linear programs are feasible and bounded, and therefore solvable. So,
there exist optimal feasible 2 × 2 bases B.
Consider an arbitrary 2 × 2 matrix of the form
µ i ¶ µi+r¶
B= i i + r ,
2 2
for all i and r such that 1 ≤ i < n and 1 ≤ r ≤ n − i. Hence any two
columns of the coefficient matrix of (3.3) (or equivalently of (3.4)) form a
basis. The question is which one is feasible and optimal. Let us consider the
second property first. According to Proposition 1.15, Section 1.6 (page 60), a
basis B of (3.3) satisfies the optimality condition if
1 − eT B −1 Nj ≥ 0 ∀j 6= i, i + r,
where eT = (1, 1) and Nj is the jth column of the coefficient matrix of (3.3).
Obviously, for (3.4) we have the reverse inequality as optimality condition:
1 − eT B −1 Nj ≤ 0 ∀j 6= i, i + r.
240 STOCHASTIC PROGRAMMING
2i + r − j
eT B −1 Nj = j . (3.5)
i(i + r)
Proposition 4.3 The basis
µ i ¶ µi+r¶
B= i i+r
2 2
satisfies the optimality condition
(a) for (3.3) if and only if r = 1 (i arbitrary);
(b) for (3.4) if and only if i = 1 and i + r = n.
Proof
(a) If r ≥ 2, we get from (3.5) for j = i + 1
2i + r − j
eT B −1 Ni+1 = j
i(i + r)
i(i + r) + r − 1
=
i(i + r)
> 1,
so that the optimality condition for (3.3) is not satisfied for r > 1, showing
that r = 1 is necessary.
Now let r = 1. Then for j < i we have, according to (3.5),
2i + 1 − j
eT B −1 Nj = j
i(i + 1)
j + i2 − (j − i)2
=
i(i + 1)
i(i + 1) − (j − i)2
<
i(i + 1)
< 1,
1 BB −1 = I, the identity matrix!
PROBABILISTIC CONSTRAINTS 241
the last inequality resulting from the fact that subtracting the
denominator from the numerator yields
Hence in both cases the optimality condition for (3.3) is strictly satisfied.
(b) If i + r < n then we get from (3.5) for j = n
n(i + r) + n(i − n)
eT B −1 Nn =
i(i + r)
<1
since
{numerator} − {denominator} = n(i + r) + n(i − n) − i(i + r)
= (n − i)(i + r − n)
< 0.
Hence the only possible choice for a basis satisfying the optimality
condition for problem (3.4) is i = 1, r = n − 1.
2
As can be seen from the simplex method, a basis that satisfies the optimality
condition strictly does determine a unique optimal solution if it is feasible.
242 STOCHASTIC PROGRAMMING
and hence
2
µ ¶ S1,n − S2,n
S1,n n−1
B −1 =
.
S2,n 2
S2,n
n(n − 1)
The last vector is nonnegative since the definition of the binomial moments
implies (n − 1)S1,n − 2S2,n ≥ 0 and S2,n ≥ 0. This yields for (3.4) the optimal
value S1,n − (2/n)S2,n . Therefore we finally get an upper bound for P (ν̃ ≥ 1)
as
2
P (ν̃ ≥ 1) ≤ S1,n − S2,n . (3.7)
n
In conclusion, recalling that
and
µ ¶ ¹ º
2 2 2S2,n
Fξ̃ (z) ≤ 1 − S1,n − S2,n , with i − 1 = .
i+1 i(i + 1) S1,n
Another way to introduce these moments is the following. With the same
notation as at the beginning of this section, let us define new random variables
χ̃i : Ξ −→ IR, i = 1, · · · , n, as the indicator functions
½
1 if ξ ∈ Bi ,
χ̃i (ξ) :=
0 otherwise.
Pn
Then clearly ν̃ = i=1 χ̃i , and
µ ¶ µ ¶ X
ν̃ χ̃1 + · · · χ̃n
= = χ̃i1 χ̃i2 · · · χ̃ik .
k k
1≤i1 ≤···≤ik ≤n
244 STOCHASTIC PROGRAMMING
Taking the expectation on both sides yields for the binomial moments Sk,n
·µ ¶¸ X
ν̃
Eξ̃ = Eξ̃ (χ̃i1 χ̃i2 · · · χ̃ik )
k
1≤i1 ≤···≤ik ≤n
X
= P (Bi1 ∩ · · · ∩ Bik ) .
1≤i1 ≤···≤ik ≤n
such that we get from (3.7) for P (ν̃ ≥ 1) the upper bound
2
PU = 0.24 − × 0.0193 = 0.23035.
4
¥ ¦
According to (3.6), we find i−1 = 2×0.0193
0.24 = 0 and hence i = 1, so that (3.6)
yields the lower bound
2 2
PL = × 0.24 − × 0.0193 = 0.1757.
2 2
In conclusion, we get for Fξ̃ (z) = 0.778734 the bounds 1 − PU ≤ Fξ̃ (z) ≤
1 − PL , and hence
0.76965 ≤ Fξ̃ (z) ≤ 0.8243.
PROBABILISTIC CONSTRAINTS 245
Observe that these bounds could be derived without any specific information
about the type of the underlying probability distribution (except the
assumption of independent components made only for the sake of a simple
presentation). 2
Further bounds have been derived for P (ν̃ ≥ 1) using binomial moments up
to the order m, 2 < m < n, as well as for P (ν̃ ≥ r), r > 1. For some of them
explicit formulae could also be derived, while others require the computational
solution of optimization problems with algorithms especially designed for the
particular problem structures.
Exercises
(a) Show that in any case (provided that S1,n P and S2,n are binomial
n
moments) for the optimal solution v̂ of (3.3), i=1 v̂i ≤ 1.
Pn
(b) If for thePoptimal solution v̂ of (3.3) i=1 v̂i < 1 then we have
n
v0 = 1 − i=1 v̂i > 0. What does this mean with respect to Fξ̃ (z)?
Pn
(c) Solving (3.4) can result in i=1 v̂i > 1. To what extent does this result
improve your knowledge about Fξ̃ (z)?
References
[1] Borell C. (1975) Convex set functions in d-space. Period. Math. Hungar.
6: 111–136.
[2] Boros E. and Prékopa A. (1989) Closed-form two-sided bounds for
probabilities that at least r and exactly r out of n events occur. Math.
Oper. Res. 14: 317–342.
[3] Brascamp H. J. and Lieb E. H. (1976) On extensions of the Brunn–
Minkowski and Prekopa–Leindler theorems, including inequalities for log
concave functions, and with an application to the diffusion euation. J.
Funct. Anal. 22: 366–389.
[4] Charnes A. and Cooper W. W. (1959) Chance-constrained programming.
Management Sci. 5: 73–79.
[5] Marti K. (1971) Konvexitätsaussagen zum linearen stochastischen
PROBABILISTIC CONSTRAINTS 247
Preprocessing
pos W = {t | t = W y, y ≥ 0}.
W1
W2
pos W
W3
W4
that is, a matrix containing the coefficient matrix, the cost vector and an extra
column. To see the importance of the extra column, consider the following
interpretation of pos W (remember that pos W equals the set of all positive
linear combinations of columns from W ):
¯
µ ¶ µ ¶ ¯ X X
q1 · · · qn 1 q ¯ ¯
pos = ¯ W = λk Wk , q ≥ λk qk .
W1 · · · Wn 0 W ¯
λk ≥0 λk ≥0
in a sequential manner until we are left with a minimal (but not necessarily
unique) set of columns. A column thrown out in this process will never be
part of an optimal solution, and is hence not needed. It can be dropped. From
a modelling point of view, this means that the modeller has added an activity
that is clearly inferior. Knowing that it is inferior should add to the modellers
understanding of his model.
A column that is not a part of the frame of pos W , but is a part of the
frame of pos W , is one that does not add to our production possibilities, but
its existence might add to our profit.
W y ≤ h, y ≥ 0.
Let W j be the jth row of W , such that the jth inequality is given by
W j y ≤ hj . A row j is not needed if there exists a vector α ≥ 0 such that
X
αi W i = W j
i6=j
and X
αi hi ≤ hj .
i6=j
PREPROCESSING 253
pos W
pol pos W
There is another important aspect of the polar cone pos W ∗ that we have
not yet discussed. It is indicated in Figure 3 by showing that the generators
are pairwise normals. However, that is slightly misleading, so we have to turn
to a three-dimensional figure to understand it better. We shall also need the
term facet. Let a cone pos W have dimension k. Then every cone K positively
spanned by k−1 generators from pos W , such that K belongs to the boundary
of pos W , is called a facet. Consider Figure 4.
What we note in Figure 4 is that the generators are not pairwise normals,
but that the facets of one cone have generators of the other as normals. This
goes in both directions. Therefore, when we state that h ∈ pos W if and only
if hT y ≤ 0 for all generators of pol pos W , we are in fact saying that either h
represents a feasible problem because it is a linear combination of columns in
W or because it satisfies the inequality implied by the facets of pos W . In still
other words, the point of finding W ∗ is not so much to describe a new cone,
but to replace the description of pos W in terms of generators with another
in terms of inequalities.
This is useful if the number of facets is not too large. Generally speaking,
performing an inner product of the form bT y is very cheap. For those who
know anything about parallel processing, it is worth noting that it can be
pipelined on a vector processor and the different inner products can be done
in parallel. And, of course, as soon as we find one positive inner product, we
can stop—the given recourse problem is infeasible.
Readers familiar with extreme point enumeration will see that going from
PREPROCESSING 255
pos W
pol pos W
1 0 ··· 0 −1
0 1 ··· 0 −1
W ∗ :=
... ... . . . .. ..
. .
0 0 ··· 1 −1
or
1 0 ··· 0 −1 0 ··· 0
0 1 ··· 0 0 −1 ··· 0
W ∗ :=
... .. . . .. .. .. .. ..
. . . . . . .
0 0 ··· 1 0 0 ··· −1
Figure 6 The cones pos W and pol pos W before any column has been added
to W .
Example 5.1 Let us turn to a small example to see how procedure support
progresses. Since pos W and pol pos W live in the same dimension, we can
draw them side by side.
Let us initially assume that
µ ¶
3 1 −1 −2
W = .
1 1 2 1
The first thing to do, according to procedure support, is to subject W to a
frame finding algorithm, to see if some columns are not needed. If we do that
(check it to see that you understand frames) we end up with
µ ¶
3 −2
W = .
1 1
Having reduced W , we then initialize W ∗ to span the whole space. Consult
Figure 6 for details. We see there that
µ ¶
1 0 −1 0
W∗ = .
0 1 0 −1
Consult procedure support. From there, it can be seen that the approach
is to take one column from W at a time, and with it perform some calculations.
Figure 6 shows the situation before we consider the first column of W . Calling
it pos W is therefore a bit imprecise. The main point, however, is that the
left and right parts correspond. If W has no columns then pol pos W spans
the whole space.
Now, let us take the first column from W . It is given by W1 = (3, 1)T . We
next find the inner products between W1 and all four columns of W ∗ . We get
α = (3, 1, −3, −1)T .
258 STOCHASTIC PROGRAMMING
Figure 7 The cones pos W and pol pos W after one column has been added
to W .
In other words, the sets I+ = {1, 2} and I− = {3, 4} have two members
each, while I0 = ∅. What this means is that two of the columns must be
removed, namely those in I + , and two kept, namely those in I − . But to avoid
losing parts of the space, we now calculate four columns Ckj . First, we get
C13 = C24 = 0. They are not interesting. But the other two are useful:
µ ¶ µ ¶ µ ¶ µ ¶ µ ¶ µ 1¶
1 0 1 0 1 −1 −3
C14 = +3 = , C23 = +3 = .
0 −1 −3 1 0 1
Since our only interests are directions, we scale the latter to (−1, 3)T . This
brings us into Figure 7. Note that one of the columns in pos W ∗ is drawn
with dots. This is done to indicate that if procedure framebylp is applied to
W ∗ , that column will disappear. (However, that is not a unique choice.)
Note that if W had had only this one column then W ∗ , as it appears in
Figure 7, is the polar matrix of that one-column W . This is a general property
of procedure support. At any iteration, the present W ∗ is the polar matrix
of the matrix containing those columns we have so far looked at.
Now let us turn to the second column of W . We find
µ ¶
−1 1 −1
αT = (−2, 1)W ∗ = (−2, 1) = (5, −5, 2)
3 −3 0
We must now calculate two extra columns, namely C12 and C32 . The first
gives 0, so it is not of interest. For the latter we get
µ ¶ µ ¶ Ã 3!
−1 1 −5
C32 = + 25 = ,
0 −3 − 65
which we scale to (−1, −2)T . This gives us Figure 8. To the left we have pos W ,
with W being the matrix we started out with, and to the right its polar cone.
A column represents a feasible problem if it is inside pos W , or equivalently, if
it has a nonpositive inner product with all generators of pos W ∗ = pol pos W .
PREPROCESSING 259
Figure 8 The cones pos W and pol pos W after two columns have been added
to W .
Hence
(w∗ )T T (ξ)x ≥ (w∗ )T (h0 + Hξ) for all ξ.
If randomness affects both h and T , as indicated above, we must, at least in
principle, create one inequality per ξ for each column from W ∗ . However, if
T (ξ) ≡ T0 , we get a much easier set-up by calculating
£ ¤
(w∗ )T T0 x ≥ h0 + max (w∗ )T H t,
t∈Ξ
where Ξ is the support of ξ. ˜ If we do this for all columns of W ∗ and add the
resulting inequalities in terms of x to Ax = b, we achieve relatively complete
recourse. Hence we see that relatively complete recourse can be generated.
This is why the term is useful. It is very hard to test for relatively complete
recourse. With relatively complete recourse we should never have to worry
about feasibility.
Since the inequalities resulting from the columns of W ∗ can be dominated
by others (in particular, if T (ξ) is truly random), the new rows, together with
those in Ax = b, should be subjected to row removal, as outlined earlier in
this chapter.
260 STOCHASTIC PROGRAMMING
h3
H3
pos W
h1H
h2 H
2
Figure 9 Illustration of feasibility.
In Chapter 3, (page 150), we discussed the set A that is a set of ξ values such
that if h0 + Hξ − T (ξ)x produces a feasible second-stage problem for all ξ ∈ A
˜ We pointed out
then the problem will be feasible for all possible values of ξ.
that in the worst case A had to contain all extreme points in the support of
˜
ξ.
Assume that the second stage is given by
Q(x, ξ) = min{q(ξ)T y | W y = h0 + Hξ − T0 x, y ≥ 0},
where W is fixed and T (ξ) ≡ T0 . This covers many situations. In R2 consider
the example in Figure 9, where ξ˜ = (ξ˜1 , ξ˜2 , ξ˜3 ).
Since h1 ∈ pos W , we can safely fix ξ˜1 at its lowest possible value, since if
things are going to go wrong, then they must go wrong for ξ1min . Or, in other
words, if h0 + H ξˆ − T0 x ∈ pos W for ξˆ = (ξ1min , ξˆ2 , ξˆ3 ) then so is any other
vector with ξ˜2 = ξˆ2 and ξ˜3 = ξˆ3 , regardless of the value of ξ˜1 . Similarly, since
−h2 ∈ pos W , we can fix ξ˜2 at its largest possible value. Neither h3 nor −h3
are in pos W , so there is nothing to do with ξ˜3 .
Hence to check if x yields a feasible solution, we must check if
h0 +Hξ−T0 x ∈ pos W for ξ = (ξ1min , ξ2max , ξ3min )T and ξ = (ξ1min , ξ2max , ξ3max )T
Hence in this case A will contain only two points instead of 23 = 8. In general,
we see that whenever a column from H, in either its positive or negative
262 STOCHASTIC PROGRAMMING
Exercises
1. Let W be the coefficient matrix for the following set of linear equations:
x + 12 y − z + s1 = 0,
2x + z + s2 = 0,
x, y, z, s1 , s2 ≥ 0.
x+y +z ≤ 4,
2x +z ≤ 5,
y +z ≤ 8,
x, y, z ≥ 0.
(a) Are there any columns that are not needed for feasibility? (Remember
the slack variables!)
(b) Let W contain the columns that were needed from question (a),
including the slacks. Try to find the generators of pol pos W by
geometric arguments, i.e. draw a picture.
References
Network Problems
The purpose of this chapter is to look more specifically at networks. There are
several reasons for doing this. First, networks are often easier to understand.
Some of the results we have outlined earlier will be repeated here in a network
setting, and that might add to understanding of the results. Secondly, some
results that are stronger than the corresponding LP results can be obtained
by utilizing the network structure. Finally, some results can be obtained that
do not have corresponding LP results to go with them. For example, we shall
spend a section on PERT problems, since they provide us with the possibility
of discussing many important issues.
The overall setting will be as before. We shall be interested in two-
or multistage problems, and the overall solution procedures will be the
same. Since network flow problems are nothing but specially structured LPs,
everything we have said before about LPs still hold. The bounds we have
outlined can be used, and the L-shaped decomposition method, with and
without bounds, can be applied as before. We should like to point out,
though, that there exists one special case where scenario aggregation looks
more promising for networks than for general LPs: that is the situation where
the overall problem is a network. This may require some more explanation.
When we discuss networks in this chapter, we refer to a situation in which
the second stage (or the last stage in a multistage setting) is a network.
We shall mostly allow the first stage to be a general linear program. This
rather limited view of a network problem is caused by properties of the L-
shaped decomposition method (see page 159). The computational burden in
that algorithm is the calculation of Q(x̂), the expected recourse cost, and to
some extent the check of feasibility. Both those calculations concern only the
recourse problem. Therefore, if that problem is a network, network algorithms
can be used to speed up the L-shaped algorithm.
What if the first-stage problem is also a network? Example 2.2 (page 112)
was such an example. If we apply the L-shaped decomposition method to
that problem, the network structure of the master problem is lost as soon as
feasibility and optimality cuts are added. This is where scenario aggregation,
266 STOCHASTIC PROGRAMMING
outlined in Section 2.6, can be of some use. The reason is that, throughout the
calculations, individual scenarios remain unchanged in terms of constraints, so
that structure is not lost. A nonlinear term is added to the objective function,
however, so if the original problem was linear, we are now in a setting of
quadratic objectives and linear (network) constraints. If the original problem
was a nonlinear network, the added terms will not increase complexity at all.
6.1 Terminology
Furthermore, we have
since we can reach nodes 2 and 3 in one step, but we need two steps to reach
node 4. Node 1 itself is in both sets by definition.
Two examples of predecessors of a node are
since node 1 has no predecessors, and nodes 2 and 3 can be reached from node
1.
A common problem in network flows is the min cost network flow problem.
It is given as follows.
The coefficient matrix for this problem has rank 3. Therefore the node–arc
incidence matrix has three rows, and is given by
1 1 0 0 0
W 0 = −1 0 1 0 1 .
0 −1 0 1 −1
2
Proposition 6.1 A capacitated network flow problem with total supply equal
to total demand is feasible iff for every cut Q = [Y, N \ Y ], and b(Y )T β ≤
a(Q+ )T γ.
NETWORK PROBLEMS 269
If the arcs in the network are uncapacitated, the result is somewhat simpler,
namely the following.
The latter proposition is of course just a special case of the first where we take
into account that whenever a cut contains an uncapacitated arc, the capacity
of that cut is +∞, and hence the inequality from Proposition 6.1 is always
satisfied.
The above two propositions are very simple in nature. However, from a
computational point of view, they are not very useful. Both require that we
look at all subsets Y of N , in other words 2n subsets. For reasonably large n it
is not computationally feasible to try to enumerate subsets this way. Another
problem that might not be that obvious when reading the two propositions is
that they are not “if and only if” statements in a very useful sense. There is no
guarantee that inequalities arising from the propositions are indeed needed.
We might—and most probably will—end up with inequalities that are implied
by other inequalities. A key issue in this respect is the connectedness of a
network. We defined earlier that a network was connected if for all Y ⊂ N we
have that Q = [Y, N \ Y ] 6= ∅. It is reasonably easy to check connectedness of
a network. Details are given in function Connected in Figure 2. Note that
we use F ∗ and B ∗ . If they are not available, we can also use F + and B + , or
calculate F ∗ and B ∗ , which is quite simple.
Using the property of connectedness, it is possible to prove the following
stronger result.
are both needed if and only if G(Y ) and G(N \ Y ) are both connected.
As we can see, these results are very similar. They both state that an
inequality, satisfying the requirements of Proposition 6.1 or 6.2 must be kept
270 STOCHASTIC PROGRAMMING
Proposition 6.3 states that the latter inequality is not needed, because
G({2, 3}) is not connected. From the inequalities themselves, we easily see
that if the first two are satisfied, then the third is automatically true. It is
perhaps slightly less obvious that, for the very same reason, the inequality
is also not needed. It is implied by the requirement that total supply must
equal total demand plus the companions of the first two inequalities above.
(Remember that each node set gives rise to two inequalities). More specifically,
the inequality can be obtained by adding the following two inequalities and
NETWORK PROBLEMS 271
2
a f
d
b
1 4 5
g
c e
3
Figure 3 Example network 1.
Once you have looked at this for a while, you will probably realize that the
part of Proposition 6.3 that says that if G(Y ) or G(N \ Y ) is disconnected
then we do not need any of the inequalities is fairly obvious. The other part of
the proposition is much harder to prove, namely that if G(Y ) and G(N \ Y )
are both connected then the inequalities corresponding to Y and N \ Y are
both needed. We shall not try to outline the proof here.
Propositions 6.3 and 6.4 might not seem very useful. A straightforward use
of them could still require the enumeration of all subsets of N , and for each
such subset check if G(Y ) and G(N \ Y ) are both connected. However, we can
obtain more than that.
The first important observation is that the results refer to the connectedness
of two networks—both the one generated by Y and the one generated by N \Y .
Consider the capacitated case. Let Y1 = N \Y . If both networks are connected,
we have two inequalities that we need, namely
b(Y )T β ≤ a(Q+ )T γ
and
b(Y1 )T β = b(N \ Y )T β ≤ a(Q− )T γ.
On the other hand, if at least one of the networks is disconnected, neither
inequality will be needed. Therefore checking each subset of N means doing
272 STOCHASTIC PROGRAMMING
twice as much work as needed. If we are considering Y and discover that both
G(Y ) and G(Y1 = N \ Y ) are connected, we write down both inequalities at
the same time. An easy way to achieve this is to disregard some node (say
node n) from consideration in a full enumeration. This way, we will achieve
n ∈ N \ Y for all Y we investigate. Then for each cut where the connectedness
requirement is satisfied we write down two inequalities. This will halve the
number of subsets to be checked.
For uncapacitated networks there is no corresponding result. If we have a
Y with Q+ = ∅ then Q− 6= ∅, since otherwise we should have a disconnected
network to start out with. But, of course, the uncapacitated problem is easier
in other respects.
In some cases it is possible to reduce the complexity of a calculation by
collapsing nodes. By these, we understand the process of replacing a set of
nodes by one new node. Any other node that had an arc to or from one of the
collapsed nodes will afterwards have an arc to or from the new node: one for
each original arc. Of course, in the uncapacitated case it is enough with one
arc from one of the original nodes to the new node. A simple but important
use of node collapsing is given by the following proposition.
2
a f
d
b
1 4 5
g
c e
3
Figure 4 Example network 2, assumed to be uncapacitated.
f,g
1,2,3,4 5
Proposition 6.7 If B + (i)∪F + (i) = {i, j} then nodes i and j can be collapsed
after the inequalities generated by CreateIneq({i}) have been created.
This result holds for both the capacitated and uncapacitated case, as long
as we use the relevant version of CreateIneq. The only set Y where i ∈ Y but
j 6∈ Y , at the same time as both G(Y ) and G(N \ Y ) are connected, is the set
where Y = {i}. The reason is that node j blocks node i’s connection to all
other nodes. Therefore, after calling CreateIneq({i}), we can safely collapse
node i into node j. Examples of this can be found in Figure 8, (see e.g. nodes 4
and 5). This result is easy to implement, since all we have to do is run through
all nodes, one at a time, and look for nodes satisfying B + (i) ∪ F + (i) = {i, j}.
274 STOCHASTIC PROGRAMMING
procedure AllFacets;
begin
TreeRemoval;
CreateIneq(∅);
Y := ∅;
if network capacitated then W := N \ {n};
if network uncapacitated then W := N ;
Facets(Y, W );
end;
The interesting property of the cone pos W ∗ is that the recourse problem is
feasible if and only if a given right-hand side has a non-positive inner product
with all generators of the cone. And if there are not too many generators, it
is much easier to perform inner products than to check if a linear program is
feasible. Refer to Figure 4 for an illustration in three dimensions.1 To find the
polar cone, we used procedure support in Figure 5. The major computational
burden in that procedure is the call to procedure framebylp, outlined in
Figure 1. In principle, to determine if a column is part of the frame, we must
remove the column from the matrix, put it as a right-hand side, and see if
the corresponding system of linear equations has a solution or not. If it has
a solution, the column is not part of the frame, and can be removed. An
important property of this procedure is that to determine if a column can be
1 Figures and procedures referred to in this Subsection are contained in Chapter 5
NETWORK PROBLEMS 277
discarded, we have to use all other columns in the test. This is a major reason
why procedure framebylp is so slow when the number of columns gets very
large.
So, a generator w∗ of the cone pos W ∗ has the property that a right-hand
side h must satisfy hT w∗ ≤ 0 to be feasible. In the uncapacitated network
case we saw that a right-hand side β had to satisfy b(Y )T β ≤ 0 to represent a
feasible problem. Therefore the index vector b(Y ) corresponds exactly to the
column w∗ . And calling procedure framebylp to remove those columns that
are not in the frame of the cone pos W ∗ corresponds to using Proposition 6.4.
Therefore the index vector of a node set from Proposition 6.4 corresponds to
the columns in W ∗ .
Computationally there are major differences, though. First, to find a
candidate for W ∗ , we had to start out with W , and use procedure support,
which is an iterative procedure. The network inequalities, on the other hand,
are produced more directly by looking at all subsets of nodes. But the
most important difference is that, while the use of procedure framebylp,
as just explained, requires all columns to be available in order to determine if
one should be discarded, Proposition 6.4 is totally local. We can pick up
an inequality and determine if it is needed without looking at any other
inequalities. With possibly millions of candidates, this difference is crucial.
We did not develop the LP case for explicit bounds on variables. If such
bounds exist, they can, however, be put in as explicit µ constraints.
¶ If so, a
b(Y )
column w∗ from W ∗ corresponds to the index vector .
−a(Q+ )
Let us now discuss how the results obtained in the previous section can help
us, and how they can be used in a setting that deserves the term preprocessing.
Let us first repeat some of our terminology, in order to see how this fits in
with our discussions in the LP setting.
A two-stage stochastic linear programming problem where the second-stage
problem is a directed capacitated network flow problem can be formulated as
follows:
£ ¤
minx cT x + Q(x)
s.t. Ax = b, x ≥ 0,
where X
Q(x) = Q(x, ξ j )pj
and
Q(x, ξ) =
miny1 {(q 1 )T y 1 | W 0 y 1 = h10 + H 1 ξ − T 1 (ξ)x, 0 ≤ y 1 ≤ h20 + H 2 ξ − T 2 (ξ)x},
278 STOCHASTIC PROGRAMMING
where W 0 is the node–arc incidence matrix for the network. To fit into a more
general setting, let µ 0 ¶
W 0
W =
I I
so that Q(x, ξ) can also be written as
Q(x, ξ) = min{q T y | W y = h0 + Hξ − T (ξ)x, y ≥ 0}
y
¶µ µ 1¶ µ 1¶
y1 2 1 q h0
where y = , y is the slack of y , q = , h0 = , T (ξ) =
y2 0 h20
µ 1 ¶ µ 1¶
T (ξ) H
2 and H = 2 . Given our definition of β and γ, we have, for a
T (ξ) H
given x̂, µ ¶
β X
= h0 + Hξ − T (ξ)x̂ = h0 + hi ξi − T (ξ)x̂.
γ
i
Using the inequalities derived in the previous section, we can proceed to
transform these inequalities into inequalities in terms of x. By adding these
inequalities to the first-stage constraints Ax = b, we get relatively complete
recourse, i.e. we guarantee that any x satisfying the (expanded) first-stage
˜ An
constraints will yield a feasible second-stage problem for any value of ξ.
inequality has the form
X X
b[A(Y )]T β = β(i) ≤ γ(k) = a(Q+ )T γ.
i∈A(Y ) k∈Q+
Collecting all x terms on the left-hand side and all other terms on the right-
hand side we get the following expression:
h ³ X ´ ³ X ´i
− b(A(Y ))T T01 + Tj1 ξj + a(Q+ )T T02 + Tj2 ξj x
j j
X© ª
≤ − b[A(Y )]T h1i + a(Q+ )T h2i ξi − b[A(Y )]T h10 + a(Q+ )T h20 .
i
© ª
− b[A(Y )]T T01 + a(Q+ )T T02 x
X© ª
≤ min − b[A(Y )]T h1i + a(Q+ )T h2i ξi − b[A(Y )]T h10 + a(Q+ )T h20 .
ξ∈Ξ
i
Consider the simple network in Figure 11. It represents the flow of sewage (or
some other waste) from three cities, represented by nodes 1, 2 and 3.
All three cities produce sewage, and they have local treatment plants to take
care of some of it. Both the amount of sewage from a city and its treatment
280 STOCHASTIC PROGRAMMING
capacity vary, and the net variation from a city is given next to the node
representing the city. For example, City 1 always produces more than it can
treat, and the surplus varies between 10 and 20 units per unit time. City 2,
on the other hand, sometimes can treat up to 5 units of sewage from other
cities, but at other times has as much as 15 units it cannot itself treat. City
3 always has extra capacity, and that varies between 5 and 15 units per unit
time.
The solid lines in Figure 11 represent pipes through which sewage can be
pumped (at a cost). Assume all pipes have a capacity of up to 5 units per unit
time. Node 4 is a common treatment site for the whole area, and its capacity
is so large that for practical purposes we can view it as being infinite. Until
now, whenever a city had sewage that it could not treat itself, it first tried to
send it to other cities, or site 4, but if that was not possible, the sewage was
simply dumped in the ocean. (It is easy to see that that can happen. When
City 1 has more than 10 units of untreated sewage, it must dump some of it.)
New rules are being introduced, and within a short period of time dumping
sewage will not be allowed. Four projects have been suggested.
• Increase the capacity of the pipe from City 1 (via City 2) to site 4 with x1
units (per unit time).
• Increase the capacity of the pipe from City 2 to City 3 with x2 units (per
unit time).
• Increase the capacity of the pipe from City 1 (via City 3) to site 4 with x3
units (per unit time).
• Build a new treatment plant in City 1 with a capacity of x4 units (per unit
time).
It is not quite clear if capacity increases can take on any values, or just some
predefined ones. Also, the cost structure of the possible investments are not
NETWORK PROBLEMS 281
yet clear. Even so, we are asked to analyse the problem, and create a better
basis for decisions.
The first thing we must do, to use the procedures of this chapter, is to
make sure that, technically speaking, we have a network (as defined at the
start of the chapter). A close look will reveal that a network must have equality
constraints at the node, i.e. flow in must equal flow out. That is not the case
in our little network. If City 3 has spare capacity, we do not have to send
extra sewage to the city, we simply leave the capacity unused if we do not
need it. The simplest way to take care of this is to introduce some new arcs
in the network. They are shown with dotted lines in Figure 11. Finally, to
have supply equal to demand in the network (remember from Proposition 6.1
that this is needed for feasibility), we let the external flow in node 4 be the
negative of the sum of external flows in the other three nodes.
You may wonder if this rewriting makes sense. What does it mean when
“sewage” is sent along a dotted line in the figure? The simple answer is that
the amount exactly equals the unused capacity in the city to which the arc
goes. (Of course, with the given numbers, we realize that no arc will be needed
from node 4 to node 1, but we have chosen to add it for completeness.)
Now, to learn something about our problem, let us apply Proposition 6.3
to arrive at a number of inequalities. You may find it useful to try to write
them down. We shall write down only some of them. The reason for leaving
out some is the following observation: any node set Y that is such that Q+
contains a dotted arc from Figure 11 will be uninteresting, because
a(Q+ )T γ = ∞,
so that the inequality says nothing interesting. The remaining inequalities are
as follows (where we have used that all existing pipes have a capacity of 5 per
unit time).
β1 ≤ 10 + x1 + x3 + x4 ,
β2 ≤ 10 + x1 + x2 ,
β3 ≤ 5 + x3 ,
β1 + β2 + β3 ≤ 10 + x1 + x3 + x4 , (4.1)
β1 + β2 ≤ 15 + x1 + x2 + x3 + x4 ,
β1 + β3 ≤ 10 + x1 + x3 + x4 ,
β2 + β3 ≤ 10 + x1 + x3 .
Let us first note that if we set all xi = 0 in (4.1), we end up with a number
of constraints that are not satisfied for all possible values of β. Hence, as we
already know, there is presently a chance that sewage will be dumped.
However, our interest is mainly to find out about which investments to
make. Let us therefore rewrite (4.1) in terms of xi rather than βi :
282 STOCHASTIC PROGRAMMING
x1 +x3 +x4 ≥ β1 − 10 ≥ 10,
x1 +x2 ≥ β2 − 10 ≥ 5,
+x3 ≥ β3 − 5 ≥ −10,
x1 +x3 +x4 ≥ β1 +β2 +β3 − 10 ≥ 20, (4.2)
x1 +x2 +x3 +x4 ≥ β1 +β2 − 15 ≥ 20,
x1 +x3 +x4 ≥ β1 +β3 − 10 ≥ 5,
x1 +x3 ≥ β2 +β3 − 10 ≥ 0.
¾
x1 + x2 ≥ 5,
(4.3)
x1 + x3 + x4 ≥ 20.
Even though we know nothing so far about investment costs and pumping
costs through the pipes, we know a lot about what limits the options.
Investments of at least five units must be made on a combination of x1 and x2 .
What this seems to say is that the capacity out of City 2 must be increased by
at least 5 units. It is slightly more difficult to interpret the second inequality. If
we see both building pipes and a new plant in City 1 as increases in treatment
capacity (although they are of different types), the second inequality seems to
say that a total of 15 units must be built to facilitate City 1. However, a closer
look at which cut generated the inequality reveals that a more appropriate
interpretation is to say that the three cities, when they are seen as a whole,
must obtain extra capacity of 15 units. It was the node set Y = {1, 2, 3} that
generated the cut.
The two constraints (4.3) are all we need to pass on to the planners. If these
two, very simple, constraints are taken care of, sewage will never have to be
dumped. Of course, if the investment problem is later formulated as a linear
program, the two constraints can be added, thereby guaranteeing feasibility,
and, from a technical point of view, relatively complete recourse.
NETWORK PROBLEMS 283
[-1,1]
2
6 (0,[
4]) 4 2,6
])
[2, (0
(0, 2 , [2
1
,4 1
] )
1
(0,[2,4]) 3 (0,[6,8])
[1,3] 1 2 4 7 5 Slack
2
]) [-2,0]
1 ,8 5
(0, ,[4 ])
[2, (0 0,2
6]) 3
4 (0,[
8 2
3
[3,3]
Figure 12 Network illustrating the different bounds.
6.5 Bounds
f = (2, 0, 0, 0, 2, 2, 1, 1)T ,
with a cost of 18.
Although the Edmundson–Madansky distribution is very useful, it still has
the problem that the objective function must be evaluated in an exponential
284 STOCHASTIC PROGRAMMING
2
4 6 (0,4
(0,3) )
2 1
(0
,3) 1
1
(0,3) 3 (0,7)
2 1 2 4 7 5 Slack
2
) 5 -1
1 ,6
(0 (0 )
,4) (0,1
4
3
8 2
3
3
Figure 13 Example network with arc capacities and external flows
corresponding to the Jensen lower bound.
where all elements of the random vectors ξ˜ = (ξ˜1T , ξ˜2T , . . .)T and η̃ =
(η̃1T , η̃2T , . . .)T are mutually independent. Furthermore, let the supports be
given by Ξ(ξ) ˜ = [A, B] and Ξ(η̃) = [0, C]. The matrix W 0 is the node–arc
incidence matrix for a network, with one row removed. That row represents
the slack node. The external flow in the slack node equals the negative sum of
the external flows in the other nodes. The goal is to create an upper bounding
function U (ξ, η) that is piecewise linear, separable and convex in ξ, as well as
easily integrable in η:
NETWORK PROBLEMS 285
X ½ d+ (ξ − E ξ˜ ) if ξ ≥ E ξ˜ ,
˜ 0) + H(η) +
U (ξ, η) = φ(E ξ, i i i i i
d−
(E ˜i − ξi ) if ξi < E ξ˜i ,
ξ
i i
for some parameters d± i . The principles of the ξ part of this bound were
outlined in Section 3.4.4 and will not be repeated in all details here. We shall
use the developments from that section here, simply by letting η = 0 while
developing the ξ part. Because this is a restriction (constraint) on the original
problem, it produces an upper bound. Then, afterwards, we shall develop
H(η). In Section 3.4.4 we assumed that E ξ˜ = 0. We shall now drop that
assumption, just to illustrate that it was not needed, and to show how many
parameters can be varied in this method.
Let us first see how we can find the ξ part of the function, leaving η = 0.
First, let us calculate
˜ 0) = min{q T y | W 0 y = b + E ξ,
φ(E ξ, ˜ 0 ≤ y ≤ c} = q T y 0 .
y
This is our basic setting, and all other values of ξ will be seen as deviations
from E ξ.˜ Note that since y 0 is “always” there, we shall update the arc
capacities to become −y 0 ≤ y ≤ c − y 0 . For this purpose, we define α1 = −y 0
and β 1 = c − y 0 . Let ei be a unit vector of appropriate dimension with a +1
in position i.
Next, define a counter r and let r := 1. Now, check out the case when
ξ1 > E ξ˜1 by solving
2
4 6 (0,2
(0,2) )
2 1
(0
,2) 1
1
(0,2) 3 (0,6)
2 1 2 4 7 5 Slack
2
) 5 -1
1 ,4
(0 (0 )
,2) (0,0
4
3
8 2
3
3
What we are doing here is to find, for each variable, how much ξ˜r , in the
worst case, uses of arc i in the negative direction. That is then subtracted
from what we had before. There are three possibilities. We may have both
(5.1) and (5.2) yielding nonnegative values for the variable i. Then nothing is
used of the available “negative capacity” αir . Then αir+1 = αir . Alternatively,
when (5.1) has yir+ < 0, it will in the worst case use yir+ of the available
“negative capacity”. Finally, when (5.2) has yir− < 0, in the worst case we
use yir− of the capacity. Therefore, αir+1 is what is left for the next random
variable. Similarly, we find
where βir+1 shows how much is still available of the capacity on arc i in the
forward (positive) direction.
We next increase the counter r by one and repeat (5.1)–(5.4). This takes
care of the piecewise linear functions in ξ.
Let us now look at our example in Figure 12. To calculate the ξ part of the
bound, we put all arc capacities at their lowest possible value and external
flows at their means. This is shown in Figure 14.
The optimal solution in Figure 14 is given by
y 0 = (2, 0, 0, 0, 3, 2, 2, 0)T ,
with a cost of 22. The next step is update the arc capacities in Figure 14 to
account for this solution. The result is shown in Figure 15.
Since the external flow in node 1 varies between 1 and 3, and we have so
far solved the problem for a supply of 2, we must now find the cost associated
with a supply of 1 and a demand of 1 in node 1. For a supply of 1 we get the
NETWORK PROBLEMS 287
2
4 6 (-2,
,0) 0)
(-2 2 1
(0
,2) 1
1
(0,2) 3 (-2,4)
+/- 1
1 2 4 7 5
2 -/+ 1
) 5
1 ,1
(0 (-3 )
,2) (0,0
4
3
8 2
3
Figure 15 ˜ 0).
Arc capacities after the update based on φ(E ξ,
+/- 1
2
4 6 (-1,
,0) 0)
(-1 2 1
(0
,2) 1
1
(0,1) 3 (-2,3)
1 2 4 7 5
2 -/+ 1
1 ,1) 5
(0 (-3 )
,2) (0,0
4
3
8 2
3
solution
y 1+ = (0, 1, 0, 0, 0, 0, 1, 0)T ,
with a cost of 5. Hence d+
1 = 5. For a demand of 1 we get
with a cost of −3, so that d−1 = 3. Hence we have used one unit of the forward
capacity of arcs 2 and 7, and one unit of the reverse capacity of arcs 1 and
6. Note that both solutions correspond to paths between node 1 and node 5
(the slack node). We update to get Figure 16.
For node 2 the external flow varies between −1 and 1, so we shall now check
the supply of 1 and demand of 1 based on the arc capacities of Figure 16. For
supply we get
y 2+ = (0, 0, 0, 1, 0, 0, 1, 0)T ,
288 STOCHASTIC PROGRAMMING
2
4 6 (0,0
,0) )
(-1 2 1
(0
,1) 1
1
(0,1) 3 (-2,2)
1 2 4 7 5
2 -/+ 1
) 5 +/- 1
1 ,1
(0 (-3 )
,2) (0,0
4
3
8 2
3
Figure 17 ˜ 0) and nodes 1 and
Arc capacities after the update based on φ(E ξ,
2.
y 4+ = (0, 0, 0, 0, 0, 0, 1, 0)T ,
2
4 6 (0,0
,0) )
(-1 2 1
(0
,1) 1
1
(0,1) 3 (-1,1)
1 2 4 7 5
2
) 5
1 ,1
(0 (-3 )
,2) (0,0
4
3
8 2
3
˜ 0) and external flow
Figure 18 Arc capacities after the update based on φ(E ξ,
in all nodes.
If, for simplicity, we assume that all distributions are uniform, we easily
integrate the upper-bounding function U (ξ, η) to obtain
U = 22 + H(η)
R2 R3
+ 1 3(ξ1 − 2) 12 dξ1 + 2 5(ξ1 − 2) 21 dξ1
R0 R1
+ −1 ξ2 12 dξ2 + 0 3ξ2 12 dξ2
R −1 R0
+ −2 2(ξ4 + 1) 12 dξ4 + −1 2(ξ4 + 1) 12 dξ4
1 1 1 1 1 1
= 22 + H(η) − 3 × 4
+5× 4
−1× 4
+3× 4
−2× 4
+2× 4
= 23 + H(η).
Note that there is no contribution from ξ4 to the upper bound. The reason
is that the recourse function φ(ξ, η) is linear in ξ4 . This property of discovering
that the recourse function is linear in some random variable is shared with
the Jensen and Edmundson–Madansky bounds.
We then turn to the η part of the bound. Note that if (5.3) and (5.4) were
calculated after the final y r± had been found, the α and β show what is left
of the deterministic arc capacities after all random variable ξ˜i have received
their shares. Let us call these α∗ and β ∗ . If we add to each upper bound in
Figure 18 the value C (remember that the support of the upper arc capacities
was Ξ = [0, C]), we get the arc capacities of Figure 19. Now we solve the
problem
min{q T y | W 0 y = 0, α∗ ≤ y ≤ β ∗ + C} = q T y ∗ . (5.5)
y
With zero external flow in Figure 19, we get the optimal solution
2
4 6 (0,4
,2) )
(-1 2 1
(0
,3) 1
1
(0,3) 3 (-1,3)
1 2 4 7 5
2
) 5
1 ,5
(0 (-3 )
,6) (0,2
4
3
8 2
3
Figure 19 Arc capacities used to calculate H(η) for the example in Figure 12.
with a cost of −4. This represents cycle flow with negative costs. The cycle
became available as a result of making arc 8 having a positive arc capacity.
If, again for simplicity, we assume that η8 is uniformly distributed over [0, 2],
we find that the capacity of that cycle has a probability of being 1 equal to
0.5. The remaining probability mass is uniformly distributed over [0, 1]. We
therefore get
Z 1
1 1
EH(η) = −4 × 1 × 2 − 4 2
x dx = −2 − 1 = −3.
0
The total upper bound for this example is thus 23 − 3 = 20, compared with
the Jensen lower bound of 18.
In this example the solution y ∗ of (5.5) contained only one cycle. In general,
∗
y may consist of several cycles, possibly sharing arcs. It is then necessary to
pick y ∗ apart into individual cycles. This can be done in such a way that all
cycles have nonpositive costs (those with zero costs can then be discarded),
and such that all cycles that use a common arc use it in the same direction.
We shall not go into details of that here.
Since πn is the time at which the project finishes, we can calculate the
minimal project completion time by solving
min πn
s.t. πj − πi ≥ qk for all k ∼ (i, j), (6.1)
π1 = 0.
It is worth noting that (6.1) is not really a decision problem. There are namely
no decisions. We are only calculating consequences of an existing setting of
relations and durations.
292 STOCHASTIC PROGRAMMING
π̃j = max ˜
{π̃i + qk (ξ)}, with k ∼ (i, j),
i∈B + (i)\{i}
Example 6.3 Assume we have two random variables ξ˜1 and ξ˜2 , with joint
distribution as in Table 1. Note that both random variables have the same
marginal distributions; namely, each of them can take on the values 1 or 2,
each with a probability 0.5. Therefore E max{ξ˜1 , ξ˜2 } = 1.7 from Table 1, but
0.25(1 + 2 + 2 + 2) = 1.75 if we use the marginal distributions as independent
distributions. Therefore, if ξ˜1 and ξ˜2 represent two paths with some joint arc,
disregarding the dependences will create an upper bound.
2
NETWORK PROBLEMS 295
Table 1 Joint distribution for ξ˜1 and ξ˜2 , plus the calculation of max{ξ˜1 , ξ˜2 }.
The vocabulary in this chapter is mostly taken from Rockafellar [25], which
also contains an extremely good overview of deterministic network problems.
A detailed look at network recourse problem is found in Wallace [28].
The original feasibility results for networks were developed by Gale [10]
and Hoffman [13]. The stronger versions using connectedness were developed
by Wallace and Wets. The uncapacitated case is given in [31], while the
capacitated case is outlined in [33] (with a proof in [32]). More details
of the algorithms in Figures 9 and 10 can also be found in these papers.
Similar results were developed by Prékopa and Boros [23]. See also Kall and
Prékopa [14].
As for the LP case, model formulations and infeasibility tests have of course
been performed in many contexts apart from ours. In addition to the references
given in Chapter 5, we refer to Greenberg [11, 12] and Chinneck [3].
The piecewise linear upper bound is taken from Wallace [30]. At the very
end of our discussion of the piecewise linear upper bound, we pointed out that
the solution y ∗ to (5.5) could consist of several cycles sharing arcs. A detailed
discussion of how to pick y ∗ apart, to obtain a conformal realization can be
found in Rockafellar [25], page 476. How to use it in the bound is detailed
in [30]. The bound has been strengthened for pure arc capacity uncertainty
by Frantzeskakis and Powell [8].
Special algorithms for stochastic network problems have also been
developed; see e.g. Qi [24] and Sun et al. [27].
We pointed out at the beginning of this chapter that scenario aggregation
(Section 2.6) could be particularly well suited to problems that have network
structure in all periods. This has been utilized by Mulvey and Vladimirou
for financial problems, which can be formulated in a setting of generalized
networks. For details see [19, 20]. For a selection of papers on financial
problems (not all utilizing network structures), consult Zenios [36, 37], and,
for a specific application, see Dempster and Ireland [5].
The above methods are well suited for parallel processing. This has been
done in Mulvey and Vladimirou [18] and Nielsen and Zenios [21].
Another use of network structure to achieve efficient methods is described
in Powell [22] for the vehicle routing problem.
The PERT formulation was introduced by Malcolm et al. [17]. An overview
NETWORK PROBLEMS 297
Exercises
problem is very similar to the PERT problem, in that paths in the latter
correspond to cuts in the max flow problem. Use the bounding ideas listed
in Section 6.6.3 to find bounds on the expected max flow in a network with
random arc capacities.
4. In our example about sewage treatment in Section 6.4 we introduced four
investment options.
(a) Assume that a fifth investment is suggested, namely to build a pipe
with capacity x5 directly from City 1 to site 4. What are the constraints
on xi for i = 1, . . . , 5 that must now be satisfied for the problem to be
feasible?
(b) Disregard the suggestion in question (a). Instead, it is suggested to
see the earlier investment 1, i.e. increasing the pipe capacity from City
1 to cite 4 via City 2 as two different investment. Now let x1 be the
increased capacity from City 1 to City 2, and x5 the increased capacity
from City 2 to cite 4 (the dump). What are now the constraints on xi
for i = 1, . . . , 5 that must be satisfied for the problem to be feasible?
Make sure you interpret the constraints.
References
facet, 57, 201, 203, 254, 255, 279 Kuhn–Tucker conditions, 77–84
Farkas’ lemma, 69, 151
fat solution, 8 L-shaped method, 75, 149–160, 201,
feasibility cut, 72, 98, 149–155, 161, 165, 205, 208, 218, 221, 251, 265, 279
191, 202, 259 algorithms, 156–158
example, 155 example, 159–160
reduced complexity, 261–262 MSLiP, 221
feasible within approximation scheme, 189–
basic solution, 49 191
degenerate, 59 algorithm, 191
nondegenerate, 59 Lagrange function, 83
basis, 50, 249 augmented, 94
set, 49, 251 multipliers, 79
feasible direction saddle point, 84
Index 303