Stochastic Programming
Stochastic Programming
Second Edition
Peter Kall
Institute for Operations Research
and Mathematical Methods of Economics
University of Zurich
CH-8044 Zurich
Stein W. Wallace
Molde University College
P.O. Box 2110
N-6402 Molde, Norway
ii
STOCHASTIC PROGRAMMING
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 A numerical example . . . . . . . . . . . . . . .
1.1.2 Scenario analysis . . . . . . . . . . . . . . . . .
1.1.3 Using the expected value of p . . . . . . . . . .
1.1.4 Maximizing the expected value of the objective
1.1.5 The IQ of hindsight . . . . . . . . . . . . . . .
1.1.6 Options . . . . . . . . . . . . . . . . . . . . . .
1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . .
1.3 An Illustrative Example . . . . . . . . . . . . . . . . .
1.4 Stochastic Programs: General Formulation . . . . . . .
1.4.1 Measures and Integrals . . . . . . . . . . . . . .
1.4.2 Deterministic Equivalents . . . . . . . . . . . .
1.5 Properties of Recourse Problems . . . . . . . . . . . .
1.6 Properties of Probabilistic Constraints . . . . . . . . .
1.7 Linear Programming . . . . . . . . . . . . . . . . . . .
1.7.1 The Feasible Set and Solvability . . . . . . . .
1.7.2 The Simplex Algorithm . . . . . . . . . . . . .
1.7.3 Duality Statements . . . . . . . . . . . . . . . .
1.7.4 A Dual Decomposition Method . . . . . . . . .
1.8 Nonlinear Programming . . . . . . . . . . . . . . . . .
1.8.1 The KuhnTucker Conditions . . . . . . . . . .
1.8.2 Solution Techniques . . . . . . . . . . . . . . .
1.8.2.1 Cutting-plane methods . . . . . . . .
1.8.2.2 Descent methods . . . . . . . . . . . .
1.8.2.3 Penalty methods . . . . . . . . . . . .
1.8.2.4 Lagrangian methods . . . . . . . . . .
1.9 Bibliographical Notes . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
1
1
1
2
3
4
5
5
7
10
21
21
31
36
46
53
54
64
70
75
80
83
89
90
93
97
98
102
104
iv
STOCHASTIC PROGRAMMING
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2 Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 The Bellman Principle . . . . . . . . . . . . . . . . . . . . . .
2.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . .
2.3 Deterministic Decision Trees . . . . . . . . . . . . . . . . . . .
2.4 Stochastic Decision Trees . . . . . . . . . . . . . . . . . . . .
2.5 Stochastic Dynamic Programming . . . . . . . . . . . . . . .
2.6 Scenario Aggregation . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Approximate Scenario Solutions . . . . . . . . . . . .
2.7 Financial Models . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1 The Markowitz model . . . . . . . . . . . . . . . . . .
2.7.2 Weak aspects of the model . . . . . . . . . . . . . . .
2.7.3 More advanced models . . . . . . . . . . . . . . . . . .
2.7.3.1 A scenario tree . . . . . . . . . . . . . . . . .
2.7.3.2 The individual scenario problems . . . . . . .
2.7.3.3 Practical considerations . . . . . . . . . . . .
2.8 Hydro power production . . . . . . . . . . . . . . . . . . . . .
2.8.1 A small example . . . . . . . . . . . . . . . . . . . . .
2.8.2 Further developments . . . . . . . . . . . . . . . . . .
2.9 The Value of Using a Stochastic Model . . . . . . . . . . . . .
2.9.1 Comparing the Deterministic and Stochastic Objective
Values . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.2 Deterministic Solutions in the Event Tree . . . . . . .
2.9.3 Expected Value of Perfect Information . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
110
110
117
121
124
130
134
141
141
142
143
145
145
145
147
147
148
150
151
.
.
.
.
151
152
154
156
CONTENTS
3.6
3.7
Simple Recourse . . . . . . . . . . . .
Integer First Stage . . . . . . . . . . .
3.7.1 Initialization . . . . . . . . . .
3.7.2 Feasibility Cuts . . . . . . . . .
3.7.3 Optimality Cuts . . . . . . . .
3.7.4 Stopping Criteria . . . . . . . .
3.8 Stochastic Decomposition . . . . . . .
3.9 Stochastic Quasi-Gradient Methods . .
3.10 Solving Many Similar Linear Programs
3.10.1 Randomness in the Objective .
3.11 Bibliographical Notes . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
205
209
216
216
217
217
217
225
229
232
233
235
237
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
243
245
247
249
257
258
258
5 Preprocessing . . . . . . . . . . . . . . . . . . .
5.1 Problem Reduction . . . . . . . . . . . . . . .
5.1.1 Finding a Frame . . . . . . . . . . . .
5.1.2 Removing Unnecessary Columns . . .
5.1.3 Removing Unnecessary Rows . . . . .
5.2 Feasibility in Linear Programs . . . . . . . . .
5.2.1 A Small Example . . . . . . . . . . . .
5.3 Reducing the Complexity of Feasibility Tests
5.4 Bibliographical Notes . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
261
261
262
263
264
265
271
273
274
274
275
6 Network Problems . . . . . . . . . . . . . . . .
6.1 Terminology . . . . . . . . . . . . . . . . . . .
6.2 Feasibility in Networks . . . . . . . . . . . . .
6.2.1 The uncapacitated case . . . . . . . .
6.2.2 Comparing the LP and Network Cases
6.3 Generating Relatively Complete Recourse . .
6.4 An Investment Example . . . . . . . . . . . .
6.5 Bounds . . . . . . . . . . . . . . . . . . . . .
6.5.1 Piecewise Linear Upper Bounds . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
277
278
280
286
287
288
290
294
295
4 Probabilistic Constraints . . . . . . .
4.1 Joint Chance Constrained Problems
4.2 Separate Chance Constraints . . . .
4.3 Bounding Distribution Functions . .
4.4 Bibliographical Notes . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
vi
STOCHASTIC PROGRAMMING
6.6
Project Scheduling . . . . . . . . . . . . . . . . .
6.6.1 PERT as a Decision Problem . . . . . . .
6.6.2 Introduction of Randomness . . . . . . . .
6.6.3 Bounds on the Expected Project Duration
6.6.3.1 Series reductions . . . . . . . . .
6.6.3.2 Parallel reductions . . . . . . . .
6.6.3.3 Disregarding path dependences .
6.6.3.4 Arc duplications . . . . . . . . .
6.6.3.5 Using Jensens inequality . . . .
6.7 Bibliographical Notes . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
301
303
303
304
305
305
305
306
306
307
308
309
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Preface
Over the last few years, both of the authors, and also most others in the field
of stochastic programming, have said that what we need more than anything
just now is a basic textbooka textbook that makes the area available not
only to mathematicians, but also to students and other interested parties who
cannot or will not try to approach the field via the journals. We also felt
the need to provide an appropriate text for instructors who want to include
the subject in their curriculum. It is probably not possible to write such a
book without assuming some knowledge of mathematics, but it has been our
clear goal to avoid writing a text readable only for mathematicians. We want
the book to be accessible to any quantitatively minded student in business,
economics, computer science and engineering, plus, of course, mathematics.
So what do we mean by a quantitatively minded student? We assume that
the reader of this book has had a basic course in calculus, linear algebra
and probability. Although most readers will have a background in linear
programming (which replaces the need for a specific course in linear algebra),
we provide an outline of all the theory we need from linear and nonlinear
programming. We have chosen to put this material into Chapter 1, so that
the reader who is familiar with the theory can drop it, and the reader who
knows the material, but wonders about the exact definition of some term, or
who is slightly unfamiliar with our terminology, can easily check how we see
things. We hope that instructors will find enough material in Chapter 1 to
cover specific topics that may have been omitted in the standard book on
optimization used in their institution. By putting this material directly into
the running text, we have made the book more readable for those with the
minimal background. But, at the same time, we have found it best to separate
what is new in this bookstochastic programmingfrom more standard
material of linear and nonlinear programming.
Despite this clear goal concerning the level of mathematics, we must
admit that when treating some of the subjects, like probabilistic constraints
(Section 1.6 and Chapter 4), or particular solution methods for stochastic
programs, like stochastic decomposition (Section 3.8) or quasi-gradient
viii
STOCHASTIC PROGRAMMING
methods (Section 3.9), we have had to use a slightly more advanced language
in probability. Although the actual information found in those parts of the
book is made simple, some terminology may here and there not belong to
the basic probability terminology. Hence, for these parts, the instructor must
either provide some basic background in terminology, or the reader should at
least consult carefully Section 1.4.1, where we have tried to put together those
terms and concepts from probability theory used later in this text.
Within the mathematical programming community, it is common to split
the field into topics such as linear programming, nonlinear programming,
network flows, integer and combinatorial optimization, and, finally, stochastic
programming. Convenient as that may be, it is conceptually inappropriate.
It puts forward the idea that stochastic programming is distinct from integer
programming the same way that linear programming is distinct from nonlinear
programming. The counterpart of stochastic programming is, of course,
deterministic programming. We have stochastic and deterministic linear
programming, deterministic and stochastic network flow problems, and so on.
Although this book mostly covers stochastic linear programming (since that is
the best developed topic), we also discuss stochastic nonlinear programming,
integer programming and network flows.
Since we have let subject areas guide the organization of the book, the
chapters are of rather different lengths. Chapter 1 starts out with a simple
example that introduces many of the concepts to be used later on. Tempting as
it may be, we strongly discourage skipping these introductory parts. If these
parts are skipped, stochastic programming will come forward as merely an
algorithmic and mathematical subject, which will serve to limit the usefulness
of the field. In addition to the algorithmic and mathematical facets of the
field, stochastic programming also involves model creation and specification
of solution characteristics. All instructors know that modelling is harder to
teach than are methods. We are sorry to admit that this difficulty persists
in this text as well. That is, we do not provide an in-depth discussion of
modelling stochastic programs. The text is not free from discussions of models
and modelling, however, and it is our strong belief that a course based on this
text is better (and also easier to teach and motivate) when modelling issues
are included in the course.
Chapter 1 contains a formal approach to stochastic programming, with a
discussion of different problem classes and their characteristics. The chapter
ends with linear and nonlinear programming theory that weighs heavily in
stochastic programming. The reader will probably get the feeling that the
parts concerned with chance-constrained programming are mathematically
more complicated than some parts discussing recourse models. There is a
good reason for that: whereas recourse models transform the randomness
contained in a stochastic program into one special parameter of some random
vectors distribution, namely its expectation, chance constrained models deal
PREFACE
ix
more explicitly with the distribution itself. Hence the latter models may
be more difficult, but at the same time they also exhaust more of the
information contained in the probability distribution. However, with respect to
applications, there is no generally valid justification to state that any one of the
two basic model types is better or more relevant. As a matter of fact, we
know of applications for which the recourse model is very appropriate and of
others for which chance constraints have to be modelled, and even applications
are known for which recourse terms for one part of the stochastic constraints
and chance constraints for another part were designed. Hence, in a first reading
or an introductory course, one or the other proof appearing too complicated
can certainly be skipped without harm. However, to get a valid picture about
stochastic programming, the statements about basic properties of both model
types as well as the ideas underlying the various solution approaches should be
noticed. Although the basic linear and nonlinear programming is put together
in one specific part of the book, the instructor or the reader should pick up
the subjects as they are needed for the understanding of the other chapters.
That way, it will be easier to pick out exactly those parts of the theory that
the students or readers do not know already.
Chapter 2 starts out with a discussion of the Bellman principle for
solving dynamic problems, and then discusses decision trees and dynamic
programming in both deterministic and stochastic settings. There then follows
a discussion of the rather new approach of scenario aggregation. We conclude
the chapter with a discussion of the value of using stochastic models.
Chapter 3 covers recourse problems. We first discuss some topics from
Chapter 1 in more detail. Then we consider decomposition procedures
especially designed for stochastic programs with recourse. We next turn to
the questions of bounds and approximations, outlining some major ideas
and indicating the direction for other approaches. The special case of simple
recourse is then explained, before we show how decomposition procedures for
stochastic programs fit into the framework of branch-and-cut procedures for
integer programs. This makes it possible to develop an approach for stochastic
integer programs. We conclude the chapter with a discussion of Monte-Carlo
based methods, in particular stochastic decomposition and quasi-gradient
methods.
Chapter 4 is devoted to probabilistic constraints. Based on convexity
statements provided in Section 1.6, one particular solution method is described
for the case of joint chance constraints with a multivariate normal distribution
of the right-hand side. For separate probabilistic constraints with a joint
normal distribution of the coefficients, we show how the problem can be
transformed into a deterministic convex nonlinear program. Finally, we
address a problem very relevant in dealing with chance constraints: the
problem of how to construct efficiently lower and upper bounds for a
multivariate distribution function, and give a first sketch of the ideas used
STOCHASTIC PROGRAMMING
in this area.
Preprocessing is the subject of Chapter 5. Preprocessing is any analysis
that is carried out before the actual solution procedure is called. Preprocessing
can be useful for simplifying calculations, but the main purpose is to facilitate
a tool for model evaluation.
We conclude the book with a closer look at networks (Chapter 6). Since
these are nothing else than specially structured linear programs, we can draw
freely from the topics in Chapter 3. However, the added structure of networks
allows many simplifications. We discuss feasibility, preprocessing and bounds.
We conclude the chapter with a closer look at PERT networks.
Each chapter ends with a short discussion of where more literature can be
found, some exercises, and, finally, a list of references.
Writing this book has been both interesting and difficult. Since it is the first
basic textbook totally devoted to stochastic programming, we both enjoyed
and suffered from the fact that there is, so far, no experience to suggest how
such a book should be constructed. Are the chapters in the correct order?
Is the level of difficulty even throughout the book? Have we really captured
the basics of the field? In all cases the answer is probably NO. Therefore,
dear reader, we appreciate all comments you may have, be they regarding
misprints, plain errors, or simply good ideas about how this should have been
done. And also, if you produce suitable exercises, we shall be very happy to
receive them, and if this book ever gets revised, we shall certainly add them,
and allude to the contributor.
About 50% of this text served as a basis for a course in stochastic
programming at The Norwegian Institute of Technology in the fall of 1992. We
wish to thank the students for putting up with a very preliminary text, and
for finding such an astonishing number of errors and misprints. Last but not
least, we owe sincere thanks to Julia Higle (University of Arizona, Tucson),
Diethard Klatte (Univerity of Zurich), Janos Mayer (University of Zurich) and
Pavel Popela (Technical University of Brno) who have read the manuscript1
very carefully and fixed not only linguistic bugs but prevented us from quite a
number of crucial mistakes. Finally we highly appreciate the good cooperation
and very helpful comments provided by our publisher. The remaining errors
are obviously the sole responsibility of the authors.
Written in LATEX
P. K. and S.W.W.
Basic Concepts
1.1
Motivation
A numerical example
You own two lots of land. Each of them can be developed with necessary
infrastructure and a plant can be built. In fact, there are nine possible
decisions. Eight of them are given in Figure 1, the ninth is to do nothing.
The cost structure is given in the following table. For each lot of land we
give the cost of developing the land and building the plant. The extra column
will be explained shortly.
STOCHASTIC PROGRAMMING
2
4
3
Figure 1 Eight of the nine possible decisions. The area surrounded by thin
lines correspond to Lot 1, the area with thick lines to Lot 2. For example,
Decision 6 is to develop both lots, and build a plant on Lot 1. Decision 9 is to
do nothing.
Lot 1
Lot 2
developing the land building the plant building the plant later
600
200
220
100
600
660
Scenario analysis
BASIC CONCEPTS
Decision number
9
4
7
STOCHASTIC PROGRAMMING
1.1.4
We just calculated the expected value of using the expected value solution. It
was 30. We can also calculate the expected value of using any of the possible
scenario solutions. We find that for doing nothing (Decision 9), the expected
value is 0, and for Decision 7 the expected value equals
1
1
1500 + 420 + 2500 = 40.
2
2
In other words, the expected value solution is the best of the three scenario
solutions in terms of having the best expected performance. But is this the
solution with the best expected performance? Let us answer this question
by simply listing all possible solutions, and calculate their expected value. In
all cases, if the land is developed before p becomes known, we will consider
the option of building the plant at the 10% penalty if that is profitable. The
results are given in Table 1.
Table 1 The expected value of all nine possible solutions. The income is the
value of the product if the plant is already built. If not, it is the value of the
product minus the construction cost at 10% penalty.
Decision
1
2
3
4
5
6
7
8
9
Investment
600
800
100
700
1300
900
1500
700
0
Income if
p = 210
1
2 210
1
2 210
1
2 210
1
2 210
1
2 420
Income if
p = 1250
1
2 1030
1
2 1250
1
2 590
1
2 1250
1
2 2280
1
2 1840
1
2 2500
1
2 1620
0
Expected
profit
85
70
195
30
55
125
40
110
0
As we see from Table 1, the optimal solution is to develop Lot 2, then wait to
see what the price turns out to be. If the price turns out to be low, do nothing,
if it turns out to be high, build plant 2. The solution that truly maximizes the
expected value of the objective function will be called the stochastic solution.
Note that also two more solutions are substantially better than the expected
value solution.
All three solutions that are better than the expected value solution are
solutions with options in them. That is, they mean that we develop some land
in anticipation of high prices. Of course, there is a chance that the investment
BASIC CONCEPTS
The IQ of hindsight
In hindsight, that is, after the fact, it will always be such that one of the
scenario solutions turn out to be the best choice. In particular, the expected
value solution will be optimal for any 700 < p 800. (We did not have any
probability mass there in our example, but we could easily have constructed
such a case.) The problem is that it is not the same scenario solution that is
optimal in all cases. In fact, most of them are very bad in all but the situation
where they are best.
The stochastic solution, on the other hand, is normally never optimal after
the fact. But, at the same time, it is also hardly ever really bad.
In our example, with the given probability distribution, the decision of doing
nothing (which has an expected value of zero) and the decision of building
both plants (with an expected value of -40) both have a probability of 50%
of being optimal after p has become known. The stochastic solution, with an
expected value of 195, on the other hand, has zero probability of being optimal
in hindsight.
This is an important observation. If you base your decisions on stochastic
models, you will normally never do things really well. Therefore, people who
prefer to evaluate after the fact can always claim that you made a bad decision.
If you base your decisions on scenario solutions, there is a certain chance that
you will do really well. It is therefore possible to claim that in certain cases
the most risky decision one can make is the one with the highest expected
value, because you will then always be proven wrong after the fact. The IQ of
hindsight is very high.
1.1.6
Options
We have already hinted at it several times, but let us repeat the observation
that the value of a stochastic programming approach to a problem lies in
the explicit evaluation of flexibility. Flexible solutions will always lose in
deterministic evaluations.
Another area where these observations have been made for quite a while is
option theory. This theory is mostly developed for financial models, but the
theory of real options (for example investments) is coming. Let us consider
our extremely simple example in the light of options.
STOCHASTIC PROGRAMMING
We observed from Table 1 that the expected Net Present Value (NPV)
of Decision 4, i.e. the decision to develop Lot 2 and build a plant, equals
30. Standard theory tells us to invest if a project has a positive NPV, since
that means the project is profitable. And, indeed, Decision 4 represents an
investment which is profitable in terms of expected profits. But as we have
observed, Decision 3 is better, and it is not possible to make both decisions;
they exclude each other. The expected NPV for Decision 3 is 195. The
difference of 165 is the value of an option, namely the option not to build
the plant. Or to put it in a different wording: If your only possibilities were to
develop Lot 2 and build the plant at the same time, or do nothing, and you
were asked how much you were willing to pay in order to be allowed to delay
the building of the plant (at the 10% penalty) the answer is at most 165.
Another possible setting is to assume that the right to develop Lot 2 and
build the plant is for sale. This right can be seen as an option. This option is
worth 195 in the setting where delayed construction of the plant is allowed.
(If delays were not allowed, the right to develop and build would be worth 30,
but that is not an option.)
So what is it that gives an option a value? Its value stems from the right
to do something in the future under certain circumstances, but to drop it in
others if you so wish. And, even more importantly, to evaluate an option you
must model explicitly the future decisions. This is true in our simple model,
but it is equally true in any complex option model. It is not enough to describe
a stochastic future, this stochastic future must contain decisions.
So what are the important aspect of randomness? We may conclude that
there are at least three (all related of course).
1. Randomness is needed to obtain a correct evaluation of the future income
and costs, i.e. to evaluate the objective.
2. Flexibility only has value (and meaning) in a setting of randomness.
3. Only by explicitly evaluating future decisions can decisions containing
flexibility (options) be correctly valued.
BASIC CONCEPTS
1.2
Preliminaries
min{c1 x1 + c2 x2 + + cn xn }
subject to
..
..
.
.
x1 , x2 , , xn 0.
Using matrixvector notation, the shorthand formulation of problem (2.1)
would read as
min cT x
s.t. Ax = b
(2.2)
x 0.
min g0 (x)
s.t. gi (x) 0, i = 1, , m
(2.3)
x X IRn .
STOCHASTIC PROGRAMMING
xIR
(2.4)
BASIC CONCEPTS
10
1.3
STOCHASTIC PROGRAMMING
An Illustrative Example
Let us consider the following problem, idealized for the purpose of easy
presentation. From two raw materials, raw1 and raw2, we may simultaneously
produce two different goods, prod1 and prod2 (as may happen for example in
a refinery). The output of products per unit of the raw materials as well
as the unit costs of the raw materials c = (craw1 , craw2 )T (yielding the
production cost ), the demands for the products h = (hprod1 , hprod2 )T and
the production capacity b, i.e. the maximal total amount of raw materials that
can be processed, are given in Table 2.
According to this formulation of our production problem, we have to deal
with the following linear program:
Table 2
Products
Raws
prod1 prod2
raw1
2
3
raw2
6
3
relation
h
180
162
c
2
3
=
min(2xraw1 + 3xraw2 )
s.t. xraw1 + xraw2 100,
2xraw1 + 6xraw2 180,
3xraw1 + 3xraw2 162,
xraw1
0,
xraw2 0.
b
1
1
100
(3.1)
BASIC CONCEPTS
Figure 2
11
instancecan vary within certain limits (for our discussion, randomly) and
that we have to make our decision on the production plan before knowing the
exact values of those data.
To be more specific, let us assume that
our model describes the weekly production process of a refinery relying
on two countries for the supply of crude oil (raw1 and raw2, respectively),
supplying one big company with gasoline (prod1) for its distribution system
of gas stations and another with fuel oil (prod2) for its heating and/or power
plants;
it is known that the productivities (raw1, prod1) and (raw2, prod2), i.e.
the output of gas from raw1 and the output of fuel from raw2 may change
randomly (whereas the other productivities are deterministic);
simultaneously, the weekly demands of the clients, hprod1 for gas and hprod2
for fuel are varying randomly;
the weekly production plan (xraw1 , xraw2 ) has to be fixed in advance and
cannot be changed during the week, whereas
the actual productivities are only observed (measured) during the
production process itself, and
the clients expect their actual demand to be satisfied during the
corresponding week.
12
STOCHASTIC PROGRAMMING
Figure 3
= 180 + 1 ,
= 162 + 2 ,
= 2 + 1 ,
= 3.4 2 ,
(3.3)
where the random variables j are modelled using normal distributions, and
1 and 2 are distributed uniformly and exponentially respectively, with the
following parameters:1
distr 1
distr 2
distr 1
distr 2
N (0, 12),
N (0, 9),
U[0.8, 0.8],
EX P( = 2.5).
(3.4)
For simplicity, we assume that these four random variables are mutually
independent. Since the random variables 1 , 2 and 2 are unbounded,
we restrict our considerations to their respective 99% confidence intervals
1
BASIC CONCEPTS
13
(except for U). So we have for the above random variables realizations
1 [30.91, 30.91],
2 [23.18, 23.18],
(3.5)
1 [0.8, 0.8],
2 [0.0, 1.84].
Hence, instead of the linear program (3.1), we are dealing with the stochastic
linear program
min(2xraw1 + 3xraw2 )
s.t.
xraw1 +
xraw2 100,
(2 + 1 )xraw1 +
6xraw2 180 + 1 ,
(3.6)
3xraw1 + (3.4 2 )xraw2 162 + 2 ,
xraw1
0,
xraw2 0.
This is not a well-defined decision problem, since it is not at all clear what
the meaning of min can be before knowing a realization (1 , 2 , 1 , 2 ) of
(1 , 2 , 1 , 2 ).
Geometrically, the consequence of our random parameter changes may
be rather complex. The effect of only the right-hand sides i varying
over the intervals given in (3.5) corresponds to parallel translations of the
corresponding facets of the feasible set as shown in Figure 4.
We may instead consider the effect of only the i changing their values
within the intervals mentioned in (3.5). That results in rotations of the related
facets. Some possible situations are shown in Figure 5, where the centers of
rotation are indicated by small circles.
Allowing for all the possible changes in the demands and in the
productivities simultaneously yields a superposition of the two geometrical
motions, i.e. the translations and the rotations. It is easily seen that the
variation of the feasible set may be substantial, depending on the actual
realizations of the random data. The same is also true for the so-called waitand-see solutions, i.e. for those optimal solutions we should choose if we knew
the realizations of the random parameters in advance. In Figure 6 a few
possible situations are indicated. In addition to the deterministic solution
x
= (
xraw1 , x
raw2 ) = (36, 18), = 126,
production plans such as
y = (
yraw1 , yraw2 ) = (20, 30), = 130,
z = (
zraw1 , zraw2) = (50, 22), = 166,
v = (
vraw1 , vraw2 ) = (58, 6), = 134
(3.7)
14
STOCHASTIC PROGRAMMING
Figure 4
Figure 5
BASIC CONCEPTS
15
(3.8)
To introduce another possibility, let us assume that the refinery has made
the following arrangement with its clients. In principle, the clients expect
the refinery to satisfy their weekly demands. However, very likelyaccording
to the production plan and the unforeseen events determining the clients
demands and/or the refinerys productivitythe demands cannot be covered
by the production, which will cause penalty costs to the refinery. The
amount of shortage has to be bought from the market. These penalties are
supposed to be proportional to the respective shortage in products, and we
assume that per unit of undeliverable products they amount to
qprod1 = 7, qprod2 = 12.
(3.9)
16
STOCHASTIC PROGRAMMING
Figure 6 LP: feasible set varying with productivities and demands; some waitand-see solutions.
program (3.6) by the well defined stochastic program with recourse, using
:= hprod1 = 180 + 1 , h2 ()
:= hprod2 = 162 + 2 ,
h1 ()
:= (raw1, prod1) = 2 + 1 , ()
:= (raw2, prod2) = 3.4 + 2 :
()
+ 12y2 ()]}
xraw1 +
xraw2
raw1 +
()x
6xraw2 + y1 ()
raw2
3xraw1 + ()x
+ y2 ()
xraw1
xraw2
y1 ()
y2 ()
100,
h1 (),
h2 (),
0,
0,
0,
0.
(3.10)
In (3.10) E stands for the expected value with respect to the distribution
and in general, it is understood that the stochastic constraints have
of ,
to hold almost surely (a.s.) (i.e., they are to be satisfied with probability
1). Note that if has a finite discrete distribution {( i , pi ), i = 1, , r}
(pi > 0 i) then (3.10) is just an ordinary linear program with a so-called
BASIC CONCEPTS
17
s.t.
xraw1 +
xraw2
100,
i
i
i
( )xraw1 +
6xraw2 + y1 ( )
h1 ( ) i,
i
i
i
3xraw1 + ( )xraw2
+ y2 ( ) h2 ( ) i,
xraw1
0,
xraw2
0,
y1 ( )
0 i,
i
y2 ( ) 0 i.
(3.11)
18
STOCHASTIC PROGRAMMING
6
%
br 15
br
12
rb
b r
r b
r N (0, 12)
b N (0, 9)
9
r
b
6
r
b
r
-30
r
r
-25
Figure 7
(15, 15).
-20
-15
-10
-5
10
15
20
30
r
25
BASIC CONCEPTS
19
x
= (37.566, 22.141), (
x) = 144.179, I (
x) = 141.556,
whereas the solution of our original LP (3.1) would yield as total expected
costs
(
x) = 204.561.
For the reliability, we now get
(
x) = 0.9497,
in contrast to
(
x) = 0.2983
for the LP solution x
.
(),
h1 ()
and h2 ()
20
STOCHASTIC PROGRAMMING
xraw2 100,
0,
xraw2 0,
+ 6xraw2 h1 ()
0.95.
+ 3xraw2 h2 ()
xraw1 +
xraw1
P
2xraw1
3xraw1
This problem can be solved with appropriate methods, one of which will be
presented later in this text. It seems worth mentioning that in this case
using the normal distributions instead of their discrete approximations is
appropriate owing to theoretical properties of probabilistic constraints to be
discussed later on. The solution of the probabilistically constrained program
is
z = (37.758, 21.698), I (z) = 140.612.
So the costsi.e. the first-stage costsare only slightly increased compared
with the LP solution if we observe the drastic increase of reliability. There
seems to be a contradiction on comparing this last result with the solution
(3.12) in that I (
x) < I (z) and (
x) > 0.95; however, this discrepancy is due
to the discretization error made by replacing the true normal distribution of
(1 , 2 ) by the 15 15 discrete distribution used for the computation of the
solution (3.12). Using the correct normal distribution would obviously yield
I (
x) = 138.694 (as in (3.12)), but only (
x) = 0.9115!
BASIC CONCEPTS
1.4
21
In the same way as random parameters in (3.1) led us to the stochastic (linear)
program (3.6), random parameters in (2.3) may lead to the problem
ming0 (x, )
0, i = 1, , m,
(4.1)
s.t. gi (x, )
x X IRn ,
22
STOCHASTIC PROGRAMMING
1
in IR :
2
(I[a,b) ) =
in IR :
(I[a,b) ) =
in IR3 :
(I[a,b) ) =
ba
0
if a b,
otherwise,
(b1 a1 )(b2 a2 ) if a b,
0
otherwise,
Y
(bi ai ) if a b
(I[a,b) ) =
i=1
0
else.
(4.2)
Obviously for a set A that is the disjoint finite union of intervals, i.e.
(n)
A = M
, I (n) being intervals such that I (n) I (m) = for n 6= m,
n=1 I
PM
(n)
). In order to measure a set
we define its measure as (A) =
n=1 (I
A that is not just an interval or a finite union of disjoint intervals, we may
proceed as follows.
Any finite collection of pairwise-disjoint intervals contained in A forms
a packing C of A, C being the union of those intervals, with a welldefined measure (C) as mentioned above. Analogously, any finite collection
of pairwise disjoint intervals, with their union containing A, forms a covering
D of A with a well-defined measure (D).
Take for example in IR2 the set
Acirc = {(x, y) | x2 + y 2 16, y 0},
i.e. the half-circle illustrated in Figure 8, which also shows a first possible
packing C1 and covering D1 . Obviously we learned in high school that the
area of Acirc is computed as (Acirc ) = 21 (radius)2 = 25.1327, whereas
we easily compute (C1 ) = 13.8564 and (D1 ) = 32. If we forgot all our
wisdom from high school, we would only be able to conclude that the measure
of the half-circle Acirc is between 13.8564 and 32. To obtain a more precise
estimate, we can try to improve the packing and the covering in such a way
that the new packing C2 exhausts more of the set Acirc and the new covering
D2 becomes a tighter outer approximation of Acirc . This is shown in Figure 9,
for which we get (C2 ) = 19.9657 and (D2 ) = 27.9658.
Hence the measure of Acirc is between 19.9657 and 27.9658. If this is still
not precise enough, we may further improve the packing and covering. For
the half-cirle Acirc , it is easily seen that we may determine its measure in this
way with any arbitrary accuracy.
In general, for any closed bounded set A IRk , we may try a similar
procedure to measure A. Denote by CA the set of all packings for A and by
BASIC CONCEPTS
Figure 8
Figure 9
23
24
STOCHASTIC PROGRAMMING
(4.3 i)
Ai A.
(4.3 ii)
i=1
T
This implies that with Ai A, i = 1, 2, , also i=1 Ai A.
As a consequence of the above construction, we have, for the natural
measure defined in IRk , that
(A) 0 A A and () = 0;
if Ai A,
, and Ai Aj = for i 6= j,
Si = 1, 2, P
(4.4 i)
(4.4 ii)
BASIC CONCEPTS
25
These properties are also familiar from probability theory: there we have
some space of outcomes (e.g. the results of random experiments), a
collection F of subsets F called events, and a probability measure (or
probability distribution) P assigning to each F F the probability with
which it occurs. To set up probability theory, it is then required that
(i) is an event, i.e. F, and, with F F, it holds that also F F,
i.e. if F is an event then so also is its complement (or notF );
(ii) the countable union of events is an event.
Observe that these formally coincide with (4.3) except that can be any
space of objects and need not be IRk .
For the probability measure, it is required that
(i) P (F ) 0 F F and P () = 1;
S
(ii) P
if Fi F, i = 1, 2, , and Fi Fj = for i 6= j, then P (
i=1 Fi ) =
P
(F
).
i
i=1
P(A) = P ({ | ()
A}) A A.
Example 1.2 At a market hall for the fruit trade you find a particular species
of apples. These apples are traded in certain lots (e.g. of 1000 lb). Buying a lot
involves some risk with respect to the quality of apples contained in it. What
does quality mean in this context? Obviously quality is a conglomerate of
criteria described in terms like size, ripeness, flavour, colour and appearance.
Some of the criteria can be expressed through quantitative measurement, while
others cannot (they have to be judged upon by experts). Hence the set of
all possible qualities cannot as such be represented as a subset of some IRk .
Having bought a lot, the trader has to sort his apples according to their
outcomes (i.e. qualities), which could fall into events like unusable
26
STOCHASTIC PROGRAMMING
(e.g. rotten or too unripe), cooking applesand low (medium, high) quality
eatable apples. Having sorted out the unusable and the cooking apples,
for the remaining apples experts could be asked to judge on ripeness, flavour,
colour and appearance, by assigning real values between 0 and 1 to parameters
r, f, c and a respectively, corresponding to the degree (or percentage) of
achieving the particular criterion.
Now we can construct a scalar value for any particular outcome (quality)
, for instance as
if unusable,
0
1
if cooking apples,
v() :=
2
(1 + r)(1 + f )(1 + c)(1 + a) otherwise.
Obviously v has the range v[] = {0} { 21 } {[1, 16]}. Denoting the events
unusable by U and cooking apples by C, we may define the collection F
of events as follows. With G denoting the family of all subsets of (U C)
let F contain all unions of U, C, or with any element of G. Assume that
after long series of observations we have a good estimate for the probabilities
P (A), A F.
According to our scale, we could classify the apples as
eatable and
1st class for v() [12, 16] (high selling price),
2nd class for v() [8, 12) (medium price),
3rd class for v() [1, 8) (low price);
good for cooking for v() =
waste for v() = 0.
1
2
(cheap);
BASIC CONCEPTS
Figure 10
27
since obviously { | ()
A} = A if A F.
k
} = (observe that = IR always satisfies this, but there may be smaller
sets in A that do so), with F = {B | B = A , A A}, instead of the
abstract probability space (, F, P ) we may equivalently consider the induced
probability space (, F , P), which we shall use henceforth and therefore
denote as (, F, P ). We shall use for the random vector and for the
28
Figure 11
STOCHASTIC PROGRAMMING
Next let us briefly review integrals. Consider first IRk with A, its measurable
sets, and the natural measure , and choose some bounded measurable set
B A. Further, let {A1 , , Ar } be a partition
of B into measurable sets,
Sr
i.e. Ai A, Ai Aj = for i 6= j, and i=1 Ai = B. Given the indicator
functions Ai : B IR defined by
Ai (x) =
1 if x Ai ,
0 otherwise,
r
X
ci Ai (x)
i=1
= ci
Then the integral
for x Ai .
(x)d is defined as
Z
(x)d =
r
X
ci (Ai ).
(4.6)
i=1
In Figure 11 the integral would result by accumulating the shaded areas with
their respective signs as indicated.
BASIC CONCEPTS
Figure 12
29
Observe that the sum (or difference) of simple functions 1 and 2 is again
a simple function and that
Z
Z
Z
2 (x)d
1 (x)d +
[1 (x) + 2 (x)]d =
B
Z
Z
(x)d
|(x)|d
B
(x)d =
s Z
X
j=1
(x)d.
Bj
The convergence a.e. can be replaced by another type of convergence, which we omit here.
30
STOCHASTIC PROGRAMMING
(x)d is defined by
Z
Z
n (x)d
(x)d = lim
n
R
such that { B n (x)d} is a Cauchy sequence. Therefore limn B n (x)d
exists. It can be shown that this definition yields a uniquely determined
value for the integral, i.e. it cannot happen that a choice of another mean
fundamental sequence of simple functions converging a.e. to yields a different
value for the integral.
The boundedness of B is not absolutely essential here; with a slight
modification of the assumption n (x) (x) a.e. the integrability of
may be defined analogously.
Now it should be obvious that, given a probability space (, F, P )
assumed to be introduced by a random vector in IRk and a function
: IR, the integral with respect to the probability measure P , denoted
by
Z
=
()dP,
E ()
R
random vector .
Finally, we recall that in probability theory the probability measure P of a
probability space (, F, P ) in IRk is equivalently described by the distribution
function F defined by
F(x) = P ({ | x}), x IRk .
If there exists a function f : IR such that the distribution function can
be represented by an integral with respect to the natural measure as
Z
IRk ,
f(x)d, x
F(
x) =
x
x
then f is called the density function of P . In this case the distribution function
is called Rof continuous type. It follows that for any event A F we have
P (A) = A f(x)d. This implies in particular that for any A F such that
BASIC CONCEPTS
31
(A) = 0 also P (A) = 0 has to hold. This fact is referred to by saying that
the probability measure P is absolutely continuous with respect to the natural
measure . It can be shown that the reverse statement is also true: given a
probability space (, F, P ) in IRk with P absolutely continuous with respect
to (i.e. every event A F with the natural measure (A) = 0 has also a
probability of zero), there exists a density function f for P .
1.4.2
Deterministic Equivalents
Let us now come back to deterministic equivalents for (4.1). For instance, in
analogy to the particular stochastic linear program with recourse (3.10), for
problem (4.1) we may proceed as follows. With
0
if gi (x, ) 0,
gi+ (x, ) =
gi (x, ) otherwise,
the ith constraint of (4.1) is violated if and only if gi+ (x, ) > 0 for a given
Hence we could provide for each constraint a
decision x and realization of .
recourse or second-stage activity yi () that, after observing the realization ,
is chosen such as to compensate its constraints violationif there is oneby
satisfying gi (x, ) yi () 0. This extra effort is assumed to cause an extra
cost or penalty of qi per unit, i.e. our additional costs (called the recourse
function) amount to
)
(m
X
+
qi yi () yi () gi (x, ), i = 1, , m ,
(4.7)
Q(x, ) = min
y
i=1
(4.8)
+
(g1+ (x, ), , gm
(x, ))T .
where g + (x, ) =
If we think of a factory producing m products, gi (x, ) could be understood
as the difference {demand} {output} of a product i. Then gi+ (x, ) >
0 means that there is a shortage in product i, relative to the demand.
Assuming that the factory is committed to cover the demands, problem (4.7)
could for instance be interpreted as buying the shortage of products at the
32
STOCHASTIC PROGRAMMING
(4.10)
xX
xX
(4.11)
BASIC CONCEPTS
33
Hence, taking into account the multiple stages, we get as total costs for the
multistage problem
f0 (x0 , 1 , , K ) = g0 (x0 ) +
K
X
Q (x0 , x
1 , , x 1 , 1 , , )
(4.12)
=1
x0 X
K
X
1 , , x
1 , 1 , , )],
E1 ,, Q (x0 , x
(4.13)
=1
mincT x
s.t.
Ax = b,
(4.14)
= h(),
T ()x
x 0.
Comparing this with the general stochastic program (4.1), we see that the set
X IRn is specified as
X = {x IRn | Ax = b, x 0},
where the m0 n matrix A and the vector b are assumed to be deterministic.
In contrast, the m1 n matrix T () and the vector h() are allowed to depend
and therefore to have random entries themselves. In
on the random vector ,
general, we assume that this dependence on IRk is given as
T () = T 0 + 1 T 1 + + k T k ,
k,
1 + + k h
0 + 1 h
h() = h
(4.15)
k . Observing that
with deterministic matrices T 0 , , T k and vectors h0 , , h
the stochastic constraints in (4.14) are equalities (instead of inequalities, as
in the general problem formulation (4.1)), it seems meaningful to equate their
deficiencies, which, using linear recourse and assuming that Y = {y IRn |
y 0}, according to (4.9) yields the stochastic linear program with fixed
34
STOCHASTIC PROGRAMMING
recourse
(4.16)
(4.18)
0, i = 1, , s,
s.t. Efi (x, )
(4.19)
= 0, i = s + 1, , m,
Efi (x, )
x X IRn ,
where the fi are constructed from the objective and the constraints in (4.1)
or (4.14) respectively. So far, f0 represented the total costs (see (4.8) or (4.12))
and f1 , , fm
could be used to describe the first-stage feasible set X.
However, depending on the way the functions fi are derived from the problem
functions gj in (4.1), this general formulation also includes other types of
deterministic equivalents for the stochastic program (4.1).
To give just two examples showing how other deterministic equivalent
problems for (4.1) may be generated, let us choose first [0, 1] and define
a payoff function for all constraints as
1 if gi (x, ) 0, i = 1, , m,
(x, ) :=
otherwise.
BASIC CONCEPTS
35
implying
if gi (x, ) 0, i = 1, , m,
otherwise,
= E (x, )
0,
Ef1 (x, )
(4.20)
where, with the vector-valued function g(x, ) = (g1 (x, ), , gm (x, ))T ,
Z
=
f1 (x, )dP
Ef1 (x, )
Z
Z
dP
( 1)dP +
=
{g(x,)60}
{g(x,)0}
if gi (x, ) 0,
otherwise,
i = 1, , m,
36
STOCHASTIC PROGRAMMING
then we get from (4.19) the problem with single (or separate) probabilistic
constraints:
min cT ()x
s.t.
Ax = b,
h(),
T ()x
x 0,
(4.23)
and, with Ti () and hi () denoting the ith row and ith component of T () and
h() respectively,
)
1.5
Convexity may be shown easily for the recourse problem (4.11) under rather
mild assumptions (given the integrability of g0 + Q).
Proposition 1.1 If g0 (, ) and Q(, ) are convex in x , and if X is
a convex set, then (4.11) is a convex program.
BASIC CONCEPTS
Proof
37
For x
, x X, (0, 1) and x :=
x + (1 )
x we have
g0 (
x, ) + Q(
x, ) [g0 (
x, ) + Q(
x, )] + (1 )[g0 (
x, ) + Q(
x, )]
implying
E {g0 (
E{g0 (
x, )+Q(
x, )}
x, )+Q(
x, )}+(1)E
x, )+Q(
x, )}.
{g0 (
2
Remark 1.1 Observe that for Y = IRn+ the convexity of Q(, ) can
immediately be asserted for the linear case (4.16) and that it also holds for the
nonlinear case (4.10) if the functions q() and gi (, ) are convex and the Hi ()
are concave. Just to sketch the argument, assume that y and y solve (4.10)
for x
and x
respectively, at some realization . Then, by the convexity of
gi and the concavity of Hi , i = 1, , m, we have, for any (0, 1),
gi (
x + (1 )
x, ) gi (
x, ) + (1 )gi (
x, )
Hi (
y ) + (1 )Hi (
y)
Hi (
y + (1 )
y ).
Hence y =
y + (1 )
y is feasible in (4.10) for x
=
x + (1 )
x, and
therefore, by the convexity of q,
Q(
x, ) q(
y)
q(
y ) + (1 )q(
y)
= Q(
x, ) + (1 )Q(
x, ).
2
R
Smoothness (i.e. partial differentiability of Q(x) = Q(x, ) dP ) of
recourse problems may also be asserted under fairly general conditions. For
example, suppose that : IR2 IR, so that (x, y) IR. Recalling that is
partially differentiable at some point (
x, y) with respect to x, this means that
(x, y)
,
there exists a function, called the partial derivative and denoted by
x
such that
(
x, y) r(
x, y; h)
(
x + h, y) (
x, y)
=
+
,
h
x
h
where the residuum r satisfies
r(
x, y; h) h0
0.
h
if
The recourse function is partially differentiable with respect to xj in (
x, )
there is a function
Q(x,)
xj
such that
Q(
h)
Q(
x + hej , )
x, )
j (
x, ;
Q(
x, )
+
=
h
xj
h
38
STOCHASTIC PROGRAMMING
with
h) h0
j (
x, ;
0,
h
Q(x,) T
where ej is the jth unit vector. The vector ( Q(x,)
x1 , , xn ) is called
the gradient of Q(x, ) with respect to x and is denoted by x Q(x, ). Now we
are not only interested in the partial differentiability of the recourse function
Q(x, ) but also in that of the expected recourse function Q(x). Provided that
is partially differentiable at x a.s., we get
Q(x, )
Q(
x + hej ) Q(
x)
h
=
=
=
Q(
x + hej , ) Q(
x, )
dP
h
Z
Q(
x, ) j (
x, ; h)
dP
+
xj
h
N
Z
Z
j (
x, ; h)
Q(
x, )
dP +
dP,
x
h
j
N
N
BASIC CONCEPTS
Figure 13
39
(5.1)
40
Figure 14
STOCHASTIC PROGRAMMING
Since the Bl are convex polyhedral cones in IRm1 (see Section 1.7) with
nonempty interiors, they may be represented by inequality systems
C l z 0,
where C l 6= 0 is an appropriate matrix with no row equal to zero. Fix l and
let Dl (x) such that, by (5.1), h() T ()x intBl . Then
C l [h() T ()x] < 0,
i.e. for any fixed j there exists a lj > 0 such that
C l [h() T ()(x lj ej )] 0
or, equivalently,
C l [h() T ()x] lj C l T ()ej lj [0, lj ].
Hence for () = maxi (C l T ()ej )i there is a
tl > 0 : C l [h() T ()x] |t|()e |t| < tl ,
e = (1, , 1)T . This implies that for := max () there exists a t0 > 0 such
that
C l [h() T ()x] |t|e |t| < t0
(choose, for example, t0 = tl /). In other words, there exists a t0 > 0 such
that
Dl (x; t) := { | C l [h() T ()x] |t|e} 6= |t| < t0 ,
and obviously Dl (x; t) Dl (x). Furthermore, by elementary geometry, the
natural measure satisfies
(Dl (x) Dl (x; t)) |t|v
BASIC CONCEPTS
Figure 15
41
max{|dlT T ()ej | | , l} =: .
|t|
Assuming now that we have a continuous density () for P , we know already
from (5.1) that ({ | () Bl intBl }) = 0. Hence it follows that
XZ
EQ(x, ) =
Q(x, )()d
l ZDl (x)
X
=
{dlT [h() T ()x] l }()d,
l
Dl (x)
42
STOCHASTIC PROGRAMMING
and, since
Z
Q(x + tej , ) Q(x, )
t0
()d max ()|t|v 0,
Dl (x)Dl (x;t)
t
=
EQ(x, )
=
XZ
l
Dl (x)
Dl (x)
XZ
x Q(x, )()d
T T ()dl ()d.
Hence for the linear caseobserving (4.15)we get the differentiability statement of Proposition 1.2 provided that (5.1) is satisfied and P has a continuous
density on .
2
Summarizing the statements given so far, we see that stochastic programs
with recourse are likely to have such properties as convexity (Proposition 1.1)
and, given continuous-type distributions, differentiability (Proposition 1.2),
whichfrom the viewpoint of mathematical programmingare appreciated.
On the other hand, if we have a joint finite discrete probability distribution
{( k , pk ), k = 1, , r} of the random data then, for example, problem (4.16)
becomessimilarly to the special example (3.11)a linear program
)
(
r
X
pk q T y k
minxX cT x +
k=1
(5.2)
s.t. T ( k )x + W y k = h( k ), k = 1, , r,
yk 0
having the so-called dual decomposition structure, as mentioned already for our
special example (3.11) and demonstrated in Figure 16 (see also Section 1.7.4).
cT
qT
T ( 1 )
T ( 2 )
..
.
qT
h( 1 )
h( 2 )
W
..
..
.
T ( r )
Figure 16
qT
W
Dual decomposition data structure.
h( r )
BASIC CONCEPTS
43
Q(x, ) = min q T y
y0
r
r
X
X
j = 1, j 0 j .
j j ,
= =
j=1
j=1
From the definition of a support, it follows that x IRn allows for a feasible
solution of the second-stage program for all if and only if this is true
for all j , j = 1, , r. In other words, the induced first-stage feasibility set
K is given as
K = {x | T ( j )x + W y j = h( j ), y j 0, j = 1, , r}.
From this formulation of K (which obviously also holds if has a finite discrete
distribution, i.e. = { 1 , , r }), we evidently get the following.
Proposition 1.3 If the support of the distribution of is either a finite
set or a (bounded) convex polyhedron then the induced first-stage feasibility
set K T
is a convex polyhedral set. The first-stage decisions are restricted to
x X K.
Example 1.3 Consider the following first-stage feasible set:
44
STOCHASTIC PROGRAMMING
and a random vector with the support = [4, 19] [13, 21]. Then the
constraints to be satisfied for all are
W y = T x, y 0.
Observing that the second column W2 of W is a positive linear combination
of W1 and W3 , namely W2 = 31 W1 + 32 W3 , the above second-stage constraints
reduce to the requirement that for all the right-hand side T x can
be written as
T x = W1 + W3 , , 0,
or in detail as
1 2x1 3x2 = + 5,
2 3x1 x2 = 2 + 2,
, 0.
2 1
Multiplying this system of equations with the regular matrix S =
,
2 5
which corresponds to adding 2 times the first equation to the second and
adding 2 times the first to 5 times the second, respectively, we get the
equivalent system
21 + 2 7x1 7x2 = 12 0,
21 + 52 11x1 + x2 = 12 0.
Because of the required nonnegativity of and , this is equivalent to the
system of inequalities
7x1 + 7x2 21 + 2 ( 21 ),
11x1 x2 21 + 52 ( 27 ).
Since these inequalities have to be satisfied for all , choosing the minimal
right-hand sides (for ) yields the induced constraints as
K = {x | 7x1 + 7x2 21, 11x1 x2 27}.
The first-stage feasible set X together with the induced feasible set are illustrated in Figure 17.
2
T
It might happen that X K = ; then we should check our model very
carefully to figure out whether we really modelled what we had in mind or
whether we can find further possibilities for compensation that are not yet
contained in our model. On the other hand, we have already mentioned the
case of a complete fixed recourse matrix (see (4.17) on page 34), for which
K = IRn and therefore the problem of induced constraints does not exist.
Hence it seems interesting to recognize complete recourse matrices.
BASIC CONCEPTS
Figure 17
45
Induced constraints K.
Wy = 0
yi 1, i = 1, , m1 ,
(5.4)
y0
have a feasible solution.
Proof
Wi yi +
n
X
Wi yi = z
i=m1 +1
m1
X
i=1
Wi ,
yi 0, i = 1, , n.
4
46
STOCHASTIC PROGRAMMING
With
yi =
yi + 1, i = 1, , m1 ,
yi ,
i > m1 ,
Wi yi = z
i=1
1.6
(6.1)
{x | g(x, ) 0}.
(6.2)
BASIC CONCEPTS
47
[ \
{x | g(x, ) 0}.
(6.3)
GG G
2xraw1 + 6xraw2 h1 ()
3xraw1 + 3xraw2 h2 ()
48
Figure 18
STOCHASTIC PROGRAMMING
It follows that the feasible set for the above constraints is nonconvex, as shown
in Figure 18.
2
As above, define S(x) := { | g(x, ) 0}. If g(, ) is jointly convex in (x, )
=
then, with xi B(), i = 1, 2, i S(xi ) and [0, 1], for (
x, )
(x1 , 1 ) + (1 )(x2 , 2 ) it follows that
g(x1 , 1 ) + (1 )g(x2 , 2 ) 0,
g(
x, )
i.e. = 1 + (1 ) 2 S(
x), and hence5
S(
x) [S(x1 ) + (1 )S(x2 )]
implying
P (S(
x)) P (S(x1 ) + (1 )S(x2 )).
By our assumption on g (joint convexity), any set S(x) is convex. Now we
conclude immediately that B() is convex [0, 1], if
P (S1 + (1 )S2 ) min[P (S1 ), P (S2 )] [0, 1]
for all convex sets Si F, i = 1, 2, i.e. if P is quasi-concave. Hence we have
proved the following
5
BASIC CONCEPTS
Figure 19
= 12 .
49
On the other hand, F being a quasi-concave function does not imply in general
that the corresponding probability measure P is quasi-concave. For instance,
50
STOCHASTIC PROGRAMMING
2
B)
3
= 0, but
BASIC CONCEPTS
51
and hence
P (S1 + (1 )S2 ) min[P (S1 ), P (S2 )].
2
As mentioned above, for the log-concave case necessary and sufficient
conditions were derived first, and later corresponding conditions for quasiconcave measures were found.
Proposition 1.6 Let P on = IRk be of the continuous type, i.e. have a
density f . Then the following statements hold:
P is log-concave iff f is log-concave (i.e. if the logarithm of f is a concave
function);
P is quasi-concave iff f 1/k is convex.
The proof has to be omitted here, since it would require a rather advanced
knowledge of measure theory.
Remark 1.4 Consider
(a) the k-dimensional uniform distribution on a convex body S IRk (with
positive natural
measure ) given by the density
(
1/(S) if x S,
U (x) :=
0
otherwise
( is the natural measure in IRk , see Section 1.4.1);
(b) the exponential
( distribution with density
0
if x < 0,
EX P (x) :=
x
e
if x 0
( > 0 is constant);
(c) the multivariate normal distribution in IRk described by the density
1
N (x) := e 2 (xm)
1 (xm)
otherwise,
implying by Proposition 1.6 that the corresponding propability measure
PU is quasi-concave.
52
STOCHASTIC PROGRAMMING
(b) Since
ln[EX P (x)] =
if x < 0,
ln x if x 0,
the density of the exponential distribution is obviously log-concave,
implying by Proposition 1.6 that the corresponding measure PEX P is logconcave and hence, by Lemma 1.2, also quasi-concave.
(c) Taking the logarithm
ln[N (x)] = ln 12 (x m)T 1 (x m)
and observing that the covariance matrix and hence its inverse 1 are
positive definite, we see that this density is log-concave, and therefore the
corresponding measure PN is log-concave (by Proposition 1.6) as well as
quasiconcave (by Lemma 1.2).
There are many other classes of widely used continuous type probability
measures, whichaccording to Proposition 1.6are either log-concave or at
least quasi-concave.
2
In addition to Proposition 1.5, we have the following statement, which is of
interest because, for mathematical programs in general, we cannot assert the
existence of solutions if the feasible sets are not known to be closed.
Proposition 1.7 If g : IRn IRm is continuous then the feasible set
B() is closed.
Proof
Consider any sequence {x } such that x x
and x B() .
To prove the assertion, we have to show that x
B(). Define A(x) := { |
g(x, ) 0}. Let Vk be the open ball with center x
and radius 1/k. Then we
show first that
\
[
A(
x) =
cl
A(x).
(6.4)
k=1
xVk
\
[
cl
A(x).
A(
x)
k=1
xVk
Assume that k=1 cl xVk A(x). This means that for every k we have
S
S
cl xVk A(x); in other words, for every k there exists a k xVk A(x)
1/k (and
and hence some xk Vk with k A(xk ) such that k k k
k
k
k k
Since
obviously kx x
k 1/k since x Vk ). Hence (x , ) (
x, ).
k A(xk ), g(xk , k ) 0 k and therefore, by the continuity of g(, ),
A(
x), which proves (6.4) to be true.
BASIC CONCEPTS
53
K
\
k=1
cl
A(x)
xVk
1.7
Linear Programming
min cT x
s.t. Ax = b,
(7.1)
x 0,
where the vectors c IRn , b IRm and the m n matrix A are given
and x IRn is to be determined. Any other LP6 formulation can easily be
54
STOCHASTIC PROGRAMMING
transformed to assume the form (7.1). If, for instance, we have the problem
min cT x
s.t. Ax b
x 0,
then, by introducing a vector y IRm
+ of slack variables, we get the problem
min cT x
s.t. Ax y = b
x0
y 0,
which is of the form (7.1). This LP is equivalent to (7.1) in the sense that
the x part of its solution set and the solution set of (7.1) as well as the two
optimal values obviously coincide. Instead, we may have the problem
min cT x
s.t. Ax b,
where the decision variables are not required to be nonnegativeso-called free
variables. In this case we may introduce a vector y IRm
+ of slack variables
andobserving that any real number may be presented as the difference of two
nonnegative numbersreplace the original decision vector x by the difference
z + z of the new decision vectors z + , z IRn+ yielding the problem
min{cT z + cT z }
s.t. Az + Az y
z+
z
y
= b,
0,
0,
0,
which is again of the form (7.1). Furthermore, it is easily seen that this
transformed LP and its original formulation are equivalent in the sense that
given any solution (
z + , z , y) of the transformed LP, x := z+ z is a
solution of the original version,
given any solution x of the original LP, the vectors y := A
x b and
, solve the transformed version,
z+ , z IRn+ , chosen such that z+ z = x
and the optimal values of both versions of the LP coincide.
1.7.1
(7.2)
BASIC CONCEPTS
55
is satisfied. Given this condition, it may happen that rk(A) < m, but then
we may drop one or more equations from the system without changing its
solution set. Therefore we assume throughout this section that
rk(A) = m,
(7.3)
56
STOCHASTIC PROGRAMMING
xk
{N B}
xl
(7.4)
x{B} = B 1 b B 1 N x{N B} ,
(7.5)
whichusing the assignment (7.4) yields for any choice of the nonbasic
variables x{N B} a solution of our system Ax = b, and in particular for
x{N B} = 0 reproduces our feasible basic solution x
.
Proposition 1.9 If B =
6 then there exists at least one feasible basic solution.
Proof
A
x = b, x
0.
If for I(
x) = {i | x
i > 0} the column set {Ai | i I(
x)} is linearly dependent,
then the linear homogeneous system of equations
P
iI(
x) Ai yi = 0,
yi = 0, i 6 I(
x),
has a solution y 6= 0 with yi < 0 for at least one i I(
x)if this does not
hold for y, we could take
y, which solves the above homogeneous system as
well. Hence for
:= max{ | x +
y 0}
< . Since A
we have 0 <
y = 0 obviously holds for y, it followsobserving
y
the definition of that
for z := x
+
y
Az = A
x + A
= b,
z 0,
i.e. z B, and I(z) I(
x), I(z) 6= I(
x), such that we have reduced our
original feasible solution x to another one with fewer positive components.
BASIC CONCEPTS
57
58
STOCHASTIC PROGRAMMING
Figure 21
x
,
i
i=1
Pr
P
r
r
{i}
=
1,
0
i,
and
w
=
x
,
where
=
where
i
i
i
i
i=1
i=1
i=1
1, i 0 i. As is easily checked, we have x
= v + (1 )w with
= /( ) (0, 1). This implies immediately that x
is a convex linear
combination of {x{i} , i = 1, , r}.
2
The convex hull of finitely many points {x{1} , , x{r} }, formally denoted
by conv{x{1} , , x{r} }, is called a convex polyhedron or a bounded convex
polyhedral set (see Figure 21). Take for instance in IR2 the points z 1 =
(2, 2), z 2 = (8, 1), z 3 = (4, 3), z 4 = (7, 7) and z 5 = (1, 6). In Figure 22 we
have P = conv{z 1 , , z 5 }, and it is obvious that z 3 is not necessary to
in other words, P = conv{z 1 , z 2 , z 3 , z 4 , z 5 } = conv{z 1 , z 2 , z 4 , z 5 }.
generate P;
whereas
Hence we may drop z 3 without any effect on the polyhedron P,
omitting any other of the five points would essentially change the shape of
the polyhedron. The points that really count in the definition of a convex
polyhedron are its vertices (z 1 , z 2 , z 4 and z 5 in the example). Whereas in twoor three-dimensional spaces, we know by intuition what we mean by a vertex,
we need a formal definition for higher-dimensional cases: A vertex of a convex
polyhedron P is a point x P such that the line segment connecting any two
points in P, both different from x
, does not contain x
. Formally,
6 y, z P, y 6= x
6= z, (0, 1), such that x = y + (1 )z.
It may be easily shown that for an LP with a bounded feasible set B the
feasible basic solutions x{i} , i = 1, , r, coincide with the vertices of B.
By Proposition 1.10, the feasible set of a linear program is a convex
BASIC CONCEPTS
Figure 22
59
xK
K,
kxK k
z K 0,
kz K k = 1,
Az K = b/kxK k,
and hence
kAz K k kbk/K
K.
(7.6)
60
A
z = 0, z 0, z 6= 0.
STOCHASTIC PROGRAMMING
Proof
Since for C = {0} the statement is trivial,P
we assume that C =
6 {0}.
n
yi > 0 we have, with
For any P
arbitrary y C such that y 6= 0 and hence i=1P
n
n
:= 1/ i=1 yi for y :=
y , that y C := {y | Ay
i=1 yi = 1, y 0}.
Pn= 0,
Obviously C C and, owing to the constraints i=1 yi = 1, y 0, the set C
is bounded. Hence, by Proposition 1.10, C is a convex polyhedron generated
{1}
{s}
by itsPfeasible basic solutions
Ps {y , , y } such that y has a representation
s
{i}
y =
with i=1 i = 1, i 0 i,Pimplying that y = (1/)
yP
= i=1 i y
s
{i}
. This shows that C = {y | y = si=1 i y {i} , i 0 i}. 2
i=1 (i /)y
In Figure 23 we see a convex polyhedral cone C and its intersection C
with the hyperplane H = {y | eT y = 1} (e = (1, , 1)T ). The vectors
y {1} , y {2} and y {3} are the generating elements (feasible basic solutions) of C,
as discussed in the proof of Proposition 1.12, and therefore they are also the
generating elements of the cone C .
Now we are ready to describe the feasible set B of the linear program (7.1)
in general. Given the convex polyhedron P := conv{x{1} , , x{r} } generated
by the feasible basic solutions {x{1} , , x{r} } B and the convex
polyhedral cone C = {y | Ay = 0, y 0}given by its generating elements
as pos{y {1} , , y {s} } as discussed in Proposition 1.12we get the following.
Proposition 1.13 B is the algebraic sum of P and C, formally B = P + C,
meaning that every x B may be represented as x
= z + y, where z P and
y C.
Proof Choose an arbitrary x
B. Since {y | Ay = 0, 0 y x
} is compact,
the continuous function (y) := eT y, where e = (1, , 1)T , attains its
maximum on this set. Hence there exists a y such that
A
y = 0,
y x,
(7.7)
y 0,
eT y = max{eT y | Ay = 0, 0 y x
}.
BASIC CONCEPTS
Figure 23
61
Let x
:= x
y. Then x
B and {y | Ay = 0, 0 y x
} = {0}, since otherwise
we should have a contradiction to (7.7). Hence for I(
x) = {i | xi > 0} we have
{y | Ay = 0, yi = 0, i 6 I(
x), y 0} = {0}
and therefore, by Proposition 1.11, the feasible set
B1 := {x | Ax = b, xi = 0, i 6 I(
x), x 0}
is bounded and, observing that x
B1 , nonempty. From Proposition 1.10, it
follows that x
is a convex linear combination of the feasible basic solutions of
Ax = b
xi = 0, i 6 I(
x)
x0
which are obviously feasible basic solutions of our original constraints
Ax = b
x0
as well. It follows that x
P, and, by the above construction, we have y C
and x = x
+ y.
2
According to this proposition, the feasible set of any LP is constructed
as follows. First we determine the convex hull P of all feasible basic
solutions, which might look like that in Figure 21, for example; then we
add (algebraically) the convex polyhedral cone C (owing to Proposition 1.10
62
Figure 24
STOCHASTIC PROGRAMMING
BASIC CONCEPTS
Figure 25
63
(7.8)
cT y 0 y C = {y | Ay = 0, y 0}.
(7.9)
and
Given that these two conditions are satisfied, there is at least one feasible basic
solution that is an optimal solution.
Proof Obviously condition (7.9) is necessary for the existence of an optimal
solution. If B =
6 then we know from Proposition 1.13 that x B iff
Ps
Pr
x = i=1 i x{i} + j=1 j y {j}
Pr
with i 0 i, j 0 j and
i=1 i = 1
where {x{1} , , x{r} } is the set of all feasible basic solutions in B and
{y {1} , , y {s} } is a set of elements generating C, for instance as described
in Proposition 1.12. Hence solving
min cT x
s.t. Ax = b,
x0
64
STOCHASTIC PROGRAMMING
If we have the task of solving a linear program of the form (7.1) then, by
Proposition 1.14, we may restrict ourselves to feasible basic solutions. Let
x
B be any basic solution and, as before, I(
x) = {i | x
i > 0}. Under the
assumption (7.3), the feasible basic solution is called
nondegenerate if |I(
x)| = m, and
degenerate if |I(
x)| < m.
To avoid lengthy discussions, we assume in this section that for all feasible
basic solutions x{1} , , x{r} of the linear program (7.1) we have
|I(x{i} )| = m, i = 1, , r,
(7.10)
i.e. that all feasible basic solutions are nondegenerate. For the case of
degenerate basic solutions, and the adjustments that might be necessary in
BASIC CONCEPTS
65
this case, the reader may consult the wide selection of books devoted to
linear programming in particular. Referring to our former presentation (7.5),
we have, owing to (7.10), that IB (
x) = I(
x), and, with the basic part
B = (Ai | i I(
x)) and the nonbasic part N = (Ai | i 6 I(
x)) of the matrix
A, the constraints of (7.1) may be rewrittenusing the basic and nonbasic
variables as introduced in (7.4)as
x{B} = B 1 b B 1 N x{N B} ,
(7.11)
x{B} 0,
{N B}
x
0.
(7.12)
(7.13)
{N B}
x{N B} 0
66
STOCHASTIC PROGRAMMING
BASIC CONCEPTS
67
Remark 1.5 The following comments on the single steps of the simplex
method may be helpful for a better understanding of this procedure:
Step 1 Obviously we assume that B =
6
. The existence of a feasible
basis B follows from Propositions 1.9 and 1.8. Because of our
assumption (7.10), we have B 1 b > 0.
Step 2 (a) If for a feasible basis B we have
[(c{N B} )T (c{B} )T B 1 N ]T 0
then by Proposition 1.15 this basis (i.e. the corresponding basic
solution) is optimal.
(b) If the simplex criterion is violated for the feasible basic solution
belonging to B given by x{B} = B 1 b, x{N B} = 0, then
there must be an index {1, , n m} such that 0 :=
[(c{N B} )T (c{B} )T B 1 N ]T
< 0, and, keeping all but the th
{N B}
= 0, j 6= , with
nonbasic variables on their present values xj
:= B 1 N , the objective and the basic variables have the
representations
{N B}
= (c{B} )T B 1 b + 0 x
,
{N B}
{B}
x
= B 1 b
+ x
.
According to these formulae, we conclude immediately that for
0 the nonnegativity of the basic variables would never
{N B}
be violated by increasing x
arbitrarily such that we had
inf B = , whereas for 6 0 it would follow that the
set of rows {i | i < 0, 1 i m} 6= , and consequently, with
{N B}
:= B 1 b, the constraints x{B} = + x
0 would
{N B}
block the increase of x
at some positive value (remember
that, by the assumption (7.10), we have > 0). More precisely,
we now have to observe the constraints
i + i x{N B} 0 for i {i | i < 0, 1 i m}
68
STOCHASTIC PROGRAMMING
or equivalently
x{N B}
i
for i {i | i < 0, 1 i m}.
i
i
= min
i < 0, 1 i m ,
i
{B}
{N B}
x
is the first basic variable to decrease to zero if x
is
increased to the value /( ), and we observe that at the
same time the objective value is changed to
>0
z}|{
{B} T
= (c
) + 0
|{z} | {z}
<0
< (c
{B} T
>0
BASIC CONCEPTS
69
bases for any linear program of the form (7.1), the simplex method must end
after finitely many cycles.
2
Remark 1.6 In step 2 of the simplex method it may happen that the simplex
criterion is not satisfied and that we discover that inf B = . It is worth
mentioning that in this situation we may easily find a generating element of
the cone C associated with B, as discussed in Proposition 1.12. With the above
notation, we then have a feasible basis B, and for some column N 6= 0 we
have B 1 N 0. Then, with e = (1, , 1)T of appropriate dimensions, for
(
y {B} , y{N B} ) satisfying
{N B}
y{B} = B 1 N y
,
1
{N B}
y
=
,
eT B 1 N + 1
{N B}
yl
= 0 for l 6=
it follows that
B y{B} + N y{N B} = 0
{N B}
{N B}
eT y{B} + eT y{N B} = eT B 1 N y
+ y
{N
B}
= (eT B 1 N + 1)
y
= 1,
y{B} 0,
{N B}
y
0.
Observe that, with B = (B1 , , Bm ) a basis of IRm , owing to
v = B 1 N 0, and hence 1 eT v 1,
we have
B1
rk
1
It follows that
Bm
B1
1
0
1 eT v
B2
1
B1
= rk
1
B1
= rk
0
Bm
N
1
Bm
0
1
Bm
0
1
70
1.7.3
STOCHASTIC PROGRAMMING
Duality Statements
min cT x
s.t. Ax = b,
x 0,
(7.14)
(7.15)
max bT u
s.t. AT u c,
u 0.
Hence for this case the pair of the primal and its dual program looks like
min cT x
s.t. Ax b,
x 0;
max bT u
s.t. AT u c,
u 0.
BASIC CONCEPTS
71
2
Example 1.6 Considering the primal program
min cT x
s.t. Ax b,
x0
in its standard form
min cT x
s.t. Ax + Iy = b,
x 0,
y0
72
STOCHASTIC PROGRAMMING
max f T z
s.t. DT z g,
DT z g,
Iz 0
min f T w
s.t. DT w = g,
w 0.
2
Hence, by comparison with our standard forms of the primal program (7.14)
and the dual program (7.15), it follows that the dual of the dual is the primal
program.
2
There are close relations between a primal linear program and its dual
program. Let us denote the feasible set of the primal program (7.14) by B and
that of its dual program by D. Furthermore, let us introduce the convention
that
inf xB cT x = + if B = ,
(7.16)
supuD bT u = if D = .
Then we have as a first statement the following so-called weak duality theorem:
Proposition 1.17 For the primal linear program (7.14) and its dual (7.15)
inf cT x sup bT u.
xB
uD
BASIC CONCEPTS
73
x B, u D,
and hence
inf xB cT x supuD bT u.
2
In view of this proposition, the question arises as to whether or when it
might happen that
inf cT x > sup bT u.
xB
uD
74
STOCHASTIC PROGRAMMING
such that also the dual constraints do not allow a feasible solution. Hence, by
our convention (7.16), we have for this dual pair
inf cT x = + > sup bT u = .
xB
uD
2
However, the so-called duality gap in the above example does not occur
so long as at least one of the two problems is feasible, as is asserted by the
following strong duality theorem of linear programming.
Proposition 1.18 Consider the feasible sets B and D of the dual pair of
linear programs (7.14) and (7.15) respectively. If either B =
6 or D 6= then
it follows that
inf cT x = sup bT u.
xB
uD
If one of these two problems is solvable then so is the other, and we have
min cT x = max bT u.
xB
uD
BASIC CONCEPTS
75
1.7.4
76
STOCHASTIC PROGRAMMING
min{cT x + q T y}
s.t. Ax
= b,
T x + W y = h,
(7.17)
x 0,
y 0.
In addition, we assume that the problem is solvable and that the set {x |
Ax = b, x 0} is bounded. The above problem may be restated as
min{cT x + f (x)}
s.t. Ax = b,
x 0,
with
f (x) := min{q T y | W y = h T x, y 0}.
Our recourse function f (x) is easily seen to be piecewise linear and convex. It
is also immediate that the above problem can be replaced by the equivalent
problem
min{cT x + }
s.t.
Ax = b
f (x) 0
x 0;
however, this would require that we know the function f (x) explicitly in
advance. This will not be the case in general. Therefore we may try to
construct a sequence of new (additional) linear constraints that can be
used to define a monotonically decreasing feasible set B1 of (n + 1)-vectors
(x1 , , xn , )T such that finally, with B0 := {(xT , )T | Ax = b, x 0,
IR}, the problem min(x,)B0 B1 {cT x + } yields a (first-stage) solution of our
problem (7.17).
After these preparations, we may describe the following particular method.
BASIC CONCEPTS
77
0 0
and hence
uT h u
T T x,
which has to hold for any feasible x, and obviously does not hold
for x
, since u
T (h T x
) > 0. Therefore we introduce the feasibility
cut, cutting off the infeasible solution x:
u
T (h T x) 0.
T
Then we redefine B1 := B1 {(xT , ) | uT (h T x) 0} and go on
to step 3.
(b) Otherwise, if f (
x) is finite, we have for the recourse problem (see
the proof of Proposition 1.18) simultaneouslyfor x
a primal
optimal basic solution y and a dual optimal basic solution u
. From
the dual formulation of the recourse problem, it is evident that
f (
x) = (h T x
)T u
,
whereas for any x we have
f (x) = sup{(h T x)T u | W T u q}
(h T x)T u
=u
T (h T x).
78
STOCHASTIC PROGRAMMING
Figure 26
BASIC CONCEPTS
79
{v | W v = 0, q T v < 0, v 0} = .
In addition, we have assumed {x | Ax = b, x 0} to be bounded.
Hence inf{f (x) | Ax = b, x 0} is finite such that the lower bound
0 exists. This (and the boundedness of {x | Ax = b, x 0}) implies
that
min{cT x + | Ax = b, 0 , x 0}
is solvable.
Step 2 If f (
x) = +, we know from Proposition 1.14 that {u | W T u
0, (h T x
)T u > 0} 6= , and, according to Remark 1.6, for the convex
polyhedral cone {u | W T u 0} we may find with the simplex method
one of the generating elements u
mentioned in Proposition 1.12 that
satisfies (h T x
)T u
> 0. By Proposition 1.12, we have finitely many
generating elements for the cone {u | W T u 0} such that, after
having used all of them to construct feasibility cuts, for all feasible
x we should have (h T x)T u 0 u {u | W T u 0} and hence
solvability of the recourse problem. This shows that f (
x) = + may
appear only finitely many times within this method.
If f (
x) is finite, the simplex method yields primal and dual optimal
feasible basic solutions y and u
respectively. Assume that we already
had the same dual basic solution u
:= u
in a previous step to construct
an optimality cut
uT (h T x);
then our present has to satisfy this constraint for x = x
such that
uT (h T x
)
= uT (h T x
)
holds, or equivalently we have f (
x) and stop the procedure. From
the above inequalities, it follows that
(h T x
)T u{i} , i = 1, , k,
if u{1} , , u{k} denote the feasible basic solutions in {u | W T u q}
used so far for optimality cuts. Observing that in step 3 for any x we
minimize with respect to B1 this implies that
= max (h T x
)T u{i} .
1ik
80
STOCHASTIC PROGRAMMING
s.t. Ax
=b
T i x + W y i = hi , i = 1, , K
x 0,
y i 0, i = 1, , K.
Thus we may simply introduce feasibility and optimality cuts for all the recourse functions fi (x) := min{q iT y i | W y i = hi T i x, y i 0}, i = 1, , K,
yielding the so-called multicut version of the dual decomposition method. Alternatively, combining the single cuts corresponding to the particular blocks
i = 1, , K with their respective probabilities leads to the so-called L-shaped
method.
1.8
Nonlinear Programming
BASIC CONCEPTS
81
82
Figure 27
STOCHASTIC PROGRAMMING
If, moreover, the function is convex then, owing to Proposition 1.21, this
condition is also sufficient for x
to be a global minimum, since then for any
arbitrary x IRn we have
0 = (x x
)T (
x) (x) (
x)
and hence
(
x) (x) x IRn .
d
(
x) = 2.
dx
Hence we cannot just transfer the optimality conditions for unconstrained optimization to the constrained case.
2
Therefore we shall first deal with the necessary and/or sufficient conditions
for some x
IRn to be a local or global solution of the program (8.1).
BASIC CONCEPTS
1.8.1
83
(8.3)
(8.4)
max{bT u}
X
ai ui = c,
s.t.
(8.5)
i=1
u 0.
(8.6)
i=1
84
STOCHASTIC PROGRAMMING
u 0 such that f (
x) +
m
X
u
i gi (
x)
i=1
m
X
= 0,
u
i gi (
x) = 0.
i=1
u 0 such that f (
x) +
m
X
u
i gi (
x)
i=1
m
X
= 0,
u
i gi (
x) = 0
i=1
is sufficient for x
to be a solution of the program (8.4).
2
Remark 1.10 The optimality condition derived in Remark 1.9 for the linear
case could be formulated as follows:
(1) For the feasible x
the negative gradient of the objective f i.e. the
direction of the greatest (local) descent of f is equal (with the multipliers
u
i 0) to a nonnegative linear combination of the gradients of those
constraint functions gi that are active at x
, i.e. that satisfy gi (
x) = 0.
(2) This corresponds to the fact that the multipliers satisfy the complementarity conditions u
i gi (
x) = 0, i = 1, , m, stating that the multipliers
u
i are zero for those constraints that are not active at x
, i.e. that satisfy
gi (
x) < 0.
In conclusion, this optimality condition says that f (
x) must be contained
in the convex polyhedral cone generated by the gradients gi (
x) of the constraints being active in x
. This is one possible formulation of the KuhnTucker
conditions illustrated in Figure 28.
2
Let us now return to the more general nonlinear case and consider the
following question. Given that x
is a (local) solution, under what assumption
BASIC CONCEPTS
Figure 28
85
KuhnTucker conditions.
u 0 such that f (
x) +
m
X
ui gi (
x) = 0,
i=1
m
X
i=1
(8.7)
ui gi (
x) = 0,
hold? Hence we ask under what assumption are the conditions (8.7) necessary
for x
to be a (locally) optimal solution of the program (8.1). To answer this
question, let I(
x) := {i | gi (
x) = 0}, such that the optimality conditions (8.7)
are equivalent to
o
n X
ui gi (
x) = f (
x), ui 0 for i I(
x) 6= .
u
iI(
x)
Observing that gi (
x) and f (
x) are constant vectors when x is fixed at x
,
the condition of Farkas lemma (Proposition 1.19) is satisfied if and only if
the following regularity condition holds in x
:
RC 0
z T gi (
x) 0, i I(
x) implies that z T f (
x) 0.
(8.8)
86
STOCHASTIC PROGRAMMING
satisfied in x
it necessarily follows that
u 0 such that f (
x) +
m
X
u
i gi (
x)
i=1
m
X
= 0,
ui gi (
x) = 0.
i=1
Example 1.10 The KuhnTucker conditions need not hold if the regularity
condition cannot be asserted. Consider the following simple problem (x IR1 ):
min{x | x2 0}.
Its unique solution is x
= 0. Obviously we have
f (
x) = (1), g(
x) = (0),
and there is no way to represent f (
x) as (positive) multiple of g(
x). (Needless to say, the regularity condition RC 0 is not satisfied in x.)
2
We just mention that for the case of linear constraints the KuhnTucker
conditions are necessary for optimality, without the addition of any regularity
condition.
Instead of condition RC 0, there are various other regularity conditions
popular in optimization theory, only two of which we shall mention here. The
first is stated as
RC 1
z 6= 0 s.t. z T gi (
x) 0, i I(
x), {xk | xk 6= x, k = 1, 2, } B
such that
lim xk = x
,
xk x
z
=
.
k
k kx x
k
kzk
lim
The secondused frequently for the convex case, i.e. if the functions gi are
convexis the Slater condition
RC 2
x B such that gi (
x) < 0 i.
(8.9)
BASIC CONCEPTS
Figure 29
87
such that
condition RC 2for the convex caserequires the existence of an x
gi (
x) < 0 i, but does not refer to any optimal solution. Without proof we
might mention the following.
Proposition 1.23
(a) The regularity condition RC 1 (in any locally optimal solution) implies
the regularity condition RC 0.
(b) For the convex case the Slater condition RC 2 implies the regularity
condition RC 1 (for every feasible solution).
In Figure 29 we indicate how the proof of the implication RC 2 = RC 1 can
be constructed.
Based on these facts we immediately get the following.
Proposition 1.24
(a) If x (locally) solves problem (8.1) and satisfies RC 0 then the Kuhn
Tucker conditions (8.7) necessarily hold in x
.
(b) If the functions f, gi , i = 1, , m, are convex and the Slater condition
RC 2 holds, then x
B (globally) solves problem (8.1) if and only if the
KuhnTucker conditions (8.7) are satisfied for x
.
Proof:
Referring to Proposition 1.23, the necessity of the KuhnTucker
conditions has already been demonstrated. Hence we need only show that in
the convex case the KuhnTucker conditions are also sufficient for optimality.
88
STOCHASTIC PROGRAMMING
m
X
u
i gi (
x) = 0,
i=1
m
X
u
i gi (
x) = 0.
i=1
Then, with I(
x) = {i | gi (
x) = 0}, we have
f (
x) =
u
i gi (
x)
iI(
x)
iI(
x)
u
i [gi (x) gi (
x)]
|{z}
|
{z
}
0 x B, i I(
x)
Observe that
to show the necessity of the KuhnTucker conditions we had to use the
regularity condition RC 0 (or one of the other two, being stronger), but we
did not need any convexity assumption;
to demonstrate that in the convex case the KuhnTucker conditions are
sufficient for optimality we have indeed used the assumed convexity, but we
did not need any regularity condition at all.
Defining the Lagrange function for problem (8.1),
L(x, u) := f (x) +
m
X
ui gi (x)
i=1
BASIC CONCEPTS
89
x L(
x, u
) = 0,
u L(
x, u
) 0,
(8.10)
uT u L(
x, u
) = 0,
u 0.
L(
x, u
) L(x, u) x IRn .
Solution Techniques
90
STOCHASTIC PROGRAMMING
cutting-plane methods;
methods of descent;
penalty methods;
Lagrangian methods
Cutting-plane methods
Assume that for problem (8.1) the functions f and gi , i = 1, , m, are convex
and that theconvexfeasible set
B = {x | gi (x) 0, i = 1, , m}
is bounded. Furthermore, assume that
y int Bwhich for instance would
be true if the Slater condition (8.9) held. Then, instead of the original problem
min f (x),
xB
(8.11)
BASIC CONCEPTS
91
xB
{x | (ak )T x k },
92
Figure 30
STOCHASTIC PROGRAMMING
and hence
cT x
k min cT x, k = 0, 1, 2, ,
xB
such that cT z k cT x
k could be taken after the kth iteration as an upper bound
on the distance of either the feasible (but in general nonoptimal) objective
value cT z k or the optimal (but in general nonfeasible) objective value cT xk to
the feasible optimal value minxB cT x. Observe that in general the sequence
{cT z k } need not be monotonically decreasing, whereas Pk+1 Pk k ensures
that the sequence {cT x
k } is monotonically increasing. Thus we may enforce
a monotonically decreasing error bound
k := cT z lk cT x
k , k = 0, 1, 2, ,
by choosing z lk from the boundary points of B constructed in step 2 up to
iteration k such that
cT z lk = min cT z l .
l{0,,k}
BASIC CONCEPTS
93
y + (1 )
xk B, there is at least one constraint i0 active in z k meaning
that gi0 (z k ) = 0. The convexity of gi0 implies, owing to Proposition 1.21, that
y z k )T gi0 (z k ),
y ) gi0 (z k ) (
y ) = gi0 (
0 > gi0 (
and therefore that ak := gi0 (z k ) 6= 0.
Observing that zk = k y + (1 k )
xk with 0 < k < 1 is equivalent to
x
k z k =
k
(
y z k ),
1 k
1.8.2.2
Descent methods
For the sake of simplicity, we consider the special case of minimizing a convex
function under linear constraints
min f (x)
s.t. Ax = b,
(8.12)
x 0.
Assume that we have a feasible point z B = {x | Ax = b, x 0}. Then
there are two possibilities.
(a) If z is optimal then the KuhnTucker conditions have to hold. For (8.12)
these are
f (z) + AT u w = 0,
z T w = 0,
w 0,
orwith J(z) := {j | zj > 0}equivalently
AT u w = f (z),
wj = 0 for j J(z),
w 0.
94
STOCHASTIC PROGRAMMING
Applying Farkas Lemma 1.19 tells us that this system (and hence the
above KuhnTucker system) is feasible if and only if
[f (z)]T d 0 d {d | Ad = 0, dj 0 for j 6 J(z)};
(b) If the feasible point z is not optimal then the KuhnTucker conditions
cannot hold, and, according to (a), there exists a direction d such that
Ad = 0, dj 0 j : zj = 0 and [f (z)]T d < 0. A direction like
this is called a feasible descent direction at z, which has to satisfy the
following two conditions: 0 > 0 such that z + d B [0, 0 ]
and [f (z)]T d < 0. Hence, having at a feasible point z a feasible descent
direction d (for which, by its definition, d 6= 0 is obvious), it is possible to
move from z in direction d with some positive step length without leaving
B and at the same time at least locally to decrease the objectives value.
From these brief considerations, we may state the following.
Conceptual method of descent directions
Step 1 Determine a feasible solution z (0) , let k := 0.
Step 2 If there is no feasible descent direction at z (k) then stop (z (k) is
optimal).
Otherwise, choose a feasible descent direction d(k) at z (k) and go to
step 3.
Step 3 Solve the so-called line search problem
min{f (z (k) + d(k) ) | (z (k) + d(k) ) B},
BASIC CONCEPTS
95
xB = B 1 b B 1 N xN B .
u = B 1 N v.
= [B f (z (k) )]T , [N B f (z (k) )]T [B f (z (k) )]T B 1 (B, N ),
96
STOCHASTIC PROGRAMMING
we have
rB = 0, rN B = ([N B f (z (k) )]T [B f (z (k) )]T B 1 N )T ,
and hence
[f (z
(k)
)] d = (u , v )
rB
rN B
rjN B
if rjN B 0,
B NB
xN
rj
j
if rjN B > 0,
and
(rB )T xB = 0, (rN B )T xN B = 0,
i.e. v = 0 is equivalent to satisfying the KuhnTucker conditions.
It is known that the reduced gradient method with the above definition
of v may fail to converge to a solution (so-called zigzagging). However,
we can perturb v as follows:
if rjN B 0,
rjN B
NB NB
B
xj rj
if rjN B > 0 and xN
,
vj :=
j
B
0
if rjN B > 0 and xN
<
.
j
Then a proper control of the perturbation > 0 during the procedure can
be shown to enforce convergence.
2
The feasible direction and the reduced gradient methods have been extended
to the case of nonlinear constraints. We omit the presentation of the general
case here for the sake of better readability.
BASIC CONCEPTS
1.8.2.3
97
Penalty methods
The term penalty reflects the following attempt. Replace the original
problem (8.1)
min f (x)
s.t. gi (x) 0, i = 1, , m,
by appropriate free (i.e. unconstrained) optimization problems
(gi (x)),
Frs (x) := f (x) + r
(gi (x)) +
s
iI
(8.13)
iJ
< 0,
1
+,
s
98
STOCHASTIC PROGRAMMING
B2 := {x | gi (x) 0, i J}
iI
1 X
(gi (x(k) )) = 0.
k sk
lim
iJ
1.8.2.4
Lagrangian methods
xIRn
BASIC CONCEPTS
99
To simplify the description, let us first consider the optimization problem with
equality constraints
min f (x)
(8.14)
s.t. gi (x) = 0, i = 1, , m.
Knowing for this problem the proper multiplier vector u or at least a good
approximate u of it, we should find
min [f (x) + uT g(x)],
(8.15)
xIRn
Pm
where uT g(x) =
i=1 ui gi (x). However, at the beginning of any solution
procedure we hardly have any knowledge about the numerical size of the
multipliers in a KuhnTucker point of problem (8.14), and using some guess
for u might easily result in an unsolvable problem (inf x L(x, u) = ).
On the other hand, we have just introduced penalty methods. Using for
problem (8.14) a quadratic loss function for violating the equality constraints
seems to be reasonable. Hence we could think of a penalty method using as
modified objective
minn [f (x) + 21 kg(x)k2 ]
(8.16)
xIR
and driving the parameter towards +, with kg(x)k being the Euclidean
norm of g(x) = (g1 (x), , gm (x))T .
One idea is to combine the two approaches (8.15) and (8.16) such that we are
dealing with the so-called augmented Lagrangian as our modified objective:
min [f (x) + uT g(x) + 12 kg(x)k2 ].
xIRn
(8.17)
100
STOCHASTIC PROGRAMMING
Observe that for u(k) = 0 k we should get back the penalty method with
a quadratic loss function, which, according to Proposition 1.26, is known to
converge in the sense asserted there.
For the method (8.17) in general the following two statements can be proved,
showing
(a) that we may expect a convergence behaviour as we know it already for
penalty methods; and
(b) how we should successively adjust the multiplier vector u(k) to get the
intended convergence to the proper KuhnTucker multipliers.
Proposition 1.27 If f and gi , i = 1, , m, are continuous and x(k) , k =
1, 2, , are global solutions of
min Lk (x, u(k) )
x
m
X
(8.18)
ui gi (x ) = 0, g(x ) = 0.
i=1
(8.19)
BASIC CONCEPTS
101
Now let us come back to our original nonlinear program (8.1) with inequality
constraints and show how we can make use of the above results for the case
of equality constraints. The key to this is the observation that our problem
with inequality constraints
min f (x)
s.t. gi (x) 0, i = 1, , m
is equivalent to the following one with equality constraints:
min f (x)
s.t. gi (x) + zi2 = 0, i = 1, , m.
Now applying the augmented Lagrangian method (8.17) to this equalityconstrained problem requires that for
L (x, z, u) := f (x) +
m
X
2
ui gi (x) + zi2 + 21 gi (x) + zi2
(8.20)
i=1
zIRm
m
n
X
(k)
2 o
ui gi (x) + zi2 + 21 k gi (x) + zi2
= minm f (x) +
zIR
i=1
= f (x) +
m
X
i=1
2
(k)
min ui gi (x) + zi2 + 21 k gi (x) + zi2 .
zi
yi =
hu
i
+ gi (x) .
102
STOCHASTIC PROGRAMMING
implying
(8.22)
h
ui i
gi (x) + yi = max gi (x),
(8.23)
and, with a solution x(k) of this problem, our update formula (8.19) for the
multipliersrecalling that we now have the equality constraints gi (x)+zi2 = 0
instead of gi (x) = 0 as beforebecomes by (8.23)
u(k)
u(k+1) := u(k) + k max g(x(k) ),
k
= max 0, u(k) + k g(x(k) ) ,
(8.24)
1.9
Bibliographical Notes
The observation that some data in real life optimization problems could be
random, i.e. the origin of stochastic programming, dates back to the 1950s.
Without any attempt at completeness, we might mention from the early
contributions to this field Avriel and Williams [3], Beale [5, 6], Bereanu [8],
Dantzig [11], Dantzig and Madansky [13], Tintner [43] and Williams [49].
For more detailed discussions of the situation of the decision maker facing
random parameters in an optimization problem we refer for instance to
Dempster [14], Ermoliev and Wets [16], Frauendorfer [18], Kall [22], Kall and
Prekopa [24], Kolbin [28], Sengupta [42] and Vajda [45].
BASIC CONCEPTS
103
104
STOCHASTIC PROGRAMMING
Exercises
1. Show that from (4.3) on page 24 it follows that with Ai A, i = 1, 2, ,
Ai A and Ai Aj A i, j.
i=1
BASIC CONCEPTS
105
m
X
ui gi (x).
i=1
Show that x
is a global solution of
min f (x)
s.t. gi (x) 0, i = 1, , m.
(See Proposition 1.25 for the definition of a saddle point.)
References
[1] Abadie J. and Carpentier J. (1969) Generalization of the Wolfe reduced
gradient method to the case of nonlinear constraints. In Fletcher R. (ed)
Optimization, pages 3747. Academic Press, London.
[2] Attouch H. and Wets R. J.-B. (1981) Approximation and convergence
in nonlinear optimization. In Mangasarian O. L., Meyer R. M., and
Robinson S. M. (eds) NLP 4, pages 367394. Academic Press, New York.
[3] Avriel M. and Williams A. (1970) The value of information and stochastic
programming. Oper. Res. 18: 947954.
[4] Bazaraa M. S. and Shetty C. M. (1979) Nonlinear ProgrammingTheory
and Algorithms. John Wiley & Sons, New York.
[5] Beale E. M. L. (1955) On minimizing a convex function subject to linear
inequalities. J. R. Stat. Soc. B17: 173184.
[6] Beale E. M. L. (1961) The use of quadratic programming in stochastic
linear programming. Rand Report P-2404, The RAND Corporation.
[7] Benders J. F. (1962) Partitioning procedures for solving mixed-variables
programming problems. Numer. Math. 4: 238252.
106
STOCHASTIC PROGRAMMING
BASIC CONCEPTS
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
107
108
STOCHASTIC PROGRAMMING
Dynamic Systems
2.1
DYNAMIC SYSTEMS
111
Figure 1 Basic set-up for a dynamic program with four states, four stages
and three possible decisions.
as follows. With
t
the stages, t = 1, , T ,
zt
xt
Gt (zt , xt ) the transformation (or transition) of the system from the state zt
and the decision taken at stage t into the state zt+1 at the next
stage, i.e. zt+1 = Gt (zt , xt ),
rt (zt , xt ) the immediate return if at stage t the system is in state zt and the
decision xt is taken,
F
Xt (zt )
112
STOCHASTIC PROGRAMMING
2 if zT = 4,
1 if zT = 3,
rT =
1 if zT = 2,
2 if zT = 1.
To solve max F (r1 , , r4 ), we have to fix the overall objective F as a
function of the immediate returns r1 , r2 , r3 , r4 . To demonstrate possible effects
of properties of F on the solution procedure, we choose two variants.
(a) Let
F (r1 , , r4 ) := r1 + r2 + r3 + r4
and assume that the initial state is z1 = 4. This is illustrated in Figure 2,
which has the same structure as Figure 1. Using the figure, we can check
that an optimal policy (i.e. sequence of decisions), is x1 = x2 = x3 = 0
keeping us in zt = 4 for all t with the optimal value F (r1 , , r4 ) =
1 + 1 + 1 2 = 1.
We may determine this optimal policy iteratively as follows. First, we
determine the decision for each of the states in stage 3 by determining
f3 (z3 ) := max[r3 (z3 , x3 ) + r4 (z4 )]
x3
DYNAMIC SYSTEMS
113
Figure 2 Dynamic program: additive composition. The solid lines show the
result of the backward recursion.
114
STOCHASTIC PROGRAMMING
(zt+1 )]
ft (zt ) := max[rt (zt , xt )ft+1
xt
(1.1)
DYNAMIC SYSTEMS
115
x2 X2 ,,xT XT
116
STOCHASTIC PROGRAMMING
Proof
{xt Xt , t1}
1 (r1 (z1 , x1 ),
max
{xt Xt , t2}
for all x1 . Therefore this also holds when the right-hand side of this inequality
is maximized with respect to x1 .
On the other hand, it is also obvious that
max
{xt Xt , t2}
max
{xt Xt ,t2}
DYNAMIC SYSTEMS
2.2
117
Dynamic Programming
The purpose of this section is to look at certain aspects of the field of dynamic
programming. The example we looked at in the previous section is an example
of a dynamic programming problem. It will not represent a fair description
of the field as a whole, but we shall concentrate on aspects that are useful in
our context. This section will not consider randomness. That will be discussed
later.
We shall be interested in dynamic programming as a means of solving
problems that evolve over time. Typical examples are production planning
under varying demand, capacity expansion to meet an increasing demand
and investment planning in forestry. Dynamic programming can also be used
to solve problems that are not sequential in nature. Such problems will not
be treated in this text.
Important concepts in dynamic programming are the time horizon, state
variables, decision variables, return functions, accumulated return functions,
optimal accumulated returns and transition functions. The time horizon refers
to the number of stages (time periods) in the problem. State variables describe
the state of the system, for example the present production capacity, the
present age and species distribution in a forest or the amount of money
one has in different accounts in a bank. Decision variables are the variables
under ones control. They can represent decisions to build new plants, to cut
a certain amount of timber, or to move money from one bank account to
another. The transition function shows how the state variables change as a
function of decisions. That is, the transition function dictates the state that
will result from the combination of the present state and the present decisions.
For example, the transition function may show how the forest changes over
the next period as a result of its present state and of cutting decisions, how the
amount of money in the bank increases, or how the production capacity will
change as a result of its present size, investments (and detoriation). A return
function shows the immediate returns (costs or profits) as a result of making
a specific decision in a specific state. Accumulated return functions show the
accumulated effect, from now until the end of the time horizon, associated with
a specific decision in a specific state. Finally, optimal accumulated returns show
the value of making the optimal decision based on an accumulated return
function, or in other words, the best return that can be achieved from the
present state until the end of the time horizon.
Example 2.2 Consider the following simple investment problem, where it is
clear that the Bellman principle holds. We have some money S0 in a bank
account, called account B. We shall need the money two years from now, and
today is the first of January. If we leave the money in the account we will face
an interest rate of 7% in the first year and 5% in the second. You also have
118
STOCHASTIC PROGRAMMING
10%
fee 20
fee 10
S0
fee 10
B
B
7%
Figure 4
7%
fee 20
S3
5%
the option of moving the money to account A. You will there face an interest
rate of 10% the first year and 7% the second year. However, there is a fixed
charge of 20 per year and a charge of 10 each time we withdraw money from
account A. The fixed charge is deducted from the account at the end of a year,
whereas the charges on withdrawals are deducted immediately. The question
is: Should we move our money to account A for the first year, the second year
or both years? In any case, money left in account A at the end of the second
year will be transferred to account B. The goal is to solve the problem for all
initial S0 > 1000. Figure 4 illustrates the example.
Note that all investments will result in a case where the wealth increases,
and that it will never be profitable to split the money between the accounts
(why?).
Let us first define the two-dimensional state variables zt = (zt1 , zt2 ). The
first state variable, zt1 , refers to the account name (A or B); the second
state variable, zt2 , refers to the amount of money St in that account. So
zt = (B, St ) refers to a state where there is an amount St in account B
in stage t. Decisions are where to put the money for the next time period. If
xt is our decision variable then xt {A, B}. The transition function will be
denoted by Gt (zt , xt ), and is defined via interest rates and charges. It shows
what will happen to the money over one year, based on where the money is
now, how much there is, and where it is put next. Since the state space has
two elements, the function Gt is two-valued. For example
A
A
2
2
1
1
, A = St 1.07 20.
, A = A,
zt+1 = Gt
zt+1 = Gt
St
St
Accumulated return functions will be denoted by ft (zt1 , zt2 , xt ). They
describe how the amount zt2 in account zt1 will grow, up to the end of the
time horizon, if the money is put into account xt in the next period, and
optimal decisions are made thereafter. So if f1 (A, S1 , B) = S, we know that
in stage 1 (i.e. at the end of period 1), if we have S1 in account A and then
DYNAMIC SYSTEMS
119
max
x1 {A,B}
f1 (A, S1 , x1 ).
The calculations for our example are as follows. Note that we have three
stages, which we shall denote Stage 0, Stage 1 and Stage 2. Stage 2 represents
the point in time (after two years) when all funds must be transferred to
account B. Stage 1 is one year from now, where we, if we so wish, may move
the money from one account to another. Stage 0 is now, where we must decide
if we wish to keep the money in account B or move it to account A.
Stage 2 At Stage 2, all we can do is to transfer whatever money we have in
account A to account B:
f2 (A, S2 ) = S2 10,
f2 (B, S2 ) = S2 ,
indicating that a cost of 10 is incurred if the money is in account A and needs
to be transferred to account B.
Stage 1 Let us first consider account A, and assume that the account contains
S1 . We can keep the money in account A, making S2 = S1 1.07 20 (this is
the transition function), or move it to B, making S2 = (S1 10) 1.05. This
generates the following two evaluations of the accumulated return function:
f1 (A, S1 , A) = f2 (A, S1 1.07 20) = S1 1.07 30,
f1 (A, S1 , B) = f2 (B, (S1 10) 1.05) = S1 1.05 10.5.
By comparing these two, we find that, as long as S1 975 (which is always
the case since we have assumed that S0 > 1000), account A is best, making
f1 (A, S1 ) = S1 1.07 30.
Next, consider account B. If we transfer the amount S1 to account A, we
get S2 = S1 1.07 20. If it stays in B, we get S2 = S1 1.05. This gives us
f1 (B, S1 , A) = f2 (A, S1 1.07 20) = S1 1.07 30,
f1 (B, S1 , B) = f2 (B, S1 1.05) = S1 1.05.
By comparing these two, we find that
S1 1.07 30 if S1 1500,
f1 (B, S1 ) =
S1 1.05
if S1 1500.
120
STOCHASTIC PROGRAMMING
Stage 0 Since we start out with all our money in account B, we only need
to check that account. Initially we have S0 . If we transfer to A, we get
S1 = S0 1.1 20, and if we keep it in B, S1 = S0 1.07. The accumulated
returns are
f0 (B, S0 , A) = f1 (A, S1 ) = f1 (A, S0 1.1 20)
= (S0 1.1 20) 1.07 30 = 1.177 S0 51.4,
f0 (B, S0 , B) = f1 (B, S1 ) = f1 (B, S0 1.07)
S0 1.1449 30 if S0 1402,
=
S0 1.1235
if S0 1402.
Comparing the two options, we see that account A is always best, yielding
f0 (B, S0 ) = 1.177 S0 51.4.
So we should move our money to account A and keep it there until the end
of the second period. Then we move it to B as required. We shall be left with
a total interest of 17.7% and fixed charges of 51.4 (including lost interest on
charges).
2
As we can see, the main idea behind dynamic programming is to take one
stage at a time, starting with the last stage. For each stage, find the optimal
decision for all possible states, thereby calculating the optimal accumulated
return from then until the end of the time horizon for all possible states. Then
move one step towards the present, and calculate the returns from that stage
until the end of the time horizon by adding together the immediate returns,
and the returns for all later periods based on the calculations made at the
previous stage. In the example we found that f1 (A, S1 ) = S1 1.07 30.
This shows us that if we end up in stage 1 with S1 in account A, we shall (if
we behave optimally) end up with S1 1.07 30 in account B at the end of
the time horizon. However, f1 does not tell us what to do, since that is not
needed to calculate optimal decisions at stage 0.
Formally speaking, we are trying to solve the following problem, where
x = (x0 , . . . , xT )T :
maxx F (r0 (z0 , x0 ), . . . , rT (zT , xT ), Q(zT +1 ))
s.t. zt+1 = Gt (zt , xt )
for t = 0, . . . , T,
At (zt ) xt Bt (zt ) for t = 0, . . . , T,
where F satisfies the requirements of Proposition 2.2. This is to be solved
for one or more values of the initial state z0 . In this set-up, rt is the return
DYNAMIC SYSTEMS
121
function for all but the last stage, Q the return function for the last stage, Gt
the transition function, T the time horizon, zt the (possibly multi-dimensional)
state variable in stage t and xt the (possibly multi-dimensional) decision
variable in stage t. The accumulated return function ft (zt , xt ) and optimal
accumulated returns ft (zt ) are not part of the problem formulation, but
rather part of the solution procedure. The solution procedure, justified by
the Bellman principle, runs as follows.
Find f0 (z0 )
by solving recursively
ft (zt ) =
=
with
max
ft (zt , xt )
max
In each case the problem must be solved for all possible values of the state
variable zt , which might be multi-dimensional.
Problems that are not dynamic programming problems (unless rewritten
with a large expansion of the state space) would be problems where, for
example,
zt+1 = Gt (z0 , . . . , zt , x0 , . . . , xt ),
or where the objective function depends in an arbitrary way on the whole
history up til stage t, represented by
rt (z0 , . . . , zt , x0 , . . . , xt ).
Such problems may more easily be solved using other approaches, such as
decision trees, where these complicated functions cause little concern.
2.3
122
STOCHASTIC PROGRAMMING
Stage 0
B,S 0
B
Stage 2
A,S2
B
B, S3
Figure 5
B,S1
A,S1
Stage 1
B, S2
A,S2
B, S2
B, S3
B, S3
B, S3
this method to be useful, since there is one leaf in the tree for each possible
sequence of decisions.
The tree indicates that at stage 0 we have S0 in account B. We can then
decide to put them into A (go left) or keep them in B (go right). Then at stage
1 we have the same possible decisions. At stage 2 we have to put them into
B, getting S3 , the final amount of money. As before we could have skipped
the last step. To be able to solve this problem, we shall first have to follow
each path in the tree from the root to the bottom (the leaves) to find S3
in all cases. In this way, we enumerate all possible sequences of decisions
that we can possibly make. (Remember that this is exactly what we avoid in
dynamic programming). The optimal sequence, must, of course, be one of these
sequences. Let (AAB) refer to the path in the tree with the corresponding
indices on the arcs. We then get
(ABB)
(AAB)
(BAB)
(BBB)
We have now achieved numbers in all leaves of the tree (for some reason
decision trees always grow with the root up). We are now going to move back
DYNAMIC SYSTEMS
123
towards the root, using a process called folding back. This implies moving one
step up the tree at a time, finding for each node in the tree the best decision
for that node.
This first step is not really interesting in this case (since we must move the
money to account B), but, even so, let us go through it. We find that the best
we can achieve after two decisions is as follows.
(AB)
: S3 = S0 1.155 31.5,
(AA)
(BA)
: S3 = S0 1.177 51.4,
: S3 = S0 1.1449 30,
(BB)
: S3 = S0 1.1235.
124
STOCHASTIC PROGRAMMING
Period
1
2
Outcome 1
8
5
8% or 12%
fee 20
fee 10
S0
2.4
5% or 9%
fee 20
fee 10
B
B
7%
Figure 6
Outcome 2
12
9
S3
5%
We shall now see how decision trees can be used to solve certain classes
of stochastic problems. We shall initiate this with a look at our standard
investment problem in Example 2.2. In addition, let us now assume that the
interest on account B is unchanged, but that the interest rate on account A
is random, with the previously given rates as expected values. Charges on
account A are unchanged. The distribution for the interest rate is given in
Table 1. We assume that the interest rates in the two periods are described
by independent random variables.
Based on this information, we can give an update of Figure 4, where we
show the deterministic and stochastic parameters of the problem. The update
is shown in Figure 6.
Consider the decision tree in Figure 7. As in the deterministic case, square
nodes are decision nodes, from which we have to choose between account A
and B. Circular nodes are called chance nodes, and represent points at which
something happens, in this case that the interest rates become known.
Start at the top. In stage 0, we have to decide whether to put the money
DYNAMIC SYSTEMS
125
Stage 0
B
A
12%
8%
7%
Stage 1
5%
9%
5%
5%
9%
9%
5%
5%
5%
Stage 2
Figure 7
126
STOCHASTIC PROGRAMMING
Stage 0
1126
A
1126
Stage 1
1124
12%
8%
1104
5%
Stage 2 1083
1125
1103
9%
1124
1147
1104
7%
5%
1103
1145
1147
5%
9%
B
1124
1115
9%
5%
5%
1125
1169
1145
1094
1136
5%
1124
Let us see how this works. First we do as we have done before: we follow each
path down the tree to see what amount we end up with in account B. We
have assumed S0 = 1000. That is shown in the leaves of the tree in Figure 8.
We then fold back. Since the next node is a chance node, we take the
expected value of the two square nodes below. Then for stage 1 we check
which of the two possible decisions has the largest expectation. In the far left
of Figure 8 it is to put the money into account A. We therefore cross out
the other alternative. This process is repeated until we reach the top level.
In stage 0 we see that it is optimal to use account A in the first period and
regardless of the interest rate in the first period, we shall also use account
A in the second period. In general, the second-stage decision depends on the
outcome in the first stage, as we shall see in a moment.
You might have observed that the solution derived here is exactly the same
as we found in the deterministic case. This is caused by two facts. First, the
interest rate in the deterministic case equals the expected interest rate in the
stochastic case, and, secondly, the objective function is linear. In other words,
if is a random variable and a and b are constants then
E(a + b) = aE + b.
For the stochastic case we calculated the left-hand side of this expression, and
for the deterministic case the right-hand side.
In many cases it is natural to maximize expected profits, but not always.
One common situation for decision problems under uncertainty is that
the decision is repeated many times, often, in principle, infinitely many.
DYNAMIC SYSTEMS
127
Utility
u(w )
0
w
0
Figure 9
aversion.
Wealth
Investments in shares and bonds, for example, are usually of this kind. The
situation is characterized by long time series of data, and by many minor
decisions. Should we, or should we not, maximize expected profits in such a
case?
Economics provide us with a tool to answer that question, called a utility
function. Although it is not going to be a major point in this book, we should
like to give a brief look into the area of utility functions. It is certainly an
area very relevant to decision making under uncertainty. If you find the topic
interesting, consult the references listed at the end of this chapter. The area
is full of pitfalls and controversies, something you will probably not discover
from our little glimpse into the field. More than anything, we simply want to
give a small taste, and, perhaps, something to think about.
We may think of a utility function as a function that measures our happiness
(utility) from a certain wealth (let us stick to money). It does not measure
utility in any fixed unit, but is only used to compare situations. So we can say
that one situation is preferred to another, but not that one situation is twice
as good as another. An example of a utility function is found in Figure 9.
Note that the utility function is concave. Let us see what that means.
Assume that our wealth is w0 , and we are offered a game. With 50%
probability we shall win w; with 50% probability we shall lose the same
amount. It costs nothing to take part. We shall therefore, after the game, either
have a wealth of w0 + w or a wealth of w0 w. If the function in Figure 9 is
our utility function, and we calculate the utility of these two possible future
128
STOCHASTIC PROGRAMMING
DYNAMIC SYSTEMS
129
Stage 0
4.820
A
4.807
Stage 1
4.820
12%
8%
4.634
Stage 2 4.419
4.979
9%
4.828
5%
4.634
5%
4.820
4.634
4.624
5%
4.979
7%
4.977
9%
4.683
4.820
9%
5%
5%
4.828
5.130
4.977
4.543
4.913
5%
4.820
and that we wish to maximize the expected utility of the final wealth s. In
the deterministic case we found that it would never be profitable to split
the money between the two accounts. The argument is the same when we
simply maximized the expected value of S3 as outlined above. However, when
maximizing expected utility, that might no longer be the case. On the other
hand, the whole set-up used in this chapter assumes implicitly that we do
not split the funding. Hence in what follows we shall assume that all the
money must be in one and only one account. The idea in the decision tree is
to determine which decisions to make, not how to combine them. Figure 10
shows how we fold back with expected utilities. The numbers in the leaves
represent the utility of the numbers in the leaves of Figure 8. For example
u(1083) = ln(1083 1000) = ln 83 = 4.419.
We observe that, with this utility function, it is optimal to use account B.
The reason is that we fear the possibility of getting only 8% in the first period
combined with the charges. The result is that we choose to use B, getting
the certain amount S3 = 1124. Note that if we had used account A in the
first period (which is not optimal), the optimal second-stage decision would
depend on the actual outcome of the interest on account A in the first period.
With 8%, we pick B in the second period; with 12%, we pick A.
130
2.5
STOCHASTIC PROGRAMMING
DYNAMIC SYSTEMS
131
if S1 < 1077,
f1 (A, S1 ) =
0.5 ln[(S1 1.05 1030)(S1 1.09 1030)]
if S1 > 1077.
if S1 < 1538
f1 (B, S1 ) =
0.5 ln[(S1 1.05 1030)(S1 1.09 1030)]
if S1 > 1538.
Stage 0 We here have to consider only the case when the amount S0 > 1000
sits in account B. The basis for these calculations will be the
following two expressions. The first calculates the expected result
of using account A, the second the certain result of using account B.
f0 (B, S0 , A) = 0.5[f1 (A, S0 1.08 20) + f1 (A, S0 1.12 20)],
f0 (B, S0 , B) = f1 (B, S0 1.07).
132
STOCHASTIC PROGRAMMING
which means that we use account B for small amounts and account
A for large amounts within the given interval.
Case 3 Here we have S0 > 1437. In this case
f0 (B, S0 ) =
1
4
DYNAMIC SYSTEMS
133
S 0 >1022
S0
S1<1077
S 0 <1022
Stage 0
S1 >1077
S1 >1538
B
S 1<1538
Stage 1
S3
Stage 2
If we put these results into Figure 4, we obtain Figure 11. From the latter,
we can easily construct a solution similar to the one in Figure 10 for any
S0 > 1000. Verify that we do indeed get the solution shown in Figure 10 if
S0 = 1000.
But we see more than that from Figure 11. We see that if we choose account
B in the first period, we shall always do the same in the second period. There
is no way we can start out with S0 < 1022 and get S1 > 1538.
Formally, what we are doing is as follows. We use the vocabulary of
Section 2.2. Let the random vector for stage t be given by t and let the
return and transition functions become rt (zt , xt , t ) and zt+1 = Gt (zt , xt , t ).
Given this, the procedure becomes
find f0 (z0 )
by recursively calculating
ft (zt ) =
=
min
ft (zt , xt )
min
(zt+1 ))}, t = T, . . . , 0,
Et {t (rt (zt , xt , t ), ft+1
with
zt+1 = Gt (zt , xt , t ) for t = 0, . . . , T,
fT +1 (zT +1 ) = Q(zT +1 ),
where the functions satisfy the requirements of Proposition 2.2. In each stage
the problem must be solved for all possible values of the state zt . It is possible
to replace expectations (represented by E above) by other operators with
respect to t , such as max or min. In such a case, of course, probability
distributions are uninterestingonly the support matters.
134
2.6
STOCHASTIC PROGRAMMING
Scenario Aggregation
So far we have looked at two different methods for formulating and solving
multistage stochastic problems. The first, stochastic decision trees, requires a
tree that branches off for each possible decision xt and each possible realization
of t . Therefore these must both have finitely many possible values. The state
zt is not part of the tree, and can therefore safely be continuous. A stochastic
decision tree easily grows out of hand.
The second approach was stochastic dynamic programming. Here we must
make a decision for each possible state zt in each stage t. Therefore, it is clearly
an advantage if there are finitely many possible states. However, the theory is
also developed for a continuous state space. Furthermore, a continuous set of
decisions xt is acceptable, and so is a continuous distribution of t , provided
we are able to perform the expectation with respect to t .
The method we shall look at in this section is different from those mentioned
above with respect to where the complications occur. We shall now operate
on an event tree (see Figure 12 for an example). This is a tree that branches
off for each possible value of the random variable t in each stage t. Therefore,
compared with the stochastic decision tree approach, the new method has
similar requirements in terms of limitations on the number of possible values
of t . Both need finite discrete distributions. In terms of xt we must have
finitely many values in the decision tree, the new method prefers continuous
variables. Neither of them has any special requirements on zt .
The second approach we have discussed so far for stochastic problems
is stochastic dynamic programming. The new method we are about to
outline is called scenario aggregation. We shall see that stochastic dynamic
programming is more flexible than scenario aggregation in terms of
distributions of t , is similar with respect to xt , but is much more restrictive
with respect to the state variable zt , in the sense that the state space is hardly
of any concern in scenario aggregation.
If we have T time periods and t is a vector describing what happens in
time period t (i.e. a realization of t ) then we call
s = (0s , 1s , . . . , Ts )
a scenario. It represents one possible future. So assume we have a set of
scenarios S describing all (or at least the most interesting) possible futures.
What do we do? Assume our world can be described by state variables zt
and decision variables xt and that the cost (i.e. the return function) in time
period t is given by rt (zt , xt , t ). Furthermore, as before, the state variables
can be calculated from
zt+1 = Gt (zt , xt , t ),
DYNAMIC SYSTEMS
135
with z0 given. Let be a discount factor. What is often done in this case is
to solve for each s S the following problem
min
T
X
rt (zt , xt , ts )
T +1
Q(zT +1 )
t=0
s.t.
(6.1)
where Q(z) represents the value of ending the problem in state z, yielding
an optimal solution xs = (xs0 , xs1 , . . . , xsT ). Now what? We have a number of
different solutionsone for each s S. Shall we take the average and calculate
for each t
X
xt =
ps xst ,
sS
136
STOCHASTIC PROGRAMMING
Today
First random variable
Tomorrow
Second
random
variable
The future
Figure 12
sense, because they are all possible decisions, or what are called implementable
decisions.
s
For each time period t let {s}t be the set of all scenarios having 0s , . . . , t1
in common with scenario s. In Figure 12, {s}0 = S, whereas each {s}2 contains
only one scenario. There are three sets {s}1 . Let p({s}t ) be the sum of the
probabilities of all scenarios in {s}t . Hence, after solving (6.1) for all s, we
calculate for all {s}t
x({s}t ) =
s {s}t
ps xst
.
p({s}t )
t=0
subject to
s
zt+1
= Gt (zts , xst , ts ) for t = 0, . . . , T with z0s = z0 given,
s
At (zt ) xst Bt (zts ) for t = 0, . . . , T,
X ps xs
t
s
for t = 0, . . . , T and all s.
xt =
p({s}t )
(6.2)
s {s}t
DYNAMIC SYSTEMS
137
p(s)
sS
T
X
t=0
h
t rt (zts , xst , ts )
+wts (xst
s {s}t
ps xst i
) + T +1 Q(zTs +1 )
p({s}t )
(6.3)
s {s}t
with
x({s}t ) =
ps xst
p({s}t )
X
s {s}t
ps xst
p({s}t )
sS
But since, for a fixed w, the terms wts x({s}t ) are fixed, we can as well drop
them. If we then add an augmented Lagrangian term, we are left with
X
T
h
i
X
p(s)
t rt (zts , xst , ts ) + wts xst + 21 [xst x({s}t )]2 + T +1 Q(zTs +1 ) .
sS
t=0
138
STOCHASTIC PROGRAMMING
procedure scenario(s, x, xs );
begin
Solve the problem
min
T
X
[rt (zt , xt , ts )
wts xt
1
2
T +1
+ (xt x) ] +
Q(zT +1 )
t=0
Figure 13
Our problem is now totally separable in the scenarios. That is what we need
to define the scenario aggregation method. See the algorithms in Figures 13
and 14 for details. A few comments are in place. First, to find an initial
x({s}t ), we can solve (6.1) using expected values for all random variables.
Finding the correct value of , and knowing how to update it, is very hard.
We discussed that to some extent in Chapter 1: see in particular (8.17). This is
a general problem for augmented Lagrange methods, and will not be discussed
here. Also, we shall not go into the discussion of stopping criteria, since the
details are beyond the scope of the book. Roughly speaking, though, the goal
is to have the scenario problems produce implementable solutions, so that xs
equals x({s}t ).
Example 2.3 This small example concerns a very simple fisheries
management model. For each time period we have one state variable, one
decision variable, and one random variable. Let zt be the state variable,
representing the biomass of a fish stock in time period t, and assume that
z0 is known. Furthermore, let xt be a decision variable, describing the portion
of the fish stock caught in a given year. The implicit assumption made here
is that it requires a fixed effort (measured, for example, in the number of
participating vessels) to catch a fixed portion of the stock. This seems to be
a fairly correct description of demersal fisheries, such as for example the cod
fisheries. The catch in a given year is hence zt xt .
During a year, fish grow, some die, and there is a certain recruitment. A
common model for the total effect of these factors is the so called Schaefer
model, where the total change in the stock, due to natural effects listed above,
DYNAMIC SYSTEMS
139
procedure scen-agg;
begin
for all s and t do wts := 0;
Find initial x({s}t );
Initiate > 0;
repeat
for all s S do scenario(s, x({s}t ), xs );
for all x({s}t ) do
X ps xs
t
x({s}t ) =
;
p({s}
t)
s {s}t
Update if needed;
for all s and t do
wts := wts + [xst x({s}t )];
until result good enough;
end;
Figure 14
is given by
zt
,
szt 1
K
where s is a growth ratio and K is the carrying capacity of the environment.
Note that if zt = K there is no net change in the stock size. Also note that
if zt > K, then there is a negative net effect, decreasing the size of the stock,
and if zt < K, then there is a positive net effect. Hence zt = K is a stable
situation (as zt = 0 is), and the fish stock will, according to the model, stabilize
at z = K if no fishing takes place.
If fish are caught, the catch has to be subtracted from the existing stock,
giving us the following transition function:
zt
.
zt+1 = zt xt zt + szt 1
K
This transition function is clearly nonlinear, with both a zt xt term and a zt2
term. If the goal is to catch as much as possible, we might choose to maximize
t zt xt ,
t=0
140
STOCHASTIC PROGRAMMING
But since this leaves zt = zT +1 for all t T + 1, and therefore all xt for
t T + 1 equal, we can let
Q(zT +1 ) =
tT 1 xt zt =
t=T +1
zT +1 (1 zT +1 /K)
.
1
With these assumptions on the horizon, the existence of Q(zT +1 ) and a finite
discretization of the random variables, we arrive at the following optimization
problem, (the objective function amounts to the expected catch, discounted
over the horizon of the problem; of course, it is easy to bring this into monetary
terms):
i
hP
P
T
T +1
t s s
Q(zTs +1 )
max sS p(s)
t=0 zt xt +
h
i
zs
s
s.t. zt+1
= zts 1 xt + ts 1 Kt , with z0s = z0 given,
0 xst 1,
P
xst = s {s}t
ps xst
p({s}t ) .
DYNAMIC SYSTEMS
141
2.6.1
Consider the algorithm just presented. If the problem being solved is genuinely
a stochastic problem (in the sense that the optimal decisions change compared
with the optimal decisions in the deterministicor expected valuesetting),
we should expect scenario solutions xs to be very different initially, before the
dual variables ws obtain their correct values. Therefore, particularly in early
iterations, it seems a waste of energy to solve scenario problems to optimality.
What will typically happen is that we see a sort of fight between the scenario
solutions xs and the implementable solution x({s}t ). The scenario solutions
try to pull away from the implementable solutions, and only when the penalty
(in terms of wts ) becomes properly adjusted will the scenario solutions agree
with the implementable solutions. In fact, the convergence criterion, vaguely
stated, is exactly that the scenario solutions and the implementable solutions
agree.
From this observation, it seems reasonable to solve scenario problems only
approximately, but precisely enough to capture the direction in which the
scenario problem moves relative to the implementable solution. Of course, as
the iterations progress, and the dual variables wts adjust to their correct values,
the scenario solutions and the implementable solutions agree more and more.
In the end, if things are properly organized, the overall set-up converges. It
must be noted that the convergence proof for the scenario aggregation method
does indeed allow for approximate scenario solutions. From an algorithmic
point of view, this would mean that we replaced the solution procedure in
Figure 13 by one that found only an approximate solution.
It has been observed that by solving scenario problems only very
approximately, instead of solving them to optimality, one obtains a method
that converges much faster, also in terms of the number of outer iterations. It
simply is not wise to solve scenario problems to optimality. Not only can one
solve scenario problems approximately, one should solve them approximately.
2.7
Financial Models
Optimization models involving uncertainty have been used for a long time.
One of the best known models is the mean-variance model of Markowitz, for
which he later got the Nobel economics prize. In this section, we shall first
discuss the main principles behind Markowitz model. We shall then discuss
some of the weaknesses of the model, mostly in light of the subjects of this
142
STOCHASTIC PROGRAMMING
The purpose of the Markowitz model is to help investors distribute their funds
in a way that does not represent a waste of money. It is quite clear that
when you invest, there is a tradeoff between the expected payoff from your
investment, and the risk associated with it. Normally, the higher the expected
payoff, the higher the risk. However, for a given payoff you would normally
want as little risk as possible, and for a given risk level, you would want the
expected payoff to be as large as possible. If you, for example, have a higher
risk than necessary for a given expected payoff, you are wasting money, and
this is what the Markowitz model is constructed to help you avoid.
But what is risk? It clearly has something to do with the spread of the
possible payoffs. A portfolio (collection of investments) is riskier the higher the
spread, all other aspects equal. In the Markowitz model, the risk is measured
by the variance of the (random) payoffs from the investment. The model will
not tell us in what way we should combine expected payoffs with variance,
only make sure that we do not waste money. How to actually pick a portfolio
is left to other theories, such as for example utility theory, as discussed briefly
in Section 2.4.
Financial instruments such as for example bonds, stocks, options and
bank deposits all have random payoffs, although the uncertainty vary a lot.
Furthermore, the instruments are not statistically independent, but rather
strongly correlated. It is obvious that if the value of a 3 year bond increases,
so will normally the value of, say, a 5 year bond. The correlation is almost, but
not quite, perfect. In the same way, stocks from companies in similar sectors
of the economy often move together. On the other hand, if energy prices rise
internationally, the value of an oil company may increase, whereas the value of
an aluminum producer may decrease. If the interest rates increase, bonds will
normally decrease in value. In other words, we must in this setting operate
with dependent random variables.
Assume we have n possible investments instruments.
P Let xi be the
proportion of our funds invested in intrument i. Hence,
xi = 1. Let the
payoff of instrument i be i (with = (1 , . . . , n )), and let V be the variancecovariance matrix for the investment, i.e.
n
o
Vij = E (i E i )(j E j ) .
We
The variance of a portfolio is now xT V x, and the mean payoff is xT E .
now solve (letting e be a vector of 1s).
DYNAMIC SYSTEMS
Mean
143
Infeasible
region
Efficient
frontier
Inefficient
region
Variance
Figure 15
min xT V x
s.t. xT E = v
xT e = 1
x 0.
(7.4)
The above model, despite the fact that for most investors it is considered
advanced, has a number of shortcomings which relate to the subject of this
book. The first we note is that this is a two-stage model; We make decisions
under uncertainty (the investments), we then observe what happens, and
finally we obtain payoffs according to what happened. An important question
is therefore to what extent the problem is well modeled as a two-stage problem.
More and more people, both in industry and academia, tend to think that this
is not the case. The reasons are many, here follows a list of some of them. We
list them as they represent a valid way of thinking for any decision problem
under uncertainty.
144
STOCHASTIC PROGRAMMING
DYNAMIC SYSTEMS
2.7.3
145
A scenario tree
146
STOCHASTIC PROGRAMMING
Bonds
Stocks
Real estate
Cash
Stage 1
Figure 16
Stage 2
Stage 3
horizontally. For each stage we first have a column with one node for each
instrument. In the example these are four investment categories. The arcs
entering from the left brings the initial portfolio into the model, measured in
the amount of money held in each category. The arcs that run horizontally
between nodes of the same category represent investments held from one
period to the next. The node which is alone in a column represents trading.
Arcs that run to or from this cash node, represent the selling or buying of
instruments. A stage consists of one column of 4 nodes plus the single cash
node.
We mentioned that this is a generalized network. That means that the
amount of money that enters an arc is not the same as the amount that leaves
it. For example, if you put money in the bank, and the interest rate is 10%,
then the flow into the arc is multiplied by 1.1 to produce the flow out. This
parameter is called the multiplier of the arc. This way the investment generally
increases over time. For most categories this parameter is uncertain, normally
with a mean greater than one. This is how we represent uncertainty in the
model. For a given scenario, these multipliers are known.
Arcs going to the cash trading node from all nodes but the cash node to its
left, will have multiplyers less than 1 to represent variable transaction costs.
The arcs that leave the cash trading node have the same multipliers as the
horizontal arcs for the same investment categories, reduced (deterministically)
for variable transaction costs. Fixed transaction costs are not hard to model,
but they would produce very difficult models to solve.
We can also have arcs going backwards in time. They represent borrowings.
Since you must pay interest on borrowings (deterministic or stochastic), these
arcs have multipliers less than one, meaning that if you want 100 USD now,
you must pay back more than 100 USD in a later time period.
DYNAMIC SYSTEMS
147
If we transform all investments into cash in the last period (maybe without
transaction costs) a natural objective is to maximize this value. This way
we have set the scene for using scenario aggregation on a financial model. It
appears that these models are very promising.
2.7.3.3
Practical considerations
2.8
The production of electric power from rivers and reservoirs represents an area
where stochastic programming methodology has been used for a long time.
The reason is simply that the environment in which planning must take place
is very uncertain. In particular, the inflow of water to the rivers and reservoirs
vary a lot both in the short and long term. This is caused by variation in
rainfall, but even more by the uncertainty related to the time of snow melting
in the spring. Furthermore, the demand for power is also random, depending
on such as temperature, the price of oil, and general economic conditions. The
actual setting for the planners will vary a lot from country to country. Norway,
with close to 100% of her electricity coming from hydro, is in a very different
situation from for example France with a high dependency on nuclear power,
or the US with a more mixed system. In addition, general market regulations
will affect modeling. Some countries have strictly regulated markets, others
have full deregulation and competition.
We shall now present a very simple model for electricity production. As an
example, we shall assume that electricity can be sold at fixed prices, which
could be interpreted as if we were a small producer in a competitive market.
It is worth noting that in many contexts it is necessary to consider also price
as random. In still other contexts, the goal is not at all to maximize profit,
but rather to satisfy demand. So, there are many variations of this problem.
We shall present one in order to illustrate the basics.
148
2.8.1
STOCHASTIC PROGRAMMING
A small example
Let us look at a rather simple version of the problem. Let there by two
reservoirs, named A and B. The reservoirs are connected by a river, with
A being upstream of B. We shall assume that the periods are long enough
for water released from reservoir A in a period to reach reservoir B in the
same period. This implies either that the reservoirs are close or that the time
periods are long. It will be easy to change the model if it is more reasonable
to let the water arrive in reservoir B in the next period. We shall also assume
that both water released for production and water spilled from reservoir A
(purposely or as a result of a full reservoir) will reach reservoir B. Sometimes
spilled water is lost.
There are three sets of variables. Let
vij be the volumes of water in reservoir i, (i {A,B}) at the beginning of
period j, (j {0, 1, 2, . . . , T }) (here vi0 is given to be the initial volume
in each reservoir), and
uij be the volume of water in reservoir i released to power station i during
period j, and
rij be the amount of water spilling out of reservoir i in period j.
There is one major set of parameters for the constraints, which we
eventually will interpret as random variables. Let
qij be the volume of water flowing into reservoir i during period j.
Bounds on the variables uij and vij are also given. They typically represent
such as reservoir size, production capacity and legal restrictions.
uij
[ui , ui ] and
vij
What we now lack is a description of the objective function plus the end
effects. To facilitate that, let
DYNAMIC SYSTEMS
149
inflow
inflow
Figure 17
cj denote the value of the electricity generated from one unit of water in
period j,
(vAT , vBT ) denote the value function for water at the final stage, and
i denote the marginal values of water at the final stage (the partial
derivatives of ).
The objective we wish to maximize then has the form (assuming discounting
is contained in cj )
T
X
j=1
The function is very important in this model. The reason is that a major
feature of the model is that it distributes water between the periods covered
by the model, and all later periods. Hence, if underestimates the future
value of water, the model will most likely suggest an empty reservoir after
stage T , and if it is set too high, the model will almost only save water. The
estimation of , which is normally done in a model with a very long time
horizon (often infinite) has been subjected to research for several decades.
Very often it is the partial derivatives i that are estimated, rather than the
function itself.
Now, if the inflow is random, we can set up an event tree. Most likely, the
inflows to the reservoirs are dependent, and if the periods are short, there
may also be dependence over time. The model we are left with can, at least
in principle, be solved with scenario aggregation.
150
2.8.2
STOCHASTIC PROGRAMMING
Further developments
The model shown above is very simplified. Modeling of real systems must take
into account a number of other aspects as well. In this section, we list some
of them to give you a feeling for what may happen.
First, these models are traditionally set in a context where the major goal is
to meet demand, rather than maximize profit. In a pure hydro based system,
the goal is then to obtain as much energy as possible from the available water
(which of course is still uncertain). In a system with other sources for energy
as well, we also have to take into account the cost of these sources, for example
natural gas, oil or nuclear power.
Obviously, in a model as simple as ours, maximizing the amount of energy
obtained from the available water resources makes little sense, as we have
(implicitly) assumed that the amount of energy we get from 1m3 of water
is fixed. The reality is normally different. First, the turbines are not equally
efficient at all production levels. They have some optimal (below maximal)
production levels where the amount of energy per m3 water is optimized.
Generally, the function describing energy production as a result of water usage
in a power plant with several turbines is neither convex nor monotone. In
particular, the non-convexity is serious. It stems from physical properties of
the turbines.
But there is more than that. The energy production also depends on the
head (hydrostatic pressure) that applies at a station during a period. It is
common to measure water pressure as the height of the water column having
the given pressure at the bottom. This is particularly complicated if the
water released from one power plant is submerged in the reservoir of the
downstream power plant. In this case the head of the upper station will depend
on the reservoir level of the lower station, generating another source of nonconvexities.
Traditionally, these models have been solved using stochastic dynamic
programming. This can work reasonably well as long as the dimension of
the state space is small. A requirement in stochastic dynamic programming
is that there is independence between periods. Hence, if water inflow in one
period (stage) is correlated to that of the previous period(s), the state space
must be expanded to contain the inflow in these previous period(s). If this
happens, SDP is soon out of business.
Furthermore, in deregulated markets it may be necessary to include price
as a random variable. Price is correlated to inflow in the present period, but
even more to inflow in earlier periods through the reservoir levels. This creates
dependencies which are very hard to tackle in SDP.
Hence, researchers have turned to other methods, for example scenario
aggregation, where dependencies are of no concern. So far, it is not clear
how successful this will be.
DYNAMIC SYSTEMS
2.9
151
152
2.9.2
STOCHASTIC PROGRAMMING
Item
A
B
Value
6
4
Minimum size
5
3
Maximum size
8
6
The goal is to fill the container with as valuable items as possible. However,
the size of an item is uncertain. For simplicity, we assume that each item can
have two different sizes, as given in Table 2. All sizes occur with the same
probability of 0.5. As is always the case with a stochastic model, we must
decide on how the stages are defined. We shall assume that we must pick an
item before we learn its size, and that once it is picked, it must be put into
the container. If the container becomes overfull, we obtain a penalty of 2 per
unit in excess of 10. We have the choice of picking only one item, and they
can be picked in any order.
A stochastic decision tree for the problem is given in Figure 18, where we
have already folded back and crossed out nonoptimal decisions. We see that
the expected value is 7.5. That is obtained by first picking item A, and then,
if item A turns out to be small, also pick item B. If item A turns out to be
large, we choose not to pick item B.
2
If we assume that the event tree (or the stochastic part of the stochastic
decision tree) is a fair description of the randomness of a model, the following
simple approach gives a reasonable measure of how good the deterministic
model really is. Start in the root of the event tree, and solve the deterministic
model. (Probably this means replacing random variables by their means.
However, this approach can be used for any competing deterministic model.)
Take that part of the deterministic solution that corresponds to the first stage
of the stochastic model, and let it represent an implementable solution in the
root of the event tree. Then go to each node at level two of the event tree and
repeat the process. Taking into consideration what has happened in stage 1
(which is different for each node), solve the deterministic model from stage
DYNAMIC SYSTEMS
Figure 18
153
2 onwards, and use that part of the solution that corresponds to stage 2 as
an implementable solution. Continue until you have reached the leaves of the
event tree.
This is a fair comparison, since even people who prefer deterministic models
resolve them as new information becomes available (represented by the event
tree). In this setting we can compare both decisions and (expected) optimal
objective values. What we may observe is that although the solutions are
different, the optimal values are almost the same. If that is the case, we are
observing flat objective functions with many (almost) optimal solutions. If we
observe large differences in objective values, we have a clear indication that
solving a stochastic model is important.
Let us return to Example 2.4. Let the following simple deterministic
algorithm be an alternative to the stochastic programming approach in
Figure 18. Consider all items not put into the container so far. For each item,
calculate the value of adding it to the container, given that it has its expected
size. If at least one item adds a positive value to the content of the container,
pick the one with the highest added value. Then put it in, and repeat.
This is not meant to be a specially efficient algorithmit is only presented
for its simplicity to help us make a few points. If we apply this algorithm to
our case, we see that with an empty container, item A will add 6 to the value
of the container and item B will add 4. Hence we pick item A. The algorithm
will next determine if B should be picked or not. However, for the comparison
154
STOCHASTIC PROGRAMMING
For simplicity, assume that we have a two-stage model. Now compare the
optimal objective value of the stochastic model with the expected value of
the wait-and-see solutions. The latter is calculated by finding the optimal
solution for each possible realization of the random variables. Clearly, it is
better to know the value of the random variable before making a decision than
having to make the decision before knowing. The difference between these two
expected objective values is called the expected value of perfect information
(EVPI), since it shows how much one could expect to win if one were told
what would happen before making ones decisions. Another interpretation is
that this difference is what one would be willing to pay for that information.
What does it mean to have a large EVPI? Does it mean that it is important
to solve a stochastic model? The answer is no! It shows that randomness plays
an important role in the problem, but it does not necessarily show that a
deterministic model cannot function well. By resorting to the set-up of the
previous subsection, we may be able to find that out. We can be quite sure,
however, that a small EVPI means that randomness plays a minor role in the
model.
In the multistage case the situation is basically the same. It is, however,
possible to have a very low EVPI, but at the same time have a node far down
in the tree with a very high EVPI (but low probability.)
Let us again turn to Example 2.4. Table 3 shows the optimal solutions for
the four cases that can occur, if we make the decisions after the true values
have become known. Please check that you agree with the numbers.
With each case in Table 3 equally probable, the expected value of the wait-
DYNAMIC SYSTEMS
155
Table 3 The four possible wait-and-see solutions for the container problem in
Example 2.4.
Size of A
5
5
8
8
Size of B
3
6
3
6
Solution
A, B
A, B
A, B
A
Value
10
8
8
6
and-see solution is 8, which is 0.5 more than what we found in Figure 18.
Hence EVPI equals 0.5; The value of knowing the true sizes of the items
before making decisions is 0.5. This is therefore also the maximal price one
would pay to know this.
What if we were offered to pay for knowing the value of A or B before
making our first pick? In other words, does it help to know the size of for
example item B before choosing what to do? This is illustrated in Figure 19.
Figure 19 Stochastic decision tree for the container problem when we know
the size of B before making decisions.
We see that the EVPI for knowing the size of item B is 0.5, which is the
same as that for knowing both A and B. The calculation for item A is left as
156
STOCHASTIC PROGRAMMING
an exercise.
Example 2.5 Let us conclude this section with another similar example. You
are to throw a die twice, and you will win 1 if you can guess the total number
of eyes from these two throws. The optimal guess is 7 (if you did not know
that already, check it out!), and that gives you a chance of winning of 16 . So
the expected win is also 16 .
Now, you are offered to pay for knowing the result of the first throw. How
much will you pay (or alternatively, what is the EVPI for the first throw)? A
close examination shows that knowing the result of the first throw does not
help at all. Even if you knew, guessing a total of 7 is still optimal (but that
is no longer a unique optimal solution), and the probability that that will
happen is still 16 . Hence, the EVPI for the first stage is zero.
Alternatively, you are offered to pay for learning the value of both throws
before guessing. In that case you will of course make a correct guess, and
be certain of winning one. Therefore the expected gain has increased from 16
to 1, so the EVPI for knowing the value of both random variables is 56 . 2
As you see, EVPI is not one number for a stochastic program, but can
be calculated for any combination of random variables. If only one number is
given, it usually means the value of learning everything, in contrast to knowing
nothing.
References
[1] Bellman R. (1957) Dynamic Programming. Princeton University Press,
Princeton, New Jersey.
[2] Helgason T. and Wallace S. W. (1991) Approximate scenario solutions in
the progressive hedging algorithm. Ann. Oper. Res. 31: 425444.
[3] Howard R. A. (1960) Dynamic Programming and Markov Processes. MIT
Press, Cambridge, Massachusetts.
[4] Nemhauser G. L. (1966) Dynamic Programming. John Wiley & Sons, New
York.
[5] Rockafellar R. T. and Wets R. J.-B. (1991) Scenarios and policy aggregation
in optimization under uncertainty. Math. Oper. Res. 16: 119147.
[6] Schaefer M. B. (1954) Some aspects of the dynamics of populations
important to the management of the commercial marine fisheries. InterAm. Trop. Tuna Comm. Bull. 1: 2756.
[7] Wallace S. W. and Helgason T. (1991) Structural properties of the
progressive hedging algorithm. Ann. Oper. Res. 31: 445456.
[8] Watson S. R. and Buede D. M. (1987) Decision Synthesis. The Principles
DYNAMIC SYSTEMS
157
158
STOCHASTIC PROGRAMMING
Recourse Problems
The purpose of this chapter is to discuss principal questions of linear recourse
problems. We shall cover general formulations, solution procedures and
bounds and approximations.
Figure 1 shows a simple example from the fisheries area. The assumption is
that we know the position of the fishing grounds, and potential locations for
plants. The cost of building a plant is known, and so are the distances between
grounds and potential plants. The fleet capacity is also known, but quotas,
and therefore catches, are only known in terms of distributions. Where should
the plants be built, and how large should they be?
This is a typical two-stage problem. In the first stage we determine which
plants to build (and how big they should be), and in the second stage we catch
and transport the fish when the quotas for a given year are known. Typically,
quotas can vary as much as 50% from one year to the next.
3.1
Outline of Structure
pj Q(x, j )
and
Q(x, ) = min{q()T y | W ()y = h() T ()x, y 0},
h() =
where pj is the probability
that = j ,Pthe jth realization of ,
P
P
h0 + H = h0 + i hi i , T () = T0 + i Ti i and q() = q0 + i qi i .
160
STOCHASTIC PROGRAMMING
NordTrndelag
Sr-Trndelag
Mre og Romsdal
Hedmark
Sogn og Fjordane
Hordaland
Oppland
Buskerud
Akershus
OSLO
Telemark
stfold
Vestfold
Rogaland
VestAgder
AustAgder
Figure 1 A map showing potential plant sites and actual fishing grounds for
Southern Norway and the North Sea.
The function Q(x, ) is called the recourse function, and Q(x) therefore the
expected recourse function.
In this chapter we shall look at only the case with fixed recourse, i.e. the
case where W () W . Let us repeat a few terms from Section 1.4, in order
to prepare for the next section. The cone pos W , mentioned in (4.17) of
Chapter 1, is defined by
pos W = {t | t = W y, y 0}.
The cone pos W is illustrated in Figure 2. Note that
W y = h, y 0 is feasible h pos W.
Recall that a problem has complete recourse if
pos W = Rm .
Among other things, this implies that
h() T ()x pos W for all and all x.
But that is definitely more than we need in most cases. Usually, it is more
than enough to know that
RECOURSE PROBLEMS
161
W2
W1
W4
Figure 2
columns.
W3
The cone pos W for a case where W has three rows and four
3.2
This section contains a much more detailed version of the material found
in Section 1.7.4. In addition to adding more details, we have now added
randomness more explicitly, and have also chosen to view some of the aspects
from a different perspective. It is our hope that a new perspective will increase
the understanding.
3.2.1
Feasibility
The material treated here coincides with step 2(a) in the dual decomposition
method of Section 1.7.4. Let the second-stage problem be given by
Q(x, ) = min{q()T y | W y = h() T ()x, y 0},
where W is fixed. Assume we are given an x
and should like to know if that x
We assume
yields a feasible second-stage problem for all possible values of .
162
STOCHASTIC PROGRAMMING
H
pos W
h0-T0 x^
Figure 3 Illustration showing that if infeasibility is to occur for a fixed x
, it
and hence of .
In this
must occur for an extreme point of the support of H ,
example T () is assumed to be equal to T0 .
RECOURSE PROBLEMS
163
164
STOCHASTIC PROGRAMMING
pos W
pol pos W
Figure 4
RECOURSE PROBLEMS
165
for feasibility. We should like to check for feasibility in such a way that if the
given problem is not feasible, we automatically come up with a generator of
pol pos W . For the discussion, we shall use Figure 5.
We should like to find a such that
T t 0 for all t pos W.
This is equivalent to requiring that T W 0. In other words, should be
in the cone pol pos W . But, assuming that the right-hand side h() T ()
x
produces an infeasible problem, we should at the same time require that
T [h() T ()
x] > 0,
because if we later add the constraint T [h() T ()x] 0 to our problem,
we shall exclude the infeasible right-hand side h() T ()
x without leaving
out any feasible solutions. Hence we should like to solve
max{ T (h() T ()
x) | T W 0, kk 1},
where the last constraint has been added to bound . We can do that, because
otherwise the maximal value will be +, and that does not interest us since
we are looking for the direction defined by . If we had chosen the 2 norm, the
maximization would have made sure that came as close to h() T ()
x as
possible (see Figure 5). Computationally, however, we should not like to work
with quadratic constraints. Let us therefore see what happens if we choose
the 1 norm. Let us write our problem differently to see the details better. To
do that, we need to let the unconstrained be replaced by 1 2 , where
1 , 2 0. We then get the following:
max{( 1 2 )T (h()T ()
x) | ( 1 2 )T W 0, eT ( 1 + 2 ) 1, 1 , 2 0},
where e is a vector of ones. To more easily find the dual of this problem, let
us write it down in a more standard format:
max( 1 2 )T (h() T ()
x) dual variables
y
W T 1 W T 2 0
eT 1 + eT 2 1
t
1 , 2 0
From this, we find the dual linear program to be
min{t | W y + et (h() T ()
x), W y + et (h() T ()
x), y, t 0}.
Note that if the optimal value in this problem is zero, we have W y =
h() T ()
x, so that we do indeed have h() T ()
x pos W , contrary
166
STOCHASTIC PROGRAMMING
pos W
h ( x ) - T ( x ) x$
s
Figure 5
to our assumption. We also see that if t gets large enough, the problem is
always feasible. This is what we solve for all A. If for some we find a
positive optimal value, we have found a for which h() T ()
x 6 pos W ,
and we create the cut
T (h() T ()x) 0 T T ()x T h().
(2.1)
The used here is a generator of pol pos W , but it is not in general as close
to h() T ()
x as possible. This is in contrast to what would have happened
had we used the 2 norm. (See Example 3.1 below for an illustration of this
point.)
Note that if T () T0 , the expression T T0 x in (2.1) does not depend on
. Since at the same time (2.1) must be true for all , we can for this special
case strengthen the inequality by calculating
T T0 x T h0 + max T H t.
t
Since T T0 is a vector and the right-hand side is a scalar, this can conveniently
be written as T x . The x
we started out with will not satisfy this
constraint.
Example 3.1 We present this little example to indicate why the 1 and 2
norms give different results when we generate feasibility cuts. The important
point is how the two norms limit the possible values. The 1 norm is given
in the left part of Figure 6, the 2 norm in the right part.
RECOURSE PROBLEMS
167
For simplicity, we have assumed that pol pos W equals the positive
quadrant, so that the constraints T W 0 reduce to 0. Since at the
same time kk 1, we get that must be within the shaded part of the two
figures.
For convenience, let us denote the right-hand side by h, and let =
( x , y )T , to reflect the x and y parts of the vector. In this example h =
(4, 2)T . For the 1 norm the problem now becomes.
max{4 x + 2 y k x + y 1, 0}.
The optimal solution here is = (1, 0)T . Graphically this can be seen from
the figure from the fact that an inner product equals the length of one vector
multiplied by the length of the projection of the second vector on the first. If
we take the h vector as the fixed first vector, the feasible vector with the
largest projection on h is = (1, 0)T .
For the 2 norm the problem becomes
max{4 x + 2 y k( x )2 + ( y )2 1, 0}.
q
The optimal solution here is = 15 (2, 1)T , which is a vector in the same
direction as h.
In this example we see that if is found using the 1 norm, it becomes a
generator of pol pos W , but it is not as close to h as possible. With the 2
norm, we did not get a generator, but we got a vector as close to h as possible.
168
STOCHASTIC PROGRAMMING
Figure 7
LP solver.
3.2.2
Optimality
The material discussed here concerns step 1(b) of the dual decomposition
method in Section 1.7.4. Let us first note that if we have relatively complete
recourse, or if we have checked that h() T ()x pos W for all A, then
the second-stage problem
min{q()T y | W y = h() T ()x, y 0}
is feasible. Its dual formulation is given by
max{ T (h() T ()x) | T W q()T }.
As long as q() q0 , the dual is either feasible or infeasible for all x and , since
x and do not enter the constraints. We see that this is more complicated
if q is also affected by randomness. But even when enters the objective
function, we can at least say that if the dual is feasible for one x and a
given then it is feasible for all x for that value of , since x enters only
the objective function. Therefore, from standard linear programming duality,
since the primal is feasible, the primal must be unbounded if and only if the
dual is infeasible, and that would happen for all x for a given , if randomness
affects the objective function. If q() q0 then it would happen for all x
and . Therefore we can check in advance for unboundedness, and this is
particularly easy if randomness does not affect the objective function. Note
that this discussion relates to Proposition 1.18. Assume we know that our
problem is bounded.
RECOURSE PROBLEMS
169
c x
!
!
+
b
0
0
1
A 0 0
LP
0 0 I 0 , , 1 , , feasible;
0
e e 0 I
s1
0
s2
if (feasible) then := + ;
end
else begin
A
0
b
c
x
LP
,
,
,
, feasible ;
I
0
s
if feasible then := ;
end;
end;
Figure 8
Now consider
Q(x) =
pj Q(x, j ),
with
Q(x, ) = min{q()T y | W y = h() T ()x, y 0}.
It is clear from standard linear programming theory that Q(x, ) is piecewise
linear and convex in x (for fixed ). Provided that q() q0 , Q(x, ) is
also piecewise
linear and convex in (for fixed x). (Remember that P
T () =
P
T0 + Ti i .) Similarly, if h() T ()x h0 T0 x, while q() = q0 + i qi i ,
then, from duality, Q(x, ) is piecewise linear and concave in . Each linear
piece corresponds to a basis (possibly several in the case of degeneracy).
Therefore Q(x), being a finite sum of such functions, will also be convex and
piecewise linear in x. If, instead of minimizing, we were maximizing, convexity
and concavity would change places in the statements.
In order to arrive at an algorithm for our problem, let us now reformulate
the latter by introducing a new variable :
min cT x +
s.t.
Ax = b,
Q(x),
kT x k for k = 1, . . . , K,
x 0,
170
STOCHASTIC PROGRAMMING
procedure feascut(A:set; x
:real; newcut:boolean; K:integer);
begin
A := A; newcut := false;
while A 6= and not (newcut) do begin
pick(A
, ); A := A \ {};
y
0
h() T ()
x
1 t
W e I 0
, , , feasible;
,
LP
s1
0
h() + T ()
x
W e 0 I
s2
0
newcut := (t > 0);
if newcut then begin
(* Create a feasibility cutsee page 161. *)
K := K + 1;
T
Construct the cut K
x K ;
end;
end;
end;
Figure 9
where, as before,
Q(x) =
pj Q(x, j )
and
Q(x, ) = min{q()T y | W y = h() + T ()x, y 0}.
Of course, computationally we cannot use Q(x) as a constraint since
Q(x) is only defined implicitly by a large number of optimization problems.
Instead, let us for the moment drop it, and solve the above problem without
it, simply hoping it will be satisfied (assuming so far that all feasibility cuts
kT x k are there, or that we have relatively complete recourse). We then
get some x
and (the first time = ). Now we calculate Q(
x), and then
check if Q(
x). If it is, we are done. If not, our x
is not optimaldropping
Q(x) was not acceptable.
Now
X
X
pj q( j )T y j
pj Q(
x, j ) =
Q(
x) =
j
RECOURSE PROBLEMS
171
procedure L-shaped;
begin
K := 0, L := 0;
:=
LP(A, b, c, x
, feasible);
stop := not (feasible);
while not (stop) do begin
feascut(A, x,newcut,K);
if not (newcut) then begin
Find Q(
x);
stop := ( Q(
x));
if not (stop) then begin
(* Create an optimality cutsee page 168. *)
L := L + 1;
T
Construct the cut L
x + L ;
end;
end;
if not (stop) then begin
master(K, L, x
, ,feasible);
stop := not (feasible);
end;
end;
end;
Figure 10
where
j is the optimal dual solution yielding Q(
x, j ). The constraints in the
T
dual problem are, as mentioned before, W q( j )T , which are independent
of x. Therefore, for a general x, and corresponding optimal vectors j (x), we
have
X
X
pj (
j )T [h( j ) T ( j )x],
pj ( j (x))T [h( j ) T ( j )x]
Q(x) =
j
since
is feasible but not necessarily optimal, and the dual problem is a
maximization problem. Since what we dropped from the constraint set was
Q(x), we now add in its place
X
pj (
j )T [h( j ) T ( j )x] = + T x,
or
T x + .
172
STOCHASTIC PROGRAMMING
Q(x)
x
cx+
cut 5
0
x2
x1
x3
cut 2
cut 1
cut 3
(x , )
5 5
cut 4
(x4 , )
4
Figure 11
Since there are finitely many feasible bases coming from the matrix W , there
are only finitely many such cuts.
We are now ready to present the basic setting of the L-shaped decomposition
algorithm. It is shown in Figure 10. To use it, we shall need a procedure
that solves LPs. It can be found in Figure 7. Also, to avoid too complicated
expressions, we shall define a special procedure for solving the master problem;
see Figure 8. Furthermore, we refer to procedure pick(A, ), which simply
picks an element from the set A, and, finally, we use procedure feascut
which is given in Figure 9. The set A was defined on page 162.
In the algorithms to follow, let x represent the K feasibility
cuts kT x k , and let x + I represent the L optimality cuts
lT x + l . Furthermore, let e be a column of 1s of appropriate size.
RECOURSE PROBLEMS
173
3.3
Regularized Decomposition
As mentioned at the end of Section 1.7.4, the recourse problem (for a discrete
distribution) looks like
PK
min{cT x + i=1 pi (q i )T y i }
s.t. Ax
=b
(3.1)
T i x + W y i = hi , i = 1, , K
x 0,
y i 0, i = 1, , K.
(3.3)
where (3.2) denotes a feasibility cut and (3.3) denotes an optimality cut, the
and
resulting from step 2 of the dual decomposition method of Section 1.7.4,
as explained further in Section 3.2. Of course, the matrix T and the right-hand
side vector h will vary, depending on the block i for which the cut is derived.
One cycle of a multicut solution procedure for problem (3.1) looks as follows:
Let B1i = {(x, 1 , , K ) | }, i = 1, , K, be feasible for the cuts
generated so far for block i (obviously for block i restricting only (x, i )).
Given B0 = {(x, ) | Ax = b, x 0, IRK } and the sets B1i , solve the
master program
)
(
K
K
\
X
i
T
,
(3.4)
B1i
p i (x, 1 , , K ) B0
min c x +
i=1
i=1
yielding (
x, 1 , , K ) as a solution. With this solution try to construct
further cuts for the blocks.
174
STOCHASTIC PROGRAMMING
for Ji denoting the set of optimality cuts generated so far for block i with the
related dual basic solutions
ij according to (3.3), and not, as we intend to,
with
i fi (x) = max[( ij )T x + ij ]
jJi
where Ji enumerates all dual feasible basic solutions for block i. Hence
we are working in the beginning with a piecewise linear convex function
(maxjJi [( ij )T x + ij ]) supporting fi (x) that does not sufficiently reflect the
shape of fi (see e.g. Figure 26 of Chapter 1, page 78). The effect may beand
often isthat even if we start a cycle with an (almost) optimal first-stage
solution x of (3.1), the first-stage solution x
of the master (3.4) may be far
away from x , and it may take many further cycles to come back towards x .
The reason for this is now obvious: if the set of available optimality cuts, Ji , is
a small subset of the collection Ji then the piecewise linear approximation of
fi (x) may be inadequate near x . Therefore it seems desirable to modify the
master program in such a way that, when starting with some overall feasible
first-stage iterate z k , its solution xk does not move too far away from z k .
Thereby we can expect to improve the approximation of fi (x) by an optimality
cut for block i at xk . This can be achieved by introducing into the objective of
the master the term kx z k k2 , yielding a so-called regularized master program
)
K
K
\
X
1
i
k 2
T
,
p i (x, 1 , , K ) B0
B1i
min
kx z k + c x +
2
i=1
i=1
(3.5)
with a control parameter > 0. To avoid too many constraints in (3.5), let
us start with some z 0 B0 such that fi (z 0 ) < i and G0 being the feasible
set defined by the first-stage equations Ax = b and all optimality cuts at z 0 .
Hence we start (for k = 0) with the reduced regularized master program
(
RECOURSE PROBLEMS
min
175
)
K
X
1
pi i (x, 1 , , K ) Gk .
kx z k k2 + cT x +
2
i=1
(3.6)
176
Figure 12
STOCHASTIC PROGRAMMING
K
X
pi fi (x)
i=1
RECOURSE PROBLEMS
177
k T
(1k , , K
) as recourse approximates. If, for Fk := cT xk + pT k ,
k
Step 2 Delete from (3.6) some constraints that are inactive at (xk , k ) such
that no more than n + K constraints remain.
Step 3 If xk satisfies the first-stage constraints (i.e. xk 0) then go to
step 4; otherwise add to (3.6) no more than K violated (first-stage)
constraints, yielding the feasible set Gk+1 , put z k+1 := z k , k := k + 1,
and go to step 1.
Step 4 For i = 1, , K solve the second-stage problems at xk and
(a) if fi (xk ) = then add to (3.6) a feasibility cut;
(b) otherwise, if fi (xk ) > ik then add to (3.6) an optimality cut.
Step 5 If fi (xk ) = for at least one i then put z k+1 := z k and go to step 7.
Otherwise, go to step 6.
Step 6 If F (xk ) = Fk , or else if F (xk ) F (z k ) + (1 )Fk and if exactly
n + K constraints were active at (xk , k ), then put z k+1 := xk ;
otherwise, put z k+1 := z k .
Step 7 Determine Gk+1 as resulting from Gk after deleting and adding
constraints due to step 2 and step 4 respectively. With k := k + 1,
go to step 1.
It can be shown that this algorithm converges in finitely many steps.
The parameter can be controlled during the procedure so as to increase
it whenever steps (i.e. kxk z k k) seem too short, and decrease it when
F (xk ) > F (z k ).
3.4
Bounds
Section 3.2 was devoted to the L-shaped decomposition method. We note that
the deterministic methods very quickly run into dimensionality problems with
respect to the number of random variables. With much more than 10 random
variables, we are in trouble.
This section discusses bounds on stochastic problems. These bounds can be
useful and interesting in their own right, or they can be used as subproblems
in larger settings. An example of where we might need to bound a problem,
and where this problem is not a subproblem, is the following. Assume that a
company is facing a decision problem. The decision itself will be made next
year, and at that time all parameters describing the problem will be known.
However, today a large number of relevant parameters are unknown, so it
178
STOCHASTIC PROGRAMMING
xraw1
0,
xraw2 0.
where both 1 and 2 are normally distributed with mean 0. As discussed in
Section 1.3, we shall look at the 99% intervals for both (as if that was the
support). This gives us
1 [30.91, 30.91],
2 [23.18, 23.18].
RECOURSE PROBLEMS
179
()
Figure 13
oil from the second country gives 6 and 3 units of the same products. Company
1 wants at least 180 + 1 units of Product 1 and Company 2 at least 162 + 2
units of Product 2. The goal now is to find the expected value of (); in other
words, we seek the expected value of the wait-and-see solution. Note that
this interpretation is not the one we adopted in Section 1.3.
3.4.1
Assume that q() q0 , so that randomness affects only the right-hand side.
The purpose of this section is to find a lower bound on Q(
x, ), for fixed x
,
and for that purpose we shall, as just mentioned, use () Q(
x, ) for a
fixed x
.
Since () is a convex function, we can bound it from below by a linear
function L() = c + d. Since the goal will always be to find a lower bound
that is as large as possible, we shall require that the linear lower bound be
Figure 13 shows two examples of such lowertangent to () at some point .
bounding functions. But the question is which one should we pick. Is L1 ()
or L2 () the better?
the slope
If we let the lower bounding function L() be tangent to () at ,
must be (), and we must have
= ()
+ b,
()
= L().
Hence, in total, the lower-bounding function is given by
since ()
+ ()(
).
L() = ()
Since this is a linear function, we easily calculate the expected value of the
180
STOCHASTIC PROGRAMMING
lower-bounding function:
= ()
+ ()(E
)
= L(E ).
EL()
In other words, we find the expected lower bound by evaluating the lower
From this, it is easy to see that we obtain the best
bounding function in E .
This can be seen not only from the
(largest) lower bound by letting = E .
fact that no linear function that supports () can have a value larger than
in E ,
but also from the following simple differentiation:
(E )
d
= ()
()
+ ()(E
).
L(E )
d
What we have developed is
If we set this equal to zero we find that = E .
the so-called Jensen lower bound, or the Jensen inequality.
Proposition 3.1 If () is convex over the support of then
(E )
E()
This best lower bound is illustrated in Figure 14. We can see that the
Jensen lower bound can be viewed two different ways. First, it can be seen as
a bound where a distribution is replaced by its mean and the problem itself
Secondly, it can be viewed as
is unchanged. This is when we calculate (E ).
a bound where the distribution is left unchanged and the function is replaced
by a linear affine function, represented by a straight line. This is when we
Depending on the given situation, both
integrate L() over the support of .
these views can be useful.
There is even a third interpretation. We shall see it used later in the
to
stochastic decomposition method. Assume we first solve the dual of (E )
obtain an optimal basis B. This basis, since does not enter the constraints of
the dual of , is dual feasible for all possible values of . Assume now that we
solve the dual version of () for all , but constrain our optimization so that
we are allowed to use only the given basis B. In such a setting, we might claim
that we use the correct function, the correct distribution, but optimize only in
an approximate way. (In stochastic decomposition we use not one, but a finite
number of bases.) The Jensen lower bound can in this setting be interpreted as
representing approximate optimization using the correct problem and correct
distribution, but only one dual feasible basis.
It is worth pointing out that these interpretations of the Jensen lower bound
are put forward to help you see how a bound can be interpreted in different
ways, and that these interpretations can lead you in different directions when
trying to strengthen the bound. An interpretation is not necessarily motivated
by computational efficiency.
RECOURSE PROBLEMS
181
Looking back at our example in (4.1), we find the Jensen lower bound by
= (0). That has been solved already in Section 1.3, where
calculating (E )
we found that (0) = 126.
3.4.2
Again let be a random variable. Let the support = [a, b], and assume that
q() q0 . As in the previous section, we define () = Q(
x, ). (Remember
that x is fixed at x.) Consider Figure 14, where we have drawn a linear function
U () between the two points (a, (a)) and (b, (b)). The line is clearly above
() for all . Also this straight line has the formula c + d, and since we
know two points, we can calculate
c=
(b) (a)
,
ba
d=
b
a
(a)
(b).
ba
ba
E a
,
ba
P { = b} = p.
(4.2)
As for the Jensen lower bound, we have now shown that the Edmundson
Madansky upper bound can be seen as either changing the distribution and
keeping the problem, or changing the problem and keeping the distribution.
Looking back at our example in (4.1), we have two independent random
variables. Hence we have 22 = 4 LPs to solve to find the Edmundson
Madansky upper bound. Since both distributions are symmetric, the
probabilities attached to these four points will all be 0.25. Calculating this
we find an upper bound of
1
4
182
STOCHASTIC PROGRAMMING
Edmundson-Madansky U()
()
Jensen L()
a
b
E
Figure 14 The Jensen lower bound and the EdmundsonMadansky upper
bound in a minimization problem. Note that x is fixed.
This is exactly the same as the lower bound, and hence it is the true value
We shall shortly comment on this situation where the bounds turn
of E().
out to be equal.
In higher dimensions, the Jensen lower bound corresponds to a hyperplane,
while the EdmundsonMadansky bound corresponds to a more general
polynomial. A two-dimensional illustration of the EdmundsonMadansky
bound is given in Figure 15. Note that if we fix the value of all but one of the
variables, we get a linear function. This polynomial is therefore generated by
straight lines. From the viewpoint of computations, we do not have to relate to
this general polynomial. Instead, we take one (independent) random variable
at a time, and calculate (4.2). This way we end up with 2 possible values for
each random variable, and hence, 2k possible values of for which we have to
evaluate the recourse function.
Assume that the function () in Figure 14 is linear. Then it appears from
the figure that both the Jensen lower bound and the EdmundsonMadansky
upper bound are exact. This is indeed a correct observation: both bounds are
exact whenever the function is linear. And, in particular, this means that if
the function is linear, the error is zero. In the example (4.1) used to illustrate
the Jensen and EdmundsonMadansky bounds we observed that the bounds
where equal. This shows that the function () is linear over the support we
used.
One special use of the Jensen lower bound and EdmundsonMadansky
upper bound is worth mentioning. Assume we have a random vector,
containing a number of independent random variables, and a function that is
convex with respect to that random vector, but the random vector either has
a continuous distribution, or a discrete distribution with a very large number
of outcomes. In both cases we might have to simplify the distribution before
RECOURSE PROBLEMS
183
184
STOCHASTIC PROGRAMMING
Edmundson-Madansky
()
Jensen
b
cell 1
cell 2
Figure 16 Illustration of the effect on the Jensen lower bound and the
EdmundsonMadansky upper bound of partitioning the support into two cells.
idea is outlined in Figure 17. Whatever the original distribution was, we now
have two distributions: one giving an overall lower bound, the other an overall
upper bound.
Since the random variables in the vector were assumed to be independent,
this operation has produced discrete distributions for the random vector as
well.
3.4.3
Combinations
RECOURSE PROBLEMS
185
then we get a lower bound by applying the Jensen rule on the right-hand side
random variables and the EdmundsonMadansky rule in the objective. If we
do it the other way around, we get an overall upper bound.
3.4.4
186
STOCHASTIC PROGRAMMING
i = 0.
d
if
<
E
i
i
i
i
There is a very good reason for such a choice. Note how U () is separable in
its components i . Therefore, for almost all distribution functions, U is simple
to integrate.
To appreciate the bound, we must understand its basic motivation. If we
take some minimization problem, like the one here, and add extra constraints,
the resulting problem will bound the original problem from above. What we
shall do is to add restrictions with respect to the upper bounds c. We shall do
this by viewing () as a parametric problem in , and reserve portions of the
upper bound c for the individual random variables i . We may, for example,
end up by saying that two units of cj are reserved for variable i , meaning that
these two units can be used in the parametric analysis, only when we consider
i . For all other variables k these two units will be viewed as nonexisting.
The clue of the bound is to introduce the best possible set of such constraints,
such that the resulting problem is easy to solve (and gives a good bound).
= (0) by finding
First, let us calculate (E )
(0) = min{q T y | W y = b, 0 y c} = q T y 0 .
y
This can be interpreted as the basic setting, and all other values of will be
seen as deviations from E = 0. (Of course, any other starting point will also
dofor example solving Q(A), where, as stated before, A is the lowest possible
value of .) Note that since y 0 is always there, we can in the following operate
with bounds y 0 y c y 0 . For this purpose, we define 1 = y 0 and
1 = c y 0 . Let ei be a unit vector of appropriate size with a +1 in position
i.
Next, define a counter r and let r := 1. Now check out the case when r > 0
by solving (remembering that Br is the maximal value of r )
min{q T y | W y = er Br , r y r } = q T y r+ = d+
r Br .
y
(4.4)
Note that d+
r represents the per unit cost of increasing the right-hand side
from 0 to er Br . Similarly, check out the case with r < 0 by solving
min{q T y | W y = er Ar , r y r } = q T y r = d
r Ar .
y
(4.5)
RECOURSE PROBLEMS
187
means of the following problem, where we calculate what is left for the next
random variable:
r+1
= ri min{yir+ , yir , 0}.
i
(4.6)
What we are doing here is to find, for each variable, how much r , in the worst
case, uses of the bound on variable i in the negative direction. That is then
subtracted off what we had before. There are three possibilities. Both (4.4) and
(4.5) may yield non-negative values for the variable yi . In that case nothing
is used of the available negative bound ri . Then r+1
= ri . Alternatively,
i
r+
if (4.4) has yi < 0, then it will in the worst case use yir+ of the available
negative bound. Finally, if (4.5) has yir < 0 then in the worst case we use
yir of the bound. Therefore r+1
is what is left for the next random variable.
i
Similarly, we find
ir+1 = ir max{yir+ , yir , 0},
(4.7)
ir+1
where
shows how much is still available of bound i in the forward
(positive) direction.
We next increase the counter r by one and repeat (4.4)(4.7). This takes
care of the piecewise linear functions in .
Note that it is possible to solve (4.4) and (4.5) by parametric linear
programming, thereby getting not just one linear piece above E and one
below, but rather piecewise linearity on both sides. Then (4.6) and (4.7) must
be updated to worst case analysis of bound usage. That is simple to do.
Let us turn to our example (4.1). Since we have developed the piecewise
linear upper bound for equality constraints, we shall repeat the problem with
slack variables added explicitly.
(1 , 2 ) = min{2xraw1 + 3xraw2 }
s.t. xraw1 + xraw2 + s1
= 100,
2xraw1 + 6xraw2
s2
= 180 + 1 ,
3xraw1 + 3xraw2
s3 = 162 + 2 ,
xraw1
0,
xraw2
0,
s1
0,
s2
0,
s3 0.
In this setting, what we need to develop is the following:
+
+
d1 1 if 1 0,
d2 2
U (1 , 2 ) = (0, 0) +
+
d
d
1 1 if 1 < 0,
2 2
if 2 0,
if 2 < 0.
First, we have already calculated (0, 0) = 126 with xraw1 = 36, xraw2 =
1
18 and s1 = 46. Next, let us try to find d
1 . To do that, we need ,
188
STOCHASTIC PROGRAMMING
which equals (36, 18, 46, 0, 0). We must then formulate (4.4), using 1
[30.91, 30.91]:
min{2xraw1 + 3xraw2 }
s.t. xraw1 + xraw2 + s1
2xraw1 + 6xraw2
s2
3xraw1 + 3xraw2
s3
xraw1
xraw2
s1
s2
s3
= 0,
= 30.91,
= 0,
36,
18,
46,
0,
0.
(2, 3, 0, 0, 0)y 1+
= 0.25.
30.91
Next, we solve the same problem, just with 30.91 replaced by 30.91.
This amounts to problem (4.5), and gives us the solution is y 1 =
(7.7275, 7.7275, 0, 0, 0)T, with a total cost of 7.7275. Hence, we get
d
1 =
(2, 3, 0, 0, 0)y 1
= 0.25.
30.91
The next step is to update according to (4.6) to find out how much is left
of the negative bounds on the variables. For xraw1 we get
2raw1 = 36 min{7.7275, 7.7275, 0} = 28.2725.
For xraw2 we get in a similar manner
2raw2 = 18 min{7.7275, 7.7275, 0} = 10.2725.
For the three other variables, 2i equals 1i . We can now turn to (4.4) for
random variable 2. The problem to solve is as follows, when we remember the
2 [23.18, 23.18].
min{2xraw1 + 3xraw2 }
s.t. xraw1 + xraw2 + s1
2xraw1 + 6xraw2
s2
3xraw1 + 3xraw2
s3
xraw1
xraw2
s1
s2
s3
= 0,
= 0,
= 23.18,
28.2725,
10.2725,
46,
0,
0.
RECOURSE PROBLEMS
189
The solution to this is y 2+ = (11.59, 3.863, 7.727, 0, 0)T, with a total cost
of 11.59. This gives us
d+
2 =
(2, 3, 0, 0, 0)y 2+
= 0.5.
23.18
Next, we solve the same problem, just with 23.18 replaced by 23.18.
This amounts to problem (4.5), and gives us the solution y 2 =
(11.59, 3.863, 7.727, 0, 0)T, with a total cost of 11.59. Hence we get
d
2 =
(2, 3, 0, 0, 0)y 2
= 0.5.
23.18
This finishes the calculation of the (piecewise) linear functions in the upper
bound. What we have now found is that
(
(
1
1
if 1 0,
2 if 2 0,
4 1
U (1 , 2 ) = 126 + 1
+ 12
if 1 < 0,
4 1
2 2 if 2 < 0,
which we easily see can be written as
U (1 , 2 ) = 126 + 41 1 + 21 2 .
In other words, as we already knew from calculating the Edmundson
Madansky upper bound and Jensen lower bound, the recourse function is
linear in this example. Let us, for illustration, integrate with respect to 1 .
Z 30.91
1
f (1 ) d1 = 41 E 1 = 0.
4 1
30.91
This is how it should be for linearity, the contribution from a random variable
over which U (and therefore ) is linear is zero. We should of course get the
same result with respect to 2 , and therefore the upper bound is 126, which
equals the Jensen lower bound.
Now that we have seen how things go in the linear case, let us try to see
how the results will be when linearity is not present. Hence assume that we
have now developed the necessary parameters d
i for (4.3). Let us integrate
with respect to the random variable i , assuming that i = [Ai , Bi ]:
Z
Ai
d
1 i f (i )di +
Bi
d+
1 i f (i )di
d
i E{i | i 0}P {i 0} + di E{i | i > 0}P {i > 0}.
This result should not come as much of a surprise. When one integrates a
linear function, one gets the function evaluated at the expected value of the
190
STOCHASTIC PROGRAMMING
From this, we also see, as we have already claimed a few times, that if d+
i = di
which
for all i, then the contribution to the upper bound from equals (E ),
equals the contribution to the Jensen lower bound.
Let us repeat why this is an upper bound. What we have done is to
distribute the bounds c on the variables among the different random variables.
They have been given separate pieces, which they will not share with others,
do not need the capacities themselves.
even if they, for a given realization of ,
This partitioning of the bounds among the random variables represents
a set of extra constraints on the problem, and hence, since we have a
minimization problem, the extra constraints yield an upper bound. If we
run out of capacities before all random variables have received their parts,
we must conclude that the upper bound is +. This cannot happen with
the EdmundsonMadansky upper bound. If () is feasible for all then the
EdmundsonMadansky bound is always finite. However, as for the Jensen and
EdmundsonMadansky bounds, the piecewise linear upper bound is also exact
when the recourse function turns out to be linear.
As mentioned before, we shall consider random upper bounds in Chapter 6,
in the setting of networks.
3.5
3.5.1
Approximations
Refinements of the bounds on the Wait-and-SeeSolution
Let us, also in this section, assume that x = x, and as before define
() = Q(
x, ). Using any of the above (or other) methods we can find bounds
on the recourse function. Assume we have calculated L and U such that
U.
L E()
We can now look at U L to see if we are happy with the result or not.
If we are not, there are basically two approaches that can be used. Either
we might resort to a better bounding procedure (probably more expensive in
terms of CPU time) or we might start using the old bounding methods on a
partition of the support, thereby making the bounds tighter. Since we know
only finitely many different methods, we shall eventually be left with only the
second option.
The set-up for such an approach to bounding will be as follows. First,
partition the support of the random variables into an arbitrary selection of
cellspossibly only one cell initially. We shall only consider cells that are
rectangles, so that they can be described by intervals on the individual random
variables. Figure 18 shows an example in two dimensions with five cells. Now,
apply the bounding procedures on each of the cells, and add up the results.
RECOURSE PROBLEMS
191
Cell2
Cell 1
Cell 3
Cell 4
Figure 18
Cell 5
Partitioning of cells.
192
STOCHASTIC PROGRAMMING
increased our work load. It is now harder to achieve a given error bound than
it was before the partition. And note, we shall never recover from the error,
in the sense that intelligent choices later on will not counteract this one bad
choice. Each time we make a bad partition, the workload from there onwards
basically doubles for the cell from which we started. Since we do not want to
unnecessarily increase the workload too often, we must be careful with how
we partition.
Now that we know that bad choices can increase the workload, what should
we do? The first observation is that chosing at random is not a good idea,
because, every now and then, we shall make bad choices. On the other hand,
it is clear that the partitioning procedure will have to be a heuristic. Hence,
we must make sure that we have a heuristic rule that we hope never makes
really bad choices.
By knowing our problem well, we may be able to order the random variables
according to their importance in the problem. Such an ordering could be used
as is, or in combination with other ideas. For some network problems, such as
the PERT problem (see Section 6.6), the network structure may present us
with such a list. If we can compile the list, it seems reasonable to ask, from a
modelling point of view, if the random variables last on the list should really
have been there in the first place. They do not appear to be important.
Over the years, some attempts to understand the problem of partitioning
have been made. Most of them are based on the assumption that the
EdmundsonMadansky bound was used to calculate the upper bound. The
reason is that the dual variables associated with the solution of the recourse
function tell us something about its curvature. With the Edmundson
Madansky bound, we solve the recourse problem at all extreme points of
the support, and thus get a reasonably good idea of what the function looks
like.
To introduce some formality, assume we have only one random variable
with support = [A, B]. When finding the EdmundsonMadansky upper
,
bound, we calculated (A) = Q(
x, A) and (B) = Q(
x, B), obtaining dual
solutions A and B . We know from duality that
(A) = ( A )T [h(A) T (A)
x],
(B) = ( B )T [h(B) T (B)
x].
We also know that, as long as q() q0 (which we are assuming in this
section), a that is dual feasible for one is dual feasible for all , since
does not enter the dual constraints. Hence, we know that
= (A) ( B )T [h(A) T (A)
x] 0
and
= (B) ( A )T [h(B) T (B)
x] 0.
RECOURSE PROBLEMS
193
194
STOCHASTIC PROGRAMMING
Figure 20
we have much curvature in the sense of the slopes of at the end point, but
still almost linearity (as in Figure 20), then the smaller of the two parameters
will be small. Hence the conclusion seems to be to calculate both and ,
pick the smaller of the two, and use that as a measure of nonlinearity.
Using and , we have a good measure of nonlinearity in one dimension.
However, with more than one dimension, we must again be careful. We can
certainly perform tests corresponding to those illustrated in Figures 19 and 20,
for one random variable at a time. But the question is what value should
we give the other random variables during the test. If we have k random
variables, and have the EdmundsonMadansky calculations available, there
are 2k1 different ways we can fix all but one variable and then compare dual
solutions. There are at least two possible approaches.
A first possibility is to calculate and for all neighbouring pairs of
extreme points in the support, and pick the one for which the minimum of
and is the largest. We then have a random variable for which is very
nonlinear, at least in parts of the support. We may, of course, have picked a
variable for which is linear most of the time, and this will certainly happen
once in a while, but the idea is tested and found sound.
An alternative, which tries to check average nonlinearity rather than
maximal nonlinearity, is to use all 2k1 pairs of neighbouring extreme points
involving variation in only one random variable, find the minimum of and
for each such pair, and then calculate the average of these minima. Then
we pick the random variable for which this average is maximal.
The number of pairs of neighbouring extreme points is fairly large. With k
random variables, we have k2k1 pairs to compare. Each comparison requires
the calculation of two inner products. We have earlier indicated that the
EdmundsonMadansky upper bound cannot be used for much more than 10
random variables. In such a case we must perform 5120 pairwise comparisons.
RECOURSE PROBLEMS
195
Tj j x
hj j T0 +
h() T ()
x = h0 +
j
(5.1)
characterizes how () = Q(
x, ) changes with respect to all random variables.
Since these calculations are performed at each extreme point of the support,
and each extreme point has a probability according to the Edmundson
Madansky calculations, we can interpret the vectors j as outcomes of
a random vector
that has 2k possible values and the corresponding
EdmundsonMadansky probabilities. If, for example, the random variable
i
has only one possible value, we know that () is linear in i . If
i has several
possible values, its variance will tell us quite a bit about how the slope varies
over the support. Since the random variables i may have very different units,
and the dual variables measure changes in the objective function per unit
196
STOCHASTIC PROGRAMMING
(5.2)
In other words, we perform all possible partitions, keep the best, and discard
the remaining information. If the upper bound we are using is expensive in
terms of CPU time, such an idea of look-ahead has two effects, which pull
in different directions. On one hand, the information we are throwing away
has cost a lot, and that seems like a waste. On the other hand, the very
fact that the upper bound is costly makes it crucial to have few cells in the
RECOURSE PROBLEMS
197
end. With a cheap (in terms of CPU time) upper bound, the approach seems
more reasonable, since checking all possibilities is not particularly costly, but,
even so, bad partitions will still double the work load locally. Numerical tests
indicate that this approach is very good even with the EdmundsonMadansky
upper bound, and the reason seems to be that it produces so few cells. Of
course, without EdmundsonMadansky, we cannot calculate , and
, so if
we do not like the look-ahead, we are in need of a new heuristic.
We have pointed out before that the piecewise linear upper bound can
obtain the value +. That happens if one of the problems (4.4) or (4.5)
becomes infeasible. If that takes place, the random variable being treated
when it happens is clearly a candidate for partitioning.
So far we have not really defined what constitutes a good partition. We shall
return to that after the next subsection. But first let us look at an example
illustrating the partitioning ideas.
Example 3.2 Consider the following function:
(1 , 2 ) = max{x + 2y}
s.t. x + y 6,
2x 3y 21,
3x + 7y 49,
x + 12y 120,
2x + 3y 45,
x
1 ,
y 2 .
Let us assume that 1 = [0, 20] and 2 = [0, 10]. For simplicity, we shall
assume uniform and independent distributions. We do that because the form
of the distribution is rather unimportant for the heuristics we are to explain.
The feasible set for the problem, except the upper bounds, is given in
Figure 21. The circled numbers refer to the numbering of the inequalities.
For all problems we have to solve (for varying values of ), it is reasonably
easy to read the solution directly from the figure.
Since we are maximizing, the Jensen bound is an upper bound, and the
EdmundsonMadansky bound a lower bound. We easily find the Jensen upper
bound from
(10, 5) = 20.
To find a lower bound, and also to calculate some of the information
needed to use the heuristics, we first calculate at all extreme points of the
support. Note that in what follows we view the upper bounds on the variables
as ordinary constraints. The results for the extreme-point calculations are
summed up in Table 1.
198
Figure 21
STOCHASTIC PROGRAMMING
(1 , 2 )
(0, 0) = (L, L)
0.000
0.000
0.000
(0, 0, 0, 0, 0, 1, 2)
(20, 0) = (U, L)
10.500
0.000
10.500
(0, 12 , 0, 0, 0, 0, 27 )
0.000
6.000
12.000
(2, 0, 0, 0, 0, 3, 0)
8.571
9.286
27.143
RECOURSE PROBLEMS
199
The first idea we wish to test is based on comparing pairs of extreme points,
to see how well the optimal dual solution (which is dual feasible for all righthand sides) at one extreme-point works at a neighbouring extreme point. We
use the indexing L and U to indicate Low and Up of the support.
LL:UL We first must test the optimal dual solution LL together with the
right-hand side bUL . We get
= ( LL )T bUL (U, L)
= (0, 0, 0, 0, 0, 1, 2)(6, 21, 49, 120, 45, 20, 0)T (U, L)
= 20 10.5 = 9.5.
We then do the opposite, to find
= ( UL )T bLL (L, L)
= (0, 12 , 0, 0, 0, 0, 72 )(6, 21, 49, 120, 45, 0, 0)T (L, L)
= 10.5 0 = 10.5.
The minimum is therefore 9.5 for the pair LL:UL.
LL:LU Following a similar logic, we get the following:
= ( LL )T bLU (L, U )
= (0, 0, 0, 0, 0, 1, 2)(6, 21, 49, 120, 45, 0, 10)T (L, U )
= 20 12 = 8,
= ( LU )T bLL (L, L)
= (2, 0, 0, 0, 0, 3, 0)(6, 21, 49, 120, 45, 0, 0)T (L, L)
= 12 0 = 12.
The minimal value for the pair LL:LU is therefore 8.
LU:UU For this pair we get the following:
= ( UU )T bLU (L, U )
= (0, 0, 0, 0.0476, 0.476, 0, 0)(6, 21, 49, 120, 45, 0, 10)T (L, U )
= 27.143 12 = 15.143
= ( LU )T bUU (L, L)
= (2, 0, 0, 0, 0, 3, 0)(6, 21, 49, 120, 45, 20, 10)T (U, U )
= 72 27.143 = 44.857.
The minimal value for the pair LU:UU is therefore 15.143.
UL:UU For the final pair the results are given by
= ( UU )T bUL (U, L)
= (0, 0, 0, 0.0476, 0.476, 0, 0)(6, 21, 49, 120, 45, 20, 0)T (U, L)
= 27.143 10.5 = 16.643,
= ( UL )T bUU (U, U )
= (0, 21 , 0, 0, 0, 0, 27 )(6, 21, 49, 120, 45, 20, 10)T (U, U )
= 46.5 27.143 = 18.357.
200
STOCHASTIC PROGRAMMING
RECOURSE PROBLEMS
Table 2
201
(5,5)
(15,5)
(10,10)
(20,5)
15
25
27.143
25
(10,7.5)
(10,2.5)
(0,5)
(10,0)
25
15
10
10
3.5.2
We have now investigated how to bound Q(x) for a fixed x. We have done
that by combining upper and lower bounding procedures with partitioning of
On the other hand, we have earlier discussed (exact) solution
the support of .
procedures, such as the L-shaped decomposition method (Section 3.2) and the
scenario aggregation (Section 2.6). These methods take a full event/scenario
tree as input and solve this (at least in principle) to optimality. We shall now
see how these methods can be combined.
The starting point is a set-up like Figure 18. We set up an initial partition
of the support, possibly containing only one cell. We then find all conditional
expectations (in the example there are five), and give each of them a
probability equal to that of being in their cell, and we view this as our
true distribution. The L-shaped method is then applied. Let i denote the
given that is contained in the ith cell. Then
conditional expectation of ,
the partition gives us the support { 1 , . . . }. We then solve
where
min cT x + L(x)
s.t. Ax = b,
x 0,
L(x) =
X
j=1
pj Q(x, j ),
(5.3)
202
STOCHASTIC PROGRAMMING
cx + U1 ( x )
cx + U 2 ( x )
cx + Q( x )
cx + L2 ( x )
cx + L1 ( x )
cx$ + U1 ( x$ )
cx$ + Q ( x$ )
cx$ + L1 ( x$ )
x$
Figure 22 Example illustrating the use of bounds in the L-shaped
decomposition method. An initial partition corresponds to the lower bounding function L1 (x) and the upper bounding function U1 (x). For all x we have
L1 (x) Q(x) U1 (x). We minimize cx + L1 (x) to obtain x
. We find the error
U1 (
x) L1 (
x), and we decide to refine the partition. This will cause L1 to be
replaced by L2 and U1 by U2 . Then the process can be repeated.
with pj being the probability of being in cell j. Let x be the optimal solution
to (5.3). Clearly if x is the optimal solution to the original problem then
cT x + L(
x) cT x + L(x) cT x + Q(x),
so that the optimal value we found by solving (5.3) is really a lower bound
on min cT x + Q(x). The first inequality follows from the observation that x
minimizes cT x + L(x). The second inequality holds because L(x) Q(x) for
all x (Jensens inequality). Next, we use some method to calculate U(
x), for
example the EdmundsonMadansky or piecewise linear upper bound. Note
that
x) cT x + U(
x),
cT x + Q(x) cT x + Q(
so cT x
+U(
x) is indeed an upper bound on cT x+Q(x). Here the first inequality
holds because x minimizes cT x + Q(x), and the second because, for all x,
Q(x) U(x).
We then have a solution x
and an error U(
x) L(
x). If we are not satisfied
with the precision, we refine the partition of the support, and repeat the use of
RECOURSE PROBLEMS
203
the L-shaped method. It is worth noting that the old optimality cuts generated
in the L-shaped method are still valid, but generally not tight. The reason is
that, with more cells, and hence a larger , the function L(x) is now closer to
Q(x). Feasibility cuts are still valid and tight. Figure 22 illustrates how the
approximating functions L(x) and U(x) change as the partition is refined.
In total, this gives us the procedure in Figure 23. The procedure refine()
will not be detailed, since there are so many options. We refer to our earlier
discussion of the subject in Section 3.5.1. Note that, for simplicity, we have
assumed that, after a partitioning, the procedure starts all over again in the
repeat loop. That is of course not needed, since we already have checked the
present x for feasibility. If we replace the set A by in the call to procedure
feascut, the procedure Bounding L-shaped must stay as it is. In many cases
this may be a useful change, since A might be very large. (In this case old
feasibility cuts might no longer be tight.)
3.5.3
We have now seen partitioning used in two different settings. In the first we
just wanted to bound a one-stage stochastic program, while in the second we
used it in combination with the L-shaped decomposition method. The major
difference is that in the latter case we solve a two-stage stochastic program
between each time we partition. Therefore, in contrast to the one-stage setting,
the same partition (more and more refined) is used over and over again.
In the two-stage setting a new question arises. How many partitions should
we make between each new call to the L-shaped decomposition method? If we
make only one, the overall CPU time will probably be very large because a
new LP (only slightly changed from last time) must be solved each time we
make a new cell. On the other hand, if we make many partitions per call to
L-shaped, we might partition extensively in an area where it later turns out
that partitioning is not needed (remember that x enters the right-hand side
of the second-stage constraints, moving the set of possible right-hand sides
around). We must therefore strike a balance between getting enough cells and
not getting them in the wrong places.
This brings us to the question of what is a good partitioning strategy.
It should clearly be one that minimizes CPU time for solving the problem
at hand. Tests indicate that for the one-stage setting, using the idea of the
variance of the (random) dual variables on page 195, is a good idea. It creates
quite a number of cells, but because it is cheap (given that we already use
the EdmundsonMadansky upper bound) it is quite good overall. But, in
the setting of the L-shaped decomposition method, this large number of cells
become something of a problem. We have to carry them along from iteration
to iteration, repeatedly finding upper and lower bounds on each of them. Here
it is much more important to have few cells for a given error level. And that
204
STOCHASTIC PROGRAMMING
:= {E };
K := 0, L := 0;
:= , LP(A, b, c, x
, feasible);
stop := not (feasible);
while not (stop) do begin
feascut(A, x,newcut);
if not (newcut) then begin
Find L(
x);
newcut := (L(
x) > 1 );
if newcut then begin
(* Create an optimality cutsee page 168 *)
L := L + 1;
T
Construct the cut L
x + L ;
end;
end;
if newcut then begin
master(K, L, x
, ,feasible);
stop := not (feasible);
end
else begin
Find U(
x);
stop := (U(
x) L(
x) 2 );
Figure 23 The L-shaped decomposition algorithm in a setting of approximations and bounds. The procedures that we refer to start on page 168, and the
set A was defined on page 162.
RECOURSE PROBLEMS
205
is best achieved by looking ahead using (5.2). Our general advice is therefore
that in the setting of two (or more) stages one should seek a strategy that
minimizes the final number of cells, and that it is worthwhile to pay quite a
lot per iteration to achieve this goal.
3.6
Simple Recourse
(6.1)
where
Q(x, ) = min{q +T y + + q T y | y + y = T x, y + 0, y 0}.
Hence we assume
W = (I, I),
T () T (constant),
h() ,
and in addition
q = q + + q 0.
In other words, we consider the case where only the right-hand side is
random, andPwe shall see that in this case, using our former presentation
h() = h0 + i hi i , we only need to know the marginal distributions of the
components hj () of h(). However, stochastic dependence or independence
of these components does not matter at all. This justifies the above setting
h() .
By linear programming duality, we have for the recourse function
Q(x, )
= min{q +T y + + q T y | y + y = T x, y + 0, y 0}
= max{( T x)T | q q + }.
(6.2)
Observe that our assumption q 0 is equivalent to solvability of the secondstage problem. Defining
:= T x,
the dual solution of (6.2) is obvious:
qi+ if i i > 0,
i =
qi if i i 0.
206
STOCHASTIC PROGRAMMING
Hence, with
i (i , i ) =
Q
we have
Q(x, ) =
(i i )qi+
(i i )qi
if i < i ,
if i i ,
i (i , i ) with = T x.
Q
i
Z
X Z
+
qi
(i i )P(d) qi
=
i
i >i
i i
(i i )P(d) .
The last expression shows that knowledge of the marginal distributions of the
is a
i is sufficient to evaluate the expected recourse. Moreover, EQ(x, )
P
m
1
=
so-called separable function in (1 , , m1 ), i.e. EQ(x, )
i=1 Qi (i ),
+
where, owing to q + q = q,
R
R
R
R
= qi+ (i i )P(d) (qi+ + qi ) i i (i i )P(d)
(6.3)
+
+
= qi i qi i q i i i (i i )P(d)
with i = Ei .
The reformulation (6.3) reveals the shape of the functions Qi (i ). Assume
that is bounded such that i < i i i, . Then we have
+
if i i ,
q qi+ i
i i
+
+
qi i qi i q i
(i i )P(d) if i < i < i , (6.4)
Qi (i ) =
i i
qi i + qi i
if i i ,
showing that for i < i and i > i the functions Qi (i ) are linear (see
Figure 24). In particular, we have
i (i , i ) if i i or i i .
Qi (i ) = Q
(6.5)
RECOURSE PROBLEMS
207
i (i , i ).
Simple recourse: supporting Qi (i ) by Q
Figure 24
i = E(i | i (i ,
i ]), i = E(i | i (
i , i ]).
Obviously relation (6.5) also applies analogously to the conditional
expectations
i (
Q1i (
i ) = E(Q
i , i ) | i (i ,
i ])
and
i (
Q2i (
i ) = E(Q
i , i ) | i (
i , i ]).
Therefore
1
2
i (
i (
i ) = Q
i , i ),
Q1i (
i ) = Q
i , i ), Q2i (
i (i , 1i , 2i ) := p1 Q
exact value Qi (
i ). With Q
i i (i , i ) + pi Qi (i , i ), the
resulting situation is demonstrated in Figure 25.
Assume now that for a partition of the intervals (i , i ] into subintervals
Ii := (i , i+1 ], = 0, , Ni 1, with i = i0 < i1 < < iNi = i ,
we have minimized the Jensen lower bound (see Section 3.4.1), letting pi =
208
STOCHASTIC PROGRAMMING
Figure 25
i (i , 1i , 2i ).
Simple recourse: supporting Qi (i ) by Q
P (i Ii ), i = E(i | i Ii ):
k N
i 1
i
h
X
X
i (i , i )
pi Q
minx, cT x +
i=1 =0
s.t. Ax
= b,
T x = 0,
x
0,
N
i 1
X
i (
i , i ),
pi Q
=0
N
i 1
X
=0
i (
i , i ).
pi Q
RECOURSE PROBLEMS
209
(c) If
i Ii for exactly one , with 0 < Ni , then there are two cases.
First, if i <
i < i+1 , partition Ii = (i , i+1 ] into
1
2
Ji
= (i ,
i ] and Ji
= (
i , i+1 ].
i (
i , i ) +
pi Q
i (
pi Q
i , i ),
=1
6=
where
2
X
), = 1, 2.
pi = P (i Ji
), i = E(i | i Ji
N
i 1
X
i (
i , i ).
pi Q
=0
3.7
This book deals almost exclusively with convex problems. The only exception
is this section, where we discuss, very briefly, some aspects of integer
programming. The main reason for doing so is that some solution procedures
for integer programming fit very well with some decomposition procedures
for (continuous) stochastic programming. Because of that we can achieve
two goals: we can explain some connections between stochastic and integer
programming, and we can combine the two subject areas. This allows us to
arrive at a method for stochastic integer programming. Note that talking
about stochastic and integer programming as two distinct areas is really
meaningless, since stochastic programs can contain integrality constraints, and
integer programs can be stochastic. But we still do it, with some hesitation,
210
STOCHASTIC PROGRAMMING
min cT x
s.t. Ax = b
(7.1)
s.t. Ax = b,
xj {aj , . . . , dj },
and
(7.2)
min cT x
s.t. Ax = b,
xj {dj + 1, . . . , bj }.
What we have done is to branch. We have replaced the original problem by
two similar problems that each investigate their part of the solution space.
The two problems are now put into a collection of waiting nodes. The term
waiting node is used because the branching can be seen as building up a
tree, where the original problem sits in the root and the new problems are
stored in child nodes. Waiting nodes are then leaves in the tree, waiting to be
analysed. Leaves can also be fathomed or bounded, as we shall see shortly.
We next continue to work with the problem in one of these waiting nodes.
We shall call this problem the present problem. When doing so, a number of
different situations can occur.
RECOURSE PROBLEMS
211
with the best-so-far objective value z (we initiate z at +). If the new
objective value is better, we keep x
and update z so that z = cT x. We then
fathom the present problem.
3. The present problem might have a nonintegral solution x
with cT x
z. In
this case the present problem cannot possibly contain an optimal integral
solution, and it is therefore dropped, or bounded. (This is the process that
gives half of the name of the method.)
4. The present problem has a solution x
that does not satisfy any of the above
criteria. If so, we branch as we did in (7.2), creating two child nodes. We
then add them to the tree, making them waiting nodes.
An example of an intermediate stage for a branch-and-bound tree can be
found in Figure 26. Three branchings have taken place, and we are left with
two fathomed, one bounded and one waiting node. The next step will now be
to branch on the waiting node.
Note that as branching proceeds, the interval over which we solve the
continuous version must eventually contain only one point. Therefore, sooner
212
STOCHASTIC PROGRAMMING
RECOURSE PROBLEMS
213
method for (continuous) stochastic programs. It must be noted that cuttingplane methods are hardly ever used in their pure form for solving integer
programs. They are usually combined with other methods. For the sake of
exposition, however, we shall biefly sketch some of the ideas.
When we solve the relaxed linear programming version of (7.1), we
have difficulties because we have increased the solution space. However,
all points that we have added are non-integral. In principle, it is possible
to add extra constraints to the linear programming relaxation to cut off
some of these noninteger solutions, namely those that are not convex
combinations of feasible integral points. These cuts will normally be added in
an iterative manner, very similarly to the way we added cuts in the L-shaped
decomposition method. In fact, the L-shaped decomposition method is known
as Benders decomposition in other areas of mathematical programming, and
its original goal was to solve (mixed) integer programming problems. However,
it was not cast in the way we are presenting cuts below.
So, in all its simplicity, a cutting-plane method will run through two major
steps. The first is to solve a relaxed linear program; the second is to evaluate
the solution, and if it is not integral, add cuts that cut away nonintegral
points (including the present solution). These cuts are then added to the
relaxed linear program, and the cycle is repeated. Cuts can be of different
types. Some come from straightforward arithmetic operations based on the
LP solution and the LP constraints. These are not necessarily very tight.
Others are based on structure. For a growing number of problems, knowledge
about some or all facets of the (integer) solution space is becoming available.
By a facet in this case, we understand the following. The solution space of the
relaxed linear program contains all integral feasible points, and none extra. If
we add a minimal number of new inequalities, such that no integral points are
cut off, and such that all extreme points of the new feasible set are integers,
then the intersection between a hyperplane representing such an inequality
and the new set of feasible solutions is called a facet. Facets are sometimes
added as they are found to be violated, and sometimes before the procedure
is started.
How does this relate to the L-shaped decomposition procedure? Let us be
a bit formal. If all costs in a recourse problem are zero, and we choose to use
the L-shaped decomposition method, there will be no optimality cuts, only
feasibility cuts. Such a stochastic linear program could be written as
min cT x
s.t.
Ax = b,
x 0,
W y() = h() T ()x, y() 0.
(7.3)
To use the L-shaped method to solve (7.3), we should begin solving the
214
STOCHASTIC PROGRAMMING
problem
min cT x
s.t. Ax = b,
x 0,
i.e. (7.3) without the last set of constraints added. Then, if the resulting x
makes the last set of constraints in (7.3) feasible for all , we are done. If not,
an implied feasibility cut is added.
An integer program, on the other hand, could be written as
min cT x
s.t. Ax = b,
(7.4)
A cutting-plane procedure for (7.4) will solve the problem with the constraints
a x b so that the integrality requirement is relaxed. Then, if the resulting
x
is integral in all its elements, we are done. If not, an integrality cut is added.
This cut will, if possible, be a facet of the solution space with all extreme points
integer.
By now, realizing that integrality cuts are also feasibility cuts, the
connection should be clear. Integrality cuts in integer programming are just
a special type of feasibility cuts.
For the bounding version of the L-shaped decomposition method we
combined bounding (with partitioning of the support) with cuts. In the same
way, we can combine branching and cuts in the branch-and-cut algorithm for
integer programs (still deterministic). The idea is fairly simple (but requires
a lot of details to be efficient). For all waiting nodes, before or after we
have solved the relaxed LP, we add an appropriate number of cuts, before
we (re)solve the LP. How many cuts we add will often depend on how well we
know the facets of the (integer) solution space. This new LP will have a smaller
(continuous) solution space, and is therefore likely to give a better result
either in terms of a nonintegral optimal solution with a higher objective value
(increasing the probability of bounding), or in terms of an integer solution.
So, finally, we have reached the ultimate question. How can all of this be
used to solve integer stochastic programs? Given the simplification that we
have integrality only in the first-stage problem, the procedure is given in
Figure 27. In the procedure we operate with a set of waiting nodes P. These
are nodes in the cut-and-branch tree that are not yet fathomed or bounded.
The procedure feascut was presented earlier in Figure 9, whereas the new
procedure intcut is outlined in Figure 28. Let us try to compare the L-shaped
integer programming method with the continuous one presented in Figure 10.
RECOURSE PROBLEMS
215
master(K, L, x
, ,feasible);
fathom := not (feasible) or (cT x + > z);
if not (fathom) then begin
feascut(A, x
,newcut);
if not (newcut) then intcut(
x, newcut);
if not (newcut) then begin
if x integral then begin
Find Q(
x);
z := min{z, cT x
+ Q(
x)};
fathom := ( Q(
x));
if not (fathom) then begin
L := L + 1;
T
Create the cut L
x + L ;
end;
end
else begin
Use branching to create 2 new problems P1 and P2 ;
Let P := P {P1 , P2 };
end;
end;
end;
until fathom;
end; ( while )
end;
216
STOCHASTIC PROGRAMMING
procedure intcut(
x:real; newcut:boolean);
begin
if violated integrality constraints found then begin
K := K + 1;
T
Create a cut K
x + K ;
newcut := true;
end
else newcut := false;
end;
Figure 28
3.7.1
Initialization
Feasibility Cuts
Both approaches operate with feasibility cuts. In the continuous case these
are all implied constraints, needed to make the second-stage problem feasible
for all possible realizations of the random variables. For the integer case,
we still use these, and we add any cuts that are commonly used in branchand-cut procedures in integer programming, preferably facets of the solution
space with integral extreme points. To reflect all possible kinds of such cuts
(some concerning second-stage feasibility, some integrality), we use a call
to procedure feascut plus the new procedure intcut. Typically, implied
constraints are based on an x
that is nonintegral, and therefore infeasible.
In the end, though, integrality will be there, based on the branching part of
the algorithm, and then the cuts will indeed be based on a feasible (integral)
solution.
RECOURSE PROBLEMS
3.7.3
217
Optimality Cuts
The creation of optimality cuts is the same in both cases, since in the integer
case we create such cuts only for feasible (integer) solutions.
3.7.4
Stopping Criteria
The stopping criteria are basically the same, except that what halts the whole
procedure in the continuous case just fathoms a node in the integer case.
3.8
Stochastic Decomposition
218
STOCHASTIC PROGRAMMING
where
Q(x) =
Q(x, )f () d
RECOURSE PROBLEMS
219
illustrated in Figure 29. There we see the situation for the third sample point.
We first make an exact optimization for the new sample point, 3 , obtaining
a true optimal dual solution 3 . This is represented in Figure 29 by the
supporting hyperplane through 3 , Q(x3 , 3 ). Afterwards, we solve inexactly
for the two old sample points. There are three bases available for the inexact
optimization. These bases are represented by the three thin lines. As we see,
neither of the two old sample points find their true optimal basis.
= {1 , 2 , 3 }, with each outcome having the same probability 1 , we
If ()
3
could now calculate a lower bound on Q(x3 ) by computing
3
L(x3 ) =
1X 3 T
( ) (j T (j )x3 ).
3 j=1 j
k
1X k T
( ) (j T (j )xk ).
k j=1 j
220
STOCHASTIC PROGRAMMING
Remember, however, that this is not the true value of Q(xk )just an estimate.
In other words, we have now observed two major differences from the
exact L-shaped method (page 171). First, we operate on a sample rather
than on all outcomes, and, secondly, what we calculate is an estimate of
a lower bound on Q(xk ) rather than Q(xk ) itself. Hence, since we have a
lower bound, what we are doing is more similar to what we did when we
used the L-shaped decomposition method within approximation schemes, (see
page 204). However, the reason for the lower bound is somewhat different. In
the bounding version of L-shaped, the lower bound was based on conditional
expectations, whereas here it is based on inexact optimization. On the other
hand, we have earlier pointed out that the Jensen lower bound has three
different interpretations, one of which is to use conditional expectations (as
in procedure Bounding L-shaped) and another that is inexact optimization
(as in SD). So what is actually the principal difference?
For the three interpretations of the Jensen bound to be equivalent, the
limited set of bases must come from solving the recourse problem in the points
of conditional expectations. That is not the case in SD. Here the points are
random (according to the sample j ). Using a limited number of bases still
produces a lower bound, but not the Jensen lower bound.
Therefore SD and the bounding version of L-shaped are really quite
different. The reason for the lower bound is different, and the objective value
in SD is only a lower bound in terms of expectations (due to sampling). One
method picks the limited number of points in a very careful way, the other
at random. One method has an exact stopping criteria (error bound), the
other has a statistically based stopping rule. So, more than anything else,
they are alternative approaches. If one cannot solve the exact problem, one
either resorts to bounds or to sample-based methods.
In the L-shaped method we demonstrated how to find optimality cuts. We
can now find a cut corresponding to xk (which is not binding and might even
not be a lower bound, although it represents an estimate of a lower bound).
As for the L-shaped method, we shall replace Q(x) in the objective by , and
then add constraints. The cut generated in iteration k is given by
k
1X k T
( ) [j T (j )x] = kk + (kk )T x.
k j=1 j
The double set of indices on and indicate that the cut was generated in
iteration k (the subscript) and that it has been updated in iteration k (the
superscript).
In contrast to the L-shaped decomposition method, we must now also look
at the old cuts. The reason is that, although we expect these cuts to be loose
(since we use inexact optimization), they may in fact be far too tight (since
they are based on a sample). Also, being old, they are based on a sample that
RECOURSE PROBLEMS
221
is smaller than the present one, and hence, probably not too good. We shall
therefore want to phase them out, but not by throwing them away. Assume
that there exists a lower bound on Q(x, ) such that Q(x, ) Q for all x and
. Then the old cuts
jk1 + (jk1 )T x for j = 1, . . . , k 1
will be replaced by
k 1 k1
1
[j + (jk1 )T x] + Q
k
k
= kj + (jk )T x for j = 1, . . . , k 1.
(8.1)
min cT x +
s.t.
Ax = b,
(8.2)
k T
k
(j ) x + j for j = 1, . . . , k,
x 0,
yielding the next iterate xk+1 . Note that, since we assume relatively complete
recourse, there are no feasibility cuts. The above format is the one to be used
for computations. To understand the method better, however, let us show an
alternative version of (8.2) that is less useful computationally but is more
illustrative (see Figure 30 for an illustration):
min k (x) cT x + maxj{1,,k} [kj + (jk )T x]
s.t. Ax = b, x 0.
This defines the function k (x) and shows more clearly than (8.2) that we do
indeed have a function in x that we are minimizing. Also k (x) is the present
estimate of (x) = cT x + Q(x).
The above set-up has one major shortcoming: it might be difficult to
extract a converging subsequence from the sequence xk . A number of changes
therefore have to be made. These make the algorithm look more messy, but
the principles are not lost. To make it simpler (empirically) to extract a
converging subsequence, we shall introduce a sequence of incumbent solutions
xk . Following the incumbent, there will be an index ik that shows in which
iteration the current xk was found.
We initiate the method by setting the counter k := 0, choose an r (0, 1)
Thus we solve
(to be explained later), and let 0 := E .
min cT x + q0T y
s.t. Ax = b,
W y = 0 T (0 )x, x, y 0,
222
STOCHASTIC PROGRAMMING
()
Figure 30
k
1X
(j )T [j T (j )x] = kk + (kk )T x.
k j=1
RECOURSE PROBLEMS
223
In addition, we need to update the incumbent cut ik . This is done just the
way we found cut k. We solve
max{ T [j T (j )xk1 ] | V }
to obtain j , and replace the old cut ik by
k
1X
( j )T [j T (j )x] = kik1 + (ikk1 )T x.
k j=1
k 1 k1
1
[j +(jk1 )T x]+ Q = kj +(jk )T x for j = 1, . . . , k1, j 6= ik1 .
k
k
224
STOCHASTIC PROGRAMMING
cx + Q( x)
f2 ( x2 )
f2 ( x )
x3
x2
cx + Q( x)
f3 ( x 2 )
f3 ( x )
f3 ( x 3 )
b
Figure 31
x3
x2
c
a
RECOURSE PROBLEMS
3.9
225
We are still dealing with recourse problems stated in the somewhat more
general form
Z
Q(x, ) P(d) .
(9.1)
min f (x) +
xX
This formulation also includes the stochastic linear program with recourse,
letting
X = {x | Ax = b, x 0},
f (x) = cT x,
Q(x, ) = min{(q())T y | W y = h() T ()x, y 0}.
To describe the so-called stochastic quasi-gradient method (SQG), we
simplify the notation by defining
F (x, ) := f (x) + Q(x, )
and hence considering the problem
min EF (x, ),
xX
(9.2)
(9.3 i)
(9.3 ii)
Observe that for stochastic linear programs with recourse the assumptions (9.3) are satisfied if, for instance,
we have relatively complete recourse, the recourse function Q(x, ) is a.s.
finite x, and the components of are square-integrable (i.e. their second
moments exist);
X = {x | Ax = b, x 0} is bounded.
Then, starting from some feasible point x0 X, we may define an iterative
process by
x+1 = X (x v ),
(9.4)
where v is a random vector, 0 is some step size and X is the projection
onto X, i.e. for y IRn , with k k the Euclidean norm,
X (y) = arg min ky xk.
xX
(9.5)
226
STOCHASTIC PROGRAMMING
(9.6)
has to hold x, z (see Figure 27 in Chapter 1). But, even if the convex function
is not differentiable at some point z, e.g. if it has a kink there, it is shown
in convex analysis that there exists at least one vector g such that
(x z)T g (x) (z) x.
(9.7)
Any vector g satisfying (9.7) is called a subgradient of at z, and the set of all
vectors satisfying (9.7) is called the subdifferential of at z and is denoted by
(z). If is differentiable at z then (z) = {(z)}; otherwise, i.e. in the
nondifferentiable case, (z) may contain more than one element as shown
for instance in Figure 32. Furthermore, in view of (9.7), it is easily seen that
(z) is a convex set.
If is convex and g 6= 0 is a subgradient of at z then, by (9.7) for > 0,
it follows that
(z + g) (z) + g T (x z)
= (z) + g T (g)
= (z) + kgk2
> (z).
Hence any subgradient, g , such that g 6= 0 is a direction of ascent,
although not necessarily the direction of steepest ascent as the gradient would
be if were differentiable in z. However, in contrast to the differentiable case,
g need not be a direction of strict descent for in z. Consider for example
the convex function in two variables
(u, v) := |u| + |v|.
T
u
1
g=
v
3
v3
1
=u+v3
|u| + |v| || |3|,
RECOURSE PROBLEMS
Figure 32
227
228
Figure 33
STOCHASTIC PROGRAMMING
(9.8)
(9.9)
RECOURSE PROBLEMS
229
that
E F (x , )
E(v | x0 , , x )T (x x ) + ,
0 EF (x , )
where
= bT (x x ).
(9.10)
(9.11)
N
1 X
w , w x F (x , ),
N =1
(9.13)
would yield b = 0,
where the or are independent samples of ,
= 0 , provided that the operations of integration and differentiation may
be exchanged, as asserted for example by Proposition 1.2 for the differentiable
case.
Finally, assume that for the step size together with v and we have
0,
= ,
=0
E( | | + 2 kv k2 ) < .
(9.14)
=0
With the choices (9.12) or (9.13), for uniformly bounded v this assumption
could obviously be replaced by the step size assumption
0,
=0
= ,
2 < .
(9.15)
=0
With these prerequisites, it can be shown that, under the assumptions (9.3),
(9.8) and (9.14) (or (9.3), (9.12) or (9.13), and (9.15)) the iterative
method (9.4) converges almost surely (a.s.) to a solution of (9.2).
3.10
230
STOCHASTIC PROGRAMMING
(10.1)
What we observe here is that the part that varies, h() T ()x, appears only
in the objective. As a consequence, if (10.1) is feasible for one value of x and
, it is feasible for all values of x and . Of course, the problem might be
unbounded (meaning that the primal is infeasible) for some x and . For the
moment we shall assume that that does not occur. (But if it does, it simply
shows that we need a feasibility cut, not an optimality cut).
In a given iteration of the L-shaped decomposition method, x will be fixed,
and all we are interested in is the selection of right-hand sides resulting from
all possible values of . Let us therefore simplify notation, and assume that
we have a selection of right-hand sides B, so that, instead of (10.1), we solve
max{ T h | T W q0T }
(10.2)
for all h B. Assume (10.2) is solved for one value of h B with optimal
basis B. Then B is a dual feasible basis for all h B. Therefore, for all
h B for which B 1 h 0, the basis B is also primal feasible, and hence
optimal. The idea behind bunching is simply to start out with some h B,
find the optimal basis B, and then check B 1 h for all other h B. Whenever
B 1 h 0, we have found the optimal solution for that h, and these righthand sides are bunched together. We then remove these right-hand sides from
B, and repeat the process, of course with a warm start from B, using the dual
simplex method, for one of the remaining right-hand sides in B. We continue
until all right-hand sides are bunched. That gives us all information needed
to find Q and the necessary optimality cut.
This procedure has been followed up in several directions. An important
one is called trickling down. Again, we start out with B, and we solve (10.2)
for some right-hand side to obtain a dual feasible basis B. This basis is stored
in the root of a search tree that we are about to make. Now, for one h B at
a time do the following. Start in the root of the tree, and calculate B 1 h. If
B 1 h 0, register that this right-hand side belongs to the bunch associated
RECOURSE PROBLEMS
231
B
1
B
2
B
8
B
5
B
Figure 34
232
STOCHASTIC PROGRAMMING
The discussion of trickling down etc. was carried out in a setting of right-hand
side randomness only. However, as with many other problems we have faced
in this book, pure objective function randomness can be changed into pure
right-hand side randomness by using linear programming duality. Therefore
the discussions of right-hand side randomness apply to objective function
RECOURSE PROBLEMS
233
randomness as well.
Then, one may ask what happens if there is randomness in both the
objective and the right-hand side. Trickling down cannot be performed the
way we have outlined it in that case. This is because a basis that was optimal
for one will, in general, be neither primal nor dual feasible for some other .
On the other hand, the basis may be good, not far from the optimal one. Hence
warm starts based on an old basis, performing a combination of primal and
dual simplex steps, will almost surely be better than solving the individual
LPs from scratch.
3.11
Bibliographical Notes
Benders [1] decomposition is the basis for all decomposition methods in this
chapter. In stochastic programming, as we have seen, it is more common to
refer to Benders decomposition as the L-shaped decomposition method. That
approach is outlined in detail in Van Slyke and Wets [63]. An implementation
of the L-shaped decomposition method, called MSLiP, is presented in
Gassmann [31]. It solves multistage problems based on nested decomposition.
Alternative computational methods are also discussed in Kall [44].
The regularized decomposition method has been implemented under the
name QDECOM. For further details on the method and QDECOM, in
particular for a special technique to solve the master (3.6), we refer to the
original publication of Ruszczy
nski [61]; the presentation in this chapter is
close to the description in his recent paper [62].
Some attempts have also been made to use interior point methods. As
examples consider Birge and Qi [7], Birge and Holmes [6], Mulvey and
Ruszczy
nski [60] and Lustig, Mulvey and Carpenter [55]. The latter two
combine interior point methods with parallel processing.
Parallel techniques have been tried by others as well; see e.g. Berland [2]
and Jessup, Yang and Zenios [42]. We shall mention some others in Chapter 6.
The idea of combining branch-and-cut from integer programming with
primal decomposition in stochastic programming was developed by Laporte
and Louveaux [53]. Although the method is set in a strict setting of
integrality only in the first stage, it can be expanded to cover (via a
reformulation) multistage problems that possess the so-called block-separable
recourse property, see Louveaux [54] for details.
Stochastic quasi-gradient methods were developed by Ermoliev [20, 21],
and implemented by, among others, Gaivoronski [27, 28]. Besides stochastic
quasi-gradients several other possibilities for constructing stochastic descent
directions have been investigated, e.g. in Marti [57] and in Marti and
Fuchs [58, 59].
The Jensen lower bound was developed in 1906 [41]. The Edmundson
234
STOCHASTIC PROGRAMMING
RECOURSE PROBLEMS
235
Exercises
1. The second-stage constraints of a two-stage problem look as follows:
1
3 1 0
6
5 1 0
y=
+
x
2 1
2 1
4
0
2 4
y0
where is a random variable with support = [0, 1]. Write down the LP
(both primal and dual formulation) needed to check if a given x produces
a feasible second-stage problem. Do it in such a way that if the problem
is not feasible, you obtain an inequality in x that cuts off the given x. If
you have access to an LP code, perform the computations, and find the
inequality explicitly for x
= (1, 1, 1)T .
2. Look back at problem (4.1) we used to illustrate the bounds. Add one extra
constraint, namely
xraw1 40.
(a)
(b)
(c)
(d)
Find the Jensen lower bound after this constraint has been added.
Find the EdmundsonMadansky upper bound.
Find the piecewise linear upper bound.
Try to find a good variable for partitioning.
236
STOCHASTIC PROGRAMMING
5. Show that for a convex function and any arbitrary z the subdifferential
(z) is a convex set. [Hint: For any subgradient (9.7) has to hold.]
6. Assume that you are faced with a large number of linear programs that
you need to solve. They represent all recourse problems in a two-stage
stochastic program. There is randomness in both the objective function
and the right-hand side, but the random variables affecting the objective
are different from, and independent of, the random variables affecting the
right-hand side.
(a) Argue why (or why not) it is a good idea to use some version of
bunching or trickling down to solve the linear programs.
(b) Given that you must use bunching or trickling down in some version,
how would you organize the computations?
7. First consider the following integer programming problem:
min{cx | Ax h, xi {0, . . . , bi } i}.
x
(a) Assume that you solve the integer program with branch-and-bound.
Your first step is then to solve the integer program above, but with
xi {0, . . . , bi } i replaced by 0 x b. Assume that you get x
.
Explain why x
can be a good partitioning point if you wanted to find
E(
x) by repeatedly partitioning the support, and finding bounds on
each cell. [Hint: It may help to draw a little picture.]
(b) We have earlier referred to Figure 18, stating that it can be seen
as both the partitioning of the support for the stochastic program,
and partitioning the solution space for the integer program. Will the
number of cells be largest for the integer or the stochastic program
above? Note that there is not necessarily a clear answer here, but you
should be able make arguments on the subject. Question (a) may be
of some help.
8. Look back at Figure 17. There we replaced one distribution by two others:
one yielding an upper bound, and one a lower bound. The possible values
for these two new distributions were not the same. How would you use the
ideas of Jensen and EdmundsonMadansky to achieve, as far as possible,
the same points? You can assume that the distribution is bounded. [Hint:
The EdmundsonMadansky distribution will have two more points than
the Jensen distribution.]
RECOURSE PROBLEMS
237
References
[1] Benders J. F. (1962) Partitioning procedures for solving mixed-variables
programming problems. Numer. Math. 4: 238252.
[2] Berland N. J. (1993) Stochastic optimization and parallel processing. PhD
thesis, Department of Informatics, University of Bergen.
[3] Berland N. J. and Wallace S. W. (1993) Partitioning of the support to
tighten bounds on stochastic PERT problems. Working paper, Department
of Managerial Economics and Operations Research, Norwegian Institute of
Technology, Trondheim.
[4] Berland N. J. and Wallace S. W. (1993) Partitioning the support to
tighten bounds on stochastic linear programs. Working paper, Department
of Managerial Economics and Operations Research, Norwegian Institute of
Technology, Trondheim.
[5] Birge J. R. and Dul
a J. H. (1991) Bounding separable recourse functions
with limited distribution information. Ann. Oper. Res. 30: 277298.
[6] Birge J. R. and Holmes D. (1992) Efficient solution of two stage stochastic
linear programs using interior point methods. Comp. Opt. Appl. 1: 245276.
[7] Birge J. R. and Qi L. (1988) Computing block-angular Karmarkar
projections with applications to stochastic programming. Management Sci.
pages 14721479.
[8] Birge J. R. and Wallace S. W. (1988) A separable piecewise linear upper
bound for stochastic linear programs. SIAM J. Control and Optimization
26: 725739.
[9] Birge J. R. and Wets R. J.-B. (1986) Designing approximation schemes
for stochastic optimization problems, in particular for stochastic programs
with recourse. Math. Prog. Study 27: 54102.
[10] Birge J. R. and Wets R. J.-B. (1987) Computing bounds for stochastic
programming problems by means of a generalized moment problem. Math.
Oper. Res. 12: 149162.
[11] Birge J. R. and Wets R. J.-B. (1989) Sublinear upper bounds for stochastic
programs with recourse. Math. Prog. 43: 131149.
[12] Dul
a J. H. (1987) An upper bound on the expectation of sublinear functions
of multivariate random variables. Preprint, CORE.
[13] Dul
a J. H. (1992) An upper bound on the expectation of simplicial functions
of multivariate random variables. Math. Prog. 55: 6980.
[14] Dupacova J. (1976) Minimax stochastic programs with nonconvex
nonseparable penalty functions. In Prekopa A. (ed) Progress in Operations
Research, pages 303316. North-Holland, Amsterdam.
[15] Dupacova J. (1980) Minimax stochastic programs with nonseparable
penalties. In Iracki K., Malanowski K., and Walukiewicz S. (eds)
Optimization Techniques, Part I, volume 22 of Lecture Notes in Contr.
Inf. Sci., pages 157163. Springer-Verlag, Berlin.
238
STOCHASTIC PROGRAMMING
RECOURSE PROBLEMS
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
239
240
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
STOCHASTIC PROGRAMMING
RECOURSE PROBLEMS
241
242
STOCHASTIC PROGRAMMING
Probabilistic Constraints
As we have seen in Sections 1.5 and 1.6, at least under appropriate assumptions, chance-constrained problems such as (4.21), or particularly (4.23), as
well as recourse problems such as (4.11), or particularly (4.16), (all from Chapter 1), appear as ordinary convex smooth mathematical programming problems. This might suggest that these problems may be solved using known
nonlinear programming methods. However, this viewpoint disregards the fact
that in the direct application of those methods to problems like
xX
where
Q(x, ) = min{q T y | W y h() T ()x, y Y },
we had repeatedly to obtain gradients and evaluations for functions like
P ({ | T ()x h()})
or
244
STOCHASTIC PROGRAMMING
PROBABILISTIC CONSTRAINTS
245
4.1
min cT x
s.t. P ({ | T x })
Dx = d,
x 0.
(1.1)
For this problem we know from Propositions 1.51.7 in Section 1.6 that if the
distribution function F is quasi-concave then the feasible set B() is a closed
convex set.
Under the assumption that has a (multivariate) normal distribution,
we know that F is even log-concave. We therefore have a smooth convex
program. For this particular case there have been attempts to adapt penalty
and cutting-plane methods to solve (1.1). Further, variants of the reduced
gradient method as sketched in Section 1.8.2 have been designed.
These approaches all attempt to avoid the exact numerical integration
associated with the evaluation of F (T x) = P ({ | T x }) and its gradient
x F (T x) by relaxing the probabilistic constraint
P ({ | T x }) .
To see how this may be realized, let us briefly sketch one iteration of the
reduced gradient methods variant implemented in PROCON, a computer
program for minimizing a function under PRObabilistic CONstraints.
With the notation
G(x) := P ({ | T x }),
let x be feasible in
min cT x
s.t. G(x) ,
Dx = d,
x 0,
(1.2)
246
STOCHASTIC PROGRAMMING
(1.3)
max
s.t.
f Tu +
g T v ,
vj 0 if zj ,
kvk 1,
,
if G(x) + ,
0 if zj ,
1,
(1.5)
rT = g T f T B 1 N,
sT = z G(x)T y G(x)T B 1 N
are the reduced gradients of the objective and the probabilistic constraint
function. Problem (1.5)and hence (1.4)is always solvable owing to its
nonempty and bounded feasible set. Depending on the obtained solution
( , uT , v T ) the method proceeds as follows.
Case 1 When = 0, is replaced by 0 and (1.5) is solved again. If
= 0 again, the feasible solution xT = (y T , z T ) is obviously optimal.
Otherwise the steps of case 2 below are carried out, starting with
the original > 0.
Case 2 When 0 < , the following cycle is entered:
Step 1 Set := 0.5.
PROBABILISTIC CONSTRAINTS
247
4.2
Let us now consider stochastic linear programs with separate (or single) chance
constraints as introduced at the end of Section 1.4. Using the formulation given
there we are dealing with the problem
)
for the special case where Ti () Ti , i.e. where only the right-hand side hi ()
248
STOCHASTIC PROGRAMMING
our feasible set may be rewritten in terms of the random variable (x)
as
Bi (i ) = {x | P ((x) 0) i }. From probability theory, we know
= x P
(x)
(x)
Bi (i ) = x1
i .
(x)
Hence
n
o
m (x)
Bi (i ) = x1
i
(x)
o
n m (x)
1 i
= x
(x)
n m (x)
o
= x
1 (1 i )
(x)
n
o
= x 1 (1 i )(x) m(x) 0 .
Here m(x) is linear affine in x and (x) is convex in x. Therefore the lefthand side of the constraint
1 (1 i )(x) m(x) 0
is convex iff 1 (1 i ) 0, which is exactly the case iff i 0.5. Hence
we have, under the assumption of normal distributions and i 0.5, instead
of (2.1) a deterministic convex program with constraints of the type
1 (1 i )(x) m(x) 0,
which can be solved with standard tools of nonlinear programming.
PROBABILISTIC CONSTRAINTS
4.3
249
250
STOCHASTIC PROGRAMMING
!
k!( k)!
X
n
i
P ({ | () = i}), k = 0, 1, , n.
:= E
=
k
k
(3.1)
i=0
Since
i
0
(3.2)
PROBABILISTIC CONSTRAINTS
251
(3.3)
(3.4)
These linear programs are feasible and bounded, and therefore solvable. So,
there exist optimal feasible 2 2 bases B.
Consider an arbitrary 2 2 matrix of the form
i i+r
i+r ,
B= i
2
2
252
STOCHASTIC PROGRAMMING
i+r1
ir
B 1 =
i1
(i + r)r
j
For Nj = j we get
ir
.
2
(i + r)r
eT B 1 Nj = j
2i + r j
.
i(i + r)
(3.5)
i i+r
i+r
B= i
2
2
2i + r j
i(i + r)
i(i + r) + r 1
i(i + r)
> 1,
so that the optimality condition for (3.3) is not satisfied for r > 1, showing
that r = 1 is necessary.
Now let r = 1. Then for j < i we have, according to (3.5),
eT B 1 Nj = j
=
2i + 1 j
i(i + 1)
j + i2 (j i)2
i(i + 1)
i(i + 1) (j i)2
i(i + 1)
< 1,
<
PROBABILISTIC CONSTRAINTS
253
2i + 1 j
i(i + 1)
j(i + 1) + j(i j)
i(i + 1)
< 1,
=
the last inequality resulting from the fact that subtracting the
denominator from the numerator yields
j(i + 1) + j(i j) i(i + 1) = (j i) [(i + 1) j] < 0.
{z
}
| {z } |
>1
<0
Hence in both cases the optimality condition for (3.3) is strictly satisfied.
(b) If i + r < n then we get from (3.5) for j = n
n(i + r) + n(i n)
i(i + r)
<1
eT B 1 Nn =
since
{numerator} {denominator} = n(i + r) + n(i n) i(i + r)
= (n i)(i + r n)
< 0.
Finally, if i > 1 then, with (3.5), we have for j = 1
2i + r 1
i(i + r)
(i 1) + (i + r)
=
i(i + r)
i1
1
=
+
i(i + r)
i
1
1
< 3+2
< 1.
eT B 1 Nn =
Hence the only possible choice for a basis satisfying the optimality
condition for problem (3.4) is i = 1, r = n 1.
2
As can be seen from the simplex method, a basis that satisfies the optimality
condition strictly does determine a unique optimal solution if it is feasible.
254
STOCHASTIC PROGRAMMING
i i+1
i+1
B= i
2
2
B 1
S1,n
S2,n
2
S1,n
i
=
2 S2,n
i1
i+1 i+1
2
S1,n S2,n
=
i1
2
S1,n +
S2,n
i+1
i+1
0,
or, equivalently, if
(i 1)S1,n 2S2,n iS1,n .
Hence we have to choose i such that i 1 = 2S2,n /S1,n , where is the
integer part of (i.e. the greatest integer less than or equal to ). With this
particular i the optimal value of (3.3) amounts to
i1
2
2
2
2
S1,n +
S2,n =
S1,n
S2,n .
S1,n S2,n
i
i+1
i+1
i+1
i(i + 1)
Thus we have found a lower bound for P (
1) as
P (
1)
2
2
S1,n
S2,n ,
i+1
i(i + 1)
with i 1 =
we have
1 n
n
B=
0
2
B 1 =
1
0
2
n1
2
n(n 1)
2S2,n
.
S1,n
(3.6)
PROBABILISTIC CONSTRAINTS
and hence
B 1
S1,n
S2,n
255
2
S2,n
n1
.
2
S2,n
n(n 1)
S1,n
The last vector is nonnegative since the definition of the binomial moments
implies (n 1)S1,n 2S2,n 0 and S2,n 0. This yields for (3.4) the optimal
value S1,n (2/n)S2,n . Therefore we finally get an upper bound for P (
1)
as
2
P (
1) S1,n S2,n .
(3.7)
n
In conclusion, recalling that
F(z) = 1 P (
1),
we have shown the following.
Proposition 4.4 The distribution function F(z) is bounded according to
2
F(z) 1 S1,n S2,n
n
and
F(z) 1
2
2S2,n
2
.
S1,n
S2,n , with i 1 =
i+1
i(i + 1)
S1,n
X
n
i
P ({ | () = i}), k = 0, 1, , n.
:= E
=
k
k
i=0
Another way to introduce these moments is the following. With the same
notation as at the beginning of this section, let us define new random variables
i () :=
0 otherwise.
Then clearly =
Pn
i=1
i , and
1 +
n
=
=
k
k
1i1 ik n
ik .
i2
i1
256
STOCHASTIC PROGRAMMING
Taking the expectation on both sides yields for the binomial moments Sk,n
X
=
ik )
i2
E (
i1
E
k
1i1 ik n
X
=
P (Bi1 Bik ) .
1i1 ik n
4
X
i=1
3
X
qi
4
X
= 0.24
qi qj = 0.0193
i=1 j=i+1
PL =
2
2
0.24 0.0193 = 0.1757.
2
2
PROBABILISTIC CONSTRAINTS
257
Observe that these bounds could be derived without any specific information about the type of the underlying probability distribution (except the
assumption of independent components made only for the sake of a simple
presentation).
2
Further bounds have been derived for P (
1) using binomial moments up
to the order m, 2 < m < n, as well as for P (
r), r > 1. For some of them
explicit formulae could also be derived, while others require the computational
solution of optimization problems with algorithms especially designed for the
particular problem structures.
4.4
Bibliographical Notes
258
STOCHASTIC PROGRAMMING
Exercises
1. Given a random vector with support in IRk , assume that for A
and B we have P (A) = P (B) = 1. Show that then also P (A B) = 1.
2. Under the assumptions of Proposition 4.2, the support of the distribution
is = { 1 , , r }, with P ( = j ) = pj > 0 j. Show that for
> 1 minj{1,,r} pj the only event A satisfying P (A) is
A = .
References
[1] Borell C. (1975) Convex set functions in d-space. Period. Math. Hungar.
6: 111136.
[2] Boros E. and Prekopa A. (1989) Closed-form two-sided bounds for
probabilities that at least r and exactly r out of n events occur. Math.
Oper. Res. 14: 317342.
[3] Brascamp H. J. and Lieb E. H. (1976) On extensions of the Brunn
Minkowski and PrekopaLeindler theorems, including inequalities for log
concave functions, and with an application to the diffusion euation. J.
Funct. Anal. 22: 366389.
[4] Charnes A. and Cooper W. W. (1959) Chance-constrained programming.
Management Sci. 5: 7379.
[5] Marti K. (1971) Konvexit
atsaussagen zum linearen stochastischen
PROBABILISTIC CONSTRAINTS
259
260
STOCHASTIC PROGRAMMING
Preprocessing
The purpose of this chapter is to discuss different aspects of preprocessing
the data associated with a stochastic program. The term preprocessing
is rather vague, but whatever it could possibly mean, our intention here
is to discuss anything that will enhance the model understanding and/or
simplify the solution procedures. Thus preprocessing refers to any analysis
of a problem that takes place before the final solution of the problem. Some
tools will focus on the issue of model understanding, while others will focus
on issues related to choice of solution procedures. For example, if it can be
shown that a problem has (relatively) complete recourse, we can apply solution
procedures where that is required. At the same time, the fact that a problem
has complete recourse is of value to the modeller, since it says something about
the underlying problem (or at least the model of the underlying problem).
5.1
Problem Reduction
262
STOCHASTIC PROGRAMMING
5.1.1
Finding a frame.
Finding a Frame
PREPROCESSING
263
W1
W2
pos W
W3
W4
Figure 2
5.1.2
This can be useful in a couple of different settings. Let us first see what
happens if we simply apply the frame algorithm to the recourse matrix W .
We shall then remove columns that are not needed to describe feasibility. This
is illustrated in Figure 2. Given the matrix W = (W1 , W2 , W3 , W4 ), we find
that the shaded region represents pos W and the output of a frame algorithm
is either W = (W1 , W2 , W4 ) or W = (W1 , W3 , W4 ). The procedure framebylp
will produce the first of these two cases.
Removing columns not needed for feasibility can be of use when verifying
feasibility in the L-shaped decomposition method (see page 171). We are there
to solve a given LP for all A. If we apply frame to W before checking
feasibility, we get a simpler problem to look at, without losing information,
since the removed columns add nothing in terms of feasibility. If we are willing
to live with two version of the recourse matrix, we can therefore reduce work
while computing.
From the modelling perspective, note that columns thrown out are only
needed if the cost of the corresponding linear combination is higher than that
of the column itself. The variable represented by the column does not add to
our production possibilitiesonly, possibly, to lower our costs. In what follows
in this subsection let us assume that we have only right-hand side randomness,
and let us, for simplicity, denote the cost vector by q. To see if a column can
reduce our costs, we define
W :=
q
W
1
0
264
STOCHASTIC PROGRAMMING
that is, a matrix containing the coefficient matrix, the cost vector and an extra
column. To see the importance of the extra column, consider the following
interpretation of pos W (remember that pos W equals the set of all positive
linear combinations of columns from W ):
X
X
q1 qn 1
q
pos
=
W
=
W
,
q
q
.
k
k
k
k
W1 Wn 0
W
k 0
k 0
k 0
in a sequential manner until we are left with a minimal (but not necessarily
unique) set of columns. A column thrown out in this process will never be
part of an optimal solution, and is hence not needed. It can be dropped. From
a modelling point of view, this means that the modeller has added an activity
that is clearly inferior. Knowing that it is inferior should add to the modellers
understanding of his model.
A column that is not a part of the frame of pos W , but is a part of the
frame of pos W , is one that does not add to our production possibilities, but
its existence might add to our profit.
5.1.3
i W i = W j
i6=j
and
X
i6=j
i hi hj .
PREPROCESSING
265
5.2
266
STOCHASTIC PROGRAMMING
pos W
pol pos W
Figure 3
There is another important aspect of the polar cone pos W that we have
not yet discussed. It is indicated in Figure 3 by showing that the generators
are pairwise normals. However, that is slightly misleading, so we have to turn
to a three-dimensional figure to understand it better. We shall also need the
term facet. Let a cone pos W have dimension k. Then every cone K positively
spanned by k1 generators from pos W , such that K belongs to the boundary
of pos W , is called a facet. Consider Figure 4.
What we note in Figure 4 is that the generators are not pairwise normals,
but that the facets of one cone have generators of the other as normals. This
goes in both directions. Therefore, when we state that h pos W if and only
if hT y 0 for all generators of pol pos W , we are in fact saying that either h
represents a feasible problem because it is a linear combination of columns in
W or because it satisfies the inequality implied by the facets of pos W . In still
other words, the point of finding W is not so much to describe a new cone,
but to replace the description of pos W in terms of generators with another
in terms of inequalities.
This is useful if the number of facets is not too large. Generally speaking,
performing an inner product of the form bT y is very cheap. In parallel
processing, an inner product can be pipelined on a vector processor and the
different inner products can be done in parallel. And, of course, as soon as we
find one positive inner product, we can stopthe given recourse problem is
infeasible.
Readers familiar with extreme point enumeration will see that going from
PREPROCESSING
267
pos W
pol pos W
Figure 4
268
STOCHASTIC PROGRAMMING
W := WI0 WI kj Ckj ;
framebylp(W );
end; (* else *)
end; (* if *)
end; (* for *)
end;
Figure 5
1 0
0 1
W :=
... ...
0 0
or
1 0
0 1
W :=
... ...
0 0
0 1
0 1
..
.
..
. ..
.
1 1
0 1 0
0 0 1
.
..
..
..
. ..
.
.
1 0
0
..
.
0
0
..
.
PREPROCESSING
Figure 6
to W .
269
The cones pos W and pol pos W before any column has been added
progresses. Since pos W and pol pos W live in the same dimension, we can
draw them side by side.
Let us initially assume that
3 1 1 2
W =
.
1 1
2
1
The first thing to do, according to procedure support, is to subject W to a
frame finding algorithm, to see if some columns are not needed. If we do that
(check it to see that you understand frames) we end up with
3 2
W =
.
1
1
Having reduced W , we then initialize W to span the whole space. Consult
Figure 6 for details. We see there that
1 0 1
0
W =
.
0 1
0 1
Consult procedure support. From there, it can be seen that the approach
is to take one column from W at a time, and with it perform some calculations.
Figure 6 shows the situation before we consider the first column of W . Calling
it pos W is therefore not quite correct. The main point, however, is that the
left and right parts correspond. If W has no columns then pol pos W spans
the whole space.
Now, let us take the first column from W . It is given by W1 = (3, 1)T . We
next find the inner products between W1 and all four columns of W . We get
= (3, 1, 3, 1)T.
In other words, the sets I+ = {1, 2} and I = {3, 4} have two members
each, while I0 = . What this means is that two of the columns must be
270
Figure 7
to W .
STOCHASTIC PROGRAMMING
The cones pos W and pol pos W after one column has been added
removed, namely those in I + , and two kept, namely those in I . But to avoid
losing parts of the space, we now calculate four columns Ckj . First, we get
C13 = C24 = 0. They are not interesting. But the other two are useful:
1
1
1
0
3
1
0
1
.
C14 =
+3
=
=
, C23 =
+3
1
0
0
1
3
1
Since our only interests are directions, we scale the latter to (1, 3)T . This
brings us into Figure 7. Note that one of the columns in pos W is drawn
with dots. This is done to indicate that if procedure framebylp is applied to
W , that column will disappear. (However, that is not a unique choice.)
Note that if W had had only this one column then W , as it appears in
Figure 7, is the polar matrix of that one-column W . This is a general property
of procedure support. At any iteration, the present W is the polar matrix
of the matrix containing those columns we have so far looked at.
Now let us turn to the second column of W . We find
1
1 1
T
PREPROCESSING
Figure 8
to W .
271
The cones pos W and pol pos W after two columns have been added
A Small Example
Let us return to the example we discussed in Section 1.3. We have now named
the right-hand side elements b1 , b2 and b3 , since they are the focus of the
discussion here (in the numerical example they had the values 100, 180 and
162):
272
STOCHASTIC PROGRAMMING
min{2xraw1 + 3xraw2 }
s. t. xraw1 + xraw2 b1 ,
2xraw1 + 6xraw2 b2 ,
3xraw1 + 3xraw2 b3 ,
xraw1
0,
xraw2 0.
The interpretation is that b1 is the production limit of a refinery, which refines
crude oil from two countries. The variable xraw1 represents the amount of
crude oil from Country 1 and xraw2 the amount from Country 2. The quality
of the crudes is different, so one unit of crudes from Country 1 gives two units
of Product 1 and three units of Product 2, whereas the crudes from the second
country gives 6 and 3 units of the same products. Company 1 wants at least
b2 units of Product 1 and Company 2 at least b3 units of Product 2.
If we now calculate the inequalities describing pos W , or alternatively the
generators of pol pos W , we find that there are three of them:
b1
0
6b1 b2
0
3b1
b3 0.
The first should be easy to interpret, and it says something that is not very
surprising: the production capacity must not be negative. That we already
knew. The second one is more informative. Given appropriate units on crudes
and products, it says that the demand of Company 1 must not exceed six times
the production capacity of the refinery. Similarly, the third inequality says
that the demand of Company 2 must not exceed three times the production
capacity of the refinery. (The inequalities are not as meaningless as they
might appear at first sight: remember that the units for refinery capacity and
finished products are not the same.) These three inequalities, one of which
was obvious, are examples of constraints that are not explicitly written down
by the modeller, but still are implied by him or her. And they should give the
modeller extra information about the problem.
In case you wonder where the feasibility constraints are, what we have just
discussed was a one-stage deterministic model, and what we obtained was
three inequalities that can be used to check feasibility of certain instances of
that model. For example, the numbers used in Section 1.3 satisfy all three
constraints, and hence that problem was feasible. (In the example b1 = 100,
b2 = 180 and b3 = 162.)
PREPROCESSING
273
h3
H3
pos W
h1H
h2 H
Figure 9
5.3
Illustration of feasibility.
In Chapter 3, (page 162), we discussed the set A that is a set of values such
that if h0 + H T ()x produces a feasible second-stage problem for all A
We pointed out
then the problem will be feasible for all possible values of .
that in the worst case A had to contain all extreme points in the support of
.
Assume that the second stage is given by
Q(x, ) = min{q()T y | W y = h0 + H T0 x, y 0},
where W is fixed and T () T0 . This covers many situations. In R2 consider
the example in Figure 9, where = (1 , 2 , 3 ).
Since h1 pos W , we can safely fix 1 at its lowest possible value, since if
things are going to go wrong, then they must go wrong for 1min . Or, in other
words, if h0 + H T0 x pos W for = (1min , 2 , 3 ) then so is any other
vector with 2 = 2 and 3 = 3 , regardless of the value of 1 . Similarly, since
h2 pos W , we can fix 2 at its largest possible value. Neither h3 nor h3
are in pos W , so there is nothing to do with 3 .
Hence to check if x yields a feasible solution, we must check if
h0 +HT0 x pos W for = (1min , 2max , 3min )T and = (1min , 2max , 3max )T
Hence in this case A will contain only two points instead of 23 = 8. In general,
we see that whenever a column from H, in either its positive or negative
274
STOCHASTIC PROGRAMMING
5.4
Bibliographical Notes
Exercises
1. Let W be the coefficient matrix for the following set of linear equations:
= 0,
x + 12 y z + s1
2x
+ z
+ s2 = 0,
x,
y, z, s1 , s2 0.
(a) Find a frame of pos W .
(b) Draw a picture of pos W , and find the generators of pol pos W by
simple geometric arguments.
(c) Find the generators of pol pos W by using procedure support in
Figure 5. Make sure you draw the cones pos W and pol pos W after
each iteration of the algorithm, so that you see how it proceeds.
PREPROCESSING
275
4,
5,
8,
0.
(a) Are there any columns that are not needed for feasibility? (Remember
the slack variables!)
(b) Let W contain the columns that were needed from question (a),
including the slacks. Try to find the generators of pol pos W by
geometric arguments, i.e. draw a picture.
3. Consider the following recourse problem constraints:
1 3
2
2 1
0 1 4
5
y=
+
+
3 1
7
2 2 1
1 1
3
1
2
References
[1] Chinneck J. W. and Dravnieks E. W. (1991) Locating minimal infeasible
constraint sets in linear programs. ORSA J.Comp. 3: 157168.
[2] Dul
a J. H., Helgason R. V., and Hickman B. L. (1992) Preprocessing
schemes and a solution method for the convex hull problem in
multidimensional space. In Balci O. (ed) Computer Science and
Operations Research: New Developments in their Interfaces, pages 59
70. Pergamon Press, Oxford.
[3] Greenberg H. J. (1982) A tutorial on computer-assisted analysis. In
Greenberg H. J., Murphy F. H., and Shaw S. H. (eds) Advanced
Techniques in the Practice of Operations Research. Elsevier, New York.
[4] Greenberg H. J. (1983) A functional description of ANALYZE: A
computer-assisted analysis. ACM Trans. Math. Software 9: 1856.
276
STOCHASTIC PROGRAMMING
Network Problems
The purpose of this chapter is to look more specifically at networks. There are
several reasons for doing this. First, networks are often easier to understand.
Some of the results we have outlined earlier will be repeated here in a network
setting, and that might add to understanding of the results. Secondly, some
results that are stronger than the corresponding LP results can be obtained
by utilizing the network structure. Finally, some results can be obtained that
do not have corresponding LP results to go with them. For example, we shall
spend a section on PERT problems, since they provide us with the possibility
of discussing many important issues.
The overall setting will be as before. We shall be interested in twoor multistage problems, and the overall solution procedures will be the
same. Since network flow problems are nothing but specially structured LPs,
everything we have said before about LPs still hold. The bounds we have
outlined can be used, and the L-shaped decomposition method, with and
without bounds, can be applied as before. We should like to point out,
though, that there exists one special case where scenario aggregation looks
more promising for networks than for general LPs: that is the situation where
the overall problem is a network. This may require some more explanation.
When we discuss networks in this chapter, we refer to a situation in which
the second stage (or the last stage in a multistage setting) is a network.
We shall mostly allow the first stage to be a general linear program. This
rather limited view of a network problem is caused by properties of the Lshaped decomposition method (see page 171). The computational burden in
that algorithm is the calculation of Q(
x), the expected recourse cost, and to
some extent the check of feasibility. Both those calculations concern only the
recourse problem. Therefore, if that problem is a network, network algorithms
can be used to speed up the L-shaped algorithm.
What if the first-stage problem is also a network? Example 2.2 (page 117)
was such an example. If we apply the L-shaped decomposition method to
that problem, the network structure of the master problem is lost as soon as
feasibility and optimality cuts are added. This is where scenario aggregation,
278
STOCHASTIC PROGRAMMING
outlined in Section 2.6, can be of some use. The reason is that, throughout the
calculations, individual scenarios remain unchanged in terms of constraints, so
that structure is not lost. A nonlinear term is added to the objective function,
however, so if the original problem was linear, we are now in a setting of
quadratic objectives and linear (network) constraints. If the original problem
was a nonlinear network, the added terms will not increase complexity at all.
6.1
Terminology
NETWORK PROBLEMS
Figure 1
279
Furthermore, we have
F + ({1}) = {1, 2, 3},
280
STOCHASTIC PROGRAMMING
since we can reach nodes 2 and 3 in one step, but we need two steps to reach
node 4. Node 1 itself is in both sets by definition.
Two examples of predecessors of a node are
B + ({1}) = {1},
since node 1 has no predecessors, and nodes 2 and 3 can be reached from node
1.
A common problem in network flows is the min cost network flow problem.
It is given as follows.
min q(1)y(1) + q(2)y(2) + q(3)y(3) + q(4)y(4) + q(5)y(5)
s.t. y(1) + y(2)
= (1),
y(1)
+ y(3)
+ y(5) = (2),
y(2)
+ y(4) y(5) = (3),
y(3) y(4)
= (4),
y(k) (k), k = 1, . . . , 5,
y(k) 0,
k = 1, . . . , 5.
The coefficient matrix for this problem has rank 3. Therefore the nodearc
incidence matrix has three rows, and is given by
1
1 0 0 0
W = 1 0 1 0 1 .
0 1 0 1 1
2
6.2
Feasibility in Networks
NETWORK PROBLEMS
281
b(Y )T a(Q+ )T ,
b(N \ Y )T a(Q )T
are both needed if and only if G(Y ) and G(N \ Y ) are both connected.
Otherwise, none of the inequalities are needed.
Example 6.2 Let us look at the small example network in Figure 3 to at
282
STOCHASTIC PROGRAMMING
2
a
f
d
b
e
3
Figure 3
Example network 1.
least partially see the relevance of the last proposition. The following three
inequalities are examples of inequalities describing feasibility for the example
network:
(d)
+ (f ),
(3)
(e),
(2) + (3) (d) + (e) + (f ).
(2)
Proposition 6.2 states that the latter inequality is not needed, because
G({2, 3}) is not connected. From the inequalities themselves, we easily see
that if the first two are satisfied, then the third is automatically true. It is
perhaps slightly less obvious that, for the very same reason, the inequality
(1) + (4) + (5) (a) + (c)
is also not needed. It is implied by the requirement that total supply must
equal total demand plus the companions of the first two inequalities above.
(Remember that each node set gives rise to two inequalities). More specifically,
the inequality can be obtained by adding the following two inequalities and
one equality (representing supply equals demand):
(1) + (2)
+ (4) + (5) (c),
(1)
+ (3) + (4) + (5) (a),
(1) (2) (3) (4) (5) = 0.
2
Once you have looked at this for a while, you will probably realize that the
part of Proposition 6.2 that says that if G(Y ) or G(N \ Y ) is disconnected
NETWORK PROBLEMS
283
then we do not need any of the inequalities is fairly obvious. The other part of
the proposition is much harder to prove, namely that if G(Y ) and G(N \ Y )
are both connected then the inequalities corresponding to Y and N \ Y are
both needed. We shall not try to outline the proof here.
Proposition 6.2 might not seem very useful. A straightforward use could
still require the enumeration of all subsets of N , and for each such subset
a check to see if G(Y ) and G(N \ Y ) are both connected. However, we can
obtain more than that.
The first important observation is that the result refers to the connectedness
of two networksboth the one generated by Y and the one generated by N \Y .
Let Y1 = N \ Y . If both networks are connected, we have two inequalities that
we need, namely
b(Y )T a(Q+ )T
and
284
STOCHASTIC PROGRAMMING
Figure 5
The only set Y where i Y but j 6 Y , at the same time as both G(Y )
and G(N \ Y ) are connected, is the set where Y = {i}. The reason is that
node j blocks node is connections to all other nodes. Therefore, after calling
CreateIneq({i}), we can safely collapse node i into node j. Examples of this
can be found in Figure 5, (see e.g. nodes 4 and 5). This result is easy to
implement, since all we have to do is run through all nodes, one at a time,
and look for nodes satisfying B + (i) F + (i) = {i, j}. Whenever collapses take
place, F + and B + (or, alternatively, F and B ) must be updated for the
remaining nodes.
By repeatedly using this proposition, we can remove from the network all
trees (and trees include double arcs like those between nodes 2 and 5). We
are then left with circuits and paths connecting circuits. The circuits can be
both directed and undirected. In the example in Figure 5 we are left with
NETWORK PROBLEMS
285
procedure AllFacets;
begin
TreeRemoval;
CreateIneq();
Y := ;
W := N \ {n};
Facets(Y, W );
end;
Figure 6 Main program for full enumeration of inequalities satisfying
Proposition 6.2.
286
6.2.1
STOCHASTIC PROGRAMMING
NETWORK PROBLEMS
287
2
a
f
d
b
4
1
c
e
3
Figure 8
Figure 9
6.2.2
288
STOCHASTIC PROGRAMMING
a solution, the column is not part of the frame, and can be removed. An
important property of this procedure is that to determine if a column can be
discarded, we have to use all other columns in the test. This is a major reason
why procedure framebylp is so slow when the number of columns gets very
large.
So, a generator w of the cone pos W has the property that a right-hand
side h must satisfy hT w 0 to be feasible. In the uncapacitated network
case we saw that a right-hand side had to satisfy b(Y )T 0 to represent a
feasible problem. Therefore the index vector b(Y ) corresponds exactly to the
column w . And calling procedure framebylp to remove those columns that
are not in the frame of the cone pos W corresponds to using Proposition 6.5.
Therefore the index vector of a node set from Proposition 6.5 corresponds to
the columns in W .
Computationally there are major differences, though. First, to find a
candidate for W , we had to start out with W , and use procedure support,
which is an iterative procedure. The network inequalities, on the other hand,
are produced more directly by looking at all subsets of nodes. But the
most important difference is that, while the use of procedure framebylp,
as just explained, requires all columns to be available in order to determine if
one should be discarded, Proposition 6.5 is totally local. We can pick up
an inequality and determine if it is needed without looking at any other
inequalities. With possibly millions of candidates, this difference is crucial.
We did not develop the LP case for explicit bounds on variables. If such
bounds exist, they can, however, be put in as explicit
If so, a
constraints.
column w from W corresponds to the index vector
6.3
b(Y )
.
a(Q+ )
Let us now discuss how the results obtained in the previous section can help
us, and how they can be used in a setting that deserves the term preprocessing.
Let us first repeat some of our terminology, in order to see how this fits in
with our discussions in the LP setting.
A two-stage stochastic linear programming problem where the second-stage
problem is a directed capacitated network flow problem can be formulated as
follows:
minx cT x + Q(x)
s.t. Ax = b, x 0,
where
Q(x) =
Q(x, j )pj
NETWORK PROBLEMS
289
and
Q(x, ) =
miny1 {(q 1 )T y 1 | W y 1 = h10 + H 1 T 1 ()x, 0 y 1 h20 + H 2 T 2 ()x},
where W is the nodearc incidence matrix for the network. To fit into a more
general setting, let
W 0
W =
I I
so that Q(x, ) can also be written as
1
1
y1
q
h0
2
1
where y =
,
y
is
the
slack
of
y
,
q
=
,
h
=
, T () =
0
y2
0
h20
1
1
H
T ()
and H =
. Given our definition of and , we have, for a
2
2
T ()
given x
,
X
hi i T ()
x.
= h0 + H T ()
x = h0 +
kQ+
An
Let us replace and with their expressions in terms of x and .
inequality then says that the following must be true for all values of x and all
realizations of :
i
h
i
h
X
X
h2i i T 2 ()x .
h1i i T 1 ()x a(Q+ )T h20 +
b[A(Y )]T h10 +
i
Collecting all x terms on the left-hand side and all other terms on the righthand side we get the following expression:
h
i
X
X
Tj2 j x
Tj1 j + a(Q+ )T T02 +
b(A(Y ))T T01 +
j
X
i
b[A(Y )]T h1i + a(Q+ )T h2i i b[A(Y )]T h10 + a(Q+ )T h20 .
290
STOCHASTIC PROGRAMMING
b[A(Y )]T T01 + a(Q+ )T T02 x
X
b[A(Y )]T h1i + a(Q+ )T h2i i b[A(Y )]T h10 + a(Q+ )T h20 .
min
6.4
An Investment Example
Consider the simple network in Figure 10. It represents the flow of sewage (or
some other waste) from three cities, represented by nodes 1, 2 and 3.
NETWORK PROBLEMS
Figure 10
Section 6.4.
291
All three cities produce sewage, and they have local treatment plants to take
care of some of it. Both the amount of sewage from a city and its treatment
capacity vary, and the net variation from a city is given next to the node
representing the city. For example, City 1 always produces more than it can
treat, and the surplus varies between 10 and 20 units per unit time. City 2,
on the other hand, sometimes can treat up to 5 units of sewage from other
cities, but at other times has as much as 15 units it cannot itself treat. City
3 always has extra capacity, and that varies between 5 and 15 units per unit
time.
The solid lines in Figure 10 represent pipes through which sewage can be
pumped (at a cost). Assume all pipes have a capacity of up to 5 units per unit
time. Node 4 is a common treatment site for the whole area, and its capacity
is so large that for practical purposes we can view it as being infinite. Until
now, whenever a city had sewage that it could not treat itself, it first tried to
send it to other cities, or site 4, but if that was not possible, the sewage was
simply dumped in the ocean. (It is easy to see that that can happen. When
City 1 has more than 10 units of untreated sewage, it must dump some of it.)
New rules are being introduced, and within a short period of time dumping
sewage will not be allowed. Four projects have been suggested.
Increase the capacity of the pipe from City 1 (via City 2) to site 4 with x1
units (per unit time).
Increase the capacity of the pipe from City 2 to City 3 with x2 units (per
unit time).
Increase the capacity of the pipe from City 1 (via City 3) to site 4 with x3
units (per unit time).
Build a new treatment plant in City 1 with a capacity of x4 units (per unit
time).
292
STOCHASTIC PROGRAMMING
It is not quite clear if capacity increases can take on any values, or just some
predefined ones. Also, the cost structure of the possible investments are not
yet clear. Even so, we are asked to analyse the problem, and create a better
basis for decisions.
The first thing we must do, to use the procedures of this chapter, is to
make sure that, technically speaking, we have a network (as defined at the
start of the chapter). A close look will reveal that a network must have equality
constraints at the node, i.e. flow in must equal flow out. That is not the case
in our little network. If City 3 has spare capacity, we do not have to send
extra sewage to the city, we simply leave the capacity unused if we do not
need it. The simplest way to take care of this is to introduce some new arcs
in the network. They are shown with dotted lines in Figure 10. Finally, to
have supply equal to demand in the network (remember from Proposition 6.1
that this is needed for feasibility), we let the external flow in node 4 be the
negative of the sum of external flows in the other three nodes.
You may wonder if this rewriting makes sense. What does it mean when
sewage is sent along a dotted line in the figure? The simple answer is that
the amount exactly equals the unused capacity in the city to which the arc
goes. (Of course, with the given numbers, we realize that no arc will be needed
from node 4 to node 1, but we have chosen to add it for completeness.)
Now, to learn something about our problem, let us apply Proposition 6.2
to arrive at a number of inequalities. You may find it useful to try to write
them down. We shall write down only some of them. The reason for leaving
out some is the following observation: any node set Y that is such that Q+
contains a dotted arc from Figure 10 will be uninteresting, because
a(Q+ )T = ,
so that the inequality says nothing interesting. The remaining inequalities are
as follows (where we have used that all existing pipes have a capacity of 5 per
unit time).
1
10 + x1
+ x3 + x4 ,
2
10 + x1 + x2 ,
3 5
+ x3 ,
1 + 2 + 3 10 + x1
+ x3 + x4 ,
(4.1)
1 + 2
15 + x1 + x2 + x3 + x4 ,
1
+ 3 10 + x1
+ x3 + x4 ,
2 + 3 10 + x1
+ x3 .
Let us first note that if we set all xi = 0 in (4.1), we end up with a number
of constraints that are not satisfied for all possible values of . Hence, as we
already know, there is presently a chance that sewage will be dumped.
However, our interest is mainly to find out about which investments to
make. Let us therefore rewrite (4.1) in terms of xi rather than i :
NETWORK PROBLEMS
293
x1
+x3
x1 +x2
+x3
x1
+x3
x1 +x2 +x3
x1
+x3
x1
+x3
+x4 1
10 10,
2
10
5,
3 5 10,
+x4 1 +2 +3 10 20,
+x4 1 +2
15 20,
+x4 1
+3 10
5,
2 +3 10
0.
(4.2)
(4.3)
Even though we know nothing so far about investment costs and pumping
costs through the pipes, we know a lot about what limits the options.
Investments of at least five units must be made on a combination of x1 and x2 .
What this seems to say is that the capacity out of City 2 must be increased by
at least 5 units. It is slightly more difficult to interpret the second inequality. If
we see both building pipes and a new plant in City 1 as increases in treatment
capacity (although they are of different types), the second inequality seems to
say that a total of 20 units must be built to facilitate City 1. However, a closer
look at which cut generated the inequality reveals that a more appropriate
interpretation is to say that the three cities, when they are seen as a whole,
must obtain extra capacity of 20 units. It was the node set Y = {1, 2, 3} that
generated the cut.
The two constraints (4.3) are all we need to pass on to the planners. If these
two, very simple, constraints are taken care of, sewage will never have to be
dumped. Of course, if the investment problem is later formulated as a linear
program, the two constraints can be added, thereby guaranteeing feasibility,
and, from a technical point of view, relatively complete recourse.
294
STOCHASTIC PROGRAMMING
[-1,1]
2
)
,4]
[
(0,
[2
,4
1
[1,3]
(0,[2,4]) 3
(0,[
4
(0
,
2,6
])
])
(0,[6,8])
2
[2,
6])
,[4
(0
Slack
])
0,2
(0,[
[-2,0]
]) 5
,8
(0,
3
[3,3]
Figure 11
6.5
Bounds
NETWORK PROBLEMS
295
0
2
)
0,3
4
2
(0
,3)
1
2
(0,4
(0,3)
(0,7)
2
(0
,4)
,6
(0
Slack
-1
(0,1
3
3
X d+ ( E ) if E ,
i
i
i
i
i
i i ) if i < E i ,
d
(E
i
i
296
STOCHASTIC PROGRAMMING
This is our basic setting, and all other values of will be seen as deviations
Note that since y 0 is always there, we shall update the arc
from E .
capacities to become y 0 y c y 0 . For this purpose, we define 1 = y 0
and 1 = c y 0 . Let ei be a unit vector of appropriate dimension with a +1
in position i.
Next, define a counter r and let r := 1. Now, check out the case when
1 > E 1 by solving
min{q T y | W y = er (Br E r ), r y r } = q T y r+ = d+
r (Br E r ).
y
(5.1)
min{q T y | W y = er (Ar E r ), r y r } = q T y r = d
r (Ar E r ).
y
(5.2)
Now, based on y r , we shall assign portions of the arc capacities to the random
variable r . These portions will be given to r and left unused by other random
variables, even when r does not need them. The portions will correspond
to paths in the network connecting node r to the slack node (node 5 in the
example). That is done by means of the following problem, where we calculate
what is left for the next random variable:
r+1
= ri min{yir+ , yir , 0}.
i
(5.3)
What we are doing here is to find, for each variable, how much r , in the
worst case, uses of arc i in the negative direction. That is then subtracted
from what we had before. There are three possibilities. We may have both
(5.1) and (5.2) yielding nonnegative values for the variable i. Then nothing is
= ri . Alternatively,
used of the available negative capacity ri . Then r+1
i
NETWORK PROBLEMS
297
0
2
)
0,2
4
2
(0
,2)
1
2
(0,2
(0,2)
(0,6)
2
(0
,2)
,4
(0
Slack
-1
(0,0
3
3
Figure 13
when (5.1) has yir+ < 0, it will in the worst case use yir+ of the available
negative capacity. Finally, when (5.2) has yir < 0, in the worst case we
use yir of the capacity. Therefore, r+1
is what is left for the next random
i
variable. Similarly, we find
ir+1 = ir max{yir+ , yir , 0},
(5.4)
where ir+1 shows how much is still available of the capacity on arc i in the
forward (positive) direction.
We next increase the counter r by one and repeat (5.1)(5.4). This takes
care of the piecewise linear functions in .
Let us now look at our example in Figure 11. To calculate the part of the
bound, we put all arc capacities at their lowest possible value and external
flows at their means. This is shown in Figure 13.
The optimal solution in Figure 13 is given by
y 0 = (2, 0, 0, 0, 3, 2, 2, 0)T,
with a cost of 22. The next step is update the arc capacities in Figure 13 to
account for this solution. The result is shown in Figure 14.
Since the external flow in node 1 varies between 1 and 3, and we have so
far solved the problem for a supply of 2, we must now find the cost associated
with a supply of 1 and a demand of 1 in node 1. For a supply of 1 we get the
solution
y 1+ = (0, 1, 0, 0, 0, 0, 1, 0)T,
with a cost of 5. Hence d+
1 = 5. For a demand of 1 we get
y 1 = (1, 0, 0, 0, 0, 1, 0, 0)T,
298
STOCHASTIC PROGRAMMING
2
6
,0)
(-2
0)
(0
,2)
1
+/- 1
(-2,
(0,2)
(-2,4)
2
,2)
-/+ 1
) 5
1
3,
(0
(-
(0,0
3
Figure 14
0).
Arc capacities after the update based on (E ,
+/- 1
2
(-1
,0)
(0
,2)
1
(0,1)
(-1,
0)
(-2,3)
2
(0
,2)
-/+ 1
) 5
,1
(-3
(0,0
3
Figure 15
0) and node 1.
Arc capacities after the update based on (E ,
NETWORK PROBLEMS
299
2
(-1
,0)
(0
,1)
1
(0,1)
(0,0
2
(0
,2)
(-2,2)
+/- 1
) 5
1
3,
(-
5
-/+ 1
(0,0
3
Figure 16
2.
units, and we have so far solved for a demand of 1. Therefore we must now
look at a demand of 1 and a supply of 1 in node 4, based on the arc capacities
in Figure 16. In that figure we have updated the capacities from Figure 15
based on the solutions for node 2.
A supply in node 4 gives us the solution
y 4+ = (0, 0, 0, 0, 0, 0, 1, 0)T,
with a cost of 2. One unit demand, on the other hand, gives us
y 4 = (0, 0, 0, 0, 0, 1, 0)T,
(, ) = 22+ H()
5(1 2)
+
3(1 2)
32 if 2
+
2
if 2
2(4 + 1)
+
2(4 + 1)
if 1
if 1
0,
< 0,
if 4
if 4
2,
< 2,
1,
< 1.
If, for simplicity, we assume that all distributions are uniform, we easily
300
STOCHASTIC PROGRAMMING
2
(-1
,0)
(0
,1)
1
(0,1)
(0,0
(-1,1)
2
(0
,2)
) 5
,1
(-3
(0,0
3
0) and external flow
Figure 17 Arc capacities after the update based on (E ,
in all nodes.
1
4
+5
1
4
1
4
+3
1
4
1
4
+2
1
4
Note that there is no contribution from 4 to the upper bound. The reason
is that the recourse function (, ) is linear in 4 . This property of discovering
that the recourse function is linear in some random variable is shared with
the Jensen and EdmundsonMadansky bounds.
We then turn to the part of the bound. Note that if (5.3) and (5.4) were
calculated after the final y r had been found, the and show what is left
of the deterministic arc capacities after all random variable i have received
their shares. Let us call these and . If we add to each upper bound in
Figure 17 the value C (remember that the support of the upper arc capacities
was = [0, C]), we get the arc capacities of Figure 18. Now we solve the
problem
min{q T y | W y = 0, y + C} = q T y .
y
With zero external flow in Figure 18, we get the optimal solution
y = (0, 0, 0, 0, 1, 0, 1, 1)T,
(5.5)
NETWORK PROBLEMS
301
2
(-1
,2)
(0
,3)
1
(0,3)
(0,4
(-1,3)
2
(0
,6)
) 5
,5
(-3
(0,2
3
Figure 18
Arc capacities used to calculate H() for the example in Figure 11.
with a cost of 4. This represents cycle flow with negative costs. The cycle
became available as a result of making arc 8 having a positive arc capacity.
If, again for simplicity, we assume that 8 is uniformly distributed over [0, 2],
we find that the capacity of that cycle has a probability of being 1 equal to
0.5. The remaining probability mass is uniformly distributed over [0, 1]. We
therefore get
Z 1
1
x dx = 2 1 = 3.
EH() = 4 1 12 4
2
0
The total upper bound for this example is thus 23 3 = 20, compared with
the Jensen lower bound of 18.
In this example the solution y of (5.5) contained only one cycle. In general,
y may consist of several cycles, possibly sharing arcs. It is then necessary to
pick y apart into individual cycles. This can be done in such a way that all
cycles have nonpositive costs (those with zero costs can then be discarded),
and such that all cycles that use a common arc use it in the same direction.
We shall not go into details of that here.
6.6
Project Scheduling
302
STOCHASTIC PROGRAMMING
(6.1)
It is worth noting that (6.1) is not really a decision problem. There are namely
no decisions. We are only calculating consequences of an existing setting of
relations and durations.
NETWORK PROBLEMS
6.6.1
303
As pointed out, (6.1) is not a decision problem, since there are no decisions
to be made. Very often, activity durations are not given by nature, but can
be affected by how much resources we put into them. For example, it takes
longer to build a house with one carpenter than with two. Assume we have
available a budget of B units of resources, and that if we spend one unit on
activity k, its duration will decrease by ak time units. A possible decision
problem is then to spend the budget in such a way that the project duration
is minimized. This can be achieved by solving the following problem:
min n
s.t. j
i qk ak xk for all k (i, j),
xk B,
(6.2)
1 = 0,
xk 0.
Introduction of Randomness
It seems natural to assume that activity durations are random. If so, the
project duration is also random, and we can no longer talk about finding the
minimal project duration time. However, a natural alternative seems to be
to look for the expected (minimal) project duration time. In (6.1) and (6.2)
the goal would then be to minimize En . However, we must now be careful
about how we interpret the problems. Problem (6.1) is simple enough. There
are still no decisions, so we are only trying to calculate when, on expectation,
the project will finish, if all activities start as soon as they can. But when
we turn to (6.2) we must be careful. In what order do things happen? Do we
first decide on x, and then simply sit back (as we did with (6.1)) and observe
what happens? Or do we first observe what happens, and then make decisions
on x? These are substantially different situations. It is of importance that
you understand the modelling aspects of this difference. (There are solution
differences as well, but they are less interesting now.) In a sense, the two
interpretations bound the correct problem from above and below.
If we interpret (6.2) as a problem where x is determined before the activity
durations are known, we have in fact a standard two-stage stochastic program.
The first-stage decision is to find x, and the second-stage decision to find the
(We put q()
to show that
project duration given x and a realization of q().
q is indeed a random variable.) Butand this is perhaps the most important
question to ask in this sectionis this a good model? What does it mean?
304
STOCHASTIC PROGRAMMING
First, it is implicit in the model that, while the original activity durations
are random, the changes ak xk are not. In terms of probability distributions,
therefore what we have done is to reduce the means of the distributions
describing activity durations, but without altering the variances. This might
or might not be a reasonable model. Clearly, if we find this unreasonable, we
could perhaps let ak be a random variable as well, thereby making also the
effect of the investment xk uncertain.
The above discussion is more than anything a warning that whenever we
introduce randomness in a model, we must make sure we know what the
randomness means. But there is a much more serious model interpretation if
we see (6.2) as a two-stage problem. It means that we think we are facing
a project where, before it is started, we can make investments, but where
afterwards, however badly things go, we shall never interfere in order to fix
shortcomings. Also, even if we are far ahead of schedule, we shall not cut back
on investments to save money. We may ask whether such projects exist
projects where we are free to invest initially, but where afterwards we just sit
back and watch, whatever happens.
From this discussion you may realize (as you have beforewe hope) that the
definition of stages is important when making models with stochasticity. In
our view, project scheduling with uncertainty is a multistage problem, where
decisions are made each time new information becomes available. This makes
the problem extremely hard to solve (and even formulatejust try!) But this
complexity cannot prevent us from pointing out the difficulties facing anyone
trying to formulate PERT problems with only two stages.
We said earlier that there were two ways of interpreting (6.2) in a setting
of uncertainty. We have just discussed one. The other is different, but has
similar problems. We could interpret (6.2) with uncertainties as if we first
observed the values of q and then made investments. This is the wait-andsee solution. It represents a situation where we presently face uncertainty, but
where all uncertainty will be resolved before decisions have to be made. What
does that mean in our context? It means that before the project starts, all
uncertainty related to activities disappears, everything becomes known, and
we are faced with investments of the type (6.2). If the previous interpretation
of our problem was odd, this one is probably even worse. In what sort of
project will we have initial uncertainty, but before the first activity starts,
everything, up to the finish of the project, becomes known? This seems almost
as unrealistic as having a deterministic model of the project in the first place.
6.6.3
Despite our own warnings in the previous subsection, we shall now show
how the extra structure of PERT problems allows us to find bounds on the
expected project duration time if activity durations are random. Technically
NETWORK PROBLEMS
305
speaking, we are looking for the expected value of the objective function in
There is a very large collection
(6.1) with respect to the random variables q().
of different methods for bounding PERT problems. Some papers are listed at
the end of this chapter. However, most, if not all, of them can be categorized
as belonging to one or more of the following groups.
6.6.3.1
Series reductions
If there is a node with only one incoming and one outgoing arc, the node is
removed, and the arcs replaced by one arc with a duration equal to the sum
of the two arc durations. This is an exact reformulation.
6.6.3.2
Parallel reductions
If two arcs run in parallel with durations 1 and 2 then they are replaced
with one arc having duration max{1 , 2 }. This is also an exact reformulation.
6.6.3.3
Let
i be a random variable describing when event i takes place. Then we can
calculate
j =
max
iB + (i)\{i}
{
i + qk ()},
306
STOCHASTIC PROGRAMMING
Table 1
1
1
1
2
2
Figure 19
6.6.3.4
2
1
2
1
2
Prob.
0.3
0.2
0.2
0.3
max
1
2
2
2
Arc duplication.
Arc duplications
If there is a node i with B + (i ) = {i, i }, so that the node has only one
incoming arc k (i, i ), remove node i , and for each j F + (i )\ {i } replace
k (i , j) by k (i, j). The new arc has associated with it the random
If arc k had a deterministic duration,
+ qk ().
:= qk ()
duration qk ()
this is an exact reformulation. If not, we get an upper bound based on the
previous principle of disregarding path dependences. (This method is called
arc duplication because we duplicate arc k and use one copy for each arc k .)
An exactly equal result applies if there is only one outgoing arc. This result
is illustrated in Figure 19, where F + (i ) = {i , 1, 2, 3}.
If there are several incoming and several outgoing arcs, we may pair up all
incoming arcs with all outgoing arcs. This always produces an upper bound
based on the principle of disregarding path dependences.
6.6.3.5
or a
Since our problem is convex in , we get a lower bound whenever a qk ()
NETWORK PROBLEMS
307
6.7
Bibliographical Notes
The vocabulary in this chapter is mostly taken from Rockafellar [25], which
also contains an extremely good overview of deterministic network problems.
A detailed look at network recourse problem is found in Wallace [28].
The original feasibility results for networks were developed by Gale [10]
and Hoffman [13]. The stronger versions using connectedness were developed
by Wallace and Wets. The uncapacitated case is given in [31], while the
capacitated case is outlined in [33] (with a proof in [32]). More details of the
algorithms in Figures 6 and 7 can also be found in these papers. Similar results
were developed by Prekopa and Boros [23]. See also Kall and Prekopa [14].
As for the LP case, model formulations and infeasibility tests have of course
been performed in many contexts apart from ours. In addition to the references
given in Chapter 5, we refer to Greenberg [11, 12] and Chinneck [3].
The piecewise linear upper bound is taken from Wallace [30]. At the very
end of our discussion of the piecewise linear upper bound, we pointed out that
the solution y to (5.5) could consist of several cycles sharing arcs. A detailed
discussion of how to pick y apart, to obtain a conformal realization can be
found in Rockafellar [25], page 476. How to use it in the bound is detailed
in [30]. The bound has been strengthened for pure arc capacity uncertainty
by Frantzeskakis and Powell [8].
Special algorithms for stochastic network problems have also been
developed; see e.g. Qi [24] and Sun et al. [27].
We pointed out at the beginning of this chapter that scenario aggregation
(Section 2.6) could be particularly well suited to problems that have network
structure in all periods. This has been utilized by Mulvey and Vladimirou
for financial problems, which can be formulated in a setting of generalized
networks. For details see [19, 20]. For a selection of papers on financial
problems (not all utilizing network structures), consult Zenios [36, 37], and,
for a specific application, see Dempster and Ireland [5].
The above methods are well suited for parallel processing. This has been
done in Mulvey and Vladimirou [18] and Nielsen and Zenios [21].
Another use of network structure to achieve efficient methods is described
in Powell [22] for the vehicle routing problem.
The PERT formulation was introduced by Malcolm et al. [17]. An overview
of project scheduling methods can be found in Elmaghraby [7]. A selection of
308
Figure 20
STOCHASTIC PROGRAMMING
bounding procedures based on the different ideas listed above can be found in
the following: Fulkerson [9], Kleindorfer [16], Shogan [26], Kamburowski [15]
and Dodin [6]. The PERT problem as an investment problem is discussed in
Wollmer [34].
The max flow problem is another special network flow problem that is much
studied in terms of randomness. We refer to the following papers, which discuss
both bounds and a two-stage setting: Cleef and Gaul [4], Wollmer [35], Aneja
and Nair [1], Carey and Hendrickson [2] and Wallace [29].
Exercises
1. Consider the network in Figure 20. The interpretation is as for Figure 11
regarding parameters, except that we for the arc capacities simply have
written a number next to the arc. All lower bounds on flow are zero.
Calculate the Jensen lower bound, the Edmundson-Madansky upper
bound, and the piecewise linear upper bound for the expected minimal
cost in the network.
2. When outlining the piecewise linear upper bound, we found a function that
was linear both above and below the expected value of the random variable.
Show how (5.1) and (5.2) can be replaced by a parametric linear program
to get not just one linear piece above the expectation and one below, but
rather piecewise linearity on both sides. Also, show how (5.3) and (5.4)
must then be updated to account for the change.
3. The max flow problem is the problem of finding the maximal amount of
flow that can be sent from node 1 to node n in a capacitated network. This
problem is very similar to the PERT problem, in that paths in the latter
NETWORK PROBLEMS
309
correspond to cuts in the max flow problem. Use the bounding ideas listed
in Section 6.6.3 to find bounds on the expected max flow in a network with
random arc capacities.
4. In our example about sewage treatment in Section 6.4 we introduced four
investment options.
(a) Assume that a fifth investment is suggested, namely to build a pipe
with capacity x5 directly from City 1 to site 4. What are the constraints
on xi for i = 1, . . . , 5 that must now be satisfied for the problem to be
feasible?
(b) Disregard the suggestion in question (a). Instead, it is suggested to
see the earlier investment 1, i.e. increasing the pipe capacity from City
1 to cite 4 via City 2 as two different investment. Now let x1 be the
increased capacity from City 1 to City 2, and x5 the increased capacity
from City 2 to cite 4 (the dump). What are now the constraints on xi
for i = 1, . . . , 5 that must be satisfied for the problem to be feasible?
Make sure you interpret the constraints.
5. Develop procedures for uncpacitated networks corresponding to those in
Figures 4, 6 and 7.
References
[1] Aneja Y. P. and Nair K. P. K. (1980) Maximal expected flow in a network
subject to arc failures. Networks 10: 4557.
[2] Carey M. and Hendrickson C. (1984) Bounds on expected performance
of networks with links subject to failure. Networks 14: 439456.
[3] Chinneck J. W. (1990) Localizing and diagnosing infeasibilities in
networks. Working paper, Systems and Computer Engineering, Carleton
University, Ottawa, Ontario.
[4] Cleef H. J. and Gaul W. (1980) A stochastic flow problem. J. Inf. Opt.
Sci. 1: 229270.
[5] Dempster M. A. H. and Ireland A. M. (1988) A financial expert decision
support system. In Mitra G. (ed) Mathematical Methods for Decision
Support, pages 415440. Springer-Verlag, Berlin.
[6] Dodin B. (1985) Reducability of stochastic networks. OMEGA Int. J.
Management 13: 223232.
[7] Elmaghraby S. (1977) Activity Networks: Project Planning and Control
by Network Models. John Wiley & Sons, New York.
[8] Frantzeskakis L. F. and Powell W. B. (1989) An improved polynomial
bound for the expected network recourse function. Technical report,
Statistics and Operations Research Series, SOR-89-23, Princeton
310
STOCHASTIC PROGRAMMING
NETWORK PROBLEMS
311
312
STOCHASTIC PROGRAMMING
Index
absolutely continuous, 31
accumulated return function, see dynamic programming
almost everywhere (a.e.), 27
almost surely (a.s.), 16, 28
approximate optimization, 218
augmented Lagrangian, see Lagrange
function
backward recursion, see dynamic programming
barrier function, 97
basic solution, see feasible
basic variables, 56, 65
Bellman, 110, 115117, 121
optimality principle, 115
solution procedure, 121
Benders decomposition, 213, 233
block-separable recourse, 233
bounds
EdmundsonMadansky upper bound,
181185, 192, 194, 203, 234
Jensen lower bound, 179182, 184,
185, 218, 220, 233
limited information, 234
piecewise linear upper bound, 185
190, 234
example, 187189
stopping criterion, 212
bunching, 230
cell, 183, 190, 196, 201, 203, 212, 234
chance constraints, see stochastic program with
chance node, see decision tree
complementarity conditions, 84, 89
1
314
duality theorem
strong, 74
weak, 72
dynamic programming
accumulated return function, 117
backward recursion, 114
deterministic, 117121
solution procedure, 121
immediate return, 110
monotonicity, 115
return function, 117
separability, 114
stage, 110
state, 110, 117
stochastic, 130133
solution procedure, 133
time horizon, 117
transition function, 117
dynamic systems, 110116
Edmundson-Madansky upper bound,
see bounds
event, 25
event tree, 134, 135
EVPI, 154156
expectation, 30
expected profit, 126, 128
expected value of perfect information,
see EVPI
expected value solution, 3
facet, 62, 213, 216
Farkas lemma, 75, 163
fat solution, 15
feasibility cut, 77, 103, 161168, 173,
177, 203, 214
example, 166167
feasible
basic solution, 55
degenerate, 64
nondegenerate, 64
basis, 55
set, 55
feasible direction
method, see methods (nlp)
financial models, 141147
efficient frontier, 143, 144
Markowitz mean-variance, 142143
weak aspects, 143144
multistage, 145147
portfolio, 142
STOCHASTIC PROGRAMMING
transaction costs, 146
first-stage
costs, 15, 31
fishery model, 138, 159, 234
forestry model, 234
free variables, 54
function
differentiable, 37, 81
integrable, 30
separable, 206
simple, 28
gamblers, 128
generators, see convex
global optimization, 8
gradient, 38, 84, 226
here-and-now, 151
hindsight, 5
hydro power production, 147150
additional details, 150
numerical example, 148149
immediate return, see dynamic programming
implementable decision, 136, 141
indicator function, 28
induced
constraints, 43, 214
feasible set, 43
integer programming, see program
integral, 28, 30
interior point method, 233
Jensen inequality, 180, 202
Jensen lower bound, see bounds
KuhnTucker conditions, 8389
L-shaped method, 80, 161173, 213,
217, 220, 229, 233
algorithms, 168170
example, 172173
MSLiP, 233
within approximation scheme, 201
203
algorithm, 203
Lagrange function, 88
augmented, 99
multipliers, 84
saddle point, 89
Index
Lagrangian methods, see methods (nlp)
linear program
dual program, 70
multiple right-hand sides, 229233
primal program, 70
standard form, 53
linear programming, 5380, 103
parametric, 187
log-concave
measure, 50, 51, 103
loss function, 97
Markowitz model, 142143
options, 144
transaction costs, 144
weak aspects, 143144
maximum
relative, 8
mean fundamental, 29
measurable set, 24
measure, 22
natural, 24
extreme, 234
probability, 25
measure theory, 103
methods (nlp), 89102
augmented Lagrangian, 99, 104, 137
update, 100
cutting planes, 9093, 104
descent directions, 9396
feasible directions, 95, 104
Lagrangian, 98102
penalties, 9798, 104
reduced gradients, 95, 104
minimization
constrained, 82
unconstrained, 82
minimum
global, 9
local, 8
relative, 8
model understanding, 145
monotonicity, see dynamic programming
MSLiP, 233
multicut method, 80
multipliers, see Lagrange function
multistage, see stochastic program with
nested decomposition, 233
networks
315
financial model, 145
PERT, see PERT
nodearc incidence matrix, see networks
nonbasic variables, 56, 65
nonlinear programming, 80102, 104
optimality condition, 65
necessary, 84
sufficient, 84
optimality cut, 78, 103, 168173, 177,
203
optimality principle, see Bellman
option, 46
options, 144
outcome, 25
parallel processing, 233
partition, 28, 234
curvature of function, 192
example, 197201
look ahead, 196, 205
look-ahead, 196
point of, 212
quality of, 203205
refinement of, 190201, 212
penalty method, see methods (nlp)
piecewise linear upper bound, see bounds
pivot step, 68
polar cone, see convex
polar matrix, 163
generating elements, 163
polyhedral cone, see convex
polyhedral set, see convex
polyhedron, see convex
positive hull, 60
preprocessing
induced constraints, see induced constraints
simplified feasibility test, see feasibility cut
probabilistic constraints, see stochastic
program with
probability
distribution, 25
space, 25
theory, 103
PROCON, 103
program
convex, 8
integer, 8, 209
316
best-so-far, 211
bounding, 210, 211
branch, 210
branch-and-bound, 210212
branch-and-cut, 214, 216
branching variable, 210, 212
cutting-plane, 212, 214
facet, 213, 216
fathom, 210212
partition, 212
relaxed linear program, 213
waiting node, 210, 211
linear, 7
mathematical, 7
nonconvex, 8, 209
nonlinear, 8
progressive hedging, see scenario aggregation
project scheduling, see PERT
QDECOM, 103, 175, 233
quasi-concave
function, 49
measure, 48, 51, 103
random
variable, 25
vector, 25
recourse
activity, 31
costs, 15, 31
expected, 15
function, 31, 160
expected, 160
matrix, 31
complete, 45, 160
fixed, 160
relatively complete, 161
simple, 34
program, 31
variable, 15
vector, 31
reduced gradient, 95
method, see methods (nlp)
regularity condition, 85, 86
regularized decomposition, 173177, 233
master program, 174
method, 176
reliability, 18, 47
removing columns, see preprocessing
removing rows, see preprocessing
STOCHASTIC PROGRAMMING
return function, see dynamic programming
risk averse, 128
risk-neutral, 128
saddle point, see Lagrange function
sampling, see stochastic decomposition
scenario, see scenario aggregation
scenario aggregation, 134141
approximate solution, 141
event tree, 134, 135, 145
scenario, 134
scenario solution, 141
scenario analysis, 2
Schur complement, 232
second-stage
activity, 31
program, 32
separability, see dynamic programming
simplex
criterion, 65
method, 6469
eta-vector, 232
slack node, see networks
slack variables, 54
Slater condition, 86
stage, see dynamic programming
state, see dynamic programming
stochastic decomposition, 217223, 229,
232, 234
cut, 220222
estimate of lower bound, 220
incumbent, 221, 223
relatively complete recourse, 217, 221
sampling, 218, 220
stopping 223
stopping criterion, 217, 220
stochastic method, 217
stochastic program, 13
approximation schemes, 103
general formulation, 2136
linear, 13, 33, 36, 159161
nonlinear, 32
value of, 151156
stochastic program with
chance constraints, see probabilistic
constraints
complete recourse, 34, 160
fixed recourse, 34, 160
integer first stage, 209217, 233
algorithms, 214
Index
feasibility cut, 216
initialization, 216
optimality cut, 217
stopping criterion, 217
probabilistic constraints
applications, 103
closedness, 52
convexity, 49
joint, 20, 35, 36, 46
methods, 103
models, 103
properties, 4653
separate, 36
single, 36
recourse, 16, 32, 159236
approximations, 190205
bounds, 177190, 212
convexity, 36
differentiability, 38
methods, 103
multistage, 32, 33, 103, 145
nonanticipativity, 103
properties, 3646
smoothness, 37
relatively complete recourse, 46, 161,
217
simple recourse, 34, 205209, 234
stochastic programming
models, 9, 2136
stochastic quasi-gradient, 228, 233
methods, 225229
stochastic solution, 4, 5
stochastic subgradient, 229
subdifferential, 226
subgradient, 226
methods, 228
support of probability measure, 43
supporting hyperplane
of convex function, 81
of convex set, 91
time horizon, see dynamic programming
transition function, see dynamic programming
trickling down, 230, 232, 234
unbounded solution, 168
utility function, 127
vertex, see convex
317
wait-and-see solution, 13, 103, 178, 190