SMo Notes1

Stochastic Modelling
ii
Preface
These notes are based on Dr. Burak Büke’s notes. The material is also covered and expanded
in the excellent book
• R. Durrett: Essentials of Stochastic Processes, Springer, 2012.
A much more extensive discussion can be found in
• V. Kulkarni: Modeling and Analysis of Stochastic Systems, CRC Press, 2010.
To brush up your basic probability knowledge revisit your second year textbook or any equiv-
alent textbook
• S. Ross: A First Course in Probability, Pearson, 2013.
• R.B. Ash: Basic Probability Theory, Dover, 2008.
This course does not use measure theoretic foundations, and we’ll do just fine without it. It is,
however, covered in many excellent books about rigorous probability, for example
• R.B. Ash, C. A. Doléans-Dade: Probability and Measure Theory, Dover, 2008.
• R. Durrett: Probability: Theory and Examples, Cambridge University Press in 2010.
• K.B. Athreya, S.N. Lahiri: Measure Theory and Probability Theory, Springer, 2006.
in which books short but rigorous treatments of stochastic processes can also be found.
Please report mistakes to tibor.antal@ed.ac.uk. You can find me in my office at JCMB

5403.
Edinburgh, January 10, 2019 Tibor Antal
iii
iv
Contents
Preface iii
1 Preliminaries 1
1.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Discrete Time Markov Chains 9

2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Modelling Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 More Complicated Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Chapman-Kolmogorov Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Classification of States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Transience and Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Positive and Null Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Periodicity of Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Stationary Probabilities: Aperiodic Case . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.10 Stationary Probabilities: Periodic Case . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.11 First Passage Probabilities and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.12 Costs and Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.12.1 Long-run Average Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.12.2 Cost in transient states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.13 Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Poisson Processes 37
3.1 Exponential Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Properties of Minimum of Two Exponentials . . . . . . . . . . . . . . . . . . 38
3.1.3 Strong Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.4 Sums of I.I.D. Exponentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Superpositioning and Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Campbell’s Theorem: Uniform Order Statistics . . . . . . . . . . . . . . . . . . . . . 45
3.5 Nonhomogeneous Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 Event Times for Nonhomogeneous PP . . . . . . . . . . . . . . . . . . . . . . 49
v
3.6 Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Continuous Time Markov Chains 51

4.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Chapman-Kolmogorov Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Backward and Forward Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Transience and recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Stationary probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 First Passage Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7 Costs and Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Appendices
Appendix A Review of Probability 71

A.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2 Conditional Probability and Independence . . . . . . . . . . . . . . . . . . . . . . . . 71
A.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.4 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.5 Jointly Distributed RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
vi
Chapter 1
Preliminaries
Dynamical systems that are driven by random factors appear in many contexts in real life.
Apart from many problems in physics, chemistry and biology; manufacturing and service sys-
tems, telecommunication systems and financial markets are just a few examples. In this course,
our goal will be to introduce the basic tools for modelling and analyzing such systems. For this
purpose, our main tools will be the concepts of conditional probability and conditional expec-
tation. Hence we start by reviewing these subjects. The notes below assume that the reader is
familiar with the basic terminology of probability theory, which is reviewed in Appendix A.
1.1 Conditional Probability

The probability that we assign to an event depends on the information we have about the
experiment. As we gather more information about the outcome of the experiment, we need a
way to update the probabilities.
Example 1.1.1. Suppose a fair die has been tossed. All the outcomes in the sample space Ω =
{1, 2, 3, 4, 5, 6} are equally likely. The probability of an even number appearing is 1/2. Now sup-
pose we are given the information that the outcome was a prime number. Based on this information
our sample space reduces to be {2, 3, 5}. Since all numbers are equally likely we would want to
assign a probability of 1/3 for getting an even number after this new information is revealed.
We need to come up with a general tool that will handle all such cases. Hence, conditional
probability is introduced.
Definition 1.1.2. The conditional probability of event A given B (such that P(B) > 0) has occured,
denoted P(A|B), is defined as
P(A ∩ B)
P(A|B) ≡ .
P(B)
Now let’s check if this definition matches our intuition. First, if we are given that event B
has happened, then probability of B should be updated as 1. We can see this as
P(B ∩ B) P(B)
P(B|B) = = = 1.
P(B) P(B)
1
We can also confirm the result we derived intuitively for Example 1.1.1 by taking A = {2, 4, 6}
and B = {2, 3, 5}.
1
P(A ∩ B) P({2}) 6 1
P(A|B) = = = 1 = .
P(B) P({2, 3, 5}) 2
3
Two events, A and B, are said to be independent if P(A ∩ B) = P(A)P(B). Intuitively, inde-
pendence means whether A happens or not does not depend on the occurrence of B. In other
words, the information we get about the occurrence of B should not affect the probability of
A. By calculating
P(A ∩ B) P(A)P(B)
P(A|B) = = = P(A),
P(B) P(B)
we see that this is indeed the case.
Remark. Above (and throughout the course) we use our intuition to guess the result that
we should obtain. However, intuition may mislead us in some occasions. Therefore, we can
use our intuition to determine how to approach the problem and to guess the possible results,
but we should stick to mathematical definitions and theoretical proofs on our conclusions.
Exercise 1.1.3. If A and B are independent events and P(B) 6= 1, prove that P(A|B c ) = P(A),
where B c is the complement of event B.
We can also proceed in the opposite way to deduce the overall probability of an event using
conditional probabilities.
P(A) = P(A ∩ Ω) = P(A ∩ (B ∪ B c ))

= P((A ∩ B) ∪ (A ∩ B c )) = P(A ∩ B) + P(A ∩ B c ) (1.1)
= P(A|B)P(B) + P(A|B c )P(B c ).
For conditional probabilities to make sense, we needed 0 < P(B) < 1. Generalizing this idea,
we get the following theorem.
Theorem 1.1.4. Law of Total Probability. For events B1 , B2 , · · · , Bn (all withSn non-zero proba-
bilities) that partition the sample space Ω ( that is Bi ∩ B j = ; for i 6= j, and i=1 Bi = Ω) and for
any event A
Xn
P(A) = P(A|Bi )P(Bi ).
i=1
Proof. Repeat the steps in (1.1) for general n.

Example 1.1.5. We have three urns with white and black balls in them. In urn 1, there are 2
black and 6 white balls, there are 5 black and 4 white balls in urn 2, and 5 black balls and 3
white balls in urn 3. If we choose one of the urns at random and pick a ball from it, what is the
probability that it will be black?
To solve this example, we use the law of total probability. Let A be the event for choosing a
black ball, and Bi be the event for choosing urn i. Then
P(A) = P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 )

= 82 31 + 59 13 + 85 13
= 103
216 .
2
Note that if we aggregated all the balls in one urn, the probability of choosing a black ball would be
12
25 , which is considerably different than the result we found. This is because we perform a different
experiment with a different setting.
Now we will consider a simple model of stock market. As this example includes time dy-
namic behaviour, it will give us a flavour of the upcoming subjects.
Example 1.1.6. Suppose you currently have 100 pounds invested in the stock market. At each
day, you will either gain or lose 1 pound each with probability 0.25 or stay at the same wealth
with probability 0.5. Find the probability of having 101 pounds at the end of day 2.
We denote our wealth at the end of day n to be X n , hence our current wealth is X 0 = 100. Now
use the law of total probability conditioning on the wealth at the end of day 1, X 1 . (That is the
three events we’ll use, which partition Ω are {X 1 = 99}, {X 1 = 100}, and {X 1 = 101}.)
P(X 2 = 101|X 0 = 100) = P(X 2 = 101|X 1 = 99, X 0 = 100)P(X 1 = 99|X 0 = 100)
+P(X 2 = 101|X 1 = 100, X 0 = 100)P(X 1 = 100|X 0 = 100)
+P(X 2 = 101|X 1 = 101, X 0 = 100)P(X 1 = 101|X 0 = 100)
= 0 × 0.25 + 0.25 × 0.5 + 0.5 × 0.25
= 0.25.
Note that we used the total probability rule here for a conditional probability, which we can
do, since conditional probability is also a probability. We can also calculate P(X 2 = 98|X 0 =
100), P(X 2 = 99|X 0 = 100), P(X 2 = 100|X 0 = 100) and P(X 2 = 102|X 0 = 100) in the same
manner. Then, we can calculate e.g. P(X 3 = 99|X 0 = 100) via
102
X
P(X 3 = 99|X 0 = 100) = P(X 3 = 99|X 2 = m, X 0 = 100)P(X 2 = m|X 0 = 100).
m=98
Note that the limits of summation covers all the possible values that X 2 can take. So we didn’t sum
over all possible paths here, but chose a simpler way: we conditioned on the values of X 2 . Using
the same methodology iteratively, we can calculate the probabilities for all X i .
Example 1.1.7. The number of people arriving at an ice cream parlor in a day is a Poisson random
variable with rate λ. The arriving person is a girl with probability p, and a boy otherwise. What
is the probability that exactly i girls and j boys arrive in a day?
Let X , Y and N be the number of girls, boys and the total number of people arriving in a day.
Obviously, N = X + Y . We are asking for the probability
P(X = i, Y = j) = P(X = i, Y = j, N = i + j) = P(X = i, Y = j|N = i + j)P(N = i + j)
For the first term, if N = i + j then the number of girls is a binomial (i + j, p). For the second term
we just use the Poisson (λ) distribution. Hence
i+ j i λi+ j (λp)i −λp (λ(1 − p)) j −λ(1−p)

P(X = i, Y = j) = p (1 − p) j × e−λ = e × e
i (i + j)! i! j!
Calculating the marginal distributions P(X = i) and P(Y = j) by summation, we obtain that X
and Y are Poisson variables with parameters λp and λ(1 − p)
(λp)i −λp (λ(1 − p)) j −λ(1−p)
P(X = i) = e , P(Y = j) = e
i! j!
3
and they are independent, since P(X = i, Y = j) = P(X = i)P(Y = j). This result states that
the number of girls and boys arriving at an ice cream parlor in a day follow independent Poisson
distributions with rate pλ and (1 − p)λ respectively.
The conclusion of Example 1.1.7 is important and will be used extensively when we are
studying Poisson processes and continuous time Markov chains. There is also an interesting
aspect of the derivation. The randomness in this example is observed in two folds, the total
number of people and number of girls. We realized that we can find the distribution of the
number of girls if we were to know the number of people. Then we used conditioning argu-
ment. The conclusion is when we have several sources of randomness, we can use conditioning
to get randomness under control.
Conditional probability can also be used with continuous random variables. Let f X (x) and
f Y ( y) be the marginal densities for continuous random variables X and Y respectively and
f X ,Y (x, y) (often denoted simply f (x, y)) isRthe joint density. The conditional density function
of X is a density f (x|B), so P(X ∈ A|B) = A f (x|B)d x for any A ⊂ R, and B event. It can be
defined even if P(B) = 0 as
f X ,Y (x, y)
f (x|Y = y) = .
f Y ( y)
To justify this formula consider
f (x|Y = y)d x ≈ P(x ≤ X ≤ x + d x| y ≤ Y ≤ y + d y)
P(x≤X ≤x+d x, y≤Y ≤ y+d y) f X ,Y (x, y) d x d y
= P( y≤Y ≤ y+d y) ≈ f Y ( y) d y
To help memorize this formula note that it is identical to that of the discrete case for the joint
mass function pX ,Y (x, y) = P(X = x, Y = y), the marginal mass function pY ( y) = P(Y = y),
and the conditional probability mass function p(x|Y = y) = P(X = x|Y = y), where
pX ,Y (x, y)
p(x|Y = y) = .
pY ( y)
Note that in both the continuous and the discrete case the conditional probability is just the
probability of the event and the condition happening together, normalized to one (so “proba-
bilities add up to one"). Check this normalization in both cases.
With these new tools we can generalise the total probability rule to continuous variables.
Theorem 1.1.8.
= y p(x|Y = y)p( y)
P
p(x) X , Y discrete
R
p(x) = Pp(x|Y = y) f ( y)d y X discrete, Y continuous
f (x) = y f (x|Y = y)p( y) X continuous, Y discrete
R
f (x) = f (x|Y = y) f ( y)d y X , Y continuous
Proof. We’ve already seen the first case, so let’s prove only the last case here. For any A ⊂ R
let’s write P(X ∈ A) in two ways
Z Z∞
P(X ∈ A) = f (x)d x = f (x, y) d x d y
−∞
ZA∞ Z Z Z ∞
= f (x|Y = y) f ( y) d x d y = f (x|Y = y) f ( y) d y d x
−∞ A A −∞
4
R
which is true for any A, hence f (x) = f (x|Y = y) f ( y)d y.
The next example is an application of the second case of Theorem 1.1.8.
Example 1.1.9. Suppose n independent Bernoulli experiments are performed with unknown suc-
cess probability Q, where Q is uniformly distributed between 0 and 1. What is the probability of
getting exactly m successes out of these n trials?
Similar to Example 1.1.7, we have two sources of randomness and once we condition on the
value of Q, we can easily calculate the probability of m successes. Hence, for m = 0, . . . , n we have
R1
P(X = m) = P(X = m|Q = p) f (p)d p
R01 n

= 0 m p m (1 − p)n−m d p
1 1 1
R
= m!(n−m)!
n! m m−1
n−m+1 n−m+2 · · · n 0 (1 − p) d p =
n
n+1
The last but one equality follows by using integration-by-parts repeatedly. (There is also a nice
intuitive explanation for this simple result; try to find it out).
1.2 Conditional Expectation

The expectation of a random variable after some information is revealed, called conditional
expectation, is calculated by using the conditional probabilities. Letting S be the set of all
values X can take, p(x|Y = y) be the conditional probability mass function, and f (x|Y = y)
be the conditional density, then
E(X |Y = y) = x∈S x p(x|Y = y)

P
if X is discrete
R∞
E(X |Y = y) = −∞ x f (x|Y = y)d x if X is continuous
The value of the conditional expectation depends on the value of Y , that is it is a function
of Y . Hence, if we do not specify the value of Y but keep it random, we can treat conditional
expectation E(X |Y ) as a random variable.
Example 1.2.1. X is a Poisson random variable with unknown rate Λ. Λ is uniformly distributed
between 3 and 5. If we were to know the exact value of Λ, we would say E(X |Λ) = Λ, hence we
treat E(X |Λ) as a random variable distributed uniformly between 3 and 5.
Example 1.2.2. Flip two fair coins, and let X i be one if the i-th coins comes up heads, and zero
otherwise. Let us condition on the values of X 1 and calculate the conditional expectation of the
total number of heads
E(X 1 + X 2 |X 1 = 0) = E(X 1 |X 1 = 0) + E(X 2 |X 1 = 0) = 0 + EX 2 = 1/2

E(X 1 + X 2 |X 1 = 1) = 1 + EX 2 = 3/2
Hence E(X 1 + X 2 |X 1 ) is a random variable which is 1/2 if the first coin is tails, and 3/2 if the first
coin is heads. In other words, E(X 1 + X 2 |X 1 ) = X 1 + EX 2 .
5
Theorem 1.2.3. Tower Property. For X and Y random variables
E[E(X |Y )] = E(X ),
or written out in detail
y∈S E(X |Y = y)pY ( y)

P
if Y is discrete
E[E(X |Y )] = R ∞ .
−∞
E(X |Y = y) f Y ( y)d y if Y is continuous
Proof. The proof for the discrete case is quite short

X
E[E(X |Y )] = E(X |Y = y)P(Y = y)
y
XX
= x P(X = x|Y = y)P(Y = y)
y x
X X
= x P(X = x, Y = y)
x y
X
= x P(X = x) = EX
x
where the summations go over all possible values. For the continuous case one should replace
the sums with integrals. For infinite state spaces though the change of the order of summation
or integration should be justified.
Note that the tower property is just the total probability rule for expectations. An alternative
proof is to start from the total probability rule
X
P(X = x) = P(X = x|Y = y)P(Y = y)
y
multiply both sides by x, and sum over all values of x. To complete this proof, we assume
again that we can change the order of summation even for infinite state spaces.
Example 1.2.4. A student will choose to read a book out of two books. The expected time to finish
the first book alone is 1 week, but for the longer second book alone it is 3 weeks. The student is
inclined to choose the second book and will start reading it with probability 23 . What is the expected
time to finish the chosen book?
Let X be the time in weeks to finish the chosen book and Y be 1 or 2 depending on the book.
Then using tower property
1 2 7
E(X ) = E(X |Y = 1)P(Y = 1) + E(X |Y = 2)P(Y = 2) = 1 × + 3 × = weeks.
3 3 3
Tower property is a powerful tool that sometimes enables us to compute expectations even
when we do not know conditional expectations explicitly.
Example 1.2.5. A miner is trapped in a mine. There are three paths that the miner can choose
to pursue. The first path takes the miner back to the mine after 7 minutes, the second path brings
the miner back after 15 minutes and the third path leads out in 5 minutes. If at each time, the
6
miner chooses one of the paths arbitrarily, what is the expected time for the miner to get out of the
mine?
Let X be the time to get out of the cave and Y be the path the miner chooses initially. For the
first and second paths, the miner spends some time (7 or 15 minutes) and then he is back to his
starting situation. Hence, we can write
E(X |Y = 1) = 7 + E(X )
E(X |Y = 2) = 15 + E(X )
E(X |Y = 3) = 5
E(X ) = E(X |Y = 1)P(Y = 1) + E(X |Y = 2)P(Y = 2) + E(X |Y = 3)P(Y = 3)
= (7 + E(X )) 13 + (15 + E(X )) 13 + 5 13
3E(X ) = 2E(X ) + 27
E(X ) = 27.
Example 1.2.6. Conditional Variance Formula. Consider two random variables X and Y . We
define conditional variance as
Var(X |Y = y) = E(X 2 |Y = y) − (E(X |Y = y))2 .
Similarly, we treat Var(X |Y ) as a random variable. Then
Var(X ) = EX 2 − (EX )2
= E E(X 2 |Y ) − (E E(X |Y ))2
= E E(X 2 |Y ) − E(E(X |Y )2 ) + E(E(X |Y )2 ) − (E E(X |Y ))2
= E[E(X 2 |Y ) − (E(X |Y ))2 ] + Var(E(X |Y ))
= E(Var(X |Y )) + Var(E(X |Y ))
1.3 Stochastic Processes

Roughly speaking, a stochastic process is some randomly changing quantity. For example in
Example 1.1.6, we can consider the set of random variables corresponding to the wealth at the
end of each day. The subscript (index in general) n shows us which day each random variable
is referring to. In real life we encounter many such occasions where we need to consider an
indexed collection of random variables.
Definition 1.3.1. A stochastic process (X t ) t∈T is an indexed collection of random variables. Set
T is called the index set. As in Example 1.1.6, the index is in general interpreted as time and the
value X t is referred as the state of the system at time t. The set S of all possible states is referred
as the state space of the process.
If the state space S is countable, we refer the stochastic process as a discrete state space
process. If the state space is continuum, then it is called a continuous state space process. In
this class, we will only deal with discrete state space processes. Similarly, if index set T is
countable, then the process is said to be a discrete time process, and if T is continuum it is
called a continuous time process.
7
Example 1.3.2. Consider the system in Example 1.1.6 where X n is our wealth at stock market
at the end of day n and assume that we stay at 0 wealth once we reach lose all of our wealth.
The indexing is done per day hence index set is T = {0, 1, 2, · · · }. We can only get whole pounds
hence the state space is S = {0, 1, 2, · · · }. (X n )n∈N is a discrete time, discrete state space stochastic
process.
Example 1.3.3. Suppose we are recording whether a day is mostly sunny(S), cloudy(C), rainy(R)
or snowy(W ) each day. Hence, (X n )n∈N is a discrete time (each day) and discrete state space
process.
Example 1.3.4. Just like the example above, each day we are recording the average temperature.
Since the measurements are daily this is a discrete time process, however since temperature can be
any real number over an interval state space is continuous
Now consider a queue in front of a cash machine. People arrive at the cash machine, and if
nobody is there they withdraw money and leave, if the cash machine is being used, they form
a queue and wait till they withdraw their cash. We will describe some stochastic processes
related to this phenomenon.
Example 1.3.5. Waiting Time. Number the customers in the order they used the cash machine
since it started operating. Let Wn be the amount of time nth customer waited before being served.
Index set is T = {1, 2, 3, · · · } and the waiting time can be any nonnegative real number. So (Wn )n∈N
is said to be a discrete time continuous state space process (even though index set does not represent
time).
Example 1.3.6. Arrival Time. Using the same index set as above, let An be the arrival time of the
nth customer. Since customers can arrive at any point in time, we have a discrete time continuous
state space process.
Example 1.3.7. Let Cn be the number of customers waiting in front of the queue when nth cus-
tomer arrived. For this process, the state space is S = {0, 1, 2, 3 · · · }, hence this is a discrete time
discrete state space process.
Example 1.3.8. Queue Size. Let Nt be the number of customers in front of the cash machine at
time t. This is a continuous time discrete state space process. Notice that even though the state
of the system changes at discrete time points when customers arrive or leave, there is a queue size
corresponding to each time point.
8
Chapter 2
Discrete Time Markov Chains
Our goal is to analyze the stochastic process to obtain expected values and probabilities of
interest. We can develop methodologies if we impose some special structure over the process.
In this chapter, we consider discrete time discrete state space processes which has a special
property called “Markov Property”.
2.1 Basic Definitions

Definition 2.1.1. A stochastic process is said to have Markov property if given the present state, the
future events are independent of the past. For discrete time discrete state space processes (X n )n∈N
this property can be stated as
P(X n+1 = j|X n = i, X n−1 = in−1 , · · · , X 0 = i0 ) = P(X n+1 = j|X n = i)
for all j, i, in−1 , . . . , i0 ∈ S and n ∈ N, and we also define pi j (n) ≡ P(X n+1 = j|X n = i) and refer to
it as the (one step) transition probability from i to j at time n.
Henceforth, a discrete time, discrete state space process with Markov property will be re-
ferred to as Discrete Time Markov Chain (DTMC). Since we are dealing with probabilities we
should have for all n ∈ N that
X
pi j (n) ≥ 0, for all i, j ∈ S and pi j (n) = 1.
j∈S
Another important point is the dependency of transition probabilities on the time point n. If
the transition probabilities do not depend on the time point, that is pi j (n) = pi j for all i, j ∈ S
and n, we say that the DTMC is time-homogeneous. In this course, we assume that we deal
with time-homogeneous DTMCs unless otherwise stated.
2.2 Modelling Examples

To model a real life system as a DTMC, we first need to define the state space S in such a
way to have Markov property. After the state space is properly defined, then the transition
9
probabilities should be stated. The transition probabilities can be written in matrix form
 
p11 p12 · · ·
P =  p21 p22 · · ·  .
.. .. . .
. . .
Note that the indexing of the matrix element may start from zero, or we may also use negative
integers for convenience. The visual tool named “state diagram” can also be used to describe
a DTMC model.
Example 2.2.1. Simple Weather Forecast Model. Suppose we have only two possibilities for
weather in a day, namely sunny(s) and rainy(r). If a day is sunny the next day is rainy with
probability α and if a day is rainy the next day is sunny with probability β. Model this system as
a DTMC.
First we need to define the state space. Let’s use the intuitive definition S = {s, r}, where X n
will show whether day n is sunny or rainy. At this point we need to be careful to make sure that
this state space definition provides us with Markov property. If X n is given to be rainy or sunny
that is enough to determine what will happen later. Hence, we have the Markov property. The
transition matrix is then given as
1−α α

P= .
β 1−β
Instead of writing the P matrix, we can use the following state diagram to describe our DTMC
1−α 1−β
α
s r
β
Figure 2.1: State diagram for the basic weather model
Example 2.2.2. Gambler’s Ruin Problem. A gambler starts gamblings with some initial wealth
0 ≤ X 0 ≤ N . At each game, she wins 1 pound with probability p and loses with probability 1 − p.
The gambler decides to leave the game either when she is bankrupt, that is reaches 0 wealth, or
when she makes a wealth of N pounds. Model this system as a DTMC.
We can define X n as the wealth of the gambler at the end of n-th game, and the state space as
S = {0, 1, 2, · · · , N }. Since each game is independent of the others, we can easily see that we have
Markov property. The transition probabilities can be written as
pi,i−1 = 1 − p and pi,i+1 = p if 1 ≤ i ≤ N − 1,

p00 = 1,
pN N = 1,
10
1 1
p p p
0 1 2 3 4
1−p 1−p 1−p
Figure 2.2: State diagram for gambler’s ruin problem for N = 4
and all the other transition probabilities are 0. The state diagram is depicted on Figure 2.2.
To specify the example further, we take for N = 4 and p = 0.7. Now The 5 by 5 transition
matrix is  
1 0 0 0 0
 0.3 0 0.7 0 0 
P =  0 0.3 0 0.7 0 .
 
 0 0 0.3 0 0.7 
0 0 0 0 1
Notice how all rows sum up to one.
Example 2.2.3. Random Walk – Walk of a Drunken Man. Random walk is one of the simplest
stochastic processes which has applications in many different areas. It is basically the Gambler’s
Ruin problem without the “boundaries". Consider a man trying to walk on a straight line. However,
as drunk as he is, at each step he is moving right or left with probabilities p and 1 − p respectively.
Model the distance from desired path as a DTMC.
Let X n be the position of our drunken man with respect to his desired path. Then the state
space is S = {· · · , −2, −1, 0, 1, 2} , that is X n = −3 means that at the end of nth step, he is 3 steps
left to his desired path. Since each step is independent of others we have the Markov property. And
the transition probabilities are given by
pi,i+1 = p,
pi,i−1 = 1 − p
and all the other probabilities are 0.
Exercise 2.2.4. Important! Consider the random walk example with p 6= (1 − p) 6= 0.5 and
define X n to be the distance of the man from his desired path with S = {0, 1, 2, · · · }, that is X n = 2
means the distance to his desired path is 2 steps, but we do not know whether he is on right or left
of his desired path. Calculate the transition probability pi j = P(X n+1 = j|X n = i). Does process
(X n )n∈N have Markov property? (The answer to this question is obviously “no”, but why?)
Example 2.2.5. Branching Process. At each hour, a bacterial cell can divide into two to form two
young bacterias with probability p. Otherwise it stays undivided with probability 1 − p. Assume
that whether a bacteria divides or not is independent of other bacterial cells, and also independent
of the cell’s behavior in the previous hours. We model the systems as X n , the number of bacteria in
the population at the end of n-th hour as a DTMC.
11
From the description of the problem, we can see that a bacteria either divides or stays the
same and there are no other options. The number of cells can be any non-negative integer, so the
state space is S = {0, 1, 2, · · · }. Markov property follows from the assumption of independence. If
X n = i, we can conclude the following
• Since no deaths occur, X n+1 ≥ i
• Since at most i divisions can occur X n+1 ≤ 2i
• If X n+1 = j, i ≤ j ≤ 2i, then this means j − i cell divisions have occurred. We can think
of this situation as out of i trials (possible divisions), j − i of them has actually succeeded,
which indicates a binomial distribution.
Hence, we can write the transition probabilities as

i
j−i p
j−i
(1 − p)2i− j if i ≤ j ≤ 2i
pi j = .
0 otherwise
This includes the particular case of no initial cell, for which p00 = 1 and p0 j = 0 for j ≥ 1.
Example 2.2.6. Infection model. Consider a population of N people. One person from the
population gets infected with a disease that has never been seen by the population at time 0. The
person stays sick for one period and passes the disease to the people he contacted over this period,
that is contacted individuals will be sick in the next period. Supposing that the probability that
any two individuals contact each other over a given period is p, develop a DTMC to model the
situation of the disease in the population.
In this problem, we define a three dimensional state X n = (I n , Sn , R n ), where
I n : Number of infected individuals at period n.

Sn : Number of susceptible individuals at period n.
R n : Number of removed individuals at period n.
Notice that number of people in the society stays the same over all periods, which means I n + Sn +
R n = N , for all n ≥ 0. Hence, state space is S = {(i, s, r) ∈ Z3≥0 |i + s + r = N }, where Z3≥0 is
three dimensional vectors of nonnegative integers. Suppose we are at state (i, s, r) at time period
n. Then, we know for sure that R n+1 = r + i, since all the currently infected individuals will be
removed. Each of the s susceptible individuals get away without contacting any infected individual
with probability (1 − p)i . So each individual gets infected in the next period with probability
q = 1 − (1 − p)i . Notice that the probability of j individuals getting infected in the next period is
a binomial random variable and

s j
P(X n+1 = ( j, s − j, r + i)|X n = (i, s, r)) = q (1 − q)s− j for j = 0, · · · , s.
j
The other cases have 0 probability.
Example 2.2.7. Wright Fisher model. This is the most popular model in Population Genetics.
It describes the evolution of M individuals (their genes) where each of them can be either normal
(called wild-type) or mutant. Let X n be the number of mutants at the n-th generation, so the state
12
1st Day 2nd Day P(Sunny in 3rd day) P(Rainy in 3rd day)
S S 0.9 0.1
S R 0.4 0.6
R S 0.6 0.4
R R 0.2 0.8
Table 2.1: Probabilities for Weather Forecasting-2
space is S = {0, 1, . . . , M }. In the next generation there are again M individuals, and each of
them independently choose a parent from the previous generation (the parents then die). Normal
parents have normal children, mutant parents have mutant children. If there are i mutants in the
old generation, each individual in the next generation will be mutant with probability p = i/M ,
hence the total number of mutants is a binomial (M , p), that is the probability that there are j
mutants in the next generation is
i M−j
j
M i
pi, j = 1−
j M M
for all i, j ∈ S.
2.3 More Complicated Examples

We defined stochastic processes in such a way that at a given time n the state of the system is
a random variable. This property can be extended, and in many applications the state can be
described as a random vector. We consider now two examples for this.
Example 2.3.1. Weather Forecasting-2 In this example, we are trying to model the system where
weather condition on a given day depends on the previous two days, that is the probability of a
day being rainy or sunny depends on previous two days as shown in Table 2.1. Model this system
as a DTMC.
0.1
0.9 (s, s) (s, r)
0.4
0.6 0.6
0.4
(r, s) 0.2
(r, r) 0.8
Figure 2.3: State diagram for Weather Forecast-2
13
Notice that if the system state on day n is defined to be the weather condition at time n, then
the stochastic process do not have Markov property. The trick here is to append the state definition
to capture the weather condition in the last two days. Hence define
X n = (i1 , i2 ) if weather condition an day n − 1 is i1 and on day n is i2 .
To see this suppose X n = (s, r), this means the condition on day n − 1 is sunny and on day n is
rainy. The probability that day n + 1 is observed to be rainy is 0.6 and if this is observed then
X n+1 = (r, r). So our state space is
S = {(s, s), (s, r), (r, s), (r, r)}
and transition probabilities are given on the figure.
Example 2.3.2. Pólya Urn model In this model we take an urn (just a box really) and put a
black and a white ball in it. At each time step we draw a ball randomly and put it back with an
additional ball of the same color. Hence the total number of balls increases by one at each time step.
But to describe the state of the system at time n we need to know more, for example the number of
black and white balls, Bn and Wn . Our process is then (Bn , Wn )n∈N . Initially (B0 , W0 ) = (1, 1). The
state space is S = {(b, w) : b, w ∈ Z+ }, that is all possible pairs of positive integers. At each time
step we either draw a white ball, hence increase the number of white balls, or we draw a black ball
hence increase the number of black balls. The one step transtion probabilities from (b, w) are
p(b,w),(b+1,w) = b/(b + w)
p(b,w),(b,w+1) = w/(b + w)
for b, w ≥ 1, and all other transition probabilities are zero.

Alternatively, we could model this urn as a simple time dependent DTMC, and follow only the
number of black balls Bn at time n, so the state space is Z+ . At time n = 0 we have a black and a
white ball, so B0 = 1. Since at each time we add an extra ball, the total number of balls is always
n + 2. Hence
pi,i+1 (n) = i/(n + 2)
pi,i (n) = (n + 2 − i)/(n + 2)
for i ≥ 1, and all other transition probabilities are zero. Note that in this case pi, j (n) generally
depends on time, so this chain is not time homogeneous.
2.4 Chapman-Kolmogorov Equations

For time-homogeneous systems, we have introduced one-step transition probability to be pi j =
P(X n+1 = j|X n = i) = P(X 1 = j|X 0 = i), that is if the state at a given period is i, then the
probability of observing j in the following period is pi j . Generalizing this statement, now
we ask “If the state of a system is given to be i at a certain period, what is the probability
(n)
that n periods later we will observe state j ?”. This probability will be denoted pi j and in
mathematical terms,
(n)
pi j ≡ P(X n = j|X 0 = i) = P(X m+n = j|X m = i)
14
for a time-homogeneous system.
Let us try to break down the probability of being somewhere after n + m steps into what
happens in the first n steps and what happens in the remaining m steps. For this we use the
law of total probability on the probability of being in state j after m + n periods given that we
start at i, by conditioning on all possible states after n steps.
X
(n+m)
pi j = P(X n+m = j|X n = k, X 0 = i)P(X n = k|X 0 = i).
(n)
k∈S,pik 6=0
(n)
Here we didn’t sum over unreachable states for which pik = 0, so the conditional probability
is defined. Using Markov property and time-homogeneity the first factor becomes
(m)
P(X n+m = j|X n = k, X 0 = i) = P(X n+m = j|X n = k) = P(X m = j|X 0 = k) = pk j
Putting this back to the sum we get the following.
Theorem 2.4.1. Chapman-Kolmogorov Equations.

X
(n+m) (n) (m)
pi j = pik pk j .
k∈S
(n)
Defining, P (n) to be the matrix with pi j as the entry in ith row and jth column, in matrix
form we get
P (n+m) = P (n) P (m) .
Note that we extended the sum to all states, as that just means adding zeros. Intuitively the
above equation says that in order to get from i to j in two days, we have to pass through some
state k after one day.
Corollary 2.4.2. If P n is defined to be the nth power of matrix P, then P (n) = P n .
Proof. We will use induction to prove our corollary. First P (1) = P and using Chapman-Kolmogorov
equations P (2) = P P = P 2 . Now assume P (n) = P n , then
P (n+1) = P (n) P = P n P = P n+1 .
Example 2.4.3. Simple Weather Example-Revisited. Find the probability that it rains two days
after a sunny day, P(X 2 = 1|X 0 = 0). Here we interpret s ≡ 0 as sunny, and r ≡ 1 as rainy. We
know
1−α α

P= .
β 1−β
Then,
(1 − α)2 + αβ α(2 − α − β)

(2)
P =P = 2
β(2 − α − β) (1 − β)2 + αβ
So P(X 2 = 1|X 0 = 0) = α(2 − α − β). By simply calculating higher powers of the matrix P we can
(n)
calculate all probabilities pi j .
15
Using Chapman-Kolmogorov equations we can answer questions about finite dimensional
probabilities.
Example 2.4.4. If it’s rainy on Monday, what’s the probability that the weekend will be all sunny?
An optimist’s formulation of the question. This translates to
P(X 7 = 0, X 6 = 0|X 1 = 1) = P(X 7 = 0|X 6 = 0, X 1 = 1)P(X 6 = 0|X 1 = 1) =
where we used the definition of conditional probability (in this form also called the multiplication
rule). Now let’s use the Markov property and the time homogeneity
= P(X 7 = 0|X 6 = 0)P(X 6 = 0|X 1 = 1) = P(X 1 = 0|X 0 = 0)P(X 5 = 0|X 0 = 1)
Now we can express the result in terms of powers of P as

(5)
P(X 7 = 0, X 6 = 0|X 1 = 1) = p10 p00 = [P 5 ]1,0 [P]0,0
where for any matrix A by [A]i, j we mean the ai, j element.
What if we don’t know the initial condition precisely? But we at least know the probability
of all possible initial conditions (say, in the office we are not sure if it’s sunny outside today,
but we think there’s a 90% chance that it’s sunny). Let us denote the probabilities of the initial
(0)
conditions as ak = P(X 0 = k). Now consider a(0) to be a row vector, e.g. in the special case of
S = {0, 1, 2, . . . m} we have
(0) (0) (0)
a(0) = (a0 , a1 , . . . , am )
Example 2.4.5. What is the probability that it’s rainy in two days, if it’s rainy today with 90%
chance? We use again the Total Probability rule
P(X 2 = 1) = P(X 2 = 1|X 0 = 0)P(X 0 = 0) + P(X 2 = 1|X 0 = 1)P(X 0 = 1)

(2) (0) (2) (0)
= p0,1 a0 + p1,1 a1
(0) (2)
= i∈S ai pi,1
P
= [a(0) P 2 ]1
Here a(0) P 2 is a row vector times a square matrix, so the result is another row vector, and we take
the first component of that.
Similarly to the initial condition we can define the row vector a(n) with component k being
(n)
the probability of being in state k at time n, that is ak = P(X n = k). Now these probabilities
change in one step – by definition – according to the one-step transition probabilities
a(n+1) = a(n) P
In the weather example this means that

(n+1) (n+1) (n) (n)a00 a01
(a0 , a1 ) = (a0 , a1 )
a10 a11
(n) (n) (n) (n)
= (a0 a00 + a1 a10 , a0 a01 + a1 a11 )
16
This has an intuitive meaning: We are in state 0 in the next step, if we were in state 0 in the
previous step and we stayed there, or if we were in state 1 but we moved to state 0, that is
(n+1) (n) (n)
a0 = a0 a00 + a1 a10 . Same can be said for state 1.
Since the n-steps probabilities are given by P n , the previous Example 2.4.5 generalizes into
the elegant formula
a(n) = a(0) P n
Example 2.4.6. For a more involved example consider a DTMC and find an expression for P(X 2 =
i2 , X 5 = i5 , X 10 = i10 ). Write
P(X 2 = i2 , X 5 = i5 , X 10 = i10 ) = P(X 10 = i10 |X 2 = i2 , X 5 = i5 )P(X 2 = i2 , X 5 = i5 )

= P(X 10 = i10 |X 5 = i5 )P(X 5 = i5 |X 2 = i2 )P(X 2 = i2 )
= P(X 5 = i10 |X 0 = i5 )P(X 3 = i5 |X 0 = i2 )P(X 2 = i2 )
= P(X 5 = i10 |X 0 = P i5 )P(X 3 = i5 |X 0 = i2 )×
k∈S P(X 2 = i2 |X 0 = k)P(X 0 = k)
(5) (3) (0) (2)
= pi5 i10 pi2 i5 k∈S ak pki2
P
= [P 5 ]i5 i10 [P 3 ]i2 i5 [a(0) P 2 ]i2
The first equality is just using the definition of the conditional probability. For the second equality
we also used the Markov property. The third equation is written using time-homogeneity and in
the fourth equation we used law of total probability to write P(X 2 = i2 ). The final equation is
just replacing the notation introduced for DTMC’s. We know that all the conditional probabilities,
(5) (3) (2)
pi5 i10 , pi2 i5 and pki2 , can be calculated by using Chapman-Kolmogorov equations, that is taking
the appropriate power of P. We only need to know the distribution of X 0 and one-step transition
matrix P to calculate the probability above.
After all these examples we come up with the following theorem.
Theorem 2.4.7. One-step transition matrix P and the initial distribution a(0) completely charac-
terizes the DTMC, that is all finite dimensional probabilities can be calculated.
Proof. The details of the proof is left as an exercise. You only need to calculate
P(X n1 = i1 , X n2 = i2 , · · · X nk = ik ) = [P nk −nk−1 ]ik−1 ,ik · · · [P n2 −n1 ]i1 ,i2 [a0 P n1 ]i1
using the reasoning above.
2.5 Classification of States

By now we can calculate the probability that our chain is in any given state at any given time
by simply multiplying matrices. From now on we are more interested in what happens to the
chain for large times. To explain the concepts defined in this section, we will refer to the DTMC
example specified in Figure 2.4.
Definition 2.5.1. Accessibility. A state j is said to be accessible from state i, denoted i → j, if

(n)
there exists an n ≥ 0 such that pi j > 0.
17
0.3
1
0.2 0.7
2 4
0.8 1
0.5 0.5 0.5
1
0.2
0.3 3 5 6
Figure 2.4: State diagram for a simple DTMC
For the DTMC in Figure 2.4, state 1 is accessible from states 2, 3 and 5, but state 1 is not
accessible from states 4 or 6.
Definition 2.5.2. Communication. Two states i and j are said to communicate, denoted i ↔ j,
if and only if i → j and j → i.
For the DTMC in Figure 2.4, states 4 and 6 communicate, also states 2 and 5 communicate
but states 1 and 2 do not communicate.
Theorem 2.5.3. Communication is an equivalence relation, that is
1. i ↔ i for all i ∈ S (reflexive)
2. If i ↔ j then j ↔ i (symmetric)
3. If i ↔ j and j ↔ k then i ↔ k (transitive).
(0)
Proof. Since pii = P(X 0 = i|X 0 = i) = 1, communication is reflexive. Symmetry follows from
(n)
the definition. Using definition of accessibility we know there exists m, n ≥ 0 such that pi j > 0
(m)
and p jk > 0. Hence
X
(n+m) (n) (m) (n) (m)
pik = pil plk ≥ pi j p jk > 0
l∈S
and i → k. We can prove k → i similarly and hence transitivity follows.
Equivalence relations partition a set into equivalence classes, and in our case we’ll call these
classes communicating classes. To summarize let’s define them explicitly.
Definition 2.5.4. Let C be a subset of the state space S. C is said to be a communicating class if
1. i ∈ C and j ∈ C then i ↔ j.
18
2. i ∈ C and i ↔ j then j ∈ C.
/ C we
If in addition to these properties we cannot leave C, that is for any i ∈ C and any k ∈
have i 9 k, then C is said to be a closed communicating class.
The communicating classes for the example in Figure 2.4 are {2, 3, 5}, {1}, {4, 6} where only
{4, 6} is closed and the other two are not closed.
Definition 2.5.5. A DTMC is said to be irreducible if the state space S is a single (closed) commu-
nicating class, and it is called reducible if it is composed of several communicating classes.
Among the examples, we have seen the models for weather examples and random walk are
irreducible Markov chains and the models for gambler’s ruin problem, branching process and
epidemic spreading are reducible. (Why?)
2.6 Transience and Recurrence

To analyze the long-run behaviour of Markov chains we will ask the following question: Given
that we start with X 0 = i, what is the probability that we will ever return to state i again?
Define T j = min{n ≥ 1 : X n = j}, that is the first arrival time to state j. Note that T j > 0,
so even if we start from j the first arrival to j (which is in fact the first return to j in this case)
is always positive. Note that if we never reach j we consider T j to be infinite. In fact we allow
random variables to be extended as X : Ω → {R, ∞}. Given that we start at i the probability of
ever reaching state j (or more precisely reaching it after finite number of steps) can be written
as
∞
X
%i j = P(T j < ∞|X 0 = i) = P(T j = n|X 0 = i).
n=1
Note that for i 6= j, %i j > 0 is equivalent to i → j.1 But with %i j we can say more.
Definition 2.6.1. A state i is said to be recurrent if %ii = 1 and it is said to be transient if %ii < 1.
So a state is recurrent if starting from there we’ll eventually return there with probability
one. We can also return to a transient state but with probability less than one. Note that
the definition implies that each state is either recurrent or transient. From our probability
studies we know that if an experiment is repeated over and over again and an event A has
positive probability, then this event A will eventually occur with probability one. When i is
transient, then there is a positive probability of not returning. Hence, in the long run there
will be sometime where we will leave and not return to state i. However, for recurrent states
we know that we will return for sure each time we leave state i. This leads to the following
lemma.
(n)
1
To prove this assume that i → j with i 6= j, hence there is n ≥ 1 such that pi j > 0, and hence ρi j =
(n)
P(T j < ∞|X 0 = i) ≥ P(T j ≤ n|X 0 = i) ≥ pi j > 0. On the other hand, if 0 < ρi j then there is n ≥ 1 such that
(n)
0 < P(T j = n|X 0 = i) ≤ pi j , so i → j.
19
Lemma 2.6.2. Let Ni be the number of visits to state i (including the starting visit). Then Ni
is infinite if i is recurrent, while Ni is a geometric (1 − %ii ) random variable if i is transient.
Consequently, ¨
1 if i is recurrent
P(Ni = ∞|X 0 = i) =
0 if i is transient
and the mean number of visits to i is

∞ if i is recurrent
E(Ni |X 0 = i) = 1 .
1−%ii if i is transient
Proof. For recurrent states Ni cannot be bounded, as it is always visited once more with proba-
bility one, hence it is visited infinitely many times. For transient states, after each visit whether
there is a next visit is an independent trial with success probability 1 − %ii (escape), hence Ni
is a geometric (1 − %ii ) random variable, and the lemma follows.
To find out if a state is transient or recurrent the following theorem helps: if you can go
somewhere from where you don’t return with probability one, then you are in a transient state.
Theorem 2.6.3. If i → j but % ji < 1, then i is transient.
(n)
Proof. Since i → j , there is an n such that pi, j > 0. To show that %ii < 1 we show that the
following is positive
1 − %ii = P(Ti = ∞|X 0 = i) ≥ P(Ti = ∞, X n = j|X 0 = i)

(n)
= P(Ti = ∞|X 0 = j)P(X n = j|X 0 = i) = (1 − % ji )pi, j > 0
hence %ii < 1, and i is transient.
There are several simple Corollaries of the above Theorem.
Corollary 2.6.4. (a) If i → j and i is recurrent then % ji = 1.
(b) If i → j and i is recurrent then j is also recurrent.
(c) If i → j and j is transient then i is transient too.
(d) Recurrence and transience are class properties.
Proof. (a) It is just the contrapositive of Theorem 2.6.3.
(b) From Corollary (a) we know that starting from j we arrive at i (in finite time) with prob-
ability one, and since i is recurrent we return to i infinitely many times. Every time from
i we’ll have a chance to return to j, hence it will eventually happen with probability one.
(c) It is just the contrapositive of Corollary (b).
(d) If i and j are in the same communicating class, then i ↔ j, which implies i → j, and by
Corollary (b) and (c) we obtain the result.
20
Since recurrence and transience are class properties, each communicating class is either
recurrent or transient, so S is partitioned into recurrent and transient communicating classes.
The above theorem and corollaries allow us to identify transient and recurrent classes in simple
finite chains.
Example 2.6.5. Consider the example of Figure 2.4. Find the communicating classes, and decide
on their recurrence or transience.
Example 2.6.6. Success runs: Suppose you are answering questions in a competition. Let X n
be the number of correct answers since the last incorrect answer. If X n = i, then you can answer
the next question correctly with probability pi which means X n+1 = i + 1. If you cannot answer
the question correctly then X n+1 = 0. The state diagram for this DTMC is given in Figure 2.5.
p0 p1 p2
1 − p0 0 1 2 ...
1 − p1
1 − p2
Figure 2.5: State diagram for Competition
To determine whether this system is transient or recurrent we can use the definition of %ii .
Considering state 0, never returning back to state 0 means answering all the questions correctly.
Hence
%00 = 1 − P(Answer
Q∞ all the questions correctly|X 0 = 0)
= 1 − i=0 pi .
Q∞
This means if i=0 pi = 0, then the system is recurrent. For example suppose all questions are at
Q∞
the same level of difficulty, that pi = p < 1, then i=0 p = limn→∞ p n = 0, hence the system is
recurrent.
The following theorem connects recurrence and transition probabilities.
P∞ (n)
Theorem 2.6.7. State i is recurrent if and only if n=0 pii = ∞.
Proof. We first define the indicator random variable for an event A as

§
1 if A occurs
1A = .
0 otherwise
P∞
Then we can write the time spent in state i as Ni = n=0 1X n =i , and take its expected value
term by term
X∞ ∞
X ∞
X
(n)
E(Ni |X 0 = i) = E( 1X n =i |X 0 = i) = E(1X n =i |X 0 = i) = pii .
n=0 n=0 n=0
By using Lemma 2.6.2 the theorem follows 2 .

2
We can change the order of expectation and a possibly infinite summation due to the Monotone Convergence
Pm Pm+1
Theorem from measure theory. Basically, one needs to check that 0 < 1X n =i < 1X n =i ↑ Ni .
21
Example 2.6.8. In this example, we will try to find out under which conditions a random walk
(n)
is transient or recurrent. We will use Theorem 2.6.7 for this purpose, hence we need to find p00 .
We know that if n is odd, then it is impossible to return back to same state. If n is even, then
the probability is equal to taking half of the steps towards right and the other half towards left.
Defining steps towards right as success, we can see that this is the same as probability of getting
n/2 successes out of n trials. Defining k = n/2, we get
2k k

(2k)
p00 = p (1 − p)k
k
P∞ (2k)
To test whether k=0 p00 diverges or converges we will use the following ratio test.
P∞
Lemma 2.6.9. k=0 ak converges if limk→∞ ak+1 /ak < 1, and it diverges if limk→∞ ak+1 /ak > 1.
The lemma says nothing when the ratio converges to one.
Hence, taking the ratio

2k+2

(2k+2)
p00 k+1 p
k+1
(1 − p)k+1
limk→∞ (2k) = limk→∞ 2k

k p (1 − p)
p00 k k
(2k+2)!
(k+1)!(k+1)! p
k+1
(1 − p)k+1
= limk→∞ (2k)! k
k!k! p (1 − p)
k
(2k+2)(2k+1)
= limk→∞ (k+1)(k+1) p(1 − p)
= 4p(1 − p)
< 1 if p 6= 1/2
§
=
= 1 if p = 1/2
This means when p 6= 1/2, the sum converges to a finite number and hence the system is transient.
However, when p = 1/2 we need to use another p method. Using p = 1/2, and using Stirling
k+1/2 −k
approximation for factorials, k! ≈ k e 2π, we get
(2k) (2k)! 1 k

p00 = k!k! 4 p
(2k2k+1/2 )e−2k 2π 1 k

≈ k k+1/2 e −k
p
2πk k+1/2
p
e−k 2π 4
1
= p .
πk
P∞ (2k) P∞
Hence, we conclude that k=0 p00 ≈ k=0 p1πk = ∞ and the system is recurrent.
In fact, you can perform the sum exactly3 for any p to obtain
X 2k 1
EN0 = p k (1 − p)k = p
k≥0
k 1 − 4p(1 − p)
which is finite for p 6= 1/2 (hence all states are transient), but infinite for the symmetric walk
(hence all states are recurrent).
2k

3
One needs to show that k≥0 k x k = (1 − 4x)−1/2 where x = p(1 − p). One can do this by e.g. Taylor
P
expanding (1 − 4x)−1/2 around x = 0.
22
2.7 Positive and Null Recurrence
Even if a DTMC returns back to state i in finite time almost surely, that is %ii = 1, this does
not mean that it happens in reasonable time. As anP example, consider a DTMC where P(Ti =
∞
n|X 0 = i) = C/n , where C = 6/π is such that n=1 C/n2 = 1. We know that state i is
2 2
recurrent (why?), however

∞
X C
E(Ti |X 0 = i) = n 2 = ∞.
n=1
n
Hence, we define mi j = E(T j |X 0 = i), so mii is the mean time to return to state i, and classify
recurrent state i as positive recurrent when mii < ∞ and null recurrent when mii = ∞. It is
also an easy exercise to show that mii = ∞ when i is transient.
Heuristic Approach: On average at every mii steps, state i will be visited once. Hence,
the long run average proportion of time spent in state i should be 1/mii . Now the mean (long
run average) proportion time spent in state i is 1/mii and we can write it as 4
1 1
PN 1
PN
= E(lim 1
n=0 X n =i
|X = i) = lim E( 1 |X = i)
mii N →∞ N +1 0 N →∞
1
PN n=0 X n =i 0
N +1
= limN →∞ N +1 n=0 E(1X n =i |X 0 = i)
(n)
PN
= limN →∞ N 1+1 n=0 pii
= pii∗
where we used the notation

N
1 X (n)
pi∗j = lim pi j
N →∞ N + 1
n=0
for the mean proportion of time spent in state j when starting from i. One should pay special
attention to the algebraic manipulations above. In the second equality of the first line, the
limit and the expectation are interchanged, which can be justified in measure theory.5 The next
equality only uses the linearity of expectation and the fourth equality just uses the expectation
of a binary random variable. Since we obtained that pii∗ = 1/mii , and positive recurrency means
finite mean return time mii , we (heuristically) arrive at the following theorem.
Theorem 2.7.1. A recurrent state i is positive recurrent if and only if
1
pii∗ = > 0.
mii
P∞ (n)
Note that for a transient state n=0 pii is finite due to Theorem 2.6.7, hence pii∗ would be
zero. The following small lemma basically allows us to ignore finite number of terms in the
infinite term limit, and it will soon be useful.
PN PN
Lemma 2.7.2. limN →∞ N 1+1 n=0 f n = limN →∞ N 1+1 n=0 f n+m , for any ( f n )n∈N sequence which
is bounded | f n | ≤ b for all n, and for any m ∈ N.
PN
4
In fact more is true, N 1+1 n=0 1X n =i → m1ii almost surely.
5
This is due to the Dominated Convergence Theorem from measure theory. Basically one needs to check that
PN (n)
all terms are bounded, which is true here, since N 1+1 n=0 pi j ≤ 1.
23
Proof. Write
+m
N
N m−1 N
N
1 X 1 X X X 1 X
lim f n+m = lim fn − fn + f n = lim fn
N →∞ N + 1 N →∞ N + 1 N →∞ N + 1
n=0 n=0 n=0 n=N +1 n=0
Pm−1 PN +m
where the last equality comes from the boundedness of | n=0
f n | ≤ bm, | f |
n=N +1 n
≤ bm.
Theorem 2.7.3. Positive recurrence (null recurrence) is a class property, that is if recurrent states
i ↔ j then i and j are both positive recurrent or both null recurrent.
(n)
Proof. Suppose i is positive recurrent. If i ↔ j, then there exists n, m > 0 such that pi j > 0
(m)
and p ji > 0. Hence for any s ≥ 0
(m+s+n) (m) (s) (n)
pj j ≥ p ji pii pi j .
Summing over s
(s) (m+n+s)
1
PN 1
PN
p∗j j = limN →∞ N +1 p
s=0 j j
= limN →∞ N +1 s=0
pj j
(m) (n) (s)
1
PN
≥ p ji pi j limN →∞ N +1 s=0
pii
(m) (n)
= p ji pi j pii∗ > 0.
where we used Lemma 2.7.2. Hence from Theorem 2.7.1 the result follows.
Theorem 2.7.4. All states in a finite closed communicating class C are positive recurrent.
(n)
pi j = 1 for every n ≥ 0 and every i ∈ C. Hence
P
Proof. Using the fact that j∈C
(n)
PN P
1 = limN →∞ N 1+1 n=0 j∈C pi j
(n)
PN
= limN →∞ j∈C N 1+1 n=0 pi j
P
N (n)
= P j∈C limN →∞ N 1+1 n=0 pi j
P P
= j∈C pi∗j
Notice that we can easily interchange limit and summation using the fact that C is finite. If C
was not finite this could not be done! Now since j∈C pi∗j = 1, there exists a j ∈ C such that
P
pi∗j > 0. However, to prove positive recurrency of a state j, we need p∗j j > 0. Since i ↔ j, there
(m)
exists m > 0 such that p ji > 0 and
1 N (n)
p∗j j = limN →∞
P
N +1 Pn=0 p j j
N (n+m)
= limN →∞ N 1+1 n=0 p j j
(m) (n)
PN
≥ limN →∞ N 1+1 n=0 p ji pi j
(m) N (n)
= p ji limN →∞ N 1+1 n=0 pi j
P
(m)
= p ji pi∗j > 0
Hence j is positive recurrent, and using communication property all the other states are positive
recurrent.
24
Note that the recurrence part of the previous theorem can be proven in a more intuitive
way. If C is finite, the system cannot spend finite time in each state, so there must be a state j
where it spends infinite amount of time, that is for any initial condition i,
P(N j = ∞|X 0 = i) > 0
(m)
Now, as before, since i ↔ j, there exists m > 0 such that p ji > 0, so also
P(N j = ∞|X 0 = j) ≥ P(N j = ∞|X 0 = i)P(X m = i|X 0 = j) > 0
hence j is recurrent, and so are all other states.
Theorem 2.7.5. All states in an open communicating class C are transient.
/ C. Since i and j do not

Proof. Since C is open (non-closed), there is a state i ∈ C and i → j ∈
communicate j 9 i, so ρ ji = 0, and hence from Theorem 2.6.3 i is transient, and from 2.6.4
(d) all states in C are transient.
To summarize this section so far in words, S is partitioned into communicating classes (The-
orem 2.5.3), where each class is either transient, positive recurrent or null recurrent (Corollary
2.6.4, Theorem 2.7.3). If a class is “leaking out", then that’s a transient class (Theorem 2.7.5),
that is all recurrent classes are closed. On the other hand finite closed classes are positive
recurrent (Theorem 2.7.4). We also have two important conditions for recurrence (Theorem
2.6.7) and for positive recurrence (Theorem 2.7.1) in terms of transition probabilities.
2.8 Periodicity of Chains

Consider a DTMC with states S = {0, 1} and the following transition matrix

0 1
P= .
1 0
(n) (n)
We can easily see that P00 = 0 if n is odd and P00 > 0 if n is even. We observe a similar
behaviour for random walk. We need to take equal number of steps towards right and left
(n)
to return back to our original position, hence the probability p00 is positive if and only if n is
even. Motivated by this behaviour we introduce the following definition.
(n)
Definition 2.8.1. The period d of state i is the greatest common factor of {n ≥ 0 : pii > 0}. If
d = 1, the state is called aperiodic, and for d ≥ 2, it is called periodic.
For random walk the period for state 0 is 2. Considering the DTMC’s represented in Fig-
ure 2.6, for (a) state 1 has period d = 3 because it is only possible to return after three steps.
For (b), it is also impossible to return to state 1 after one step, however you can return in two
or three steps. Taking the greatest common factor we conclude that state 1 is aperiodic for (b).
Interestingly we find the same period for all communicating states. Our next theorem is about
this fact.
25
1
1 0.5
0.5 0.5
0.5
1 1 0.5
2 3
2 1
3 0.5
Figure 2.6: DTMC’s with periods d = 3 (on the left) and d = 1 (on the right)
Theorem 2.8.2. Period is a class property, that is if i ↔ j then di = d j .
(m)
Proof. Since i and j communicate, we know that there exists m and n such that pi j > 0 and
(n)
p ji > 0. First
X
(n+m) (m) (n) (m) (n)
pii = pik pki ≥ pi j p ji > 0,
k∈S
(s)
hence, n + m is a multiple of period of i, di . Suppose s is chosen such that p j j > 0.
(n+m+s) (m) (s) (n)

pii ≥ pi j p j j p ji > 0,
(s)
so n + m + s is also a multiple of di , and hence di is a factor for all s such that p j j > 0. Since d j
is the greatest such common factor, di ≤ d j . Proving the opposite inequality in the same way,
we conclude di = d j .
2.9 Stationary Probabilities: Aperiodic Case

(n)
In this section we are interested in the large time limit of the probability p j j . An easy observa-
tion will help us a lot: we can start the chain from a special random initial condition a(0) = π
such that all probabilities a(n) = a(0) = π remain unchanged in time.
Definition 2.9.1. A distribution π = (π1 , π2 · · · ) ≥ 0 is stationary if it satisfies the global balance

equations, i.e. X
π = πP, and π j = 1.
j∈S
A chain with a stationary distribution is said to be in a stationary state or steady state.
Proposition 2.9.2. If a chain is initially in a stationary distribution, a(0) = π, then a(n) = π for
all n ≥ 0.
Proof. Use induction, with a(n+1) = a(n) P = πP = π.
26
We have seen so far the state space is partitioned into transient, 0-recurrent and +recurrent
(n)
communicating classes. For any transient class limn→∞ p j j = 0 since the chain will leave this
class forever after a final visit. Since any open class is transient (see Theorem 2.7.5), the only
interesting case is if we entered a closed communicating class, that is an irreducible chain,
which we are going to focus on from here.
Using the tools of Renewal Theory, one can show that the large time limit probability
(n)
limn→∞ p j j exists for aperiodic, irreducible chains. From the convergence it is easy to see
(do it!) that this limit is the same as p∗j j and together with Theorem 2.7.1 we find that for
positive recurrent chains the limiting probability π j for state j is positive
(n) 1
lim p j j = p∗j j = >0
n→∞ mj j
We will also show that in this case this limit is the unique stationary solution.
We can understand the existence of such a limit intuitively. Think about our simple weather
example and ask for the probability that in a million years it will rain if it rains today. Now
what is this probability a million years and one day later? Intuitively, we can argue that these
probabilities should be equal after a really long time and it should be equal to the long run
average proportion of days that are rainy.
Theorem 2.9.3. (a) For aperiodic, irreducible chains the limiting probabilities are independent
of the initial state, that is, for every i, j ∈ S,
(n) (n)
lim p j j = lim pi j
n→∞ n→∞
and we call this limit π j .

(b) If the chain is also positive recurrent then the limiting probability distribution π =
(. . . , π j , . . . ) is the unique stationary distribution.
Proof. (a) For transient chains each state is only visited a finite number of times, hence the
(n)
probabilities limn→∞ pi j = 0, so the statement for transient chains is proved. Let us assume
then that the chain is recurrent. Define rk = P(T j = k|X 0 = i), that is probability of first visit
to state j being exactly after k steps starting from state i (note that r0 = 0 by the definition
of T j ). Since the DTMC is assumed to be irreducible and recurrent, from Corollary 2.6.4 (a)
P∞
it follows that k=1 rk = ρi j = 1. We also assume that π j = limn→∞ p j j exists, hence for any
ε > 0 we can find N such that for all n > N
∞
X
(n)
rk < ε/2 and |p j j − π j | < ε/2.
k=n+1
We can also write (why? Try to interpret the equation!)

n
X
(n) (n−k)
pi j = rk p j j .
k=1
27
Then for all n > 2N
(n) (n−k)
Pn P∞
|pi j − π j | = | k=1
rk p j j − π j k=1 rk |
(n−k) (n−k)
Pn−N −1 Pn P∞
= | k=1 rk (p j j − π j ) + k=n−N rk (p j j − π j ) − π j k=n+1 rk |
(n−k) (n−k)
Pn−N −1 Pn P∞
≤ k=1
rk |p j j − π j | + k=n−N rk |p j j − π j | + π j k=n+1 rk
Pn−N −1 Pn P∞
≤ k=1
rk ε/2 + k=n−N rk + k=n+1 rk
Pn−N −1 P∞
≤ ε/2 k=1 rk + k=n−N rk < ε/2 + ε/2 = ε.
(n)
where from the second to third line we used the triangular inequality. Hence if p j j → π j then
(n)
also pi j → π j .
(b) Since X
(n) (n−1)
pk j = pki pi j
i
by taking n to infinity we get the first global balance equation

X
πj = πi pi j .
i
πi = 1, we first apply the transition probabilities consecutively,

P
To prove that i∈S
π = πP = πP 2 = · · · = πP n
that is X
(n)
πj = πi pi j
i∈S
and take the n → ∞ limit to get π j = π j i∈S πi . For transient orP null recurrent chains π j = 0
P
for all j, but for positive recurrent chains π j = 1/m j j > 0, hence i∈S πi = 1.
We prove uniqueness similarly. Let π̄ be another solution of the equations, so we get again
X
(n)
π̄ j = π̄i pi j
i∈S
Taking the n → ∞ limit, we get X

π̄ j = π̄i π j = π j .
i∈S
One should be extremely careful when interchanging the limit and the infinite sums. For all
the limits in this theorem, we can change the order of limit and summation using bounded
convergence theorem from measure theory.
Heuristic Explanation: The number of times the DTMC in state j is certainly equal to the
number of times it goes “into” state j (apart from the initial position). Let N be a really large
number and consider the first N time steps. Then approximately the number of steps spent in
state j will be N π j . Similarly, the number of steps from state i to state j will be approximately
N πi pi j . Summing the number of transitions into state j over all states
X
N πj ≈ N πi pi j .
i∈S
28
Canceling N s on both sides, we see that the equations define a balance between number of
visits to state j and number of transitions into state j. That is why these equations are called
the global balance equations.
Example 2.9.4. Consider the DTMC with state space S = {1, 2} and transition probabilities

0.3 0.7
P= .
0.4 0.6
Find the steady-state probabilities.
First let’s write the global balance equations:
π1 = 0.3π1 + 0.4π2
π2 = 0.7π1 + 0.6π2
π1 + π2 = 1.
Solving these equations (the first two are not independent), we get π1 = 4/11 = 0.363636 . . . , π2 =
7/11 = 0.636364 . . . . To demonstrate the power of this result, if we take the fifth power of P, we
see that the weather just after 5 days is already quite well described by the stationary distribution

0.36363 0.63637
P =
5
.
0.36364 0.63636
Example 2.9.5. Consider the model of success runs described in Example 2.6.6. Find steady-state
probability distribution.
First, let’s define qi = 1 − pi for ease of notation. We can write the global balance equations as
π0 = q0 π0 + q1 π1 + q2 π2 · · ·
π1 = p0 π0
π2 = p1 π1
..
P∞ .
i=0
πi = 1.
We know that one of the equations above is redundant, hence we can disregard the first equation
as it is more complicated than the others. By solving πi in terms of π0 , we get
π1 = p0 π0
π2 = p1 π1 = p1 p0 π0
..
.
P∞ πi = pi−1 pi−2 · · · p1 p0 π0
π =1
i=0 i
Qi−1
By defining αi = j=0
p j for i > 0 and α0 = 1, the above equations can be rewritten as πi = αi π0
for all i, and
∞
X
αi π0 = 1
i=0
P∞
Hence, there is a stationary solution (i.e. the DTMC is positive recurrent) if and only if i=0
αi <
P∞ P∞
∞ and in this case π0 = 1/ i=0 αi and πi = αi / i=0 αi .
29
2.10 Stationary Probabilities: Periodic Case
The results of the previous section are slightly modified for periodic chains. For motivation,
consider the DTMC with state space S = {1, 2} and transition probabilities

0 1
P= .
1 0
We see that §
(n) (n) 1 if n is even
p00 = p11 =
0 if n is odd
(n)
Hence, we cannot talk about limn→∞ p00 as it does not exist! However, we know from our
previous calculations that p∗j j exists and is equal to π j . Using a similar methodology, we can
prove an analog of Theorem 2.9.3 for periodic irreducible DTMCs.
Theorem 2.10.1. (a) For irreducible chains, the steady-state probabilities π j are independent of
the initial state, that is,
π j = p∗j j = pi∗j
for every i, j ∈ S.
(b) if the chain is also positive recurrent then the limiting probability distribution π is the
unique stationary distribution.
So the developments for positive recurrent aperiodic and periodic DTMCs can be summa-
rized as follows.
Aperiodic Periodic
(n)
limn→∞ pi j Exist Do not exist
pi∗j Exist Exist
Interpretation for π j Stationary probability Stationary probability
Limiting probability
It is not hard to prove that if you find a solution to the π = πP, π j = 1 equations for
P
an irreducible chain, then that’s the Plong term average probability. For aperiodic chains that’s
also the limiting probability. Since π j = 1, then π j > 0 for at least one j, which state is
then positive recurrent, and since that’s a class property, all states in the chain are positive
recurrent. Note that for transient and null-recurrent states the probability of being there goes
to zero, even if these states are periodic.
Example 2.10.2. Consider the periodic DTMC with state space S = {0, 1, 2} and transition prob-
ability matrix  
0 1 0
P =  0.5 0 0.5 
0 1 0
Find the steady-state probabilities.
30
First let’s write the global balance equations
π0 = 0.5π1
π1 = π0 + π2
π2 = 0.5π1
π0 + π1 + π2 = 1.
Solving these equations, we get π0 = π2 = 0.25 and π1 = 0.5.
Example 2.10.3. Reflecting Random Walk: Consider a DTMC with state space S = {0, 1, 2, · · · }
and transition probabilities for i ≥ 1

 qi if j = i − 1
pi j = ri if j = i
 p if j = i + 1
i
with p00 = r0 and p01 = p0 and all other transition probabilities are zero. Note that this chain is
periodic if and only if ri = 0 for all i. The global balance equations are
π0 = r0 π0 + q1 π1
π1 = p0 π0 + r1 π1 + q2 π2
π2 = p1 π1 + r2 π2 + q3 π3
..
.
P∞ p0
with i=0 πi = 1. Starting from the first equation, and using p0 + r0 = 1, we see that π1 = q1 π0 .
p p
Replacing it in the second equation, we get π2 = q01 q21 π0 . Proceeding similarly and defining
p0 p1 · · · pi−1
ρi = for all i > 0
q1 q2 · · · qi
and ρ0 = 1, we get
πi = ρi π0 .
P∞ P∞
Using the fact that i=0 πi = i=0 ρi π0 = 1, we see that the reflected random walk is positive
P∞ ρ
recurrent if and only if i=0 ρi < ∞ and then π0 = P∞1 ρ and πi = P∞i ρ . Notice that (some-
i=0 i j=0 j
what surprisingly) the stationary probabilities do not depend on ri . Note also that for constant
bias, p = pi and q = qi for all i, the walk is positive recurrent only for p < q, that is when it’s
biased towards the origin. Similarly to Exercise 2.6.8 it is easy to see that this walk is also null
recurrent for p = q and transient for p > q, for which cases there is no stationary solution, all
probabilities converge to zero.
2.11 First Passage Probabilities and Times

In this section we are interested in the following type of questions: If a gambler has i pounds
and the bank has N − i pounds, what’s the probability that the gambler gets ruined? And how
long does the gambling take on average? In the Wright-Fisher Model with M individuals and i
31
mutants, what is the probability that mutants eventually take over? And how long does it take
until the population becomes homogeneous: either all normals or all mutants?
Let us define the first passage time of a chain to a set A ⊂ S as T̂A = min{n ≥ 0 : X n ∈ A}.
Note that T̂A = 0 for X 0 ∈ A, that is if we are already in A, it takes no time to get there.
Otherwise, the first passage time T̂A is identical to the first arrival time TA = min{n > 0 : X n ∈ A}
for X 0 ∈ S − A. We defined this variation for convenience.
Theorem 2.11.1. Let A, B ⊂ S, with P(min{ T̂A, T̂B } < ∞|X 0 = i) = 1 for all i. Then the
probability hi = P( T̂A < T̂B |X 0 = i) of reaching set A before set B when starting from state i
satisfies

0 if i ∈ B
hi = 1 if i ∈ A
if i ∈ S − (A ∪ B)
P
j∈S pi j h j
Proof. The i ∈ A, B parts are trivial. For i ∈ S − (A ∪ B), let us use the total probability rule by
conditioning on the first step as
hi = P
P( T̂A < T̂B |X 0 = i)
= j∈S P( T̂A < T̂B |X 1 = j, X 0 = i)P(X 1 = j, X 0 = i)
= j∈S P( T̂A < T̂B |X 1 = j)pi j
P
= P j∈S P( T̂A < T̂B |X 0 = j)pi j

P
= j∈S h j pi j
where we used the Markov property.
Theorem 2.11.2. Let A ⊂ S, , with P( T̂A < ∞|X 0 = i) = 1 for all i. Then the mean time
g i = E( T̂A|X 0 = i) to reach set A when starting from state i satisfies
¨
0 if i ∈ A
gi =
1 + j∈S pi j g j if i ∈ S − A
P
Proof. We use again the total probability rule (tower property) for i ∈ / A and condition on the
first step as
gi = P E( T̂A|X 0 = i)
= j∈S E( T̂A|X 1 = j, X 0 = i)P(X 1 = j, X 0 = i)
= j∈S E( T̂A|X 1 = j)pi j
P
= P j∈S E( T̂A + 1|X 0 = j)pi j

P
= j∈SP(1 + g j )pi j
= 1 + j∈S g j pi j
where we used again the Markov property.
Without a proof we mention that there is a unique solution to the above equations, hence
if one finds a solution, that is indeed the first passage probability or time.
32
Example 2.11.3. One of the simplest examples is the Gambler’s ruin problem discussed in Example
2.2.2. The gambler has initially i Pounds, each step wins or loses a pound with equal probability,
and quits playing when having zero or N = 3 Pounds. We are interested in the probability of the
gambler getting ruined. For this, let A = {0} and B = {3}, hence the equations for the first passage
probabilities are
h0 = 1
h1 = 21 h0 + 12 h2
h2 = 12 h1 + 12 h3
h3 = 0
By solving these equations we get that h1 = 2/3 and h2 = 1/3. So it’s better to have more money.
More seriously, you can check that for general N the probabilities hi = 1 − i/N are the (unique)
solution. Hence if you start gambling with 1 pound, and the bookie has N − 1 Pounds, where N is
huge, then the probability of you eventually getting ruined is close to one.
How much time does it take to quit playing, that is reaching either 0 or 3 Pounds? For this let
A = {0, 3}, hence the equations for the mean first passage times are
g0 =0
g1 = 1 + 21 g0 + 12 g2
g2 = 1 + 21 g1 + 12 g3
g3 =0
and by solving these equations we obtain that g1 = g2 = 2. In this very simple case we can even
check this result by using
P∞ directly the definition of expectation, and summing over all possible
paths, which leads to n=1 n/2 = 2.
n
2.12 Costs and Rewards

Many situations can be modelled by assigning a cost (or a reward) to each state: each time I
visit state j I have to pay a cost c( j). That is at step n we incur a cost of c(X n ). If I wander in
the state space forever, my cost will be infinite. We can, however, talk about average cost.
2.12.1 Long-run Average Cost

We can write the long run average cost starting from state i to be
N
1 X
ψi = lim E c(X n )|X 0 = i
N →∞ N + 1
n=0
Suppose c(i) ≤ B < ∞ and the DTMC is irreducible and positive recurrent.
PN
ψi = limN →∞ N 1+1 E( n=0 c(X n )|X 0 = i)
PN
= limN →∞ N 1+1 n=0 E(c(X n )|X 0 = i)
(n)
PN P
= limN →∞ N 1+1 n=0 j∈S c( j)pi j
N (n)
= P j∈S c( j) limN →∞ N 1+1 n=0 pi j
P P
= j∈S c( j)π j
33
We conclude that the long-run average cost is independent of the initial state and calculated
using steady-state probabilities.
Note that you’ll get the same expression for the long run mean cost
X
lim E(c(X n )|X 0 = i) = c( j)π j
n→∞
j∈S
for chains which are also aperiodic.
Example 2.12.1. Find the long-run average cost for brand-switching example.
First calculate the steady-state probabilities to be π1 = 0.132, π2 = 0.319 and π3 = 0.549
and
ψ1 = ψ2 = ψ3 = π1 c(1) + π2 c(2) + π3 c(3) = 1.709.
2.12.2 Cost in transient states

Another situation is when only transient states have P nonzero costs, so we can calculate the
∞
total final cost. In this case we define the total cost as n=0 c(X n ). Let φi be the total expected
cost if we start from state X 0 = i. Hence, we can use our first step conditioning tool again
P∞
φi = E( n=0 c(XPn )|X 0 = i)
∞
= E(c(X 0 ) +
P∞n=1 c(X n )|X 0 = i)
= c(i) + P
E( n=1 P c(X n )|X 0 = i)
∞
= c(i) + j∈S E( n=1 c(X n )|X 0 = i, X 1 = j)P(X 1 = j|X 0 = i)
P∞
= c(i) + j∈S E( n=1 c(X n )|X 1 = j)pi j
P
P∞
= c(i) + P j∈S E( n=0 c(X n )|X 0 = j)pi j
P
= c(i) + j∈S pi j φ j
Example 2.12.2. Consider the gambler’s ruin with N = 3 from Examples 2.2.2 and 2.11.3. Now
if we set c(1) = 1 and all other c( j) = 0, the cost φi will be the time spent in state 1 when starting
from state i, and the equations will be
φ1 = 1 + 12 φ2
φ2 = 0 + 21 φ1
which leads to φ1 = 4/3 and φ2 = 2/3. We can similarly calculate φ̃i for the visits to state 2, and
get φ̃1 = 2/3 and φ̃2 = 4/3, hence the total time spent in states 1 or 2 is φ1 + φ̃1 = φ2 + φ̃2 = 2,
in accordance with Examples 2.11.3.
2.13 Reversibility
Recall that if we start a chain from its stationary distribution then the chain will be in its
stationary state for all times. We call such a chain a stationary DTMC. What happens if we
watch such a chain backwards in time? What we mean by that is the following.
34
Definition 2.13.1. Reversed Process. If (X n )n∈N is a stationary DTMC and we fix an m, then the
process (X̃ n )0≤n≤m where X̃ n = X m−n is called the reversed process of X .
Theorem 2.13.2. The reversed process (X̃ n )0≤n≤m is a DTMC with transition probabilities
π j p ji
p̃i j =
πi
Proof. We should show that the reversed process has Markov property to prove that it is a
DTMC. Hence,
P(X̃ n+1 = j, X̃ n = i, X̃ n−1 = k, · · · )

P(X̃ n+1 = j|X̃ n = i, X̃ n−1 = k, · · · ) =
P(X̃ n = i, X̃ n−1 = k, · · · )
P(X m−n−1 = j, X m−n = i, X m−n+1 = k, · · · )
=
P(X m−n = i, X m−n+1 = k, · · · )
P(X m−n+1 = k, · · · |X m−n = i)P(X m−n = i, X m−n−1 = j)
=
P(X m−n+1 = k, · · · |X m−n = i)P(X m−n = i)
P(X m−n = i|X m−n−1 = j)P(X m−n−1 = j)
=
P(X m−n = i)
p ji π j
= .
πi
This result shows that for a stationary DTMC, the reversed process has Markov property with
transition probabilities p̃i j .
Definition 2.13.3. Reversibility. A stationary DTMC is said to be reversible if the reversed process
is stochastically the same as the original process, that is p̃i j = pi j , which implies
πi pi j = π j p ji , for all i, j ∈ S.
πi = 1 are called the detailed balance equations.

P
This equation together with i∈S
The detailed balance equations imply that for a stationary DTMC to be reversible, out of
all transitions the long run proportion of transitions from i to j should be equal to the long
run proportion of transitions from j to i. This seems like a particularly unlikely scenario, but
many real world models satisfy detailed balance. (For example all statistical physics models of
systems in thermal equilibrium). If we can find a distribution that satisfy the detailed balance
equations, then we found the unique stationary distribution as well (for irreducible chains)
from the following.
Corollary 2.13.4. If π satisfies the detailed balance equations, then it also satisfies the global
balance equations.
Proof. Left as a tutorial exercise.
Example 2.13.5. Consider the DTMC with two states S = {1, 2} with transition matrix

0.7 0.3
P=
0.5 0.5
35
Writing down the global balance equations we get
π1 = 0.7π1 + 0.5π2
π2 = 0.3π1 + 0.5π2
π1 + π2 = 1.
We find π1 = 5/8 and π2 = 3/8. By checking
π1 p12 = π2 p21
5/8 × 0.3 = 3/8 × 0.5
3/16 = 3/16.
Hence, the DTMC is reversible. (In fact, by using the above interpretation of detailed balance
equations, we can argue that any irreducible DTMC with only two states is reversible. Why?)
Example 2.13.6. Now, let’s consider a DTMC which is not reversible. For this purpose assume
that our DTMC has S = {1, 2, 3} and
 
0 1 0
P =  0 0 1 .
1 0 0
We know that the sample path will be exactly · · · , 1, 2, 3, 1, 2, 3, 1, 2, 3, · · · Hence, reversed process
will have the sample path · · · , 3, 2, 1, 3, 2, 1, 3, 2, 1, · · · . By looking at these we can see that this
DTMC is not reversible because p̃i j 6= pi j (for example, p̃31 = 0 6= 1 = p31 ). Please check to see the
detailed balance equations also fail to hold.
2 4
3 5 6
Figure 2.7: A sample state diagram for a tree DTMC
A DTMC is said to be a tree DTMC if pi j > 0 implies p ji > 0 and there are no cycles in the
state diagram. A sample tree DTMC is given in Figure 2.7. Since we know that in a stationary
DTMC the number of transitions from state i to state j should be equal to transitions from
state j to state i in the long run. Hence, using the intuitive interpretation of detailed balance
equation, we can conclude
Theorem 2.13.7. A stationary tree DTMC is reversible
Since, random walk is a tree DTMC, that is there are no cycles, we have the following
corollary.
Corollary 2.13.8. A stationary random walk, that is a positive recurrent random walk in sta-
tionarity, is reversible.
36
Chapter 3
Poisson Processes
In this chapter, exponential random variables with some given rate λ will play an important
role, hence we start with overviewing some of their properties. Then we’ll define our first
continuous time stochastic model: the Poisson process, and we’ll discuss its main features.
3.1 Exponential Random Variable

Definition 3.1.1. A continuous nonnegative random variable X is called exponential with rate λ
if its cumulative distribution functions is
P(X ≤ x) = F (x) = 1 − e−λx .
Consequently, its density is
f (x) = λe−λx
both for x ≥ 0, and zero otherwise.
Theorem 3.1.2. The r-th moment of the exponential random variable with rate λ is given by
r!
E(X r ) = .
λr
Proof. Use integration-by-parts.
3.1.1 Memoryless Property

Exponential distribution has a unique place in the analysis of stochastic processes due to the
so-called “memoryless property”, which can be stated as
P(X > s + t, X > s) e−λ(s+t)
P(X > s + t|X > s) = = −λs = e−λt = P(X > t).
P(X > s) e
Intuitively, suppose the lifetime of a lightbulb is exponentially distributed with rate λ. When
this lightbulb is installed the expected lifetime is 1/λ. After this lightbulb has been operating
for s years, the remaining lifetime is still exponential with rate λ and the expected remaining
lifetime is 1/λ. This means we can forget about the history of the lightbulb and treat it as a
brand new one!
37
Theorem 3.1.3. The only continuous distribution which has a support [0, ∞) with memoryless
property is the exponential.
Proof. If X has a distribution with memoryless property, then by definition P(X > t + s|X >
s) = P(X > t), which means P(X > t + s) = P(X > t)P(X > s). Hence,
P(X >t+h)−P(X >t) P(X >t)P(X >h)−P(X >t)
limh→0 h = limh→0 h
P(X >t)(P(X >h)−1)
= limh→0 h
P(X >h)−1
= P(X > t) limh→0 h .
Hence defining F̄ (t) = 1 − F (t), F̄ (t) should satisfy the ordinary differential equation
F̄ 0 (t) = F̄ (t) F̄ 0 (0). The unique solution to this differential equation is the exponential function
and hence the theorem follows.
3.1.2 Properties of Minimum of Two Exponentials

First let’s suppose that we have two independent exponential random variables X 1 ∼ exp(λ1 )
and X 2 ∼ exp(λ2 ). What is the probability that {X 1 < X 2 }? By conditioning on X 1 we can write
R∞
P(X 1 < X 2 ) = 0 P(X 2 > X 1 |X 1 = x)λ1 e−λ1 x d x
R∞
= 0 P(X 2 > x)λ1 e−λ1 x d x
R∞
= 0 λ1 e−λ2 x e−λ1 x d x
λ1
= λ1 +λ 2
.
Now define Z = min{X 1 , X 2 }, then what is the distribution of Z?
P(Z > x) = P(min{X 1 , X 2 } > x)

= P(X 1 > x)P(X 2 > x)
= e−λ1 x e−λ2 x
= e−(λ1 +λ2 )x .
This is an interesting result, because it says the minimum of two independent exponentials is
distributed exponential with rate λ1 + λ2 .
A surprising fact is that these two events above are independent.
P(Z > x, X 1 < X 2 ) = P(min{X 1 , X 2 } > x, X 1 < X 2 )

= P(X 1 < X 2 , x < X 1 , x < X 2 )
= P(X 1 < X 2 , x < X 1 )
R∞
= −∞ P(X 1 < X 2 , x < X 1 |X 1 = y) f X 1 ( y)d y
R∞
= x P( y < X 2 ) f X 1 ( y)d y
R∞
= x λ1 e−λ2 y e−λ1 y d y
λ1
= e−(λ1 +λ2 )x λ1 +λ 2
= P(Z > x)P(X 1 < X 2 ).
Hence if you know which one is smaller, the size of it is independent of that. If you have two
light bulbs, one burns out at rate 1 and the other at rate 10, the first burning out (no matter
38
which one) occurs at rate 11, or in 1/11 time on average. That is true even if you know that the
rate 1 lightbulb burnt out first. This is a surprising but fundamental property of exponential
variables.
Generalizing these results
P we get that the minimum of k independent exponentials is again
and exponential with rate j λ j , and that the event that it’s the ith exponential is independent
with probability λi / j λ j .
P
3.1.3 Strong Memoryless Property

The memoryless property stated previously is about the conditional probability of X given
X > s. Here, s is a real number which is not random. Strong memoryless property is a statement
when this s is also a random variable.
Theorem 3.1.4. If X 2 is an exponential random variable with rate λ and X 1 is an independent
non-negative continuous random variable, then for any x ≥ 0
P(X 2 > X 1 + x|X 2 > X 1 ) = P(X 2 > x) = e−λx .

Proof.
P(X 2 > X 1 + x, X 2 > X 1 ) = P(X 2 > X 1 + x)
R∞
= 0 P(X 2 > X 1 + x|X 1 = y) f ( y)d y
R∞
= 0 P(X 2 > y + x) f ( y)d y
R∞
= 0 e−λ(x+ y) f ( y)d y
R∞
= e−λx 0 e−λ y f ( y)d y.
R∞
and consequently, P(X 2 > X 1 ) = 0 e−λ y f ( y)d y. Hence
P(X 2 > X 1 + x, X 2 > X 1 )
P(X 2 > X 1 + x|X 2 > X 1 ) = = e−λx
P(X 2 > X 1 )
3.1.4 Sums of I.I.D. Exponentials

Theorem 3.1.5. If Z = X 1 + X 2 + · · · X n , where X i ∼ exp(λ) for all i and independent, then Z is
called the gamma (n, λ) random variable and its density function is given by
(λz)n−1
f n (z) = λe −λz
.
(n − 1)!
Proof. We will use induction. If n = 1, then f1 (z) = λe−λz by definition. Now assume that the
result holds for n − 1, then for n
Rz
f n (z) = 0 f n−1 (x) f1 (z − x)d x
Rz (λx)n−2
= 0 λe−λx (n−2)! λe−λ(z−x) d x
R z x n−2
= λe−λz λn−1 0 (n−2)! dx
n−1
= λe−λz λn−1 (n−1)!
z
(λz)n−1
= λe−λz (n−1)! .
39
By using integration-by-parts, we can find the cumulative distribution as
n−1
−λt (λt)
X r
P(Z ≤ t) = 1 − e .
r=0
r!
3.2 Poisson Processes

We first define the Poisson process as the number of arrivals (say of customers to a cafe shop)
up to time t where the wait between any two consecutive arrivals are independent exponential
random variables. We later give two other equivalent definitions as well.
Definition 3.2.1. Let τi be independent exponential (λ) random variables, S0 = 0, Sn = τ1 +τ2 +

· · · + τn and Nt = max{n ≥ 0 : Sn ≤ t}. Then (Nt ) t∈R≥0 is a Poisson Process with rate parameter
λ, or briefly PP(λ).
Here τn represents the time between (n − 1)st and nth arrivals, Sn = τ1 + τ2 + · · · + τn

denotes the time in which nth arrival occurs, and Nt is the number of arrivals by time t. It can
be illustrated as
τ1 τ2 τ3 Nt = 2
0
*
S
*S t
*S -
time
1 2 3
Note that a Poisson Process is a Counting process, that is Nt ∈ Z≥0 for all t, and Nt is
a nondecreasing function of t. The following theorem justifies why the process is called a
Poisson process.
Theorem 3.2.2. If (Nt ) t≥0 is a PP(λ), then Nt follows a Poisson distribution with rate λt for any
t, that is
e−λt (λt)k
P(Nt = k) = .
k!
Proof.
P(Nt = k) = P(Nt ≥ k) − P(Nt ≥ k + 1)
= P(Sk ≤ t) − P(Sk+1 ≤ t)
(λt) r (λt) r
Pk−1 Pk
= 1 − r=0 e−λt r! − 1 − r=0 e−λt r!
(λt)k
= e−λt k!
Above, the third equality follows as Sk ∼ gamma(k, λ) as it is the sum of k i.i.d. exponential
random variables. Hence, the theorem follows.
Example 3.2.3. Suppose Nonna’s kitchen in Morningside opens at 9pm. Arrivals to the kitchen
follow a Poisson process with rate λ = 10/hour.
1. Find the expected number of arrivals between 9pm and midnight.
40
2. What is the probability that there are exactly 8 arrivals between 9.00pm and 9.30pm?
The fact that arrivals follow a Poisson process with rate λ = 10/hour, intuitively means that
• Expected inter-arrival time for customers is 0.1 hours=6 minutes.
• Expected number of arrivals per hour is 10 customers.
The number of arrivals between 9pm and midnight is N3 which follows a Poisson distribution
with rate λt = 10 × 3 = 30. Since the expectation of a Poisson random variable is the rate,
expected number of customers till midnight is 30. This also matches with our intuition.
The second part of the question asks
e−0.5λ (0.5λ)8 e−5 58

P(N0.5 = 8) = = ≈ 0.065.
8! 8!
A PP(λ) counts the events starting from time 0. Suppose you reset your counter at time s
and start counting the future events. We see that this new process can be defined as
Nt(s) = Ns+t − Ns for all t ≥ 0,
(s)
which means that (Nt ) t≥0 counts the number of events between s and t + s. What can be said
about this new stochastic process?
By simple observation, we can see that the time between first and second events, second and
third events, ..., are all i.i.d. exponentials with rate λ. Using the memoryless property, we can
also see that first inter-event time is also exponential with rate λ. Hence, from the definition,
(s)
we can see that (Nt ) t≥0 is a PP(λ). The following theorem summarizes these results
(s)
Theorem 3.2.4. The process (Nt ) t≥0 is a PP(λ) and it is independent of (Nu )0≤u≤s .
Proof. As stated above the only event we need to worry about is the time for the first event.
We can write
P(SNs +1 − s > x) = P(SNs + τNs +1 − s > x|SNs + τNs +1 − s > 0)

= P(τNs +1 > s + x − SNs |τNs +1 > s − SNs )
= P(τNs +1 > x) = e−λx
where we used the strong memoryless property of the exponential. Note also that in the first
equality a condition seemingly appeared out of nowhere, but that condition was implicitly
assumed on the left hand side.
Note here that by repeatedly using the above theorem, you can break up a time interval
into any finite number of non-overlapping intervals and get independent Poisson processes.
We exploit this feature in the following example.
Example 3.2.5. What is the probability of P(Nt 1 = n1 , Nt 2 = n2 , . . . , Nt k = nk )?
41
This probability can be written as
P(Nt 1 = n1 , Nt 2 = n2 , . . . , Nt k = nk )
= P(Nt 1 = n1 , Nt 2 − Nt 1 = n2 − n1 , . . . , Nt k − Nt k−1 = nk − nk−1 )
= P(Nt 1 = n1 )P(Nt 2 − Nt 1 = n2 − n1 ) · · · P(Nt k − Nt k−1 = nk − nk−1 )
(t 1 ) (t k−1 )
= P(Nt 1 = n1 )P(Nt 2 −t 1
= n2 − n1 ) · · · P(Nt k −t k−1
= nk − nk−1 )
n −n
(λt 1 )n1 −λ(t 2 −t 1 ) (λ(t 2 −t 1 ))n2 −n1 −λ(t k −t k−1 ) (λ(t k −t k−1 )) k k−1
= e−λt 1 n1 ! e (n2 −n1 )! · · · e (nk −nk−1 )!
n1 n −n
−λt k nk t 1 (t 2 −t 1 ) 2 1 (t k −t k−1 )nk −nk−1
= e λ n1 ! (n2 −n1 )! · · · (nk −nk−1 )!
Definition 3.2.6. 1. A process (Nt ) t≥0 is said to have stationary increments if Ns+t − Ns is
identically distributed for all s, that is the distribution does not dependent on s.
2. A process (Nt ) t≥0 is said to have independent increments if the increments of the distribu-
tion is independent for non-overlapping intervals, that is Ns1 +t 1 − Ns1 and Ns2 +t 2 − Ns2 are
independent if [s1 , s1 + t 1 ] ∩ [s2 , s2 + t 2 ] = ;.
The following theorem is an “if and only if" statement, which means it gives another defi-
nition for the Poisson process.
Theorem 3.2.7. (Nt ) t≥0 is a PP(λ) if and only if
1. it has stationary and independent increments.
2. Nt is a Poisson (λt) random variable for all t.
Proof. The “only if" part is proven above, that is if we have a PP(λ), then it should have station-
ary and independent increments (Theorem 3.2.4), and the increments should follow a Poisson
distribution. To prove the “if" part, consider
e−λt (λt)0
P(τ1 > t) = P(Nt = 0) = = e−λt .
0!
This means that the first inter-event time has exponential distribution with rate λ. For the
n + 1th interarrival time (we take n + 1 for notational simplicity only), let f (s) ≡ fSn (s) denote
the density function for Sn and we get
P(τn+1 > t) = P(NSn +t − NSn = 0)

R∞
= 0 P(NSn +t − NSn = 0|Sn = s) f (s)ds
R∞
= 0 P(Ns+t − Ns = 0) f (s)ds
R∞
= 0 P(Nt = 0) f (s)ds
R∞
= P(Nt = 0) 0 f (s)ds = e−λt .
Notice that third and fourth equality follows using stationary and independent increments.
Hence, we know that all the inter-event times are exponential with rate λ, which means if the
conditions stated in the theorem hold, then (Nt ) t≥0 is a PP(λ).
42
So far we have seen two different definitions of Poisson processes, which will help us to
recognize them. We will give a third definition for Poisson processes, but before doing so we
need to define some related concepts.
f (x)
Definition 3.2.8. A function f (x) is said to be a o(x) (read “little o x” ) function, if lim x→0 x =0.
The definition means that as the parameter goes to 0 the function goes to 0 faster than
linearly. Here are some examples
Example 3.2.9. 1. f (x) = ax 2 is a o(x) function, because lim x→0 ax 2 /x = lim x→0 ax =
0. (Similarly you can prove that any function with f (x) = ax n , where n > 1 is a o(x)
function).
2. f (x) = ax is not o(x), because lim x→0 ax/x = a.
3. f (h) = sin h is not o(h). We see that limh→0 sinh h = 00 , hence we need to use L’hospital’s rule,
that is we need to take the derivative of the numerator and denominator to find the limit.
f (h)
This implies if limh→0 h = 0/0, then
f (h) f 0 (h) cos(h)

lim = lim = lim = 1.
h→0 h h→0 1 h→0 1
Now, we are ready to state a third definition of Poisson processes.
Theorem 3.2.10. (Nt ) t≥0 is a PP(λ) if and only if
1. it has stationary and independent increments.
2.
P(Nh = 0) = 1 − λh + o(h)
P(Nh = 1) = λh + o(h)
P(Nh ≥ 2) = o(h).
This theorem implies that as we get shorter intervals the probability of observing two or
more events in the interval decreases very fast (faster than linearly). Apart from the o(h)
(“negligible") terms, with a small probability λh we have a single arrival, but most likely there
is no arrival.
Proof. First let’s prove the “only if" part, that is if (Nt ) t≥0 is a PP(λ) then the conditions stated
are true. We need not worry about stationary and independent increments as it is already
proven. Playing with the equations, we need to show P(Nh = 0) − 1 + λh is a o(h) function.
P(Nh = 0) − 1 + λh e−λh − 1 + λh −λe−λh + λ

lim = lim = lim = 0.
h→0 h h→0 h h→0 1
where we used L’Hospital’s rule. To prove the same thing for P(Nh = 1), we need to use the
same trick. Then third equation follows as the sum is equal to 1.
43
To prove the “if" part, that is if conditions 1,2 hold (Nt ) t≥0 is a PP(λ), we need to show that
the conditions above imply Nt is a Poisson random variable. For shorthand notation define
pn (t) = P(Nt = n), then using stationary and independent increments, we can write
p0 (t + h) − p0 (t) p0 (t)p0 (h) − p0 (t) p0 (t)(−λh + o(h))

= =
h h h
and using the law of total probabiliy for n ≥ 1

Pn
pn (t + h) − pn (t) i=0
P(Nt+h = n|Nt = i)P(Nt = i) − pn (t)
=
h Pn h
p
i=0 n−i
(h)p i (t) − p n (t)
=
h Pn−2
(1 − λh + o(h))pn (t) + λhpn−1 (t) + i=0 o(h)pi (t) − pn (t)
=
h Pn−2
(−λh + o(h))pn (t) + λhpn−1 (t) + i=0 o(h)pi (t)
=
h
Taking the h → 0 limit in above equations, we get
p00 (t) = −λp0 (t)

pn0 (t) = −λpn (t) + λpn−1 (t), for n ≥ 1.
By solving the first equation we can see that p0 (t) = P(Nt = 0) = e−λt . Substituting this in the
equation for n = 1 we can solve for p1 (t). We can finish the proof by induction.
3.3 Superpositioning and Splitting

(A) (B)
Suppose we have two independent Poisson processes (Nt ) t≥0 and (Nt ) t≥0 with rates λA and
(A) (B)
λB respectively. What can we say about the combined process Nt = Nt + Nt ?
First, we can say that the new process has stationary and independent increments. Remem-
bering that the sum of Poisson random variables are again Poisson with rates summed up we
can see that Nt ∼ Poisson (t(λA + λB )). Hence, (Nt ) t≥0 is a PP(λA + λB ).
We can argue the same result using the properties of exponential distribution. We see that
(A) (B)
the first event time for Nt is the minimum of the first event times of Nt and Nt . Hence,
first event time is exponential with rate λA + λB . Then, using the strong memoryless property
the subsequent event is also exponential with rate λA + λB . Hence, using the first definition of
Poisson processes we again conclude that (Nt ) t≥0 is a PP(λA + λB ). Generalizing results for k
different processes, we state the following theorem:
(i) (1) (k)

Theorem 3.3.1. Let (Nt ) t≥0 be PP(λi ) for each i = 1, . . . , k. Then, with Nt = Nt + · · · + Nt ,
(Nt ) t≥0 is a PP(λ1 + · · · + λk ).
Proof. Use induction along with the above argument.
44
Now consider the opposite way. Suppose (Nt ) t≥0 is a PP(λ). When each event occurs, it is
of type A with probability p and type B with probability (1 − p). Then the number of A type
(A)
events occurring up to time t, Nt , and the number of B type events occurring up to time
(B)
t, Nt has stationary and independent increments (Why?). Also using the methodology in
(A) (B)
Example 1.1.7 we can prove that Nt and Nt have independent Poisson distributions with
rate pλt and (1− p)λt respectively. This is called the splitting (or thinning) property of Poisson
processes. The splitting mechanism described above, that is assigning each event to be of type
i with probability pi independent of other events, is called a “Bernoulli splitting mechanism”.
Theorem 3.3.2. Suppose (Nt ) t≥0 is a PP(λ). If an event in this process is of type i
Pk
with probability pi independent of the other events, where p = 1, then the processes
i=1 i
(1) (2) (k)
(Nt ) t≥0 , (Nt ) t≥0 , . . . , (Nt ) t≥0 are independent Poisson processes with rates λp1 , λp2 , . . . λpk
respectively.
Example 3.3.3. Suppose there are two machines operating in a factory. The failure time for
machine A is exponential with rate 4/year and with probability 0.4 this is a failure that requires
replacement. Similarly, failure time for machine B is exponential with rate 6/year and with
probability 0.2 the failed machine needs to be replaced. If replacements and repairs are done
instantaneously, what is the probability that exactly 7 machine replacements will be needed over
two years?
First, the failures for machine A and machine B follow a PP(4) and PP(6) respectively. Then
using the splitting property, replacements for machine A and B follow PP(0.4×4) and PP(0.2×6).
Now using superpositioning we can see that the total number of replacements is a PP(0.4 × 4 +
0.2 × 6)=PP(2.8). Hence if Nt denotes the number of replacements by time t then
e−2.8×2 (2.8 × 2)7
P(N2 = 7) = ≈ 0.1267.
7!
3.4 Campbell’s Theorem: Uniform Order Statistics

In this section we study the distribution of arrival times conditioned on the number of arrivals.
Let us consider a single arrival first, and ask if it occured before time s (with s ≤ t), that is
P(Ns = 1, Nt = 1) P(Ns = 1, Nt − Ns = 0)
P(S1 < s|Nt = 1) = P(Ns = 1|Nt = 1) = =
P(Nt = 1) P(Nt = 1)
(s)
P(Ns = 1, Nt−s = 0) P(Ns = 1)P(Nt−s = 0) λse−λs e−λ(t−s) s
= = = =
P(Nt = 1) P(Nt = 1) λt e−λt t
hence S1 is uniform on [0, t]. In general, what is the probability that k events happened before
time s with the condition that n events happened till time t?
(s)
P(Ns = k, Nt = n) P(Ns = k, Nt−s = n − k)
P(Ns = k|Nt = n) = =
P(Nt = n) P(Nt = n)
P(Ns = k)P(Nt−s = n − k) (λs)k −λs (λ(t − s))n−k −λ(t−s) n! λt
= = e e e
P(N t = n) k! (n − k)! (λt)n
n s k s n−k

= 1−
k t t
45
which is the binomial distribution with parameters n and success probability s/t. The reason is
that, as we’ll see, each arrival occurs independently with probability s/t before s, that is each
event occurs independently and uniformly on [0, t].
Note that this is only true if we do not consider the order at which these events occurred
(as in our last calculation). For example if there are two events, the first one always occurs
before the second one, which is not true for two independent uniform variables. In practice,
generate n uniform random variables on [0, t], and the smallest one corresponds to the first
arrival, the second smallest to the second arrival, and so on.
In mathematical terms, let Ui for i = 1, . . . n be independent uniform variables on [0, t]. Let
us order them, that is we define U(1) to be the minimum of Ui s, U(2) to be the second smallest
element, and so on. U(i) is called the ith order statistic of the uniform distribution. Let us
summarize the result we got.
Theorem 3.4.1. Campbell’s Theorem. Let Sn be the event times for a Poisson process. If Nt = n
is given then the vector (S1 , S2 , . . . , Sn ) follows the distribution of ordered independent uniform
variables (U(1) , U(2) , . . . , U(n) ). Consequently, the unordered set of arrival times {S1 , . . . , Sn } has
the same distribution as {U1 , . . . , Un }.
Proof. We find the conditional distribution of event times, Si , 0 < i ≤ n, conditioned on Nt = n.
Choosing any 0 = t 0 < t 1 < t 2 < · · · t n < t and hi infinitesimal,
P(Si ∈ [t i , t i + hi ], 1 ≤ i ≤ n; Nt = n)
P(Si ∈ [t i , t i + hi ], 1 ≤ i ≤ n|Nt = n) =
P(Nt = n)
P(exactly one event in [t i , t i + hi ], 1 ≤ i ≤ n; no event elsewhere in [0, t])
=
P(Nt = n)
−λhn −λ(t−h1 −···−hn )
λh1 e −λh1
· · · λhn e e
=
(λt) /n! × e−λt
n
n!
= n h1 · · · h n .
t
Hence the conditional density is
n!
fS1 ,...,Sn (t 1 , . . . , t n |Nt = n) =
tn
This density doesn’t depend on t i , hence it is uniform. Its value is one over the volume that the
ordered uniform (U(1) , U(2) , . . . , U(n) ) variables span. The unordered uniforms span t n volume,
and we have to divide that into n! equal parts, corresponding to the n! different orderings.
Example 3.4.2. Suppose arrivals to a real estate agent follows a PP(λ/hour) during an 8-hour
day. If exactly 20 customers arrived in a given day, what is the probability that exactly 5 of them
has arrived during the first hour?
Campbell’s theorem implies that if N8 = 20, then these 20 arrivals can be placed using uniform
distribution over these 8 hours. So the question reduces to “Out of 20 trials what is the probability
that exactly 5 of them hits the first hour?” Using our binomial distribution argument,
5 15
20 1 7
P(N1 = 5|N8 = 20) = .
5 8 8
Also notice that we were not given λ and we did not need to use it!
46
Example 3.4.3. Suppose customers arrive to an amusement park according to a Poisson process
with rate 20000 customers/year throughout the year. The company charges 30 pounds for entry
between June and August and 10 pounds for other months. For simplicity assume all the months
are of equal length and if we are given that 10000 people have arrived during the year, what is
the expected revenue for the amusement park.
Since N1 = 10000 is given, we can use Campbell’s theorem and conclude that a given customer
is equally to arrive any day of the year. Hence the expected amount a customer pays is
30 × P(arrives on June, July or August) + 10 × P(arrives on other months)

= 30 × 3/12 + 10 × 9/12
= 15 pounds.
Multiplying by the total number of customers 10000 × 15 = 150000.

If we did *not* know N1 = 10000, then the expected revenue would be
E(Revenue) = 10 × E(N5/12 + (N1 − N8/12 )) + 30 × E(N8/12 − N5/12 )

= 10 × 9/12 × 20000 + 30 × 3/12 × 20000
= 150000 + 150000
= 300000.
3.5 Nonhomogeneous Poisson Process

In this section, we relax the stationarity assumption of Poisson Processes. Consider the second
condition for the third definition of the Poisson processes (given in Theorem 3.2.10) in the
following format
P(Nt+h − Nt = 0) = 1 − λh + o(h)
P(Nt+h − Nt = 1) = λh + o(h)
P(Nt+h − Nt ≥ 2) = o(h).
Now relaxing the stationary assumption, we can make λ depend on time t.
Definition 3.5.1. A counting process (Nt ) t≥0 is a nonhomogeneous Poisson process if
1. it has independent increments
2.
P(Nt+h − Nt = 0) = 1 − λ(t)h + o(h)
P(Nt+h − Nt = 1) = λ(t)h + o(h)
P(Nt+h − Nt ≥ 2) = o(h).
Interestingly, the number of arrivals at any fixed time t is still Poisson.
Theorem 3.5.2. Define Z t
Λ(t) = λ(u)du.
0
Then Nt is a Poisson Λ(t) random variable for any t.
47
Proof. The proof follows the same lines as the "if" part of the proof of Theorem 3.2.10. First,
we derive the differential equations governing the probabilities P(Nt = k). Let us start again
with the k = 0 case.
P(N =0)−P(N =0)
d
dt P(Nt = 0) = limh→0 t+h h t
P(N =0)[P(Nt+h −Nt =0)−1]

= limh→0 t h
1−λ(t)h+o(h)−1
= P(Nt = 0) limh→0 h = −λ(t)P(Nt = 0)
with initial condition P(N0 = 0) = 1. Instead of solving it, we just substitute the solution
P(Nt = 0) = e−Λ(t) to verify it
d −Λ(t)
e = −λ(t)e−Λ(t)
dt
We proceed similarly for k ≥ 1, as we did in the proof of Theorem 3.2.10.
It will be convenient to rewrite the previous theorem for an interval.

Rt
Corollary 3.5.3. Nt − Ns is a Poisson random variable with parameter Λ(t) − Λ(s) = s
λ(u)du.
Proof. Use the independent increments property.
Example 3.5.4. Suppose the arrivals to a bank follow a nonhomogeneous PP, and the arrival rates
are piecewise constant as given in the table below.
Time Interval 9-10.30 10.30-12.00 12.00-1.00 1.00-3.00 3.00-5.00

Arrival Rate(per hour) 10 15 12 20 15
• What is the expected number of arrivals between 10am and 1.30pm?
• What is the variance of arrivals between 10am and 1.30pm?
• What is the probability that there will be 25 arrivals between 10am and 1.30pm?
First, let’s label time as t = 0 at 9.00am and t = 8 at 5.00pm. The arrivals between 10am
and 1.30pm has a Poisson distribution with rate
R 4.5
Λ(4.5) − Λ(1) = λ(t)d t
R11.5 R3 R4 R 4.5
= 1 10d t + 1.5 15d t + 3 12d t + 4 20d t
= 5 + 22.5 + 12 + 10
= 49.5.
So the mean and variance of the arrivals between 10am and 1.30pm is equal to 49.5. For the
third question
49.525
P(N4.5 − N1 = 25) = e−49.5 .
25!
Nonhomogeneous processes inherit many features from homogeneous processes.
48
Corollary 3.5.5. The superpositioning and splitting properties are true for nonhomogeneous Pois-
son processes as well.
Proof. You can show these by using the definition of the nonhomogeneous Poisson process. It
is left as an exercise.
On the other hand, while event times are still independent for nonhomogeneous processes,
they are no longer uniform as we’ll see below.
3.5.1 Event Times for Nonhomogeneous PP

Similarly to the uniform order statistics property of homogeneous PP, we can derive order
statistics property for nonhomogeneous PP. We proceed similarly to the homogeneous case and
calculate the probability that exactly k events happened before time s if n events happened
before t, that is
P(Ns = k)P(Nt − Ns = n − k)
P(Ns = k|Nt = n) =
P(Nt = n)
Λ(s)k −Λ(s) (Λ(t) − Λ(s))n−k −(Λ(t)−Λ(s)) Λ(t)n −Λ(t)
Á
= e e e
k! (n − k)! n!
Λ(s) k Λ(s) n−k

n
= 1−
k Λ(t) Λ(t)
This is again a binomial with n trials but this time the success probability is Λ(s)/Λ(t). Since
this is true for all k and all s, we conclude that the unordered arrival times are distributed
independently on [0, t] with a distribution Λ(s)/Λ(t) and density λ(s)/Λ(t). To state this as a
theorem, define i.i.d. random variables U1 , U2 , . . . , Uk by
Λ(u)
P(Ui ≤ u) = , 0 ≤ u ≤ t.
Λ(t)
and let U(i) be their order statistics.
Theorem 3.5.6. Let (Nt ) t≥0 be a nonhomogeneous PP, then if Nt = k is given then
(S1 , S2 , . . . , Sk ) ∼ (U(1) , U(2) , . . . , U(k) ).

Example 3.5.7. Consider Example 3.5.4. If in a given day we observed 90 customers, N8 = 90,
what is the probability that exactly 20 of them were between 12pm and 1pm?
So we need to first find
R8 R 1.5 R3 R4 R6 R8
0
λ(t)d t = 0
10d t + 1.5
15d t + 3
12d t + 4
20d t + 6
15d t
= 15 + 22.5 + 12 + 40 + 30
= 119.5.
R4
λ(t)d t
The probability of generating a random variable between t = 3 and t = 4 is 3
119.5 =
12/119.5.
Hence,
12 20 107.5 70

90
P(N4 − N3 = 20|N8 = 90) = .
20 119.5 119.5
49
3.6 Compound Poisson Processes
Now, we return back to the homogeneous setting. Suppose events occur according to a P P(λ).
When event n occurs we incur a random cost of Yn , which is independent and identically
distributed for all n. Then the total cost incurred up to time t is given by
Nt
X
Zt = Yn .
n=1
and (Z t ) t≥0 is called a Compound Poisson Process. The expected value of Z t can be calculated
as P∞
E(Z t ) = Pk=0 E(Z t |Nt = k)P(Nt = k)
∞
= k=0 kE(Y
P∞ 1 )P(Nt = k)
= E(Y1 ) k=0 kP(Nt = k)
= λt E(Y1 ).
To find Var(Z t ), we will use the conditional variance formula derived in 1.2.6.
Var(Z t ) = E(Var(Z t |Nt )) + Var(E(Z t |Nt ))

= E(Nt Var(Y1 )) + Var(Nt E(Y1 ))
= λtVar(Y1 ) + λt E(Y1 )2
= λt(E(Y12 ) − E(Y1 )2 + E(Y1 )2 )
= λt E(Y12 ).
Example 3.6.1. The number of customers arrive at a grocery store as a PP(10), where time is
measured in hours. Each customer spends a random amount uniformly distributed between 0 and
100 Pounds. What is the expected value and variance of the income of the store during an 8 hour
day?
The spending of each customer is Uniform on [0, 100], with mean 50 and variance 1002 /12.
Hence, by using the above formulas, the mean and the variance of the income is
1002

E Z8 = 10 × 8 × 50 = 4000, VarZ8 = 10 × 8 × + 50 ≈ 266, 667
2
12
p
So the mean square deviation of the total income is VarZ8 ≈ 516 Pounds.
50
Chapter 4
Continuous Time Markov Chains
In this chapter, we will analyze continuous time Markov processes with discrete state space.
These processes will be referred to as Continuous Time Markov Chains (CTMCs). Many con-
cepts introduced for discrete time chains in Chapter 2 will be generalized here.
4.1 Basic Definitions

The definition of a continuous time discrete state space process is based on an underlying
discrete time process (X̃ n )n∈N , the embedded chain with jump probabilities p̃i j . We put a˜on
any quantity related to the embedded chain. The continuous time process then starts at state
X̃ 0 , but doesn’t stay there for a unit time. Instead, after a random τ1 time, the system makes
a transition to another state X̃ 1 . Here τ1 and the state X̃ 1 depend only on X̃ 0 . τ2 time after
this transition, the system makes a transition to another state X̃ 2 . Again, τ2 and the state X̃ 2 ,
depend only on X̃ 1 and so on. It can be illustrated as
S0 = 0 state X̃ S1 state X̃ 1 S2 state X̃ 2 S3

0
-
time
duration τ1 duration τ2 duration τ3
The time τn spent in state X̃ n−1 is called the inter-event time. We can find Sn , the time when
the nth transition occur, by setting S0 = 0 and adding up the inter-event times
n
X
Sn = τn .
i=1
For times between transitions Sn ≤ t < Sn+1 , the system is in state X̃ n . By denoting the state of
the system at any real (continuous) time as X t , we can write that X t = X̃ n for Sn ≤ t < Sn+1 .
Definition 4.1.1. The continuous time stochastic process (X t ) t≥0 is called a Continuous Time
Markov Chain (CTMC) if
1. each duration τn is an exponential random variable with rate qi > 0 which depends only
on state X̃ n−1 = i the process leaves,
51
2. the corresponding (embedded) discrete time process (X̃ n )n∈N is a DTMC with p̃ii = 0 for all
i.
More formally
P(X̃ n = j, τn > y|X̃ n−1 = i, τn−1 , X̃ n−2 , τn−2 , . . . , X̃ 0 ) = P(X̃ n = j, τn > y|X̃ n−1 = i)
= p̃i j e−qi y .
This definition ensures that our continuous time chain is Markov, by which we mean the
following.
Definition 4.1.2. A continuous time process (X t ) t≥0 has the Markov property if for any 0 ≤ s0 <
s1 < · · · < sn < s, any t ≥ 0 and any possible states i0 , . . . , in , i, j we have
P(X s+t = j|X s = i, X sn = in , . . . , X s0 = i0 ) = P(X s+t = j|X s = i)
Using the Markov property of the embedded DTMC and the memoryless property of the
exponential, we can prove the following theorem.
Theorem 4.1.3. The CTMC (X t ) t≥0 has Markov property.
Example 4.1.4. Consider our simple weather example with two states S = {s, r}. Here our
stochastic process (X t ) t≥0 gives us whether it is raining at time point t. Suppose when it starts
raining, it rains an exponential amount of time with rate q r , and after rain stops, we observe
sunny weather for an exponential amount of time with rate qs . Notice that after a rainy weather,
we are sure to observe a sunny weather and after a sunny weather, we are sure to observe a rainy
weather, hence the embedded DTMC has the transition matrix

0 1
P̃ = .
1 0
Here the important point is to realize that embedded DTMC shows the states just after a
transition occurs!
Example 4.1.5. Single Machine Repair: This example is mathematically the same as the exam-
ple before. Suppose we have a machine that fails after an exponential amount of time with rate
µ. After it fails, it takes an exponential amount of time with rate λ to repair the machine. Define
X t = 1, if the machine is running at time t, and X t = 0, if it is being repaired. Then we can see
that q1 = µ and q0 = λ and again the embedded DTMC has the transition matrix

0 1
P̃ = .
1 0
Since we have an exponential inter-event time at each state, the properties of exponential
distribution will be used a lot in this chapter. The example below shows how the properties
for minimum of two exponentials is used in modelling a system as a CTMC.
52
Example 4.1.6. Two Machines-Two Repairmen: We have two machines and two repairmen in
the system. Similar to the example above, failure time of a machine is exponential with rate µ.
When a machine fails, only one repairman works on the machine and repairs the machine in an
exponential time with rate λ. If X t is defined to be the number of machines operating at time t,
Model this system as a CTMC.
We need to find the rates and the probability for the embedded DTMC. When X t = 0, this
means that both machines are down and both repairmen are working. Hence, the system will go
to state 1, when a repairman finishes the repair. You can see that inter-event time in this state
is the minimum of two exponentials each with rate λ, hence it is exponential with 2λ, q0 = 2λ.
Similarly, if both machines are up and running, X t = 2, then the transition will occur to state 1,
if one of the machines fail. The inter-event time is minimum of two exponentials each with rate µ,
hence q2 = 2µ. When there is only one machine running, the system will change state when either
the running machine fails or the down machine is repaired. Hence, the inter-event time is again
the minimum of two exponentials one with rate µ and the other with rate λ, that is inter-event
time is exponential with rate q1 = λ + µ.
When X t = 1, the next state will be either 0 or 2 depending on whether a failure or a repair
occurs first. In the previous chapter, we learned that if we consider two exponentials, the probability
of one of them occurring before the other can be stated using the ratio of the rates. Hence, we can
conclude that the embedded DTMC has the transition probability matrix
 
0 1 0
µ λ
P̃ =  λ+µ 0 λ+µ 
0 1 0
As it is mentioned above, the embedded DTMC records the state after a transition occurs.
Hence, the possibility of the embedded DTMC staying at the same state is 0. In relation with
the transition probabilities of the embedded DTMC and the rates for the inter-event times we
define the following matrix.
Definition 4.1.7. The rate matrix or the generator Q of the CTMC (X t ) t≥0 is defined through its
elements
qi j = qi p̃i j , if iP
6= j
qii = −qi = − j∈S, j6=i qi j .
Here qi j is called the jump rate from i to j.
Notice that if we are given the Q matrix, we can find the rates for inter-event times and
transition probabilities for the embedded DTMC. Hence, if we want to model a system as a
CTMC, it will be enough to just give the state space and the rate matrix.
Similarly to the state diagram for DTMCs, we can construct a visual tool for CTMC. This
visual tool is called the rate diagram and shows the states as the nodes and on the arc between
state i and j, we write qi j . Notice that we need not show qii .
Example 4.1.8. Two Machine–Two Repairmen: The rate matrix for the two machine–two re-
pairmen problem can be written as
 
−2λ 2λ 0
Q= µ −(λ + µ) λ .
0 2µ −2µ
53
The rate diagram for the problem is
2λ λ
0 1 2
µ 2µ
Example 4.1.9. Poisson Processes: We see that the PP(λ), (Nt ) t≥0 is a special type of CTMC.
That is the inter-event time at any state i is exponential with rate λ and we jump to state i + 1
with probability 1. Hence the rate diagram can be shown as
λ λ λ ...
0 1 2
Example 4.1.10. Pure Birth Processes: Suppose when there are n people in a population, the
next birth occurs after an exponential amount of time with rate λn . Letting X t be the number of
people in the population at time t, we can model the system as a CTMC with state space {0, 1, 2, . . . }
and rate diagram
λ0 λ1 λ2
0 1 2 ...
Example 4.1.11. Birth-Death Processes: Similar to pure birth processes, suppose when there
are n people in a population, the next birth occurs after an exponential amount of time with rate
λn . Also, when there are n people in the system we observe deaths with rate µn . We assume
µn , λn > 0 for all n. Then we can see that qn = λn + µn , and calculate that
λn µn
p̃n,n+1 = , p̃n,n−1 = and p̃n j = 0 otherwise.
λn + µn λn + µn
Consequently, qn,n+1 = λn and qn,n−1 = µn . Notice that these are equal to the rates of the occasions
that are causing these transitions. You can argue that this is always the case. Hence the rate
diagram is
λ0 λ1 λ2
0 1 2 ...
µ1 µ2 µ3
Example 4.1.12. M /M /1 Queue: Suppose that customers arrive to a bank according to a PP(λ).
There is one bankteller in the bank, who is serving the customers in the order the customers arrive.
The service times for the customers are exponential random variables with rate µ. Model the
number of customers in the bank as a CTMC.
Notice that the time between arrivals are exponential with rate λ. When X t = i, we know that
we will jump to state i + 1 when an arrival occurs, hence the rate for jumping from i to i + 1 is λ.
Similarly, when X t = i, we jump to state i − 1 when a service is finished, which occurs with rate
µ. Hence, the rate diagram for the number of customers in an M /M /1 queue is given
54
λ λ λ
0 1 2 ...
µ µ µ
Example 4.1.13. M /M /∞ Queue (Ample Service): Similar to the previous example, suppose
that customers arrive to a bank according to a PP(λ). But suppose that there are infinitely many
servers, which means the banktellers start serving a customer as soon as the customer arrives. The
service times for the customers are again exponential random variables with rate µ. Model the
number of customers in the bank as a CTMC.
Notice that the time between arrivals are exponential with rate λ. When X t = i, we know that
we will jump to state i + 1 when an arrival occurs, hence the rate for jumping from i to i + 1 is
λ. Different than the above example, when there are i customers in the system, i banktellers are
serving them. This means there will be a transition from i to i − 1 when one of these customers
are served, (minimum of i exponentials). Hence, the rate for jumping from state i to i − 1 is iµ,
as seen on this rate diagram
λ λ λ
0 1 2 ...
µ 2µ 3µ
Example 4.1.14. M /M /c Queue : Similar to the previous examples, suppose that customers
arrive at a bank according to a PP(λ). In this case, there are c banktellers in the system. This
means when a customer arrives, if there are less than c customers in the system, one of the idle
banktellers start serving the customer. However, if there are c or more customers in the system,
all the banktellers are busy and the arriving customer joins the queue. The service times for the
customers are again exponential random variables with rate µ. Model the number of customers
in the bank as a CTMC.
Again, when X t = i, we know that we will jump to state i +1 when an arrival occurs, hence the
rate for jumping from i to i + 1 is λ. If i ≤ c, this means that i banktellers are serving customers,
this means rate for transition to i − 1 is iµ. However, when system state is i > c, then we know
that c customers are being served and the remaining i − c are waiting in the queue. There will be
a transition to state i − 1 if one of these c services finishes, hence the rate for transition to i − 1 is
cµ, as seen on this rate diagram
λ λ λ λ λ
0 µ 1 2µ 2 3µ
...
cµ c cµ
...
Back in the communist days in Hungary you had to queue in a grocery store for requesting
your bread and milk, then you went to another queue to pay for it, and to even another queue
to pick up your goods. Such a system and many other real world systems can be modeled as
queuing networks. The simplest such network models two consecutive queues.
Example 4.1.15. Network of Tandem Queues: Suppose customers arriving to a system needs
to go through a two stage service process. The customers are arriving to the system according to
55
a PP(λ). An arriving customer joins the queue for the first service. In the first service point there
is a single server which is serving the customers at an exponential time with rate µ1 . After the
customer is done with the first service, the customer joins the queue for the second service. There
is again a single server serving at exponential time with rate µ2 . The customer who is done with
the second service leaves the system. Model the number of customers in the system as a CTMC.
λ
µ1 µ2
To model this system as a CTMC, we need to use two dimensional state space (X t , Yt ), where X t
and Yt are number of customers waiting (including the ones being served) in server 1 and server 2
respectively. If the system is at state (i, j), then the system jumps to state (i + 1, j) when an arrival
to the system occurs, which occurs exponentially with rate λ. If i > 0, then the system jumps to
state (i − 1, j + 1) when a customer is done in server 1 and goes to server 2, which occurs with rate
µ1 . If j > 0, then the system will go to state (i, j − 1), when a customer is done at server 2, which
occurs exponentially with rate µ2 . Summarizing these
q(i, j)(i+1, j) = λ for all i and j,

q(i, j)(i−1, j+1) = µ1 for all i > 0 and j,
q(i, j)(i, j−1) = µ2 for all i and j > 0.
All other rates are 0. Note that in the illustration the boxes do not represent states, but the queues
themselves.
Example 4.1.16. Inspection Model: A machine can be in 4 states. The machine is in state 1 if
it is brand new. After being used for an exponential amount of time with rate λ1 , the machine
is worn out to state 2. Similarly after an exponential amount of time with rate λ2 the machine
becomes state 3. When it is at state 3, after an exponential time with rate λ3 it goes into state 4.
A repairman inspects the machine at random times, where the time between two inspections are
exponentially distributed with rate µ. If during inspection, the repairman finds the machine in
states 3 or 4, the machine is replaced with a brand new machine immediately. Let X t be the state
of the machine at time t and model this system as a CTMC.
Obviously, the state space is S = {1, 2, 3, 4}. The rates are given as
q12 = λ1 , q23 = λ2 , q34 = λ3 , q31 = µ, q41 = µ.
4.2 Chapman-Kolmogorov Equations

In the examples above, we saw that modelling systems as CTMC can be really useful. Now, our
goal will be to analyze these systems. For this purpose we define the transition matrix P(t)
with entries
pi j (t) = P(X t = j|X 0 = i).
This is the probability of being at state j at time t, if we know that we start at state i at time
0. Similar to DTMCs, we define the initial distribution to be
(0)
ai = P(X 0 = i).
56
Theorem 4.2.1. The matrix P(t) and the initial distribution a(0) completely characterizes the
CTMC, that is the probability
P(X t 1 = i1 , X t 2 = i2 , . . . , X t k = ik )
can be assessed by only knowing P(t) and a(0) .

Proof. So far, we know some properties of CTMCs. First, we know that we have Markov prop-
erty. Also, since the rate to jump out of state i and the embedded DTMC only depends on i, we
see that we have time-homogeneity. We will use these two properties along with law of total
probability to prove the theorem. We can see that
P(X t 1 = i1 , . . . , X t k−1 = ik−1 , X t k = ik )
= P(X t k = ik |X t 1 = i1 , . . . , X t k−1 = ik−1 )P(X t 1 = i1 , . . . , X t k−1 = ik−1 )
= P(X t k = ik |X t k−1 = ik−1 )P(X t 1 = i1 , . . . , X t k−1 = ik−1 )
= P(X t k −t k−1 = ik |X 0 = ik−1 )P(X t 1 = i1 , . . . , X t k−1 = ik−1 )
= pik−1 ik (t k − t k−1 )P(X t 1 = i1 , . . . , X t k−1 = ik−1 )
Carrying on in this manner, we can see that
P(X t 1 = i1 , . . . , X t k−1 = ik−1 , X t k = ik ) = pik−1 ik (t k − t k−1 ) · · · pi1 i2 (t 2 − t 1 )P(X t 1 = i1 )
All the terms in the above equation can be determined by P(t) except for P(X t 1 = i1 ). Using
the law of total probability, we see that
X X
(0) (0)
P(X t 1 = i1 ) = P(X t 1 = i1 |X 0 = i)ai = pii1 (t 1 )ai .
i∈S i∈S
(0)
Hence, the probability can be calculated using P(t) and a and the theorem follows.
The following theorem follows from the definition of the matrix P(t).
Theorem 4.2.2. P(t) has the following properties
1. pi j (t) ≥ 0
j∈S pi j (t) = 1 for all t ≥ 0

P
2.
3. (Chapman-Kolmogorov Equations) P
P(t + s) = P(t)P(s), that is pi j (t + s) = k∈S pik (t)pk j (s).
Proof. The first two properties follow directly from the definition of pi j (t). For Chapman-
Kolmogorov equations
pi j (t + s) = P
P(X t+s = j|X 0 = i)
= Pk∈S P(X t+s = j, X t = k|X 0 = i)
= Pk∈S P(X t+s = j|X t = k, X 0 = i)P(X t = k|X 0 = i)
= Pk∈S P(X t+s = j|X t = k)P(X t = k|X 0 = i)
= Pk∈S P(X s = j|X 0 = k)P(X t = k|X 0 = i)
= k∈S pk j (s)pik (t).
57
Now, suppose X 0 = i. The time for the first transition is exponentially distributed with rate
qi . This is just like the Poisson processes, which enables us to conclude for small h,
pii (h) = P(X h = i|X 0 = i) = 1 − qi h + o(h).
More precisely one has to condition on the number of jumps until time h, and realize that
starting from i we are at i at time h if we haven’t moved (which has probability e−qi h = 1 −
qi h + o(h)), or if we jumped at least twice (which has a probability o(h)). Similarly, for small
enough h, X h = j, means that there was a transition (with probability qi h + o(h)) and this
transition was from i to j (with probability p̃i j ). Hence,
pi j (h) = P(X h = j|X 0 = i) = (qi h + o(h))p̃i j = qi j h + o(h).
Noticing that pii (0) = 1 and pi j (0) = 0 for j 6= i, we obtain the following theorem.
Theorem 4.2.3. P 0 (0) = Q, that is
pii0 (0) = −qi = qii , pi0 j (0) = qi j for i 6= j
Proof.
pii (h) − pii (0) 1 − qi h + o(h) − 1
pii0 (0) = lim = lim = −qi = qii .
h→0 h h→0 h
Similarly,
pi j (h) − pi j (0) qi j h + o(h)
pi0 j (0) = lim = lim = qi j .
h→0 h h→0 h
4.3 Backward and Forward Equations

We just obtained in Theorem 4.2.3 that P 0 (0) = Q. We can generalize this formula for any
time to get a differential equation for P(t). We will have not one, but two such equations right
away.
Theorem 4.3.1. Let P(t) be the transition matrix and Q be the generator of a CTMC. Then P(t)
is the unique solution of both the forward Kolmogorov equation
X
P 0 (t) = P(t)Q that is pi0 j (t) = pik (t)qk j
k∈S
and the backward Kolmogorov equation

X
P 0 (t) = QP(t) that is pi0 j (t) = qik pk j (t)
k∈S
Both backward and forward equations have the initial conditions
P(0) = 1 that is pii (0) = 1, pi j (0) = 0 for j 6= i
58
Proof. We prove the forward equations first. First, by using the Chapman-Kolmogorov equa-
tions, we write X
pi j (t + h) = pik (t)pk j (h).
k∈S
Hence,
pi j (t + h) − pi j (t)
pi0 j (t) = lim
h→0 h
pik (t)pk j (h) − pi j (t)
P
k∈S
= lim
h→0 h
pik (t)(qk j h + o(h)) + pi j (t)(1 + q j j h + o(h)) − pi j (t)
P
k∈S,k6= j
= lim
h→0 h
k∈S pik (t)(qk j h + o(h))
P
= lim
h→0 h
= k∈S pik (t)qk j .
P
The solution is unique as it always is for constant coefficient first order linear equations. Proof
for the backward equation follows the same lines, but we condition at the small time h
X
pi j (t + h) = pik (h)pk j (t).
k∈S
We leave the completion of this part of the proof as an exercise.
Example 4.3.2. Consider the single machine, single repairman problem. The generator is
−λ λ

Q= .
µ −µ
Note that this is the most general 2-state CTMC. Let’s write the forward equations
0
p00 (t) = −λp00 (t) + µp01 (t)
0
p01 (t) = λp00 (t) − µp01 (t)
p10 (t)
0
= −λp10 (t) + µp11 (t)
0
p11 (t) = λp10 (t) − µp11 (t)
Now notice p01 (t) = 1 − p00 (t), replacing this in the first equation
0
p00 (t) = −λp00 (t) + µ(1 − p00 (t))
p00 (t)
0
= µ − (λ + µ)p00 (t)
0
p00 (t) + (λ + µ)p00 (t) =µ
e(λ+µ)t (p00
0
(t) + (λ + µ)p00 (t)) = µe(λ+µ)t
(p00 (t)e(λ+µ)t )0 = µe(λ+µ)t
µ (λ+µ)t
p00 (t)e(λ+µ)t = e +C
λ+µ
µ
p00 (t) = + C e−(λ+µ)t
λ+µ
59
λ
Using p00 (0) = 1, we can find C = λ+µ . Hence,
µ λ −(λ+µ)t
p00 (t) = + e .
λ+µ λ+µ
Also using p01 (t) = 1 − p00 (t), we get
λ
p01 (t) = (1 − e−(λ+µ)t ).
λ+µ
After solving the other equations in a similar way, we summarize the results in
µ λ λ
λ+µ + λ+µ e λ+µ (1 − e )
−(λ+µ)t −(λ+µ)t

P(t) = µ λ µ
λ+µ (1 − e ) λ+µ + λ+µ e−(λ+µ)t
−(λ+µ)t
Example 4.3.3. In this example, we derive P(Nt = k) for PP(λ) using CTMC techniques. Since
we know that N (0) = 0 for Poisson processes, we are only interested in deriving p0k (t). Remember
that the generator matrix for a PP(λ) is
−λ λ 0 0 · · ·
 
Q =  0 −λ λ 0 · · · 
.. .. .. .. . .
. . . . .
We use again the forward equation P 0 (t) = P(t)Q. Define pk (t) = p0k (t), then
p00 (t) = −λp0 (t)

p10 (t) = −λp1 (t) + λp0 (t)
pk0 (t) = −λpk (t) + λpk−1 (t)
e−λt (λt)k−1
It is easy to see that p0 (t) = e−λt . Now, we will use induction. Assume pk−1 (t) = (k−1)! , and
check whether a similar equation holds for k,
e−λt (λt)k−1
pk0 (t) = −λpk (t) + λ (k−1)!
e−λt (λt)k−1
pk0 (t) + λpk (t) = λ (k−1)!
(λt)k−1
eλt pk0 (t) + eλt λpk (t) = λ (k−1)!
(λt)k−1
(eλt pk (t))0 = λ (k−1)!
(λt)k
pk (t) = e−λt k!
We can solve the Kolmogorov equations also in general, at least for finite state spaces.
Theorem 4.3.4. For finite state spaces the solution of both backward and forward equations is
P(t) = eQt .
Proof. We first verify the solution by differentiating it
tn
P 0 (t) = ddt eQt = ddt n≥0 Q n n!
P
t n−1 tn
= n≥1 Q n (n−1)! = Q n≥0 Q n n! = QeQt = QP(t)
P P
60
which is the backward equation. Here we used the uniform convergence of the infinite series.
The forward equation is also satisfied since the matrices commute QP(t) = P(t)Q. Matrices
do not commute in general, but in our case
X tn X n tn
QeQt = Q · Qn = Q · Q = eQt Q
n≥0
n! n≥0 n!
The solution is unique as it always is for constant coefficient first order linear equations.
Example 4.3.5. We revisit example 4.3.2 of the single machine, single repairman. Using the
general solution of the Kolmogorov equations in Theorem 4.3.4 we know that P(t) = eQt , where
Q is given in example 4.3.2. To compute the exponential, we rewrite Q in its Eigen bases as
−λ λ 1 µ λ

1 −λ 0 0
Q= = UΛU =−1
µ −µ 1 µ 0 −(λ + µ) λ + µ −1 1
and hence
UΛt U −1 Λt 1 0
P=e Qt
=e = Ue U −1
=U U −1
0 e−(λ+µ)t
which leads to the same expression as in example 4.3.2.
In the previous examples we saw just how complicated the time evolution can be even in
such simple cases as the weather model. This motivates our next section: we’ll obtain the large
time behavior of these systems without solving them for finite times.
4.4 Transience and recurrence

In this section we’ll study the large time behavior of continuous chains. We motivate this
section by an example.
Example 4.4.1. Consider the single machine, single repairman problem. In Exercise 4.3.2 we
found that µ λ λ
+ −(λ+µ)t
(1 −(λ+µ)t
)

e − e
P(t) = λ+µ µ
λ+µ λ+µ
λ µ
λ+µ (1 − e −(λ+µ)t
) λ+µ + λ+µ e
−(λ+µ)t
Hence µ λ

λ+µ λ+µ
lim P(t) = µ λ
t→∞
λ+µ λ+µ
We see that the limit is independent of the initial state.
When we were studying the limiting behavior of DTMCs in Chapter 2, we defined the
concepts of (i) accessibility and communication, (ii) communicating classes and closed com-
municating class, (iii) recurrence and transience, (iv) positive recurrence and null recurrence.
In this chapter, we will use the continuous time analogs of these concepts. You are strongly
encouraged to review these topics.
The only difference in the continuous time case is that we have to define the first arrival
time to state j differently, that is T j = min{t ≥ τ1 : X t = j}. We required t ≥ τ1 , where τ1 is
61
the time of the first jump, so if we start from state j we don’t get zero. Then the definitions for
transience and recurrence are identical to the discrete case, and all we need are
ρi j = P(T j < ∞|X 0 = i), mi j = E(T j |X 0 = i).
to recall the old definitions.
Definition 4.4.2. State i is
1. transient if ρii < 1,
2. null recurrent if ρii = 1 and mii = ∞,
3. positive recurrent if ρii = 1 and mii < ∞,
The case of transience and recurrence is particularly simple, as it is the same for a CTMC
and its embedded DTMC.
Theorem 4.4.3. A state i for a CTMC is recurrent if and only if the embedded DTMC is recurrent.
Proof. If state i is recurrent for the embedded DTMC, this means with probability 1 the CTMC
will return back to state i after finitely many jumps. Hence, the return time to state i is sum
of finitely many exponential random variables (with different rates). Consequently, the return
time is finite with probability one.
Similarly, if state i is transient for the embedded DTMC, this means with positive probability
the CTMC will not return back to state i after finitely many steps, so the CTMC is also transient.
What can we say about positive recurrence? Positive recurrence of the embedded DTMC
gives no information about that of a CTMC. The positive recurrence of the embedded DTMC
gives us some information about the number of jumps, but we also need to account for the
amount of time spent at each state by considering the qi rates. In the following theorem π̃ is
not the stationary distribution, as we do not require the normalization π̃i = 1.
P
Theorem 4.4.4. Let (X t ) t≥0 be an irreducable CTMC. Suppose π̃ is a positive solution of π̃ = π̃ P̃

where P̃ is the transition matrix of the embedded DTMC. Then the CTMC is positive recurrent if
and only if
X π̃
i
< ∞.
i∈S
q i
Proof. We use the first step conditioning idea and the one step probabilities p̃i j of the embedded
chain
mi j = E(T j |X 0 = i)
= E(T j |X τ1 = = + k6= j E(T j |X τ1 = k, X 0 = i)p̃ik
P
j, X 0 i)p̃ i j
= Eτ1 p̃i j P
+ k E(τ1 + T j |X 0 = k)p̃ik
P
= Eτ1 P + k E(T j |X 0 = k)p̃ik

1
= qi + k6= j mk j p̃ik
62
Now multiplying by π̃i and summing over i
π̃ = π̃ /q + j π̃i mk j p̃ik
P P P P
m
i∈S i i j Pi∈S i i Pi∈S k6=P
= Pi∈S π̃i /qi + Pk6= j mk j i∈S π̃i p̃ik
= Pi∈S π̃i /qi + k6= j mk j π̃k
π̃ j m j j = i∈S P π̃i /qi
1
m j j = π̃ j i∈S π̃i /qi
π̃i /qi is finite (note that π̃ j > 0).

P
which means that m j j is finite if and only if i∈S
Example 4.4.5. Here we return to our old familiar model of “Success runs" we met in Example
2.6.6 and again in Example 2.9.5. We consider two cases; case 1: positive recurrent CTMC with
null recurrent embedded DTMC, and case 2: null recurrent CTMC with positive recurrent embedded
DTMC. In the success runs DTMC model S = N, and the nonzero transition probabilities are
p̃i,i+1 = pi , and p̃i0 = 1 − pi for all i ≥ 0.
Case 1: We chose pi = (i + 1)/(i + 2), then we can check that π̃i = 1/(i + 1) is a solution of
π̃ = π̃ P̃. Indeed, for i ≥ 1 we have
1 i 1
π̃i−1 p̃i−1,i = = = π̃i
i i+1 i+1
and for i = 0 we have
X X 1
π̃0 = 1 = π̃i p̃i0 = = 1.
i≥0 i≥0
(i + 1)(i + 2)
These π̃i cannot be normalized (multiplied by a constant, so probabilities for all options add up
to one) though since
X X 1
π̃i = =∞
i≥0 i≥0
i + 1
hence the DTMC is null recurrent. Define a CTMC by adding that we leave state i at rate qi = i +1,
so from Theorem 4.4.4 we find that
X π̃ X 1 π2
i
= = <∞
i≥0
qi i≥0
(i + 1)2 6
that is the CTMC is positive recurrent.

Case 2: For the DTMC pick the same pi = p ∈ (0,P 1) for all i, and you can check that π̃i = p
i
solves π̃ = π̃ P̃. This π̃ can be normalized, since i≥0 π̃i = 1/(1 − p), so there is a stationary
distribution, and hence the DTMC is positive recurrent. Let us choose the rate to leave state i to be
qi = p i to define our CTMC. Since
X π̃ X
i
= 1=∞
i≥0
qi i≥0
according to Theorem 4.4.4 our CTMC is null recurrent.
63
4.5 Stationary probabilities
It is easy to show that for a transient or even for a null-recurrent state the probability of be-
ing there goes to zero for large times, lim t→∞ pii (t) = 0. The most interesting cases are the
positive-recurrent states. Also, if such states belong to multiple communicating classes, these
closed communicating classes can be treated separately. So here we’ll focus on irreducible
positive-recurrent CTMCs.
Remember that the stationary (or limiting) probabilities π̃i s for a DTMC are calculated from
π̃ = π̃ P̃. This equation means that as time progresses the probability of being at state i does
not change. If we impose the same condition on the CTMC, we should require that pi0 j (t) = 0.
Also for an irreducible CTMC, we should have
lim pi j (t) = π j , for all i.
t→∞
(Notice that we can directly speak about limiting probabilities without worrying about period-
icity due to being continuous time. You are encouraged to think about why periodicity is not
observed in a CTMC.) Using the above two observations, and the forward equation we see that
lim P 0 (t) = lim P(t)Q = 0.
t→∞ t→∞
Also using the fact that all the rows of lim t→∞ P(t) are the same π, we can conclude the
following theorem.
Theorem 4.5.1. Let (X t ) t≥0 be an irreducible CTMC with limiting distribution πi . The limiting
distribution is the unique stationary distribution, that is the unique solution of the global balance
equations X
πQ = 0, πi = 1.
i∈S
if and only if the CTMC is positive recurrent.

Note that the stationary distribution π of a CTMC is different than the stationary distri-
bution π̃ of the embedded DTMC. Imagine for example our two state weather model, and
assume that rainy gets sunny fast, but sunny turns rainy slowly. That means that for large
times sunny weather is more probable. For the embedded DTMC however sunny and rainy
states are equally probable in the stationary state.
Example 4.5.2. Single Machine-Single Repairman Problem. We know that the rate matrix
for the problem is
−λ λ

Q= .
µ −µ
Hence we can write the global balance equations as
−λ λ

(π0 , π1 ) = 0 and π0 + π1 = 1.
µ −µ
Writing these explicitly we get
−π0 λ + π1 µ = 0
π0 λ − π1 µ = 0
π0 + π1 = 1.
64
Notice that one of the equations is redundant. Solving these equations we get π0 = µ/(λ + µ) and
π1 = λ/(λ + µ), as we have already obtained in Example 4.4.1.
We can notice that when we are writing the ith equation, it is essentially
X X X
π j q ji = πi qii + π j q ji = −πi qi + π j q ji = 0.
j∈S j∈S, j6=i j∈S, j6=i
By rearranging terms X
πi qi = π j q ji .
j∈S, j6=i
We see that on the left hand side we have the stationary probability πi times the rate qi of
jumping out of state i, and on the other side we have the sum of the stationary probabilities π j
times the rate to jump from j to i. So the flow of probability out of a state is the same as the
total flow into that state. Noticing this fact will enable us to write the global balance equations
easier using a rate diagram.
Also note that the order of factors in equation πQ = 0 matters, so we need to find a
left eigenvalue of Q for eigenvalue zero (the right eigenvalue is just the column vector 1 =
(1, 1, . . . , 1)∗ as we’ve already seen.). That’s why we used the forward Kolmogorov equation
for the derivation, and not the backward, which would give Q(π1 1, π2 1 . . . ) = 0, which is
satisfied for any π.
Example 4.5.3. Birth-Death Process. Using the rate diagram of the birth-death processes given
in Example 4.1.11 we can write the global balance equations as
λ0 π0 = µ1 π1
(λ1 + µ1 )π1 = λ0 π0 + µ2 π2
(λn P
+ µn )πn = λn−1 πn−1 + µn+1 πn+1 for all n > 0
∞
π
n=0 n
= 1.
λ0
Note that all rates are positive. From the first equation we get π1 = µ1 π0 . Plugging this into
λ0 λ1
second equation we get π2 = µ1 µ2 π0 and carrying on in a similar fashion
λ0 λ1 · · · λn−1
πn = π0 .
µ1 µ2 · · · µ n
λ0 λ1 ···λn−1
Now, letting α0 = 1 and αn = µ1 µ2 ···µn ,
∞ ∞
X X 1
πn = αn π0 = 1 ⇒ π0 = P∞ .
n=0 n=0 α
n=0 n
P∞
We know that these equations make sense if and only if n=0 αn < ∞, which means the steady-
state probabilities exist if and only if this condition hold. Hence, the CTMC is positive recurrent if
and only if this sum converges! In that case the general solution is
αn
πn = P∞ .
n=0
αn
65
We can also have a birth-death process on a finite state space S = {0, 1, . . . , M }, with λ M = 0.
The solution always exist for positive rates and it is simply
αn
πn = P M .
α
n=0 n
As in the discrete time case, it is easier to solve the above exercise using the detailed balance
equations.
Theorem 4.5.4. If the detailed balance equations are satisfied, that is

X
πi qi j = π j q ji for all i 6= j, and πi = 1
i
then also the global balance equations are satisfied, so π is the stationary distribution.
Proof. We sum the first equation over all j

X X
πi qi j = π j q ji
j j
qi j = 0, we get the first global balance equation

P
and since j
X
0= π j q ji
j
πi = 1 was already assumed.

P
and i
Example 4.5.5. Birth-Death Process again. We solve Exercise 4.5.3 again but with the detailed
balance equations. For each i ≥ 0 we have the detailed balance condition
πi λi = πi+1 µi+1
from which
λi
πi+1 = πi
µi+1
Applying this rule repeatedly we again arrive at
λ0 λ1 · · · λn−1
πn = π0 .
µ1 µ2 · · · µ n
and we get π0 from the normalization as before in Exercise 4.5.3. Since the chain has a tree
structure, we knew in advance that the detailed balance condition will work. But if we don’t know
this in advance, we can still try; if we find a solution, then we are lucky, since that’s the stationary
state as well thanks to Theorem 4.5.4.
66
Example 4.5.6. M/M/1 Queue. We know that M/M/1 queue is a specific type of birth
n and death
λ
process, where λn = λ and µn = µ for all n. Hence, we can conclude that αn = µ . The CTMC
is positive recurrent if and only if
∞ ∞ n
X X λ λ
αn < ∞ ⇔ <∞⇔ < 1.
n=0 n=0
µ µ
By defining ρ = λµ , if ρ < 1 then πn = (1 − ρ)ρ n .
Heuristic Explanation of Results for M/M/1: The best way to interpret the equations
is to consider a numerical example. Suppose our time unit is hours and µ = 4/hour. This
essentially means the expected service time is 15 minutes per customer.
Now suppose that λ = 5, which means λ/µ = 5/4 > 1. We know that this means system
is not positive recurrent. Let’s see why. The incoming customers arriving in 1 hour demands
a total of 5/4 hours of processing time on expected. That is, on expected, there is 15 minutes
that we are not able to serve in an hour and this work stays in the system and accumulate. This
means in the long run the queue blows up.
If λ = 3, then we see that the CTMC is positive recurrent which means the system is stable.
On expected, 45 minutes of work (3 × 1/4 hours) arrives per hour. This means that we are
utilizing 3/4 of our capacity. Hence, we can interpret ρ = λ/µ as utilization. Similarly, 1 − ρ
should be the idle time. Since we know that the system is idle only when there are no customers
in the system, this explains why π0 = 1 − ρ intuitively.
Example 4.5.7. Barbershop. A barber finishes haircuts at rate 3, measured in hours, so on

average it takes him 20 minutes to cut a person’s hair. Customers arrive at rate 2. There is,
however, only a two chair waiting room. When an arriving customer sees that both waiting room
chairs are taken, he leaves. What fraction of customers leave without a haircut in the long run?
A state describes the total number of customers in the shop, so the state space is S = {0, 1, 2, 3}.
The detailed balance equations are
2π0 = 3π1 , 2π1 = 3π2 , 2π2 = 3π3
The solution is
2 2 4 8
π1 = π0 , π2 = π1 = π0 , π3 = π0
3 3 9 27
and we get π0 from the normalization
2 4 8 65

1 = π0 + π1 + π2 + π3 = π0 1 + + + = π0
3 9 27 27
that is π0 = 27/65, and all of them are
27 18 12 8
π0 = , π1 = , π2 = , π3 = .
65 65 65 65
Hence in the stationary state a customer finds the shop full with probability π3 = 8/65, so 8/65
fraction of customers leave without a haircut. Note that in this example we just recovered a special
case of the general formula for a finite state space birth and death chain from Example 4.5.5.
67
4.6 First Passage Properties
Here we generalize the problem of first passage probabilities and times we discussed for the
discrete case in Section 2.11.
First we notice that the probabilities of visiting certain states before other states is the same
for a CTMC and its embedded DTMC. Hence the first passage probabilities can be calculated
from the embedded DTMC using the method of Section 2.11.
We can also work directly with the generator Q. For this we define the first passage time
to reach a set A ⊂ S as T̂A = min{t ≥ 0 : X t ∈ A}. Then we can state a theorem parallel to
Theorem 2.11.1 as follows.
Theorem 4.6.1. Let A, B ⊂ S, such that P(min{ T̂A, T̂B } < ∞|X 0 = i) = 1 for all i. Then the
probability hi = P( T̂A < T̂B |X 0 = i) of reaching set A before set B when starting from state i
satisfies
hi = 0 if i ∈ B
h
Pi = 1 if i ∈ A
j∈S qi j h j = 0 if i ∈ S − (A ∪ B)
Proof. For i ∈ A∪ B the statement is obvious. For i ∈ S − (A∪ B) we use the discrete time result
from Theorem 2.11.1 for the embedded chain with transition probabilities P̃, that is
X X qi j
hi = p̃i j h j = hj
j6=i j6=i
qi
by noting that the embedded chain never stays in the same state. If we multiply both sides by
qi = −qii we obtain X
−qii hi = qi j h j
j6=i
and by rearranging we complete the proof.
For the mean first passage times we can also rely on the embedded chain, but we need to
combine it with the mean time spent in each state. It is again simpler to work with Q directly.
Theorem 4.6.2. Let A ⊂ S, such that P( T̂A < ∞|X 0 = i) = 1 for all i. Then the mean time
g i = E( T̂A|X 0 = i) to reach set A when starting from state i satisfies
gi = 0 if i ∈ A
j∈S qi j g j = −1 if i ∈ S − A
P
Proof. Since the i ∈ A case is obvious, we assume that i ∈ S − A. Conditioning on the first step,
we can use the discrete time result from Theorem 2.11.2 by noting that the mean time spent
in state i before the first step is 1/qi
1 X 1 X qi j
gi = + p̃i j g j = + gj
qi j6=i
qi j6=i
qi
68
We again multiply both sides by qi to get
X
−qii g i = 1 + qi j g j
j6=i
and by rearranging we complete the proof.
Example 4.6.3. Barbershop. Let us revisit Example 4.5.7. The generator is
−2 2 0 0
 
 3 −5 2 0
Q=
0 3 −5 2 
0 0 3 −3
What is the probability hi that the shop becomes empty before it becomes full, given we have i
customers now? We know that h0 = 1 and h3 = 0. From Theorem 4.6 we also know that
3h0 −5h1 +2h2 =0

3h1 −5h2 +2h3 = 0
Note that we would obtain the exact same equations for the embedded chain (check!). These
equations can be solved immediately
15 9
h1 = h2 =
19 19
We got h1 > h2 , which agrees with intuition.
How long do we have to wait on average, g i , until the shop becomes empty if we have i cus-
tomers at the moment? We know that g0 = 0, and from Theorem 4.6.2 for the other three
probabilities for i = 1, 2, 3 we have
3g0 −5g1 +2g2 = −1

3g1 −5g2 +2g3 = −1
3g2 −3g3 = −1
which we can solve to obtain

19 34 43
g1 = , g2 = , g3 = .
27 27 27
The waiting times are monotone growing with i, which agrees with intuition. Also, for example g3
means that even if we have 3 customers, the shop will be empty within 2 hours on average. Note
that this result is very different from what we could have obtained for the embedded chain!
4.7 Costs and Rewards

Even though it is possible to consider the discounted cost case with a similar approach as in
DTMCs, in this section our main goal will be to analyze the long run average cost.
69
Suppose we are incurring a cost of c(i) units for each time unit we spend at state i. Here
important thing is to see that c(i) is cost per time unit. We define the long run average cost
per unit time Z T
1

ψi = lim E c(X t )d t X 0 = i

T →∞ T
0
Theorem 4.7.1. For an irreducible positive recurrent chain ψi = π j c( j).

P
j∈S
Proof. The proof is analogous to the discrete case proved in Section 2.12.1. First note that
X
E(c(X t )|X 0 = i) = pi j (t)c( j)
j∈S
and since lim t→∞ pi j (t) = π j , we find that

Z T
X 1 X
ψi = c( j) lim pi j (t)d t = π j c( j).
T →∞ T
j∈S 0 j∈S
The interchange of limits can be justified even for infinite state spaces due to positive recur-
rence.
Example 4.7.2. Single Machine-Single Repairman. Now suppose that our machine produces
n items per hour, and each item can be sold for c pounds. When the machine is down, we pay the
repairman r pounds per hour to repair the machine. This means when the machine is down we
are incurring a cost of r pounds per hour and when the machine is up, we are gaining nc pounds
per hour. Hence, the long run average gain is
ncλ rµ
ncπ1 − rπ0 = −
λ+µ λ+µ
Example 4.7.3. M/M/1 Expected Queue Length. To calculate the expected queue length, note
that when there are n people in the system, we can think as if we are incurring a cost of n, that is
c(n) = n. Since from Exercise 4.5.6 we know that the stationary distribution is πn = ρ n (1 − ρ)
where ρ = λ/µ, the expected queue length is the mean of a geometric variable minus one, with
success probability 1 − ρ, that is
X 1 ρ λ
ψi = nπn = −1= = .
n∈S
1−ρ 1−ρ µ−λ
We see that it is finite and positive for µ > λ, and goes to infinity as µ → λ.
THE END
70
Appendix A
Review of Probability
Here we review the basics of probability theory we learnt as year two undergrads.
A.1 Axioms of Probability

The sample space Ω is the set of all possible outcomes of an experiment. An event is a
subset of the sample space. In the experiment of tossing a coin twice, the sample space is
Ω = {hh, ht, th, t t}. Events are subsets of Ω. For example an event can be that you tossed the
same face of the coin twice A = {hh, t t}.
For events A and B, we have the union A ∪ B, the intersection A ∩ B (sometimes shortened
to AB), and the complement Ac . Two events are disjoint (or mutually exclusive) if A ∩ B = ;,
where ; is the empty set (an impossible event). The DeMorgan laws are
(∪Ai )c = ∩Aci (∩Ai )c = ∪Aci .
Each event A has a probability P(A) ∈ [0, 1] such that P(Ω) = 1, and that for disjoint events
Ai ∞
[
∞
X
P Ai = P(Ai ).
i=1 i=1
The probability of the complement is P(A ) = 1 − P(A).

c
The inclusion–exclusion formula is P(A ∪ B) = P(A) + P(B) − P(A ∩ B), or in general

n
X X
P(∪ni=1 Ai ) = (−1) r+1 P(Ei1 · · · Eir ).
r=1 i1 <···<i r
If all outcomes are equally likely in a finite sample space, then P(A) = |A|/|Ω|.
A.2 Conditional Probability and Independence

The probability of event A conditioned on event B is
P(A ∩ B)
P(A|B) =
P(B)
71
if P(B) 6= 0. It follows that P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A). The multiplication rule is
P(A1 ∩ · · · ∩ An ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩ A2 ) · · · P(An |A1 ∩ · · · ∩ An−1 ).
Total probability rule: P(A) = P(A|B)P(B) + P(A|B c )P(B c ). More generally, if Bi for i =
1, . . . , k partition the sample space (i.e. ∪Bi = Ω and Bi ∩ B j = ; for all i 6= j), then
k
X
P(A) = P(A|Bi )P(Bi ).
i=1
If we combine this with the definition of conditional probability then we obtain the Bayes’s rule
P(A|Bi )P(Bi )
P(Bi |A) = Pk .
j=1
P(A|B j )P(B j )
Events A and B are independent if P(A ∩ B) = P(A)P(B). This condition is equivalent to

P(A|B) = P(A) and to P(B|A) = P(B). Events Ai for i = 1, . . . n are independent if for any subset
of them
P(Ai1 ∩ · · · ∩ Aik ) = P(Ai1 ) · · · P(Aik ).
Conditional probability P(A|B) is also a probability, in the sense that it satisfies all axioms
of probability. Consequently all rules derived for probability apply, e.g. the total probability
rule conditioned on an event C becomes
P(A|C) = P(A|B ∩ C)P(B|C) + P(A|B c ∩ C)P(B c |C)
Note that two conditions can be written as P(A|B|C) = P(A|B ∩ C)
A.3 Random Variables

A random variable (RV) is a random number which depends on the outcome of an experi-
ment. More precisely a random variable is a function X : Ω → R. Its (cumulative) distribution
(function) is
F (x) = FX (x) = P(X ≤ x).
F (x) is non-decreasing, right-continuous, lim x→−∞ F (x) = 0, lim x→∞ F (x) = 1. Also
P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a) = F (b) − F (a)
and
P(X = x) = P(X ≤ x) − lim− P(X ≤ y) = F (x) − lim− F ( y).
y→x y→x
Discrete RVs: A random variable X is discrete if it can take only a countable number of
possible values. The (probability) mass function of X is
p(x) = P(X = x)
72
the expected (or mean) value of X is
X
EX = x P(X = x)
the expected value of a function of X is

X
E g(X ) = g(x)P(X = x)
where the summation goes over all possible values of X . The variance of X is
VarX = E(X − EX )2 = EX 2 − (EX )2 .
Most important discrete RVs, also expressed as sums of independent RVs:

• Bernoulli (p), P(X = 1) = p, P(X = 0) = 1 − p, EX = p, VarX = p(1 − p).
Pn n

• binomial (n, p) = i=1 Bernoulli (p), P(X = i) = i p i (1 − p)n−i , EX = np, VarX =
np(1 − p).
• Poisson (λ), which is the n → ∞ limit of a Binomial (n, p) with λ = np, P(X = n) =
e−λ λn /n!, EX = λ, VarX = λ.
• geometric (p): # independent Bernoulli (p) trials till first success, P(X = n) = (1 −
p)n−1 p, EX = 1/p, VarX = (1 − p)/p2 .
Pr
• negative binomial (r, p) = i=1 geometric (p), # independent Bernoulli (p) trials till the
n−1

r-th success, P(X = n) = r−1 (1 − p)n−r p r , EX = r/p, VarX = r(1 − p)/p2 .
• hypergeometric (N , m, n): # red balls

drawn
N when we draw n balls out of m red balls
m N −m
and N − m blue balls, P(X = i) = i n−i / n , EX = nm/N , VarX = np(1 − p)[1 − (n −
1)/(N − 1)], with p = m/N .
A.4 Continuous random variables

A RV is continuous if its distribution function is continuous, that is if it has a density function
f : R → R such that for any B ⊂ R
Z
P(X ∈ B) = f (x)d x.
B
The distribution of X is
Z x
F (x) = FX (x) = P(X ≤ x) = f (x)d x

−∞
and hence where F (x) is differentiable, f (x) = d

dx F (x). The density is normalized to one
Z∞
FX (∞) = P(X ≤ ∞) = f (x)d x = 1,
−∞
73
also Z b
P(a < X ≤ b) = f (x)d x

a
and for any a ∈ R Z a
P(X = a) = f (x)d x = 0.
a
The expected value of X is Z ∞
EX = x f (x)d x
−∞
and the expected value of a function of X is

Z ∞
E g(X ) = g(x) f (x)d x.

−∞
The variance is the same as for discrete RVs.

Again, conditional probability is also a probability, so conditioned on event C, X continuous
RV has a conditional density function f (·|C) : R → R such that for any B ⊂ R
Z
P(X ∈ B|C) = f (x|C)d x.
B
Most important continuous RVs, also expressed as sums of independent RVs:
• uniform on [a,b], f (x) = 1/(b − a) for a < x < b, and 0 otherwise. EX = (b + a)/2,
VarX = (b − a)2 /12.
2 2
• normal (µ, σ2 ), f (x) = p
1
2πσ2
e−(x−µ) /2σ , EX = µ, VarX = σ2 .
2
• standard normal (0,1), f (x) = p12π e−x /2 , EX = 0, VarX = 1. Its distribution is F (x) =
Φ(x) is given in tables. Also, µ + σX is normal (µ, σ2 ).
• exponential (λ), f (x) = λe−λx for x > 0, and 0 otherwise, EX = 1/λ, VarX = 1/λ2 . The
exponential random variable is memoryless, that is P(X > t + s|X > t) = P(X > s).
Pn
• gamma (n, λ) = i=1 exponential (λ), f (x) = λe−λx (λx)n−1 /(n − 1)! for x > 0, and 0
otherwise, EX = n/λ, VarX = n/λ2 .
For both discrete and continuous RVs: E(b + aX ) = b + aEX , and Var(b + aX ) = a2 Var(X ).
A.5 Jointly Distributed RVs

The joint distribution of X and Y RVs is
F (a, b) = F (X ≤ a, Y ≤ b)
74
and the marginal distributions are FX (a) = P(X ≤ a) = P(X ≤ a, Y ≤ ∞) = F (a, ∞), and
FY (b) = F (∞, b). X and Y are independent if for all A, B ⊂ R, P(X ∈ A, Y ∈ B) = P(X ∈
A)P(Y ∈ B).
joint mass function p(x, y) = P(X = x,
For discrete RVs there is a P PY = y), with marginal
masses pX (x) = P(X = x) = y p(x, y), and pY ( y) = P(Y = y) = x p(x, y). X and Y are
independent if p(x, y) = pX (x)pY ( y).
For continuous RVs there is a joint density function f (x, y), such that for any C ⊂ R2
ZZ
P((X , Y ) ∈ C) = f (x, y)d x d y.
C
R∞ R∞
The marginal densities are f X (x) = −∞ f (x, y)d y, and f Y ( y) = −∞ f (x, y)d x. X and Y are
if f (x, y)
independent P P= f X (x) f Y ( y).
Always E X
Pi = EX i , and for independent X i variables also E(X 1 · · · X n ) = EX 1 · · · EX n
and Var X i = VarX i .
P
The mass function of the sum of independent discrete variables is

X
P(X + Y = a) = P(X = x)P(Y = a − x)
x
The density function of the sum of independent continuous variables is

Z∞
f X +Y (a) = f X (a − y) f Y ( y)d y.
−∞
In particular
• normal (µ1 , σ12 ) + normal (µ2 , σ22 ) = normal (µ1 + µ2 , σ12 + σ22 ),
Pn
• i=1
exponential (λ) = gamma(n, λ),
Pn
• i=1
Bernoulli (p) = binomial (n, p),
Pn
• i=1
geometric (p) = negative binomial (n, p),
• Poisson (λ1 ) + Poisson (λ2 ) = Poisson (λ1 + λ2 ).
75

SMo Notes1

Uploaded by

SMo Notes1

Uploaded by

Stochastic Modelling

• R. Durrett: Essentials of Stochastic Processes, Springer, 2012.

A much more extensive discussion can be found in

• V. Kulkarni: Modeling and Analysis of Stochastic Systems, CRC Press, 2010.

• S. Ross: A First Course in Probability, Pearson, 2013.

• R.B. Ash: Basic Probability Theory, Dover, 2008.

• R.B. Ash, C. A. Doléans-Dade: Probability and Measure Theory, Dover, 2008.

• R. Durrett: Probability: Theory and Examples, Cambridge University Press in 2010.

Please report mistakes to tibor.antal@ed.ac.uk. You can find me in my office at JCMB

Edinburgh, January 10, 2019 Tibor Antal

2 Discrete Time Markov Chains 9

4 Continuous Time Markov Chains 51

Appendix A Review of Probability 71

1.1 Conditional Probability

P(A) = P(A ∩ Ω) = P(A ∩ (B ∪ B c ))

Proof. Repeat the steps in (1.1) for general n.

P(A) = P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 )

The next example is an application of the second case of Theorem 1.1.8.

1.2 Conditional Expectation

E(X |Y = y) = x∈S x p(x|Y = y)

E(X 1 + X 2 |X 1 = 0) = E(X 1 |X 1 = 0) + E(X 2 |X 1 = 0) = 0 + EX 2 = 1/2

or written out in detail

y∈S E(X |Y = y)pY ( y)

Proof. The proof for the discrete case is quite short

Var(X |Y = y) = E(X 2 |Y = y) − (E(X |Y = y))2 .

Similarly, we treat Var(X |Y ) as a random variable. Then

1.3 Stochastic Processes

Discrete Time Markov Chains

2.1 Basic Definitions

P(X n+1 = j|X n = i, X n−1 = in−1 , · · · , X 0 = i0 ) = P(X n+1 = j|X n = i)

2.2 Modelling Examples

Figure 2.1: State diagram for the basic weather model

pi,i−1 = 1 − p and pi,i+1 = p if 1 ≤ i ≤ N − 1,

Figure 2.2: State diagram for gambler’s ruin problem for N = 4

and all the other probabilities are 0.

• Since no deaths occur, X n+1 ≥ i

• Since at most i divisions can occur X n+1 ≤ 2i

Hence, we can write the transition probabilities as

I n : Number of infected individuals at period n.

The other cases have 0 probability.

Table 2.1: Probabilities for Weather Forecasting-2

2.3 More Complicated Examples

Figure 2.3: State diagram for Weather Forecast-2

X n = (i1 , i2 ) if weather condition an day n − 1 is i1 and on day n is i2 .

S = {(s, s), (s, r), (r, s), (r, r)}

and transition probabilities are given on the figure.

for b, w ≥ 1, and all other transition probabilities are zero.

2.4 Chapman-Kolmogorov Equations

Putting this back to the sum we get the following.

Theorem 2.4.1. Chapman-Kolmogorov Equations.

Corollary 2.4.2. If P n is defined to be the nth power of matrix P, then P (n) = P n .

P (n+1) = P (n) P = P n P = P n+1 .

P(X 7 = 0, X 6 = 0|X 1 = 1) = P(X 7 = 0|X 6 = 0, X 1 = 1)P(X 6 = 0|X 1 = 1) =

= P(X 7 = 0|X 6 = 0)P(X 6 = 0|X 1 = 1) = P(X 1 = 0|X 0 = 0)P(X 5 = 0|X 0 = 1)

Now we can express the result in terms of powers of P as

where for any matrix A by [A]i, j we mean the ai, j element.

P(X 2 = 1) = P(X 2 = 1|X 0 = 0)P(X 0 = 0) + P(X 2 = 1|X 0 = 1)P(X 0 = 1)

In the weather example this means that

P(X 2 = i2 , X 5 = i5 , X 10 = i10 ) = P(X 10 = i10 |X 2 = i2 , X 5 = i5 )P(X 2 = i2 , X 5 = i5 )

= [P 5 ]i5 i10 [P 3 ]i2 i5 [a(0) P 2 ]i2

After all these examples we come up with the following theorem.

P(X n1 = i1 , X n2 = i2 , · · · X nk = ik ) = [P nk −nk−1 ]ik−1 ,ik · · · [P n2 −n1 ]i1 ,i2 [a0 P n1 ]i1

using the reasoning above.

2.5 Classification of States

Definition 2.5.1. Accessibility. A state j is said to be accessible from state i, denoted i → j, if

Theorem 2.5.3. Communication is an equivalence relation, that is

1. i ↔ i for all i ∈ S (reflexive)

3. If i ↔ j and j ↔ k then i ↔ k (transitive).

and i → k. We can prove k → i similarly and hence transitivity follows.

2.6 Transience and Recurrence

1 − %ii = P(Ti = ∞|X 0 = i) ≥ P(Ti = ∞, X n = j|X 0 = i)

hence %ii < 1, and i is transient.

There are several simple Corollaries of the above Theorem.

Corollary 2.6.4. (a) If i → j and i is recurrent then % ji = 1.

(b) If i → j and i is recurrent then j is also recurrent.

(c) If i → j and j is transient then i is transient too.