Learning Probabilistic Networks
Learning Probabilistic Networks
Paul J Krause
PHILIPS RESEARCH LABORATORIES
CROSSOAK LANE
REDHILL, SURREY RH1 5HA
UNITED KINGDOM
Abstract
A probabilistic network is a graphical model that encodes probabilistic relationships
between variables of interest. Such a model records qualitative influences between var-
iables in addition to the numerical parameters of the probability distribution. As such it
provides an ideal form for combining prior knowledge, which might be limited solely
to experience of the influences between some of the variables of interest, and data. In
this paper, we first show how data can be used to revise initial estimates of the parame-
ters of a model. We then progress to showing how the structure of the model can be
revised as data is obtained. Techniques for learning with incomplete data are also cov-
ered. In order to make the paper as self contained as possible, we start with an introduc-
tion to probability theory and probabilistic graphical models. The paper concludes with
a short discussion on how these techniques can be applied to the problem of learning
causal relationships between variables in a domain of interest.
1. Motivation
Two well cited tutorials on learning probabilistic networks are extant. The first by Buntine [11]
is primarily an annotated bibliography. The second by Heckerman (available in its most
extended form as [37]) is quite technical. There does seem to be a need for a more accessible
introductory paper. This paper aims to provide a broad survey of the field in a limited space,
and as such has had to try and strike a balance between technical content and accessibility. The
primary goal has been to produce a paper which will provide a reader with an introduction to
the approaches that are being taken and the main issues arising: the references cited will need
to be consulted in order to obtain full technical details.
A personal acknowledgement is also needed. This has been brought to the beginning of the
paper in order to be clear about the lineage of this tutorial from the outset. This paper initially
started as a joint effort between myself and David Heckerman. Because of this I have drawn on
David’s tutorials [36] and [37] at several points. In addition David has provided extensive com-
ments on three early versions. However, at several points in this paper I put forward my own
perspective of subjective probabilities as “expert judgements” of a true, physical, probability,
rather than as a degree of belief. Whilst David encouraged me to air this viewpoint, it is not one
to which he personally subscribes. In addition, as things turned out I essentially ended up writ-
ing the entire paper; although the topics of several sections are the same as David’s tutorials,
the content, wording and presentation style are very different. For these reasons, we agreed that
I should be sole author. I mention this at some length as merely to “thank David Heckerman for
1 of 38
helpful comments on earlier drafts” would be a poor acknowledgement of his help and support;
I simply could not have produced a paper that was worthy of submission without this help.
Good discussions of subjective probabilities as degrees of belief can be found in, for example,
[7], [36] and [37], and again I emphasise that this paper is necessarily but a first step on the
road to understanding this subject. Look upon it as an exercise in scientific journalism.
3. Introduction to Probability
2 of 38
abstraction of a cancer patient, a geometric point, a chance outcome.
A sample space, or universe, is the set of all possible sample points in a situation of interest. It
is usual to use Ω to designate a specific sample space. The sample points in a sample space
must be mutually exclusive and collectively exhaustive.
A probability measure, p(⋅), is a function on subsets of a sample space Ω. These subsets are
called events. We can refer to the values of p(A), p(A ∪ B), p(Ω) as the probabilities of the
respective events (for A, B ⊆ Ω). But the function p(⋅) is a measure with the following proper-
ties:
Definition:
A probability measure on a sample space Ω is a function mapping subsets of Ω to the interval
[0, 1] such that:
1. For each A ⊆ Ω, p(A) ≥ 0.
2. p(Ω) = 1
3. For any countably infinite collection of disjoint subsets of Ω, Ak, k = 1, …,
∞ ∞
p ∪ Ak = ∑ p ( Ak )
k=1 k=1
In general, we will need to check that the sets (events) themselves satisfy certain properties to
ensure that they are measurable. More details on this can be obtained, for example, from [14],
[95].
Sometimes we can estimate the probabilities we want by counting (the ratio of the number of
cancer patients cured to the total number treated, for example) or some other form of direct
measurement. In this sense, we are saying that probability is an attribute or a property of the
real world. The term physical probability is often used to denote this interpretation of probabil-
ity [32].
3 of 38
gold standard by which the probability estimates may be revised in the light of experience.
This is discussed clearly and at a fundamental level in Bernardo and Smith, [7], Chapter 4. It is
also a theme expanded on through much of this paper.
We should be clear at this stage that our interest is in using whatever sources of information are
available to us to produce a model of the real world. We are not interested in modelling the
expert. Nevertheless, expert judgement can be a useful starting point for the ultimate derivation
of physical probabilities.
Although often written in the form p(A, B) = P(A | B)p(B) and referred to as the “product rule”,
this is in fact the simplest form of Bayes’ Theorem. It is important to realise that this form of
the rule is not, as often stated, a definition. Rather, it is a theorem derivable from simpler
assumptions.
Bayes’ Theorem can easily be rewritten in a form which tells us how to obtain a posterior prob-
ability in a hypothesis A after observation of some evidence B, given the prior probability in A
and the likelihood of observing B were A to be the case:
p( B | A) p( A)
p ( A | B ) = ---------------------------------
p( B)
This simple formula has immense practical importance on a domain such as diagnosis. It is
often easier to elicit the probability, for example, of observing a symptom given a disease than
that of a disease given a symptom. Yet, operationally it is usually the latter which is required.
In its general form, Bayes’ Theorem is stated as follows.
Proposition
Suppose ∪
n
An = Ω is a partition of a sample space into disjoint sets. Then:
p ( B | An ) p ( An )
p ( A n | B ) = --------------------------------------------
-
∑ p ( B | A n ) p ( A n)
n
4 of 38
It is important to appreciate that Bayes’ Theorem is as applicable at the ‘meta-level’ as it is at
the domain level. It can be used to handle the case where the hypothesis is a proposition in the
knowledge domain (a specific disease, perhaps) and the evidence is observation of some condi-
tion (perhaps a symptom). However, it can also handle the case where a hypothesis is that a
parameter in a knowledge model has a certain value (or distribution of values) or that the
model has a certain structure, and the evidence is some incoming case data.
1. The term “random variable” is often used. Here we reserve the term “random variable” for the situation
where there are repeated observations, which is not strictly the case for the variables used in probabilis-
tic networks as discussed in this paper.
5 of 38
A: visit to Asia
B: Bronchitis
D: Dyspnoae
L: Lung cancer
S: Smoking
T: Tuberculosis
X: positive X-ray
4. Graph theory
Very many problem domains can be structured through using a graphical representation.
Essentially, one identifies the concepts or items of information which are relevant to the prob-
lem at hand (nodes in a graph), and then makes explicit the influences between concepts. This
section introduces some of the terminology associated with the use of graphs.
At its most abstract, a graph G is simply a collection of vertices V and edges E between verti-
ces: G = (V, E). We can associate a graph G with a set of variables U = {X1, X2, …, Xn} by
establishing a one to one relationship between the nodes in the graph and the variables in U.
One might, for example, label the nodes from 1 to n, with nodes being associated with the
appropriately subscripted variable in U. An edge e(i,j) might be directed from node i to node j.
In this case, the edge e(j,i) cannot simultaneously belong to E and we say that node i is a parent
of its child node j. If both e(i,j) and e(j,i) belong to E, we say the edge is undirected1.
For example, the graph shown in figure 2 is associated with the set of variables introduced in
1. There are graphs in which there can be directed arcs from i to j and from j to i. However, we do not con-
sider them in this paper.
6 of 38
visit Asia Smoking
Lung
Tuberculosis Bronchitis
Cancer
Either
T or L
positive Dyspnoae
X-ray
Figure 2: A graph associated with the variable set of figure 1. An extra node
(Either T or L) has been introduced in order to simplify the graph. The
example is due to Lauritzen and Spiegelhalter [53].
figure 1. Here, one additional node, E(ither T or L), has been introduced in order to simplify
the graph. Note that in this graph all the edges are directed, and there are no nodes from where
it is possible to follow a sequence of directed edges and return to the starting point (no directed
cycles).
A graph which contains only directed edges is known as a directed graph. Those graphs which
contain no directed cycles have been particularly studied in the context of probabilistic expert
systems. These are referred to as directed acyclic graphs (DAGs).
A graph which contains only undirected edges is known as an undirected graph. These have
not received so much attention in the expert systems literature, but are of importance in statisti-
cal modelling. Graphs with a mixture of directed and undirected edges are referred to as mixed
graphs (e.g. [72]).
As mentioned in the opening to this section, the important point about a graphical representa-
tion of a set of variables is that the edges can be used to indicate relevance or influences
between variables. Absence of an edge between two variables, on the other hand, provides
some form of independence statement; nothing about the state of one variable can be inferred
by the state of the other
There is a direct relationship between the independence relationships that can be expressed
graphically and the independence relationships that can be defined in terms of probability dis-
tributions.
5. Independence
7 of 38
Z
X Y
rain sprinkler on
X Y
Z wet lawn
probability theory. It is this combination of qualitative information with the quantitative infor-
mation of the numerical parameters that makes probability theory so expressive. Detailed stud-
ies of conditional independence properties can be found in Dawid [21] and Pearl [67]. For
completeness, we include definitions of the basic notions here.
We shall use the notation introduced by Dawid. Let X and Y be variables. Then X Y
denotes X and Y are independent. The corresponding probabilistic expression of this is:
p(x, y) = p(x)p(y).
Now introduce a further variable Z. Then X Y | Z denotes that X is conditionally independ-
ent of Y given Z. One expression of this in terms of probability distributions is:
p(x, y | z) = p(x | z)p(y | z).
We can draw a directed acyclic graph that directly encodes this assertion of conditional inde-
pendence. This is shown in figure 3. A significant feature of the structure in figure 3 is that we
can now decompose the joint probability distribution for the variables X, Y and Z into the
product of terms involving at most two variables:
p(x, y, z) = p(x, y | z)p(z) = p(x | z)p(y | z)p(z).
As a concrete example, think of the variable Z as representing a disease such as measles. The
variables X and Y represent distinct symptoms; perhaps “red spots” and “Koplik’s spots”1
respectively. Then if we observe the disease (measles) as present, the probability of either
symptom being present is determined. Actual confirmation of one symptom being present will
not alter the probability of occurrence of the other.
A different scenario is illustrated in figure 4. Here X and Y are marginally independent, but
8 of 38
X Z Y
conditionally dependent given Z. This is best illustrated with another simple (often used)
example. Both “rain” (X) and “sprinkler on” (Y) may cause the lawn to become wet. Before
any observation of the lawn is made, the probability of rain and the probability of the sprinkler
being on are independent. However, once the lawn is observed to be wet, confirmation of it
raining may influence the probability of the sprinkler being on (they are, by and large, alterna-
tive causes). This is an example of “explaining away” (p. 447 [80]); the presence of one cause
making an alternative less likely. See for [67] further discussion of this.
Figure 4 provides the graphical representation of this situation. The probability distribution can
again be factorised. This time as:
p(x, y, z) = p(z | x, y)p(x)p(y)
Note that this is again making use of the (marginal) independence of X and Y (p(x, y) =
p(x)p(y)).
A third and final example completes the cases of interest. The disease X (“Kawasaki disease”)
is known to cause the pathological process Z (“myocardial ischemia”). This, in turn, has an
associated symptom Y (“chest pain”). Here, if through some additional test, myocardial
ischemia is diagnosed as present, observation of chest pain will have no further influence on
the probability of Kawasaki disease being the underlying cause of the expression of the patho-
logical process. That is, X and Y are again conditionally independent given Z. In this case, the
probability distribution factorises as:
p(x, y, z) = p(y| z)p(z| x)p(x)
A pattern is emerging. We can now provide a generalisation of what is happening.
Proposition
Let U = {X1, X2, …, Xn} have an associated graph G = (V, E), where G is a DAG. Then the
joint probability distribution p(U) admits a direct factorisation:
n
p(u) = ∏ p ( xi | pa ( xi ) ) eqn. 1.
i=1
9 of 38
Here pa(xi) denotes a value assignment for the parents of xi.
A slightly larger example will help to bring out the importance of this property. In figure 2 we
took the variables listed in figure 1 and generated a DAG which represents the influences
between these variables (one additional variable has been introduced to simplify the graph). A
factorisation of the probability distribution can now be written down directly using equation 1:
p(a, b, d, e, l, s, t, x) = p(a)p(s)p(t| a)p(l| s)p(b| s)p(e| l, t)p(d| e, b)p(x| e)
The net result is that the probability distribution for a large set of variables may be represented
by a product of conditional probability relationships between small clusters of semantically
related propositions. Now, instead of needing to elicit a joint probability distribution over a set
of complex events, the problem is broken down into the assessment of these conditional proba-
bilities as parameters of the graphical representation.
The population of these relationships with numerical parameters is now much more tractable.
It should now be clear that the elicitation of this qualitative graphical structure is of fundamen-
tal importance in easing the problem of eliciting the probability distribution for a knowledge
model.
This is best illustrated by means of an example. Figure 6 illustrates a simple DAG which is a
composition of the earlier structures. Let X = {2} and Y = {3}. First of all, we will consider the
case where Z = {1}. d-separation is a little hard to apply when it is first encountered because it
is defined in terms of a negative condition (“there is no path such that …”). This simply means
that in establishing whether d-separation holds we must consider all possible paths between the
two sets of variables. In this case there are two: 2 ← 1 → 3, and 2 → 4 ← 3. In the first case,
the node in Z has no converging arrows; in the second the node with converging arrows and its
descendent are outside Z. That is, neither path satisfies the two conditions (they are said to be
blocked by Z) and X and Y are d-separated by Z = {1}. However, consider now Z' = {1, 5}.
Here a descendant of node 4, node 5, is in Z'. This potentially opens up a pathway between X
and Y; if we learn the value of 5, its causes, 2 and 3, will be rendered dependent. In this case
10 of 38
1
2 3
path 2 → 4 ← 3 is said to be active, and X and Y are not d-separated by Z' = {1,5}.
Some further worked examples of d-separation can be found in [45], and we recommend that
the reader examine additional examples to gain further understanding of d-separation. But the
general principal is that it provides a mechanism for reading off the dependencies and inde-
pendencies implicit in a DAG. Furthermore, we have the following result due to Geiger and
Pearl [30]:
Theorem
For every DAG G, there exists a probability distribution p such that for each triple X,Y, Z of
disjoint subsets of variables the following holds:
Z d-separates X and Y iff X Y|Z
This shows that no rule for detecting independence in a DAG can improve on d-separation in
terms of completeness.
Note that the equivalence cannot be taken in the other direction; there are some probability dis-
tributions with independence properties that cannot be represented using a DAG. See [87] for
some examples of this, and [27], [104] for more general discussions on modelling probability
distributions using graphical representations.
We should mention that the literature contains a number of alternative criteria for assessing
independence. One overview of these can be found in Lauritzen at al. [54], which contains a
detailed study of the independence properties of factorisable distributions. They show that for a
directed acyclic graph G and probability measure p (subject to certain constraints), the follow-
ing properties are equivalent:
• p admits a recursive factorisation according to G;
• p obeys the global directed Markov property, relative to G;
• p obeys the local directed Markov property, relative to G.
The directed local Markov Property is easily expressed in words as: any variable is condition-
ally independent of its nondescendents, given its parents. In the case of the global directed
11 of 38
Markov property, two sets of variables A and B are conditionally independent given a third, S,
if S separates A and B in graph G. Here, separates denotes that all paths between A and B pass
through S. However, we do not wish to dwell on these criteria here. The important point is to
emphasise that Lauritzen at al. also demonstrate an equivalence between the global directed
Markov property and Pearl’s d-separation. They then extend the generality of their results by
dropping reference to all features of probability other than conditional independence. All these
(equivalent) criteria can be thought of as generalising to an algebra of relevance, not just one
of probability. Further discussion of this in the specific context of d-separation can be found in,
e.g. [30], [70], [71].
Definition:
In order to fully specify the model, one needs a graph G, and the set of parameters Θ which
make up the factorised probability distribution. We shall refer to M = (G, Θ) as a graphical
model.
This certainly helps, but it is not a complete solution. In general, there will still be a large
number of parameters to elicit. What is more, given that the assignment of values to these
parameters will be based on varying amounts of information and experience, they will be of
varying precision and reliability. It should also be emphasised that there is nothing sacred
about the underlying graphical representation; there will almost always be an element of sub-
jectivity in its specification.
A number of questions immediately arise. If there is some imprecision in the values of the
parameters, can we update them in the light of experience? Equally, can we update the struc-
ture in the light of experience? Indeed, to remove any element of subjectivity, could we even
learn the structure and the parameters directly from statistical data? In the remainder of the
paper, we will first look at some aids to the knowledge elicitation problem, and then progress
to reviewing the work that has been carried out to date to address these last three questions.
12 of 38
5.5 Some pointers to further reading on decomposable models and factorisable
distributions
The primary focus of this paper is on exploiting the relationships between graphical represen-
tations and factorisable distributions for knowledge representation and elicitation. However,
these relationships have also been exploited to develop computationally efficient techniques
for revising probability distributions in the light of evidence. Basically the revision of the glo-
bal probability distribution is decomposed into a sequence of local computations by exploiting
the relevance properties implied by the model. Three general approaches are the arc reversal/
node reduction technique of Shachter [83], the message passing algorithm due to Pearl [66],
and the “clique tree” approach of Lauritzen and Spiegelhalter [53]. Shachter [84] demonstrated
that the arc reversal/node reduction technique, Pearl’s algorithm and the Lauritzen & Spiegel-
halter algorithm are essentially identical for any DAG.
The Lauritzen and Spiegelhalter algorithm was further developed by Jensen and others to form
the basis of the HUGIN expert system shell. Jensen et al. [47] extends earlier work restricted to
singly connected trees to cover multiply connected trees (in which there may be more than one
path between any two nodes) by introducing a secondary structure called a junction tree.
Jensen et al. [46] provides an accessible account of the algorithm used as the basis of the
HUGIN shell.
In a slightly different vein, Smyth et al. [87] reviews the applicability and utility of techniques
for graphical modelling to hidden Markov models (HMMs). Its particular merit in the context
of this paper is that it contains an up to date and self-contained review of the basic principles of
probabilistic independence networks. Short discussions of parameter estimation and model
selection techniques are also included.
Lauritzen [52] provides an up to date text on graphical models.
13 of 38
ables may be seen as making some state more likely, for example). This information is then
applied as constraints on the distribution hyperspace of all possible joint distributions over a
set of variables. Second-order probability distributions are then obtained over the parameters
(probabilities) to be assessed (provided the constraints are consistent). In general, the elicita-
tion of the second-order probability distribution on the parameters may be an iterative process.
The expert(s) may need to be “confronted” with any inconsistencies in the constraints in order
to refine the elicitation. This seems to be a very interesting approach, but at the time of writing
has yet to be tried out on a substantial problem [Druzdzel, pers. comm.].
This is to emphasise that eliciting probabilities directly from experts is still an active area of
research. Once elicited, one is not necessarily sure about their reliability. Learning techniques
can be used to critique probabilities used as parameters in a network, and revise them if neces-
sary.
1. A probability density function represents a probability distribution over a continuous set of outcomes.
14 of 38
p(θ | ξ)
“heads” “tails”
(a) (b)
Figure 7: (a) The two possible outcomes from flips of a drawing pin (“thumb tack”).
(b) A probability density function for the parameter value θ, the long run
fraction of heads associated with a drawing pin. Redrawn after [37].
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
θ θ θ
That is, we obtain the posterior probability density for θ by multiplying its prior density by the
function f(θ) = θ and renormalising. This procedure is depicted graphically in Figure 8. As we
might expect, the resulting probability density function is slightly tighter and shifted slightly to
the right.
It is quite straightforward to see that in general1, if we observe h heads and t tails, then
p(θ| h heads, t tails, ξ) = c θh (1 - θ)t p(θ|ξ)
That is, once we have assessed a prior density function for θ, all that is relevant in a random
1. Provided that; (1) observations are independent given Θ = θ, and (2) the probability p(X = heads| θ) is
constant.
15 of 38
sample of events is the number of heads and the number of tails. h and t are said to be sufficient
statistics for this model; they provide a summarisation of the data that is sufficient to compute
the posterior from the prior.
So far, we have only considered the case where the outcome variable has two states. The
results generalise quite straightforwardly to cover situations where the outcome variable has r
> 2 states. In this case, we denote the probabilities of the outcomes Θx = { θX=1, …, θX=r},
where each state is assumed possible so that θX=k > 0 for all 1 ≤ k ≤ r. In addition, we have
r
∑ θX = k = 1 . If we know these probabilities, then the outcome xi of any event will be inde-
k=1
pendent of the outcomes of all other events (Figure 9). Any database D = {x1, …, xm} of out-
comes that satisfies these conditions is called a random sample from an (r-1)-dimensional
multinomial distribution with parameters Θx [32]. In the specific case where r = 2 the sequence
is often referred to as a binomial sample.
Referring back to our earlier example, we have seen how the probability density function for θ
may be updated given a binomial sample and some prior density function for θ. There are no
formal constraints on the choice of the prior density function p(θ | ξ). However, in practice
there are certain advantages to using what is known as the beta distribution in the two state
case. A variable Θ is said to have a beta distribution with hyperparameters1 α, β when its prob-
ability density function is given by
p(θ | α, β) = cθα-1(1-θ)β-1 α, β > 0 eqn. 2.
This is the density function that was used in Figure 7(b), with α = 3 and β = 2. The first prop-
erty to notice is that on observing a heads in our drawing pin example, the prior distribution is
revised to become
p(θ | heads, α, β) = cθ(α+1)-1(1-θ)β-1
In general, after h heads and t tails, we will have:
p(θ | h heads, t tails, α, β) = cθh+α-1(1-θ)t+β-1 eqn. 3.
Clearly, the result after sampling remains a beta distribution.
The second property to notice is that the expectation with respect to this distribution has a sim-
ple form:
x1 x2 ••• xn
data (random sample)
Figure 9:
16 of 38
α
E p ( θ | α, β ) ( θ ) = ∫ θp ( θ | α, β ) dθ
= -------------
α+β
eqn. 4.
Note that for clarity, E is subscripted with the distribution used for expectation. We may take
this as the probability of observing a heads on the next flip, in the drawing pin example. In par-
ticular, with the prior distribution of Figure 7, Ep(θ | α, β)(θ) = 3/5. After observation of one
heads, the revised density function on the right hand side of Figure 8 has expectation 4/6.
Note that we may think of the denominator of Equation 4 as an equivalent sample size. That is,
the prior has been assessed “as if” α + β observations had been made.
The generalisation of the beta distribution to r > 2 states is known as the Dirichlet distribution
(with hyperparameters α1, …, αr). The above two properties generalise fully. In particular, we
say that the set of Dirichlet distributions obtained as a Dirichlet prior is updated is a conjugate
family of distributions for sampling from a multinomial distribution.
Further details of the Dirichlet distribution can be found in [102]. For simplicity we will refer
here to a Dirichlet distribution with hyperparameters α1, …, αr as
D[ α1, …, αr]
We now have the basic machinery needed to learn the probabilities for a network in the situa-
tion where a network structure S is assumed known and where all the variables are observable,
provided we make one additional assumption. Let Sh denote the hypothesis that the joint prob-
ability can be factored according to S, and V be the set of case variables in the network S.
Spiegelhalter and Lauritzen [88] introduced the idea of considering the conditional probabili-
ties of a network such as S as being generated by a set of parameters {θv | V ∈ V}. The param-
eters θv are uncertain, with some prior distribution p(θv | Sh). Then, given a random data sample
D = {x1, …, xN} we wish to update the overall parameterisation θ (whose components are θv)
given Sh. The additional simplifying assumption that is made in this situation is that of param-
eter independence. That is, the parameters θv are assumed a priori independent variables. In
∏ p ( θv | S
h h
this case, p ( θ | S ) = ) and the joint distribution for the variables V and parame-
v
ters θ, given Sh can be written as:
∏ p ( V | pa ( V ), θv, S
h h h
p ( V, θ | S ) = ) p ( θv | S ) eqn. 5.
v
From equation 5 it is clear that for any node V ∈ V, θv can be considered to be just another par-
ent of V in an extended network. Figure 10 gives an example of such a parameterised network
for the often used “visit to Asia” example.
In addition to parameter independence, we assume:
• No missing data;
• Conjugate priors.
If we can obtain data on each variable V ∈ V, then assuming that the parameters are independ-
ent, we can update each parameter θv independently using just the same techniques as we used
in the single variable case. In particular, if each parameter set has a Dirichlet prior distribution,
then a posteriori, the distribution remains Dirichlet.
Consider a node V with K states. For a specific parent configuration pa(V)* the variable will be
parameterised as:
17 of 38
θA θS
Smoking
visit Asia
θL θB
θT θE
Tuberculosis Lung
Bronchitis
Cancer
Either
θD
θX T or L
positive Dyspnoae
X-ray
Figure 10: Parameterisation of the “Visit to Asia” example to account for uncer-
tainty in the values of the conditional probabilities for variables in the
network (after [88]).
Note that the lesson of this section can be quite simply summarised. That is, the case of learn-
ing parameters from complete data with known structure can be reduced to maintaining suffi-
cient statistics for certain distributions (the Dirichlet being one example).
This outlines the updating scheme for a single observation. Details of the generalisation to a
database of N observations can be found in [36]. Although this is quite straightforward, it is
often the case that certain variables in a network may be physically impossible or too expen-
sive to observe with certainty. Techniques to handle such cases are considered in the next sec-
tion.
18 of 38
cannot be unintrusively observed. During routine use, the values of these variables could be too
expensive (in terms of cost or of risk to the patient) to observe on a regular basis. Analogous
situations can arise in most other domains, and so there is a need to extend the work of the pre-
vious section to cover the situation where a subset of the variables in the model are unobserva-
ble. For simplicity, we will restrict the discussion in this section to the case where the
availability or otherwise of an observation is independent of the actual states of the variables.
(Methods for addressing dependencies on omissions have been studied in, for example, [69],
[75] and [78]. We will come back to this in the next section).
The two variable case illustrates the general situation quite simply. We have the variables X1
and X2 with states {x1, x1} and {x2, x2} respectively. X2 is observed to be in state x2, whilst the
state of X1 is unknown. Hence there are two possible completions of the database, {x1, x2} and
{x1, x2}; one for each possible variable instantiation of the unknown X1. Then the posterior dis-
tribution for θ2|1 is obtained by summing over the possible states of X1 as follows:
p(θ2|1 | x2) = p(θ2|1 | x1, x2)p(x1| x2) + p(θ2|1 | x1, x2)p(x1| x2)
In this case, the terms p(θ2|1 | x1, x2) and p(θ2|1 | x1, x 2) would be Beta distrib utions, and the
resulting posterior distribution is a “mixed distribution” with mixing coefficients p(x1| x2) and
p(x1| x2).
In general (see, e.g. [36], [88]), the posterior distribution for each of the parameters θv will be a
linear combination of Dirichlet distributions, or a Dirichlet mixture, with mixing coefficients
that will need to be computed from the prior distributions. What one is doing is computing the
joint posterior distribution given each possible completion of the database, and then mixing
these over all possible completions. Note that this means that the posterior over parameters are
no longer independent. In addition, as the process continues and further cases are observed,
one faces a combinatorial explosion (e.g. [19], [89]); in general, the complexity of the exact
computation is exponential in the number of missing variable entries in the data base, [18].
Thus, in practice some approximation is required.
A number of such approximations have been described in the literature. One can broadly
divide them into two classifications; deterministic and stochastic. In an early example, referred
to as quasi-Bayes [86] or fractional updating [97], for each unobserved variable it was assumed
that a fractional number of instances of each state of that variable had been observed. This
early work has been criticised because it falsely increases the equivalent sample sizes [6], and
hence implicit precision, of the Dirichlet distributions. Consequently, this approach has subse-
quently been refined so that the approximating distribution attempts to match the moments of
the correct mixture [19], [89], [98].
These deterministic approximations process the data in the database sequentially, and also
make use of the assumption of parameter independence and properties of the Dirichlet distribu-
tion. In contrast, the various stochastic methods process all the data at once, and can handle
continuous domain variables and dependent parameters. (However, it should be mentioned that
the comparison reported in [19] indicates that the deterministic methods can perform well in
comparison with the stochastic methods).
A widely studied stochastic method is Gibbs sampling. This is a special case of the general
Markov chain Monte-Carlo methods for approximate inference (e.g. [33], [62]). Gibbs sam-
pling can be used to approximate any function of an initial joint distribution p(X) provided cer-
tain conditions are met. Firstly, the Gibbs sampler must be irreducible. That is, the distribution
19 of 38
p(X) must be such that we can sample any possible state of X given any possible initial state of
X. This condition is satisfied if, for example, the full joint distribution has no zeros (i.e. every
variable instantiation is possible). Secondly, strictly each instantiation must be chosen infi-
nitely often. In practice, an algorithm for deterministically rotating through the variables is
used to meet this requirement. If these conditions hold, then the average value of the sampled
function approaches the expectation with respect to p(X) with probability 1 as the number of
samples tends to infinity [15]. The problem, of course, arises as to how many samples to take
in a given situation, and how to estimate the error in the resulting posterior distribution (which
is what we are interested in here). Heuristic strategies exist for answering these questions ([62],
[73]), although there is no easy general solution.
For more information, an outline of the Gibbs sampling algorithm can be found in [10], whilst
[58] and [62] contain good discussions of Gibbs sampling, including methods for initialisation
and a discussion of convergence.
An alternative approximation algorithm is the expectation-maximization (EM) algorithm [24].
This can be viewed as a deterministic version of Gibbs sampling, and can be used to search for
the maximum a posteriori (MAP) estimate of model parameters. The EM algorithm iterates
through two steps; the expectation step and the maximisation step. In the first step, as a com-
plete database is unavailable, the expected sufficient statistics for the missing entries of the
database D are computed. The computational effort needed to perform this computation can be
intense. However, any Bayesian network inference algorithm may be used for this evaluation
(e.g. [51]). In the second step, the expected sufficient statistics are taken as though they were
the actual sufficient statistics of a database D', and the mean or mode of the parameters θ are
calculated. This computation is such that the probability of observation of D' given the network
structure and parameters is maximised; hence the term maximisation step.
The EM algorithm is fast, but it has the disadvantage of not providing a distribution over the
parameters θ. In addition, it was reported in [51] that when a substantial amount of data was
missing, the likelihood function has a number of local maxima leading to poor results (this is
also a problem with gradient ascent methods). Modifications have been suggested to overcome
this, and some authors suggest that a switch also be made to alternative algorithms when near a
solution in order to overcome the slow convergence of EM when near local maxima ([10],
[61]).
One of the algorithms suggested as an alternative to switch to in the above is gradient descent.
A demonstration of the pure use of gradient descent is reported in [81]. This paper raises rather
an interesting point. Firstly, let us introduce the notion of a hidden variable. In the case of a
database with missing variables, a particular variable may be observed in some cases, or it may
never be observed. The denotation hidden variable refers to the latter situation. Now, from the
preceding discussion it is clear that learning parameters for a network with hidden variables is
more complex than the situation where all the variables are observable in the same structure. In
contrast, the work reported in [81] indicates that learning parameters for a network containing
one or more hidden variables, provided they are cognitively meaningful, can be more statisti-
cally efficient than learning parameters for an alternative network in which all the parameters
are observable. Note that here, “efficiency” refers to the need for relatively small amounts of
data and not to speed of computation. Figure 11 indicates the two alternative networks consid-
ered. Network structure (a) showed a far more rapid reduction in mean square error per output
value as a function of number of training cases than did network structure (b). The lesson to be
20 of 38
learnt is that a soundly structured model is likely to be easier to train as well as having an
inherently higher diagnostic accuracy. This demonstrates one of the advantages of probabilistic
networks over neural networks; the exploitation of known structural information can dramati-
cally reduce the amount of learning data needed. Basically, the number of parameters needed is
reduced and hence learning becomes more statistically efficient.
(a) (b)
Figure 11: In these diagrams, all nodes except H are three-valued. H is two-valued. (a)
represents a probabilistic network with a hidden variable, H. It requires 45
parameters. (b) is the corresponding fully observable network. This requires
168 parameters (after [81]).
21 of 38
estimate the effect of Veteran status in the Vietnam era on mortality.
22 of 38
1.0
assessed prior
reference prior
0.8
Predicti ve probability
0.6
0.4
0.2
∗ ∗ ∗ ∗
0.0
5 10 15
observ ation number
discrete models, models based on Dirichlet distributions and models of the logistic regression
type. Spiegelhalter et al. [91] continues the development of the Lauritzen & Spiegelhalter algo-
rithm with a discussion of the use of data to revise the conditional probabilities in a given
structure. Details are given on the use of Dirichlet priors for learning about parameters. Data is
assumed not to be available on all the nodes in the network. This paper also includes some dis-
cussion of the use of data to compare models.
Dawid and Lauritzen [22] introduced the notion of a hyper Markov law. This is a probability
distribution over a set of probability measures on a multivariate space. Their philosophy is sim-
ilar to [88], in that the role of the expert is seen as a provider of a prior distribution expressing
their uncertainty about the numerical parameters of a graphical model. These parameters may
then be revised and eventually superseded, using case data. A major distinct contribution of
[22] is to explore the details of this Bayesian approach to learning parameters in the context of
undirected rather than directed graphs.
23 of 38
(a) (b) (c)
Sm S Sm S Sm S
L L L
(d) (e) (f)
Sm S Sm S Sm S
L L L
Figure 13: Some possible network structures for describing the influ-
ences between Smoking, Sex and Lung Cancer.
A naive approach might be to enumerate all possible network structures and then select that (or
those) which maximise some suitable criterion. A Bayesian criterion, for example, will esti-
mate the posterior probability of each network structure given the observed database. Other
criteria, such as minimum message length (e.g. [100], and see below), have also been explored.
Unfortunately, as the number of nodes in the network increases, it rapidly becomes infeasible
to enumerate exhaustively all the possible structures. Even for a number of nodes as small as
10, the number of possible structures is approximately 4.2 x 1018 (the number of probabilistic-
network structures for a given number of nodes n can be determined using a function published
by Robinson [76]). So, in general, unless one has prior grounds for eliminating significant
classes of possible structures, some more statistically efficient search strategy will be required.
A simple example can be used to illustrate this, and one other important point. Suppose we are
interested in studying the relationships between Smoking, Sex and Lung Cancer. After study-
ing a small number of cases, we might have a database expressed as the relational table, Table
1:
Table 1: An Example Database of Cases.
case Sm S L
1 T M T
2 F M F
3 T F F
4 F F F
5 T F T
There are 25 different networks which might be used to describe this problem as a directed
graph. Some of these are shown in Figure 12. Given that a meaning has been given to the vari-
able names, many of the possible structures will not seem intuitively plausible; one would not
consider a directed influence from Smoking to Sex as being at all realistic, for example. How-
24 of 38
ever, if we ignore the semantics of the nodes at this stage, the additional point we wish to make
is that several of the possible networks will be equivalent in the sense that they represent equiv-
alent independence statements. Consider, for example, networks (d), (e) and (f) in figure 12.
The probability distributions decompose respectively as:
pd(Sm, S, L) = pd(L | Sm, S)pd(Sm)pd(S);
pe(Sm, S, L) = pe(S| L)pe(L| Sm)pe(Sm);
pf(Sm, S, L) = pf(Sm | L)pf(L | S)pf(S)
Apply Bayes rule repeatedly to this last decomposition, and rearrange:
p(Sm | L)p(L | S)p(S) = p(Sm | L)p(L)p(S | L) = p(S | L)p(L | Sm)p(Sm)
It is then clear that networks (e) and (f) have equivalent functional decompositions, and hence
express equivalent independence properties.
This notion of equivalence of network structures [99] is important and should be kept in mind
during the remaining discussion1. This leads to an important property which is expected to
h
hold of Bayesian network structures. Let B S represent the hypothesis that the physical proba-
bilities for some specified joint space can be encoded in the network structure BS. Then the
hypotheses associated with two equivalent network structures must be identical; the structures
consequently having the same prior and posterior probabilities. This property is referred to as
h
hypothesis equivalence. The implication is that we should associate each hypothesis B S with
an equivalence class of hypotheses, rather than a single hypothesis. It should be born in mind
in most of the following that the learning methods strictly pertain to learning equivalence
classes of structures.
Nevertheless, in this example we can use expert knowledge to categorically eliminate a large
number of the proposed network structures from consideration. All those with arrows from
Lung Cancer to Sex, or from Smoking to Sex should certainly go. It might also be reasonably
safe to remove all those with arrows from Lung Cancer to Smoking, but this is perhaps begin-
ning to impose a judgement rather than a “technological” constraint. Now, given the remaining
h
possible network structures, let B S represent the hypothesis that the database D of Table 1, is a
random sample from the (equivalence class of a) network structure BS. Then, we simply select
that hypothesis whose posterior probability given the data is a maximum. If ξ represents the
current background knowledge, as before, and c is a normalisation constant, then this can be
calculated from the data in table 1 using
h h h
p( B S | D, ξ) = c p( B S | ξ) p(D | B S , ξ) eqn. 9.
Full details for the computation of this expression can be found in [36].
At a high level view, that is all there is to learning structure in probabilistic networks. In prac-
tice, if the user believes that only a few network structures are possible, they can directly assess
the priors for the possible network structures and their parameters and then compute the poste-
rior probabilities as described. However, if one is trying to learn a model of a situation about
which there is little prior knowledge, then some extra mechanism must be in place to control
the explosion of possible structures.
1. More on testing for and characterising equivalent network structures can be found in [12] and [99].
25 of 38
Note that, in contrast to the previous section, the remainder of this section only covers the situ-
ation where complete data is available.
26 of 38
cumstances [50].
A potential difficulty with the posterior-probability criteria is the need to assign prior probabil-
ities to each possible network structure. However, a statistically efficient (in terms of data
demands) method for doing this has been described by Heckerman et al. [40]. This requires
only the assessment of a prior network structure for the domain, the user’s “best guess”, and a
single constant. However, if a user is willing, they can provide more detailed knowledge by
assessing different penalties for different nodes, and for different parent configurations for
each node [9]. Alternatively, they might categorically assert that some arcs in the prior network
must be present. Similar practical approaches to the necessary assignment of priors to parame-
ters for all possible network structures have been discussed in [9], [17], [18], [40], [91].
∏ s ( X i | Πi )
h
p ( D, BS | ξ) = eqn. 10.
i=1
where s(Xi | Πi) is only a function of Xi and its parents. So, given such a separable criterion, we
can compare the score for two network structures that differ by the addition or deletion of arcs
pointing to Xi by computing only the term s(Xi | Πi) for both structures. If an arc between, for
example, nodes Xi and Xj is reversed then the terms s(Xi | Πi) and s(Xj | Πj) need to be calcu-
lated. A general technique for search strategies is then to make successive arc changes to the
network, and employ the property of decomposability to evaluate the merit of each change.
Cooper & Herskovits [18] describes the use of a greedy search algorithm for identifying the
most probable structure given some test data. Aliferis & Cooper [2] evaluates the accuracy of
K2, a specific instantiation of a greedy search algorithm, using simulated data. The use of sim-
ulated data from well-specified (gold-standard) models allows the accuracy with which a spe-
cific technique learns the model structure to be measured directly. The alternative is to measure
the accuracy indirectly by assessing the predictive accuracy of the resulting model [42]. Using
their direct measurements, Aliferis & Cooper report that the mean percentage of correctly
found arcs was 91.6%, whilst the mean ratio of superfluous arcs was 4.7%. Further empirical
studies leading to refined search algorithms can be found in [13] and [92].
27 of 38
A B C p(S=1|A,B,E)
A B E 1 1 1 0.95
1 1 0 0.95
1 0 1 0.20
1 0 0 0.05
0 1 1 0.00
S 0 1 0 0.00
0 0 1 0.00
0 0 0 0.00
Figure 14: A simple network structure and its associated conditional probability
table.
As indicated earlier, the last two assumptions are made so that the distributions of the parame-
ters stay within the same conjugate family of distributions, with sampling. Interestingly, the
first three assumptions, together with one additional assumption do in fact imply the fifth
assumption. This was demonstrated by Geiger and Heckerman [31]. The additional assumption
is that data can not help to discriminate two network structures that encode the same sets of
probability distributions (likelihood equivalence).
As has been indicated, a wide variety of learning algorithms are extant. Buntine [10] provides a
framework from which a wide variety of data analysis and learning algorithms can be con-
structed. Here graphical models are used at a meta-level to represent the task of learning object
level models. Buntine hopes that this work will form the basis of a computational theory of
Bayesian learning.
28 of 38
A
A B C p(S=1|A,B,E)
0 1
1 1 1 0.95 B
1 1 0 0.95 0.00 E 0 1
1 0 1 0.20
0 1 0.95
1 0 0 0.05
* 0.00
0.05 0.20
(a) (b)
Figure 15: Two alternative representations of a local CPT structure. The default table
in (a) requires 5 parameters, whilst the decision tree of (b) requires 4.
probability table (CPT) on the right hand side of the figure requires 23=8 parameters. Yet, by
inspecting the table, we can see that there is scope for a representation which uses fewer
parameters. Firstly, we have a default condition that if the alarm is not armed (A=0) there will
be no alarm sound (S=0) whatever the state of the other two parents. In addition, if the alarm is
armed and there is a burglary (B=1), then the load alarm sound has the same probability of
occurrence whether an earthquake is happening or not.
Figure15 shows two alternative representations of this local structure which use fewer parame-
ters than the original structure; a default table requiring 5 parameters and a decision tree with
4. In the default table, a single default probability is provided for all those parameters that are
not explicitly listed. In the case of the decision tree, each leaf describes a probability for node
S. The arcs are labelled with the possible states of each parent node, and one travels down the
tree taking the appropriate branches to find the probability of S given the selected states of the
parent nodes. The default table captures only the default condition, whilst the decision table
also captures the overriding impact of a burglary whenever the alarm is armed; hence the dif-
ference in the parameters required.
There is an important benefit to be gained by modifying the learning algorithms so that they
can recognise such a local structure where appropriate. Since fewer parameters are needed, the
estimation of these parameters is more efficient for a given amount of data. Furthermore, most
scoring criteria aim for a balance between complexity of the preferred network and the accu-
racy with which it represents the frequencies in the database from which it is learnt. This
means that in the example just given, a simpler network might be preferred using, say, just two
of the three possible parents as this will reduce the number of parameters needed in the naive
tabular representation of the CPT. Thus the search algorithm may show a bias against the
“true” more complex structure.
Friedman and Goldszmidt [28] have proposed two modifications to the minimum description
length (MDL) criterion for learning probabilistic networks. These allow for a compact encod-
ing of default tables and decision trees, respectively. Both criteria are separable (see Section
8.2) and so may be used in conjunction with a search strategy employing local additions,
removals and reversals of edges to learn a structure in a similar way to those already discussed.
29 of 38
Friedman and Goldszmidt then carried out a series of experiments to evaluate the effectiveness
of these new criteria. These experiments supported the hypotheses that the structures learnt
using the new criteria reduce error by preferring more complex structures in those situations
where the associated CPT can be described using fewer than exponential parameters. In addi-
tion, the estimation of these parameters was more robust as they are based on relatively larger
samples. Analogous changes to the BD metric are also proposed, although not explored.
The compact representation of local CPT structure is a very important topic. Further discus-
sions of the fundamental knowledge representation issues can be found in [9], [38], [67] and
[94].
30 of 38
The last statement is of particular significance. Causal information of this kind is needed in
order to make predictions in the face of intervention. Conversely, as in the case of good experi-
mental science, causal information can be learned with interventional data. There are cases in
the social sciences and in medicine where interventional data can be obtained through, for
example, randomised trials. However, there are also many cases where this is impractical, or
even immoral. The alternative then is to try to learn causal information from data alone.
The notion of conditional independence provides a link between learning graphical models
from data and learning causality. For many domains, there is a direct connection between lack
of cause and conditional independence. This connection is sometimes called the causal
Markov assumption [93]:
• A domain with causal relationships given by graph G exhibits the conditional independence
assumptions determined by d-separation applied to G.
For example, measles “causes” Koplik’s spots, measles “causes” red spots, but the probabili-
ties of the two symptoms are independent once measles is confirmed to be present. Several
authors have reported that the causal Markov assumption is appropriate for many domains (e.g.
[35], [67], [93]). Learning causality is essentially just the inversion of this; given certain condi-
tional independence assertions learned from data, determine those graphs which could have
produced those assertions. By the causal Markov assumption, those graphs must include the
true causal graph.
Caution is needed here. It is possible that two or more alternative network structures may
record the same conditional independence assertions. Consider, for example, the set of varia-
bles X = {X, Y, Z} and the three possible network structures X → Y→ Z, X← Y → Z, and
X ← Y ← Z. All three structures represent only the independence assertion that X and Z are
conditionally independent given Y. In this sense, they are equivalent structures. This was illus-
trated in Section 8. Verma and Pearl define two Bayesian-network structures for a set of varia-
bles X as independence equivalent if they represent the same set of independence assertions
[99]. Verma and Pearl also provide a simple condition for assessing whether two network
structures are independence equivalent. Strictly, we need the stronger notion of distribution
equivalence; two network structures for X are distribution equivalent if they encode the same
set of probability distributions for X. However, for the purposes of this discussion, the impor-
tant point is to remember that one is strictly learning equivalence classes of structures rather
than individual structures. Further discussion of the above two forms of equivalence can be
found in [37].
In [60] Meek defines a complete causal explanation of a dependency model M (a set of condi-
tional independence statements) as a directed acyclic graph G where the set of conditional
independence facts entailed by G is exactly the set of facts in M. It follows that if G is a com-
plete causal explanation of a dependency model M, and G ' is equivalent to G, then G ' must
also be a complete causal explanation of M. Although one cannot discriminate between equiv-
alent causal explanations, one can usefully ask, what are the causal relationships that are com-
mon to every causal explanation of the set of independence facts in a sample? Meek provides
algorithms for answering this question in the case where the causal explanations are consistent
with some background knowledge consisting of a set of directed edges that are required, and a
set of directed edges that are forbidden.
31 of 38
Because of its import in epidemiology, economics, the social sciences and other “soft” sci-
ences, learning causal networks has a long history. The seminal references for this topic are the
book by Spirtes et al. [93], and the paper by Verma & Pearl [99]. We shall refer to this general
approach under independence search-based techniques for learning causality (ISC), to dis-
criminate them from the Bayesian approaches to learning structure of the previous section. The
techniques appear very different. However, as the data set approaches infinity there is an
asymptotic correspondence. So, it would be foolish to regard the two approaches as competi-
tors; learning causality is a deep problem which warrants viewing from different perspectives.
A possible concern with the use of collected data to learn causality is that a fundamental
premise of scientific methodology is that some form of intervention or experimentation is nec-
essary in order to discover causality. For example, in the case of smoking and lung cancer of
section 8, one might conclude that smoking had a causal relationship with lung cancer if by
intervening to stop smoking, one reduced the incidence of lung cancer. However, Spirtes et al.
[93] and Pearl and Verma [71] have shown that, under certain assumptions, passive observa-
tion is sufficient to determine all or a part of (equivalence classes of) the causal structure of a
system under study1. As mentioned earlier in this section, this is the goal one is aiming for.
Wermuth and Lauritzen [101] discuss the use of graphical chain models in the formulation of
substantive research hypotheses. Their interest was from the social sciences, to produce statis-
tical models “capturing characteristics, behaviour, abilities, attitudes of people or historical and
environmental conditions”. In [101] they describe simple criteria for identifying equivalent sta-
tistical models from graphs. This is an important point to take on board when discussing the
learning of causality, as it is not possible to differentiate between alternative substantive
research structures if they correspond to equivalent graphical models.
The Bayesian learning techniques discussed in this paper can be viewed as providing a “soft”
version of the independence search-based techniques, using some expert judgement to initiate
the search. The required technical material, using a Bayesian approach, has essentially all been
covered in the preceding sections. So the main motivation behind this section was to air the
issues rather than describe further technology. For further reading, a small scale, but informa-
tive, example of learning causality using real-world data can be found in [37]. This data has
also been studied using non-Bayesian techniques by Wittaker [104] and by Spirtes et al. [93].
The papers on the Rubin Causal Model referred to in section 7.3 are also relevant, although this
work does not make significant use of graphical models as yet.
A number of problems still remain to be solved. For example, the techniques work best if all
variables in the domain of interest are observable. Difficulties arise if hidden variables are per-
mitted. By way of illustration, the graph X ← a → Z ← b → Y implies the same set of condi-
tional independence statements on the variable X, Y, Z as the graph X → Z ← Y. However, the
former does not present X as a cause of Z, whilst the latter does. The independence search-
based techniques can be used to identify when an association must be attributed to a hidden
common cause, [71], [93]. The Bayesian techniques, on the other hand, still have difficulties
when hidden variables are involved.
1. These are the Causal Markov Assumption, and a further assumption called faithfulness [92]. In the
Bayesian learning case, the latter follows from the assumption that the parameters have a probability
density function [37].
32 of 38
10. Conclusions
This paper has described an area of research in expert systems which is rapidly maturing, and
which is beginning to find significant real-world application. Application of the techniques
described in this paper have been described by Madigan and Raftery [57], Lauritzen et al. [55],
Singh and Provan [85], and Friedman and Goldszmidt [28]. In addition, a number of research
groups have developed software systems specifically for learning graphical models. The work
described by Spirtes et al. [93] has been further developed into a program called TETRAD II
for learning about cause and effect. This is reported in Scheines et al. Badsberg [5] and Højs-
gaard et al. [43] have built systems which can learn mixed graphical models (chain graphs)
using a variety of criteria for model selection. A widely used benchmark in the learning litera-
ture is a system called BUGS, which was created by Thomas, Spiegelhalter and Gilks [96].
BUGS takes a learning problem specified as a Bayesian network and compiles this problem
into a Gibbs-sampler computer program. Finally, a software package developed by Radford
Neal is available on the internet (http://www.cs.toronto.edu/~radford/fbm.software.html).
Neal’s package supports Bayesian regression and classification models [63]. A web page list-
ing currently available commercial and research software for learning belief networks can be
found at http://bayes.stat.washington.edu/almond/belfit.html.
This all goes to show that learning probabilistic networks from data is a healthy and buoyant
research area with great potential for application. We hope that this paper will go some way to
raising awareness of the possibilities for exploiting this work.
11. References
[1] Akaike H. 1974. A New Look at Statistical Model Identification. IEEE Trans. Automatic
Control, 19, 716-723.
[2] Aliferis C.F. & Cooper G.F. 1994 An evaluation of an algorithm for inductive learning of
Bayesian belief networks using simulated data sets. In: Uncertainty in Artificial Intelli-
gence - Proceedings of the 10th Conference, 8-14.
[3] Allen J., Fikes. R. & Sandewall E. 1991 KR-91, Principles of Knowledge Representation
and Reasoning: Proceedings of the Second International Conference. Cambridge, MA:
Morgan Kauffman.
[4] Angrist J.D., Imbens G.W. & Rubin D.B. 1996 Identification of Causal Effects Using
Instrumental Variables. Journal of the American Statistical Association, 91, 444-455.
[5] Badsberg J. 1992. Model search in contingency tables in CoCo. In: Dodge Y. & Wittaker
J. (eds), Computational Statistics, Heidelberg: Physica Verlag, 251-256.
[6] Bernardo J.M. and Giron F.J. 1988. A Bayesian analysis of simple mixture problems. In
Bayesian Statistics 3 (Bernardo J.M., DeGroot M.H., Lindley D.V. and Smith A.F.M.,
eds), Oxford University Press, 67-78.
[7] Bernardo J.M. & Smith A.E.M. 1994 Bayesian Theory. Chichester: John Wiley.
[8] Besnard P. & Hanks S. 1995. Uncertainty in Artificial Intelligence: Proceedings of the
Eleventh Conference, San Fransisco: Morgan Kaufmann.
[9] Buntine W. 1991. Theory Refinement in Bayesian Networks. In Proc. Seventh Confer-
ence on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, 52-60.
[10] Buntine W. 1994 Operations for Learning with Graphical Models. Journal of Artificial
Intelligence, 2, 159-225.
33 of 38
[11] Buntine W. 1996 A guide to the literature on learning probabilistic networks from data.
IEEE Transactions on Knowledge and Data Engineering, 8, 195-210.
[12] Chickering D. 1995 A Transformational Characterisation of Equivalent Bayesian Net-
work Structures. In Besnard & Hanks (eds), 87-98.
[13] Chickering D. 1996 Learning equivalence classes of Bayesian-network structures. In
Horvitz E. & Jensen F. (eds.).
[14] Chung K.L. 1979 Elementary Probability Theory with Stochastic Processes. New York:
Springer-Verlag.
[15] Çinlar E. 1975. Introduction to Stochastic Processes. Prentice Hall.
[16] Cook T.D. & Campbell D.T. 1979. Quasi-Experimentation: Design & Analysis Issues for
Field Settings. Chicago: Rand McNally College Publishing Company.
[17] Cooper G.F. & Herskovits E. 1992 Bayesian Method for the Induction of Probabilistic
Networks from Data. Technical Report SMI-91-1, Section on Medical Informatics, Stan-
ford University.
[18] Cooper G.F. & Herskovits E. 1992 Bayesian Method for the Induction of Probabilistic
Networks from Data. Machine Learning, 9, 309-347.
[19] Cowell R.G., Dawid A.P. and Sebastiani P. 1996. A comparison of sequential learning
methods for incomplete data. In Bayesian Statistics 5, pp. 533-542, Oxford: Clarendon
Press.
[20] Cox D. 1993. Causality and Graphical Models. Bull. Int. Stat. Inst., Proc. 49th Session 1,
365-372.
[21] Dawid A.P. 1979 Conditional Independence in Statistical Theory. J.R.. Statist. Soc. B,
49, 1-31.
[22] Dawid A.P. & Lauritzen S.L. 1993 Hyper Markov Laws in the Statistical Analysis of
Decomposable Graphical Models. The Annals of Statistics, 21, 1272-1317.
[23] DeGroot M.H. 1970 Optimal Statistical Decisions. McGraw-Hill: New York.
[24] Dempster A., Laird N. & Rubin D. 1977. Maximum likelihood from incomplete data via
the EM algorithm. J. Roy. Statist. Soc. B, 39, 1-38.
[25] Druzdzel M.J. & van der Gaag L.C. 1995 Elicitation of Probabilities for Belief Net-
works: Combining Qualitative and Quantitative Information. In: Besnard & Hanks (eds),
141-148.
[26] Dubois D., Wellman M.P., D’Ambrosio B.D. & Smets P. 1992 Uncertainty in Artificial
Intelligence: Proceedings of the Eighth Conference, San Mateo, CA: Morgan Kaufmann.
[27] Edwards D. 1995 Introduction to Graphical Modelling. New York: Springer-Verlag.
[28] Friedman N. & Goldszmidt M. 1996a. Learning Bayesian Networks with Local Struc-
ture. In UAI ‘96.
[29] Friedman N. & Goldszmidt M. 1996b. Building classifiers using Bayesian networks. In
Proceedings of AAAI-96, Menlo Park, CA: AAAI Press, 1277-1284.
[30] Geiger D. & Pearl J. 1988. On the logic of causal models. Proc. 4th Workshop on Uncer-
tainty in AI, St Paul, Minn,. 136-147.
[31] Geiger D. & Heckerman D. 1995 A Characterisation of the Dirichlet Distribution with
Application t o Learning Bayesian Networks. In: Besnard & Hanks (eds), 196-207.
[32] Good I.J. 1965 The Estimation of Probabilities. Cambridge, MA: MIT Press.
34 of 38
[33] Hastings W. 1970. Monte Carlo Sampling Methods Using Markov Chains and Their
Applications. Biometrika, 57, 97-109.
[34] Heckerman D. 1990 Probabilistic similarity networks. Networks, 20, 607-636.
[35] Heckerman D. Mamdani A. & Wellman M. 1995 Real-world applications of Bayesian
networks, Comm. ACM, 38.
[36] Heckerman D. 1996a Bayesian Networks for Knowledge Discovery. In: Fayyad U.M.,
Piatetsky-Shapiro G., Smyth P and Uthurusamy R. (eds), Advances in Knowledge Dis-
covery and Data Mining. Cambridge, MA: MIT Press, 273-305.
[37] Heckerman D. 1996b. A Tutorial on Learning with Bayesian Networks. Technical Report
MSR-TR-95-06, Microsoft Corporation, Redmond, USA.
[38] Heckerman D. & Breese J.S. 1994. A new look at causal independence. In Lopez de
Mantaras & Poole (eds), 286-292.
[39] Heckerman D. & Shachter R. 1995 Decision-Theoretic Foundations for Causal Reason-
ing. Journal of Artificial Intelligence Research, 3, 405-430.
[40] Heckerman D., Geiger D. & Chickering D. 1995. Learning Bayesian Networks: The
Combination of Knowledge and Statistical Data. Machine Learning, 20, 197-243.
[41] Henrion M., Shachter R.D., Kanal L. & Lemmer J.F. (eds.) 1990 Uncertainty in Artificial
Intelligence 5. Amsterdam: Elsevier Science Publishers B.V. (North Holland).
[42] Herskovits E. 1991 Computer-Based Probabilistic-Network Construction. PhD thesis,
Stanford University.
[43] Højsgaard S., Skjøth F. and Thiesson B. 1994. User’s guide to BIOFROST. Technical
report, Department of Mathematics and Computer Science, Aalborg, Denmark.
[44] Horvitz E. & Jensen F. (eds.) 1996. Uncertainty in Artificial Intelligence: Proceedings of
the Twelfth Conference, San Fransisco: Morgan Kaufmann.
[45] Jensen F.V. 1996 An introduction to Bayesian Networks. London: UCL Press Ltd.
[46] Jensen F.V., Lauritzen S.L. & Olesen K.G. 1989 Bayesian Updating in Recursive Graph-
ical Models by Local Computations. Technical Report R 89-15, Department of Mathe-
matics and Computer Science, University of Aalborg, Denmark.
[47] Jensen F.V., Olesen K.G. & Andersen S.K. 1990 An Algebra of Bayesian Belief Uni-
verses for Knowledge-Based Systems. Networks, 20, 637-659.
[48] Holland P.W. 1986 Statistics and Causal Inference. Journal of the American Statistical
Association, 81, 945-960.
[49] Kass R. & Raftery A. 1993 Bayes Factors and Model Uncertainty. Technical Report 571,
Dept. of Statistics, Carnegie Mellon University.
[50] Kass R. & Raftery A. 1995, Bayes Factors. Journal of the American Statistical Associa-
tion, 90, 773-795.
[51] Lauritzen S.L. 1995. The EM algorithm for graphical association models with missing
data. Computational Statistics & Data Analysis, 19, 191-210.
[52] Lauritzen S.L. 1996. Graphical Models. Oxford: Clarendon Press.
[53] Lauritzen S.L. & Spiegelhalter D.J. 1988. Local computations with probabilities on
graphical structures and their application to expert systems (with discussion). J. R. Stat.
Soc. Ser. B, 50, 157-224.
[54] Lauritzen S.L., Dawid A.P., Larsen B.N. & Leimer H-G. 1990 Independence Properties
35 of 38
of Directed Markov Fields. Networks, 20, 491-505.
[55] Lauritzen S.L., Thiesson S. & Spiegelhalter D. 1994. Diagnostic systems created by
model selection methods: a case study. In: Cheeseman P. & Oldford R (eds) AI and Sta-
tistics IV, New York: Springer-Verlag Lecture Notes in Statistics vol 89, 143-152.
[56] Lopez de Mantaras R. & Poole D. (eds) 1994. Uncertainty in Artificial Intelligence: Pro-
ceedings of the Tenth Conference, San Fransisco: Morgan Kaufmann.
[57] Madigan D. & Raftery A. 1994. Model Selection and Accounting for Model Uncertainty
in Graphical Models Using Occam’s Window. J. Am. Statist. Association, 89, 1535-1546.
[58] Madigan D. & York J. 1995. Bayesian graphical models for discrete data. International
Statistical Review, 63, 215-232.
[59] Madigan D., Raftery A., Volinsky C. & Hoeting J. 1996. Bayesian model averaging. In
Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, Portland
OR.
[60] Meek C. 1995 Causal inference and causal explanation with background knowledge. In:
Besnard & Hanks (eds), 403-410.
[61] Meilijson I. 1989. A fast improvement to the EM algorithm on its own terms. J. Roy. Sta-
tist. Soc. B., 51(1), 127-138.
[62] Neal R. Probabilistic inference using Markov chain Monte Carlo methods. Technical
Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto.
[63] Neal R. 1996 Bayesian Learning for Neural Networks. New York: SPringer-Verlag.
[64] Nilsson N. 1986 Probabilistic logic. Artificial Intelligence, 28, 71-87.
[65] Oleson K.G., Lauritzen, S.L. & Jensen F.V. 1992 aHugin: A System Creating Adaptive
Causal Probabilistic Networks. In: Dubois et al. (eds), 223-229.
[66] Pearl J. 1986 A constraint-propagation approach to probabilistic reasoning. In: Kanal
L.N. and Lemmer J.F. (eds), Uncertainty in Artificial Intelligence, Amsterdam, North-
Holland, 3718-382.
[67] Pearl J. 1988 Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. San Mateo, Ca: Morgan Kauffman.
[68] Pearl J. 1995 On the Testability of Causal Models with Latent and Instrumental Varia-
bles. In: Besnard & Hanks (eds), 435-443.
[69] Pearl J. 1995 Causal diagrams for empirical research. Biometrika, 82, 669-710.
[70] Pearl J., Geiger D. & Verma T. 1990. The Logic of Influence Diagrams. In: Oliver R.M
& Smith J.Q. (eds) 1990, Influence Diagrams, Belief Nets and Decision Analysis, Chich-
ester: John Wiley.
[71] Pearl J. & Verma T.S. 1991 A theory of inferred causation. In: Allen J., Fikes R. & Sand-
ewall E. (eds), 441-452.
[72] Richardson T. 1997. Extensions of undirected and acyclic, directed graphical models. In:
Proc, 6th Conference on Artificial Intelligence in Statistics, Ft Lauderdale, 407-419.
[73] Ripley B. 1987. Stochastic Simulation. Chichester: John Wiley & Sons.
[74] Rissanen J. 1987. Stochastic Complexity (with discussion). J. Roy. Statist. Soc. B, 49,
223-239.
[75] Robins J. 1986 A new approach to causal inference in mortality studies with sustained
exposure results. Mathematical Modelling, 7, 1393-1512.
36 of 38
[76] Robinson R.W. 1977 Counting unlabelled acyclic digraphs. In: Dold A. & Eckmann B.
Lecture notes in Mathematics 622: Combinatorial Mathematics V. Berlin: Springer-Ver-
lag.
[77] Rubin D.B. 1974 Estimating Causal Effects of Treatments in Randomized and Nonrand-
omized Studies. Journal of Educational Psychology, 66, 688-701.
[78] Rubin D.B. 1978 Bayesian inference for causal effects: The role of randomisation.
Annals of Statistics, 6, 34-58.
[79] Rubin D.B. 1991 Practical Implications of Modes of Statistical Inference for Causal
Effects and the Critical Role of the Assignment Mechanism. Biometrics, 47, 1213-1234.
[80] Russell S. and Norvig P. 1995. Artificial Intelligence: A Modern Approach. New Jersey:
Prentice-Hall.
[81] Russell S., Binder J., Koller D. & Kanazawa K. 1995 Local learning in probabilistic net-
works with hidden variables. In: Proceedings of the Fourteenth International Joint Con-
ference on Artificial Intelligence, San Fransisco: Morgan Kaufmann, 1146-1152.
[82] Schwarz G. 1978. Estimating the Dimensions of a Model. Annals of Statistics, 6, 461-
464.
[83] Shachter R.D. 1986. Evaluating Influence Diagrams. Operations Research, 34, 871-882.
[84] Shachter R.D. 1990 Evidence absorption and propagation through evidence reversals. In:
Uncertainty and Artificial Intelligence 5. Henrion et al. (eds), 75-190.
[85] Singh M. and Provan G. 1995. Efficient learning of selective Bayesian network classifi-
ers. Technical report MS-CS-95-36, Computer and Information Science Department,
University of Pennsylvania, Philadelphia, PA.
[86] Smith A.F.M. and Makov U.E. 1978 A quasi-Bayes sequential procedure for mixtures. J.
R. Statist. Soc. Ser. B, 40, 106-111.
[87] Smyth P., Heckerman D. & Jordan M.J. 1996 Probabilistic Independence Networks for
Hidden Markov Probability Models. Microsoft Research Technical Report MSR-TR-96-
03.
[88] Spiegelhalter D.J. & Lauritzen S.L. 1990 Sequential Updating of Conditional Probabili-
ties on Directed Graphical Structures. Networks, 20, 579-605.
[89] Spiegelhalter D.J. & Cowell R. 1992 Learning in Probabilistic Expert Systems. In Baye-
sian Statistics 4, (Bernardo J.M., Berger J.O., Dawid A.P. and Smith A.F. eds), Oxford
University Press, 447-465.
[90] Spiegelhalter D.J., Harris N., Bull K. & Franklin R. 1991 Empirical evaluation of prior
beliefs about frequencies: methodology and a case study in congenital heart disease.
BAIES Report BR-24, MRC Biostatistics unit, Cambridge, England.
[91] Spiegelhalter D.J., Dawid A.P., Lauritzen S.L. & Cowell R.G. 1993 Bayesian Analysis in
Expert Systems (with discussion). Statistical Science, 8, 219-283.
[92] Spirtes P. & Meek C. 1995 Learning Bayesian networks with discrete variables from
data. In: Proc. First International Conference on Knowledge Discovery and Data Mining,
Montreal QU, Morgan Kaufmann.
[93] Spirtes P., Glymour, C. & Scheines R. 1993 Causation, Prediction and Search. New
York: Springer-Verlag.
[94] Srinivas S. 1993. A generalisation of the noisy-or model. In Uncertainty in Artificial
37 of 38
Intelligence: Proceedings of the Ninth Conference, San Fransisco: Morgan Kaufmann,
208-215.
[95] Taylor S.J. 1966 Introduction to Measure and Integration. Cambridge University Press.
[96] Thomas A., Spiegelhalter D.J. & Gilks W.R. 1992. BUGS: A program to perform baye-
sian inference using gibbs sampling. Bayesian Statistics 4, (Bernardo J.M., Berger J.O.,
Dawid A.P. and Smith A.F. eds), Oxford University Press, 837-842.
[97] Titterington D.M. 1976 Updating a diagnostic system using unconfirmed cases. Applied
Statistics, 25, 238-247.
[98] Titterington D.M., Smith A.F.M. and Makov U.E. 1985. Statistical Analysis of Finite
Mixture Distributions. Chichester: John Wiley.
[99] Verma T. & Pearl J. 1990. Equivalence and Synthesis of Causal Models. In Proc. Sixth
Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann,
220-227.
[100] Wallace C.S. & Korb K. 1996. Learning a Linear Causal Model by MML. Proc. UNI-
COM Seminar on Intelligent Data Management, Chelsea Village, London, UK.
[101] Wermuth N. & Lauritzen S.L. 1990 On Substantive Research Hypotheses, Conditional
Independence Graphs and Graphical Chain Models. J. R. Stat. Soc. B, 52, 21-50.
[102] Wilks S.S. 1963. Mathematical Statistics. New York: John Wiley.
[103] Winkler R. 1967 The assessment of prior distributions in Bayesian analysis. American
Statistical Association Journal. 62, 776-800.
[104] Wittaker J. 1990 Graphical Models in Applied Multivariate Statistics. Chichester: John
Wiley.
38 of 38