Chapter 2 - Static Bayesian Networks:
Bayesian Networks in the Absence of
Temporal Information
Static Bayesian networks, or Bayesian networks are:
graphical models that allow a concise representation of the probabilistic dependencies
between a given set of random variables
X = {X1,X2, . . . ,Xp}
as a DAG G = (V,A) where each node corresponds to a random variable Xi.
2.1.1 Graph structure
Independence map - because there is no probabilistic independence, an Independence map (I-
map) is the probabilistic dependence structure P of X if there is a on-to-one correspondence
between the random variables in X and the nodes V of G
2.1.1 Fundamental connections
2.1.2 D- separation
V- structure: A and B are not independent given C
2.1.3 Equivalent structures.
Serial and diverging: A and B are independent
given C
2.1.3 Continued. Serial and diverging: Connections result in
equivalent factorizations, that is, each can be
obtained from the other with repeated application
of Bayes’ theorem
2.1.4 Markov blankets
Markov blankets facilitate the comparison of Bayesian networks with graphical
models based on undirected graphs, which are known as Markov networks or
Markov random fields
2.2 Static Bayesian modeling
1st step : Structure learning - identifying the structure of of the Bayesian network.
Constraint based
Score based
Hybrid
2nd step: Parameter learning
Implements the estimation of the parameters of the global distribution.
Can be performed efficiently by estimating the parameters of the local distributions
implied by the structure obtained in the previous step
2.2.. 1st step : Constraint learning
1. identifies which pairs of variables are connected by an arc, regardless of its
direction.These variables cannot be independent given any other subset of
variables, because they cannot be d-separated. This step can also be seen
as a backward selection procedure starting from the saturated model with a
complete graph and pruning it based on statistical tests for conditional
independence.
2.2.. 1st step : Constraint learning
2. identification of the v-structures among all the pairs of nonadjacent nodes A
and B with a common neighbor C. By definition, v-structures are the only
fundamental connection in which the two nonadjacent nodes are not
independent conditional on the third one.
2.2.1. 1st step : Constraint learning
3. Identifies compelled arcs and orients them recursively to obtain the
completed partially DAG (CPDAG) describing the equivalence class
identified by the previous steps. Can not be applied to real-world problems due to exponential
number of possible conditional independence relationships
All these algorithms, with the
exception of PC, first learn
the Markov blanket of
each node in the network.
This preliminary step greatly
simplifies the identification
of neighbors of each node,
as the search can be limited
to its Markov blanket.
2.2.2. 1st step : Score-based structure learning
Score-based structure learning algorithms (also known a search-and-score
algorithms) represent the application of general heuristic optimization techniques
to the problem of learning the structure of a Bayesian network.
2.2.3. 1st step : Hybrid structure learning
Combine constraint-based and score-based algorithms to offset their weaknesses
and produce reliable network structures in a wide variety of situations.
Implemented in two steps called restrict and maximize
Restricts the set of parent nodes
Provides a smaller search space to maximize a given score function.
2.2.4. Choosing Distributions, Conditional Independence
Tests, and Network Scores
Will ultimately depend on the your data.
2.2.4. Choosing Distributions, Conditional Independence
Tests, and Network Scores
The choice of a particular set of global and local distributions also determined which
conditional independence tests and which network scores can be used to learn the structure.
Mutual information - information-theoretic distance measure
Pearson’s X²
Bayesian Dirichlet equivalent score - the posterior density associated
with a uniform prior over both the space of the network structures and of the
parameters of each local distribution
Bayesian information criteria - penalized likelihood score defined by the number of
parameters of the global distribution.
Fisher’s Z test
2.2.5. Step 2: Parameter learning
The number of parameters needed to uniquely identify the global distribution,
which is the sum of the number of parameters of the local distributions, is
also reduced because the conditional independence relationships encoded in
the network structure fix large parts of the parameter space.
Can be problematic for some applications such as:
microarrays , typical of high-throughput biological data
Ten or hundred observations and thousands of genes
2.2.6. Discretization
Or binning involves convert all continuous variables to discrete ones and then to apply
the techniques described in the previous sections.
Side steps the problem of defining the probabilistic model by applying:
Prior knowledge of the data
Using heuristics
Choosing the number of intervals
Performing learning and discretization iteratively until no change.
Can also be applied to deal with continuous data when one or more variables present
severe departures from normality (skewness, heavy tails, etc.).
2.3 Static Bayesian Network Modeling in R
2.3 bnlearn
Create an empty graph with the nodes
corresponding to the variables
2.3 bnlearn
Create an empty graph with the nodes
corresponding to the variables
2.3 bnlearn
Add the arcs by assigning a two-column matrix
containing the labels of their end-nodes
2.3 bnlearn
The resulting ug object belongs to the class bn and contains:
learning: a list containing some information about the results of the structure
learning algorithm and its tuning parameters, including the conditional
independence tests and network scores used in the analysis. It has never
changed after the object is created.
nodes: a list containing one element per node. Each element contains the
Markov blanket, the neighborhood, the parents, and the children of that
particular node.
arcs: the arcs present in the network, in the same two-column format used in
the call to arcs above.
2.3 bnlearn
The resulting ug object belongs to the class bn and contains:
learning: a list containing some information about the results of the structure
learning algorithm and its tuning parameters, including the conditional
independence tests and network scores used in the analysis. It has never
changed after the object is created.
nodes: a list containing one element per node. Each element contains the
Markov blanket, the neighborhood, the parents, and the children of that
particular node.
arcs: the arcs present in the network, in the same two-column format used in
the call to arcs above.
2.3 bnlearn
2.5.1 Model Averaging using protein signaling study
The data consist of simultaneous measurements of 11 phosphorylated proteins and
phospholypids derived from thousands of individual primary immune system cells,
subjected to both general and specific molecular interventions.
Structured learning was repeated several times to reduce the locally optimal
networks on learning and subsequent inference.
Networks learned were averaged to produce a more robust model - model
averaging. The averaged network structure was created using the arcs present in
at least 85% of the networks. This proportion measures the strength of each arc
and provides the means to establish its significance given a threshold (85% in this
case).
Validity was evaluated based on the literature
2.5.1 Model Averaging using protein signaling study
Choosing good values for these arguments is a trade-off between
quality and speed; high values of idisc preserve the characteristics of the original
data to a greater extent, whereas smaller values result in much smaller memory
usage and shorter running times.
2.5.1 Model Averaging using protein signaling study
Bootstrap resampling to
learn a set of 500
network structures
Arcs are considered
significant if they appear in
at least 85% of the networks
and in the direction that
appears most frequently.
2.5.1 Model Averaging using protein signaling study
Building the average
network structure
Alternatively, we can use a
hill-climbing approach
starting from a different
network.
2.5.2 Choosing a significance threshold
Any threshold that falls
between those two
values results in
the same averaged
network
Exercises
2.1. Consider the asia synthetic data set from Lauritzen and Spiegelhalter (1988),
which describes the diagnosis of a patient at a chest clinic who has just come back
from a trip to Asia and is showing dyspnea.
(a) Load the data set from the bnlearn package and investigate its characteristics
using the exploratory analysis techniques covered in Chap. 1.
(b) Create a bn object with the network structure described in the manual page of
Asia.
(c) Derive the skeleton, the moral graph, and the CPDAG representing the equivalence
class of the network. Plot them using graphviz.plot.
(d) Identify the parents, the children, the neighbors, and the Markov blanket of each
node.
Exercises
2.2. Using the network structures created in Exercise 2.1 for the asia data set, produce the
following plots with graphviz.plot:
(a) A plot of the CPDAG of the equivalence class in which the arcs belonging to a
v-structure are highlighted (either with a different color or using a thicker line width).
(b) Fill the nodes with different colors according to their role in the diagnostic process:
causes (“visit to Asia” and “smoking”), effects (“Tuberculosis,” “lung cancer,”and “bronchitis”)
and the diagnosis proper (“chest X-ray,” “dyspnea,” and “either tuberculosis or lung
cancer/bronchitis”).
(c) Explore different layouts by changing the layout and shape arguments.
Exercises
2.3. Consider the marks data set analyzed in Sect. 2.3.
(a) Discretize the data using a quantile transform and different numbers of intervals
(say, from 2 to 5). How does the network structure learned from the resulting data sets
change as the number of intervals increases?
(b) Repeat the discretization using interval discretization using up to five intervals,
and compare the resulting networks with the ones obtained previously with quantile
discretization.
(c) Does Hartemink’s discretization algorithm perform better than either quantile or
interval discretization? How does its behavior depend on the number of initial
breaks?
Exercises
2.4. The ALARM network (Beinlich et al., 1989) is a Bayesian network designed to provide
an alarm message system for patients hospitalized in intensive care units (ICU). Since
ALARM is commonly used as a benchmark in literature, a synthetic data set of 5,000
observations generated from this network is available from bnlearn as alarm.
(a) Create a bn object for the “true” structure of the network using the model string provided
in its manual page.
(b) Compare the networks learned with different constraint-based algorithms with the true
one, both in terms of structural differences and using either BIC or BDe.
(c) The overall performance of constraint-based algorithms suggests that the asymptotic
Χ² conditional independence tests may not be appropriate for analyzing alarm. Are
permutation or shrinkage tests better choices?
(d) How are the above learning strategies affected by changes to alpha?
Exercises
2.5. Consider again the alarm network used in Exercise 2.4.
(a) Learn its structure with hill-climbing and tabu search, using the posterior density BDe as
a score function. How does the network structure change with the
imaginary sample size iss?
(b) Does the length of the tabu list have a significant impact on the network structures
learned with tabu?
(c) How does the BIC score compare with BDe at different sample sizes in terms of structure
and score of the learned network?
Exercises
2.6. Consider the observational data set from Sachs et al. (2005) used in Sect. 2.5.1
(the original data set, not the discretized one).
(a) Evaluate the networks learned by hill-climbing with BIC and BGe using cross validation
and the log-likelihood loss function.
(b) Use bootstrap resampling to evaluate the distribution of the number of arcs present in
each of the networks learned in the previous point. Do they differ significantly?
(c) Compute the averaged network structure for sachs using hill-climbing with BGe and
different imaginary sample sizes. How does the value of the significance threshold change
as iss increases?