# Dynamic Bayesian Networks

## Exercises

These are answers and solutions to the exercises at the end of chapter 3 in Bayesian Networks in R with Applications in Systems Biology by by Radhakrishnan Nagarajan, Marco Scutari & Sophie Lèbre. Much of my inspiration for these solutions, where necessary, by consulting the solutions provided by the authors themselves as in the appendix.

### R Environment

For today’s exercise, I load the following packages:

library(vars)
library(lars)
library(GeneNet)
library(G1DBN) # might have to run remotes::install_version("G1DBN", "3.1.1") first


### Nagarajan 3.1

Consider the Canada data set from the vars package, which we analyzed in Sect. 3.5.1.

data(Canada)


#### Part A

Load the data set from the vars package and investigate its properties using the exploratory analysis techniques covered in Chap. 1.

str(Canada)

##  Time-Series [1:84, 1:4] from 1980 to 2001: 930 930 930 931 933 ...
##  - attr(*, "dimnames")=List of 2
##   ..$: NULL ## ..$ : chr [1:4] "e" "prod" "rw" "U"

summary(Canada)

##        e              prod             rw              U
##  Min.   :928.6   Min.   :401.3   Min.   :386.1   Min.   : 6.700
##  1st Qu.:935.4   1st Qu.:404.8   1st Qu.:423.9   1st Qu.: 7.782
##  Median :946.0   Median :406.5   Median :444.4   Median : 9.450
##  Mean   :944.3   Mean   :407.8   Mean   :440.8   Mean   : 9.321
##  3rd Qu.:950.0   3rd Qu.:410.7   3rd Qu.:461.1   3rd Qu.:10.607
##  Max.   :961.8   Max.   :418.0   Max.   :470.0   Max.   :12.770


#### Part B

Estimate a VAR(1) process for this data set.

(var1 <- VAR(Canada, p = 1, type = "const"))

##
## VAR Estimation Results:
## =======================
##
## Estimated coefficients for equation e:
## ======================================
## Call:
## e = e.l1 + prod.l1 + rw.l1 + U.l1 + const
##
##          e.l1       prod.l1         rw.l1          U.l1         const
##    1.17353629    0.14479389   -0.07904568    0.52438144 -192.56360758
##
##
## Estimated coefficients for equation prod:
## =========================================
## Call:
## prod = e.l1 + prod.l1 + rw.l1 + U.l1 + const
##
##         e.l1      prod.l1        rw.l1         U.l1        const
##   0.08709510   1.01970070  -0.02629309   0.32299246 -81.55109611
##
##
## Estimated coefficients for equation rw:
## =======================================
## Call:
## rw = e.l1 + prod.l1 + rw.l1 + U.l1 + const
##
##        e.l1     prod.l1       rw.l1        U.l1       const
##  0.06381103 -0.13551199  0.96872851 -0.19538479 11.61375726
##
##
## Estimated coefficients for equation U:
## ======================================
## Call:
## U = e.l1 + prod.l1 + rw.l1 + U.l1 + const
##
##         e.l1      prod.l1        rw.l1         U.l1        const
##  -0.19293575  -0.08086896   0.07538624   0.47530976 186.80892410


#### Part C

Build the auto-regressive matrix $A$ and the constant matrix $B$ defining the VAR(1) model.

## base object creation
base_mat <- matrix(0, 4, 5)
colnames(base_mat) <- c("e", "prod", "rw", "U", "constant")
p <- 0.05
## object filling
pos <- which(coef(var1)$e[, "Pr(>|t|)"] < p) base_mat[1, pos] <- coef(var1)$e[pos, "Estimate"]
pos <- which(coef(var1)$prod[, "Pr(>|t|)"] < p) base_mat[2, pos] <- coef(var1)$prod[pos, "Estimate"]
pos <- which(coef(var1)$rw[, "Pr(>|t|)"] < p) base_mat[3, pos] <- coef(var1)$rw[pos, "Estimate"]
pos <- which(coef(var1)$U[, "Pr(>|t|)"] < p) base_mat[4, pos] <- coef(var1)$U[pos, "Estimate"]
## final objects
(A <- base_mat[, 1:4])

##               e        prod          rw         U
## [1,]  1.1735363  0.14479389 -0.07904568 0.5243814
## [2,]  0.0000000  1.01970070  0.00000000 0.0000000
## [3,]  0.0000000 -0.13551199  0.96872851 0.0000000
## [4,] -0.1929358 -0.08086896  0.07538624 0.4753098

(B <- base_mat[, 5])

##  -192.5636    0.0000    0.0000  186.8089


#### Part D

Compare the results with the LASSO matrix when estimating the L1-penalty with cross-validation.

## data preparation
## Lasso
y <- Canada[-1, gene] # remove first row of data, and select only target gene
lars(y = y, x = data_df, type = "lasso") # LASSO matrix
})
## Cross-validation
y <- Canada[-1, gene] # remove first row of data, and select only target gene
lasso.cv <- cv.lars(y = y, x = data_df, mode = "fraction")
frac <- lasso.cv$index[which.min(lasso.cv$cv)]
predict(Lasso_ls[[gene]], s = frac, type = "coef", mode = "fraction")
})
## output
rbind(
CV_ls[]$coefficients, CV_ls[]$coefficients,
CV_ls[]$coefficients, CV_ls[]$coefficients
)

##                e        prod           rw          U
## [1,]  1.17353629  0.14479389 -0.079045685  0.5243814
## [2,]  0.02570001  1.02314558 -0.004878295  0.1994059
## [3,]  0.09749788 -0.11991692  0.954389035 -0.1023845
## [4,] -0.17604953 -0.08192783  0.069502065  0.5086115


And for comparison the previously identified $A$:

A

##               e        prod          rw         U
## [1,]  1.1735363  0.14479389 -0.07904568 0.5243814
## [2,]  0.0000000  1.01970070  0.00000000 0.0000000
## [3,]  0.0000000 -0.13551199  0.96872851 0.0000000
## [4,] -0.1929358 -0.08086896  0.07538624 0.4753098


#### Part E

What can you conclude?

The whole point of LASSO, as far as I understand it, is to shrink parameter estimates towards 0 often times reaching 0 exactly. In the above this has not happened for many parameters, but is the case with the estimation provided by vars. I assume this might be because there just aren’t enough variables and/or observations in time.

### Nagarajan 3.2

Consider the arth800 data set from the GeneNet package, which we analyzed in Sects. 3.5.2 and 3.5.3.

data(arth800)
data(arth800.expr)


#### Part A

Load the data set from the GeneNet package. The time series expression of the 800 genes is included in a data set called arth800.expr. Investigate its properties using the exploratory analysis techniques covered in Chap. 1.

str(arth800.expr)

##  'longitudinal' num [1:22, 1:800] 10.04 10.11 9.77 10.06 10.02 ...
##  - attr(*, "dimnames")=List of 2
##   ..$: chr [1:22] "0-1" "0-2" "1-1" "1-2" ... ## ..$ : chr [1:800] "AFFX-Athal-GAPDH_3_s_at" "AFFX-Athal-Actin_3_f_at" "267612_at" "267520_at" ...
##  - attr(*, "time")= num [1:11] 0 1 2 4 8 12 13 14 16 20 ...
##  - attr(*, "repeats")= num [1:11] 2 2 2 2 2 2 2 2 2 2 ...

summary(arth800.expr)

## Longitudinal data:
##  800 variables measured at 11 different time points
##  Total number of measurements per variable: 22
##  Repeated measurements: yes
##
##  To obtain the measurement design call 'get.time.repeats()'.


#### Part B

For this practical exercise, we will work on a subset of variables (one for each gene) having a large variance. Compute the variance of each of the 800 variables, plot the various variance values in decreasing order, and create a data set with the variables greater than 2.

## variance calculation
variance <- diag(var(arth800.expr))
## plotting
plot(sort(variance, decreasing = TRUE), type = "l", ylab = "Variance")
abline(h = 2, lty = 2) ## variables with variances greater than 2
dataVar2 <- arth800.expr[, which(variance > 2)]
dim(dataVar2)

##  22 49


#### Part C

Can you fit a VAR process with a usual approach from this data set?

I don’t think so. There are more variables (genes) than there are samples (time steps):

dim(dataVar2)

##  22 49


#### Part D

Which alternative approaches can be used to fit a VAR process from this data set?

The chapter discusses these alternatives:

• LASSO
• James-Stein Shrinkage
• Low-order conditional dependency approximation

#### Part E

Estimate a dynamic Bayesian network with each of the alternative approaches presented in this chapter.

First, I prepare the data by re-ordering them:

## make the data sequential for both repetitions
dataVar2seq <- dataVar2[c(seq(1, 22, by = 2), seq(2, 22, by = 2)), ]


LASSO with the lars package:

x <- dataVar2seq[-c(21:22), ] # remove final rows (end of sequences)
Lasso_ls <- lapply(colnames(dataVar2seq), function(gene) {
y <- dataVar2seq[-(1:2), gene]
lars(y = y, x = x, type = "lasso")
})
CV_ls <- lapply(1:ncol(dataVar2seq), function(gene) {
y <- dataVar2seq[-(1:2), gene]
lasso.cv <- cv.lars(y = y, x = x, mode = "fraction", plot.it = FALSE)
frac <- lasso.cv$index[which.min(lasso.cv$cv)]
predict(Lasso_ls[[gene]], s = frac, type = "coef", mode = "fraction")
})
Lasso_mat <- matrix(0, dim(dataVar2seq), dim(dataVar2seq))
for (i in 1:dim(Lasso_mat)) {
Lasso_mat[i, ] <- CV_ls[i][]$coefficients } sum(Lasso_mat != 0) # number of arcs  ##  456  plot(sort(abs(Lasso_mat), decr = TRUE)[1:500], type = "l", ylab = "Absolute coefficients") James-Stein shrinkage with the GeneNet package: DBNGeneNet <- ggm.estimate.pcor(dataVar2, method = "dynamic")  ## Estimating optimal shrinkage intensity lambda (correlation matrix): 0.0539  DBNGeneNet.edges <- network.test.edges(DBNGeneNet) # p-values, q-values and posterior probabilities for each potential arc  ## Estimate (local) false discovery rates (partial correlations): ## Step 1... determine cutoff point ## Step 2... estimate parameters of null distribution and eta0 ## Step 3... compute p-values and estimate empirical PDF/CDF ## Step 4... compute q-values and local fdr ## Step 5... prepare for plotting plot(DBNGeneNet.edges[, "prob"], type = "l") # arcs probability by decreasing order sum(DBNGeneNet.edges$prob > 0.95) # arcs with prob > 0.95

##  8


First-order conditional dependency with the G1DBN package:

G1DB_BN <- DBNScoreStep1(dataVar2seq, method = "ls")

## Treating 49 vertices:
## 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

G1DB_BN <- DBNScoreStep2(G1DB_BN\$S1ls, dataVar2seq, method = "ls", alpha1 = 0.5)
plot(sort(G1DB_BN, decreasing = TRUE), type = "l", ylab = "Arcs’ p-values") ### Nagarajan 3.3

Consider the dimension reduction approaches used in the previous exercise and the arth800 data set from the GeneNet package.

data(arth800)
data(arth800.expr)


#### Part A

For a comparative analysis of the different approaches, select the top 50 arcs for each approach (function BuildEdges from the G1DBN package can be used to that end).

LASSO

lasso_tresh <- mean(sort(abs(Lasso_mat), decreasing = TRUE)[50:51]) # Lasso_mat from exercise 3.2
lasso_50 <- BuildEdges(score = -abs(Lasso_mat), threshold = -lasso_tresh)


James-Stein shrinkage with the GeneNet package:

DBNGeneNet_50 <- cbind(DBNGeneNet.edges[1:50, "node1"], DBNGeneNet.edges[1:50, "node2"])


First-order conditional dependency with the G1DBN package:

G1DBN_tresh <- mean(sort(G1DB_BN)[50:51])
G1DBN.edges <- BuildEdges(score = G1DB_BN, threshold = G1DBN_tresh, prec = 3)


#### Part B

Plot the four inferred networks with the function plot from package G1DBN.

Four inferred networks? I assume the exercise so far wanted me to also analyse the data using the LASSO approach with the SIMoNe (simone) package. I will skip over that one and continue with the three I have:

par(mfrow = c(1, 3))

## LASSO
LASSO_plot <- graph.edgelist(cbind(lasso_50[, 1], lasso_50[, 2]))
Lasso_layout <- layout.fruchterman.reingold(LASSO_plot)
plot(LASSO_plot,
layout = Lasso_layout,
edge.arrow.size = 0.5, vertex.size = 10,
main = "LASSO"
)

## James-Stein
DBN_plot <- graph.edgelist(DBNGeneNet_50)
# DBN_layout <- layout.fruchterman.reingold(DBN_plot)
plot(DBN_plot,
layout = Lasso_layout,
edge.arrow.size = 0.5, vertex.size = 10,
main = "GeneNet"
)

## First-order conditional
G1DBN_plot <- graph.edgelist(cbind(G1DBN.edges[, 1], G1DBN.edges[, 2]))
# G1DBN_layout = layout.fruchterman.reingold(G1DBN_plot)
plot(G1DBN_plot,
layout = Lasso_layout,
edge.arrow.size = 0.5, vertex.size = 10,
main = "G1DBN"
) #### Part C

How many arcs are common to the four inferred networks?

## extract edges
LASSO_el <- as_edgelist(LASSO_plot)
DBN_el <- as_edgelist(DBN_plot)
G1DBN_el <- as_edgelist(G1DBN_plot)

## number of repeated edges in pairwise comparisons
sum(duplicated(rbind(LASSO_el, DBN_el)))

##  0

sum(duplicated(rbind(LASSO_el, G1DBN_el)))

##  6

sum(duplicated(rbind(DBN_el, G1DBN_el)))

##  1

### all at once
sum(duplicated(rbind(LASSO_el, DBN_el, G1DBN_el)))

##  7


#### Part D

Are the top 50 arcs of each inferred network similar? What can you conclude?

No, they are not. I can conclude that different dimension reductions produce different DAG structures.

## Session Info

sessionInfo()

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
##  LC_COLLATE=English_Germany.utf8  LC_CTYPE=English_Germany.utf8    LC_MONETARY=English_Germany.utf8 LC_NUMERIC=C                     LC_TIME=English_Germany.utf8
##
## attached base packages:
##  stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##   G1DBN_3.1.1         igraph_1.3.4        GeneNet_1.2.16      fdrtool_1.2.17      longitudinal_1.1.13 corpcor_1.6.10      lars_1.3            vars_1.5-6          lmtest_0.9-40
##  urca_1.3-3          strucchange_1.5-3   sandwich_3.0-2      zoo_1.8-10          MASS_7.3-58.1
##
## loaded via a namespace (and not attached):
##   highr_0.9         bslib_0.4.0       compiler_4.2.1    jquerylib_0.1.4   R.methodsS3_1.8.2 R.utils_2.12.0    tools_4.2.1       digest_0.6.29     jsonlite_1.8.0    evaluate_0.16
##  nlme_3.1-159      R.cache_0.16.0    lattice_0.20-45   pkgconfig_2.0.3   rlang_1.0.5       cli_3.3.0         rstudioapi_0.14   yaml_2.3.5        blogdown_1.13     xfun_0.33
##  fastmap_1.1.0     styler_1.8.0      stringr_1.4.1     knitr_1.40        vctrs_0.4.1       sass_0.4.2        grid_4.2.1        R6_2.5.1          rmarkdown_2.16    bookdown_0.29
##  purrr_0.3.4       magrittr_2.0.3    htmltools_0.5.3   stringi_1.7.8     cachem_1.0.6      R.oo_1.25.0

Previous