Using the ggmcmc package

There is a pdf version of this document.

Follow the development of ggmcmc at github.

Purpose

This documents presents the usage of ggmcmc, an R package with graphical tools for analyzing Markov Chain Monte Carlo simulations from Bayesian inference.

First, a fake dataset is produced. Then, JAGS is used to get samples from the posterior distribution and finally ggmcmc functions are used to generate the graphics needed for visual diagnostics of convergence of MCMC chains.

The current version (0.2) has minimal capabilities and will be extended in the future.

Why ggmcmc?

ggplot (based on the grammar of graphics) empowers R users by allowing them to flexibly crate graphics. Based on this idea, ggmcmc is aimed at bringing the ideas and implementation of ggmcmc to MCMC diagnostics, allowing mainly Bayesian inference to have better and more flexible visual diagnostic tools.

Produce a fake dataset and sample from its posterior distribution

The code produces a minimal fake datased for a simple linear regression with an intercept (beta[1]) and a slope (beta[2]sigma) is also simulated.

require(rjags, quietly = TRUE)
require(ggmcmc)
set.seed(14718)
Define the true parameters of the model:
beta.0 <- 3
beta.1 <- 5
sigma <- 2
Prepare the fake dataset:
N <- 100
x <- rnorm(N, 0, 3)
y <- beta.0 + beta.1 * x + rnorm(N, mean = 0, sd = sigma)
X <- matrix(x, ncol = 1)
Define the model in JAGS by writing it in a file using R and then invoking it from the same file when sampling:
write("
  model {
    for (i in 1:N) {
      y[i] ~ dnorm(mu[i], tau)
      mu[i] <- beta[1] + beta[2] * X[i,1]
    }
    tau <- pow(sigma, -2)
    sigma ~ dunif(0, 100)
    beta[1] ~ dnorm(0, 0.001)
    beta[2] ~ dnorm(0, 0.001)
  }", "m-simple_linear_regression.bug")
Parameters to pass to JAGS, such as data, number of simulations, nodes to monitor, and initial values.
n.simu <- 2000
n.burnin <- n.simu/2
inits <- c(1, 5, -5)  # tau (1) and betas (2)
D <- list(y = y, X = X, N = N)
par <- c("sigma", "beta")
Sample from the posterior distribution using JAGS.
m.jags <- jags.model("m-simple_linear_regression.bug", data = D, n.adapt = n.burnin, 
    quiet = TRUE, n.chains = 2)
S <- coda.samples(m.jags, par, n.iter = n.simu - n.burnin, quiet = TRUE)

Convert from JAGS output (coda) to ggmcmc format

Now the ``S'' object (which is a coda object) must be converted into a data frame with certain attributes (number of chains, parameters, iterations, and whether parallel computing can be used) that the graphical ggmcmc functions will understand. The ggs() function does that.

D <- ggs(S, parallel = FALSE)
The resulting object is a data frame with four variables, namely: \begin{description} \item[Iteration] Number of iteration \item[Parameter] Name of the parameter \item[value] value sampled \item[Chain] Number of the chain \end{description}
head(D)
##   Iteration Parameter value Chain
## 1         1   beta[1] 2.852     1
## 2         2   beta[1] 2.883     1
## 3         3   beta[1] 3.139     1
## 4         4   beta[1] 2.888     1
## 5         5   beta[1] 2.678     1
## 6         6   beta[1] 2.866     1
In addition to that, the object also has several attributes that are used by the plotting functions.
str(D)
## 'data.frame':	6000 obs. of  4 variables:
##  $ Iteration: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Parameter: Factor w/ 3 levels "beta[1]","beta[2]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value    : num  2.85 2.88 3.14 2.89 2.68 ...
##  $ Chain    : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "nChains")= int 2
##  - attr(*, "nParameters")= int 3
##  - attr(*, "nIterations")= int 1000
##  - attr(*, "parallel")= logi FALSE

By default ggs() tries to use parallel computing (which is a reasonable expectation for any Bayesian practitioner to have a multicore machine in 2012) .

It this case, a parallel backend is expected to be registered.
library(doMC, quietly = TRUE)
registerDoMC()
D <- ggs(S)

What slows down ggmcmc is the current plotting, not the computation of several intermediate quantities needed for the plotting, so in this case parallel computing does not make substantial differences (only a 5\% increase in speed). But with larger parameters to analyze more chains and/or longer iterations it can make a difference.

Using ggmcmc()

ggmcmc() is a wrapper to several plotting functions that allows to create very easily a report of the diagnostics in a single pdf file. This output can be used to inspect the results more comfortable than using the plots that appear in the screen.

ggmcmc(D)
By default ggmcmc() plots 5 parameters in each page, and the dump file is called ``ggmcmc-output.pdf'', but we can change them.
ggmcmc(D, file = "model_simple-diag.pdf", param.page = 1)

Using individual functions

We can also use the functions interactively to get the plots in the screen. The parameter to be plotted is the data frame that results from the application of the ggs() function to the original coda object that JAGS produces.

Notice that some of the functions differentiate the chains, but a few of them mix the chains to produce a single plot (ggs_histogram() and ggs_caterpillar().

Histograms

Produces histograms with the posterior distribution of the parameters. Combines the values of all the chains.
ggs_histogram(D)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

Density plots

Produces density plots of the posterior distribution of the parameters. Allows to compare easily between chains.
ggs_density(D)

Traceplots

Produces a time series of the parameter.
ggs_traceplot(D)

Running means

Produces a time series of the running mean of the chain, and also shows a horizontal line with the mean of the chain.
ggs_running(D)

Comparison of the whole chain with the latest part

Produces density plots of the whole chain (black) with the density plot of only the latest part of the chain (green). By default, it uses the latest 10\% of the chain as a comparison.
ggs_compare_partial(D)

Autocorrelation plots

Produces autocorrelation plots of each parameter, bounded between -1 and 1.
ggs_autocorrelation(D)

Crosscorrelation plot

Produces a crosscorrelation plot with the correlations between all the parameters.
ggs_crosscorrelation(D)

Caterpillar plots

Produces caterpillar plots of the highest posterior densities of the parameters. By default the thick line shows the 50\% HPD whereas the thin line shows the 95\% HPD.
ggs_caterpillar(D)

Future releases of ggmcmc will include the possibility to plot only the parameters belonging to a family of parameters (as in beta[1], beta[2], ...).