JMM’s notes on

Stan

(and rstan)

Stan is a probabilistic programming language for Bayesian inference using Markov chain Monte Carlo sampling. It’s what I use for Bayesian inference for my PhD research.

Back in undergrad, I used JAGS, but Stan uses Hamiltonian Monte Carlo which can allow it to more efficiently sample complicated distributions.

Tips

Startup occasionally fails on some chains

TODO: Need to look up the error message on this again

We had a problem where some chains would fail to start on one of our models that used a softmax noise parameter σ.

The problem (with our model, not Stan) turned out to be that random initialization would cause some values of σ to be really low. This low noise would create a really steep logistic function, basically a step function. This step function would be really hard to sample in, since the gradient was essentially zero, meaning the chain couldn’t find which direction to move.

The solution was to start initialization at high noise values, that way there’d always be a gradient. This doesn’t prevent σ from contracting to steeper values, it just prevents it from starting with a steep value in a location with little probability density.

Issues

Generated quantities with transformed parameters

rstan has an issue (see Issue #714) where gqs() (which samples generated quantities) doesn’t work with transformed parameters. I submitted a patch back in June 2021 that can fix this issue, but I may need to ping someone again about that. This is pretty low priority, since I think most people won’t run gqs() by itself.

Riemannian sampling?

This would be complicated and I don’t think it’ll be solved soon, but some models with scale parameters would benefit from a non-Euclidean/Riemannian sampler. Scale parameters can squeeze some areas of space more than others, but the problem is that the step size doesn’t shrink to accomodate this. So in tight spaces, your step size will be too big and you’ll jump out of the typical set, and in looser/bigger spaces your step size will be inefficient.

IIRC, step size is calculated during one of the slow phases of warmup where a matrix (either called a “metric” or “mass matrix”) is measured. I also think this metric M only uses diagonal values currently. It might be possible for the metric to slightly adjust step sizes if it also used covariance, but I need to double check this.

Parallel Nix build hanging issue?

It’s been years since I’ve tried this, so I’m just writing from memory, but I remember an issue where compiling Stan models could hang when compiling in a Nix build in parallel. Not sure if this is still a problem. But what was really surprising about this is that the Nix builders were separate, and Nix builds are run in hermetic sandboxes. You would think that there’d be no way for two builds to interefere with each other. They shouldn’t really have any IPC mechanism or visibility outside the sandbox, so I’m not sure what happened there.

I don’t remember exactly how I solved the problem either. Did I set Nix’s --max-jobs=1 when compiling Stan models to prevent this from happening? Or did I find a specific critical section that I separated into its own derivation? Not sure.

Nix

Here’s a basic Nix setup which you can use in a direnv (or just build myenv):

{ pkgs ? import <nixpkgs> {}
, ... }:

with pkgs;
{
  myenv = let
    my-R-packages = with rPackages; [
      R
      tidyverse
      jsonlite
      data_table
      rstan
      ggplot2
    ];
    newR = rWrapper.override {
      packages = my-R-packages;
    };
  in
    buildEnv {
      name = "myenv";
      paths = [
        newR
        gcc                     # For rstan
        gnumake                 # Ditto
      ];
    };
}

Compiling Stan models

In Nix, you probably don’t want to have to recompile your Stan models every time you fit data. Instead, you should have a derivation that compiles a Stan model from the code, and use the compiled version via the sampling() method.

Here’s an Rscript that’ll compile a Stan model to an RDS file (compressed with xz, which may not be prudent given the recent backdoor in xz). I save it as compile-stan-model.R:

## Compile a Stan model and save it

## Rscript compile-stan-model.R stan-model.stan output-compiled.rds
library(rstan)
library(tidyverse)
args <- commandArgs(trailingOnly = TRUE)
stanmodel <- args[1]
output <- args[2]

## TODO: We might want to alter the makeconf_path() to set CXXFLAGS to
## be "-O3" for better optimization.
compiledmodel <- stan_model(file = stanmodel)

write_rds(compiledmodel, path = output, compress = "xz")

And here’s a “compile-stan-model.nix” package that uses it:

# Compile a Stan model and save it in an RDS
{ runCommand, r-env, lib }:

path:
let
  name = builtins.baseNameOf path;
  rdsname = (lib.strings.removeSuffix ".stan" name) + ".rds";
in
runCommand rdsname
  { buildInputs = [ r-env ];
  }
  ''
  Rscript ${./compile-stan-model.R} ${path} $out
  ''

Then you can make a function “compile-stan-model” like so:

  compile-stan-model = callPackage ./compile-stan-model.nix { };

Calling the function like (an example from my neuroeconomics research):

risk-wsls-mixture-stan-model = compile-stan-model ./fit-risk-wsls-mixture.stan;

Then you can pass this model as an argument to other R scripts. For example, I’ll usually pass some data, the model, and an output directory to an R script through arguments. Here’s what the R for that looks like:

cores <- as.integer(Sys.getenv("NIX_BUILD_CORES", unset = 1))
options(mc.cores = cores)
args = commandArgs(trailingOnly = TRUE)

## args <- c("subject-risk-data-1108.csv.gz", "risk-wsls-mixture-stan-model.rds", "outputs")
dt <- data.table(read_csv(args[1]))
stanmodel <- read_rds(args[2])
outputdir <- args[3]

## … do something to clean the data and make a “data” variable…

fit1 <- sampling(stanmodel, data = data, cores = cores,
                 warmup = 1000, seed = 12345,
                 include = FALSE, pars = c("log_lik"))

And here’s a function that I’ll need to explain later at some point:

{ runCommand, lib, r-env, keepFailures ? true }:

# Take in some CSV. Possibly allow failed fits to "succeed" in
# building (since they're likely to fail again anyway.)
prefix: rscript: stanmodel:
with lib;
name: deriv:
let
  adjustedname = if lib.strings.hasPrefix "subject-" name
                 then "subj${lib.strings.removePrefix "subject-" name}" # "nix build -L" cuts off numbers with a dash
                 else name;
in
runCommand "${prefix}-${adjustedname}"
  { buildInputs = [ r-env ];
    # TODO: Should allowSubstitutes be set to "false" here? Or will we
    # eventually want to serve them from a cache?
    # Probably just set --no-substitute on building
  }
  ''
  mkdir outputs
  if Rscript ${rscript} ${deriv} ${stanmodel} outputs > >(tee outputs/stdout.log) 2> >(tee outputs/stderr.log >&2)
  then
    mv outputs $out
  else
  ${if keepFailures then
   ''
     mkdir -p $out
     touch $out/FAILED
     mv outputs/stdout.log $out
     mv outputs/stderr.log $out
   ''
    else
      "false"
   }
  fi
  ''

Basically it takes in an Rscript (that cleans data) and a Stan model, then returns a function that can be used to fit data. That function takes in (for me) a subject’s name (a string) and data (which is a derivation pointing to a .csv.gz) and then returns a derivation that is the output of the Rscript.