Stan is a probabilistic programming language for Bayesian inference using Markov chain Monte Carlo sampling. It’s what I use for Bayesian inference for my PhD research.
Back in undergrad, I used JAGS, but Stan uses Hamiltonian Monte Carlo which can allow it to more efficiently sample complicated distributions.
Links
- Stan’s homepage is mc-stan.org.
- Documentation is at mc-stan.org/users/documentation. In particular, check out the user guide.
- The Stan project source code is hosted at github.com/stan-dev.
- Michael Betancourt has pretty good posts on Stan, and generally provides a good intuition for how a Hamiltonian MCMC sampler works (like sampling the “typical set”).
Internal links
- brms notes (I don’t really use brms currently)
Tips
Startup occasionally fails on some chains
TODO: Need to look up the error message on this again
We had a problem where some chains would fail to start on one of our models that used a softmax noise parameter σ.
The problem (with our model, not Stan) turned out to be that random initialization would cause some values of σ to be really low. This low noise would create a really steep logistic function, basically a step function. This step function would be really hard to sample in, since the gradient was essentially zero, meaning the chain couldn’t find which direction to move.
The solution was to start initialization at high noise values, that way there’d always be a gradient. This doesn’t prevent σ from contracting to steeper values, it just prevents it from starting with a steep value in a location with little probability density.
Issues
Generated quantities with transformed parameters
rstan has an issue (see Issue #714) where gqs()
(which samples generated quantities) doesn’t work with transformed parameters.
I submitted a patch back in June 2021 that can fix this issue, but I may need to ping someone again about that.
This is pretty low priority, since I think most people won’t run gqs()
by itself.
Riemannian sampling?
This would be complicated and I don’t think it’ll be solved soon, but some models with scale parameters would benefit from a non-Euclidean/Riemannian sampler. Scale parameters can squeeze some areas of space more than others, but the problem is that the step size doesn’t shrink to accomodate this. So in tight spaces, your step size will be too big and you’ll jump out of the typical set, and in looser/bigger spaces your step size will be inefficient.
IIRC, step size is calculated during one of the slow phases of warmup where a matrix (either called a “metric” or “mass matrix”) is measured. I also think this metric M only uses diagonal values currently. It might be possible for the metric to slightly adjust step sizes if it also used covariance, but I need to double check this.
Parallel Nix build hanging issue?
It’s been years since I’ve tried this, so I’m just writing from memory, but I remember an issue where compiling Stan models could hang when compiling in a Nix build in parallel. Not sure if this is still a problem. But what was really surprising about this is that the Nix builders were separate, and Nix builds are run in hermetic sandboxes. You would think that there’d be no way for two builds to interefere with each other. They shouldn’t really have any IPC mechanism or visibility outside the sandbox, so I’m not sure what happened there.
I don’t remember exactly how I solved the problem either.
Did I set Nix’s --max-jobs=1
when compiling Stan models to prevent this from happening?
Or did I find a specific critical section that I separated into its own derivation?
Not sure.
Nix
Here’s a basic Nix setup which you can use in a direnv (or just build myenv
):
{ pkgs ? import <nixpkgs> {}
, ... }:
with pkgs;
{
myenv = let
my-R-packages = with rPackages; [
R
tidyverse
jsonlite
data_table
rstan
ggplot2
];
newR = rWrapper.override {
packages = my-R-packages;
};
in
buildEnv {
name = "myenv";
paths = [
newR
gcc # For rstan
gnumake # Ditto
];
};
}
Compiling Stan models
In Nix, you probably don’t want to have to recompile your Stan models every time you fit data.
Instead, you should have a derivation that compiles a Stan model from the code, and use the compiled version via the sampling()
method.
Here’s an Rscript that’ll compile a Stan model to an RDS
file (compressed with xz
, which may not be prudent given the recent backdoor in xz).
I save it as compile-stan-model.R
:
## Compile a Stan model and save it
## Rscript compile-stan-model.R stan-model.stan output-compiled.rds
library(rstan)
library(tidyverse)
args <- commandArgs(trailingOnly = TRUE)
stanmodel <- args[1]
output <- args[2]
## TODO: We might want to alter the makeconf_path() to set CXXFLAGS to
## be "-O3" for better optimization.
compiledmodel <- stan_model(file = stanmodel)
write_rds(compiledmodel, path = output, compress = "xz")
And here’s a “compile-stan-model.nix
” package that uses it:
# Compile a Stan model and save it in an RDS
{ runCommand, r-env, lib }:
path:
let
name = builtins.baseNameOf path;
rdsname = (lib.strings.removeSuffix ".stan" name) + ".rds";
in
runCommand rdsname
{ buildInputs = [ r-env ];
}
''
Rscript ${./compile-stan-model.R} ${path} $out
''
Then you can make a function “compile-stan-model
” like so:
compile-stan-model = callPackage ./compile-stan-model.nix { };
Calling the function like (an example from my neuroeconomics research):
risk-wsls-mixture-stan-model = compile-stan-model ./fit-risk-wsls-mixture.stan;
Then you can pass this model as an argument to other R scripts. For example, I’ll usually pass some data, the model, and an output directory to an R script through arguments. Here’s what the R for that looks like:
cores <- as.integer(Sys.getenv("NIX_BUILD_CORES", unset = 1))
options(mc.cores = cores)
args = commandArgs(trailingOnly = TRUE)
## args <- c("subject-risk-data-1108.csv.gz", "risk-wsls-mixture-stan-model.rds", "outputs")
dt <- data.table(read_csv(args[1]))
stanmodel <- read_rds(args[2])
outputdir <- args[3]
## … do something to clean the data and make a “data” variable…
fit1 <- sampling(stanmodel, data = data, cores = cores,
warmup = 1000, seed = 12345,
include = FALSE, pars = c("log_lik"))
And here’s a function that I’ll need to explain later at some point:
{ runCommand, lib, r-env, keepFailures ? true }:
# Take in some CSV. Possibly allow failed fits to "succeed" in
# building (since they're likely to fail again anyway.)
prefix: rscript: stanmodel:
with lib;
name: deriv:
let
adjustedname = if lib.strings.hasPrefix "subject-" name
then "subj${lib.strings.removePrefix "subject-" name}" # "nix build -L" cuts off numbers with a dash
else name;
in
runCommand "${prefix}-${adjustedname}"
{ buildInputs = [ r-env ];
# TODO: Should allowSubstitutes be set to "false" here? Or will we
# eventually want to serve them from a cache?
# Probably just set --no-substitute on building
}
''
mkdir outputs
if Rscript ${rscript} ${deriv} ${stanmodel} outputs > >(tee outputs/stdout.log) 2> >(tee outputs/stderr.log >&2)
then
mv outputs $out
else
${if keepFailures then
''
mkdir -p $out
touch $out/FAILED
mv outputs/stdout.log $out
mv outputs/stderr.log $out
''
else
"false"
}
fi
''
Basically it takes in an Rscript (that cleans data) and a Stan model, then returns a function that can be used to fit data.
That function takes in (for me) a subject’s name (a string) and data (which is a derivation pointing to a .csv.gz
) and then returns a derivation that is the output of the Rscript.