Statistical Foundations

Annotated bibliography of the statistical and causal inference literature underlying Stoa's experimentation platform.

1. Bayesian Inference and Workflow

Core

Gelman, Carlin, Stern, Dunson, Vehtari, Rubin (2013) — Bayesian Data Analysis, Third Edition Chapter 5 (hierarchical models) is the foundation for pooling across segments — the Bayesian answer to multiple comparisons. Chapter 6 (model checking) informs simulation-based calibration.

McElreath (2020) — Statistical Rethinking, 2nd edition Lecture videos Best pedagogical introduction to Bayesian modeling. The DAG-first approach to causal reasoning applies directly to experiment design. Won the 2024 ISBA De Groot Prize.

Gelman, Vehtari, Simpson, et al. (2020) — "Bayesian Workflow" The methodological backbone for model development: specify → simulate from prior → fit → diagnose → check posterior predictions → expand if needed.

Gelman and Loken (2013) — "The Garden of Forking Paths" Core motivation for preferring Bayesian decision theory over frequentist NHST. Also motivates the operator honesty governance layer.

Computational Methods

Betancourt (2020) — Towards a Principled Bayesian Workflow The most rigorous treatment of end-to-end Bayesian workflow available. Prior pushforward checks, computational faithfulness diagnostics, posterior retrodictive checks — each stage with concrete code examples. Complements Gelman et al.'s "Bayesian Workflow" with deeper computational emphasis.

Talts, Betancourt, Simpson, Vehtari, Gelman (2018) — "Validating Bayesian Inference Algorithms with Simulation-Based Calibration" SBC as the gold standard for verifying that the inference engine recovers known parameters. If the sampler can't recover ground truth from simulated data, the posteriors it produces on real data are suspect. Pytyche's validation pipeline implements this.

Decision Theory

Stucchio (2015) — "Bayesian A/B Testing at VWO" Expected loss as decision criterion. Most concise treatment of why expected loss beats hypothesis testing for business decisions. This is pytyche's primary decision metric.

Kruschke (2013) — "Bayesian Estimation Supersedes the t-Test" The ROPE (Region of Practical Equivalence) concept maps to pytyche's minimum practical effect threshold. "82% probability variant B is better" not "p < 0.05."

2. Causal Inference

Imbens and Rubin (2015) — Causal Inference for Statistics, Social, and Biomedical Sciences Conceptual foundation. The fundamental problem of causal inference (we only observe one potential outcome per unit) motivates the entire experimental design.

Facure (2022) — Causal Inference for the Brave and True Most accessible Python-first treatment. Part II on personalization and CATE models is directly applicable.

Athey and Imbens (2016) — "Recursive Partitioning for Heterogeneous Causal Effects" Introduced "honest estimation" via sample splitting — the frequentist solution to algorithmic overfitting. BCF's prior regularization achieves the same goal more naturally.

Wager and Athey (2018) — "Estimation and Inference of HTE using Random Forests" Theoretical backing for frequentist CIs in causal forests. The asymptotic normality result requires sufficient n — exactly the assumption that breaks down for small e-commerce stores, motivating the Bayesian alternative.

3. Bayesian Causal Forests

The core of the HTE methodology. BCF addresses algorithmic honesty through prior regularization, produces native credible intervals, and integrates naturally with PyMC.

The BART Foundation

Chipman, George, McCulloch (2010) — "BART: Bayesian Additive Regression Trees" Computational foundation of BCF. Understanding BART's regularization mechanism (prior on tree depth + prior on leaf values) is prerequisite to understanding why BCF works.

Hill (2011) — "Bayesian Nonparametric Modeling for Causal Inference" Showed that BART's regularization produces good treatment effect estimates without manual tuning — the insight that Hahn et al. formalized into BCF.

Core BCF

Hahn, Murray, Carvalho (2020) — "Bayesian Regression Tree Models for Causal Inference" Bayesian Analysis, 15(3), 965-1056

The primary methodological target. BCF's separate priors on prognostic (μ, flexible) and treatment effect (τ, regularized toward zero) forests encode the correct domain assumption: treatment effects are smaller and simpler than baseline variation. Solves the algorithm's garden via regularization instead of splitting.

Krantsevich, Hahn, He (2023) — "Bayesian Causal Forests & the 2022 ACIC Data Challenge" Empirical evidence. BART-based methods were top performers in ACIC 2022. BCF credible intervals achieved better frequentist coverage than frequentist causal forest CIs, with shorter widths.

Linero (2018) — "Bayesian Regression Trees for High-Dimensional Evolution of Heterogeneous Parameters" Sparse BART via Dirichlet priors on splitting probabilities. Heavy inspiration for pytyche's feature selection in the treatment effect forest — when the covariate space is large relative to sample size, BART's default uniform splitting prior wastes splits on noise variables. Linero's approach concentrates splitting on informative covariates automatically.

Extensions

Caron, Baio, Manolopoulou (2022) — "Shrinkage Bayesian Causal Forests" Dirichlet priors on splitting probabilities for fully Bayesian feature shrinkage. Useful as the segment attribute set grows.

Bayesian Causal Forests for Multivariate Outcomes (2025) — JRSS-A Joint BCF across multiple outcomes. Directly serves the dual-metric principle: joint posterior over conversion AND revenue treatment effects.

4. Platform Engineering

Kohavi, Tang, Xu (2020) — Trustworthy Online Controlled Experiments The practical engineering bible. SRM detection, metric taxonomy, guardrail metrics. The operational concerns are universal even though our statistical methodology diverges from their frequentist framework.

Kaufman et al. (2017) — "Democratizing Online Controlled Experiments at Booking.com" Organizational model for accessible experimentation. Informs the ops-integration design.

Essential Reading Path

Phase 1: Foundations

McElreath (2020) — Statistical Rethinking — best path into Bayesian thinking
Gelman and Loken (2013) — "Garden of Forking Paths" — why we avoid frequentist NHST
Stucchio (2015) — "Bayesian A/B Testing at VWO" — expected loss as the decision criterion
Imbens and Rubin (2015) — Causal Inference — potential outcomes framework

Phase 2: Core Methodology

Gelman et al. (2020) — "Bayesian Workflow" — the model development process
Betancourt (2020) — "Principled Bayesian Workflow" — rigorous computational diagnostics
Hahn, Murray, Carvalho (2020) — "BCF" — the HTE methodology
Chipman et al. (2010) — "BART" — understanding the computational foundation
Talts et al. (2018) — "SBC" — validating the inference pipeline

Phase 3: Extensions

Gelman et al. (2013) — BDA3, Chapter 5 — hierarchical models for multiplicity
Multivariate BCF (2025) — joint treatment effects across outcomes
Kohavi, Tang, Xu (2020) — Trustworthy OCE — platform engineering patterns

← Back to Stoa Stack