Statistical Foundations
Annotated bibliography of the statistical and causal inference literature underlying Stoa's experimentation platform.
1. Bayesian Inference and Workflow
Core
Gelman, Carlin, Stern, Dunson, Vehtari, Rubin (2013) — Bayesian Data Analysis, Third Edition Chapter 5 (hierarchical models) is the foundation for pooling across segments — the Bayesian answer to multiple comparisons. Chapter 6 (model checking) informs simulation-based calibration.
McElreath (2020) — Statistical Rethinking, 2nd edition Lecture videos Best pedagogical introduction to Bayesian modeling. The DAG-first approach to causal reasoning applies directly to experiment design. Won the 2024 ISBA De Groot Prize.
Gelman, Vehtari, Simpson, et al. (2020) — "Bayesian Workflow" The methodological backbone for model development: specify → simulate from prior → fit → diagnose → check posterior predictions → expand if needed.
Gelman and Loken (2013) — "The Garden of Forking Paths" Core motivation for preferring Bayesian decision theory over frequentist NHST. Also motivates the operator honesty governance layer.
Computational Methods
Betancourt (2020) — Towards a Principled Bayesian Workflow The most rigorous treatment of end-to-end Bayesian workflow available. Prior pushforward checks, computational faithfulness diagnostics, posterior retrodictive checks — each stage with concrete code examples. Complements Gelman et al.'s "Bayesian Workflow" with deeper computational emphasis.
Talts, Betancourt, Simpson, Vehtari, Gelman (2018) — "Validating Bayesian Inference Algorithms with Simulation-Based Calibration" SBC as the gold standard for verifying that the inference engine recovers known parameters. If the sampler can't recover ground truth from simulated data, the posteriors it produces on real data are suspect. Pytyche's validation pipeline implements this.
Decision Theory
Stucchio (2015) — "Bayesian A/B Testing at VWO" Expected loss as decision criterion. Most concise treatment of why expected loss beats hypothesis testing for business decisions. This is pytyche's primary decision metric.
Kruschke (2013) — "Bayesian Estimation Supersedes the t-Test" The ROPE (Region of Practical Equivalence) concept maps to pytyche's minimum practical effect threshold. "82% probability variant B is better" not "p < 0.05."
2. Causal Inference
Imbens and Rubin (2015) — Causal Inference for Statistics, Social, and Biomedical Sciences Conceptual foundation. The fundamental problem of causal inference (we only observe one potential outcome per unit) motivates the entire experimental design.
Facure (2022) — Causal Inference for the Brave and True Most accessible Python-first treatment. Part II on personalization and CATE models is directly applicable.
Athey and Imbens (2016) — "Recursive Partitioning for Heterogeneous Causal Effects" Introduced "honest estimation" via sample splitting — the frequentist solution to algorithmic overfitting. BCF's prior regularization achieves the same goal more naturally.
Wager and Athey (2018) — "Estimation and Inference of HTE using Random Forests" Theoretical backing for frequentist CIs in causal forests. The asymptotic normality result requires sufficient n — exactly the assumption that breaks down for small e-commerce stores, motivating the Bayesian alternative.
3. Bayesian Causal Forests
The core of the HTE methodology. BCF addresses algorithmic honesty through prior regularization, produces native credible intervals, and integrates naturally with PyMC.
The BART Foundation
Chipman, George, McCulloch (2010) — "BART: Bayesian Additive Regression Trees" Computational foundation of BCF. Understanding BART's regularization mechanism (prior on tree depth + prior on leaf values) is prerequisite to understanding why BCF works.
Hill (2011) — "Bayesian Nonparametric Modeling for Causal Inference" Showed that BART's regularization produces good treatment effect estimates without manual tuning — the insight that Hahn et al. formalized into BCF.
Core BCF
Hahn, Murray, Carvalho (2020) — "Bayesian Regression Tree Models for Causal Inference" Bayesian Analysis, 15(3), 965-1056
The primary methodological target. BCF's separate priors on prognostic (μ, flexible) and treatment effect (τ, regularized toward zero) forests encode the correct domain assumption: treatment effects are smaller and simpler than baseline variation. Solves the algorithm's garden via regularization instead of splitting.
Krantsevich, Hahn, He (2023) — "Bayesian Causal Forests & the 2022 ACIC Data Challenge" Empirical evidence. BART-based methods were top performers in ACIC 2022. BCF credible intervals achieved better frequentist coverage than frequentist causal forest CIs, with shorter widths.
Linero (2018) — "Bayesian Regression Trees for High-Dimensional Evolution of Heterogeneous Parameters" Sparse BART via Dirichlet priors on splitting probabilities. Heavy inspiration for pytyche's feature selection in the treatment effect forest — when the covariate space is large relative to sample size, BART's default uniform splitting prior wastes splits on noise variables. Linero's approach concentrates splitting on informative covariates automatically.
Extensions
Caron, Baio, Manolopoulou (2022) — "Shrinkage Bayesian Causal Forests" Dirichlet priors on splitting probabilities for fully Bayesian feature shrinkage. Useful as the segment attribute set grows.
Bayesian Causal Forests for Multivariate Outcomes (2025) — JRSS-A Joint BCF across multiple outcomes. Directly serves the dual-metric principle: joint posterior over conversion AND revenue treatment effects.
4. Platform Engineering
Kohavi, Tang, Xu (2020) — Trustworthy Online Controlled Experiments The practical engineering bible. SRM detection, metric taxonomy, guardrail metrics. The operational concerns are universal even though our statistical methodology diverges from their frequentist framework.
Kaufman et al. (2017) — "Democratizing Online Controlled Experiments at Booking.com" Organizational model for accessible experimentation. Informs the ops-integration design.
Essential Reading Path
Phase 1: Foundations
- McElreath (2020) — Statistical Rethinking — best path into Bayesian thinking
- Gelman and Loken (2013) — "Garden of Forking Paths" — why we avoid frequentist NHST
- Stucchio (2015) — "Bayesian A/B Testing at VWO" — expected loss as the decision criterion
- Imbens and Rubin (2015) — Causal Inference — potential outcomes framework
Phase 2: Core Methodology
- Gelman et al. (2020) — "Bayesian Workflow" — the model development process
- Betancourt (2020) — "Principled Bayesian Workflow" — rigorous computational diagnostics
- Hahn, Murray, Carvalho (2020) — "BCF" — the HTE methodology
- Chipman et al. (2010) — "BART" — understanding the computational foundation
- Talts et al. (2018) — "SBC" — validating the inference pipeline
Phase 3: Extensions
- Gelman et al. (2013) — BDA3, Chapter 5 — hierarchical models for multiplicity
- Multivariate BCF (2025) — joint treatment effects across outcomes
- Kohavi, Tang, Xu (2020) — Trustworthy OCE — platform engineering patterns