Statistical Honesty
How Stoa prevents its experimentation system from fooling itself — at both the algorithmic and operator levels.
Why This Matters
The experimentation stack makes promises: "82% chance variant B is better", "returning customers respond 3x more strongly", "expected loss of $0.03/visitor if you ship." These are powerful claims. If the system produces them carelessly — through algorithmic overfitting or operator cherry-picking — they become dangerous. Bad experiment results are worse than no experiment results, because they carry the authority of data.
The Two Gardens
Gelman and Loken (2013) described the "garden of forking paths" — the observation that researcher degrees of freedom create multiple-comparison problems even without conscious p-hacking. In an experimentation platform, this garden has two distinct sections:
┌─────────────────────────────────────────────────────────────┐
│ THE TWO HONESTY PROBLEMS │
│ │
│ GARDEN 1: The Algorithm's Paths GARDEN 2: The Operator │
│ ─────────────────────────────── ─────────────────────── │
│ │
│ Which features to split on? Which segments to │
│ How deep to grow trees? examine? │
│ How to weight observations? Which metric to lead │
│ Where to place split thresholds? with? │
│ When to declare │
│ → The model searches a space of "enough data"? │
│ possible explanations and may Whether to re-run with │
│ overfit its own search. different parameters? │
│ How to frame the │
│ narrative? │
│ │
│ → The human selects │
│ from valid outputs │
│ and may cherry-pick. │
│ │
│ DIFFERENT PROBLEMS REQUIRE DIFFERENT SOLUTIONS │
└─────────────────────────────────────────────────────────────┘
Most experimentation platforms conflate these or address only Garden 1. Stoa treats them as distinct problems with distinct solutions.
Garden 1: Algorithmic Honesty
The Problem
When Tyche discovers heterogeneous treatment effects (HTE), it searches for subgroups that respond differently to treatment. The splits that appear in the final model were selected because they produced large apparent effects — but some of that apparent effect is an artifact of the selection process itself.
The Solution: Bayesian Causal Forests
Following Hahn, Murray, and Carvalho (2020), Stoa uses Bayesian Causal Forests (BCF) which separate the model into two components:
BCF DECOMPOSITION
═════════════════
Y(x) = μ(x) + τ(x)·Z + ε
μ(x) — prognostic forest (250 trees, flexible)
"What is the baseline outcome for this visitor?"
τ(x) — treatment effect forest (50 trees, strong shrinkage)
"How does treatment change the outcome for this visitor?"
The key: τ gets a MORE REGULARIZING PRIOR than μ.
Fewer trees, stronger shrinkage toward zero.
This encodes the correct assumption: treatment effects
are typically smaller and simpler than baseline variation.
The Bayesian prior on the treatment effect forest prevents overfitting by construction. Effects that are merely artifacts of the tree search get shrunk toward zero. Effects that are genuinely supported by data survive the shrinkage. No sample splitting required — the full dataset is available for both discovery and estimation.
Practical implications:
- Full posterior distributions (credible intervals) for CATEs at every observation
- Better small-sample performance (prior regularization is most valuable when data is sparse)
- Shorter, better-calibrated intervals than frequentist honest forests
- Natural integration with the rest of the Bayesian pipeline (PyMC models, ArviZ diagnostics)
Garden 2: Operator Honesty
The Problem
Even with perfectly calibrated posteriors, an operator walking through HTE results can cherry-pick:
- Run discovery, see 4 segments, focus on the one with the largest effect
- Ignore segments where effects were null or negative
- Lead with the most impressive metric, downplay guardrail degradation
- Re-run analysis with different parameters until a satisfying story emerges
BCF handles the algorithm's honesty. It does not handle the operator's honesty. This is not a criticism of operators — it's how human cognition works. Confirmation bias is a feature of pattern-matching minds, not a character flaw. The system must account for it structurally.
The Solution: Governance Machinery
Claim-level labeling: Every analysis output carries one of three claim levels:
exploratory— discovered on full sample, no held-out validationhonest_estimate— discovered on one split, evaluated on anotherconfirmed— reproduced across independent experiments
Segment freeze: In honest mode, segments discovered on the discovery split are frozen. The estimation split evaluates those exact definitions — no iterating, no tweaking boundaries.
Selection gap measurement: For each segment, the gap between discovery-phase effect and estimation-phase effect is computed and reported. A large gap signals something that doesn't replicate.
Decision gating: Only honest_estimate or confirmed claims are eligible for ship/stop language. Exploratory findings inform the next experiment; they don't drive deployment decisions.
The Honesty Paradox
There is a counterintuitive dynamic at work:
Better CATE estimates (BCF)
→ More segments look "real" (supported by posteriors)
→ More credible stories to tell
→ More tempting forking paths for operators
→ MORE need for governance, not less
A weak estimator produces noisy, obviously uncertain results — operators naturally discount them. A strong estimator produces precise, credible results — operators trust them, including the ones that happen to be the most dramatic. The governance layer must scale with the estimator's power.
Future Direction: Hierarchical Pooling
The Bayesian framework offers an elegant long-term solution to the operator's garden: hierarchical modeling of segment effects.
When 4 segments are discovered, the hierarchical model says "these 4 effects are drawn from a common distribution." The segment with a dramatically large effect gets pulled toward the group mean. If the effect is genuinely different, the data overwhelms the prior and the estimate stays large. If it's a fluke amplified by selection, it shrinks.
This is the approach Gelman advocates in BDA3 (Chapter 5) as the Bayesian answer to multiple comparisons: don't correct for multiplicity with penalties — model it with a hierarchical prior.
Key References
| Reference | Contribution |
|---|---|
| Gelman and Loken (2013), "The Garden of Forking Paths" | Articulates the researcher degrees of freedom problem |
| Hahn, Murray, and Carvalho (2020), "BCF" | Addresses Garden 1 via prior regularization |
| Gelman et al. (2013), BDA3, Chapter 5 | Hierarchical modeling as the Bayesian answer to multiple comparisons |
| Stucchio (2015), "Bayesian A/B Testing at VWO" | Expected loss framework for decision-theoretic honesty |
| Talts et al. (2018), "Validating Bayesian Inference with SBC" | Simulation-based calibration for verifying honest posteriors |