Statistical Honesty

How Stoa prevents its experimentation system from fooling itself — at both the algorithmic and operator levels.

Why This Matters

The experimentation stack makes promises: "82% chance variant B is better", "returning customers respond 3x more strongly", "expected loss of $0.03/visitor if you ship." These are powerful claims. If the system produces them carelessly — through algorithmic overfitting or operator cherry-picking — they become dangerous. Bad experiment results are worse than no experiment results, because they carry the authority of data.


The Two Gardens

Gelman and Loken (2013) described the "garden of forking paths" — the observation that researcher degrees of freedom create multiple-comparison problems even without conscious p-hacking. In an experimentation platform, this garden has two distinct sections:

    ┌─────────────────────────────────────────────────────────────┐
    │              THE TWO HONESTY PROBLEMS                       │
    │                                                             │
    │  GARDEN 1: The Algorithm's Paths    GARDEN 2: The Operator  │
    │  ───────────────────────────────    ───────────────────────  │
    │                                                             │
    │  Which features to split on?        Which segments to       │
    │  How deep to grow trees?             examine?               │
    │  How to weight observations?        Which metric to lead    │
    │  Where to place split thresholds?    with?                  │
    │                                     When to declare         │
    │  → The model searches a space of     "enough data"?         │
    │    possible explanations and may    Whether to re-run with  │
    │    overfit its own search.            different parameters?  │
    │                                     How to frame the        │
    │                                      narrative?             │
    │                                                             │
    │                                     → The human selects     │
    │                                       from valid outputs    │
    │                                       and may cherry-pick.  │
    │                                                             │
    │  DIFFERENT PROBLEMS REQUIRE DIFFERENT SOLUTIONS             │
    └─────────────────────────────────────────────────────────────┘

Most experimentation platforms conflate these or address only Garden 1. Stoa treats them as distinct problems with distinct solutions.


Garden 1: Algorithmic Honesty

The Problem

When Tyche discovers heterogeneous treatment effects (HTE), it searches for subgroups that respond differently to treatment. The splits that appear in the final model were selected because they produced large apparent effects — but some of that apparent effect is an artifact of the selection process itself.

The Solution: Bayesian Causal Forests

Following Hahn, Murray, and Carvalho (2020), Stoa uses Bayesian Causal Forests (BCF) which separate the model into two components:

    BCF DECOMPOSITION
    ═════════════════

    Y(x) = μ(x) + τ(x)·Z + ε

    μ(x) — prognostic forest (250 trees, flexible)
           "What is the baseline outcome for this visitor?"

    τ(x) — treatment effect forest (50 trees, strong shrinkage)
           "How does treatment change the outcome for this visitor?"

    The key: τ gets a MORE REGULARIZING PRIOR than μ.
    Fewer trees, stronger shrinkage toward zero.

    This encodes the correct assumption: treatment effects
    are typically smaller and simpler than baseline variation.

The Bayesian prior on the treatment effect forest prevents overfitting by construction. Effects that are merely artifacts of the tree search get shrunk toward zero. Effects that are genuinely supported by data survive the shrinkage. No sample splitting required — the full dataset is available for both discovery and estimation.

Practical implications:


Garden 2: Operator Honesty

The Problem

Even with perfectly calibrated posteriors, an operator walking through HTE results can cherry-pick:

BCF handles the algorithm's honesty. It does not handle the operator's honesty. This is not a criticism of operators — it's how human cognition works. Confirmation bias is a feature of pattern-matching minds, not a character flaw. The system must account for it structurally.

The Solution: Governance Machinery

Claim-level labeling: Every analysis output carries one of three claim levels:

Segment freeze: In honest mode, segments discovered on the discovery split are frozen. The estimation split evaluates those exact definitions — no iterating, no tweaking boundaries.

Selection gap measurement: For each segment, the gap between discovery-phase effect and estimation-phase effect is computed and reported. A large gap signals something that doesn't replicate.

Decision gating: Only honest_estimate or confirmed claims are eligible for ship/stop language. Exploratory findings inform the next experiment; they don't drive deployment decisions.

The Honesty Paradox

There is a counterintuitive dynamic at work:

    Better CATE estimates (BCF)
    → More segments look "real" (supported by posteriors)
    → More credible stories to tell
    → More tempting forking paths for operators
    → MORE need for governance, not less

A weak estimator produces noisy, obviously uncertain results — operators naturally discount them. A strong estimator produces precise, credible results — operators trust them, including the ones that happen to be the most dramatic. The governance layer must scale with the estimator's power.


Future Direction: Hierarchical Pooling

The Bayesian framework offers an elegant long-term solution to the operator's garden: hierarchical modeling of segment effects.

When 4 segments are discovered, the hierarchical model says "these 4 effects are drawn from a common distribution." The segment with a dramatically large effect gets pulled toward the group mean. If the effect is genuinely different, the data overwhelms the prior and the estimate stays large. If it's a fluke amplified by selection, it shrinks.

This is the approach Gelman advocates in BDA3 (Chapter 5) as the Bayesian answer to multiple comparisons: don't correct for multiplicity with penalties — model it with a hierarchical prior.


Key References

ReferenceContribution
Gelman and Loken (2013), "The Garden of Forking Paths"Articulates the researcher degrees of freedom problem
Hahn, Murray, and Carvalho (2020), "BCF"Addresses Garden 1 via prior regularization
Gelman et al. (2013), BDA3, Chapter 5Hierarchical modeling as the Bayesian answer to multiple comparisons
Stucchio (2015), "Bayesian A/B Testing at VWO"Expected loss framework for decision-theoretic honesty
Talts et al. (2018), "Validating Bayesian Inference with SBC"Simulation-based calibration for verifying honest posteriors

← Back to Stoa Stack