A/B testing tells you averages. Stoa discovers which customers respond differently—and why—using Bayesian causal forests that run 10–60x faster than anything publicly available. Then it compounds that learning across every storefront it powers.

The Loop

The core principle is a segment → experiment → analyze → discover cycle where every pass produces better segments, which produce better experiments, which produce richer discoveries.

    ┌──────────────────────────────────────────────────┐
    │                                                  │
    ▼                                                  │
 SEGMENT ────► EXPERIMENT ────► ANALYZE ────► DISCOVER │
 (who are       (what should     (did it       (where  │
  our            we test,         work?)       does it │
  customers?)    for whom?)                    differ?) │
    ▲                                            │     │
    │                          ┌─────────────────┤     │
    │                          ▼                 ▼     │
    │                    REFINE SEGMENTS    ASK WHY    │
    │                    (new boundary      (voice of  │
    │                     discovered)       customer)  │
    │                          └────────┬────────┘     │
    │                                   ▼              │
    └──────────────── RICHER MODEL ────────────────────┘
                      of customer behavior

The Analysis Engine

Standard A/B testing tells you the probability of seeing your data if there's no effect. That's backwards. Stoa tells you what you actually care about: "82% chance variant B is better, expected lift $0.45/visitor."

The core is heterogeneous treatment effect discovery—automatically finding which customer segments respond differently to an intervention. This runs on a custom BCF implementation built on bartz, a JAX-based BART library. At scale, HTE discovery runs 10–60x faster than publicly available BCF methods, with a batched optimization in progress targeting another 2–3x. That speed matters: it makes simulation-based calibration practical—generating thousands of synthetic datasets with known ground truth to verify the inference engine recovers the right answers. SBC is the gold standard for validating models this complex, and it's only feasible when each run is fast.

Revenue effects are decomposed via hurdle models into conversion rate lift and spend-per-converter lift—different problems that require different responses. The full pipeline is protected by BCF prior regularization, SBC validation, and claim-level governance to prevent operator cherry-picking. More on statistical honesty →

The Stack

Each storefront shares the same analysis backbone. Insights compound across the portfolio. No per-transaction fees, no vendor lock-in—your data lives in your databases.

 ┌─────────────────────────────────────────────────────────┐
 │  Storefront          SSR, experiment-aware routing,     │
 │                      edge deploy (Cloudflare Workers)   │
 ├─────────────────────────────────────────────────────────┤
 │  Commerce            Medusa v2, custom vertical modules │
 ├─────────────────────────────────────────────────────────┤
 │  Operations          Odoo 17, bidirectional sync        │
 ├─────────────────────────────────────────────────────────┤
 │  Analytics           Umami + dbt → PostgreSQL           │
 ├─────────────────────────────────────────────────────────┤
 │  Inference           bartz-BCF + PyMC, hurdle models,   │
 │                      SBC, sequential monitoring         │
 ├─────────────────────────────────────────────────────────┤
 │  Infrastructure      Docker + Caddy / Cloudflare edge   │
 │                      €45/mo runs everything             │
 └─────────────────────────────────────────────────────────┘

Deep Dives

I deploy Stoa for clients and run their experimentation programs—compounding learning across engagements. If you're running e-commerce and want experimentation that actually tells you something, or you're building in this space and want to talk architecture, I'd like to hear from you.