The Experiment Decision Journey

What questions operators need to answer at each phase of the virtuous loop, and what tools answer them.

The Loop, Annotated

    ┌──────────────────────────────────────────────────────┐
    │                                                      │
    ▼                                                      │
 SEGMENT ─────► EXPERIMENT ─────► ANALYZE ─────► DISCOVER  │
 "Who are       "What should      "Did it        "Where    │
  our            we test,          work?"         does it   │
  customers?"    for whom?"                       differ?"  │
    ▲                                               │      │
    │                          ┌────────────────────┤      │
    │                          ▼                    ▼      │
    │                    REFINE SEGMENTS       ASK WHY     │
    │                    "What new             "Why did     │
    │                     boundary             that group   │
    │                     did we find?"         respond     │
    │                          │                differently?"│
    │                          └───────┬────────┘          │
    │                                  ▼                   │
    └──────────────── RICHER MODEL ────────────────────────┘
                      "What do we understand now
                       that we didn't before?"

Phase 1: SEGMENT — "Who are our customers?"

The operator needs to understand the current customer landscape before designing experiments.

Question	What answers it
What segments exist today?	Segment definitions (rules + sizes)
How big is each segment?	Visitor counts per segment
Do segments behave differently?	Conversion rates, AOV, session depth by segment
Are segments targetable?	Whether attributes are available at assignment time
Is a segment large enough to experiment on?	Statistical power analysis

Key insight: Segments are a practice, not a model. The quality of segments improves through loop iterations. First iteration uses rough cuts (new vs. returning). Later iterations incorporate boundaries discovered by Tyche.

Phase 2: EXPERIMENT — "What should we test, for whom?"

The operator designs an experiment: choosing a surface, forming a hypothesis, selecting metrics, and configuring targeting.

Question	What answers it
What surfaces are available to test?	Surface catalog (18 surfaces across 7 journey stages)
What's our hypothesis?	Domain knowledge + prior discoveries
What metric should we optimize?	Outcome metric taxonomy (11 metrics, 4 categories)
Are we measuring both revenue AND satisfaction?	Dual-metric enforcement
Which segments should see this experiment?	Targeting rules from prior discoveries
How much traffic can we allocate?	Power requirements vs. available traffic

Key insight: The experiment config is a contract between the operator and the system. It declares: "for visitors matching these targeting rules, deliver these variants, and measure these metrics." The system enforces the contract.

Phase 3: ANALYZE — "Did it work?"

The experiment has run. The operator needs to understand the overall result.

Question	What answers it
Did the variant win?	Posterior probability P(lift > 0)
By how much?	Lift distribution (%, $)
How confident are we?	Credible interval width, posterior density shape
What's the risk of shipping?	Expected loss distribution
Is it worth shipping at this threshold?	P(lift > threshold) for business-relevant thresholds
Should we keep running?	Sequential monitoring recommendation (stop/continue/ship)
What's the decomposition?	Frequency (conversion rate) vs. severity (AOV) lift

Key insight: The answer to "did it work?" is always a distribution, not a yes/no. The expected loss distribution makes the decision explicit: "shipping variant B when it's actually worse would cost us $X/visitor, and there's a Y% chance of that."

Phase 4: DISCOVER — "Where does it differ?"

Beyond the overall result, the operator needs to understand treatment effect heterogeneity.

Question	What answers it
Does the effect vary by segment?	GATE estimates per discovered segment
Are there segments we didn't hypothesize?	CausalForest policy tree segmentation
How confident are we in segment-level effects?	Per-segment posterior width
Is the segment large enough to act on?	Segment size + posterior precision
Is the segment stable?	Bootstrap stability score (≥0.80 threshold)
What SQL captures this group?	Auto-generated WHERE clause from feature rules

Key insight: The n-problem is the central tension. Small segments have wide posteriors — that's uncertainty made visible, not a failure. "We found a difference for returning customers, but we only have 47 visitors in that segment, so the posterior is wide." This drives the next iteration: target that segment specifically to narrow the posterior.

Phase 5: REFINE — "What boundary did we find?"

Discovery produced segment-level results. The operator decides which discoveries to incorporate.

Question	What answers it
Should this become a formal segment?	GATE magnitude + stability + segment size + business relevance
What's the targeting rule?	Auto-generated from feature rules
Can the storefront evaluate this at assignment time?	Whether attributes are in `assignable_attributes`
What experiment should target this segment next?	The question that the discovery raised

Key insight: This is where the loop actually closes — or doesn't. The segment registry connects Tyche's discovery output to dbt's segment rules and the experiment engine's targeting. Without it, the operator must manually translate between three representations of the same concept.

Phase 6: ASK WHY — "Why did that group respond differently?"

Not every discovery has an obvious explanation. The operator needs qualitative signal.

Question	What answers it
Why did segment X respond differently?	Qualitative feedback from that segment
What do these customers have in common?	Open-ended survey responses
Is there a confound we're not measuring?	Domain knowledge + qualitative signal

Key insight: This phase is the human part of the loop. Quantitative analysis tells you what happened; qualitative feedback tells you why. The VoC system enriches the loop but doesn't block it — the loop can turn without it, but turns better with it.

Phase 7: RICHER MODEL — "What do we understand now?"

After one or more loop iterations, the operator reflects on what's changed.

Question	What answers it
What segments have we added through discovery?	Segment registry changelog
How has our model of customer behavior evolved?	Comparing early vs. current segment definitions
What questions remain open?	Discoveries without clear next steps
What's the next most valuable experiment?	Operator judgment, informed by accumulated evidence

Key insight: The richer model is not a deliverable — it's an emergent property of the loop turning. Each iteration deposits understanding. The value compounds. Iteration 5's experiment is qualitatively different from iteration 1's because the segmentation model, the hypotheses, and the operator's intuition have all improved.

← Back to Stoa Stack