The Experiment Decision Journey
What questions operators need to answer at each phase of the virtuous loop, and what tools answer them.
The Loop, Annotated
┌──────────────────────────────────────────────────────┐
│ │
▼ │
SEGMENT ─────► EXPERIMENT ─────► ANALYZE ─────► DISCOVER │
"Who are "What should "Did it "Where │
our we test, work?" does it │
customers?" for whom?" differ?" │
▲ │ │
│ ┌────────────────────┤ │
│ ▼ ▼ │
│ REFINE SEGMENTS ASK WHY │
│ "What new "Why did │
│ boundary that group │
│ did we find?" respond │
│ │ differently?"│
│ └───────┬────────┘ │
│ ▼ │
└──────────────── RICHER MODEL ────────────────────────┘
"What do we understand now
that we didn't before?"
Phase 1: SEGMENT — "Who are our customers?"
The operator needs to understand the current customer landscape before designing experiments.
| Question | What answers it |
|---|---|
| What segments exist today? | Segment definitions (rules + sizes) |
| How big is each segment? | Visitor counts per segment |
| Do segments behave differently? | Conversion rates, AOV, session depth by segment |
| Are segments targetable? | Whether attributes are available at assignment time |
| Is a segment large enough to experiment on? | Statistical power analysis |
Key insight: Segments are a practice, not a model. The quality of segments improves through loop iterations. First iteration uses rough cuts (new vs. returning). Later iterations incorporate boundaries discovered by Tyche.
Phase 2: EXPERIMENT — "What should we test, for whom?"
The operator designs an experiment: choosing a surface, forming a hypothesis, selecting metrics, and configuring targeting.
| Question | What answers it |
|---|---|
| What surfaces are available to test? | Surface catalog (18 surfaces across 7 journey stages) |
| What's our hypothesis? | Domain knowledge + prior discoveries |
| What metric should we optimize? | Outcome metric taxonomy (11 metrics, 4 categories) |
| Are we measuring both revenue AND satisfaction? | Dual-metric enforcement |
| Which segments should see this experiment? | Targeting rules from prior discoveries |
| How much traffic can we allocate? | Power requirements vs. available traffic |
Key insight: The experiment config is a contract between the operator and the system. It declares: "for visitors matching these targeting rules, deliver these variants, and measure these metrics." The system enforces the contract.
Phase 3: ANALYZE — "Did it work?"
The experiment has run. The operator needs to understand the overall result.
| Question | What answers it |
|---|---|
| Did the variant win? | Posterior probability P(lift > 0) |
| By how much? | Lift distribution (%, $) |
| How confident are we? | Credible interval width, posterior density shape |
| What's the risk of shipping? | Expected loss distribution |
| Is it worth shipping at this threshold? | P(lift > threshold) for business-relevant thresholds |
| Should we keep running? | Sequential monitoring recommendation (stop/continue/ship) |
| What's the decomposition? | Frequency (conversion rate) vs. severity (AOV) lift |
Key insight: The answer to "did it work?" is always a distribution, not a yes/no. The expected loss distribution makes the decision explicit: "shipping variant B when it's actually worse would cost us $X/visitor, and there's a Y% chance of that."
Phase 4: DISCOVER — "Where does it differ?"
Beyond the overall result, the operator needs to understand treatment effect heterogeneity.
| Question | What answers it |
|---|---|
| Does the effect vary by segment? | GATE estimates per discovered segment |
| Are there segments we didn't hypothesize? | CausalForest policy tree segmentation |
| How confident are we in segment-level effects? | Per-segment posterior width |
| Is the segment large enough to act on? | Segment size + posterior precision |
| Is the segment stable? | Bootstrap stability score (≥0.80 threshold) |
| What SQL captures this group? | Auto-generated WHERE clause from feature rules |
Key insight: The n-problem is the central tension. Small segments have wide posteriors — that's uncertainty made visible, not a failure. "We found a difference for returning customers, but we only have 47 visitors in that segment, so the posterior is wide." This drives the next iteration: target that segment specifically to narrow the posterior.
Phase 5: REFINE — "What boundary did we find?"
Discovery produced segment-level results. The operator decides which discoveries to incorporate.
| Question | What answers it |
|---|---|
| Should this become a formal segment? | GATE magnitude + stability + segment size + business relevance |
| What's the targeting rule? | Auto-generated from feature rules |
| Can the storefront evaluate this at assignment time? | Whether attributes are in assignable_attributes |
| What experiment should target this segment next? | The question that the discovery raised |
Key insight: This is where the loop actually closes — or doesn't. The segment registry connects Tyche's discovery output to dbt's segment rules and the experiment engine's targeting. Without it, the operator must manually translate between three representations of the same concept.
Phase 6: ASK WHY — "Why did that group respond differently?"
Not every discovery has an obvious explanation. The operator needs qualitative signal.
| Question | What answers it |
|---|---|
| Why did segment X respond differently? | Qualitative feedback from that segment |
| What do these customers have in common? | Open-ended survey responses |
| Is there a confound we're not measuring? | Domain knowledge + qualitative signal |
Key insight: This phase is the human part of the loop. Quantitative analysis tells you what happened; qualitative feedback tells you why. The VoC system enriches the loop but doesn't block it — the loop can turn without it, but turns better with it.
Phase 7: RICHER MODEL — "What do we understand now?"
After one or more loop iterations, the operator reflects on what's changed.
| Question | What answers it |
|---|---|
| What segments have we added through discovery? | Segment registry changelog |
| How has our model of customer behavior evolved? | Comparing early vs. current segment definitions |
| What questions remain open? | Discoveries without clear next steps |
| What's the next most valuable experiment? | Operator judgment, informed by accumulated evidence |
Key insight: The richer model is not a deliverable — it's an emergent property of the loop turning. Each iteration deposits understanding. The value compounds. Iteration 5's experiment is qualitatively different from iteration 1's because the segmentation model, the hypotheses, and the operator's intuition have all improved.