Experimentation Framework
An experimentation framework is a repeatable system for planning, running, and learning from tests (e.g., A/B, feature flags, holdouts). It covers how you form hypotheses, choose success metrics, size samples, run the test, analyze results, decide, and document learnings so teams make decisions based on evidence, not opinions. Industry guides describe it as a structured, data-driven approach used across product and marketing.
A strong framework also defines an OEC (Overall Evaluation Criterion)—the primary “north-star” metric you optimize plus guardrail metrics to protect user experience and long-term value. Choosing an OEC early and aligning on it is a hallmark of trustworthy experimentation.
Why It Matters
Better decisions, faster: Standard steps reduce bias and speed iteration.
Trustworthy results: Built-in checks (e.g., SRM tests for traffic imbalance, no “peeking” unless using sequential methods) prevent false wins.
More power with less traffic: Techniques like CUPED (variance reduction) improve sensitivity so you detect real effects sooner.
Examples (What a Simple Framework Looks Like)
Hypothesis: “If we add trust badges near the CTA, checkout conversion will increase because buyers feel safer.”
Metrics: OEC: purchase conversion. Guardrails: add-to-cart rate, refund rate, page speed.
Design: 50/50 split; power/sample-size calculated up front; fixed horizon or sequential method (but not ad-hoc peeking).
QA: Validate firing/eligibility; check for Sample Ratio Mismatch (SRM) after launch.
Run & analyze: Use CUPED where available; read OEC + guardrails; decide ship/iterate/kill.
Document & share: Log setup, results, and next steps in your experiment library (improves program maturity).
Best Practices
Define your OEC & guardrails before testing (e.g., revenue/user, long-term retention as guardrails against “short-term wins”).
Pre-plan stats: Choose fixed-horizon (don’t peek) or sequential testing (allows continuous looks with proper error control).
Size for power: Use a sample-size calculator; write the minimum detectable effect (MDE) into the plan.
Run validity checks: Watch for SRM; investigate allocation bugs or eligibility leaks if detected.
Increase sensitivity: Apply variance-reduction (e.g., CUPED) when your platform supports it.
Balance short vs. long-term: Don’t pick OECs that reward short-term spikes but hurt lifetime value.
Build program maturity: Standardize intake, prioritization, QA, decision rules, and a results library to scale learning.
Related Terms
A/B Testing / Controlled Experiments
FAQs
Q1. What is an OEC and why is it important?
It’s the primary metric you agree to optimize (e.g., revenue/user, retention). Clear OECs align teams and prevent “metric shopping.”
Q2. Fixed-horizon vs. sequential tests; what’s the difference?
Fixed-horizon requires running to the planned sample and not peeking; sequential lets you monitor continuously while controlling error rates (as used in Optimizely’s Stats Engine).
Q3. What is SRM?
Sample Ratio Mismatch = the observed traffic split (e.g., 50/50 expected) is statistically off often due to bugs or targeting issues. It signals the test isn’t trustworthy until fixed.
Q4. How do we get significance with limited traffic?
Reduce noise with CUPED or run longer, or test bigger changes (higher MDE). CUPED uses pre-experiment data to shrink variance.
Q5. What belongs in our experiment “template”?
Hypothesis, OEC & guardrails, audience/eligibility, assignment & split, sample size/power & MDE, run rules (duration/peeking policy), analysis plan (one- or two-tailed, corrections), decision rules, and documentation links.