ab-test-setup

Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.

View Source
name:ab-test-setupdescription:Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.

A/B Test Setup

1️⃣ Purpose & Scope

Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.

  • Prevents "peeking"

  • Enforces statistical power

  • Blocks invalid hypotheses

  • 2️⃣ Pre-Requisites

    You must have:

  • A clear user problem

  • Access to an analytics source

  • Roughly estimated traffic volume
  • Hypothesis Quality Checklist

    A valid hypothesis includes:

  • Observation or evidence

  • Single, specific change

  • Directional expectation

  • Defined audience

  • Measurable success criteria

  • 3️⃣ Hypothesis Lock (Hard Gate)

    Before designing variants or metrics, you MUST:

  • Present the final hypothesis

  • Specify:

  • - Target audience
    - Primary metric
    - Expected direction of effect
    - Minimum Detectable Effect (MDE)

    Ask explicitly:

    > “Is this the final hypothesis we are committing to for this test?”

    Do NOT proceed until confirmed.


    4️⃣ Assumptions & Validity Check (Mandatory)

    Explicitly list assumptions about:

  • Traffic stability

  • User independence

  • Metric reliability

  • Randomization quality

  • External factors (seasonality, campaigns, releases)
  • If assumptions are weak or violated:

  • Warn the user

  • Recommend delaying or redesigning the test

  • 5️⃣ Test Type Selection

    Choose the simplest valid test:

  • A/B Test – single change, two variants

  • A/B/n Test – multiple variants, higher traffic required

  • Multivariate Test (MVT) – interaction effects, very high traffic

  • Split URL Test – major structural changes
  • Default to A/B unless there is a clear reason otherwise.


    6️⃣ Metrics Definition

    Primary Metric (Mandatory)

  • Single metric used to evaluate success

  • Directly tied to the hypothesis

  • Pre-defined and frozen before launch
  • Secondary Metrics

  • Provide context

  • Explain _why_ results occurred

  • Must not override the primary metric
  • Guardrail Metrics

  • Metrics that must not degrade

  • Used to prevent harmful wins

  • Trigger test stop if significantly negative

  • 7️⃣ Sample Size & Duration

    Define upfront:

  • Baseline rate

  • MDE

  • Significance level (typically 95%)

  • Statistical power (typically 80%)
  • Estimate:

  • Required sample size per variant

  • Expected test duration
  • Do NOT proceed without a realistic sample size estimate.


    8️⃣ Execution Readiness Gate (Hard Stop)

    You may proceed to implementation only if all are true:

  • Hypothesis is locked

  • Primary metric is frozen

  • Sample size is calculated

  • Test duration is defined

  • Guardrails are set

  • Tracking is verified
  • If any item is missing, stop and resolve it.


    Running the Test

    During the Test

    DO:

  • Monitor technical health

  • Document external factors
  • DO NOT:

  • Stop early due to “good-looking” results

  • Change variants mid-test

  • Add new traffic sources

  • Redefine success criteria

  • Analyzing Results

    Analysis Discipline

    When interpreting results:

  • Do NOT generalize beyond the tested population

  • Do NOT claim causality beyond the tested change

  • Do NOT override guardrail failures

  • Separate statistical significance from business judgment
  • Interpretation Outcomes

    ResultAction
    Significant positiveConsider rollout
    Significant negativeReject variant, document learning
    InconclusiveConsider more traffic or bolder change
    Guardrail failureDo not ship, even if primary wins


    Documentation & Learning

    Test Record (Mandatory)

    Document:

  • Hypothesis

  • Variants

  • Metrics

  • Sample size vs achieved

  • Results

  • Decision

  • Learnings

  • Follow-up ideas
  • Store records in a shared, searchable location to avoid repeated failures.


    Refusal Conditions (Safety)

    Refuse to proceed if:

  • Baseline rate is unknown and cannot be estimated

  • Traffic is insufficient to detect the MDE

  • Primary metric is undefined

  • Multiple variables are changed without proper design

  • Hypothesis cannot be clearly stated
  • Explain why and recommend next steps.


    Key Principles (Non-Negotiable)

  • One hypothesis per test

  • One primary metric

  • Commit before launch

  • No peeking

  • Learning over winning

  • Statistical rigor first

  • Final Reminder

    A/B testing is not about proving ideas right.
    It is about learning the truth with confidence.

    If you feel tempted to rush, simplify, or “just try it” —
    that is the signal to slow down and re-check the design.