ab-test-setup

Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.

Author

Install

Hot:12

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-ab-test-setup&locale=en&source=copy

A/B Test Setup - A Rigorous Guide to A/B Test Design and Execution

Skill Overview

A/B Test Setup is a structured set of guidelines for setting up A/B tests that, through mandatory hypothesis locking, metric definition, and execution readiness checks, ensures each experiment has statistical rigor and operational readiness from the design phase, preventing invalid experiments caused by design flaws.

Applicable Scenarios

  • Product feature validation: Before officially releasing a new feature or redesign, verify the real impact of changes through controlled experiments to avoid decisions based on intuition.
  • Growth experiment design: For key metrics like conversion rate optimization and user retention improvement, design experiments with sufficient statistical power to ensure the expected effect can be detected.
  • Experiment process standards: Establish unified experiment design standards for product, data, and analytics teams to reduce experiment failures caused by vague hypotheses, insufficient samples, or inappropriate metrics.
  • Core Features

  • Hypothesis locking mechanism: Require clearly defined and locked final hypotheses before designing variants and metrics, including target audience, primary metric, expected direction, and minimum detectable effect (MDE), to prevent arbitrary changes during the experiment.
  • Sample size and duration calculation: Estimate the required sample size per variant and expected test duration based on baseline rate, MDE, significance level, and statistical power, avoiding inconclusive experiments due to insufficient traffic.
  • Execution readiness checks: Perform mandatory checks before implementation to ensure hypotheses are locked, metrics are frozen, sample sizes are calculated, guardrails are set, and tracking is validated. If any condition is not met, the experiment is blocked from starting.
  • Result interpretation guidelines: Provide clear interpretation principles and a decision matrix that distinguish statistical significance from business judgment, preventing the mistaken promotion of a "winning" variant when guardrail metrics have failed.
  • Rejection conditions and safety valves: Proactively reject continuation and explain the reasons when the baseline rate is unknown, traffic is insufficient, the primary metric is undefined, or multivariate confounding exists, avoiding execution of invalid or harmful experiments.
  • Frequently Asked Questions

    How much traffic is needed to run an A/B test?

    Required traffic depends on the baseline conversion rate and the minimum effect you want to detect. For example, if the baseline conversion rate is 5% and you want to detect a 10% relative increase (i.e., an absolute increase of 0.5%), at a 95% confidence level and 80% statistical power, each variant needs about 30,000 samples. If daily traffic is limited, you may need to extend the test duration or reassess the reasonableness of the MDE.

    If I see clearly positive results during the experiment, can I stop early?

    No. Stopping early undermines the statistical validity of the experiment; even if the results look good, they may be false positives. You must adhere to the sample size determined during the design phase unless the experiment is terminated early due to severe guardrail metric failures. This skill explicitly reminds teams not to stop early just because it "looks good."

    What are guardrail metrics and why are they important?

    Guardrail metrics are key indicators that must not significantly decline during an experiment; they prevent systemic harm from "local optimizations." For example, a redesign that increases click-through rate might reduce user retention or increase load times. If guardrail metrics show significant negative changes, even if the primary metric wins, that variant should not be promoted. This skill requires guardrail metrics to be clearly defined during the experiment design phase and gives them veto power during result analysis.