In 1948, the British Medical Research Council ran a trial on streptomycin as a treatment for tuberculosis. What made it remarkable was not the drug - it was the design. Patients were assigned to treatment or control groups using sealed envelopes drawn from a random sequence, so no clinician could influence who received what. The trial produced the first rigorous clinical evidence that streptomycin worked. It also produced the template that every A/B test since has borrowed, whether the stakes are a patient's lungs or a button's color.
That template starts before any code is written or any user is enrolled. It starts with a hypothesis.
What Makes a Hypothesis an Experiment
A hypothesis is not a hunch. A hunch says "I think a green button will do better." A hypothesis says: "Changing the checkout button from grey to green will increase the click-to-purchase conversion rate among new visitors, because higher contrast against the white background will increase visual salience and reduce hesitation." Those are different statements. The second one contains a mechanism, a specific population, a specific metric, and - crucially - a specific way to be wrong.
Every A/B experiment runs on exactly that structure: a null hypothesis and an alternative hypothesis. The null hypothesis is the skeptic's position. It says your change does nothing - that the variation and the control will produce results indistinguishable by chance. The alternative hypothesis is your actual claim: that the change produces a real, directional effect on the metric you care about. You never try to prove the alternative directly. You collect data and ask whether it is too unlikely to have occurred in a world where the null is true.
This adversarial framing is not academic formality. It exists because humans are extraordinary pattern-finders, and we will see meaningful differences in noise if we go looking for them without a pre-specified standard to beat.
Key Point: A testable hypothesis must be falsifiable - meaning there must be a possible data result that would prove it wrong. If every conceivable outcome confirms your idea, you are not running an experiment; you are running a launch with extra steps.
Metrics Are Not Optional
Before any test begins, you need to answer one question with precision: what does winning look like? This sounds obvious. It is regularly botched.
Think of your metric hierarchy as three layers. The primary metric is the single number that decides whether the experiment succeeds. It must be directly connected to the hypothesis, sensitive enough to move in the time you have, and chosen before the test runs. Choosing it afterward is called p-hacking, and it will generate false positives with impressive regularity.
Secondary metrics explain the primary. If your primary metric is conversion rate, your secondary metrics might be average session duration and items viewed before purchase. They tell you whether the conversion lift came from genuine engagement or from some confusing design that made users click by accident.
Guardrail metrics are the safety net. They watch for damage you did not anticipate. A new feature that improves conversion by 4% but increases page load time by two seconds is not a win - it is a trade-off you need to see clearly before you ship. If you are not tracking the guardrails, you will miss the fire.
The Minimum Detectable Effect
Before you launch, you need to specify the smallest improvement you actually care about. This is the Minimum Detectable Effect, and it controls everything downstream: how long the test runs, how many users you need, and how confident you can be in the result.
Setting the MDE too small is a trap. A 0.1% improvement in conversion sounds nice until you realize detecting it requires weeks of traffic and a sample size so large that a dozen other variables will have drifted by the time you finish. Setting the MDE too large means you might ship a version that is genuinely worse, because the degradation was smaller than your detection threshold.
The right MDE is the smallest effect that would change a real decision. If your team would not act on a 1% conversion lift - would not allocate engineering, would not reconsider the roadmap - then 1% is not your MDE. Set the threshold at the number that actually matters to the business.