What is a holdout experiment in advertising?

A holdout experiment (also called an incrementality test) is a controlled test where a portion of your audience is prevented from seeing your ads for a set period. By comparing sales in the holdout group (no ads) against the control group (ads as normal), you can measure whether the ads actually caused sales or whether those sales would have happened anyway.

How long does a holdout experiment take?

Typical holdout experiments run for 7 to 28 days, depending on your traffic volume and the size of the effect you're trying to detect. Higher-traffic stores can get results faster. The duration is calculated using power analysis to ensure the experiment has enough data to reach statistical significance.

Will I lose sales during a holdout experiment?

If the ads being tested are genuinely driving incremental sales, you may see a small revenue dip in the holdout group during the experiment. However, if the ads are not driving incremental sales (which is the waste you're trying to find), the holdout group will perform similarly to the control group — meaning you were already losing that money on non-incremental ads.

What does it cost to run incrementality tests on Shopify?

Enterprise incrementality platforms like Haus and Measured charge $24,000-$60,000 per year. Self-serve tools designed for Shopify merchants can run the same methodology for $149-$249 per month. Manual testing is theoretically free but requires 200+ hours of analyst time and expertise in experiment design.

How is incrementality testing different from attribution?

Attribution assigns credit for conversions to specific touchpoints (last click, first click, data-driven). It tells you which channel gets credit. Incrementality testing measures whether conversions would have happened without any ads at all. Attribution rearranges credit between channels; incrementality testing determines if the ads caused the sale.

Holdout Experiments for Shopify Merchants: Enterprise Testing at 1/100th the Price

In 2019, Procter & Gamble cut $200 million from their digital ad budget. Revenue didn't drop. The ads they cut weren't driving sales — they were just showing up in dashboards as conversions.

P&G knew this because they ran holdout experiments at massive scale. They systematically turned off ads for segments of their audience and measured what happened. When nothing happened, they knew those ads were waste.

Nike does the same thing. So does Unilever, Booking.com, and every sophisticated advertiser with a measurement team. The methodology isn't new — it's been the gold standard for advertising effectiveness since direct mail experiments in the 1960s.

What is new is that Shopify merchants can now access the same methodology. Here's how it works and why it matters more than any attribution model.

What is a holdout experiment?

The concept is simple. The execution requires care.

Imagine you're spending $150/day on a Meta campaign. You suspect some of those conversions would happen without the ad. To test this, you split your audience into two groups:

Control group (90%): Sees your ads as normal. Nothing changes for them.

Holdout group (10%): Your ads are suppressed. They don't see any Meta ads from your account during the test period.

You run this for 14 days. Then you compare:

Control group conversion rate: 2.3%
Holdout group conversion rate: 1.8%

The difference — 0.5 percentage points — is the incremental lift. The ads caused a 0.5% increase in conversions. Everything else in the holdout group's conversion rate (1.8%) would have happened without the ads.

That means 78% of the holdout group's conversions were organic. Extrapolated to your full audience, a significant portion of what Meta reports as "ad-driven conversions" were customers who would have bought anyway.

Why attribution can't tell you this

This distinction matters enormously. Here's why:

Attribution is a zero-sum game. It takes a conversion that happened and distributes credit among the touchpoints that preceded it. If a customer clicked a Meta ad and then a Google ad before buying, attribution decides whether Meta or Google gets credit (or splits it).

But attribution never asks: "What if neither ad existed? Would the customer have bought anyway?"

For a loyal customer who buys from you every month, the answer is probably yes. They would have bought regardless. But if they happened to click an ad before purchasing, the platform claims the conversion. Attribution assigns credit. Your ROAS goes up. Your actual incremental revenue: zero.

No attribution model — not last-click, not data-driven, not multi-touch — can detect this. They all start from the assumption that the conversion was influenced by marketing. The only way to test that assumption is to remove the marketing and see what happens.

That's what holdout experiments do.

The anatomy of a well-designed experiment

Running a holdout experiment isn't just turning off ads and hoping for the best. A well-designed experiment needs five components:

1. Power analysis

Before the experiment starts, you need to calculate: how large does the holdout group need to be, and how long does the experiment need to run, to detect the effect size you care about?

This depends on three factors:

Your traffic volume. More traffic = faster results. A store with 100 orders/day can run a shorter experiment than one with 10 orders/day.
The expected effect size. If ads are driving a 30% lift, you can detect that quickly. If the lift is 5%, you need a much larger sample.
Your desired confidence level. Standard is 95% confidence (p < 0.05). Higher confidence requires more data.

For most Shopify merchants spending $5K–$15K/month, a properly powered experiment needs 10–20% holdout for 14–21 days.

2. Clean audience segmentation

The holdout group must be randomly selected and genuinely prevented from seeing your ads. On Meta, this is done through audience exclusions. On Google, through campaign experiments or geographic holdouts.

The key requirement: the groups must be comparable. If your holdout group is systematically different from your control group (different geography, different demographics, different purchase history), the results are contaminated.

3. Controlled test duration

The experiment needs to run long enough to capture full purchase cycles. If your typical customer takes 3 days from first visit to purchase, a 2-day experiment won't capture the full effect. Most experiments need at least 7 days, and 14–28 is standard for reliable results.

4. Statistical significance threshold

You need to decide in advance what level of confidence you require. The industry standard is p < 0.05 (95% confidence). This means: if you see a difference between the control and holdout groups, there's only a 5% chance that difference is random noise.

If the experiment doesn't reach this threshold, the result is INCONCLUSIVE — not wrong, just not proven.

5. Clear success metrics

What are you measuring? Revenue? Conversion rate? Orders? Customer acquisition? Define this before the experiment starts. Post-hoc metric selection (looking at the results and then deciding which metric to report) is a statistical sin that invalidates results.

Why this was enterprise-only

Until recently, holdout experiments were accessible only to brands spending $50K+ per month on ads. Three barriers kept smaller merchants out:

Cost. Enterprise measurement platforms like Haus ($24,000/year) and Measured ($36,000/year) were the only tools that could design and run holdout experiments. For a merchant spending $8,000/month on ads, the measurement tool costs more than the ads.

Expertise. Designing a valid experiment requires understanding of power analysis, statistical significance, audience segmentation, and contamination risks. Most Shopify merchants don't have a data scientist on staff.

Platform access. Running holdout experiments requires programmatic access to ad platform APIs — the ability to create audience exclusions, modify campaigns, and pull granular performance data. This requires API credentials and developer resources.

The methodology itself isn't complicated. But the tooling, expertise, and access barriers made it impractical for any merchant without an enterprise budget.

The democratization moment

What's changed is that all three barriers are falling simultaneously.

Self-serve tooling. Tools designed for Shopify merchants can automate experiment design, power analysis, and execution for $149–$249/month. The methodology is identical to what enterprise brands use — the interface is just designed for merchants instead of data scientists.

Platform APIs are accessible. Both Meta and Google now offer API access that enables programmatic audience exclusions and campaign experiments. A tool that connects via OAuth can manage the experiment without the merchant touching the APIs directly.

Automated statistical analysis. The math behind power analysis and significance testing is well-established. It doesn't need a data scientist — it needs software that implements the formulas correctly and presents the results in plain language.

For the first time, a Shopify merchant spending $5,000/month on ads can answer the same question that P&G answered with their $200 million cut: "Are these ads actually driving sales, or am I paying for conversions that would have happened anyway?"

Want to see what the ROI of running incrementality tests would be for your store? Try our Incrementality ROI Calculator.

Honest limitations

We believe in transparency about what holdout experiments can and can't do:

Small stores may not reach significance. If you're spending under $2K/month on ads with fewer than 200 orders/month, your sample size may be too small to detect effects reliably. Experiments might return INCONCLUSIVE.

Experiments measure a moment in time. A holdout experiment tells you the incremental value of your ads during the test period. Seasonality, promotions, and market changes can affect results. A finding that's valid in February may not be valid in December.

There's always an opportunity cost. During the experiment, the holdout group isn't seeing your ads. If those ads are genuinely valuable, you're sacrificing some revenue to learn the truth. For most merchants, the knowledge gained vastly outweighs the temporary cost — but it's honest to acknowledge the tradeoff.

Not all waste types are equally testable. Creative fatigue is visible in your data without an experiment. Channel overlap can be estimated by cross-referencing platforms. But brand cannibalization — the hardest one to accept emotionally — is also the one that most benefits from experimental validation.

The tools exist. The methodology is proven. The only question is whether you'd rather know the truth about your ad spend or keep trusting the dashboard numbers. For more context on what those numbers might be hiding, read about the 29% problem.