Holdout Experiments for Shopify Merchants: Enterprise Testing at 1/100th the Price

Rami Omran8 min read

In 2019, Procter & Gamble cut $200 million from their digital ad budget. Revenue didn't drop. The ads they cut weren't driving sales — they were just showing up in dashboards as conversions.

P&G knew this because they ran holdout experiments at massive scale. They systematically turned off ads for segments of their audience and measured what happened. When nothing happened, they knew those ads were waste.

Nike does the same thing. So does Unilever, Booking.com, and every sophisticated advertiser with a measurement team. The methodology isn't new — it's been the gold standard for advertising effectiveness since direct mail experiments in the 1960s.

What is new is that Shopify merchants can now access the same methodology. Here's how it works and why it matters more than any attribution model.

What is a holdout experiment?

The concept is simple. The execution requires care.

Imagine you're spending $150/day on a Meta campaign. You suspect some of those conversions would happen without the ad. To test this, you split your audience into two groups:

Control group (90%): Sees your ads as normal. Nothing changes for them.

Holdout group (10%): Your ads are suppressed. They don't see any Meta ads from your account during the test period.

You run this for 14 days. Then you compare:

The difference — 0.5 percentage points — is the incremental lift. The ads caused a 0.5% increase in conversions. Everything else in the holdout group's conversion rate (1.8%) would have happened without the ads.

That means 78% of the holdout group's conversions were organic. Extrapolated to your full audience, a significant portion of what Meta reports as "ad-driven conversions" were customers who would have bought anyway.

Why attribution can't tell you this

This distinction matters enormously. Here's why:

Attribution is a zero-sum game. It takes a conversion that happened and distributes credit among the touchpoints that preceded it. If a customer clicked a Meta ad and then a Google ad before buying, attribution decides whether Meta or Google gets credit (or splits it).

But attribution never asks: "What if neither ad existed? Would the customer have bought anyway?"

For a loyal customer who buys from you every month, the answer is probably yes. They would have bought regardless. But if they happened to click an ad before purchasing, the platform claims the conversion. Attribution assigns credit. Your ROAS goes up. Your actual incremental revenue: zero.

No attribution model — not last-click, not data-driven, not multi-touch — can detect this. They all start from the assumption that the conversion was influenced by marketing. The only way to test that assumption is to remove the marketing and see what happens.

That's what holdout experiments do.

The anatomy of a well-designed experiment

Running a holdout experiment isn't just turning off ads and hoping for the best. A well-designed experiment needs five components:

1. Power analysis

Before the experiment starts, you need to calculate: how large does the holdout group need to be, and how long does the experiment need to run, to detect the effect size you care about?

This depends on three factors:

For most Shopify merchants spending $5K–$15K/month, a properly powered experiment needs 10–20% holdout for 14–21 days.

2. Clean audience segmentation

The holdout group must be randomly selected and genuinely prevented from seeing your ads. On Meta, this is done through audience exclusions. On Google, through campaign experiments or geographic holdouts.

The key requirement: the groups must be comparable. If your holdout group is systematically different from your control group (different geography, different demographics, different purchase history), the results are contaminated.

3. Controlled test duration

The experiment needs to run long enough to capture full purchase cycles. If your typical customer takes 3 days from first visit to purchase, a 2-day experiment won't capture the full effect. Most experiments need at least 7 days, and 14–28 is standard for reliable results.

4. Statistical significance threshold

You need to decide in advance what level of confidence you require. The industry standard is p < 0.05 (95% confidence). This means: if you see a difference between the control and holdout groups, there's only a 5% chance that difference is random noise.

If the experiment doesn't reach this threshold, the result is INCONCLUSIVE — not wrong, just not proven.

5. Clear success metrics

What are you measuring? Revenue? Conversion rate? Orders? Customer acquisition? Define this before the experiment starts. Post-hoc metric selection (looking at the results and then deciding which metric to report) is a statistical sin that invalidates results.

Why this was enterprise-only

Until recently, holdout experiments were accessible only to brands spending $50K+ per month on ads. Three barriers kept smaller merchants out:

Cost. Enterprise measurement platforms like Haus ($24,000/year) and Measured ($36,000/year) were the only tools that could design and run holdout experiments. For a merchant spending $8,000/month on ads, the measurement tool costs more than the ads.

Expertise. Designing a valid experiment requires understanding of power analysis, statistical significance, audience segmentation, and contamination risks. Most Shopify merchants don't have a data scientist on staff.

Platform access. Running holdout experiments requires programmatic access to ad platform APIs — the ability to create audience exclusions, modify campaigns, and pull granular performance data. This requires API credentials and developer resources.

The methodology itself isn't complicated. But the tooling, expertise, and access barriers made it impractical for any merchant without an enterprise budget.

The democratization moment

What's changed is that all three barriers are falling simultaneously.

Self-serve tooling. Tools designed for Shopify merchants can automate experiment design, power analysis, and execution for $149–$249/month. The methodology is identical to what enterprise brands use — the interface is just designed for merchants instead of data scientists.

Platform APIs are accessible. Both Meta and Google now offer API access that enables programmatic audience exclusions and campaign experiments. A tool that connects via OAuth can manage the experiment without the merchant touching the APIs directly.

Automated statistical analysis. The math behind power analysis and significance testing is well-established. It doesn't need a data scientist — it needs software that implements the formulas correctly and presents the results in plain language.

For the first time, a Shopify merchant spending $5,000/month on ads can answer the same question that P&G answered with their $200 million cut: "Are these ads actually driving sales, or am I paying for conversions that would have happened anyway?"

Want to see what the ROI of running incrementality tests would be for your store? Try our Incrementality ROI Calculator.

Honest limitations

We believe in transparency about what holdout experiments can and can't do:

Small stores may not reach significance. If you're spending under $2K/month on ads with fewer than 200 orders/month, your sample size may be too small to detect effects reliably. Experiments might return INCONCLUSIVE.

Experiments measure a moment in time. A holdout experiment tells you the incremental value of your ads during the test period. Seasonality, promotions, and market changes can affect results. A finding that's valid in February may not be valid in December.

There's always an opportunity cost. During the experiment, the holdout group isn't seeing your ads. If those ads are genuinely valuable, you're sacrificing some revenue to learn the truth. For most merchants, the knowledge gained vastly outweighs the temporary cost — but it's honest to acknowledge the tradeoff.

Not all waste types are equally testable. Creative fatigue is visible in your data without an experiment. Channel overlap can be estimated by cross-referencing platforms. But brand cannibalization — the hardest one to accept emotionally — is also the one that most benefits from experimental validation.

The tools exist. The methodology is proven. The only question is whether you'd rather know the truth about your ad spend or keep trusting the dashboard numbers. For more context on what those numbers might be hiding, read about the 29% problem.

See how much of your ad spend is wasted

Connect your Shopify store and ad accounts. Get your audit in 24 hours. No pixel required.

See Your Wasted Spend