The Problem With A/B Testing When Your GA4 Is Broken
A D2C personal care brand came to us after three months of A/B testing that produced exactly nothing.
Not failed tests they had plenty of "winners." Variant B on the product page hero image: +18% conversion rate. A new checkout CTA: +22% conversion rate. They'd shipped both changes site-wide. Revenue didn't budge.
They assumed their CRO strategy needed more creativity. More radical test ideas. A better testing tool. What they actually needed was for someone to look at their GA4 setup before running another test.
What we found when we did is something we've seen at more than one D2C brand and it's the kind of problem that makes every test result from the preceding months essentially meaningless.
The Setup: What They Were Testing and How
The brand, a mid-sized D2C brand selling personal care products had been using Omniconvert to run experiments on their Shopify storefront. Traffic volumes were healthy, test durations were reasonable, and they were hitting the 95% statistical significance threshold before calling winners. On paper, the process looked solid.
Conversion rate was their primary test metric, pulled directly from GA4. Every time a variant showed a higher conversion rate in GA4 than the control, they called it a win and shipped the change.
The problem was sitting upstream of all of it: their GA4 purchase event was firing twice for a meaningful percentage of orders.
What the Audit Found
When we ran our GA4 ecommerce tracking audit, the issue surfaced within the first thirty minutes.
Their Shopify theme had a post-purchase page that loaded order confirmation details and triggered the purchase event on page load. So far, standard. But they also had a "track your order" widget on the same page that made an API call on load and the GTM trigger configuration meant the purchase event fired a second time when that widget resolved.
The result: roughly 30–35% of orders were being counted as two purchases in GA4.
Their reported conversion rate was inflated by that same margin. A real CVR of around 1.4% was showing as 1.8–1.9% in GA4. Not so dramatic that it looked wrong, just plausible enough to trust.
But here's where it gets costly: the double-firing wasn't uniform across the site. The widget loaded faster on desktop than on mobile, and only fired the second event if the page stayed open long enough. So mobile conversion rates were less inflated than desktop rates. And Variant B in their product page test happened to have a faster-loading layout that kept users on the page slightly longer which meant more second-event fires, which meant an artificially higher reported conversion rate.
The "winning" variant hadn't converted better. It had kept the confirmation page open long enough to trigger the duplicate event more reliably.
Why the Tests Looked Valid But Weren't
This is the part that trips most brands up: if your analytics are broken in a way that affects both the control and the variant equally, the test results may still look internally consistent. The percentages move. The significance thresholds get hit. The tool shows a winner.
What it doesn't show is that the baseline you're measuring from is wrong, and the differential between variants is being driven by something other than what you're testing.
This problem has a name in the CRO industry. Nebulab, in their analysis of common ecommerce A/B testing mistakes, call it the Instrumentation Effect, when your analytics or test infrastructure doesn't work correctly and skews results at the source. Their framing is direct: when bugs exist at the source of your data, no statistical method can fix them. You can't math your way out of a measurement problem.
CXL — one of the most cited sources in the CRO industry, similarly notes that internal traffic contamination and broken goal tracking are among the most common reasons A/B test results don't hold after deployment. Tests produce a statistically significant result and the real-world impact never materialises because the significant result was measuring noise.
What Happened After We Fixed the Tracking
Fixing the double-fire was a GTM trigger adjustment, we scoped the purchase event to fire only on the initial page load, not on the widget callback. Straightforward once identified.
After the fix, their GA4 conversion rate dropped from ~1.8% back to ~1.4%. Not a performance regression, this had always been the real number. The previous month's "improvement" had been an artefact.
We then reconciled GA4 revenue against Shopify order exports for the prior 90 days. The gap confirmed what DebugView had shown: GA4 had been recording roughly 30% more conversion events than actual orders placed.
Once tracking was clean and the revenue figures reconciled, within 4% of Shopify actuals. they re-ran the product page hero image test with the same traffic volumes and duration. The variant showed no statistically significant difference this time. The earlier "winner" had been an instrumentation artefact, not a real user preference signal.
The checkout CTA test, when re-run, did show a modest but genuine lift of around 8%. That one held in production. One real win out of several declared ones — which is closer to an honest A/B testing batting average for most D2C brands anyway.
Three Signs Your GA4 Might Be Doing the Same Thing
You don't need to wait for a failed test rollout to check whether this is happening on your store. Three things worth verifying now:
1. GA4 revenue vs. Shopify revenue mismatch Pull both for the last 30 days. If GA4 is reporting more revenue than Shopify, you almost certainly have duplicate events. A gap above 5–8% warrants investigation. Our GA4 ecommerce tracking audit checklist walks through exactly how to do this comparison.
2. GA4 Conversions vs. Shopify Order Count Go to GA4 Reports → Monetization → Ecommerce Purchases. Take the total purchase event count for a date range and compare it to Shopify's total orders for the same period. If GA4 shows materially more, you have a duplicate purchase event.
3. DebugView on the Order Confirmation Page Open GA4 DebugView, then complete a test purchase on your store. Watch how many times the purchase event fires after the confirmation page loads. It should fire exactly once. If you see it fire twice, even with a delay, your data is inflated.
If any of these checks surface an issue, pause A/B testing until it's fixed. Every test running on top of bad conversion data is producing results you cannot act on reliably.
The Broader Issue: CRO Needs Clean Data to Work
This story isn't unusual. We've seen variations of the same problem at multiple D2C brands; different root causes, same outcome: a testing programme running on top of tracking it can't trust, producing winners that don't hold, and teams that conclude their CRO strategy isn't working when really their measurement isn't.
A/B testing is fundamentally an exercise in measurement. The statistical rigour that makes it credible; significance thresholds, sample sizes, test durations; only protects you against random variation in real user behaviour. It has no mechanism to protect you against systematic errors in how that behaviour is being measured.
This is why we consistently recommend a GA4 implementation audit before any structured CRO programme begins. Not because testing is the wrong approach it's the right one but because the value of every test you run is directly capped by the quality of the data you're measuring it against.
If you're already deep into a testing programme and starting to question why your winners aren't winning in production, the right first step isn't more creative hypotheses. It's checking whether the measuring instrument is broken.
What to Do Before Your Next Test
Before you start or resume A/B testing:
Reconcile GA4 revenue with Shopify, any gap above 5% needs explaining before it becomes the denominator for your test metrics
Check DebugView for event frequency, purchase should fire once per transaction, not on every page interaction
Validate your test metric source, if your A/B tool is pulling conversion data from GA4, and GA4 is inflated, so are your test results
Run a GA4 funnel audit, confirm the full funnel is instrumented correctly before any test begins
Clean data doesn't guarantee winning tests. But broken data guarantees that even your winners are unreliable and there's no amount of testing velocity that makes up for that.
If your A/B test results have been consistently not holding after deployment, broken tracking is often the first place to look. Talk to FunnelFreaks, we audit GA4 implementations before CRO programmes begin, so your test results mean what they say they mean.
Editor's note: Client details in this post have been kept general to protect confidentiality. The tracking issue, audit process, and outcome described reflect a real engagement. Specific numbers (conversion rates, revenue gaps) have been approximated to illustrative ranges.