At a previous Measurefest conference, one of the speakers, Craig Sullivan, recommended a classic research paper from Microsoft on common pitfalls in running conversion rate experiments.

It details five surprising results which took ‘multiple-person weeks to properly analyse’ at Microsoft and published for the benefit of all. As the authors point out, this stuff is worth spending a few weeks getting right as ‘multi-million-pound business decisions’ rest on the outcomes. This research ultimately points out the importance of doing A/A Testing.


Here follows an executive overview, cutting out some of the technical analysis:

1. Beware of conflicting short-term metrics

Bing’s management had two high-level goals: query share and revenue per search. The problem is that it is possible to increase both those and yet create a bad long-term company outcome, by making the search algorithm worse.

If you force users to make more searches (increasing Bing’s share of queries), because they can’t find an answer, they will click on more adverts as well.

“If the goal of a search engine is to allow users to find their answer or complete their task quickly, then reducing the distinct queries per task is a clear goal, which conflicts with the business objective of increasing share.”

The authors suggest a better metric in most cases is lifetime customer value, and the executives should try to understand where shorter-term metrics might conflict with that long-term goal

2. Beware of technical reasons for experiment results

The Hotmail link on the MSN home page was changed to open Hotmail in a separate tab/window. The naïve experiment results showed that users clicked more on the Hotmail link when it opened in a new window, but the majority of the observed effect was artificial.

Many browsers kill the previous page’s tracking Javascript when a new page loads – with Safari blocking the tracking script in 50% of pages opening in the same window.

The “success” of getting users to click more was not real, but rather an instrumentation difference.

So it wasn’t that more people were clicking on the link – but actually that just more of the links were being tracked in the ‘open in new tab’ experiment.

3. Beware of peeking at results too early

When we release a new feature as an experiment, it is really tempting to peek at the results after a couple of days and see if the test confirms our expectation of success (confirmation bias). With the initial small sample, there will be a big percentage change.

Humans then have an innate tendency to see trends where there aren’t any.

So the authors give the example of this chart:

Extrapolating the trend

Most experimenters would see the results, and even though they are negative, extrapolate the graph along the green line to a positive result and four days.


What actually happens is regression to the mean. This chart is actually from an A/A test (i.e. the two versions being tested are exactly the same). The random differences are biggest at the start, and then tail off – so the long term result will be 0% difference as the sample size increases.

The simple advice is to wait until there are enough test results to draw a statistically significant conclusion. That generally means more than a week and hundreds of individual tests.

4. Beware of the carryover effect from previous experiments

Many A/B test systems use a bucketing system to assign users into one experiment or another. At the end of one test the same
buckets of users may be reused for the second test.

The problem is that if users return to your product regularly (multiple times daily in the case of Bing), then a highly positive or negative experience in one of the tests will affect all of that bucket for many weeks.

In one Bing experiment, which accidentally introduced a nasty bug, users who saw the buggy version were still making fewer searches 6 months after the experiment ended.

Ideally, your test system would re-randomise users for the start of every new test, so those carryover effects are spread as wide as possible.


For me the biggest theme coming out of their research is the importance of A/A tests – seeing what kind of variation and results you get if you don’t change anything. Which makes you more aware of the random fluctuations inherent in statistical tests.

In conclusion, you need to think about the possible sources of bias before acting on your tests. Even the most experienced analysts make mistakes!

Have any comments? Let us know what you think, below!