Bing Slams “Freakonomics” Bing It On Challenge Critique


Yesterday, we reported on a study appearing on the “Freakonomics” blog that disputed the “Bing It On” claim that people prefer Bing to Google in a blind comparison of search results. Study author Ian Ayers sought to replicate the Bing It On challenge methodology and argued that Bing’s claims were false and its messaging deceptive.

Bing pushed back hard yesterday in several ways. There was a lengthy point by point refutation of the Ayers report in comments posted to the story I wrote at Search Engine Land from Matt Wallaert, behavioral scientist at Bing. Wallaert also responded to a similar story about the Ayers study at at Search Engine Roundtable.

Microsoft later issued a formal statement from Wallaert:

The professor’s analysis is flawed and based on an incomplete understanding of both the claims and the Challenge. The Bing It On claim is 100% accurate and we’re glad to see we’ve nudged Google into improving their results.  Bing it On is intended to be a lightweight way to challenge peoples’ assumptions about which search engine actually provides the best results. Given our share gains, it’s clear that people are recognizing our quality and unique approach to what has been a relatively static space dominated by a single service.

Later in the day there was a blog post from Microsoft about the Ayers study. It echoed the points made by Wallaert in his comments to the blog posts. Below is most of the Wallaert post:

A couple of notes are important before I talk about Ayres’ claims. There are two separate claims that have been used with the Bing It On challenge. The first is “People chose Bing web search results over Google nearly 2:1 in blind comparison tests”. We blogged about the method here and it was used back in 2012. In 2013, we updated the claim to “People prefer Bing over Google for the web’s top searches”, which I blogged about here. Ayres’ frequently goes back and forth between the two claims in his post, so I wanted to make sure both were represented. Now, on to Ayers’ issues and my explanations.

First, he’s annoyed by the sample size, contending that 1,000 people is too few to obtain a representative sample on which to base a claim. Interestingly, Ayres then links to a paper he put together with his grad students, in which they also use a sample size of 1,000 people. They then subdivide the sample into thirds for different treatments condition and yet still manage to meet conventional statistical tests using their sample.

If you’re confused, you’re not alone. A sample of 1,000 people doing the same task has more statistical power than a sample of 300 people doing the same task. Which is why statistics are so important; they help us understand whether the data we see is an aberration or a representation. A 1,000 person, truly representative sample is actually fairly large. As a comparison, the Gallup poll on presidential approval is just 1,500 people.

Next, Ayres is bothered that we don’t release the data from the Bing It On site on how many times people choose Bing over Google. The answer here is pretty simple: we don’t release it because we don’t track it. Microsoft takes a pretty strong stance on privacy and unlike in an experiment, where people give informed consent to having their results tracked and used, people who come to are not agreeing to participate in research; they’re coming for a fun challenge. It isn’t conducted in a controlled environment, people are free to try and game it one way or another, and it has Bing branding all over it.

So we simply don’t track their results, because the tracking itself would be incredibly unethical. And we aren’t basing the claim on the results of a wildly uncontrolled website, because that would also be incredibly unethical (and entirely unscientific).

Ayres’ final issue is the fact that the Bing It On site suggests queries you can use to take the challenge. He contends that these queries inappropriately bias visitors towards queries that are likely to result in Bing favorability.

First, I think it is important to note: I have no idea if he is right. Because as noted in the previous answer, we don’t track the results from the Bing It On challenge. So I have no idea if people are more likely to select Bing when they use the suggested queries or not.

Here is what I can tell you. We have the suggested queries because a blank search box, when you’re not actually trying to use it to find something, can be quite hard to fill. If you’ve ever watched anyone do the Bing It On challenge at a Seahawks game, there is a noted pause as people try to figure out what to search for. So we give them suggestions, which we source from topics that are trending now on Bing, on the assumption that trending topics are things that people are likely to have heard of and be able to evaluate results about.

Which means that if Ayres is right and those topics are in fact biasing the results, it may be because we provide better results for current news topics than Google does. This is supported somewhat by the second claim; “the web’s top queries” are pulled from Google’s 2012 Zeitgeist report, which reflects a lot of timely news that occurred throughout that year.

To make it clear, in the actual controlled studies used to determine what claims we made, we used different approaches to suggesting queries. For the first claim (2:1), participants self-generated their own queries with no suggestions from us. In the second claim (web’s top queries), we suggested five queries of which they could select one. These five queries were randomly drawn from a list of roughly 500 from the Google 2012 Zeitgeist, and they could easily get five more if they didn’t like any queries from the five they were being shown.

Google’s Matt Cutts reacted to Ayers study on Google+:

Freakonomics looked into Microsoft’s “Bing It On” challenge. From the blog post: “tests indicate that Microsoft selected suggested search words that it knew were more likely to produce Bing-preferring results. …. The upshot: Several of Microsoft’s claims are a little fishy.  Or to put the conclusion more formally, we think that Google has a colorable deceptive advertising claim.”

I have to admit that I never bothered to debunk the Bing It On challenge, because the flaws (small sample size; bias in query selection; stripping out features of Google like geolocation, personalization, and Knowledge Graph; wording of the site; selective rematches) were pretty obvious.

Regardless of whether Bing or its critics are right about whose study methodology is more flawed, the thing that was most interesting to me about the Ayers findings was the fact that Bing won 41 percent of the time. That suggests, in the context of an arguably antagonistic study, Bing did very well and is almost at parity with Google.

It would also seem to support what Microsoft has been claiming — that the Google brand and not necessarily search quality is now what sustains Google’s dominance in search.