Why is Double Blind Testing Controversial?


I noticed that the concept of "double blind testing" of cables is a controversial topic. Why? A/B switching seems like the only definitive way of determining how one cable compares to another, or any other component such as speakers, for example. While A/B testing (and particularly double blind testing, where you don't know which cable is A or B) does not show the long term listenability of a cable or other component, it does show the specific and immediate differences between the two. It shows the differences, if at all, how slight they are, how important, etc. It seems obvious that without knowing which cable you are listening to, you eliminate bias and preconceived notions as well. So, why is this a controversial notion?
moto_man
I think Hearhere summed up the issue well in his last post, but I would come at it from a slightly different angle. Simply put, DBT is not, in and of itself, "controversial." However, there is a great deal of misunderstanding/ disagreement regarding its use and applicability. More particularly, DBT is simply a tool, the results of which are interpreted based on statistical analysis, and must be understood in that context. While DBT does have some applicability in the audio context, it is not the be-all and end-all that some make it out to be.

There are two main problems with how DBTs are used/viewed by certain audiophiles. First and foremost, what many do not understand (but what anyone with experience in statistics can tell you) is that if there is a non statistically significant result, the DBT has not “proven” there are no differences between conditions! Rather, all that can be concluded is that the DBT failed to reject the null hypothesis in favor of the alternative hypothesis.

Second, small-trial (aka "small-N") listening tests analyzed at commonly used statistical significance levels (e.g. <.05) lead to large Type 2 error risks, thereby masking the very differences the tests are supposed to reveal.

Now breaking that down into English is a pain, but I'll give it a shot (I’m an engineer, as opposed to s statistician - thus any stats guys feel free to correct me). In a simple DBT, one attempts to determine if there are audible differences between two conditions (such as by inserting a new interconnect in a given system). This is more commonly called a hypothesis test - the goal is to determine whether you can reject a "null hypothesis" (there are in fact no differences between the two conditions) in favor of a "conjectured hypothesis" (there are in fact differences between the two conditions).

In a DBT, there are four possible results: 1) there are differences and the listener correctly identifies that there are differences; 2) there are no differences and the listener correctly identifies there are no differences; 3) there are no differences, but the listener believes there are differences; and 4) there are differences, but the listener believes there are no differences. Obviously, 1 and 2 are correct results. Circumstance 3 (concluding that differences exist when in reality they don’t) is commonly referred to as "Type 1 error". Circumstance 4 (missing a true difference) is commonly referred to as "Type 2 error". Put in terms of the hypothesis test stated above, type I error occurs when the null hypothesis is true and wrongly rejected, and type II error occurs when the null hypothesis is wrongly accepted when false.

Now, things get a little complicated. First we need to introduce a variable, p_u, which is the probability of success of the underlying process. In the listening context, this is the probability that a listener can identify a difference between conditions, which is based on the acuity of the listener, the magnitude of the differences, and the conditions of the trial (e.g. the quality of the components, recording, ambient noise, etc). Unfortunately, we can never “know” p_u, but can only make reasonable guesses at it.

We also need to introduce the variable "alpha". Alpha, or the significance level, is the level at which we can reject the null hypothesis in favor of the alternative hypothesis. By selecting a suitable significance level during the data analysis, you can select a risk of Type 1 error that you are willing to tolerate. A common significance level used in DBT testings is .05.

Finally, we need to look at the probability value. In hypothesis testing, the probability value is the probability of obtaining data as extreme or more extreme than the results achieved by the experiment assuming the null hypothesis is true (put another way, it is the likelihood of an observed statistic occurring on the basis of the sampling distribution).

Once the DBT is performed, one compares the probability value to alpha to determine whether the result of the test is statistically significant, such that we can reject the null hypothesis. In our example, if the null hypothesis is rejected, we can concluded there are in fact audible differences between ICs.

Now, here comes the fun part. It might seem that you want to set the smallest possible significance level to test the data, thereby producing the smallest possible risk of Type 1 error (i.e., set alpha to .01 as opposed to .05). However, this doesn’t work, because, as you reduce the risk of Type 1 error (lower alpha), the risk of Type 2 error necessarily increases.

Further, and a greater impediment to practical DBT testing, is that the risk of Type 2 error increases not only as you reduce Type 1 error risk, but also with reductions in the number of trials (N), and the listener's true ability to hear the differences under test. Since you really never know p_u, and can only speculate on how to increase it (e.g., by selecting only high quality recordings of unamplified music using a high quality system to test the ICs), the best ways to reduce the risk of Type 2 error in a practical listening test is by increasing either N or the risk of Type 1 error.

Now for some examples. Let's assume we use 16 tests on the IC in question. For purposes of the example, further assume that the probability of randomly guessing correctly whether the new IC was inserted is 0.5. Finally, we must make a guess at “p_u”, which we could say is 0.7. In this instance, the minimum number of correct results for the probability value to exceed .05 is 12 (our type I error in this case is = 0.0384). However, our type II error in this case goes through the roof - in this example, it is .5501, which is huge! Thus, this test suffers from a high level of type 2 error, and is therefore unlikely to resolve differences that actually exist between the interconnects.

What happens if there were only 11 correct results? Our p value is then .1051, which exceeds alpha. Thus, we are not able to reject the null hypothesis in favor of the alternative hypothesis, since the p value is greater than alpha. However, this does not allow us to concluded that there are in fact no audible differences between Ics. In other words, data not sufficient to show convincingly that a difference between conditions is not zero do not prove that the difference is zero.

So now lets increase the number of trials to 50. Now, the number of correct results needed to yield statistically significant results is 32 (p value = .0325). Assuming again p_u is 70%, our Type 2 error drops to ~ 0.14, which is more acceptable, and thus differences between conditions are more likely to be revealed by the test.

OK, one last variation. Let’s assume that the differences are really minor, or we are using a boom box to test the interconnects, such that p_u is only 60%. What happens to Type II error? It goes up - in the 50 trial example above, is goes from .1406 to .6644 - again, the test likely masks any true difference between ICs.

To sum up, DBT is tool that can be very useful in the audio context if used and understood correctly. Indeed, this is where I take issue with Bomarc, when he says "I don't want to get into statistics, except to say that's usually not the weak link in a DBT". Rather, the (mis)understanding of statistics is precisely the weak link in applicability of DBTs.
Thanks, Rzado, for the refresher course. Let me try to summarize for anyone who fell asleep in class. In a DBT, if you get a statistically significant result (at least 12 correct out of 16 in one of Radzo's examples), you can safely conclude that you heard a difference between the two sounds you were comparing. If you don't score that high, however, you can't be sure whether you heard a difference or not. And the fewer trials you do, the more uncertain you should be.

This doesn't mean that DBTs are hopelessly inconclusive, however. Some, especially those that use a panel of subjects, involve a much higher number of trials. Also, there's nothing to stop anyone who gets an inconclusive result from conducting the test again. This can get statistically messy, because the tests aren't independent, and if you repeat the test often enough you're liable to get a significant result through dumb luck. But if you keep getting inconclusive results, the probability that you're missing something audible goes way down.

To summarize, a single DBT can prove that a difference is audible. A thousand DBTs can't prove that it's inaudible--but the inference is pretty strong.

As for my statement about statistics not being the weak link, I meant that there are numerous ways to do a DBT poorly. There are also numerous ways to misinterpret statistics, in this or any other field. Most of the published results that I am familiar with handle the statistics properly, however.
Good post, Bomarc - I agree with 98% of what you had to say. I guess the one thing I'm not sure about is the point you are making with respect to multiple inconclusive tests lending to a strong inference that a difference is inaudible. If you have multiple tests with high Type 2 error (e.g. Beta ~.4-.7), I do not believe this is accurate. However, if you have multiple tests where you take steps to minimize Type 2 error (high N trials), I can see where you are going. But you are correct, that can start getting messy.

Thanks for clarifying your point about statistics, though. In general, I tend to give experimenters the benefit of the doubt with respect to setting up the DBT, unless I have a specific problem with the setup. But I agree, there are numerous ways to screw it up.

However, the few studies in high-end audio with which I am familiar(e.g. the ones done by Audio magazine back in the 80's) in general suffered from the problems outlined above (small N leading to high Type 2 error, erroneous conclusions based on non-rejection of null hypothesis due to tests not achieving p value < .05). There have been a couple of AES studies with which I'm familiar where the setup was such that p_u was probably no better than chance - in that circumstance, you can say either the setup is screwed up or the interprettion of the statistics is screwed up. At least one or two studies, though, were pretty demonstrative (e.g. the test of the Genesis Digital Lens, which resulted in 124 out of 124 correct identifications).

My biggest beef with DBT in Audio is that you just need to do the work - i.e. use high N trials - which is a lot easier said than done.
Rzado: My point on retesting is this: If something really is audible, sooner or later somebody is going to hear it, and get a significant response, for the same reason that sooner or later, somebody is going to flip heads instead of tails. If you keep getting tails, eventually you start to suspect that maybe this coin doesn't have a heads. Similarly, if you keep getting non-significant results in a DBT, it becomes reasonable to infer that you probably (and we can only say probably) can't hear a difference.

As for published studies, the ones I've seen (which may not be the same ones you've seen) generally did get the statistics right. What usually happens is that readers misinterpret those studies--and both sides of The Great Debate have been guilty of that.