Why is Double Blind Testing Controversial?


I noticed that the concept of "double blind testing" of cables is a controversial topic. Why? A/B switching seems like the only definitive way of determining how one cable compares to another, or any other component such as speakers, for example. While A/B testing (and particularly double blind testing, where you don't know which cable is A or B) does not show the long term listenability of a cable or other component, it does show the specific and immediate differences between the two. It shows the differences, if at all, how slight they are, how important, etc. It seems obvious that without knowing which cable you are listening to, you eliminate bias and preconceived notions as well. So, why is this a controversial notion?
moto_man
Thanks, Rzado, for the refresher course. Let me try to summarize for anyone who fell asleep in class. In a DBT, if you get a statistically significant result (at least 12 correct out of 16 in one of Radzo's examples), you can safely conclude that you heard a difference between the two sounds you were comparing. If you don't score that high, however, you can't be sure whether you heard a difference or not. And the fewer trials you do, the more uncertain you should be.

This doesn't mean that DBTs are hopelessly inconclusive, however. Some, especially those that use a panel of subjects, involve a much higher number of trials. Also, there's nothing to stop anyone who gets an inconclusive result from conducting the test again. This can get statistically messy, because the tests aren't independent, and if you repeat the test often enough you're liable to get a significant result through dumb luck. But if you keep getting inconclusive results, the probability that you're missing something audible goes way down.

To summarize, a single DBT can prove that a difference is audible. A thousand DBTs can't prove that it's inaudible--but the inference is pretty strong.

As for my statement about statistics not being the weak link, I meant that there are numerous ways to do a DBT poorly. There are also numerous ways to misinterpret statistics, in this or any other field. Most of the published results that I am familiar with handle the statistics properly, however.
Good post, Bomarc - I agree with 98% of what you had to say. I guess the one thing I'm not sure about is the point you are making with respect to multiple inconclusive tests lending to a strong inference that a difference is inaudible. If you have multiple tests with high Type 2 error (e.g. Beta ~.4-.7), I do not believe this is accurate. However, if you have multiple tests where you take steps to minimize Type 2 error (high N trials), I can see where you are going. But you are correct, that can start getting messy.

Thanks for clarifying your point about statistics, though. In general, I tend to give experimenters the benefit of the doubt with respect to setting up the DBT, unless I have a specific problem with the setup. But I agree, there are numerous ways to screw it up.

However, the few studies in high-end audio with which I am familiar(e.g. the ones done by Audio magazine back in the 80's) in general suffered from the problems outlined above (small N leading to high Type 2 error, erroneous conclusions based on non-rejection of null hypothesis due to tests not achieving p value < .05). There have been a couple of AES studies with which I'm familiar where the setup was such that p_u was probably no better than chance - in that circumstance, you can say either the setup is screwed up or the interprettion of the statistics is screwed up. At least one or two studies, though, were pretty demonstrative (e.g. the test of the Genesis Digital Lens, which resulted in 124 out of 124 correct identifications).

My biggest beef with DBT in Audio is that you just need to do the work - i.e. use high N trials - which is a lot easier said than done.
Rzado: My point on retesting is this: If something really is audible, sooner or later somebody is going to hear it, and get a significant response, for the same reason that sooner or later, somebody is going to flip heads instead of tails. If you keep getting tails, eventually you start to suspect that maybe this coin doesn't have a heads. Similarly, if you keep getting non-significant results in a DBT, it becomes reasonable to infer that you probably (and we can only say probably) can't hear a difference.

As for published studies, the ones I've seen (which may not be the same ones you've seen) generally did get the statistics right. What usually happens is that readers misinterpret those studies--and both sides of The Great Debate have been guilty of that.