Saturday, 23 February 2013

Biased Debugging

We all make mistakes - Russ Poldrack's recent blog post is an excellent example of how even the most experienced scientists are liable to miss a malicious bug in complex code. It could be the mental equivalent of missing a single double negative in a 10,000 word essay, or a split-infinite that Microsoft word fails to detect or even a bald-faced typo underlined in red that remains unnoticed by the over-familiar eyes of the author.

In the case reported by Russ last week, although there was an error in the analysis, the actual result fit their experimental hypothesis and slipped through undetected. It was only when someone else independently analysed the same data, but failed to reproduce the exact result, that alarm bells sounded. Luckily, in this case the error was detected before anything was committed to print, but the warning is clear. Obviously, we need to be more careful, and cross-check our results more carefully.

Here, I argue that we also need to think a bit more carefully about bias in the debugging process. Almost certainly, it was no coincidence that Russ's undetected error also yielded a result that was consistent with the experimental hypothesis. I argue that the debugging process is inherently biased, and will tend to seek out false positive findings that conform to our prior hopes and expectations.

Data analysis is noisy

Writing complex customised analysis routines is crucial in leading-edge scientific research, but is also error prone. Perfect coding is as unrealistic as perfect prose - errors are simply part of the creative process. When composing a manuscript, we may have multiple co-authors to help proofread numerous versions of the paper, and yet even then we often find a few persistent grammatical errors, split infinitives, double negatives slip through the net. Analysis scripts, however, are less often so well scrutinised, line by line, variable by variable.

If lucky, coding errors just cause our analyses to crash, or throw up a clearly outrageous result. Either way, we will know that we have made a mistake, and roughly where we erred - we can then switch directly to debugging mode. But what if the erroneous result looks sensible? Just by chance, what if the spurious result supports your experimental hypothesis? What are the chances that you will continue to search for errors in your code when the results make perfect sense?

Your analysis script might contain hundreds of lines of code, and even if you do go through each one, we are notoriously bad at detecting errors in familiar script. Just think of the last time you asked someone else to read draft prose because you had become blind to typos in the text that you have read a million times before. By that stage, you know exactly what the text should say, and that is the only thing you can read any more. Unless you recruit fresh eyes from a willing proofreader, or your attention is directed to specific candidate errors, you will be pretty bad at seeing even blatant mistakes right in front of you.

Debugging is non-random

OK, analysis is noisy - so what? Data are noisy too, isn't it all just part of the messy business of empirical science? Perhaps, but the real problem is that the noise is not random. On the contrary, debugging is systematically biased to favour results that conform to our prior hopes and expectations, that is, our theoretical hypotheses.

If an error yields a plausible result by chance, it is far less likely to be detected and corrected than if the error throws up a crazy result. Worse, if the result is not even crazy, but just non-significant or otherwise 'uninteresting', then the dejected researcher will presumably spend longer looking for potential mistakes that could 'explain' the 'failed analysis'. In contrast, if the results looks just fine, why rock the boat? This is like a drunkard's walk that veers systematically toward wine bottles to the left, and away from police to the right.

More degrees of freedom for generating false positives

With recent interest in myriad bad practises that boost false positive rates far beyond the assumed statistical probabilities (e.g., see Alok Jha's piece in the Guardian), I suggest that biased debugging could also contribute to the proliferation of false positives in the literature, especially in the neuroimaging literature. Biased debugging is also perhaps more insidious, because the pull towards false positives is not as obvious in debugging as it is with cherry-picking, data peeking, etc. Moreover, it is perhaps less obvious how to avoid the bias in debugging practices. As Russ notes in his post, code sharing is a good start, but it is not sufficient - errors can remain undetected even in shared code, especially if not widely used. The best possible safeguard is independent reanalysis - to reproduced identical results using independently written analysis scripts. In this respect, it is more important to share the data rather than the analysis scripts, which should not be re-run with blind faith!

See also:


  1. Great post and another (less often considered) form of confirmation bias at work in science. The best defence we have is to accept its existence and institute mechanisms to counteract it. This is one reason why we're requiring authors to release raw data as part of the Cortex Registered Reports initiative.

    Another defence is to run control analyses on data sets in which genuine effects cannot exist. We've done this in the past with TMS-fMRI datasets because there are several artefacts of TMS in the scanner that can appear, insidiously, as activations beneath the TMS coil. So we run the full experiment on a phantom as well and repeat the exact analysis scripts. If any activations are found we known we have a problem!

    1. Thanks Chris for the comment. Making data freely available will certainly allow others to attempt to replicate the published results using the stated analysis pipeline given the same data. If actually put to this test, independent reanalysis based on the described methods would provide powerful confirmation that 1) the methods are sufficiently described in the paper, and 2) there was no coding error in the original implementation.

      Testing on null data is a great way to check results, I use permutations and noise data all the time to test code for spurious effects. The key is to apply this test systematically, not more often when things look 'bad' (i.e., implausible, unexpected, non-significant, disappointing, etc) relative to 'good' results. That is where the bias lies...

  2. Good post. Debugging is in effect another (especially nasty) source of undisclosed flexibility.

    I wonder if the solution is to establish a canonical set of positive and negative controls? Make available online a dataset that clearly contains an activation or whatever, and another one that doesn't (maybe a phantom, or a resting-state time-series treated as an event-related one like in this study.

    That would let you directly compare different methods to each other. e.g. you could show that although your method does create some false positives, it is only half as bad as the other guy's approach.

    1. Thanks for your comment!

      This could be useful for different approaches to standard kinds of activation studies, especially to assess the validity of a new correction for multiple comparisons. But it might be harder in more complex types of analysis, such as connectivity, network metrics, pattern analysis, decoding, etc. It is hard to imaging what a good test set would entail, though a phantom (or just noise matrix) is probably pretty good in most cases.

      As I mentioned in response to Chris, I think the problem is not so much how to debug code (there are lots of great and clever tricks), but how to be systematic in debugging code that produces good or bad results.

      The problem is not even unique to complex and novel neuroimaging analyses, even boring old data entry into spreadsheets (e.g., SPSS, etc) is error prone, and again, on average, there will be bias in what kinds of errors will be tracked down by the researcher.

  3. Great post! I wonder if one way to address this issue would be to start coding any problem of interest using synthetic data, and only once the code is working on synthetic data (with both positive and null examples) would one then move to applying it to the actual data of interest.

  4. Thanks Russ - Great idea - I also agree with comments on your blog that there should be better systems for developing and testing new pipelines, as in other software development areas.

    If we reduce the noise in the system, we reduce the room for bias. And if we promise to develop our scripts only on independent data from that to which it is applied, then we can hope to avoid capitalising on the vagaries of the data (so long as we promise to accept the result of the critical test, and don't re-check if things look funny or boring).

    This could be less practical for smaller scale adaptations, minor hacks, batch scripts, 'tweaks', etc. It would be great if every lab could employ a full-time programmer :-)