Bell focuses on this later problem, highlighting in particular the problem of multiple comparisons. Essentially, the more we look, the more we are likely to find something by chance (i.e., some segment of random noise that doesn't look random - e.g., when eventually a thousand monkeys string together a few words from Hamlet). This is an extremely well known problem in neuroscience, and indeed any other science that is fortunate to have at its disposal methods for collecting so much data. Various statistical methods have been introduced, and debated, to deal with this problem. Some of these have been criticised for not doing what it says on the tin (i.e., overestimating the true statistical significance, e.g., see here), but there is also an issue of appropriateness. Most neuroimagers know the slightly annoying feeling you get when you apply the strictest correction to your data set and find an empty brain. Surely there must be some brain area active in my task? Or have I discovered a new form of cognition that does not depend on the physical properties of brain! So we lower the threshold a bit, and suddenly some sensible results emerge.
This is where we need to be extremely careful. In some sense, the eye can perform some pretty valid statistical operations. We can immediately see if there is any structure in the image (e.g., symmetry, etc), we can also tell whether there seems to be a lot of 'noise' (e.g., other random looking blobs). But now we are strongly influenced by our hopes and expectations. We ran the experiment to test some hypothesis, and our eye is bound to be more sympathetic to seeing something interesting in noise (especially as we have spent a lot of hard earned grant money to run the experiment, and under a lot of pressure to show something for it!). While expectations can be useful (i.e., the expert eye), they can also perpetuate bad science - once falsehoods slip into the collective consciousness of the neuroscientific community, they can be hard to dispel. Finally, structure is a truly deceptive beast. We are often completely captivated by it's beauty, even when the structure comes from something quite banal (e.g., smoothing kernel, respiratory artifact, etc).
So, we need to be conservative. But how conservative? To be completely sure we don't say anything wrong, we should probably just stay at home and run no experiments - zero chance of false positives. But if we want to find something out about the brain, we need to take some risks. However, we don't need to be complete cowboys about it either. Plenty of pioneers have already laid the groundwork for us to explore data whilst controlling for many of the problems of multiple comparisons, so we can start to make some sense of the beautiful and rich brain imaging data now clogging up hard drives all around the world.
These issues are not in any way unique to brain imaging. Exactly the same issues arise in any science lucky enough to suffer the embarrassment of riches (genetics, meteorology, epidemiology, to name just a few). And I would always defend mass data collection as inherently good. Although it raises problems, how can we really complain about having too much data? Many neuroimagers today even feel that fMRI is too limited, if only we could measure with high-temporal resolution as well! Progress in neuroscience (or indeed any empirical science) is absolutely dependent on our ability to collect the best data we can, but we also need clever analysis tools to make some sense of it all.
Update - For even more related discussion, see:
http://mindhacks.com/2012/05/28/a-bridge-over-troubled-waters-for-fmri/
http://thermaltoy.wordpress.com/2012/05/28/devils-advocate-uncorrected-stats-and-the-trouble-with-fmri/
http://www.danielbor.com/dilemma-weak-neuroimaging/
Update - For even more related discussion, see:
http://mindhacks.com/2012/05/28/a-bridge-over-troubled-waters-for-fmri/
http://thermaltoy.wordpress.com/2012/05/28/devils-advocate-uncorrected-stats-and-the-trouble-with-fmri/
http://www.danielbor.com/dilemma-weak-neuroimaging/
In the field of genetics, this problem has been massive. Jonathan Flint and colleagues described the rash of nonreplicable findings in the early days of association studies in their book "How Genes Influence Behaviour". There have been two ways forward. One is to look at all possible comparisons and consider whether the distribution of p-values is in line with expectation (using a QQ plot) - i.e if you do 1000 comparisons, then there should be one significant at .001. I realise it is complicated in imaging because voxels aren't independent, but agree with you that new analytic tools are needed. But the other approach is simply to insist than all new findings of association are replicated.
ReplyDeleteOverall it seems to me that the whole ethos of publishing in this area needs to change, so that we value replicability over novelty. Until that happens, it will be hard to progress. See http://deevybee.blogspot.co.uk/2012/01/novelty-interest-and-replicability.html
deevybee, thanks for the input. I agree that genetics is a particularly important area for statisticians to sharpen their tools. As Flint points out, the expected effect sizes are often so small, many of the published results are almost certainly false positives given the lack of statistical power. Dealing with the non-independence of the noise in fMRI data does raise some specific problems. Temporal smoothness limits the degrees of freedom for within-participant tests (i.e., you don't have as many samples as you think), whereas spatial smoothness influences the number of comparisons (i.e., you haven't performed as many independent tests as you think). These go in opposite directly (not correcting for the degrees of freedom over-estimates significance, whereas correcting for each voxel test as if it were independent is unnecessarily conservative). It can be difficult to strike the right balance.
ReplyDeleteBut as you point out, the most important thing is replicability. This is true of all science, including behavioural psychology (see Ed Yong's excellent nature blog: http://www.nature.com/news/replication-studies-bad-copy-1.10634). I agree that the main reason bad imaging paper get in is poor publication practice/culture, rather than simply because the field lacks awareness of basic inferential statistics. Too much pressure to magic-up novel findings - beauty before truth!
Nice discussion - but I disagree that the problem is a lack of appropriate methods. Methods for the correction for multiple comparisons accounting for the temporal and spatial structure that you mention have been around for many years (cf. http://www.fmri-data-analysis.org/). In addition, for group fMRI analyses, the temporal autocorrelation issues just don't matter too much (http://www.ncbi.nlm.nih.gov/pubmed/19463958). The bigger problem in my view (as I mentioned to Vaughan) is that papers can still get published without using the appropriate corrections, or that researchers can use the many available degrees of analytic flexibility to torture the data until something comes out significant at that level.
ReplyDeleteAbsolutely, I don't think we lack appropriate methods. There is a wealth of knowledge out there that can be, and has been, applied to brain imaging.
ReplyDeleteMoreover, I think brain imaging has been especially alert to the potential problems, and the community has developed some very good statistical methods for dealing with imaging data.
There is still room for improvement (e.g., assumption-free permutation approaches), but mainly in the use rather than invention of statistical approaches. It is the review/publication process that ultimately decides what gets through, and as in many fields (not just brain imaging), scientific/statistical rigour at this key stage can be a bit hit and miss.
Also, I agree that temporal autocorrelation hardly matters in group analyses where only effect sizes (i.e., beta parameters) are passed to the second level, and spatial smoothness can only help reduce the multiple comparison problem (if estimated correctly).