Monday 28 May 2012

A Tale of Two Evils: Bad statistical inference and just bad inference

Evil 1:  Flawed statistical inference

There has been a recent lively debate on the hazards of functional magnetic resonance imaging (fMRI), and what claims to believe or not in the scientific and/or popular literature [here, and here]. The focus has been on flawed statistical methods for assessing fMRI data, and in particular failure to correct for multiple comparisons [see also here at the Brain Box]. There was quite good consensus within this debate that the field is pretty well attuned to the problem, and has taken sound and serious steps to preserve the validity of statistical inferences in the face of mass data collection. Agreed, there are certainly papers out there that have failed to use appropriate corrections, and therefore the resulting statistical inferences are certainly flawed. But hopefully these can be identified, and reconsidered by the field. A freer and more dynamic system of publication could really help in this kind of situation [e.g., see here]. The same problems, and solutions apply to non-brain imaging field [e.g., see here].

But I feel that it may be worth pointing out that the consequence of such failures is a matter of degree, not kind. Although statistical significance is often presented as a category value (sig vs ns), the threshold is of course arbitrary, as undergraduates are often horrified to learn (why P<.05? yes, why indeed??). When we fail to correct for multiple comparisons, the expected probabilities change, therefore the reported statistical significance is incorrectly represented. Yes, this is bad, this is Evil 1. But perhaps there is a greater, more insidious evil to beware.

Evil 2: Flawed inference, period.

Whatever our statistical test say, or do not say, ultimately it is the scientist, journalist, politician, skeptic, whoever, who interprets the result. One of the most serious and common problems is flawed causal inference: "because brain area X lights up when I think about/do/say/hear/dream/hallucinate Y, area X must cause Y". Again, this is a very well known error, undergraduates typically have it drilled into them, and most should be able to recite like mantra: "fMRI is correlational, not causal". Yet time and again we see this flawed logic hanging around, causing trouble.

There are of course other conceptual errors at play in the literature (e.g., there must be a direct mapping between function and structure; each cognitive concept that we can imagine must have its own dedicated bit of brain, etc), but I would argue perhaps that fMRI is actually doing more to banish than reinforce ideas that we largely inherited from the 19th Century. The mass of brain imaging data, corrected or otherwise, will only further challenge these old ideas, as it becomes increasingly obvious that function is mediated via a distributed network of interrelated brain areas (ironically, ultra-conservative statistical approaches may actually obscure the network approach to brain function). However, brain imaging, even in principle, cannot disentangle correlation from causality. Other methods can, but as Vaughan Bell poetically notes:
Perhaps the most important problem is not that brain scans can be misleading, but that they are beautiful. Like all other neuroscientists, I find them beguiling. They have us enchanted and we are far from breaking their spell. [from here]
In contrast, the handful of methods (natural lesions, TMS, tDCS, animal ablation studies) that allow us to test the causal role of brain function do not readily generate beautiful pictures, and perhaps, therefore suffer a prejudice that keeps them under-represented in peer-review journals, and/or popular press. It would be interesting to assess the role of beauty in publication bias...

Update - For even more related discussion, see:

1 comment:


    "No adjustments are needed for multiple comparisons" - Rothman, Epidemiology, 1990

    An interesting take on the multiple comparisons issue.

    This strong argument might not generalize very well to large imaging datasets - but it does highlight the arbitrary way in which these corrections are made in publications (often at the level of the experiment - but one could equally correct for familywise errors over a paper, a year's papers... a scientific career...). Rather than setting an arbitrary (or arbitrarily corrected) threshold for significance, and ignoring everything that falls below that cut-off, surely we should take stock of all of our data (mindful of effect size) and decide how to proceed or what to conclude on that basis?

    We need to present our data in a way that lets us make realistic decisions about its meaning - which boils down to summarizing effect size for individual findings. The number of tests we may happen to have done in any particular experiment isn't necessarily relevant when trying to assess the meaning of a given result.