Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Tuesday, 16 April 2013

Statistical power is truth power

This week, Nature Reviews Neuroscience published an important article by Kate Button and colleagues quantifying the extent to which experiments in neuroscience may be statistically underpowered. For a number of excellent, and accessible summaries of the research, see here, here, here and this one in the Guardian from the lead author of the research.

The basic message is clear - collect more data! Data collection is expensive, and time consuming, but underpowered experiments are a waste of both time and money. Noisy data will decrease the likelihood detecting important effects (false negative), which is obviously disappointing for all concerned. But noisy datasets are also more likely to be over-interpreted, as the disheartened experimenter attempts to find something interesting to report. With enough time, and effort, trying lots of different analyses, something 'worth reporting' will inevitably emerge, even by chance (false positive). Put a thousand monkeys to a thousand typewriters, or leave an enthusiastic researcher alone long enough with a noisy data set, and eventually something that reads like a coherent story will emerge. If you are really lucky (and/or determined), it might even sound like a pretty good story, and end up published in a high-impact journal.

This is the classic Type 1 error, the bogeyman of undergraduate Statistics 101. But the problem of  false positives is very real, and continues to plague empirical research, from biological oncology to social psychology. Failure to replicate published results is the diagnostic marker of a systematic failure to separate signal from noise.

There are many bad scientific practices that increase the likelihood of false positives entering the literature, such as peeking, parameter tweaking, and publication bias, and there are some excellent initiatives out there to clean up these common forms of bad research practice. For example, Cortex has introduced a Registered Report format that should bring some rigour back to hypothesis testing, Psychological Science in now hoping to encourage replications and Nature Neuroscience has drawn up clearer guidelines to improve statistical practices.

These are all excellent initiatives, but I think we also need to consider simply increasing the margin of error. In a previous post, I argued that the accepted statistical threshold is far too lax. A 1-in-20 false discovery rate already seems absurdly permissive, but if we consider in all the other factors that invalidate basic statistical assumptions, then the true rate of false positives must be extremely high (perhaps 'Why Most Published Research Findings are False'). To increase the safety margin seems like an obvious first step to improving the reliability of published findings.

The downside, of course, to a more stringent threshold for separating signal from noise is that it demands a lot more data. Obviously, this will reduce the total number of experiments that can be conducted for the same amount of money. But as I recently argue in the Guardian, science on a shoestring budget can lead to more harm than good. If the research is important enough to fund, then it is even more important that it is funded properly. Spreading resources too thinly will only add noise and confusion to the process, leading further research down expensive and time-consuming blind alleys opened up by false positives.

So, the take home message is simple - collect more data! But how much more?

Matt Wall recently posted his thoughts on power analyses. These are standardised procedures for estimating the probability that you will be able to detect a significant effect, given a certain effect size and variance, for a given number of subjects. This approach is used widely for planning clinical studies, and is essentially the metric that Kate and colleagues use for demonstrate the systematic lack of statistical power in the neuroscience literature. But there's an obvious catch 22, as Matt points out. How are you supposed to know the effect size (and variance) if you haven't done the experiment? Indeed, isn't that exactly why you have proposed to conduct the experiment? To sample the distribution for an estimate of effect size (and variance)? Also, in a typical experiment, you might be interested in a number of possible effects, so which one do you base your power analysis on?

I tend to think that power analysis is best served for clinical studies, in which there is already a clear idea of the effect size you should be looking for (as it is bounded by practical concerns of clinical relevance). In contrast, basic science is often interested in whether there is an effect, in principle. Even if very small, it could be of major theoretical interest. In this case, there may be no lower bound effect size to impose, so without pre-cognition, it seems difficult to see how to establish the necessary sample size. Power calculations would clearly benefit replication studies, but it difficult to see how they could be applied for planning new experiments. Researchers can make a show of power calculations, by basing effect size estimations on some randomly selected previous study, but this is clearly a pointless exercise.

Instead, researchers often adopt rules of thumb, but I think the new rule of thumb should be: double your old rule of thumb! If you were previously content with 20 participants for fMRI, then perhaps you should recruit 40. If you have always relied on 100 cells, then perhaps you should collect data from 200 cells instead. Yes, these are essentially still just numbers, but there is nothing arbitrary about improving statistical power. And you can be absolutely sure that the extra time and effort (and cost) will pay dividends in the long run. You will spend less time analysing your data trying to find something interesting to report, and you will be less likely to send some other research down the miserable path of persistent failures to replicate your published false positive.


Saturday, 23 February 2013

Biased Debugging


We all make mistakes - Russ Poldrack's recent blog post is an excellent example of how even the most experienced scientists are liable to miss a malicious bug in complex code. It could be the mental equivalent of missing a single double negative in a 10,000 word essay, or a split-infinite that Microsoft word fails to detect or even a bald-faced typo underlined in red that remains unnoticed by the over-familiar eyes of the author.

In the case reported by Russ last week, although there was an error in the analysis, the actual result fit their experimental hypothesis and slipped through undetected. It was only when someone else independently analysed the same data, but failed to reproduce the exact result, that alarm bells sounded. Luckily, in this case the error was detected before anything was committed to print, but the warning is clear. Obviously, we need to be more careful, and cross-check our results more carefully.

Here, I argue that we also need to think a bit more carefully about bias in the debugging process. Almost certainly, it was no coincidence that Russ's undetected error also yielded a result that was consistent with the experimental hypothesis. I argue that the debugging process is inherently biased, and will tend to seek out false positive findings that conform to our prior hopes and expectations.

Data analysis is noisy


Writing complex customised analysis routines is crucial in leading-edge scientific research, but is also error prone. Perfect coding is as unrealistic as perfect prose - errors are simply part of the creative process. When composing a manuscript, we may have multiple co-authors to help proofread numerous versions of the paper, and yet even then we often find a few persistent grammatical errors, split infinitives, double negatives slip through the net. Analysis scripts, however, are less often so well scrutinised, line by line, variable by variable.

If lucky, coding errors just cause our analyses to crash, or throw up a clearly outrageous result. Either way, we will know that we have made a mistake, and roughly where we erred - we can then switch directly to debugging mode. But what if the erroneous result looks sensible? Just by chance, what if the spurious result supports your experimental hypothesis? What are the chances that you will continue to search for errors in your code when the results make perfect sense?

Your analysis script might contain hundreds of lines of code, and even if you do go through each one, we are notoriously bad at detecting errors in familiar script. Just think of the last time you asked someone else to read draft prose because you had become blind to typos in the text that you have read a million times before. By that stage, you know exactly what the text should say, and that is the only thing you can read any more. Unless you recruit fresh eyes from a willing proofreader, or your attention is directed to specific candidate errors, you will be pretty bad at seeing even blatant mistakes right in front of you.

Debugging is non-random


OK, analysis is noisy - so what? Data are noisy too, isn't it all just part of the messy business of empirical science? Perhaps, but the real problem is that the noise is not random. On the contrary, debugging is systematically biased to favour results that conform to our prior hopes and expectations, that is, our theoretical hypotheses.

If an error yields a plausible result by chance, it is far less likely to be detected and corrected than if the error throws up a crazy result. Worse, if the result is not even crazy, but just non-significant or otherwise 'uninteresting', then the dejected researcher will presumably spend longer looking for potential mistakes that could 'explain' the 'failed analysis'. In contrast, if the results looks just fine, why rock the boat? This is like a drunkard's walk that veers systematically toward wine bottles to the left, and away from police to the right.

More degrees of freedom for generating false positives


With recent interest in myriad bad practises that boost false positive rates far beyond the assumed statistical probabilities (e.g., see Alok Jha's piece in the Guardian), I suggest that biased debugging could also contribute to the proliferation of false positives in the literature, especially in the neuroimaging literature. Biased debugging is also perhaps more insidious, because the pull towards false positives is not as obvious in debugging as it is with cherry-picking, data peeking, etc. Moreover, it is perhaps less obvious how to avoid the bias in debugging practices. As Russ notes in his post, code sharing is a good start, but it is not sufficient - errors can remain undetected even in shared code, especially if not widely used. The best possible safeguard is independent reanalysis - to reproduced identical results using independently written analysis scripts. In this respect, it is more important to share the data rather than the analysis scripts, which should not be re-run with blind faith!


See also: http://www.russpoldrack.org/2013/02/anatomy-of-coding-error.html

Saturday, 15 September 2012

Must we really accept a 1-in-20 false positive rate in science?

There has been some very interesting and extremely important discussion recently addressing a fundamental problem in science: can we believe what we read?

After a spate of high-profile cases of scientific misdemeanours and outright fraud (see Alok Jha's piece in the Guardian), people are rightly looking for solutions to restore credibility to the scientific process [e.g., see Chris Chambers and Petroc Sumner's Guardian response here].

These include more transparency (especially pre-registering experiments), encouraging replication, promoting the dissemination of null effects, shifting career rewards from new findings (neophilia) to genuine discoveries, abolishing the cult of impact factors, etc. All these are important ideas, and many are more or less feasible to implement, especially with the right top-down influence. However, it seems to me that one of the most basic problems is staring us right in the face, and would require absolutely no structural change to correct. The fix is as simple as re-drawing a line in the sand.

Critical p-value: line in the sand

Probability estimates are inherently continuous, yet we typically divide our observations into two classes: significant (i.e., true, real, bona fide, etc) and non-significant (i.e., the rest). This reduces the mental burden of assessing experimental results - all we need to know is whether an effect is real, i.e., passes a statistical threshold. And so there are conventions, the most widely used being p<.05. If our statistical test falls below a probability of 5% chance level, we may assert that our conclusion is justified. Ideally, this threshold ensures that our inference is correct with at least 95% certainty. But turn this around, and it also means that at worst, the assertion could be wrong (i.e., false positive) one time in twenty (about the same odds as being awarded a research grant in the current climate). That already seems pretty high odds for accepting false positive claims in science. But worse, this is also only the ideal theoretical case. There are many dubious scientific practices that dramatically inflate the false discovery rate, such as cherry picking and peeking during data collection (see here).

These kinds of fishy goings-on are evident in statistical anomalies, such as the preponderance of just-significant effects reported in the literature (see here for blog review of empirical paper). Although it is difficult to estimate the true false positive rate out there, it can only be higher than the ideal one in twenty rate assumed by our statistical convention. So, even before worrying about outright fraud, it is actually quite likely that many of the results we read about in the peer-reviewed literature are in fact false positives.

Boosting the buffer zone

The obvious solution is to tighten up the accepted statistical threshold. Take physics, for example. Those folk only accept a new particle into their text books if the evidence reaches a statistical threshold of  5 sigma (i.e., p<0.0000003). Although the search for the Higgs boson involved plenty of peeking along the way, at 5 sigma the resultant inflation of the false discovery rate is hardly likely to matter. We can still believe the effect. A strict threshold level provides a more comfortable buffer between false positive and true effect. Although there are good and proper ways to correct for peeking, multiple comparisons, etc., all these assume full disclosure. It would clearly be safer just to adopt a conservative threshold. Perhaps not one quite as heroic as 5 sigma (after all, we aren't trying to find the God particle), but surely we can do better than a one-in-twenty false discovery rate as the minimal and ideal threshold.

Too conservative?

Of course, tightening the statistical threshold would necessarily increase the number of failures to detect a true effect, so-called type II errors. However, it is probably fair to say that most fields in science are suffering more from false positives (type I errors) than type II errors. False positives are more influential than false negatives, and harder to dispel. In fact, we are probably more likely to consider a null effect as a real effect cloaked in noise, especially if there is already a false positive lurking about somewhere in the published literature. It is notoriously difficult to convince your peers that your non-significant test indicates a true null effect. Increasingly, Bayesian methods are being developed to test for sameness between distributions, but this is another story.

The main point is that we can easily afford to be more conservative when bestowing statistical significance to putative effects, without stifling scientific progress. Sure, it would be harder to demonstrate evidence for really small effects, but not impossible if they are important enough to pursue. After all, the effect that betrayed the Higgs particle was very small indeed, but that didn't stop them from finding it. Valuable research could focus on validating trends of interest (i.e., strongly predicted results), rather than chasing down the next new positive effect leaving behind a catalogue of potentially suspect "significant effects" in your wake. Science cannot progress as a house of cards.

Too expensive?

Probably not. Currently, we are almost certainly wasting research money chasing down the dead ends that are opened up by false positives. A reduced, but more reliable corpus of highly reliable results would almost certainly increase the productivity of many scientific fields. At present, the pressure to publish has precipitated a flood of peer-reviewed scientific papers reporting any number of significant effects, many of which will almost certainly not stand the test of time. It would seem a far more sensible use of resources to focus on producing fewer, but more reliable scientific manuscripts. Interim findings and observations could be made readily available via any number of suggested internet-based initiatives. These more numerous 'leads' could provide a valuable source of possible research directions, without yet falling into the venerable category of immutable (i.e., citable) scientific fact. Like conference proceedings, they could adopt a more provisional status until they are robustly validated.

Raise the bar for outright fraud

Complete falsification is hard to detect in the absence of actual whistleblowers. In Simonsohn's words: "outright fraud is somewhat impossible to estimate, because if you're really good at it you wouldn't be detectable" (from Alok Jha). Even publishing the raw data is no guarantee of catching out the fraudster, as there are clever ways to generate plausible-looking data sets that would pass veracity testing.

However, fraudsters presumably start their life of crime in the grey area of routine misdemeanour. A bit of peeking here, some cherry picking there, before they are actually making up data points. Moreover, they know that even if their massaged results fail to replicate, benefit of the doubt should reasonably allow them to claim to be unwitting victims of an innocent false positive. After all, at p<0.05 there is already a 1-in-20 chance of a false positive, even if you do everything by the letter!

Like rogue traders, scientific fraudsters presumably start with a small, spur-of-the-moment act that they reasonably believe they can get away with. If we increase the threshold that needs to be crossed, fewer unscrupulous researchers will be tempted down the dark and ruinous path of scientific fraud. And if they did, it would be much harder for them to claim innocence after their 5 sigma results fail to replicate.

Why impose any statistical threshold at all?

Finally, it is worth noting some arguments that the statistical threshold should be abolished altogether. Maybe we should be more interested in the continua of effect sizes and confidence intervals, rather than discrete hypothesis testing  [e.g., see here]. I have a lot of sympathy for this argument. A more quantitative approach to inferential statistics would more accurately reflect the continuous nature of evidence and certainly, and also more readily suit meta-analyses. However, it is also useful to have a standard against which we can hold up putative facts for the ultimate test: true or false.

Monday, 28 May 2012

A Tale of Two Evils: Bad statistical inference and just bad inference

Evil 1:  Flawed statistical inference

There has been a recent lively debate on the hazards of functional magnetic resonance imaging (fMRI), and what claims to believe or not in the scientific and/or popular literature [here, and here]. The focus has been on flawed statistical methods for assessing fMRI data, and in particular failure to correct for multiple comparisons [see also here at the Brain Box]. There was quite good consensus within this debate that the field is pretty well attuned to the problem, and has taken sound and serious steps to preserve the validity of statistical inferences in the face of mass data collection. Agreed, there are certainly papers out there that have failed to use appropriate corrections, and therefore the resulting statistical inferences are certainly flawed. But hopefully these can be identified, and reconsidered by the field. A freer and more dynamic system of publication could really help in this kind of situation [e.g., see here]. The same problems, and solutions apply to non-brain imaging field [e.g., see here].

But I feel that it may be worth pointing out that the consequence of such failures is a matter of degree, not kind. Although statistical significance is often presented as a category value (sig vs ns), the threshold is of course arbitrary, as undergraduates are often horrified to learn (why P<.05? yes, why indeed??). When we fail to correct for multiple comparisons, the expected probabilities change, therefore the reported statistical significance is incorrectly represented. Yes, this is bad, this is Evil 1. But perhaps there is a greater, more insidious evil to beware.

Evil 2: Flawed inference, period.

Whatever our statistical test say, or do not say, ultimately it is the scientist, journalist, politician, skeptic, whoever, who interprets the result. One of the most serious and common problems is flawed causal inference: "because brain area X lights up when I think about/do/say/hear/dream/hallucinate Y, area X must cause Y". Again, this is a very well known error, undergraduates typically have it drilled into them, and most should be able to recite like mantra: "fMRI is correlational, not causal". Yet time and again we see this flawed logic hanging around, causing trouble.

There are of course other conceptual errors at play in the literature (e.g., there must be a direct mapping between function and structure; each cognitive concept that we can imagine must have its own dedicated bit of brain, etc), but I would argue perhaps that fMRI is actually doing more to banish than reinforce ideas that we largely inherited from the 19th Century. The mass of brain imaging data, corrected or otherwise, will only further challenge these old ideas, as it becomes increasingly obvious that function is mediated via a distributed network of interrelated brain areas (ironically, ultra-conservative statistical approaches may actually obscure the network approach to brain function). However, brain imaging, even in principle, cannot disentangle correlation from causality. Other methods can, but as Vaughan Bell poetically notes:
Perhaps the most important problem is not that brain scans can be misleading, but that they are beautiful. Like all other neuroscientists, I find them beguiling. They have us enchanted and we are far from breaking their spell. [from here]
In contrast, the handful of methods (natural lesions, TMS, tDCS, animal ablation studies) that allow us to test the causal role of brain function do not readily generate beautiful pictures, and perhaps, therefore suffer a prejudice that keeps them under-represented in peer-review journals, and/or popular press. It would be interesting to assess the role of beauty in publication bias...

Update - For even more related discussion, see:
http://thermaltoy.wordpress.com/2012/05/28/devils-advocate-uncorrected-stats-and-the-trouble-with-fmri/
http://www.danielbor.com/dilemma-weak-neuroimaging/
http://neuroskeptic.blogspot.co.uk/2012/04/fixing-science-systems-and-politics.html

Sunday, 27 May 2012

Truth before beauty? Making sense of mass data

Modern brain imaging methods can produce some remarkably beautiful images, and sometimes we are won over by them. In his opinion piece in today's Observer, Vaughan Bell highlights some of the multifarious problems that may arise in brain imaging, and in particular, functional magnetic resonance imaging (fMRI). During a typical fMRI experiment, we record many thousands of estimates of neural activity across the whole brain every second. At the end, we have an awful lot of data, and potentially an embarrassment of riches. Firstly, where do we begin looking for interesting effects? And when we find something that could be interesting, how do we know that it is 'real', and not just the kind of lucky find that is bound to accompany an exhaustive search?

Bell focuses on this later problem, highlighting in particular the problem of multiple comparisons. Essentially, the more we look, the more we are likely to find something by chance (i.e., some segment of random noise that doesn't look random - e.g., when eventually a thousand monkeys string together a few words from Hamlet). This is an extremely well known problem in neuroscience, and indeed any other science that is fortunate to have at its disposal methods for collecting so much data. Various statistical methods have been introduced, and debated, to deal with this problem. Some of these have been criticised for not doing what it says on the tin (i.e., overestimating the true statistical significance, e.g., see here), but there is also an issue of appropriateness. Most neuroimagers know the slightly annoying feeling you get when you apply the strictest correction to your data set and find an empty brain. Surely there must be some brain area active in my task? Or have I discovered a new form of cognition that does not depend on the physical properties of brain! So we lower the threshold a bit, and suddenly some sensible results emerge. 

This is where we need to be extremely careful. In some sense, the eye can perform some pretty valid statistical operations. We can immediately see if there is any structure in the image (e.g., symmetry, etc), we can also tell whether there seems to be a lot of 'noise' (e.g., other random looking blobs). But now we are strongly influenced by our hopes and expectations. We ran the experiment to test some hypothesis, and our eye is bound to be more sympathetic to seeing something interesting in noise (especially as we have spent a lot of hard earned grant money to run the experiment, and under a lot of pressure to show something for it!). While expectations can be useful (i.e., the expert eye), they can also perpetuate bad science - once falsehoods slip into the collective consciousness of the neuroscientific community, they can be hard to dispel. Finally, structure is a truly deceptive beast. We are often completely captivated by it's beauty, even when the structure comes from something quite banal (e.g., smoothing kernel, respiratory artifact, etc).

So, we need to be conservative. But how conservative? To be completely sure we don't say anything wrong, we should probably just stay at home and run no experiments - zero chance of false positives. But if we want to find something out about the brain, we need to take some risks. However, we don't need to be complete cowboys about it either. Plenty of pioneers have already laid the groundwork for us to explore data whilst controlling for many of the problems of multiple comparisons, so we can start to make some sense of the beautiful and rich brain imaging data now clogging up hard drives all around the world.

These issues are not in any way unique to brain imaging. Exactly the same issues arise in any science lucky enough to suffer the embarrassment of riches (genetics, meteorology, epidemiology, to name just a few). And I would always defend mass data collection as inherently good. Although it raises problems, how can we really complain about having too much data? Many neuroimagers today even feel that fMRI is too limited, if only we could measure with high-temporal resolution as well! Progress in neuroscience (or indeed any empirical science) is absolutely dependent on our ability to collect the best data we can, but we also need clever analysis tools to make some sense of it all.



Update - For even more related discussion, see:
http://mindhacks.com/2012/05/28/a-bridge-over-troubled-waters-for-fmri/
http://thermaltoy.wordpress.com/2012/05/28/devils-advocate-uncorrected-stats-and-the-trouble-with-fmri/
http://www.danielbor.com/dilemma-weak-neuroimaging/