Brain Box: Must we really accept a 1-in-20 false positive rate in science?

Saturday, 15 September 2012

Must we really accept a 1-in-20 false positive rate in science?

There has been some very interesting and extremely important discussion recently addressing a fundamental problem in science: can we believe what we read?

After a spate of high-profile cases of scientific misdemeanours and outright fraud (see Alok Jha's piece in the Guardian), people are rightly looking for solutions to restore credibility to the scientific process [e.g., see Chris Chambers and Petroc Sumner's Guardian response here].

These include more transparency (especially pre-registering experiments), encouraging replication, promoting the dissemination of null effects, shifting career rewards from new findings (neophilia) to genuine discoveries, abolishing the cult of impact factors, etc. All these are important ideas, and many are more or less feasible to implement, especially with the right top-down influence. However, it seems to me that one of the most basic problems is staring us right in the face, and would require absolutely no structural change to correct. The fix is as simple as re-drawing a line in the sand.

Critical p-value: line in the sand

Probability estimates are inherently continuous, yet we typically divide our observations into two classes: significant (i.e., true, real, bona fide, etc) and non-significant (i.e., the rest). This reduces the mental burden of assessing experimental results - all we need to know is whether an effect is real, i.e., passes a statistical threshold. And so there are conventions, the most widely used being p<.05. If our statistical test falls below a probability of 5% chance level, we may assert that our conclusion is justified. Ideally, this threshold ensures that our inference is correct with at least 95% certainty. But turn this around, and it also means that at worst, the assertion could be wrong (i.e., false positive) one time in twenty (about the same odds as being awarded a research grant in the current climate). That already seems pretty high odds for accepting false positive claims in science. But worse, this is also only the ideal theoretical case. There are many dubious scientific practices that dramatically inflate the false discovery rate, such as cherry picking and peeking during data collection (see here).

These kinds of fishy goings-on are evident in statistical anomalies, such as the preponderance of just-significant effects reported in the literature (see here for blog review of empirical paper). Although it is difficult to estimate the true false positive rate out there, it can only be higher than the ideal one in twenty rate assumed by our statistical convention. So, even before worrying about outright fraud, it is actually quite likely that many of the results we read about in the peer-reviewed literature are in fact false positives.

Boosting the buffer zone

The obvious solution is to tighten up the accepted statistical threshold. Take physics, for example. Those folk only accept a new particle into their text books if the evidence reaches a statistical threshold of 5 sigma (i.e., p<0.0000003). Although the search for the Higgs boson involved plenty of peeking along the way, at 5 sigma the resultant inflation of the false discovery rate is hardly likely to matter. We can still believe the effect. A strict threshold level provides a more comfortable buffer between false positive and true effect. Although there are good and proper ways to correct for peeking, multiple comparisons, etc., all these assume full disclosure. It would clearly be safer just to adopt a conservative threshold. Perhaps not one quite as heroic as 5 sigma (after all, we aren't trying to find the God particle), but surely we can do better than a one-in-twenty false discovery rate as the minimal and ideal threshold.

Too conservative?

Of course, tightening the statistical threshold would necessarily increase the number of failures to detect a true effect, so-called type II errors. However, it is probably fair to say that most fields in science are suffering more from false positives (type I errors) than type II errors. False positives are more influential than false negatives, and harder to dispel. In fact, we are probably more likely to consider a null effect as a real effect cloaked in noise, especially if there is already a false positive lurking about somewhere in the published literature. It is notoriously difficult to convince your peers that your non-significant test indicates a true null effect. Increasingly, Bayesian methods are being developed to test for sameness between distributions, but this is another story.

The main point is that we can easily afford to be more conservative when bestowing statistical significance to putative effects, without stifling scientific progress. Sure, it would be harder to demonstrate evidence for really small effects, but not impossible if they are important enough to pursue. After all, the effect that betrayed the Higgs particle was very small indeed, but that didn't stop them from finding it. Valuable research could focus on validating trends of interest (i.e., strongly predicted results), rather than chasing down the next new positive effect leaving behind a catalogue of potentially suspect "significant effects" in your wake. Science cannot progress as a house of cards.

Too expensive?

Probably not. Currently, we are almost certainly wasting research money chasing down the dead ends that are opened up by false positives. A reduced, but more reliable corpus of highly reliable results would almost certainly increase the productivity of many scientific fields. At present, the pressure to publish has precipitated a flood of peer-reviewed scientific papers reporting any number of significant effects, many of which will almost certainly not stand the test of time. It would seem a far more sensible use of resources to focus on producing fewer, but more reliable scientific manuscripts. Interim findings and observations could be made readily available via any number of suggested internet-based initiatives. These more numerous 'leads' could provide a valuable source of possible research directions, without yet falling into the venerable category of immutable (i.e., citable) scientific fact. Like conference proceedings, they could adopt a more provisional status until they are robustly validated.

Raise the bar for outright fraud

Complete falsification is hard to detect in the absence of actual whistleblowers. In Simonsohn's words: "outright fraud is somewhat impossible to estimate, because if you're really good at it you wouldn't be detectable" (from Alok Jha). Even publishing the raw data is no guarantee of catching out the fraudster, as there are clever ways to generate plausible-looking data sets that would pass veracity testing.

However, fraudsters presumably start their life of crime in the grey area of routine misdemeanour. A bit of peeking here, some cherry picking there, before they are actually making up data points. Moreover, they know that even if their massaged results fail to replicate, benefit of the doubt should reasonably allow them to claim to be unwitting victims of an innocent false positive. After all, at p<0.05 there is already a 1-in-20 chance of a false positive, even if you do everything by the letter!

Like rogue traders, scientific fraudsters presumably start with a small, spur-of-the-moment act that they reasonably believe they can get away with. If we increase the threshold that needs to be crossed, fewer unscrupulous researchers will be tempted down the dark and ruinous path of scientific fraud. And if they did, it would be much harder for them to claim innocence after their 5 sigma results fail to replicate.

Why impose any statistical threshold at all?

Finally, it is worth noting some arguments that the statistical threshold should be abolished altogether. Maybe we should be more interested in the continua of effect sizes and confidence intervals, rather than discrete hypothesis testing [e.g., see here]. I have a lot of sympathy for this argument. A more quantitative approach to inferential statistics would more accurately reflect the continuous nature of evidence and certainly, and also more readily suit meta-analyses. However, it is also useful to have a standard against which we can hold up putative facts for the ultimate test: true or false.

7 comments:

Chris Chambers15 September 2012 at 12:04
Excellent post, Mark.

So the million dollar question is: what level of alpha would be appropriate in psychology / neuroscience / biology? I wonder what the most conservative threshold is that everyone would agree.

Perhaps we need an enterprising journal editor to try an RCT. Manipulate acceptable alpha levels for a random 50% of papers for 12 months. Then see which ones replicate better.
ReplyDelete
Replies
StokesBlog15 September 2012 at 12:30
Thanks Chris - absolutely, that is the million dollar question!

And surely there can be no right answer - I can't imagine there will be an obvious sweet spot in the trade-off between type I and type II errors.

I would say p<.0001 is starting to feel a bit safe, given all the multifarious nefarious ways to inflate the advertised false discover rate, but still within a feasible range for most researchers, and research questions.

However, in the end it is just an issue of priority and resources. We can be as strict as we want, it will just require more data collection to be convincing - and that's a matter of time and money. But the same goes for validation by replication, or perhaps any other proposal to repair some of the recently damaged credibility in the empirical (esp. biological) sciences. There is unlikely to be a cheap and easy solution.
ReplyDelete
Replies
Johan Carlin15 September 2012 at 17:02
I'm not sure if I agree that false positive are more problematic than false negatives in psychology / cog neuro.

In my own work (faces) the big elephant in the room is that even if you get a relatively selective set of blobs for faces > houses and houses > faces with univariate methods (FFA, PPA, etc), anyone who has tried a multivariate searchlight of face vs house classification will have noticed that with this more flexible analysis most of ventral temporal cortex can be used to decode face/house category at similar statistical thresholds to the univariate map. So what gives? One interpretation might be that by setting a too _conservative_ threshold in the initial univariate studies (perhaps necessitated by the low SnR of early fMRI) we ended up with a too circumscribed set of discrete 'regions' that may just turn out to be peaks in a big contiguous blob. You can see a similar story in the results of recent large-sample group fMRI studies (see Neuroskeptic: http://neuroskeptic.blogspot.co.uk/2012/03/brain-scanning-just-tip-of-iceberg.html). So one possibility is that neuroimaging studies have an excess of false negatives which originates in setting thresholds so that a nice tidy group of blobs emerges. You see considerable variation in reported thresholds, presumably because the investigators end up using whatever produces a 'real'-looking blob size. If the real blobs are smaller than what we think (or non-existent) we will produce false positives in this way, and if the real blobs are larger than we think we will produce false negatives.

Given that we have no real way of knowing how big the real blobs are, the statistical significance criterion is just an arbitrary judgment call. But since it is nearly impossible to publish a 'non-significant' result, setting a more conservative significance threshold will actually tend to distort published effect sizes more than lenient thresholds. So if we are going to put our faith in replication, as Chris pointed out in that guardian story the other day, there's a pretty good argument for sticking with our current lenient criteria and just reserve judgment until the meta analysis comes in...

I think the thing we're all still learning is to appropriately weight our confidence in any single study. A blob in a neuroimaging study at p<0.05 (corrected) is emphatically _not_ the higgs boson but there is still a worrying tendency both inside and outside academia to treat single studies as conclusive evidence regardless of the strength of the presented evidence. This is something we have inherited from psychology where no distinction is made between statistical and practical significance.
ReplyDelete
Replies
Marcus Munafo16 September 2012 at 04:51
Many people are focusing on false positives, and more generally Type I vs Type II error. We also need to think about true positives. This isn't trivial. Among those studies which declare statistical significance (at whatever alpha), some will be true positives and some will be false positives. Increasing the stringency of the alpha will reduce the rate of false positives, but won't do anything to improve the rate of true positives. It's an often under-appreciated fact that underpowered (i.e., smaller) studies will result in a higher proportion of false to true positives among these nominally significant findings, because the alpha is constant while the rate of true positives is a function of statistical power and the proportion of hypotheses tested which are in fact true. So increasing the power of our studies (e.g., making them bigger), and improving the likelihood that the hypotheses we test will be true (e.g., less exploratory research) will also improve the situation.
ReplyDelete
Replies
Matt Wall14 June 2013 at 03:27
Nice post Mark! For those interested in the historical perspective of how the 5% level came to be the general standard, there's a lovely paper from 1982 that examines Fisher's (1925) statements on the subject, but also traces the idea back further to the late 19th Century. PDF here: http://www.radford.edu/~jaspelme/611/Spring-2007/Cowles-n-Davis_Am-Psyc_orignis-of-05-level.pdf
ReplyDelete
Replies

Add comment