Saturday, 15 September 2012

Must we really accept a 1-in-20 false positive rate in science?

There has been some very interesting and extremely important discussion recently addressing a fundamental problem in science: can we believe what we read?

After a spate of high-profile cases of scientific misdemeanours and outright fraud (see Alok Jha's piece in the Guardian), people are rightly looking for solutions to restore credibility to the scientific process [e.g., see Chris Chambers and Petroc Sumner's Guardian response here].

These include more transparency (especially pre-registering experiments), encouraging replication, promoting the dissemination of null effects, shifting career rewards from new findings (neophilia) to genuine discoveries, abolishing the cult of impact factors, etc. All these are important ideas, and many are more or less feasible to implement, especially with the right top-down influence. However, it seems to me that one of the most basic problems is staring us right in the face, and would require absolutely no structural change to correct. The fix is as simple as re-drawing a line in the sand.

Critical p-value: line in the sand

Probability estimates are inherently continuous, yet we typically divide our observations into two classes: significant (i.e., true, real, bona fide, etc) and non-significant (i.e., the rest). This reduces the mental burden of assessing experimental results - all we need to know is whether an effect is real, i.e., passes a statistical threshold. And so there are conventions, the most widely used being p<.05. If our statistical test falls below a probability of 5% chance level, we may assert that our conclusion is justified. Ideally, this threshold ensures that our inference is correct with at least 95% certainty. But turn this around, and it also means that at worst, the assertion could be wrong (i.e., false positive) one time in twenty (about the same odds as being awarded a research grant in the current climate). That already seems pretty high odds for accepting false positive claims in science. But worse, this is also only the ideal theoretical case. There are many dubious scientific practices that dramatically inflate the false discovery rate, such as cherry picking and peeking during data collection (see here).

These kinds of fishy goings-on are evident in statistical anomalies, such as the preponderance of just-significant effects reported in the literature (see here for blog review of empirical paper). Although it is difficult to estimate the true false positive rate out there, it can only be higher than the ideal one in twenty rate assumed by our statistical convention. So, even before worrying about outright fraud, it is actually quite likely that many of the results we read about in the peer-reviewed literature are in fact false positives.

Boosting the buffer zone

The obvious solution is to tighten up the accepted statistical threshold. Take physics, for example. Those folk only accept a new particle into their text books if the evidence reaches a statistical threshold of  5 sigma (i.e., p<0.0000003). Although the search for the Higgs boson involved plenty of peeking along the way, at 5 sigma the resultant inflation of the false discovery rate is hardly likely to matter. We can still believe the effect. A strict threshold level provides a more comfortable buffer between false positive and true effect. Although there are good and proper ways to correct for peeking, multiple comparisons, etc., all these assume full disclosure. It would clearly be safer just to adopt a conservative threshold. Perhaps not one quite as heroic as 5 sigma (after all, we aren't trying to find the God particle), but surely we can do better than a one-in-twenty false discovery rate as the minimal and ideal threshold.

Too conservative?

Of course, tightening the statistical threshold would necessarily increase the number of failures to detect a true effect, so-called type II errors. However, it is probably fair to say that most fields in science are suffering more from false positives (type I errors) than type II errors. False positives are more influential than false negatives, and harder to dispel. In fact, we are probably more likely to consider a null effect as a real effect cloaked in noise, especially if there is already a false positive lurking about somewhere in the published literature. It is notoriously difficult to convince your peers that your non-significant test indicates a true null effect. Increasingly, Bayesian methods are being developed to test for sameness between distributions, but this is another story.

The main point is that we can easily afford to be more conservative when bestowing statistical significance to putative effects, without stifling scientific progress. Sure, it would be harder to demonstrate evidence for really small effects, but not impossible if they are important enough to pursue. After all, the effect that betrayed the Higgs particle was very small indeed, but that didn't stop them from finding it. Valuable research could focus on validating trends of interest (i.e., strongly predicted results), rather than chasing down the next new positive effect leaving behind a catalogue of potentially suspect "significant effects" in your wake. Science cannot progress as a house of cards.

Too expensive?

Probably not. Currently, we are almost certainly wasting research money chasing down the dead ends that are opened up by false positives. A reduced, but more reliable corpus of highly reliable results would almost certainly increase the productivity of many scientific fields. At present, the pressure to publish has precipitated a flood of peer-reviewed scientific papers reporting any number of significant effects, many of which will almost certainly not stand the test of time. It would seem a far more sensible use of resources to focus on producing fewer, but more reliable scientific manuscripts. Interim findings and observations could be made readily available via any number of suggested internet-based initiatives. These more numerous 'leads' could provide a valuable source of possible research directions, without yet falling into the venerable category of immutable (i.e., citable) scientific fact. Like conference proceedings, they could adopt a more provisional status until they are robustly validated.

Raise the bar for outright fraud

Complete falsification is hard to detect in the absence of actual whistleblowers. In Simonsohn's words: "outright fraud is somewhat impossible to estimate, because if you're really good at it you wouldn't be detectable" (from Alok Jha). Even publishing the raw data is no guarantee of catching out the fraudster, as there are clever ways to generate plausible-looking data sets that would pass veracity testing.

However, fraudsters presumably start their life of crime in the grey area of routine misdemeanour. A bit of peeking here, some cherry picking there, before they are actually making up data points. Moreover, they know that even if their massaged results fail to replicate, benefit of the doubt should reasonably allow them to claim to be unwitting victims of an innocent false positive. After all, at p<0.05 there is already a 1-in-20 chance of a false positive, even if you do everything by the letter!

Like rogue traders, scientific fraudsters presumably start with a small, spur-of-the-moment act that they reasonably believe they can get away with. If we increase the threshold that needs to be crossed, fewer unscrupulous researchers will be tempted down the dark and ruinous path of scientific fraud. And if they did, it would be much harder for them to claim innocence after their 5 sigma results fail to replicate.

Why impose any statistical threshold at all?

Finally, it is worth noting some arguments that the statistical threshold should be abolished altogether. Maybe we should be more interested in the continua of effect sizes and confidence intervals, rather than discrete hypothesis testing  [e.g., see here]. I have a lot of sympathy for this argument. A more quantitative approach to inferential statistics would more accurately reflect the continuous nature of evidence and certainly, and also more readily suit meta-analyses. However, it is also useful to have a standard against which we can hold up putative facts for the ultimate test: true or false.

7 comments:

  1. Excellent post, Mark.

    So the million dollar question is: what level of alpha would be appropriate in psychology / neuroscience / biology? I wonder what the most conservative threshold is that everyone would agree.

    Perhaps we need an enterprising journal editor to try an RCT. Manipulate acceptable alpha levels for a random 50% of papers for 12 months. Then see which ones replicate better.

    ReplyDelete
  2. Thanks Chris - absolutely, that is the million dollar question!

    And surely there can be no right answer - I can't imagine there will be an obvious sweet spot in the trade-off between type I and type II errors.

    I would say p<.0001 is starting to feel a bit safe, given all the multifarious nefarious ways to inflate the advertised false discover rate, but still within a feasible range for most researchers, and research questions.

    However, in the end it is just an issue of priority and resources. We can be as strict as we want, it will just require more data collection to be convincing - and that's a matter of time and money. But the same goes for validation by replication, or perhaps any other proposal to repair some of the recently damaged credibility in the empirical (esp. biological) sciences. There is unlikely to be a cheap and easy solution.

    ReplyDelete
  3. I'm not sure if I agree that false positive are more problematic than false negatives in psychology / cog neuro.

    In my own work (faces) the big elephant in the room is that even if you get a relatively selective set of blobs for faces > houses and houses > faces with univariate methods (FFA, PPA, etc), anyone who has tried a multivariate searchlight of face vs house classification will have noticed that with this more flexible analysis most of ventral temporal cortex can be used to decode face/house category at similar statistical thresholds to the univariate map. So what gives? One interpretation might be that by setting a too _conservative_ threshold in the initial univariate studies (perhaps necessitated by the low SnR of early fMRI) we ended up with a too circumscribed set of discrete 'regions' that may just turn out to be peaks in a big contiguous blob. You can see a similar story in the results of recent large-sample group fMRI studies (see Neuroskeptic: http://neuroskeptic.blogspot.co.uk/2012/03/brain-scanning-just-tip-of-iceberg.html). So one possibility is that neuroimaging studies have an excess of false negatives which originates in setting thresholds so that a nice tidy group of blobs emerges. You see considerable variation in reported thresholds, presumably because the investigators end up using whatever produces a 'real'-looking blob size. If the real blobs are smaller than what we think (or non-existent) we will produce false positives in this way, and if the real blobs are larger than we think we will produce false negatives.

    Given that we have no real way of knowing how big the real blobs are, the statistical significance criterion is just an arbitrary judgment call. But since it is nearly impossible to publish a 'non-significant' result, setting a more conservative significance threshold will actually tend to distort published effect sizes more than lenient thresholds. So if we are going to put our faith in replication, as Chris pointed out in that guardian story the other day, there's a pretty good argument for sticking with our current lenient criteria and just reserve judgment until the meta analysis comes in...

    I think the thing we're all still learning is to appropriately weight our confidence in any single study. A blob in a neuroimaging study at p<0.05 (corrected) is emphatically _not_ the higgs boson but there is still a worrying tendency both inside and outside academia to treat single studies as conclusive evidence regardless of the strength of the presented evidence. This is something we have inherited from psychology where no distinction is made between statistical and practical significance.

    ReplyDelete
    Replies
    1. Hi Johan,

      Thanks for your comments. I agree that false negatives are also a serious problem, but just lowering the statistical threshold in not necessarily the best solution. Surely it is preferable to increase statistical power to detect weak but potentially important effects. Similarly, the studies discussed by Neuroskeptic use very large sample sizes to show that weak activity patterns can pass statistical threshold with enough statistical power. Informally, I find it very useful to consider the full activity patterns, not just the highly significant clusters. But this should only really be used as an informal guide - at the end of the day, even weak effects should be shown to be statistically reliable.

      Replication is a form of test/re-test reliability, and can also increase confidence. But as you point out, if the threshold is set too conservatively at the first (or second/group) level, and replication/meta-analyses are performed with reference only to thresholded data, then weak effects are unlikely to be detected, no matter how many studies you string together. In such cases, it only really makes sense to include unthresholded data into such third level analyses. But this is then roughly equivalent to just collecting more data within a single study (i.e., boosting within-study statistical power). Moreover, while the definition of replication itself is still unclear (http://neurochambers.blogspot.co.uk/2012/03/you-cant-replicate-concept.html), increasing statistical power within a single study seems like a pretty sensible first step.

      Regarding your point about pattern analyses, I doubt that the fundamental statistical properties have changed [except that the first level (i.e., within subject) analysis is almost certainly non-parametric, which does not assume normality etc]. Presumably, the benefit comes from a more general test of condition-specific differences within a set of measures (voxels), irrespective of homogeneity within that feature set (i.e., Niko’s paper here http://www.pnas.org/content/103/10/3863.abstract). So this is not really a statistical question, but a more general question about what we should be looking for in fMRI signal.

      I agree with your final point that the current imaging literature is best evaluated by weighting the reported significance values, rather than (arbitrarily) thresholded clusters. But in the end, I am afraid that in practice, most people don’t really think of graded effects, but categorise into real and spurious. For quantitative approaches such as meta-analyses, the gradation can be used properly (assuming enough information is actually provided in each paper), but I can think of many cases where a particular phenomenon has been rarefied into the ever-expanding pantheon of real and true effects based on a single study. Similarly, a single published result is often considered sufficient in subsequent research to support new claims, thereby contributing to the house of cards.

      I think a recurring theme here is whether we can expect a single study to provide sufficient evidence for a particular effect, or should the process be much more distributed (i.e., replications and meta-analyses). High powered single studies are clearly more controlled (exactly the same parameters, etc), but could place too high a burden on single research groups who may not have resources to carry out large-scale data collection. Alternatively, pooling results over multiple studies, potentially conducted by multiples groups, reduces experimental control, but would be cheaper for any given group, and could also avoid “lab-specific” biases.

      Delete
  4. Many people are focusing on false positives, and more generally Type I vs Type II error. We also need to think about true positives. This isn't trivial. Among those studies which declare statistical significance (at whatever alpha), some will be true positives and some will be false positives. Increasing the stringency of the alpha will reduce the rate of false positives, but won't do anything to improve the rate of true positives. It's an often under-appreciated fact that underpowered (i.e., smaller) studies will result in a higher proportion of false to true positives among these nominally significant findings, because the alpha is constant while the rate of true positives is a function of statistical power and the proportion of hypotheses tested which are in fact true. So increasing the power of our studies (e.g., making them bigger), and improving the likelihood that the hypotheses we test will be true (e.g., less exploratory research) will also improve the situation.

    ReplyDelete
    Replies
    1. Thanks Marcus for highlighting this important issue. I agree that it would be counter-productive to lower the statistical threshold without increasing statistical power.

      Also, thanks again for pointing me toward literature discussing more general arguments against significance testing, very helpful.

      Delete
  5. Nice post Mark! For those interested in the historical perspective of how the 5% level came to be the general standard, there's a lovely paper from 1982 that examines Fisher's (1925) statements on the subject, but also traces the idea back further to the late 19th Century. PDF here: http://www.radford.edu/~jaspelme/611/Spring-2007/Cowles-n-Davis_Am-Psyc_orignis-of-05-level.pdf

    ReplyDelete