Tuesday 16 April 2013

Statistical power is truth power

This week, Nature Reviews Neuroscience published an important article by Kate Button and colleagues quantifying the extent to which experiments in neuroscience may be statistically underpowered. For a number of excellent, and accessible summaries of the research, see here, here, here and this one in the Guardian from the lead author of the research.

The basic message is clear - collect more data! Data collection is expensive, and time consuming, but underpowered experiments are a waste of both time and money. Noisy data will decrease the likelihood detecting important effects (false negative), which is obviously disappointing for all concerned. But noisy datasets are also more likely to be over-interpreted, as the disheartened experimenter attempts to find something interesting to report. With enough time, and effort, trying lots of different analyses, something 'worth reporting' will inevitably emerge, even by chance (false positive). Put a thousand monkeys to a thousand typewriters, or leave an enthusiastic researcher alone long enough with a noisy data set, and eventually something that reads like a coherent story will emerge. If you are really lucky (and/or determined), it might even sound like a pretty good story, and end up published in a high-impact journal.

This is the classic Type 1 error, the bogeyman of undergraduate Statistics 101. But the problem of  false positives is very real, and continues to plague empirical research, from biological oncology to social psychology. Failure to replicate published results is the diagnostic marker of a systematic failure to separate signal from noise.

There are many bad scientific practices that increase the likelihood of false positives entering the literature, such as peeking, parameter tweaking, and publication bias, and there are some excellent initiatives out there to clean up these common forms of bad research practice. For example, Cortex has introduced a Registered Report format that should bring some rigour back to hypothesis testing, Psychological Science in now hoping to encourage replications and Nature Neuroscience has drawn up clearer guidelines to improve statistical practices.

These are all excellent initiatives, but I think we also need to consider simply increasing the margin of error. In a previous post, I argued that the accepted statistical threshold is far too lax. A 1-in-20 false discovery rate already seems absurdly permissive, but if we consider in all the other factors that invalidate basic statistical assumptions, then the true rate of false positives must be extremely high (perhaps 'Why Most Published Research Findings are False'). To increase the safety margin seems like an obvious first step to improving the reliability of published findings.

The downside, of course, to a more stringent threshold for separating signal from noise is that it demands a lot more data. Obviously, this will reduce the total number of experiments that can be conducted for the same amount of money. But as I recently argue in the Guardian, science on a shoestring budget can lead to more harm than good. If the research is important enough to fund, then it is even more important that it is funded properly. Spreading resources too thinly will only add noise and confusion to the process, leading further research down expensive and time-consuming blind alleys opened up by false positives.

So, the take home message is simple - collect more data! But how much more?

Matt Wall recently posted his thoughts on power analyses. These are standardised procedures for estimating the probability that you will be able to detect a significant effect, given a certain effect size and variance, for a given number of subjects. This approach is used widely for planning clinical studies, and is essentially the metric that Kate and colleagues use for demonstrate the systematic lack of statistical power in the neuroscience literature. But there's an obvious catch 22, as Matt points out. How are you supposed to know the effect size (and variance) if you haven't done the experiment? Indeed, isn't that exactly why you have proposed to conduct the experiment? To sample the distribution for an estimate of effect size (and variance)? Also, in a typical experiment, you might be interested in a number of possible effects, so which one do you base your power analysis on?

I tend to think that power analysis is best served for clinical studies, in which there is already a clear idea of the effect size you should be looking for (as it is bounded by practical concerns of clinical relevance). In contrast, basic science is often interested in whether there is an effect, in principle. Even if very small, it could be of major theoretical interest. In this case, there may be no lower bound effect size to impose, so without pre-cognition, it seems difficult to see how to establish the necessary sample size. Power calculations would clearly benefit replication studies, but it difficult to see how they could be applied for planning new experiments. Researchers can make a show of power calculations, by basing effect size estimations on some randomly selected previous study, but this is clearly a pointless exercise.

Instead, researchers often adopt rules of thumb, but I think the new rule of thumb should be: double your old rule of thumb! If you were previously content with 20 participants for fMRI, then perhaps you should recruit 40. If you have always relied on 100 cells, then perhaps you should collect data from 200 cells instead. Yes, these are essentially still just numbers, but there is nothing arbitrary about improving statistical power. And you can be absolutely sure that the extra time and effort (and cost) will pay dividends in the long run. You will spend less time analysing your data trying to find something interesting to report, and you will be less likely to send some other research down the miserable path of persistent failures to replicate your published false positive.


  1. Great post.
    But how about we ditch the idea of Type I and Type II errors altogether and go Bayesian instead. With Bayes, whatever the effect you're looking for, you can just keep adding samples (e.g. subjects) until your Bayes factor is either less than 0.33 or greater than 3. Keeping adding water until an acceptable degree of certainty is reached one way or the other.

    It's also a nice way of estimating the probability of a theory being true given the data across multiple studies (by simply multiplying the Bayes factors)

    The thresholds for what constitutes 'strong' evidence for the null or the alternative hypothesis are still arbitrary (and B>3 is roughly aligned with alpha=.05, so perhaps should be stricter according to your argument) but it avoids problems with false positives and negatives.
    See Dienes 2011: http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/Dienes%202011%20Bayes.pdf

  2. Mark,
    I like the overview of bad scientific practices that let false reports enter the literature, especially 'peeking'.

    I just want to add that power analysis isn't only for clinical studies. I've seen many a grant (usually, big grants) that do incorporate power estimations in their rationale, which are used for basic research goals.

    It's an important aspect of project planning, as I describe here.

    Thanks for coming by my site!


  3. There's one other factor that is especially problematic in fMRI. (It may be a problem in other fields, too, but I am only familiar with fMRI.) Many people don't acquire their data in anything like the optimal way that will give them the best chance of detecting an effect - whether it's there or not. Calling the failure a type II error is disingenuous to statistics. It's a crap experiment!!! (A type C error?) The power analyses proposed assume that the experimenter will do the same sort of measurement 5n times as n times. In my experience that is often not the case. Why? Because a lot of fMRI experiments are run by relatively inexperienced people and, like drivers, they tend to improve with time.

    Sadly, though, I have no idea how to tackle the type C problem. We don't have an easy way to measure the quality of data. We can use proxies, such as how much a subject may have moved during a run, or how much M1 activation results from a button push, but these are imperfect because there is so much they don't capture. An example: in a resting state scan the operator accidentally instructs the subject not to attend to the scanner noise. Pah, it's only one subject, where's the harm...? Presumably, it's now in the group variance.

    So power analyses are great - as far as they go. Pre-registration is great - as far as it goes. How, though, do we ensure that an fMRI experiment is conducted rigorously? Reviewing the methods, and pointing out the potential risks to the experimenter, may be a start and could eliminate some systematic problems. How do we then ensure that the data that's included passes some minimum quality standard, and what is that standard? If I'm doing chemistry then I can specify chemicals with particular purity, and predict the likely downstream effects. What are our data quality metrics for fMRI?