The basic message is clear - collect more data! Data collection is expensive, and time consuming, but underpowered experiments are a waste of both time and money. Noisy data will decrease the likelihood detecting important effects (false negative), which is obviously disappointing for all concerned. But noisy datasets are also more likely to be over-interpreted, as the disheartened experimenter attempts to find something interesting to report. With enough time, and effort, trying lots of different analyses, something 'worth reporting' will inevitably emerge, even by chance (false positive). Put a thousand monkeys to a thousand typewriters, or leave an enthusiastic researcher alone long enough with a noisy data set, and eventually something that reads like a coherent story will emerge. If you are really lucky (and/or determined), it might even sound like a pretty good story, and end up published in a high-impact journal.
This is the classic Type 1 error, the bogeyman of undergraduate Statistics 101. But the problem of false positives is very real, and continues to plague empirical research, from biological oncology to social psychology. Failure to replicate published results is the diagnostic marker of a systematic failure to separate signal from noise.
There are many bad scientific practices that increase the likelihood of false positives entering the literature, such as peeking, parameter tweaking, and publication bias, and there are some excellent initiatives out there to clean up these common forms of bad research practice. For example, Cortex has introduced a Registered Report format that should bring some rigour back to hypothesis testing, Psychological Science in now hoping to encourage replications and Nature Neuroscience has drawn up clearer guidelines to improve statistical practices.
These are all excellent initiatives, but I think we also need to consider simply increasing the margin of error. In a previous post, I argued that the accepted statistical threshold is far too lax. A 1-in-20 false discovery rate already seems absurdly permissive, but if we consider in all the other factors that invalidate basic statistical assumptions, then the true rate of false positives must be extremely high (perhaps 'Why Most Published Research Findings are False'). To increase the safety margin seems like an obvious first step to improving the reliability of published findings.
The downside, of course, to a more stringent threshold for separating signal from noise is that it demands a lot more data. Obviously, this will reduce the total number of experiments that can be conducted for the same amount of money. But as I recently argue in the Guardian, science on a shoestring budget can lead to more harm than good. If the research is important enough to fund, then it is even more important that it is funded properly. Spreading resources too thinly will only add noise and confusion to the process, leading further research down expensive and time-consuming blind alleys opened up by false positives.
So, the take home message is simple - collect more data! But how much more?
Matt Wall recently posted his thoughts on power analyses. These are standardised procedures for estimating the probability that you will be able to detect a significant effect, given a certain effect size and variance, for a given number of subjects. This approach is used widely for planning clinical studies, and is essentially the metric that Kate and colleagues use for demonstrate the systematic lack of statistical power in the neuroscience literature. But there's an obvious catch 22, as Matt points out. How are you supposed to know the effect size (and variance) if you haven't done the experiment? Indeed, isn't that exactly why you have proposed to conduct the experiment? To sample the distribution for an estimate of effect size (and variance)? Also, in a typical experiment, you might be interested in a number of possible effects, so which one do you base your power analysis on?
I tend to think that power analysis is best served for clinical studies, in which there is already a clear idea of the effect size you should be looking for (as it is bounded by practical concerns of clinical relevance). In contrast, basic science is often interested in whether there is an effect, in principle. Even if very small, it could be of major theoretical interest. In this case, there may be no lower bound effect size to impose, so without pre-cognition, it seems difficult to see how to establish the necessary sample size. Power calculations would clearly benefit replication studies, but it difficult to see how they could be applied for planning new experiments. Researchers can make a show of power calculations, by basing effect size estimations on some randomly selected previous study, but this is clearly a pointless exercise.
Instead, researchers often adopt rules of thumb, but I think the new rule of thumb should be: double your old rule of thumb! If you were previously content with 20 participants for fMRI, then perhaps you should recruit 40. If you have always relied on 100 cells, then perhaps you should collect data from 200 cells instead. Yes, these are essentially still just numbers, but there is nothing arbitrary about improving statistical power. And you can be absolutely sure that the extra time and effort (and cost) will pay dividends in the long run. You will spend less time analysing your data trying to find something interesting to report, and you will be less likely to send some other research down the miserable path of persistent failures to replicate your published false positive.