Is statistics science’s dirtiest secret?
Science News is a respectable magazine in the mould of New Scientist. So I was surprised to be pointed to a recent article by Tom Siegfried which comes close to blaming all the world’s woes on…we humble statisticians. I suggest that you read the article from beginning to end before returning to my comments.
It is easy to be defensive when statistics is polemically described as
a mutant form of math that has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos
and I will not disappoint! But ultimately a better response is to understand how others view and misinterpret statistics and sketch out an appropriate response.
The first paragraph seems to contain a naïve suggestion that we had some better system in the past. We didn’t. Prior to statistics there was no systematic way of dealing with inherently variable systems. Scientists could just about handle astronomical measurement error by averaging it out. But evaluating alternative medical treatments? Not a hope! In a world rife with unaccounted-for variance, science cannot operate without analysis of uncertainty.
My main criticisms of the position put by Siegfried are first that he offers no alternatives to current best practice, second that the problems described (and they are mostly genuine problems) are a consequence of scientific practice not statistics, and that third, statisticians are at the forefront of solving, or at least addressing, these problems.
Publication bias
The most prominent claim of the piece – Siegfried calls it science’s “dirtiest secret” - is that many, perhaps the majority, of statistically significant results that are published in the literature are false, by which he presumably means that the null hypothesis is true. Leaving aside how you would ever determine this, the source of the problem, and I think it is a genuine one, is publication bias. Insignificant P-values are not published so we only get to see those that are smaller than 0.05. If there are no real effects left to discover then these would be 100% false alarms. If real effects are rare, then we would not be surprised to find that most low P-values are false alarms.
Under the null hypothesis then, the distribution of a published P-value is presumably uniform on the interval [0,0.05] rather than [0,1]. If this is as big a problem as claimed then journals might start requiring P-values smaller than 0.05. But the best way to weed out false alarms is with replication. For the replicated experiment, the P-value is uniform on [0,1] since it will be automatically published (you would think) regardless of its value. If the original claim was at all important then it will be replicated. If it isn’t, it will sit there as an incorrect claim about something nobody cares about. Sounds like science at work to me.
The article also criticizes meta-analysis which is an attempt to get the benefits of replication. Of course, meta-analyses are only applied to published studies which are all subject to publication bias. Statisticians have some models for publication bias – for instance the funnel plot – but a lack of genuine replication is the responsibility of the scientists.
Confused conditional: What does a P-value mean?
It is true that many people interpret a P-value as the probability of the null given the data. I actually think that in a lot of contexts this confusion doesn’t matter i.e. in those contexts where you might have prior probability of 1/2 for the null. But in science where most nulls are probably true it is seriously misleading. So I agree that, on top of the publication bias problem, the likely prominence of nulls being true makes this interpretation a problem. However, what alternative is Siegfried offering here?
I can’t see a better solution to this than better education of scientists. The problem though is that many scientists are not that interested. They want a P-value less than 0.05 to get their publication accepted in order to progress their careers. In the real world of a scientist, a P-value does not actually mean the probability of the observed result or worse under the null. What is actually means is
the probability of this research being published has just increased from 0 to a much larger value!
Not good epistemology perhaps but the practice of science nevertheless.
Multiplicity
With many scientists out there looking for significant P-values, the harder they look the more they will find. This is really a variant of the publication bias problem. But it is probably better understood by the scientists who are involved in these kinds of data dredging for P-values.
When searching for genetic markers of disease, not even a scientist thinks that the P-value represents the chance the gene is a cause. Again, what is the alternative being offered here?
There are numerous adjustments to P-values to make them better represent significance. There is a relatively recent emphasis on false discovery rates. This is still a hot research area for statisticians, and it is we statisticians who are the most likely to solve the scientists’ problem. We are surely not the cause.
Clinical trials
There are two points made by Siegfried. One is that randomization in clinical trials does not get rid of selection bias. This appears to be nonsense. A properly randomized clinical trial gives samples that are unbiased, under the sampling scheme, with respect to all unobserved confounders. And observed confounders can be deliberately balanced. What I think Siegfried is arguing is that the researcher may – by bad luck – end up with the two samples not being balanced with respect to some unobserved confounder.
But the P-value takes this into account. It is one of the great early achievements of statistics that by the device of randomization we can make specific and unequivocal statements about the likelihood of a given result. False alarms could possibly still happen from unlucky distributions of unobserved confounders. The P-value tells us how likely this is to happen.
I’ll say it again: What is the alternative being offered here?
The second point relates to clinical trial results only giving treatment that are appropriate to the average patient. He makes the point that
…trial results are reported as averages that may obscure individual differences, masking beneficial or harmful effects and possibly leading to approval of drugs that are deadly for some and denial of effective treatment to others.
Well, any statistical analysis I know of will include all known patient specific information, not only to adjust for these factors and obtain reduced variability in the assessment of the treatment, but also to identify interactions of the treatment with these factors. While it may be ideal to have treatments individually tailored to each and every patient, in a world where we cannot conduct experiments on our own personal clones we will have to make do with less individualized medicine.
Siegfried quotes scientists as saying that
…reporting a single number gives the misleading impression that the treatment-effect is a property of the drug rather than of the interaction between the drug and the complex risk-benefit profile of a particular group of patients.
This is possibly a fair call, though I have just noted that many analyses will specifically include terms which measure this and are actually called interactions. But I think this touches on a deep but irrational psychological reaction that people have to medicine. If I tell someone that a particular drug I give them has a 1% chance of killing them and 99% chance of curing them, they will probably take it. If I tell them that 1 in 100 people carry a gene which interacts with the drug and they will definitely be killed by the drug, while everyone else is definitely cured by the drug, most people somehow feel different. The thought of that genetic marker invisibly tattooed on their forehead as the doctor sticks the syringe into their vein just freaks them out.
Of course, a one-sided rant against conventional statistics would not be complete without rolling out some false claims about the wonders of Bayesian statistics. Frequentist statisticians are, according to Siegfried, unable to take into account disease prevalance in the way Bayesian math can. Sign! However, he then goes on to complain that Bayesians have introduced
confusion into the actual meaning of the mathematical concept of “probability” in the real world.
Now why is probability in quotes? Perhaps even he realises that in his “real world” probability is not such a well defined concept as in smoky Casinos. Bayes is a red-herring in the entire discussion. Bayesian methods do not, of themselves, overcome publication bias, or multiplicity or treatment interactions. They have really become just an additional set of tools and models in the statistical armoury.
�
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
July 21st, 2010 at 11:23 am
I found it very funny to read this at this blog today, since that same article was commented a few time ago by Gelman and others , and with a positive view of the article!
July 21st, 2010 at 2:34 pm
This articles combines elements of a beatup with some well-made points. First, the beat-up.
“Science fails to face the shortcomings of statistics”
Science maybe fails to face the hortcomings of statistics as used in much of the scientific literature. This is surely a failing of science, as indeed the article makes clear.
1) It relies heavily on careful critiques, authored by statisticians whose concern has been to draw attention to failings in study design and/or statistical methodology, to make its points!
2) The author’s own arguments rely heavily on probability and statistics!
3) The author is vague about possible answers. But there is a strong suggestion that much greater subtlety is required in the way that statistical arguments are used in the scientific literature is required than is often evident. The author warns, rightly, against reliance on a single statistical summary measure.
4) The author does think that Rev Thomas Bayes might provide counselling. Sure, Bayes might have some useful comment, but by no means does he have a cure. Already, Bayesian methods are widely used, and sometimes abused, not least in the meta-analyses that the author had, a few paragraphs earlier, in his sights.
5) The major focus of criticism seems to be p-values, not statistics per se. So why was the subtitle not “Science fails to face the shortcomings of p-values”. I’d not quibble with that, but that is not what the subtitle says. Would such a subtitle not have been sufficiently attention-getting.
The article is not altogether fair to Fisher. Here is a quote from Fisher:
“If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”
— Ronald Aylmer Fisher ‘The Arrangement of Field Experiments’, The Journal of the Ministry of Agriculture, 1926, 33, 504.
That is very different from the sort of reliance on p-values that Mr Siegfried criticises.
The criticism of the reporting of gene-disease association studies is on target. But the primary fault lies with trial design. Someone has remarked that, in any study conducted in San Francisco, any genetic variant that is largely unique to the Chinese population will have an association with the use of chopsticks. Studies that are not robust against such local and (for establishing causation) meaningless associations are just naive.
Here are steps that could and should be taken to improve the scientific process:
1) All trials, in cases where publication bias may be an issue, should be registered before the event.
2) All data, and code used to analyse the data, should be published.
3) Whenever there are large issues of statistical interpretation, editors should ensure that articles get careful professional statistical scrutiny. There should be a regular, perhaps annual, audit of the statistical content of papers.
4) Every journal should carry regular warnings that p-values are not to be interpreted as probabilities that the null hypothesis is false. This should be accompanied by a note giving, for a couple of different priors, the probability that the null hypothesis is false given p=0.05.
5) In all elementary statistics courses, students should be exposed to the probability calculations that are noted in 4.
6) A couple of weeks of every elementary course should be spent on showing the scope for misuse of elementary methods and misinterpretation of results when data are not independent. This need not get much beyond pointing out the importance, in a clinical trial, of using multiple centres and checking for consistency across centres.
Items 1 -4 are issues for the scientific community and for the scientific process. I am not the first statistician to argue for such reforms. Why, in these matters, is progress so slow. Maybe because fallible humans must cooperate to make the necessary changes to the scientific process.
Whatever is done (and there is a great deal that can be done), the scientific process will remain fraught. There is no magic bullet that can give certainty where certainty is impossible. Some may hope for a magic bullet that can put the scientific process at arm’s length from the frailties of the mperfect humans who manage it. That is also a pipe dream.
July 21st, 2010 at 5:30 pm
I just want to say I found it quite funny that Chris Lloyd thinks the New Scientist, the national inquirer of science magazines, is a respectable science magazine. After all this is where a world-shattering amazing physics and biology discovery can be found every week, the theory to end all theories and tell us all, etc. etc.
As for the criticisms he levels at the article, most of them are spot-on and Science News, like New Scientist, cannot be taken to be serious in any real sense of the word. Do they have power to influence debate? Unfortunately yes.
July 22nd, 2010 at 12:06 am
Curiously, publication bias - probably the most egregious problem in medical statistics (at least) - is not explicitly mentioned in the article. I’ve reviewed Ioannidis’ article and I agree with his overall conclusion - most published findings are probably wrong.
Replication is the key. Setting higher thresholds (p<0.01, p<0.001) will just encourage more abuse and make legitimate research far more expensive. A naive experimenter will just keep doing experiments until she/he reaches p<0.05.
“What I think Siegfried is arguing is that the researcher may – by bad luck – end up with the two samples not being balanced with respect to some unobserved confounder.” - I agree.
“But the P-value takes this into account…the device of randomization we can make specific and unequivocal statements about the likelihood of a given result. False alarms could possibly still happen from unlucky distributions of unobserved confounders. The P-value tells us how likely this is to happen.” - I’m not yet convinced of this. While I’m sure randomisation is the answer (what was the question?), p-values are the probability of the observed outcome under the null hypothesis. The probability of an unobserved confounder would need a Bayesian approach. admittedly the expected mean will be zero, but the variance may increase.
I agree with John Maindonald, that whilst Bayesian statistics may tend to increase confusion, the implications of Bayes’ Theorem should be much more widely taught and understood by scientists and not just statisticians.
July 22nd, 2010 at 3:27 pm
You are a bit harsh on New Scientist. They sex it up a little I admit, but science probably needs a bit of image therapy
July 26th, 2010 at 3:52 pm
I write essentially simply to raise the issue of when this notion of publication bias was first observed and published or otherwise recorded. The rest of this note pertains to it first being drawn to my attention earlier than mid-2004, although I acknowledge that it might have been noted (and published?) well before then.
Chris Wallace first drew this notion of publication bias to my attention well before his death in August 2004, as I note in sec. 0.2.5, pp538-539 of my
D. L. Dowe (2008a), “Foreword re C. S. Wallace”, Computer Journal (Oxford Univ Press), Vol. 51, No. 5 (Sept. 2008) [Christopher Stewart WALLACE (1933-2004) memorial special issue], pp523-560
and again on p446 of my
D. L. Dowe (2008b), “Minimum Message Length and statistically consistent invariant (objective?) Bayesian probabilistic inference - from (medical) “evidence””, Social Epistemology, Vol. 22, No. 4, pp433-460.
Largely changing topic but perhaps at least still vaguely relatedly, I strongly sympathise with those who wish to bring a Bayesian flavour to hypothesis testing.