In other posts I have had a bit of a go at Bayesian zealotry. This post is about frequentist zealotry. I happen to believe that any method which has bad frequentist properties is straight out wrong. If you don’t care about how you method works in general or in the long run you should keep away from science and data. That is not to say that there are not major problems with frequentist inference. Nor does it mean that good frequentist properties mean a method is good. In this short post I want to argue the following proposition:
If two methods have almost identical frequentist properties that does not mean they are equally good. In fact, one can just be plain wrong.
My example is from a paper I had rejected about a year ago and the basis upon which it was rejected. (This is not sour grapes because the paper is now accepted - and in a better journal than the one that rejected it!) I had devised a new method of testing association of matched pairs. The method give a P-value which is guaranteed to respect any nominal size chosen – a so-called exact test. There was a competing test which was also exact. The power properties of the two tests were almost identical – mine is a tiny bit better. But mine takes more computation theoretically (though for sample sizes up to a 100 I can compute my P-value in less than 1 second).
Below is a plot of the values of my P-value (called EM) and the existing P-value (called M) for all possible data sets when n=26. I have just focused on the interesting part of the sample space where the P-values are smaller than 0.05. You can see that in each and every case my P-value is smaller and since it is exact this means power cannot be worse. But you can also see that for most data sets there is almost no difference. Indeed, I took me a great deal of search effort to find a published data set where my method gave a practically different answer to the standard method.
The referee, who no doubt considers him/herself a “practical statistician”, took the view that since my method is practically identical to the old method 99% of the time and since mine takes more computation, it is not worth the trouble. If you are an extreme frequentist then this is correct. All you care about are the average properties and here they are very close. Bayesians on the other hand take a quite different view. They are concerned with making the correct inference for this particular data set and no other. I happen to agree that this is important, in addition to frequentist performance.
So in my case I would say that in the 1% of cases where my method is better, my method is correct and the competing method is incorrect. Because my method is giving the most efficient measure of the evidence against the null for this data set. If computation really becomes impossible then this would be a practical issue of implementation but my method would still be correct and the competing method incorrect.
Do you agree with me? To focus your thoughts, you might like to ponder the following question.
If you had a statistics package which was known to produce a random and completely incorrect P-value 1% of the time and you had the option of purchasing an expensive add-on that would remove the bug and slow down computations by a factor of 2, would you buy the add-on? Would you publish the paper that gave rise to the corrective add-on? Would you agree with me that the original algorithm, by being wrong sometimes, can be given the general label of just being plain wrong?