The theory of probabilities is at bottom nothing but common sense reduced to calculus —  Laplace.

Are You an Extreme Frequentist?

In other posts I have had a bit of a go at Bayesian zealotry. This post is about frequentist zealotry. I happen to believe that any method which has bad frequentist properties is straight out wrong. If you don’t care about how you method works in general or in the long run you should keep away from science and data. That is not to say that there are not major problems with frequentist inference. Nor does it mean that good frequentist properties mean a method is good. In this short post I want to argue the following proposition:

If two methods have almost identical frequentist properties that does not mean they are equally good. In fact, one can just be plain wrong.

My example is from a paper I had rejected about a year ago and the basis upon which it was rejected. (This is not sour grapes because the paper is now accepted - and in a better journal than the one that rejected it!) I had devised a new method of testing association of matched pairs. The method give a P-value which is guaranteed to respect any nominal size chosen – a so-called exact test. There was a competing test which was also exact. The power properties of the two tests were almost identical – mine is a tiny bit better. But mine takes more computation theoretically (though for sample sizes up to a 100 I can compute my P-value in less than 1 second).

Below is a plot of the values of my P-value (called EM) and the existing P-value (called M) for all possible data sets when n=26. I have just focused on the interesting part of the sample space where the P-values are smaller than 0.05. You can see that in each and every case my P-value is smaller and since it is exact this means power cannot be worse. But you can also see that for most data sets there is almost no difference. Indeed, I took me a great deal of search effort to find a published data set where my method gave a practically different answer to the standard method.

emvm1.bmp

The referee, who no doubt considers him/herself a “practical statistician”, took the view that since my method is practically identical to the old method 99% of the time and since mine takes more computation, it is not worth the trouble. If you are an extreme frequentist then this is correct. All you care about are the average properties and here they are very close. Bayesians on the other hand take a quite different view. They are concerned with making the correct inference for this particular data set and no other. I happen to agree that this is important, in addition to frequentist performance.

So in my case I would say that in the 1% of cases where my method is better, my method is correct and the competing method is incorrect. Because my method is giving the most efficient measure of the evidence against the null for this data set. If computation really becomes impossible then this would be a practical issue of implementation but my method would still be correct and the competing method incorrect.

Do you agree with me? To focus your thoughts, you might like to ponder the following question.

If you had a statistics package which was known to produce a random and completely incorrect P-value 1% of the time and you had the option of purchasing an expensive add-on that would remove the bug and slow down computations by a factor of 2, would you buy the add-on? Would you publish the paper that gave rise to the corrective add-on? Would you agree with me that the original algorithm, by being wrong sometimes, can be given the general label of just being plain wrong?


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button

4 Responses to “Are You an Extreme Frequentist?”

  1. Frank Tuyl Says:

    An excellent source for Bayesian zealotry:) and examples of the potential inadequacy of frequentist properties is “Confidence intervals vs Bayesian intervals” by E.T. Jaynes (1976), which can be downloaded from http://bayes.wustl.edu/etj/node1.html

  2. Max Moldovan Says:

    There is a simple common sense argument for selecting the best method. Most of clinical trials (where mentioned M and EM are commonly applied) are very expensive. So, as a “practical statistician”, I would like to maximize the chances to detect the difference. Even without saying anything about correctness of M, it is an ordinary risk-reward situation. I would simply tend to select the method (EM, in this case) that gives the best chances to confirm the efficiency (subject to coverage error, of course).

    One more argument against dismissing EM because ‘it is not worth the trouble’ is the ability to test against non-standard null values, e.g. noninferiority tests. In this case 1% can easily increase to a more substantial figure.

    My question is what if I am NOT interested in detecting the difference, e.g. in case of testing if a new treatment has the same toxicity as a standard competitor. This is partly a question of ethics, but shouldn’t I be more prone to avoid ‘the trouble’ and go for sub-efficient M?

  3. David Jones Says:

    I completely agree, a statistical method that is randomly and substantially incorrect 1% of the time is pretty useless.

    Perhaps mitigating factors in this case lie in the questions:
    1) are the incorrect results produced by the old method clearly incorrect. (That is to say, is there a very real risk that the incorrect results will be unknowingly used or are they so catastrophically wrong that only an MBA student wouldn’t notice)
    2) is there a pattern to the datasets that produce incorrect results using the old method and can it be used predictively? If there is no way to practically predict with which datasets the old method will fail, then it is essentially failing randomly - as you suggest.

    I suppose that if the answer to either of the above questions is ‘yes’ then it could be argued that your new method should be applied only selectively.

  4. Chris Lloyd Says:

    Thanks for these comments David.

    Regarding (2), there is no way that I know of to predict which data sets will give different results for the two methods. If there were, then it would save a great deal of computation since one could compute the quick old method in those majority of cases where the methods are practically identical. In the sense that I cannot predict which data sets will lead to different results, the old method does fail randomly in a practical sense, though not in the usual sense since the behaviour of both statistical procedures is completely enumerable.

    Regarding (1), I am not clear in my own mind what constitutes an incorrect inference for a particular data set. So, I am taking the view that if my P-value is smaller than the standard P-value and is also sensitive to departures from the null (as revealed by power calculations) then my method is “correct” in that it is detecting departure from the null more clearly. More broadly, I am calling any method that is inefficient, incorrect. I believe that the notion that inference based on inefficient use of data is wrong is an important principle of inference.

Leave a Reply