Frequentists and prior information
Here is another post about wrong headed justification of Bayesian over frequentist statistics. As suggested by David Dowe in his comment on my previous post, it is worth pointing out at the beginning rather than the end that nowhere below will you find an argument against Bayesian statistics per se (though I think there are some).
In the previous post I mentioned that there are two claims that (some) Bayesians make about their approach that get me annoyed. The first is that Bayesian thinking is natural and people will naturally apply probability to unknowns if not brain-washed by a frequentist education. The second is that only Bayesians, and not frequentists, can make use of prior information. Wrong.
I claim that frequentists can include prior information in a very similar manner to Bayesians. It might clarify things to consider a single parameter problem, so as not to get into the main difference between the two paradigms which to my mind is how Bayesians can integrate out nuisance parameters.
Imagine I observe x=50 successes from n=100 trials. Frequentists need to specify a data generating mechanism (DGM) and a log-likelihood function (which follows from the DGM) to complete an inference. My log-likelihood can be written down as
xlogp+(n-x)log(1-p)
The sampling distribution of x is binomial. How is prior information about p to be included?
In the best case scenario, I go to the authors of the published study about p and obtain their log-likelihood function, perhaps from the raw data and model if necessary. I then add their log-likelihood to my own. I know the distribution of my data. I know the distribution of their data. So I know the distribution of the multiplied likelihoods (in principle anyway). No problem at all. Prior incorporated.
Say what you like about frequentist inference being good or bad – but I can clearly include the previous knowledge. Indeed, whenever we have a sample that can be divided into two parts you can consider the full likelihood as being generated from the first chunk of data updated by the second chunk given the first. So frequentists do include “prior” information every time they analyse a time series.
OK. So what if you don’t have access to the previous study, but instead just have an estimate of p and a standard error, for instance phat=0.4 with standard error 0.05. If the estimate can be assumed to have come from a binomial experiment then we could solve for x and n and conclude that x=38.4 and n=96. So we have a slight problem right away with a fractional x. Maybe our estimate of 0.4 was rounded. We might make it x=38 out of n=95 which slightly errs on the side of conservatism – since it gives a slightly higher standard error. From here on, our likelihood and frequentist inference becomes that which follows from x=50+38 successes from 195 trials.
So now to the more realistic case – that we just have the estimate and standard error, perhaps not even from a single study but from a Cochrane meta-analysis. So we actually have imperfect prior information, from a frequentist point of view. The estimate and standard error tell us about the location and curvature of the likelihood that led to the estimate. We might thus approximate this likelihood by a normal likelihood term
-200(p-0.4)2
and then just add this to our own log-likelihood. You will have great difficulty in distinguishing a normal from a binomial prior log-likelihood. The ML estimate we obtain that incorporates prior information will no longer just be 88/195 but it differs from this by only a little – theoretically a second order term. The standard error from the joint likelihood also differs from the variance inverse weighted standard error by a second order term.
But what is the DGM you ask? How can we claim any frequentist properties for this ML estimator? We know that most estimates are asymptotically normal so we might argue that the prior Cochran estimate is generated by a DGM which is very close to normal with standard deviation close to the standard error. To first order, you actually don’t have to worry about exactly what it is. It is approximately normal. But if you formally assume that it is exactly normal (with mean p and standard deviation 0.05) then you can even do an exact frequentist likelihood inference. The full DGM is a combination of a continuous and discrete component. But this will only become important if we want to do second order or exact inference.
You might further refine the prior model term by allowing for the variance to differ with the true value like a binomial does. You can do this by replacing the standard deviation by the square root of p(1-p)/95, perhaps rescaled to equal the standard error of 0.05 when p=0.4. This gives an extra term depending on p in the total log-likelihood, and leads to a slightly different final inference because the variance weights are slightly different.
So the only difficulty I see for frequentists including prior information is in small sample problems where one wants exact i.e. non-asymptotic inference, and where you do not know how the prior information was generated. Formally, frequentists have to approximate not only the prior likelihood term but also a DGM for it. Whereas in the Bayesian paradigm it is sufficient just to specify the prior log-likelihood term (without a DGM) and then proceed automatically to the “exact” inference i.e. compute the posterior distribution with barely a moment’s pause.
The process of thinking through how I would include prior information is actually useful I think. In having to invent a log-likelihood and a DGM, I all of a sudden wonder if I should really be making all this shit up! Perhaps I should analyse the present data to see what it says. If someone wants to combine this with previous information then they can do a random effects meta-analysis later and separately.
Just to finish on a light note, it seems that the Bayesian conspiracy has even got to Bill Gates. The thing that really pisses me off is that Microsoft Word underlines the word frequentist as a spelling mistake but not Bayesian!
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
October 23rd, 2009 at 4:55 pm
I would just like to say that to Bayesians it’s strange to go for an approximation that fails when x is close to 0 or n, when there is no need: the likelihood is proportional to beta(x+1,n-x+1), and the natural credible interval is here the highest posterior density (HPD) interval. This interval is easy to construct, even in Excel, and it outperforms popular “approximate” frequentist methods, such as Wald and Score, from a frequentist point of view. It has mean (frequentist or unconditional) coverage exactly equal to nominal, which as Agresti and others have pointed out is more appropriate than minimum coverage being equal to nominal, such as is achieved by the Clopper-Pearson interval (which typically has mean coverage of 97-98%, when the minimum is 95%): e.g. its minimum coverage is better than that of the Score interval, the interval preferred by Agresti.
When including prior information, based on a beta(a,b) prior, which can be seen to represent a-1 prior successes and b-1 prior failures, the posterior is simply beta(x+a,n-x+b), and for the purpose of sensitivity analysis resulting intervals should be compared with those based on the above consensus posterior (which follows from the uniform or Bayes-Laplace prior).
October 31st, 2009 at 6:22 pm
Frank, Firstly it is not true that anything breaks down at x=0 or x=n. I do not think there is anything in my post that suggest this and it is not true. Clopper-Pearson intervals work fine for any x. In fact they have a strong optimality property.
I do not happen to agree with Alan Agresti about coverage functions only having to equal nominal on average. Would it be OK if an interval has 100% coverage for 95% of the parameter space and 0% for the rest? Obviously not for a frequentist. So we then have to start saying how close the coverage curve is to nominal. Perhaps one could look at MSE of the coverage curve with respect to some measure on the parameter. But then you are not a frequentist. Measures on parameters are not inferentially meaningful for frequentists. Having said that, in theoretical studies I have personally done this kind of think but only in addition to more traditional frequentist measures of validity.
November 5th, 2009 at 1:30 pm
Chris, I just meant to say that the approximate Normal likelihood you referred to is unnecessary when we know the exact form: it’s called a beta:-) Of course the Wald interval does break down for x=0 and n, and in fact so does the Clopper-Pearson lower limit when x=0 and the upper limit when x=n; but those limits are just set to 0 and 1, respectively - understandable, but leading to further conservativeness, and the usual dilemma of whether to go for a one-sided interval after all.
BTW prominent Bayesians (Berger, Bernardo etc) do care about frequentist coverage, and I’m trying to convince them that for this reason, among others, they should adopt the uniform prior rather than the Jeffreys prior for the binomial and Poisson parameters: minimum coverage is 92.7% based on HPD and the uniform prior, and 84.0% based on the Jeffreys prior – still just beating the 83.8% of the Score interval.
As for Clopper & Pearson’s “exact” interval, I believe they contradicted themselves in their 1934 paper: after first stating “In our statistical experience it is likely that we shall meet many values of n and x”, they explained their method for fixed n. However, when evaluating frequentist coverage when allowing n to vary, minimum coverage goes well above nominal, which seems undesirable. Generally, Stevens (1950) expressed the issue very well: “It is the very basis of any theory of estimation, that the statistician shall be permitted to be wrong a certain proportion of times. Working within that permitted proportion, it is his job to find a pair of limits as narrow as he can possibly make them. If, however, when he presents us with his calculated limits, he says that his probability of being wrong is less than his permitted probability, we can only reply that his limits are unnecessarily wide and that he should narrow them until he is running the stipulated risk. Thus we reach the important, if at first sight paradoxical conclusion, that it is the statistician’s duty to be wrong the stated proportion of times, and failure to reach this proportion is equivalent to using an inefficient in place of an efficient method of estimation.”
Stevens (1950),”Fiducial limits of the parameter of a discontinuous distribution”, Biometrika 37(1-2), 117-129.