LR tests and fairness
Suppose you are testing for a disease. The hypothesis testing framework seems natural for this problem. Maximise the probability of diagnosing the disease, subject to a limit on the false positives. Most of us will be pretty comfortable withrecommending the likelihood ratio test in most cases. And I am not saying that there is anything wrong with it. But it is an automatic prescription. The issue I want to look at in this post is what kind of testing regime the LR prescription imposes on us when the disease is one that progresses and behaves differently in males and females.
Let’s suppose that you test for the disease by measuring a normally distributed variable X whose mean is 100 for healthy individuals (H0) and is 104 for diseased individuals (H1). There are two asymmetries with respect to gender that I want to suppose. First, it is known that 70% of patients presenting for test are males. Secondly, the measurement X has a standard deviation of 1 for males and 2 for females. So it is going to be harder to diagnose females. The data comprise two pieces of information: the measurment X and the gender G which equals M or F. Because the disease if pretty serious we are prepared to tolerate a fairly high level of false alarms. This is formalised as a target test size of 20%.
Any reasonable test will be of the form “Reject H0 if X>CG” so we need to come up with upper critical values CM and CF for males and females. A so-called similar test is one which has the same size conditional on (in this case) gender. The original paper by Lehmann is HERE and details of his book HERE. Similarity is typically imposed because it is a necessary condition for a uniformly most powerful unbiased test. In this case we get the test defined by
CM =100.842 and CF =101.684.
The conditional size is then 20% in each case but the conditional probability of a type 2 error (failure to detect the disease) is 0.079% for males and 12.336% for females. The average probability of type 2 error is 3.76% but this is unevenly distributed between the genders, the rate being 155 times higher for females.
So you might say
Let’s drop this whole similar test idea. It’s mainly a theoretical requirement for unbiased tests when the non-null mean takes a continuum of values. In this case the non-null mean is 104 and most tests will be unbiased.
Obvious next candidate is the LR test. This is the most powerful test according to the Neyman-Pearson lemma. In this case it is not too difficult to show that the test is defined by the condition
CM =102-c/4 and CF =102-c
where c is chosen to achieve the given size. For 20% size we need c=2.04 and the critical values are
CM =101.49 and CF =99.6.
The probability of type 2 errors is now 0.603% for males (higher than before) and 2.17% for females (lower than before) and 1.07% overall (lower than before and as low as possible). But more than 50% of healthy females are now diagnosed with the disease and the probability of type 2 error is still three and a half times that of males.
While it is inevitable that females are diagnosed less accurately than males in this example, because the standard deviation is twice as large for females, the LR makes a specific trade off of errors without asking our permission. We could put costs into this problem and use decision theory to get a different answer of course.
What if we just ignore gender all together? Then the distribution of X is a normal mixture and we require a critical value of 101.016 to obtain size 20%. The test sizes are 15.5% for males and 30.5% for females and probability of failing to detect the disease (a type 2 error) is 0.14% for males and 6.79% for females. The overall probability of type 2 error is 2.13%, almost double that of the LR test so we pay a big penalty on average by ignoring the important gender information.
However, I would like to leave you some non-statistical questions. We naturally want to minimise the difference in the error probabilities for males and females. But what if the two groups were not defined by gender (a visible and politically charged genetic difference) but defined by an invisible but clinically measurable genetic marker Y? In testing individuals we would use X but also measure Y because it allows us to design a better test. My questions are
Would we actually care that the error rates were different between the two groups? Would we tell the patient their genetic marker? When we quoted the diagnostic error rates to the patient would we use error rates conditional on their genetic marker?
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
May 4th, 2007 at 8:46 am
If we know the disease is serious, to the extent that we can quantify the additional degree of type I error we’re prepared to tolerate, presumably we can estimate the relative cost of Type I and Type II errors.
In this case, it’s not obvious to me that there’s a particular reason to fix the type I error rate across sex, since that will make for relatively much larger Type II error rate for the sex with the larger std deviation.
If the ratio of costs of the two errors was similar across sex, the ratio of the proportion of the two error types should also be similar, meaning that you would prefer (in terms of minimising the overall cost of the decision) to take a somewhat larger type I error in the group with the larger std dev.
May 4th, 2007 at 11:16 am
(and yes, I realize that’s what you were talking about when you mentioned decision theory - but to me it seems to be the place to start looking at this sort of problem, not something to mention parenthetically, as if it was some sort of oddity)
I guess I’m not a fan of hypothesis testing unless it makes sense to do so, which it tends to do better at when both kinds of costs are at least considered.
May 4th, 2007 at 2:52 pm
I mainly agree Glen. Certainly costs are important. But the optimum decision will still impose a specific difference in how males and females are treated (in striving for an optimum average cost). The optimum decision rule is just based on the LR multiplied by a risk reward ratio. So I thought it was simpler to illustrate the issue with the bare LR test.
What if people did know whether they were male of female and it was only revealed by a subtle test? Would we care about equal treatment then? Presumably not. So I think we would use the (cost adjusted) LR test then and also use their revealed gender in reporting to them the error rates.