Suppose you are testing for a disease. The hypothesis testing framework seems natural for this problem. Maximise the probability of diagnosing the disease, subject to a limit on the false positives. Most of us will be pretty comfortable withrecommending the likelihood ratio test in most cases. And I am not saying that there is anything wrong with it. But it is an automatic prescription. The issue I want to look at in this post is what kind of testing regime the LR prescription imposes on us when the disease is one that progresses and behaves differently in males and females.
Let’s suppose that you test for the disease by measuring a normally distributed variable X whose mean is 100 for healthy individuals (H0) and is 104 for diseased individuals (H1). There are two asymmetries with respect to gender that I want to suppose. First, it is known that 70% of patients presenting for test are males. Secondly, the measurement X has a standard deviation of 1 for males and 2 for females. So it is going to be harder to diagnose females. The data comprise two pieces of information: the measurment X and the gender G which equals M or F. Because the disease if pretty serious we are prepared to tolerate a fairly high level of false alarms. This is formalised as a target test size of 20%.
Any reasonable test will be of the form “Reject H0 if X>CG” so we need to come up with upper critical values CM and CF for males and females. A so-called similar test is one which has the same size conditional on (in this case) gender. The original paper by Lehmann is HERE and details of his book HERE. Similarity is typically imposed because it is a necessary condition for a uniformly most powerful unbiased test. In this case we get the test defined by
CM =100.842 and CF =101.684.
The conditional size is then 20% in each case but the conditional probability of a type 2 error (failure to detect the disease) is 0.079% for males and 12.336% for females. The average probability of type 2 error is 3.76% but this is unevenly distributed between the genders, the rate being 155 times higher for females.
So you might say
Let’s drop this whole similar test idea. It’s mainly a theoretical requirement for unbiased tests when the non-null mean takes a continuum of values. In this case the non-null mean is 104 and most tests will be unbiased.
Obvious next candidate is the LR test. This is the most powerful test according to the Neyman-Pearson lemma. In this case it is not too difficult to show that the test is defined by the condition
CM =102-c/4 and CF =102-c
where c is chosen to achieve the given size. For 20% size we need c=2.04 and the critical values are
CM =101.49 and CF =99.6.
The probability of type 2 errors is now 0.603% for males (higher than before) and 2.17% for females (lower than before) and 1.07% overall (lower than before and as low as possible). But more than 50% of healthy females are now diagnosed with the disease and the probability of type 2 error is still three and a half times that of males.
While it is inevitable that females are diagnosed less accurately than males in this example, because the standard deviation is twice as large for females, the LR makes a specific trade off of errors without asking our permission. We could put costs into this problem and use decision theory to get a different answer of course.
What if we just ignore gender all together? Then the distribution of X is a normal mixture and we require a critical value of 101.016 to obtain size 20%. The test sizes are 15.5% for males and 30.5% for females and probability of failing to detect the disease (a type 2 error) is 0.14% for males and 6.79% for females. The overall probability of type 2 error is 2.13%, almost double that of the LR test so we pay a big penalty on average by ignoring the important gender information.
However, I would like to leave you some non-statistical questions. We naturally want to minimise the difference in the error probabilities for males and females. But what if the two groups were not defined by gender (a visible and politically charged genetic difference) but defined by an invisible but clinically measurable genetic marker Y? In testing individuals we would use X but also measure Y because it allows us to design a better test. My questions are
Would we actually care that the error rates were different between the two groups? Would we tell the patient their genetic marker? When we quoted the diagnostic error rates to the patient would we use error rates conditional on their genetic marker?