The Perfect Instrument
Untangling the web of dependencies between variables is the statisticians bread and butter - or it should be. But I must somewhat sheepishly admit that, until a few years ago, I had not thought much about these issues and had not much sensible to say apart from the old rubric “association is not causation”. Perhaps this was because much of my research interests were in methodologies that are applied to designed experiments where causation is fairly clear. On the other hand, if you go to any econometrics seminar you will find that 90% of the regressions involve instrumental variables. They even use “instrument” as a verb! I was dimly aware of the idea for a long time without really understanding why IV’s were necessary or when they can be a bad idea. This short note is about instrumental variables and how they can be used to untangle the effects of unseen confounders.
If you know all about instrumental variables already you will probably find this pretty straightforward. But you might still like to read the excellent paper I link to at the end.
The simplest way to look at the problem is to imagine that you do a regression of Y on X but you are worried about confounders C which may make the estimated relationship between Y and X “spurious”. The obvious solution is to include C in the regression. But what if you cannot measure C? What if there are a whole bunch of things either poorly measurable or just not available which you know could drive both Y and X?
The IV approach is to find another variable Z which is correlated with X but is not vulnerable to the confounding. The relationship between Z and Y is then a watered down version of the relationship between X and Y, because Z is a watered down version of X. Under fairly natural assumptions, you can show that the association of Z and Y is equal to the product of the association of Z with X times that of X with Y. So to estimate the association of X with Y (which is the whole aim of the exercise) you divide the association of Z with Y by the association of Z with X.
None of this theory would be any use unless you can find suitable variables Z – the instrumental variable. Econometricians have also sorts of clever ideas some of which involve the supposed random assignments of dates to events of interest (for instance birth dates) for finding such variables. But I recently found a very interesting application in medical statistics.
Last year Stephen Walter gave a talk in Melbourne about a very common issue that arises in medical trials. You assign people to be given the treatment or the control. The intention is to measure the association of the outcome (say survival or improvement) with the treatment. The problem is that human subjects being the recalcitrant critters they are sometimes do not accept the treatment they are given.
There are two standard but extreme approaches. Approach 1 is to compare the group that took the treatment with the group that took the control. This can involve substantial bias if the decision to accept or not accept the treatment is based on the severity of the disease. If sicker people are more likely to accept treatment the treatment will look bad. Approach 2 is to compare those who were assigned to treatment with those assigned to control, called intention to treat analysis (ITT). This is not subject to the same bias and some might argue it measures the real world effect of prescribing a treatment, but medical scientist want to know what the intrinsic effect of the actual treatment is, not whether patients use it correctly.
Walter described a pretty complicated analysis that involved looking at all sorts of combinations of treatment assignment, treatment taken and outcome, and assuming certain conditional dependencies were not present. Ultimately, one will have to make some assumptions of this kind and his method may well be very good.
But to me, the sensible analysis is the one detailed in this marvellous paper by Sander Greenland. It falls between the above extremes and involves a great example, nay the killer example, of an instrumental variable. Here X is the treatment received and Y is the patient outcome. The raw association of X and Y is biased by the counfounder C such as disease severity or any factor which could be associated both with a patient’s decision to accept treatment and their medical outcome. The instrumental variable Z here is treatment assignment. It is independent of C because it is randomized!! It is highly correlated with X unless you have a very recalcitrant set of subjects. So if IV does not work for this example it will never work.
The paper goes through the algebra and shows that the estimated direct effect of treatment X on outcome Y is just a ratio of the effect of Z on X and Z on Y. I have not seen a more intuitive or convincing use of an intrumental variable than this. It also makes transparent when the method will work poorly - namely when the correlation between Z and X is small. One then ends up with an estimate of a small correlation on the denominator which is bound to result in poor performance. This is the well known issue of weak instruments. So the answer to the public bar question “what is the perfect instrument” is neither Stradavarius nor Gibson Les Paul. It is “intention to treat”.
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
April 9th, 2007 at 3:12 pm
I thought this worked example of Phil Ender’s http://www.gseis.ucla.edu/courses/ed231c/notes3/instrumental.html
was illuminating
and the article at Wikipedia (effectively Daniel McFadden’s explanation) also very good.
http://en.wikipedia.org/wiki/Instrumental_variable
April 10th, 2007 at 2:28 pm
I did read the Ender notes and found them pretty unilluminating - which is partly why I wrote this post. I guess others may read his notes and get something more from them.
My key problem with those notes is that there is no explanation given as to why one would not fit an ordinary multiple regression with math, female, read and write all included. If one wanted to estimate the association of math and science adjusting for other factors, this is what one would surely do (or use a partial correlation). The point of IV methods it seems to me is mainly to adjust for unseen confounders. If you can measure them then you just thrown them into the regression.
Ender also does not explain what the IV parameter estimate for Math actually means compared to the multiple regression coefficient.