Why I am in favour of logging
A colleague recently brought to me some alternative fits he had done for a paper he was writing. The alternative fits looked very strange but had been strongly suggested by a referee. He was fitting a regression model to inter-country trade data and trying to explain patterns in terms of various measures of cultural fit. The referee was pointing to some papers in econometrics that had argued about the relative merits of multiplicative regression models fitted on the direct scale, rather than on the log-scale. The referee wanted a direct fit on the basis that the random errors may be more normal and additive on the direct scale.
One of the papers he was pointing to is HERE which contains the unequivocal recommendation
Overall, except under very special circumstances, estimation based on the log-linear model cannot be recommended.
Sounds like complete bollocks to me. I do not recall ever having a real econometric data set where regression on the log-scale was worse than the direct. I have had some data sets where it did not seem to matter – typically when the mean response was large and the variation of the noise was small.
Why is the log-transform better? Let me count the ways.
Leverage effects can be huge on the direct scale. This was a case in point with my colleague’s data. He actually had data over about 35 years and was getting crazy results from around 1990. It was the China effect dragging all the other estimates all over the place.
Collinearity, which is pretty closely related to the leverage effect, is also much higher on the direct scale. The model had several surrogates for economic scale and these were almost 100% correlated on the direct scale – but only about 90% on the log-scale. Again, because of China and the US.
Fitting algorithms are so much easier on the log-scale. It is a linear model. The direct model is non-linear. The leverage and collinearity also kick in at the same time. Anyone with experience with non-linear models with almost collinearity will know what a bad thing this is.
The main reason the referee put forward for the direct model was that the errors would be more normal on that scale. This is an empirical matter and, in my experience, residuals usually look more normal on the log-scale. More importantly, we all know that normality of residuals actually does not matter much at all (for moderate sized data sets) yet the misconception persist even within highly quantitative disciplines like econometrics. Especially for regression (or two-sample tests) there is a cancelling out of the first order skewness term that makes even highly skew errors have a limited effect on the fit.
The scale of random errors is surely proportional to the mean in practice for real economic data sets. Does anyone really believe that the intrinsic variability of the GDP of Kirribati is the same as that of China? Therefore, to do the direct non-linear model properly you would have to weight the fit with respect to variance. These variances would have to be estimated from the non-linear almost collinear model. So you are iterating and I can see this procedure never converging at all.
So there you have it. When I get data in an Excel spreadsheet with $-signs anywhere I just log the hell out of everything and only then start snooping around. I would be interested if anyone can tell of experiences where direct modeling was actually successful.
�
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
Leave a Reply