How do you detect election fraud? A recent article in the Washington Post describes a novel statistical idea. I wish I had thought of it.
Most people believe the election results in Iran were rigged. They base this on a couple of arguments. The most obvious is that Ahmadinejad did unreasonably well, including especially in areas where you would expect him to poll poorly - for instance in the home seat of his main opponent Mousavi. This argument would sway most people but it is not scientific. Ahmadinejad could still say that he ran a great campaign and that the people decided he was a safer pair of hands.
Another more statistical argument is that the variability in Ahmadinejad’s vote is too small. By too small, I think they mean less than one would expect from the regional variation that one normally sees. So this does intersect with the first argument and Ahmadinejad’s high vote in some unlikely seats. But it is a distinct statistical view in that we focus on variation across seats rather than overall mean level. Certainly, if one could actually show that the variability of Ahmadinejad’s vote was less than binomial variation this would be pretty damning and suggestive that someone had just made the figures up.
Well, it turns out that the variability of the vote is about 100 times higher than binomial – there is plenty of regional variation overall. I suppose we could compare the variability with a different benchmark - such as the variability of the vote for winners of previous elections - but the argument starts to lose force as there are other explanation of why variability might decline.
So, if we are interested in revealing whether the election count data has been concoted, let’s focus more on the process of human beings making figures up. Humans are pretty bad at making figures up. Human generated data typically looks too good to be true and follows theory too well. It is well known, for instance, that Mendel’s famous pea data was probably concocted, perhaps by his assistants. Forged signatures can often be recognised by experts as being too consistent. Real signatures are not perfect and vary from day to day.
Humans are especially terrible at generating random numbers. And for a large voting count, for instance 325911 which was Ahmadinejad’s count in the region of Ardabil, the last few digits should be essentially random. On the other hand, if someone were making the numbers up and not concentrating too hard on the unimportant final digits, you might expect to see some tell-tale signs of non-randomness in the those final digits.
This idea is due to Alexandra Scacco and Bernd Baber who have suggested that there is indeed such evidence in the data. They claim that human generated random numbers tend to have too many 7’s and not enough 5’s. And looking at pairs of digits, they claim that human generated digits will have too many adjacent sequences such as 23 and 76.
The data for the 2009 Iranian presidential election are HERE and a graphic of the marginal distribution of the last digit is below. There are indeed too many 7s and not enough 5s. The overall goodness-of-fit of a uniform distribution has a P-value around 8% but this may underestimate the evidence. If we concentrate on the a priori hypothesis of too many 7s and not enough 5s then the chi-square statistic is much, much stronger.
But is the hypothesis of too many 7s and not enough 5s really a priori? I could not find any evidence on the web for the assertion of not enough 5s but there is some prior reason to look too many 7’s. If we just look at the excess of 7s the P-value is around 0.4%. The excess of 20/116=17.2% sevens over the expected 10% is very suspicious indeed.
One might obtain even stronger results if we concentrate only on those electorates where Ahmadinejad’s vote was likely to be poor. One assumes that the fraudsters would not alter the counts in the electorates where he won. So the random digits in these true counts might dilute the non-randomness in the fraudulent counts.
I also had a look at the last pairs of digits hoping to find something even stronger but I could not recover the results quoted in the Washington post article. Moreover, I could not find any evidence for their claims about adjacent sequences, though I have heard that two digit primes tend to get preferentially chosen. Anyway, I challenge readers to look at the digits and find some really damning evidence of non-randomness (which we would have to correct for the effort you put into the search!)
If I were going to fudge some numbers, I would start multiplying the real counts by some scaling factor, or leave the last digits alone, or even use a random number generator on my mobile phone! I guess criminals and fraudsters are not always very smart.
Cross posted at CoreEcon.