Some of my colleagues recently ran in the Melbourne Half Marathon. The good folks who administer the event are good enough to provide an excel spreadsheet listing the finishing times of each competitor with their registered age and gender. You might find it useful in the classroom. There are some interesting patterns in the data but not for the reason you might first think.
Actually there were several races in one. More than half of the starters continue on from 10km to the full half marathon and some only ran 5km. I tabulated the times by age and gender for the 10km race. Below is a plot (with standard error bars). What I find most interesting about this graph is how flat the curves are until the mid-forties. The wouldn’t be too many sports where 40 years olds are competitive with 20 year olds, especially a sport such as long distance running where injury is a significant factor, I would have thought. There is hope for us after all!
However, if you think about it for more than 10 seconds, these data probably do not imply that 40 years olds are as good as 20 year olds. There would be a truly huge selection effect here. Yours truly did not run, partly because I had better things to do but also because I have a bad knee. We would surely expect that many of those whose performance is reduced by the ravages of age would not even bother to compete, especially when they are still in an age-groups - say up to about 40 - when they kid themselves that they are still near their prime. I’ll go in the next one when I have done a bit of training. Don’t want the guys at work finding out I couldn’t break 60 minutes….
Each mean is the mean of a truncated distribution so they mean pretty much nothing unless we know something about the truncation. A first pass at this selection effect is to check the participation rate amongst various age-groups. So I downloaded file 320102 from the ABS which gives the 2008 Victorian population by age and gender. And I calculated the number of runners per 10,000 people within each age grouping. I did the same for the 22km race as well. The results are pretty telling (though not broken down by gender).
Participation drops off drastically from a peak at 25-29. The obese and the ill do not arrive at the starting line. The participation in the 22k run is higher. These are all more serious athletes - by definition. If you examine the vertical scale in the plot below you will note that the average speed for the 22km is about half a minute per kilometer faster than for 10km. Partly, this will be due to the traffic jam at the beginning of the race which washes out more in a longer race. They are also better runners to start with - they are not all of a sudden getting a second wind at the 10km mark.
So in the final analysis this data set is probably not much good for estimating anything but I think it will make a good case for describing sampling bias in class. When sampling is not controlled and randomised, you need to think how the data points get into the study. When human beings select themselves, the raw data tell you as much about the psychology of selection as about performance. Sport is always a great vehicle for discussing statistics and sampling design is, let’s face it, a topic that easily sends people to sleep.
We do expect to see times increasing with age eventually, even with truncation. And that is what we see though it looks like male 22km runners just won’t enter if their speed is going to be worse than 6mins/km. I think it would be interesting to track down the reasons for the degrading performance at higher ages. Finishing time data is obviously useless for this. It is a medical issue. For myself, I feel that I am pretty fit in heart and lung but on the occasions when I run with my son I am disadvantaged by a very inefficient running style (that is what they all say!) mainly a result of general stiffness and joint pain. I use a lot more energy per (short) stride than he does.*
The main pattern in the data that I cannot explain is the large difference between males and females - around 1 minute per kilometer consistent across all ages and both distances. I would have thought the differences between genders would be much less since this is a test of fitness rather than strength. My speculative explanation is that this is again a selection effect - that females have a lower pride factor and are willing to participate even at a low level of performance. I find this much easier to believe than that 50 year old males are a minute faster than 50 year old females.
* I am reminded of some of the lap swimmers I have seen who barely expend an erg, and even enhance their speed with hand paddles and flippers. If you were trying to get the maximum exercise per minute, wouldn’t you attached weights to your feet rather than flippers?