In February 1662 (352 years ago!), John Graunt presented 50 copies of his Natural and Political Observations on the London Bills of Mortality to the Royal Society, and was inducted forthwith. I'm not sure about the exact date, but it represents the approximate birthday of modern demography. Here's an indexed html conversion of that document. Apparently this was printed on Jan. 25th of that year (maybe that's the birthday?), which means that poor John Graunt was toiling away in the preceding months, and possibly would have noticed the male-biased sex ratio at birth the summer before (is that the birthday?). Working out the first lifetable ever sounds like nasty fall work to me. (Come on, let me imagine that that's how it went down!)
The cover page to citizen Graunt's momentous treatise:
The cover to the annual Bills of Mortality (London death stats) that he based it on was pretty awesome back then:
Anyway, we (Berkeley Demography) are going to pretend it's modern demography's birthday today and have called a tea time. Well done, says I!
During today's dreary walk to work I cooked up a little drinking song in memoriam (for some demography happy hour somewhere):
we stand upon John Graunt's head
we drink to the Bills of the Dead!
raise a glass in poor John Graunt's stead
raise a glass in poor John Graunt's stead!
It's short, I know, but with a barrel voice it could be cool.
[[Edit, Friday Feb 28]]
Ken Wachter informed us that Graunt may not have been the original author of his 'observations', and that the work may have been lifted from Graunt's boyhood friend, William Petty. Hervé le Bras has done some digging and he comes to some conclusions on the matter in this book
I probably head of Benford's law first on an episode of Radiolab. Benford's law is an odd empirical regularity about the way numbers appear in the world. It says that if we take the first digit (1-9) of each number in a large pool of numbers, the number of times that any given digit appears as the first digit should asymptotically follow a particular distribution. 1 occurs the most, and so forth:
If d is a given digit, then log(1+1/d) gives the expected proportion out of all first digits you'd expect digit d to take up (above).
Question: do demographic data obey this law? Test data: HMD death counts (Deaths_lexis.txt). Here are the results:
I'd say that yes, HMD death counts obey Benford's law. The dashed blue line simply takes the count from each death triangle and counts how many times each first digits appears for males and females (1918722 total numbers as of this writing). I've not looked how the pattern unfolds by sex, age or over time. We do see here that some HMD populations follow the law more closely than others. This begs the question: is there something fishy about the data for countries whose death counts don't follow this law (look how lumpy some of those lines are)? To check, I measured how much each grey line departs from the red line (half the sum of the absolute deviations from the reference distribution), and we rank the HMD countries. Scores are bounded by 0 (exact match) and 1 (totally different distributions, no overlap). Therefore larger scores mean larger departures from Benford:
To put things in perspective, .12 is not a huge difference between two distributions, so let's just looking at the rankings. These ranking results are unexpected to me, and I suspect will be to others. The greatest departures from Benford's law are found in Scandinavian countries, and the most Benford-obedient populations are in the former Soviet Republics and Switzerland. Go figure. It is widely held among demographers that the Nordic countries in general have the best data (and most abundant, longest-collected, etc), but perhaps this metric is ill-conceived...
Now a flurry of questions: Do we learn something from this departure? Is there any particular particular reason why the Benford distribution must apply? Are there too few numbers from any given country in order for this asymptotic property to shine through? Is the discrepancy large enough to merit further digging? Of course, looking at death triangle output from the HMD means that the data have been massaged to some degree prior to running this test (splitting 5-year age groups, stuff like that). However, if we do the same test on abridged death counts (Deaths_5x1) --- those that resemble lowest common denominator database inputs ---, the overall ranking is very similar. Does an anomolous digit distribution indicate anomalous data, an anomalous population or population process, or an anomalous application of Benford's law?
R code to completely reproduce this is on github, here. It uses the DemogBerkeley package (installed from github) to grab necessary data from the web.
Age distributions of 'hedonic wellbeing' (a.k.a. subjective wellbeing, happiness in a superficial not-aristotelian way) have been reported from time to time. Graphical evidence is more appealing to me. I'd never before seen the GSS (general social survey) until I saw it cited as the source for earlier happy-curves. Enter AJ Damico's awesome R utilities for social surveys (scripts on github here, overview and blog here). Lo and behold, AJ had a helper script for that survey. I couldn't resist.
Here are US male and female happy surfaces, because why not (click to enbiggen):
Some details on where the values for each 5-year AP square come from: There is a question 'General Happiness' coded as 'happy' when loaded into R, the is asked as follows:
157. Taken all together, how would you say things are these days--would you say that you are very happy,
pretty happy, or not too happy?
So we get those categories, plus NAs, 'I don't knows', etc. I, boldly or blithely, gave 'very' a score of 1, 'pretty' a score of .5, and 'not very' a score of 0, and ignored the other values, taking the (duly weighted) distribution of each of these 3 responses within each age-year-sex block. The sum of these three numbers is the score, ranging from 0 to 1, most often between .6 and .8. Visually, we see slight patterns in age, calendar year and cohorts (pre WWII cohorts very happy in this snapshot, would be worth busting down to triangles). It also would have been practical to simply take the proportion saying 'very' as our measure.
If we're OK with these hypothetical happiness units, one might ask how much happiness would be experienced in an average lifetime, something like happiness expectancy. To get a tack on that, I've taken the survival-weighted age-specific happiness for each year, and summed. We get this as the time trend:
So, females have a higher happiness expectancy too! Actually the gap here is often lower than the life expectancy gap, but a
hypothetical THR (total happiness rate) is too noisy to read a believable
trend from. It remains to be seen whether Venezuela's Ministry of Supreme Social Happiness takes demography seriously ;-P. Bhutan probably does, but I haven't looked closely. There are certainly other measures of well-being that might mean more to most social scientists...
To be precise, this is the same as total lifetime happiness in the hypothetical stationary population, and is a synthetic measure that indexes this value for a particular year, but for no particular cohort. We could decompose the time trends or sex differences using old-school Kitagawa decomposition, and we'd see that the upward trend in happiness expectancy (h0) is mostly due to mortality gains.
Ugly details: we don't have info for kids or people or the elderly, so I assumed kids 0-14 were a constant rather-happy .8, and that people aged 90-110+ were constant at the age 85 level for a given year. The latter choice has little leverage on results, but the former will make a difference.
Full reproducible R script on github here. It'll download the necessary data. Sorry, you'll need to change local file paths to get it running, as well as potentially install a few packages (code given).
Here's an experimental visualization that will require a bit of explanation, below:
Look first at the top part, mostly in green. The uppermost contour tracks the total birth count in Sweden from 1891 to 2010, according to my most recent download of HFD data. The upper bands in turn indicate mothers' cohorts (in 5-year groups). I've only shaded those 5-year cohorts whose fertility is fully observed (or almost so). The grayed-out cohorts on the left are those whose early-career fertiliy precedes 1891, while the right side gray-ed out cohorts include those whose fertility-careers are not yet complete. The darker the green, the greater the career-fertility of the given 5-year cohort. Colors were determined by just breaking up cohorts into quantiles, and you can see them matched to the y scale on the lower y axis (since they refer to vertical slice heights on the bottom). In any case, the x-axis for the upper 1/2 indicates year of ocurrence.
Take a vertical slice from the top-- That's a birth cohort. If you reflect it down to the bottom (blues) you see the career fertility of that very cohort -- the 2nd generation. The lowermost contour tracks total birth counts born to each cohort (whose starting size is seen above), and it is only fully observed for the areas that are not indicated witha gray background. The banded-areas in the bottom indicate the 5-calendar-year groups in which the births ocurred. That might be a bit to wrap you mind around: Take the 1920 birth cohort. You see its starting size on the top and the total births produced by that cohort on the bottom (no account is made for in or out-migration). Looking at the bottom blue 1/2 for the 1920 cohort, we see that it contains several shades of blue- it's births were spread out over several 5-year groups, starting around 1935. However, 1935 was a dearth-year, due to the depression, and we see it's colored in light yellow on the bottom (two consecutive bands). For reference, the darkest blue shade in the center (around 1920) refers to births between 1945 and 1949. Now we see the relative contribution from birth cohorts to both the depression dearth and the post-war boom, in two different perspectives.
Each boom (on the top) has an echo (on the bottom), i.e. literally this is a 1-generation step, before the effects of ergodicity have time to step in. This is an effect of population structure and not necessarily of rates. One could recreate the above figure with rates instead of counts and likely tell a different story. I'll leave that for another day. Note the boom around 1990: it'd take a huge rate swing to not expect an echo boom from those cohorts-- that's a no-brainer.
* the same visualization would work for deaths ;-), in either case, with rates and counts, according to your whim.
** my apologies if this explanation isn't efficient!
*** this design is indeed inspired by the stream-graph, but I had to make some design deviations from the popular design used in the NYT: 1) I kept stacking order strictly chronological because a) the figure is about time and quantum equally and b) the bands are not categorically different but rather a discretization of an imaginably continuous flow.
2) I removed the meander because *two* flows are plotted, not one. For this reason, neither flow crosses the x axis.