In February 1662 (352 years ago!), John Graunt presented 50 copies of his Natural and Political Observations on the London Bills of Mortality
to the Royal Society, and was inducted forthwith. I'm not sure about the exact date, but it represents the approximate birthday of modern demography. Here's an indexed html conversion of that document
. Apparently this was printed on Jan. 25th of that year (maybe that's the birthday?), which means that poor John Graunt was toiling away in the preceding months, and possibly would have noticed the male-biased sex ratio at birth the summer before (is that the birthday?). Working out the first lifetable ever sounds like nasty fall work to me.
(Come on, let me imagine that that's how it went down!)
The cover page to citizen Graunt's momentous treatise:
The cover to the annual Bills of Mortality (London death stats) that he based it on was pretty awesome back then:
Anyway, we (Berkeley Demography) are going to pretend it's modern demography's birthday today and have called a tea time. Well done, says I!
During today's dreary walk to work I cooked up a little drinking song in memoriam (for some demography happy hour somewhere):
It's short, I know, but with a barrel voice it could be cool.
[[Edit, Friday Feb 28]]
Ken Wachter informed us that Graunt may not have been the original author of his 'observations', and that the work may have been lifted from Graunt's boyhood friend, William Petty. Hervé le Bras has done some digging and he comes to some conclusions on the matter in this book
(possibly soon to be published in English?).
I probably head of Benford's law
first on an episode of Radiolab
. Benford's law is an odd empirical regularity about the way numbers appear in the world. It says that if we take the first
digit (1-9) of each number in a large pool of numbers, the number of times that any given digit appears as the first digit should asymptotically follow a particular distribution. 1 occurs the most, and so forth:
is a given digit, then log(1+1/d)
gives the expected proportion out of all first digits you'd expect digit d
to take up (above).
Question: do demographic data obey this law? Test data: HMD death counts (Deaths_lexis.txt). Here are the results:
I'd say that yes, HMD death counts obey Benford's law. The dashed blue line simply takes the count from each death triangle and counts how many times each first digits appears for males and females (1918722 total numbers as of this writing). I've not looked how the pattern unfolds by sex, age or over time. We do see here that some HMD populations follow the law more closely than others. This begs the question: is there something fishy about the data for countries whose death counts don't follow this law (look how lumpy some of those lines are)? To check, I measured how much each grey line departs from the red line (half the sum of the absolute deviations from the reference distribution), and we rank the HMD countries. Scores are bounded by 0 (exact match) and 1 (totally different distributions, no overlap). Therefore larger scores mean larger departures from Benford:
To put things in perspective, .12 is not a huge difference between two distributions, so let's just looking at the rankings. These ranking results are unexpected to me, and I suspect will be to others. The greatest departures from Benford's law are found in Scandinavian countries, and the most Benford-obedient populations are in the former Soviet Republics and Switzerland. Go figure. It is widely held among demographers that the Nordic countries in general have the best data (and most abundant, longest-collected, etc), but perhaps this metric is ill-conceived...
Now a flurry of questions: Do we learn something from this departure? Is there any particular particular reason why the Benford distribution must apply? Are there too few numbers from any given country in order for this asymptotic property to shine through? Is the discrepancy large enough to merit further digging? Of course, looking at death triangle output from the HMD means that the data have been massaged to some degree prior to running this test (splitting 5-year age groups, stuff like that). However, if we do the same test on abridged death counts (Deaths_5x1) --- those that resemble lowest common denominator database inputs ---, the overall ranking is very similar. Does an anomolous digit distribution indicate anomalous data
, an anomalous population or population process, or an anomalous application of Benford's law?
R code to completely reproduce this is on github, here
. It uses the DemogBerkeley package (installed from github) to grab necessary data from the web.
Age distributions of 'hedonic wellbeing' (a.k.a. subjective wellbeing, happiness in a superficial not-aristotelian way) have been reported from time
. Graphical evidence is more appealing to me. I'd never before seen the GSS (general social survey) until I saw it cited as the source for earlier happy-curves. Enter AJ Damico's awesome R utilities for social surveys (scripts on github here
, overview and blog here
). Lo and behold, AJ had a helper script for that survey. I couldn't resist.
Here are US male and female happy surfaces, because why not (click to enbiggen):
Some details on where the values for each 5-year AP square come from: There is a question 'General Happiness' coded as 'happy' when loaded into R, the is asked as follows:
157. Taken all together, how would you say things are these days--would you say that you are very happy,
pretty happy, or not too happy?
So we get those categories, plus NAs, 'I don't knows', etc. I, boldly or blithely, gave 'very' a score of 1, 'pretty' a score of .5, and 'not very' a score of 0, and ignored the other values, taking the (duly weighted) distribution of each of these 3 responses within each age-year-sex block. The sum of these three numbers is the score, ranging from 0 to 1, most often between .6 and .8. Visually, we see slight patterns in age, calendar year and cohorts (pre WWII cohorts very happy in this snapshot, would be worth busting down to triangles). It also would have been practical to simply take the proportion saying 'very' as our measure.
If we're OK with these hypothetical happiness units, one might ask how much happiness would be experienced in an average lifetime, something like happiness expectancy. To get a tack on that, I've taken the survival-weighted age-specific happiness for each year, and summed. We get this as the time trend:
So, females have a higher happiness expectancy too! Actually the gap here is often lower than the life expectancy gap, but a
hypothetical THR (total happiness rate) is too noisy to read a believable
trend from. It remains to be seen whether Venezuela's Ministry of Supreme Social Happiness
takes demography seriously ;-P. Bhutan
probably does, but I haven't looked closely. There are certainly other measures of well-being that might mean more to most social scientists...
To be precise, this is the same as total lifetime happiness in the hypothetical stationary population, and is a synthetic measure that indexes this value for a particular year, but for no particular cohort. We could decompose the time trends or sex differences using old-school Kitagawa decomposition, and we'd see that the upward trend in happiness expectancy (h0) is mostly due to mortality gains.
Ugly details: we don't have info for kids or people or the elderly, so I assumed kids 0-14 were a constant rather-happy .8, and that people aged 90-110+ were constant at the age 85 level for a given year. The latter choice has little leverage on results, but the former will make a difference.
Full reproducible R script on github here
. It'll download the necessary data. Sorry, you'll need to change local file paths to get it running, as well as potentially install a few packages (code given).
Here's an experimental visualization that will require a bit of explanation, below:
Look first at the top part, mostly in green. The uppermost contour tracks the total birth count in Sweden from 1891 to 2010, according to my most recent download of HFD data. The upper bands in turn indicate mothers' cohorts (in 5-year groups). I've only shaded those 5-year cohorts whose fertility is fully observed (or almost so). The grayed-out cohorts on the left are those whose early-career fertiliy precedes 1891, while the right side gray-ed out cohorts include those whose fertility-careers are not yet complete. The darker the green, the greater the career-fertility of the given 5-year cohort. Colors were determined by just breaking up cohorts into quantiles, and you can see them matched to the y scale on the lower y axis (since they refer to vertical slice heights on the bottom). In any case, the x-axis for the upper 1/2 indicates year of ocurrence.
Take a vertical slice from the top-- That's a birth cohort. If you reflect it down to the bottom (blues) you see the career fertility of that very cohort -- the 2nd generation. The lowermost contour tracks total birth counts born to each cohort (whose starting size is seen above), and it is only fully observed for the areas that are not indicated witha gray background. The banded-areas in the bottom indicate the 5-calendar-year groups in which the births ocurred. That might be a bit to wrap you mind around: Take the 1920 birth cohort. You see its starting size on the top and the total births produced by that cohort on the bottom (no account is made for in or out-migration). Looking at the bottom blue 1/2 for the 1920 cohort, we see that it contains several shades of blue- it's births were spread out over several 5-year groups, starting around 1935. However, 1935 was a dearth-year, due to the depression, and we see it's colored in light yellow on the bottom (two consecutive bands). For reference, the darkest blue shade in the center (around 1920) refers to births between 1945 and 1949. Now we see the relative contribution from birth cohorts to both the depression dearth and the post-war boom, in two different perspectives.
Each boom (on the top) has an echo (on the bottom), i.e. literally this is a 1-generation step, before the effects of ergodicity have time to step in. This is an effect of population structure and not necessarily of rates. One could recreate the above figure with rates instead of counts and likely tell a different story. I'll leave that for another day. Note the boom around 1990: it'd take a huge rate swing to not expect an echo boom from those cohorts-- that's a no-brainer.
* the same visualization would work for deaths ;-), in either case, with rates and counts, according to your whim.
** my apologies if this explanation isn't efficient!
*** this design is indeed inspired by the stream-graph
, but I had to make some design deviations from the popular design used in the NYT
1) I kept stacking order strictly chronological
a) the figure is about time and quantum equally and
b) the bands are not categorically different but rather a discretization of an imaginably continuous flow.
2) I removed the meander because *two* flows are plotted, not one. For this reason, neither flow crosses the x axis.
[edited Jan 27, 2014]
I'll being going to PAA this year in Boston, and the EPC in Budapest, with 3 papers in the mix:
3) Decomposing and recomposing the population pyramid by remaining years of life (with J. Spijker and J. MacInnes) [PAA Poster Session 6, EPC Poster session 1]
I will link to papers when final uploaded are made, although a motivated individual could find them on the respective conference websites.
As of last Wednesday I'm now officially a Dr. of Demography! I've posted the completed dissertation here
(already corrected for Errata and typos that I'm aware of). There are a couple known problems in it that I've not yet attended to, but I'll just replace the file in Google Drive when the beast gets updated. Most of the presentation got filmed, but the files are really big, so I need to figure out how to make them a manageable size before posting here. In lieu of that, here's a fun animation (it's a 13Mb file, might take a while to load, if at all- see attached gif at footer):
a section of this made it into the defense presentation. This figure is 1975 US using HMD data, in case anyone is curious.
It's been half a year since I've posted anything. The gap is explained by my having put 100% of my free energy into finishing my dissertation, which is now written, but not yet defended. I thank the HMD for even giving me a huge 3 month block of time to work (almost) exclusively on the dissertation (many more thanks are found in the acknowledgments, of course). In Spain, defenses are big public events- in fact anyone can go. So, I thought why not announce it here!
The defense will take place at:
After the defense I'll post the pdf here somewhere after accounting for errata that the tribunal (yes, it's called a tribunal here!) brings to my attention, and I'll post instructions for getting the code on github. Likely the next several posts on this page will be dissertation-related. Woot!
Sala d’Actes B7/1056 inside the Facultat de Ciències Socials i Lletres (Edifici B) of the UAB main campus (Bellaterra), at 11:00 am this June 26th (in about 3 weeks from this writing).
If you come to see it, be aware that my presentation is not to the tribunal, but rather to the audience and so is retooled to be accessible. That lasts like 45 minutes. Then the three tribunal members, Dr. Julio Peréz Díaz, Dr. Iñaki Permanyer and Dr. Trifon Missov, will each speak, basically for as long as they like, bringing the likely total time commitment is something like 3 hours. Two supplemental jury members are on standby, Dr. Daniel Devolder and Dr. Jeroen Spijker. Two external evaluators (to qualify for a European Mention) have been Dr. Vladimir Canudas Romo (currently changing institutions) and Dr. John MacInnes.
I'm not too worried about sneaky thieves getting at it before I get a chance to publish from it, since it's all in the public domain (as is the git-tracked entire process of writing it), and it has passed under a number of eyes.
a short abstract:
The two-sex problem in populations structured by remaining years of life.
by Timothy L. M. Riffe
One of the foremost problems in formal demography has been including information on the vital rates from both males and females in models of
population renewal and growth, the so-called two-sex problem. The two-sex problem may be conceived as a subset of the analytical problems entailed by multigroup population modeling. This dissertation characterizes the two-sex problem by means of decomposing the vital rate components to the sex-gap between the male and female single-sex stable growth rates. A suite of two-sex methods for age-structured models from the literature are presented in a standard reproducible format. A new variety of age-structure, age based on remaining years of life, is presented. Analogous models of population growth for the single-sex and two-sex cases are developed for populations structured by remaining years of life. It is found that populations structured by remaining years of life produce less sex-divergence than age-structured models, thereby reducing some of the trade-offs inherent in two-sex modeling decisions. In general, populations structured by remaining years are found to be more stable over time and closer to their ultimate model stable structures than age-structured populations. Models of population growth based on remaining-years structure are found to diverge from like-designed age-structured models. This divergence is characterized in terms of the two-sex problem and we call it the two-age problem.
and a teaser graphic (feel free to take a guess at what exactly this is showing)
Today I'm going to share one of my öfter used lexis surface markup functions, used below to make super geeky graph paper. All it does is draw lexis reference lines on top of an already-open plot, such as grid()
would do but with the cohort diagonals. This function gets used in several of the new HMD internal diagnostic plots, plus I use it a ton.
So let's say you had an image() plot of an age by year matrix, using proper age and year coordinates- then you could just mark it up using LexisRefN(ages, years, N = 5) , for instance, to get a 5x5 grid over it. The ... argument is passes on further arguments to segments(), such as lty, col, etc.
So some time ago I got inspired to print out some notepads of Lexis graph paper. The HMD team burns through these things... See the final copy that I used attached at the bottom of this post for 8.5 by 11 inch paper, meant to be cut in 1/2 for 5.5 by 8 inch pads, of which I have made an excessively large number- if you're a demographer and I know you and we'll see each other in the next year or so, then I've got one for you! (still don't know if I'll make it to the PAA in 2013). If I won't see you and you want some, let me know, or else use the proof below to make some up. If you do it yourself, here's my advice: Have a print shop do it in the off-season and price places out... The lines are light-weight- it needs to be printed minimaly at 600 dpi, max quality to turn out legible and smooth, and you may need to play with printer settings. You'll need to tinker to change the grid size, line weights, or adapt it to A4 paper. Let them recommend the paper and backing. The ones I did are just with glue on the edge, so pages just tear off. The print shop will try to get you to go for 50 pages per notepad, but they bind just fine with 30.
Here's the code I used for the attached page, modfy at will:
Happy New Year!
Hot off the press here. I find myself wasting lines of code to switch matrices back and forth from Age-Period to Age-Cohort (rows, columns), so I finally brought myself to write some helper functions:
And its buddy:
To be clear, these are thought out for matrices containing either upper or lower triangle values, not squares or parallelograms! It's niftier to work with triangles cuz they can be summed in any which way. Lexis can either be a matrix attribute containing the value 1 or 2 or else specified as 1 or 2, where 1 is lower triangle, bzw Dec 31st population and 2 is upper triangle bzw Jan 1st population. if there is no attribute and nothing is specified, it defaults to 2. If this is in error, the column names will be off by 1, FYI. Also, if going from AP to AC, the unobserved ages from incomplete(ly observed) cohorts are filled in with NAs.
I wouldn't qualify any of the kind of data I work with as 'big data', but it could easily get there. Even now, some files, like NCHS fixed width files for death and birth microdata (here
), or many-Gb IPUMSi
files can outstrip the memory on my ageing laptop. I've usually used the read.fwf()
function for this, then end up saving the resulting data.frame
as R binary files for easier reading later. That's still probably a decent way to go, but is in any case memory intensive. It's also unnecessary if all you're going to need is a custom table. The trick is to use sql from R, and I learned this from J D Long's answer to this question
on StackOverflow. It fiddling on my part to figure out how to make it work with fixed width files. Here goes:
My example data are 2010 deaths
from the US NCHS. In the zipped file there's a fixed width plain text file that's about 1.2 Gb as such, but get's much bigger when read into R as a data.frame. read.fwf() on this file for me could take 1.5 hrs+, but I doubt the process would even finish because I'd hot my 4Gb RAM ceiling. Let's say all I want is a cross table of some variables in there and I'm stubborn about getting this from R:
Here's let's say I want deaths by age, sex, month and the 39-cause recode. Look in the manual and find the start positions of these variables and their widths. We'll use an R package sqldf that externalizes the data-reading a table-making to SQL, and simply returns the results to R as a data.frame. SQL does its stuff out of memory apparently, so you're no longer held back by system RAM, and it's much faster:
Now this is something I can work with! [And many thanks to Gabor Grothendieck
for the improvements!]
Let's break down what's happening in the sqldf() call, as it looks rather foreign to monolingual R users, such as myself. As you can see, we pass the SQL commands as a single text string.
- This string starts with select
- The next elements, separated by commas, are the columns to include in the result.
- The first, 'substr(V1, 160, 2) Cause', selects a variable 2 characters wide starting at position 160 in the flat file, and gives it the name Cause.
- That's the form that the next few columns take too, until you get to Deaths, which is wrapped in a count() function. This is the part that actually does the tabulation- it doesn't really matter what variable you put in there, as far as I can tell, as the values don't matter- I think it just counts cases. In this file, one line is 1 case, so it's enough to just count lines, of something... So I just put in the substr() statement from one of the other variables. There's likely a better SQL-er way to do it, but this works for me for now.
- The from f part tells R the name of the file connection used,
- group by is the table specification- Here it's really redundant, because we don't have column names to work with, but these are the same file positions of the grouping factors that end up in the columns.
This all looks much cleaner and parsimonious if you happen to be working with files that use a separator and have a column header!