Demog Blog
My Demography meets Art submissions
So we have a bunch of frames hanging in the institute presently containing reproductions of some really awesome historical choropleth maps, like this one: These were hung because there used to be a historical demography group here. They've been left hanging well after the dissolution of that group because they're awesome. They still are. But we want the art here to reflect the present labs a little more (labor, fertility & wellbeing, health sounds the ingredients of a good toast, right?). So a call has been put out to staff to propose art. I sent in 3 things, of which I'll dedicate this post to describing the process of making one of them. The point of departure for this piece was a Jan 2014 blog post here "Birth flow with cohort reflection". Here's that original: It's a fun thing to look at. The colors are straight Brewer palettes matched to size bins of the polygons on either side. Darker is bigger. Yeah. All the labeling was done in native R. That was a big old hunk of code I can tell ya. To take part in a (fab) data viz course by Jonas Schöley last summer I had to propose a project, so I made a touchup of that piece my project. This mainly included 3 new things.
The first subplot explains the meander. The second explains the top vs bottom, and the third is a crosssection slice. Otherwise, the main image is the same as before, just longer and with a meander. I wasn't satisfied. It was still hard to explain, and I managed to confuse myself explaining it every time. Also there is a problem in that the bottom half of the graduated early part of the series is smooth, where we would expect first differences to reflect down from the top (as in the right side of the series where counts are directly observed in single age bins, respecting natural fluctuations). I've still not resolved this, but I think that solving it (here an aesthetic issue) might be a real contribution if these data are destined to enter the HFD at some point (which I understand to be the case). Mainly I wanted to remove the need for sidegraphs to explain it, as they suck attention from the centerpiece and don't help much anyway. I've opted to annotate a particular lineage found online in the public domain by Christian Helinski (2nd pic in that report!) I wanted a 5generation female lineage to superimpose. I figured there's no problem since the lineage is extinguished, and only first names are used. Then the other main problem was to solve is the colors. These are only intended to give a rough idea of relative bin size. They don't need to be perfectly perceptually balanced to convey this. I think the green blue palettes are pretty, but I wanted some more flare. So I used the paletter R package to extract palettes from paintings and photographs. The main function in that package lets you choose between categorical and sequential palettes, but I couldn't get a usable sequential palette, possibly because it was getting stuck in single hue bands, not sure. So I went with categorical and decided to attempt to perturb colors toward an even ramp. I decided hue doesn't matter, but darkness and saturation do. So I settled on some pictures, extracted palettes, then postprocessed colors optimizing on the saturation channel in HSV (colorspace), and on the inferred grayness estimated by the to.grey() function of spatstat. Here is a better idea of that:
So this is semifinal. Still need to think on the title. And man it'd be great to solve the oversmoothness of the bottom part of the early series, but not going to happen in time. A high res version of this might be hanging sometime in the mpi, will see. This one is befittingly sized to fit in a frame presently holding a Sweden map! [[Edit: Dec 13, 2017 I am presently trying to figure out how to 'adjust' earlier cohort fertility (bottom left) to account for size fluctuations in mother cohorts (top axis). May or may not succeed]] [[Edit: Dec 15, 2017, got a decent solution]] Got it! Here's the new figure: (click for 100 dpi) Data for years 17751890 have now been perturbed to account for cohort size. This is essentially a step 2 to the count graduation I had done for earlier years. I turned it into an optimization problem, with the following steps: 1) graduate to single ages using pclm. This is constrained in age groups already. 2) for years 18911959 we can already compare first differences in mother and daughter cohort sizes, and relative first differences of mother and daughter cohort sizes have a near 1:1 relationship (slope of linear model) and a high correlation (0.93). I can perturb the offspring (lower) cohort size to approach the relative first differences observed in the top cohort sizes by creating a multiplier, discounted somehow. It's the discount that I optimize for the result turns out to be .44, applied like so C * (1  rdt * .44), where rdt is the relative first difference in the mother birth cohort (top series). Then things are reconstrained to match 5year age groups in the original data. That's a crappy explanation. The bottom line is there are lots of knowns, and I'm minimizing on just the linear slope and correlation coefficients between the top and perturbed bottom series, where the only moving part is the coefficient that approaches .44. 3) regenerate and touch up in Inkscape. Now we have a similar degree of reverberation between top and bottom series. Again, compare the bottom for years 17751890. I'm pleased with the result. The method is too hackish to make it into the methods protocol for HFD, but my hunch is that these results are a closer approximation to reality than a standard rate graduation without singleage offsets. *Methodological speculation: You could in principle daisy chain this perturbation back to exposures, since we have an implied smooth rate, and an adjusted count series. count / rate = exposure. Often in the absence of single age observations, mortality numerators and denominators are both graduated to be smooth in one way or another. However, for populations with long preceding birth series, there is a source of information on relative cohort size that ought to follow the cohort fairly well over time. It's maybe less problematic to trust such an adjustment through the reproductive span than it is through the whole life span. ** No, I didn't validate this adjustment on the years 18911959, where this would certainly be possible: group to 5year ages, pclm graduate, then adjust like the earlier part. How well would that match? Not sure yet. Maybe I'll check it out one day. 
the joy of fertility
So I've been helping coteach an R for demographers workshop back at the CED in Barcelona, and one of the sessions I was responsible for was about base plotting. I gave as an exercise to make a stacked plot similar to the Joy Division Unknown Pleasures album cover but made of fertility curves from a data object I gave them. There many such dataviz projects out there, like this, this, and this, just to list some that come to mind. This way they'd get some experience with using the primitive plotting element polygon(), would have to write a function in order to draw one for each fertility curve, and would need to set up some sort of iteration in order to do so. So, three R concept covered in this exercise: plot device management, functions, and iteration. Not bad says I. Yes it's a lot for beginners, but that's why the exercises are done with the instructors present. The point isn't necessarily to get the final polished product, but to learn via trying rather than replicating code that 'just works'. So I hope it succeeded somehow there. Anyway, later on I got a more comprehensive set of fertility curves from the Human Fertility Collection, regenerated, then marked it up in Inkscape (for better labeling). Each fertility curve here is drawn at a spacing of .06 units of TFR, ergo the spacing is on scale with the data and you can treat the baselines as grid lines in this funny way. Really there's no sense trying to read values out of it. Curves for the different countries are sorted by TFR, with the lowest TFR at the top and the highest (in this set) on the bottom. You see what a variety of shapes fertility can take under similar values of TFR. I suppose that's what the plot is good at showing. Some curves are of similar shape and size, but at different locations (ages). Some are of different shape, but the same size. And so forth. Anyway, here's a vector pdf version of the plot: And the R code to produce this (prior to Inkscape markup) can be found here. And here's an update, only a short while later, due to suggestions received on Twitter from Carl Schmertmann, Alison Taylor, Nikola Sander, and Ramon Bauer. Thanks all! 
a higher order time identity
Most demographers are familiar with the ageperiodcohort identity. 'Identity' might not be the first thing that comes to mind when we think of APC. Instead with think of the Lexis diagram, or else the identification problem, which then leads back to the fact that these three measures form an identity. Any two of them will do, and you'll get the third via implication. Well, you can go bigger than that. Say we had period, cohort, start of employment, and retirement. By taking the pairwise differences between these four events, we'd end up with 6 further durations. Just as APC relate in a simple graph, the form of a triangle, these 4+6 time measures relate in the form of a denser graph with 5 vertices and 10 (4+6) edges. It's a complete graph, meaning all vertices are connected directly. Via Cayley's formula there are 125 ways to select 4 edges and still touch all vertices. And coming back to the time identity, there are therefore 125 ways to start with 4 times measures (of these 10) and end up generating the remaining 6 (i.e. such that the remaining 6 are implied). Here's a mini poster showing them all: I think I'd try sorting these somehow, but need to figure out how. For now these are unordered 'solutions'. 
Demography, Uruguay, Futbol
So Victoria Prieto kindly invited me to give a lecture via Skype (because the class is in Uruguay) on applications of the Lexis diagram. And I must say that Skype quality just stinks for this kind of thing. They were able to get clear audio of me IFF we made it so that I was just doing screen share and their audio connection was turned off. We made it work somehow, but got lost in a sea of white noise during the Q&A a couple times. Victoria asked me to try to make them love Lexis. So, aside from trying to make a real case on the utility of the Lexis diagram, I also found some atypical applications here and there, and made my own silly (but fun!) example of lifelines in the Lexis diagram, based on futbol (I have to write it like that cuz football means US football to me, no matter how long I live in the EU) players on the Uruguay national team in all of Uruguay's appearances in World Cup and Copa America. Uruguay, you ask? It's here:
(By Connormah  Own work, CC BYSA 3.0, https://commons.wikimedia.org/w/index.php?curid=6913529) Uruguay punches (or bites? mwahaha) above their weight class in the realm of futbol. It's a mostly urban country, and they have a demography masters program, and a PhD specialization in pop studies. And Victoria teaches there, ergo the Lexis lecture, and that's why the Skype hassel. OK, now back to the point of this post:
I just wanted to show that you can represent any population of durations on the Lexis diagram (or its analogues).
Here are the plots I showed them, without first telling them what it was. I asked them to guess the data:
1) guess what this is showing.
It's right skewed, between ages 15 and 40, possibly fertility?! It does look that way doesn't it?, This turns out to be the aggregate of all ages in all teams in all cups, standardized in 2.5 yearwidth age groups. Meh. I didn't reveal the subject here yet.
2) Maybe this would tip off? If I just say that the data in the above historgram came from the points in this Lexis diagram? (click to embiggen)
This image still didn't settle it. I guess x ticks every 20 years kind of obfuscates recognizing the particular years of the cups. Like, 1930 and 1950 I think Uruguay won, but there aren't any ticks there to indicate it. So then came the reveal, but since it was a oneway connection I'm not sure if it tanked or if they thought it was cool. But they did ask for a blog post on it with the code to get the data, etc. The truth is you could augment these data in so many ways I think sports data is ripe for demographic analysis, though one might need to play with the definition of age, like this:
Same data, but the lifelines only connect the first and last cups played by each player. The life lines are of course still aligned to age 0 (birth), but you could certainly realign them to make first cup be age zero, or last cup age omega... Further you could augment these to make the points games rather than simply being a team member, add other cups, regular season, and ... the predata territory: youth teams. The Lexis diagram isn't the limiting factor here, but the more detail that is added, the less useful lifelines become, and the more likely you are to gain insight from some kind of aggregation (perhaps on a metric rather than on team membership, and aligned to one or another definition of age). Then patterns will emerge that are otherwise invisible. Woot.
I hopothesized that you could use the Lexis diagrams to guess at the heros of any given team. Whaaa? Take the lifelines of the most recent team, and follow the lines back in time (downward left) to sometime between ages 5 and 15, assuming that that's when futbol impressions are made in the most lasting way, then look up to which players appeared in those cups. That's the population of likely homebrew heros. Not entirely novel. Any player will just tell you their childhood references anyway.
But possibly more interesting that this subject matter was the mungery required to get these data. The data were scraped from wikipedia entries, like this one. And indeed, using the code annotated here
Feel free to use it, expand it, make cool plots, and share them.
and here's the full presentation:
Thanks again Vicky!

Reprojections of the Lexis surface
If we think age and period as lat long coordinates, then we can also borrow the analogy of map projections. There is a large but finite number of ways to project the Lexis surface as it is (rotations, reflections, isotropic or not, other kinds of time measures, etc), but there are an infinite number of ways to reproject any of the time measures used as axes, thereby reprojecting whatever data is being shown on the surface. I'll use the example of an age pattern. This is very much in line with the previous post, where age itself was reprojected, and it's a basic extention of the SandersonScherbov method of standardizing age patterns by distorting age in one of them. Let's start with the following standard Lexis surface of fertility. Behold 120+ years of Sweden: (HFD) There are plenty of stories one can tell here. Now let's use the trick from the previous post, and transform the y=axis. The trick to transforming the yaxis is that the ycoordinate for each x coordinate must remaing comparable with respect to the function that you use to transform. I'll take the example of survival quantiles, because for each year we have a period survival function, l(x) (HMD), which tells us the proportion surviving at each age x. If youinvert this then it tells you the age at which a given proportion of the synthetic population remains alive. Ergo, the ages that correspond to quantiles. These change over time, as mortality jumps about and/or gradually improves. We want the quantiles themselves as the new yaxis (last post it was remaining life expectancy, but now using survivial quantiles, just because). So, 1) for a given quantile and year, find the exact age that it corresponds to, 2) find the fertility rate that corresponds to that exact age (I use splines for both steps) 3) wash, wrinse, repeat over the whole surface: voila: Same data! OK, we know the story, reproductive ages are now almost completely survived through. But it wasn't always the case! But that's not the point of this post. The point is the idea of reprojecting one or more axes of the Lexis surface, and its variants. All you need are two time series of data that is structured on the same agelike index (lifespan, timetodeath, time since X, time until X, etc). Here I chose lifespan quantiles, but it could have easily been another (preferably monotonic, but not necessarily so) lifetable function (or a function of a lifetable function!). Nor must it be a lifetable function. In this case we could have used ANY agepattern to reproject. You could use fertility to reproject any aspect of mortality, for instance (but there might be more details to sort out). Maybe I'll post another example sometime. I presented this idea to some graphics pros yesterday at the Fraunhofer Institute in Rostock, and you can find the code in a repo for that presentation. 
A prospective age Lexis surface? Convoluted Spielen
One could simply plot remaining life expectancy on a Lexis surface. That'd be the smart thing to do. Not what I'm going to do here, which is Sunday goofing off. First let's swap out age with prospective age. A prospective age is the age where you hit some some fixed level of remaining life expectancy. Since mortality changes depending on where and when you are prospective ages change too. This is a concept that Warren Sanderson and Sergei Scherbov use a lot. Here's the main goto paper that explains it all in a compelling way. It's the basis of demographers starting to say things like "60 is the new 50", etc. People had been saying that kind of thing already, and noticing active aged people more and more, but the lifetable gives a nice objective basis for sayings like that. The idea is to take some fixed remaining life expectancy like 40 or whatever, and ask which age it belongs to. That usually requires some interpolation. So here's a calculator to do it (as well as a survival quantile calculator), and the image I want to explain (it's US females, by the way, HMD, as usual) So, we have remaining life expectancy on the yaxis, and calendar year on the xaxis. The zcoordinate that is plotted is chronological age, which is represented both with color (darker blue higher age) and with labelled black contour lines. All the countours are increasing, which just means that remaining life expectancy is increasing at all ages. Where the contours are steeper it increased faster. There is a light 5x5 grid. for e(x) and year. Following any of the horizontals will tell you the Age at which the given remaining life expectancy was hit. So, for example, follow the e(x) = 25 line. Around 1945 it hits age 50, and in the most recent year it hits 60. Ergo 60year olds today have the remaining life expectancy of 50year olds in 1945, hence "60 is the new 50". It's awesome to find quick increases, but this plot doesn't have many (many other populations of the world have had more drammatic gains than the USA, by the way). Finally the yellow descending lines are birth cohorts, and their wavy pattern shows how we've not only flipped, but also irregularly distorted the Lexis diagram. These allow for more comparisons. Of course the surface uses period mortality, so there's another level of distortion in it that simply isn't revealed. One could use cohort data for either a restricted range of e(x) and/or age, or else switch to a country like Sweden, which has a long series of data and lets you do pretty much any formal demog you want. Here's the code, including a survival quantile lookup function, also based on a spline. You could also repeat the same Lexisish transformation with survival quantiles on the yaxis.... And you couldplot e(x) with one contour plot, and age with another contour plot above that. Oh dear, the possibilities are endless. The code is posted in a github gist, here. Sorry gists can appear in posts directly using gadgets, but sometimes gadgets go missing... 
Decomposing the population pyramid à la Vaupel & Yashin (1987)
I'm a fan of Vaupel & Yashin (1987) Repeated rescuscitation: How lifesaving alters lifetables (this paper, possibly paywalled) Here is a Jstor Link to the same article. Under some strict assumptions, they derive how hypothetical improvements in mortality rates can be translated to saved lives, but that the process can be repeated indefinitely. Among the items derived is the composition of the stationary population by the numbers of times individuals have been saved. In the article, they decompose period l(x) schedules by making period comparisons. One thing that I'm always messing around with is the idea that cohort longevity has typically (always?) been better than that belonging to the period lifetables in which cohorts are born. For example, I was born in 1981, with a period e(0) of 70.81 . Ya right! By cohort is going to outperform that by a mile! One could already say that the force of mortality that I've been winning against so far my whole life is a perturbation of that which was observed in the USA in 1981. In this way, the 1981 period mortality schedule is the one I'm comparing to. I'm now 34 (I think...), so I have 35 singleage values of cohort m(x) of my own that can be compared with the period m(x) from ages 0 to 34 in 1981. Make sense?
The difference in these two series of m(x) can translate into uppercase Lambda from equation 10 and onwards in the abovelinked paper. I can also get l(x) from these period and cohort m(x) series for the sake of comparison, and this gives everything needed for eq 10 in the paper. Ergo, I can decompose my cohort into those whose lives have been saved 0,1,2,... times due to improvements in mortality since we were born. And likewise for all the other cohorts passing through the population pyramid this year.
In order to break down the upper ages of the pyramid into how many times they've hypothetically been saved thus far, we need a mortality series stretching far enough back in time. The HMD contains a few such series. I'll take Sweden, because it's everyone's toy dataset for stuff like this: you just know it's going to work before you even start!
So, following my periodcohort comparison of mortality schedules to derive cumulative rate improvements, we get the following decomposition of the 2012 Swedish population pyramid:
The central offwhite area are those whose lives have 'never' been saved, and then each successive shade of purple increments the number of times you've been saved. Looks like most 80year females (right side) have been saved at least once, for instance.
This is all hypothetical it assumes that ongoing mortality risk is the same for those that were saved 0 or 1 or 2 or more times. Inclusion of frailty would change the whole picture. But then we've never seen a pyramid decomposed by a frailty distribution because they're hypothetical too, not really observable directly unless you make more assumptions to instrumentalize the notion.
Here's the code: (click on post title to display) Note that one of the data objects required to reproduce this is uploaded at the bottom of this post. You could download the cohort death rates from the HMD, but these are not available for cohorts with fewer than 30 observations, which we need. So I derived them straight from the HMD raw data, which is a bit extra work.

FAQ: The 1/3 and 2/3 in the HMD version 5 exposure formula
It doesn't take that much work to convince oneself that the HMD population exposure formula (version 5, soon to be incremented) makes sense (if you accept the assumptions). The assumptions are simple: Assume that deaths in the upper and lower Lexis triangles are uniformly distributed (and no migration). This is the formula: Exposure = 1/2 * January 1st population  2/3 deaths in the upper triangle + 1/2 December 31st population + 1/3 deaths in lower triangle It's easy to accept the (Jan+Dec) / 2 part of the formula, but a bit more difficult to wrap one's mind around the  2/3 and the +1/3 parts for the triangles. If you look in Appendix E of the version 5 protocol, you'll get the calculusbased explanation: Fair is fair, but it's not intuitive to that many people, so I'll show it numerically (you need to play along in R), and then give a nonrigorous geometric explanation, which is however intuitive (I think). Next some code and an inspirational Lexis square to prove it to yourself: Jan 1 pop is P1, and Dec 31 pop is P2. Deaths in upper triangle are DU, and DL is the lower. This Lexis square is a bit awkward to prove the 1/3, 2/3 thing graphically, so I'll put an equilateral one at the end, which might help. First note that all the tdeaths in DU should have been counted alive in P1. So, 1/2 * P1 would be an overestimate of the years those people lived in the triangle. So, we need to subtract some portion of DU to account for the years not lived in the triangle due to deaths. Under uniformty, the number of years they lived in the triangle on average is 1/3, so we need to subtract 2/3. We still need to show that 1/3 is the average. In the same way, the deaths in DL are not counted in P2, which means 1/2*P2 is an underestimate so we need to add some portion of DL, and they average years they lived in the triangle is also 1/3. Here's R code to simulate this numerically, which I'll explain a bit here. First, the years lived by individuals passing through the square are diagonals, but this way of drawing the Lexis diagram stretches lifelines by sqrt(2) because they're shown as a hypotenuse... So that's why the image is awkward. Instead, from this image, imagine the life lived by those dying in the upper triangle as the distance in x from P1, whereas the life lived by those dying in the lower triangle is the distance lived in y from the lower bound of the square. Follow along in the code to see that the average years lived in each triangle are 1/3, and that the rest therefore makes sense. (this is only visible if you're viewing the post directly, apparently) And here is a more visually appealing proof using the equilateral version of Lexis (annotation below) If we constrain age, period, and cohort to use the same units we end up with equilateral Lexis triangles. Convince yourself that the orange circles are the center of gravity of each triangle (they are!). Now focus on the lower triangle. The lower line is 1 year long (as is the diagonal, because it's equilateral...). If I cut the lower line into thirds, we get the points a,b,c, etc. We know that the distance from a to b is 1/3, ergo, and from b to c is 1/3. Now notice that the points b,c,d form a new equilateral triangle. And so we know that the segment b,d is 1/3 long as well. Ah! And that's the one that lifelines are parallel to! and since the uniform average of the triangle is at d, we know that it took on average 1/3 of a year to get to d. Voila, we are now convinced about the 1/3 thing in the triangles. move along, nothing to see here. 
#PAA2016 where to find me
Here is my PAA 2016 program entry: (you need to be in the post to see this...) You can also find me on Wednesday, March 30 at the HMD side meeting in the morning, and at the dataviz workshop in the afternoon. I intend to sleep at night. Most paper submissions currently on the website will be updated before March 7 too. 
110 of 141