subscribe to posts here
Recent Entries

My Demography meets Art submissions
So we have a bunch of frames hanging in the institute presently containing reproductions of some really awesome historical choropleth maps, like this one: These were hung because there used to be a historical demography group here. They've been left hanging well after the dissolution of that group because they're awesome. They still are. But we want the art here to reflect the present labs a little more (labor, fertility & wellbeing, health sounds the ingredients of a good toast, right?). So a call has been put out to staff to propose art. I sent in 3 things, of which I'll dedicate this post to describing the process of making one of them.
The point of departure for this piece was a Jan 2014 blog post here "Birth flow with cohort reflection". Here's that original: It's a fun thing to look at. The colors are straight Brewer palettes matched to size bins of the polygons on either side. Darker is bigger. Yeah. All the labeling was done in native R. That was a big old hunk of code I can tell ya.
To take part in a (fab) data viz course by Jonas Schöley last summer I had to propose a project, so I made a touchup of that piece my project. This mainly included 3 new things.  Get more data to extend the series further back in time. HT Sebastian Klüsener for sending me the data. These were in 5year age groups. I used the pclm R package to graduate birth counts (HT Marius Pascariu and Silvia Rizzi).
 Add a meander to the centerline based on gross reproductive ratio (size of child cohort on bottom axis / size of mothers' cohort on top axis). This is a point cloud that follows a sort of linear pattern, so I smoothed it and made that the centerline.
 Add some sideplots to explain the central figure
This was as far as I got it for the Rostock Retreat (here at 100 dpi I always keep it vector for printing click to embiggen) The first subplot explains the meander. The second explains the top vs bottom, and the third is a crosssection slice. Otherwise, the main image is the same as before, just longer and with a meander. I wasn't satisfied. It was still hard to explain, and I managed to confuse myself explaining it every time. Also there is a problem in that the bottom half of the graduated early part of the series is smooth, where we would expect first differences to reflect down from the top (as in the right side of the series where counts are directly observed in single age bins, respecting natural fluctuations). I've still not resolved this, but I think that solving it (here an aesthetic issue) might be a real contribution if these data are destined to enter the HFD at some point (which I understand to be the case).
Mainly I wanted to remove the need for sidegraphs to explain it, as they suck attention from the centerpiece and don't help much anyway. I've opted to annotate a particular lineage found online in the public domain by Christian Helinski (2nd pic in that report!) I wanted a 5generation female lineage to superimpose. I figured there's no problem since the lineage is extinguished, and only first names are used. Then the other main problem was to solve is the colors. These are only intended to give a rough idea of relative bin size. They don't need to be perfectly perceptually balanced to convey this. I think the green blue palettes are pretty, but I wanted some more flare. So I used the paletter R package to extract palettes from paintings and photographs. The main function in that package lets you choose between categorical and sequential palettes, but I couldn't get a usable sequential palette, possibly because it was getting stuck in single hue bands, not sure. So I went with categorical and decided to attempt to perturb colors toward an even ramp. I decided hue doesn't matter, but darkness and saturation do. So I settled on some pictures, extracted palettes, then postprocessed colors optimizing on the saturation channel in HSV (colorspace), and on the inferred grayness estimated by the to.grey() function of spatstat. Here is a better idea of that: extracted palette, raw  sorted on darkness  sorted on saturation  approximate equispaced 
The final "equispaced" colors are only approximately so, based on linear trends in both darkness and saturation (steeper darkness gradient). Good enough, All we need to convey is the 'heavy' spot in the middle of the series, and for occasional curiously large or small cohorts to 'pop' wrt surroundings. Otherwise, we get an autumn palette. Wrong symbolism for births, but I like it, sorry. Here's the final result (again click for 100dpi, where you can start to see grid lines and lineage paths). That title may need to change again.
So this is semifinal. Still need to think on the title. And man it'd be great to solve the oversmoothness of the bottom part of the early series, but not going to happen in time. A high res version of this might be hanging sometime in the mpi, will see. This one is befittingly sized to fit in a frame presently holding a Sweden map!
[[Edit: Dec 13, 2017 I am presently trying to figure out how to 'adjust' earlier cohort fertility (bottom left) to account for size fluctuations in mother cohorts (top axis). May or may not succeed]]
[[Edit: Dec 15, 2017, got a decent solution]] Got it! Here's the new figure: (click for 100 dpi) Data for years 17751890 have now been perturbed to account for cohort size. This is essentially a step 2 to the count graduation I had done for earlier years. I turned it into an optimization problem, with the following steps: 1) graduate to single ages using pclm. This is constrained in age groups already.
2) for years 18911959 we can already compare first differences in mother and daughter cohort sizes, and relative first differences of mother and daughter cohort sizes have a near 1:1 relationship (slope of linear model) and a high correlation (0.93). I can perturb the offspring (lower) cohort size to approach the relative first differences observed in the top cohort sizes by creating a multiplier, discounted somehow. It's the discount that I optimize for the result turns out to be .44, applied like so C * (1  rdt * .44), where rdt is the relative first difference in the mother birth cohort (top series). Then things are reconstrained to match 5year age groups in the original data. That's a crappy explanation. The bottom line is there are lots of knowns, and I'm minimizing on just the linear slope and correlation coefficients between the top and perturbed bottom series, where the only moving part is the coefficient that approaches .44.
3) regenerate and touch up in Inkscape. Now we have a similar degree of reverberation between top and bottom series. Again, compare the bottom for years 17751890. I'm pleased with the result.
The method is too hackish to make it into the methods protocol for HFD, but my hunch is that these results are a closer approximation to reality than a standard rate graduation without singleage offsets.
*Methodological speculation: You could in principle daisy chain this perturbation back to exposures, since we have an implied smooth rate, and an adjusted count series. count / rate = exposure. Often in the absence of single age observations, mortality numerators and denominators are both graduated to be smooth in one way or another. However, for populations with long preceding birth series, there is a source of information on relative cohort size that ought to follow the cohort fairly well over time. It's maybe less problematic to trust such an adjustment through the reproductive span than it is through the whole life span.
** No, I didn't validate this adjustment on the years 18911959, where this would certainly be possible: group to 5year ages, pclm graduate, then adjust like the earlier part. How well would that match? Not sure yet. Maybe I'll check it out one day.

Posted Dec 15, 2017, 6:07 AM by Tim Riffe

the joy of fertility
So I've been helping coteach an R for demographers workshop back at the CED in Barcelona, and one of the sessions I was responsible for was about base plotting. I gave as an exercise to make a stacked plot similar to the Joy Division Unknown Pleasures album cover but made of fertility curves from a data object I gave them. There many such dataviz projects out there, like this, this, and this, just to list some that come to mind. This way they'd get some experience with using the primitive plotting element polygon(), would have to write a function in order to draw one for each fertility curve, and would need to set up some sort of iteration in order to do so. So, three R concept covered in this exercise: plot device management, functions, and iteration. Not bad says I. Yes it's a lot for beginners, but that's why the exercises are done with the instructors present. The point isn't necessarily to get the final polished product, but to learn via trying rather than replicating code that 'just works'. So I hope it succeeded somehow there.
Anyway, later on I got a more comprehensive set of fertility curves from the Human Fertility Collection, regenerated, then marked it up in Inkscape (for better labeling). Each fertility curve here is drawn at a spacing of .06 units of TFR, ergo the spacing is on scale with the data and you can treat the baselines as grid lines in this funny way. Really there's no sense trying to read values out of it. Curves for the different countries are sorted by TFR, with the lowest TFR at the top and the highest (in this set) on the bottom. You see what a variety of shapes fertility can take under similar values of TFR. I suppose that's what the plot is good at showing. Some curves are of similar shape and size, but at different locations (ages). Some are of different shape, but the same size. And so forth. Anyway, here's a vector pdf version of the plot: And the R code to produce this (prior to Inkscape markup) can be found here.
And here's an update, only a short while later, due to suggestions received on Twitter from Carl Schmertmann, Alison Taylor, Nikola Sander, and Ramon Bauer. Thanks all!

Posted Jul 11, 2017, 11:00 AM by Tim Riffe

a higher order time identity
Most demographers are familiar with the ageperiodcohort identity. 'Identity' might not be the first thing that comes to mind when we think of APC. Instead with think of the Lexis diagram, or else the identification problem, which then leads back to the fact that these three measures form an identity. Any two of them will do, and you'll get the third via implication. Well, you can go bigger than that. Say we had period, cohort, start of employment, and retirement. By taking the pairwise differences between these four events, we'd end up with 6 further durations. Just as APC relate in a simple graph, the form of a triangle, these 4+6 time measures relate in the form of a denser graph with 5 vertices and 10 (4+6) edges. It's a complete graph, meaning all vertices are connected directly. Via Cayley's formula there are 125 ways to select 4 edges and still touch all vertices. And coming back to the time identity, there are therefore 125 ways to start with 4 times measures (of these 10) and end up generating the remaining 6 (i.e. such that the remaining 6 are implied). Here's a mini poster showing them all: I think I'd try sorting these somehow, but need to figure out how. For now these are unordered 'solutions'.

Posted Jul 11, 2017, 7:47 AM by Tim Riffe

Demography, Uruguay, Futbol
So Victoria Prieto kindly invited me to give a lecture via Skype (because the class is in Uruguay) on applications of the Lexis diagram. And I must say that Skype quality just stinks for this kind of thing. They were able to get clear audio of me IFF we made it so that I was just doing screen share and their audio connection was turned off. We made it work somehow, but got lost in a sea of white noise during the Q&A a couple times. Victoria asked me to try to make them love Lexis. So, aside from trying to make a real case on the utility of the Lexis diagram, I also found some atypical applications here and there, and made my own silly (but fun!) example of lifelines in the Lexis diagram, based on futbol (I have to write it like that cuz football means US football to me, no matter how long I live in the EU) players on the Uruguay national team in all of Uruguay's appearances in World Cup and Copa America. Uruguay, you ask? It's here:
(By Connormah  Own work, CC BYSA 3.0, https://commons.wikimedia.org/w/index.php?curid=6913529)
Uruguay punches (or bites? mwahaha) above their weight class in the realm of futbol. It's a mostly urban country, and they have a demography masters program, and a PhD specialization in pop studies. And Victoria teaches there, ergo the Lexis lecture, and that's why the Skype hassel. OK, now back to the point of this post:
I just wanted to show that you can represent any population of durations on the Lexis diagram (or its analogues).
Here are the plots I showed them, without first telling them what it was. I asked them to guess the data:
1) guess what this is showing.
It's right skewed, between ages 15 and 40, possibly fertility?! It does look that way doesn't it?, This turns out to be the aggregate of all ages in all teams in all cups, standardized in 2.5 yearwidth age groups. Meh. I didn't reveal the subject here yet.
2) Maybe this would tip off? If I just say that the data in the above historgram came from the points in this Lexis diagram? (click to embiggen)
This image still didn't settle it. I guess x ticks every 20 years kind of obfuscates recognizing the particular years of the cups. Like, 1930 and 1950 I think Uruguay won, but there aren't any ticks there to indicate it. So then came the reveal, but since it was a oneway connection I'm not sure if it tanked or if they thought it was cool. But they did ask for a blog post on it with the code to get the data, etc. The truth is you could augment these data in so many ways I think sports data is ripe for demographic analysis, though one might need to play with the definition of age, like this:
Same data, but the lifelines only connect the first and last cups played by each player. The life lines are of course still aligned to age 0 (birth), but you could certainly realign them to make first cup be age zero, or last cup age omega... Further you could augment these to make the points games rather than simply being a team member, add other cups, regular season, and ... the predata territory: youth teams. The Lexis diagram isn't the limiting factor here, but the more detail that is added, the less useful lifelines become, and the more likely you are to gain insight from some kind of aggregation (perhaps on a metric rather than on team membership, and aligned to one or another definition of age). Then patterns will emerge that are otherwise invisible. Woot.
I hopothesized that you could use the Lexis diagrams to guess at the heros of any given team. Whaaa? Take the lifelines of the most recent team, and follow the lines back in time (downward left) to sometime between ages 5 and 15, assuming that that's when futbol impressions are made in the most lasting way, then look up to which players appeared in those cups. That's the population of likely homebrew heros. Not entirely novel. Any player will just tell you their childhood references anyway.
But possibly more interesting that this subject matter was the mungery required to get these data. The data were scraped from wikipedia entries, like this one. And indeed, using the code annotated here
Feel free to use it, expand it, make cool plots, and share them.
and here's the full presentation:
Thanks again Vicky!

Posted Jul 21, 2016, 2:15 AM by Tim Riffe

Many lifelines
Posted Jun 14, 2016, 1:15 PM by Tim Riffe
The inelegant oldschool 1page blog has now migrated to the 'blog' tab on the left. This page only shows the 5 most recent entries.

