Two weeks ago the Japanese Mortality Database went live http://www.ipss.go.jp/p-toukei/JMD/index-en.html
). Welcome JMD and thanks to Futoshi Ishii and his team! This is laid out just like the HMD, but with data for Japanese prefectures (Canada
also has one for provinces (Thanks Nadine!), and the USA has one in the works or states (Thanks Mila!)). Currently JMD data is available for years 1975 to 2012, but they're looking into pushing that back further in time. Lifetables are given for 5-year intervals, possibly to smooth out noise. I'm OK with some noise, so I've gone an done some calculations on single years:
Here are some results that will make you say wow WTF right around 2011:
This plot shows period e0 for males and females for all 47 prefectures. I highlighted the consistently highest and lowest performers: Aomori seems to underperform rather consistently for both males and females, whereas the highest performer for males and females is different: Okinawa for females, Nagano for males.
Two crisis events stand out: the 1995 Great Hanshin earthquake
, which dealt a 2.5 year blow to e0 in Hyogo. As usual, the next year things snapped back to normal. The more recent (and still very alive in everyone's memory) 2011 Tohoku earlthquake and tsunami
had a massive impact on 2011 mortality conditions. The two prefectures that took the largest aggregate mortality hit (we're not talking about raw numbers here, but rather rates) were Iwate and Miyagi, followed by Fukushima. The size of the 2011 hit in Awate and Miyagi is on the order of 8-10 years for males and females -- that's as large or larger than the 1918 influenza impact in many HMD countries. For these two provinces, it was the same as briefly erasing 30+ years of progress.
I've included all code used to produce the above, plus some code to associate this data with a map and make a choropleth. Map data come from GADM
and are loaded on the fly from R. Here's a super quick workup of year 2000 Male e0 in Japan:
Plenty of details could be added to this map to make it better, aside from a spatial scale legend- (Mark major cities, only include outline for the coast, show other landmasses in light gray, be thoughtful about the projection, etc). You get the idea. 2000 was a typical year on the whole: The red prefecture on the top is Aomori, also indicated in the other figure.
I annotated it pretty well.
Question: we knew there would be a big mortality dip in 2011, but has this size of e0 impact been previously reported?
Thanks again to Futoshi for bringing these data live!
Edit April 1, 2014: Futoshi has got back to me with some comments, which I paraphrase here:
Two things to be aware of with the 2011 mortality shock 1) this is an underestimate of the shock (possibly a large underestimate), since many likely deaths are still classified as missing persons. It is unclear whether death statistics will take retrospective account of this. 2) there was a lot of migration out of the affected prefectures in the weeks and months following the tsunami. The JMD does not currently take explicit account of internal migrations between censuses, but rather at this time applied the same procedure as the HMD to distribute errors uniformly over the intercensal years. This means that the population estimate for Iwate, Miyagi and Fukushima for the period after the tsunami may be too high, which would in turn make the mortality estimate too low for 2012. The JMD is investigating data and methods to better account for this in the future. For the present, just be aware of this caveat when using the data. Also note the single-year e0 is just fine- the reason why the JMD does not publish single-age, single-year lifetables is because the age-pattern of mortality is very erratic in the smaller prefectures. For e0 we don't worry so much, but for other lifetable functions we would definitely care.
A salute to Luigi Perozzo! Dr Perozzo* (1856-1916) was an Italin engineer, statistician and mathematician. This is a story of accelerated development of a graphical idea. To put things in perspective, the first age pyramid is typically attributed to Francis Amasa Walker in 1874, then director of the US Census (link). The first stereogram (figure using 3-d effects to represent data) was due to Gustav Zeuner, but I've not yet found an image of that to show, and apparently neither has the milestones project. Check out that first pyramid from 1874. It's pretty simple. Now compare it to Dr. Perozzo's 1880 population stereogram (click to enlarge- that goes for all the images, I captured them at pretty high res)
I annotate: ages come out at us, calendar years move right, and height is population size (except the very upper line, which is births- very high infant mortality). Censuses are marked with the lines that appear to go down and left, while cohorts are traced going down and right with the thicker black lines. The relationship between age, period and cohort time axes appears to be equilateral, judging from other annotations not shown and the coordinate legend -- see the original document to see the other coordinate systems considered: http://lipari.istat.it/digibib/Annali/ - it's the one called TO00175363_Serie2Vol12ed1880+OCRott.pdf
. Just looking at it, I can't tell whether the angle between age and period on the flat bottom plane is 60 degrees, as the legend says, or 90 degrees, which is what it seems like. This is a masterpiece, but not the pinnacle of his graphical
Apparently this piece went over well, because he had another version in the same publication the next year (1881 volume 22, look at the same index page):
It's the same design, but easier to point out that the censuses are the red lines, and the dark black lines mark cohorts. The coordinate legend is the same.
Now master Perozzo takes things to the next level. Imagine the same red lines from above as cross-sectional planes, but instead of lined up in parallel they stick out in regular angle intervals from a sphere or cylinder. Here's his sphere coordinate setup. The curve below is cross-section from a different surface (a marriage cross-tabulation by age combinations of spouses), which he shows later on. I'm just going to show the population surface projection (not the forecast kind of projection). So the hemisphere facing us below shows a span of 100 years total from left to right- it's a normal cylindrical projection where latitude lines represent age (the equator is ages 50-54, births are squished together at the North Pole, and ages 100+ down in Antarctica), and longitude lines are calendar years, while cohorts trace along the S-shaped thick black lines. Data around the periphery (young & old, beginning and end of period) are the most distorted, while age 50 in 1800 would be shown in approximately standard Lexis proportions.
Perozzo does not show us what the above population surface would look like on a sphere. It would be bad form by today's graphical standards anyway, due to the distortion.
He goes on to experiment: now let's take a single census snapshot, here the census of 1750. Ages are now angles, and each of the three panels shows the same census with different age-angle conversions: Fig 1 shows 100 years of age spread over 180 degrees (left half of circle). Fig 2 shows the same info squashed into 90 degrees total. In both of those figures, absolute numbers are represented by the radius. The far right figure is also squished into 90 degrees, but instead of radius it uses a cartesian y-axis to mark absolute counts (that will only work if the census is squished into 90 degrees). Each of these representations distort counts in some way similar to how radial pie charts distort counts -- it's the area of each bar that communicates quantity, not it's distance from the center -- but this was way before we were clear about that. His panel 3 would be totally confusing to interpret: old ages get expanded more than young ages, but in a non-linear way! The same graphic of an aged population would go way off the charts to the left...
Saving the best for last! All the above is just to point out that Perozzo was being very thoughtful about angles and coordinates before he threw down on the following behemoth of a data-graphic, the reason why this blog post was actually written. BEHOLD (you must zoom in):
How does this one work you ask? It took me a while to figure this out. Clearly the red lines are censuses. Imagine them as slices, specifically represented in the manner of the center panel from the previous figure. The little arrow at the bottom points to the origin. Young ages are at the top and old ages are at the bottom. Cohorts trace along the thick black lines, as before, spiraling down towards the bottom right, and then wrapping back in. If this were a living population representation it would rotate clockwise from the top as time passes.
My thoughts are that this is somewhere on the spectrum between art and statistical graphics. It clearly conforms to a particular coordinate system, but incurs and insane amount of distortion, such that's it's difficult to interpret, but it shows a view of population like we had never seen before or hence. It probably does not contribute to understanding population, except perhaps in that it invites the idea of spinning more than a standard surface (his earlier work). In a growing population, the new censuses would keep coming around, and then overlap the earlier censuses like a conch shell. My first impressions were that this is how demographers would represent data in some early industrialization steam-punk parallel universe (because it's beautiful monstrosity involving an insane amount of misplaced ingenuity to construct).
It's tempting to try to imitate this in rgl if only out of historical curiosity, but we have no time for such things.
A tribute to the man of the day: Luigi Perozzo: we salute you!
A salute to Luigi Perozzo
We're like'n the thoughts that you thunk
What would you draw us Luigi Perozzo?
If we could get you good 'n drunk?
[Edit March 28, 2014]
That ditty was for you, Julio! -- practice makes perfect!
*See also Julio Perez' tribute to Luigi Perozzo from October, 2013: http://apuntesdedemografia.wordpress.com/2013/10/02/estereograma-de-perozzo-1880/#more-6714
Julio's demography blog is a great resource, so go check it out! (just have Google translate it for you if you don't know Spanish)
* I don't know whether he had a PhD, but surely you'll agree that if not, he deserves one!
In February 1662 (352 years ago!), John Graunt presented 50 copies of his Natural and Political Observations on the London Bills of Mortality
to the Royal Society, and was inducted forthwith. I'm not sure about the exact date, but it represents the approximate birthday of modern demography. Here's an indexed html conversion of that document
. Apparently this was printed on Jan. 25th of that year (maybe that's the birthday?), which means that poor John Graunt was toiling away in the preceding months, and possibly would have noticed the male-biased sex ratio at birth the summer before (is that the birthday?). Working out the first lifetable ever sounds like nasty fall work to me.
(Come on, let me imagine that that's how it went down!)
The cover page to citizen Graunt's momentous treatise:
The cover to the annual Bills of Mortality (London death stats) that he based it on was pretty awesome back then:
Anyway, we (Berkeley Demography) are going to pretend it's modern demography's birthday today and have called a tea time. Well done, says I!
During today's dreary walk to work I cooked up a little drinking song in memoriam (for some demography happy hour somewhere):
It's short, I know, but with a barrel voice it could be cool.
[[Edit, Friday Feb 28]]
Ken Wachter informed us that Graunt may not have been the original author of his 'observations', and that the work may have been lifted from Graunt's boyhood friend, William Petty. Hervé le Bras has done some digging and he comes to some conclusions on the matter in this book
(possibly soon to be published in English?).
I probably head of Benford's law
first on an episode of Radiolab
. Benford's law is an odd empirical regularity about the way numbers appear in the world. It says that if we take the first
digit (1-9) of each number in a large pool of numbers, the number of times that any given digit appears as the first digit should asymptotically follow a particular distribution. 1 occurs the most, and so forth:
is a given digit, then log(1+1/d)
gives the expected proportion out of all first digits you'd expect digit d
to take up (above).
Question: do demographic data obey this law? Test data: HMD death counts (Deaths_lexis.txt). Here are the results:
I'd say that yes, HMD death counts obey Benford's law. The dashed blue line simply takes the count from each death triangle and counts how many times each first digits appears for males and females (1918722 total numbers as of this writing). I've not looked how the pattern unfolds by sex, age or over time. We do see here that some HMD populations follow the law more closely than others. This begs the question: is there something fishy about the data for countries whose death counts don't follow this law (look how lumpy some of those lines are)? To check, I measured how much each grey line departs from the red line (half the sum of the absolute deviations from the reference distribution), and we rank the HMD countries. Scores are bounded by 0 (exact match) and 1 (totally different distributions, no overlap). Therefore larger scores mean larger departures from Benford:
To put things in perspective, .12 is not a huge difference between two distributions, so let's just looking at the rankings. These ranking results are unexpected to me, and I suspect will be to others. The greatest departures from Benford's law are found in Scandinavian countries, and the most Benford-obedient populations are in the former Soviet Republics and Switzerland. Go figure. It is widely held among demographers that the Nordic countries in general have the best data (and most abundant, longest-collected, etc), but perhaps this metric is ill-conceived...
Now a flurry of questions: Do we learn something from this departure? Is there any particular particular reason why the Benford distribution must apply? Are there too few numbers from any given country in order for this asymptotic property to shine through? Is the discrepancy large enough to merit further digging? Of course, looking at death triangle output from the HMD means that the data have been massaged to some degree prior to running this test (splitting 5-year age groups, stuff like that). However, if we do the same test on abridged death counts (Deaths_5x1) --- those that resemble lowest common denominator database inputs ---, the overall ranking is very similar. Does an anomolous digit distribution indicate anomalous data
, an anomalous population or population process, or an anomalous application of Benford's law?
R code to completely reproduce this is on github, here
. It uses the DemogBerkeley package (installed from github) to grab necessary data from the web.
Age distributions of 'hedonic wellbeing' (a.k.a. subjective wellbeing, happiness in a superficial not-aristotelian way) have been reported from time
. Graphical evidence is more appealing to me. I'd never before seen the GSS (general social survey) until I saw it cited as the source for earlier happy-curves. Enter AJ Damico's awesome R utilities for social surveys (scripts on github here
, overview and blog here
). Lo and behold, AJ had a helper script for that survey. I couldn't resist.
Here are US male and female happy surfaces, because why not (click to enbiggen):
Some details on where the values for each 5-year AP square come from: There is a question 'General Happiness' coded as 'happy' when loaded into R, the is asked as follows:
157. Taken all together, how would you say things are these days--would you say that you are very happy,
pretty happy, or not too happy?
So we get those categories, plus NAs, 'I don't knows', etc. I, boldly or blithely, gave 'very' a score of 1, 'pretty' a score of .5, and 'not very' a score of 0, and ignored the other values, taking the (duly weighted) distribution of each of these 3 responses within each age-year-sex block. The sum of these three numbers is the score, ranging from 0 to 1, most often between .6 and .8. Visually, we see slight patterns in age, calendar year and cohorts (pre WWII cohorts very happy in this snapshot, would be worth busting down to triangles). It also would have been practical to simply take the proportion saying 'very' as our measure.
If we're OK with these hypothetical happiness units, one might ask how much happiness would be experienced in an average lifetime, something like happiness expectancy. To get a tack on that, I've taken the survival-weighted age-specific happiness for each year, and summed. We get this as the time trend:
So, females have a higher happiness expectancy too! Actually the gap here is often lower than the life expectancy gap, but a
hypothetical THR (total happiness rate) is too noisy to read a believable
trend from. It remains to be seen whether Venezuela's Ministry of Supreme Social Happiness
takes demography seriously ;-P. Bhutan
probably does, but I haven't looked closely. There are certainly other measures of well-being that might mean more to most social scientists...
To be precise, this is the same as total lifetime happiness in the hypothetical stationary population, and is a synthetic measure that indexes this value for a particular year, but for no particular cohort. We could decompose the time trends or sex differences using old-school Kitagawa decomposition, and we'd see that the upward trend in happiness expectancy (h0) is mostly due to mortality gains.
Ugly details: we don't have info for kids or people or the elderly, so I assumed kids 0-14 were a constant rather-happy .8, and that people aged 90-110+ were constant at the age 85 level for a given year. The latter choice has little leverage on results, but the former will make a difference.
Full reproducible R script on github here
. It'll download the necessary data. Sorry, you'll need to change local file paths to get it running, as well as potentially install a few packages (code given).
Here's an experimental visualization that will require a bit of explanation, below:
Look first at the top part, mostly in green. The uppermost contour tracks the total birth count in Sweden from 1891 to 2010, according to my most recent download of HFD data. The upper bands in turn indicate mothers' cohorts (in 5-year groups). I've only shaded those 5-year cohorts whose fertility is fully observed (or almost so). The grayed-out cohorts on the left are those whose early-career fertiliy precedes 1891, while the right side gray-ed out cohorts include those whose fertility-careers are not yet complete. The darker the green, the greater the career-fertility of the given 5-year cohort. Colors were determined by just breaking up cohorts into quantiles, and you can see them matched to the y scale on the lower y axis (since they refer to vertical slice heights on the bottom). In any case, the x-axis for the upper 1/2 indicates year of ocurrence.
Take a vertical slice from the top-- That's a birth cohort. If you reflect it down to the bottom (blues) you see the career fertility of that very cohort -- the 2nd generation. The lowermost contour tracks total birth counts born to each cohort (whose starting size is seen above), and it is only fully observed for the areas that are not indicated witha gray background. The banded-areas in the bottom indicate the 5-calendar-year groups in which the births ocurred. That might be a bit to wrap you mind around: Take the 1920 birth cohort. You see its starting size on the top and the total births produced by that cohort on the bottom (no account is made for in or out-migration). Looking at the bottom blue 1/2 for the 1920 cohort, we see that it contains several shades of blue- it's births were spread out over several 5-year groups, starting around 1935. However, 1935 was a dearth-year, due to the depression, and we see it's colored in light yellow on the bottom (two consecutive bands). For reference, the darkest blue shade in the center (around 1920) refers to births between 1945 and 1949. Now we see the relative contribution from birth cohorts to both the depression dearth and the post-war boom, in two different perspectives.
Each boom (on the top) has an echo (on the bottom), i.e. literally this is a 1-generation step, before the effects of ergodicity have time to step in. This is an effect of population structure and not necessarily of rates. One could recreate the above figure with rates instead of counts and likely tell a different story. I'll leave that for another day. Note the boom around 1990: it'd take a huge rate swing to not expect an echo boom from those cohorts-- that's a no-brainer.
* the same visualization would work for deaths ;-), in either case, with rates and counts, according to your whim.
** my apologies if this explanation isn't efficient!
*** this design is indeed inspired by the stream-graph
, but I had to make some design deviations from the popular design used in the NYT
1) I kept stacking order strictly chronological
a) the figure is about time and quantum equally and
b) the bands are not categorically different but rather a discretization of an imaginably continuous flow.
2) I removed the meander because *two* flows are plotted, not one. For this reason, neither flow crosses the x axis.
[edited Jan 27, 2014]
I'll being going to PAA this year in Boston, and the EPC in Budapest, with 3 papers in the mix:
3) Decomposing and recomposing the population pyramid by remaining years of life (with J. Spijker and J. MacInnes) [PAA Poster Session 6, EPC Poster session 1]
I will link to papers when final uploaded are made, although a motivated individual could find them on the respective conference websites.
As of last Wednesday I'm now officially a Dr. of Demography! I've posted the completed dissertation here
(already corrected for Errata and typos that I'm aware of). There are a couple known problems in it that I've not yet attended to, but I'll just replace the file in Google Drive when the beast gets updated. Most of the presentation got filmed, but the files are really big, so I need to figure out how to make them a manageable size before posting here. In lieu of that, here's a fun animation (it's a 13Mb file, might take a while to load, if at all- see attached gif at footer):
a section of this made it into the defense presentation. This figure is 1975 US using HMD data, in case anyone is curious.
It's been half a year since I've posted anything. The gap is explained by my having put 100% of my free energy into finishing my dissertation, which is now written, but not yet defended. I thank the HMD for even giving me a huge 3 month block of time to work (almost) exclusively on the dissertation (many more thanks are found in the acknowledgments, of course). In Spain, defenses are big public events- in fact anyone can go. So, I thought why not announce it here!
The defense will take place at:
After the defense I'll post the pdf here somewhere after accounting for errata that the tribunal (yes, it's called a tribunal here!) brings to my attention, and I'll post instructions for getting the code on github. Likely the next several posts on this page will be dissertation-related. Woot!
Sala d’Actes B7/1056 inside the Facultat de Ciències Socials i Lletres (Edifici B) of the UAB main campus (Bellaterra), at 11:00 am this June 26th (in about 3 weeks from this writing).
If you come to see it, be aware that my presentation is not to the tribunal, but rather to the audience and so is retooled to be accessible. That lasts like 45 minutes. Then the three tribunal members, Dr. Julio Peréz Díaz, Dr. Iñaki Permanyer and Dr. Trifon Missov, will each speak, basically for as long as they like, bringing the likely total time commitment is something like 3 hours. Two supplemental jury members are on standby, Dr. Daniel Devolder and Dr. Jeroen Spijker. Two external evaluators (to qualify for a European Mention) have been Dr. Vladimir Canudas Romo (currently changing institutions) and Dr. John MacInnes.
I'm not too worried about sneaky thieves getting at it before I get a chance to publish from it, since it's all in the public domain (as is the git-tracked entire process of writing it), and it has passed under a number of eyes.
a short abstract:
The two-sex problem in populations structured by remaining years of life.
by Timothy L. M. Riffe
One of the foremost problems in formal demography has been including information on the vital rates from both males and females in models of
population renewal and growth, the so-called two-sex problem. The two-sex problem may be conceived as a subset of the analytical problems entailed by multigroup population modeling. This dissertation characterizes the two-sex problem by means of decomposing the vital rate components to the sex-gap between the male and female single-sex stable growth rates. A suite of two-sex methods for age-structured models from the literature are presented in a standard reproducible format. A new variety of age-structure, age based on remaining years of life, is presented. Analogous models of population growth for the single-sex and two-sex cases are developed for populations structured by remaining years of life. It is found that populations structured by remaining years of life produce less sex-divergence than age-structured models, thereby reducing some of the trade-offs inherent in two-sex modeling decisions. In general, populations structured by remaining years are found to be more stable over time and closer to their ultimate model stable structures than age-structured populations. Models of population growth based on remaining-years structure are found to diverge from like-designed age-structured models. This divergence is characterized in terms of the two-sex problem and we call it the two-age problem.
and a teaser graphic (feel free to take a guess at what exactly this is showing)
Today I'm going to share one of my öfter used lexis surface markup functions, used below to make super geeky graph paper. All it does is draw lexis reference lines on top of an already-open plot, such as grid()
would do but with the cohort diagonals. This function gets used in several of the new HMD internal diagnostic plots, plus I use it a ton.
So let's say you had an image() plot of an age by year matrix, using proper age and year coordinates- then you could just mark it up using LexisRefN(ages, years, N = 5) , for instance, to get a 5x5 grid over it. The ... argument is passes on further arguments to segments(), such as lty, col, etc.
So some time ago I got inspired to print out some notepads of Lexis graph paper. The HMD team burns through these things... See the final copy that I used attached at the bottom of this post for 8.5 by 11 inch paper, meant to be cut in 1/2 for 5.5 by 8 inch pads, of which I have made an excessively large number- if you're a demographer and I know you and we'll see each other in the next year or so, then I've got one for you! (still don't know if I'll make it to the PAA in 2013). If I won't see you and you want some, let me know, or else use the proof below to make some up. If you do it yourself, here's my advice: Have a print shop do it in the off-season and price places out... The lines are light-weight- it needs to be printed minimaly at 600 dpi, max quality to turn out legible and smooth, and you may need to play with printer settings. You'll need to tinker to change the grid size, line weights, or adapt it to A4 paper. Let them recommend the paper and backing. The ones I did are just with glue on the edge, so pages just tear off. The print shop will try to get you to go for 50 pages per notepad, but they bind just fine with 30.
Here's the code I used for the attached page, modfy at will:
Happy New Year!