Two weeks ago the Japanese Mortality Database went live http://www.ipss.go.jp/p-toukei/JMD/index-en.html). Welcome JMD and thanks to Futoshi Ishii and his team! This is laid out just like the HMD, but with data for Japanese prefectures (Canada also has one for provinces (Thanks Nadine!), and the USA has one in the works or states (Thanks Mila!)). Currently JMD data is available for years 1975 to 2012, but they're looking into pushing that back further in time. Lifetables are given for 5-year intervals, possibly to smooth out noise. I'm OK with some noise, so I've gone an done some calculations on single years:
Here are some results that will make you say wow WTF right around 2011:
This plot shows period e0 for males and females for all 47 prefectures. I highlighted the consistently highest and lowest performers: Aomori seems to underperform rather consistently for both males and females, whereas the highest performer for males and females is different: Okinawa for females, Nagano for males.
Two crisis events stand out: the 1995 Great Hanshin earthquake, which dealt a 2.5 year blow to e0 in Hyogo. As usual, the next year things snapped back to normal. The more recent (and still very alive in everyone's memory) 2011 Tohoku earlthquake and tsunami had a massive impact on 2011 mortality conditions. The two prefectures that took the largest aggregate mortality hit (we're not talking about raw numbers here, but rather rates) were Iwate and Miyagi, followed by Fukushima. The size of the 2011 hit in Awate and Miyagi is on the order of 8-10 years for males and females -- that's as large or larger than the 1918 influenza impact in many HMD countries. For these two provinces, it was the same as briefly erasing 30+ years of progress.
I've included all code used to produce the above, plus some code to associate this data with a map and make a choropleth. Map data come from GADM and are loaded on the fly from R. Here's a super quick workup of year 2000 Male e0 in Japan:
Plenty of details could be added to this map to make it better, aside from a spatial scale legend- (Mark major cities, only include outline for the coast, show other landmasses in light gray, be thoughtful about the projection, etc). You get the idea. 2000 was a typical year on the whole: The red prefecture on the top is Aomori, also indicated in the other figure.
Edit April 1, 2014: Futoshi has got back to me with some comments, which I paraphrase here:
Two things to be aware of with the 2011 mortality shock 1) this is an underestimate of the shock (possibly a large underestimate), since many likely deaths are still classified as missing persons. It is unclear whether death statistics will take retrospective account of this. 2) there was a lot of migration out of the affected prefectures in the weeks and months following the tsunami. The JMD does not currently take explicit account of internal migrations between censuses, but rather at this time applied the same procedure as the HMD to distribute errors uniformly over the intercensal years. This means that the population estimate for Iwate, Miyagi and Fukushima for the period after the tsunami may be too high, which would in turn make the mortality estimate too low for 2012. The JMD is investigating data and methods to better account for this in the future. For the present, just be aware of this caveat when using the data. Also note the single-year e0 is just fine- the reason why the JMD does not publish single-age, single-year lifetables is because the age-pattern of mortality is very erratic in the smaller prefectures. For e0 we don't worry so much, but for other lifetable functions we would definitely care.
A salute to Luigi Perozzo! Dr Perozzo* (1856-1916) was an Italin engineer, statistician and mathematician. This is a story of accelerated development of a graphical idea. To put things in perspective, the first age pyramid is typically attributed to Francis Amasa Walker in 1874, then director of the US Census (link). The first stereogram (figure using 3-d effects to represent data) was due to Gustav Zeuner, but I've not yet found an image of that to show, and apparently neither has the milestones project. Check out that first pyramid from 1874. It's pretty simple. Now compare it to Dr. Perozzo's 1880 population stereogram (click to enlarge- that goes for all the images, I captured them at pretty high res)
I annotate: ages come out at us, calendar years move right, and height is population size (except the very upper line, which is births- very high infant mortality). Censuses are marked with the lines that appear to go down and left, while cohorts are traced going down and right with the thicker black lines. The relationship between age, period and cohort time axes appears to be equilateral, judging from other annotations not shown and the coordinate legend -- see the original document to see the other coordinate systems considered: http://lipari.istat.it/digibib/Annali/ - it's the one called TO00175363_Serie2Vol12ed1880+OCRott.pdf. Just looking at it, I can't tell whether the angle between age and period on the flat bottom plane is 60 degrees, as the legend says, or 90 degrees, which is what it seems like. This is a masterpiece, but not the pinnacle of his graphical risk-taking enterprise.
Apparently this piece went over well, because he had another version in the same publication the next year (1881 volume 22, look at the same index page):
It's the same design, but easier to point out that the censuses are the red lines, and the dark black lines mark cohorts. The coordinate legend is the same.
Now master Perozzo takes things to the next level. Imagine the same red lines from above as cross-sectional planes, but instead of lined up in parallel they stick out in regular angle intervals from a sphere or cylinder. Here's his sphere coordinate setup. The curve below is cross-section from a different surface (a marriage cross-tabulation by age combinations of spouses), which he shows later on. I'm just going to show the population surface projection (not the forecast kind of projection). So the hemisphere facing us below shows a span of 100 years total from left to right- it's a normal cylindrical projection where latitude lines represent age (the equator is ages 50-54, births are squished together at the North Pole, and ages 100+ down in Antarctica), and longitude lines are calendar years, while cohorts trace along the S-shaped thick black lines. Data around the periphery (young & old, beginning and end of period) are the most distorted, while age 50 in 1800 would be shown in approximately standard Lexis proportions.
Perozzo does not show us what the above population surface would look like on a sphere. It would be bad form by today's graphical standards anyway, due to the distortion.
He goes on to experiment: now let's take a single census snapshot, here the census of 1750. Ages are now angles, and each of the three panels shows the same census with different age-angle conversions: Fig 1 shows 100 years of age spread over 180 degrees (left half of circle). Fig 2 shows the same info squashed into 90 degrees total. In both of those figures, absolute numbers are represented by the radius. The far right figure is also squished into 90 degrees, but instead of radius it uses a cartesian y-axis to mark absolute counts (that will only work if the census is squished into 90 degrees). Each of these representations distort counts in some way similar to how radial pie charts distort counts -- it's the area of each bar that communicates quantity, not it's distance from the center -- but this was way before we were clear about that. His panel 3 would be totally confusing to interpret: old ages get expanded more than young ages, but in a non-linear way! The same graphic of an aged population would go way off the charts to the left...
Saving the best for last! All the above is just to point out that Perozzo was being very thoughtful about angles and coordinates before he threw down on the following behemoth of a data-graphic, the reason why this blog post was actually written. BEHOLD (you must zoom in):
How does this one work you ask? It took me a while to figure this out. Clearly the red lines are censuses. Imagine them as slices, specifically represented in the manner of the center panel from the previous figure. The little arrow at the bottom points to the origin. Young ages are at the top and old ages are at the bottom. Cohorts trace along the thick black lines, as before, spiraling down towards the bottom right, and then wrapping back in. If this were a living population representation it would rotate clockwise from the top as time passes.
My thoughts are that this is somewhere on the spectrum between art and statistical graphics. It clearly conforms to a particular coordinate system, but incurs and insane amount of distortion, such that's it's difficult to interpret, but it shows a view of population like we had never seen before or hence. It probably does not contribute to understanding population, except perhaps in that it invites the idea of spinning more than a standard surface (his earlier work). In a growing population, the new censuses would keep coming around, and then overlap the earlier censuses like a conch shell. My first impressions were that this is how demographers would represent data in some early industrializationsteam-punk parallel universe (because it's beautiful monstrosity involving an insane amount of misplaced ingenuity to construct).
It's tempting to try to imitate this in rgl if only out of historical curiosity, but we have no time for such things.
A tribute to the man of the day: Luigi Perozzo: we salute you!
A salute to Luigi Perozzo
We're like'n the thoughts that you thunk
What would you draw us Luigi Perozzo?
If we could get you good 'n drunk?
[Edit March 28, 2014]
That ditty was for you, Julio! -- practice makes perfect!
*See also Julio Perez' tribute to Luigi Perozzo from October, 2013: http://apuntesdedemografia.wordpress.com/2013/10/02/estereograma-de-perozzo-1880/#more-6714
Julio's demography blog is a great resource, so go check it out! (just have Google translate it for you if you don't know Spanish)
In February 1662 (352 years ago!), John Graunt presented 50 copies of his Natural and Political Observations on the London Bills of Mortality to the Royal Society, and was inducted forthwith. I'm not sure about the exact date, but it represents the approximate birthday of modern demography. Here's an indexed html conversion of that document. Apparently this was printed on Jan. 25th of that year (maybe that's the birthday?), which means that poor John Graunt was toiling away in the preceding months, and possibly would have noticed the male-biased sex ratio at birth the summer before (is that the birthday?). Working out the first lifetable ever sounds like nasty fall work to me. (Come on, let me imagine that that's how it went down!)
The cover page to citizen Graunt's momentous treatise:
The cover to the annual Bills of Mortality (London death stats) that he based it on was pretty awesome back then:
Anyway, we (Berkeley Demography) are going to pretend it's modern demography's birthday today and have called a tea time. Well done, says I!
During today's dreary walk to work I cooked up a little drinking song in memoriam (for some demography happy hour somewhere):
we stand upon John Graunt's head
we drink to the Bills of the Dead!
raise a glass in poor John Graunt's stead
raise a glass in poor John Graunt's stead!
It's short, I know, but with a barrel voice it could be cool.
[[Edit, Friday Feb 28]]
Ken Wachter informed us that Graunt may not have been the original author of his 'observations', and that the work may have been lifted from Graunt's boyhood friend, William Petty. Hervé le Bras has done some digging and he comes to some conclusions on the matter in this book
I probably head of Benford's law first on an episode of Radiolab. Benford's law is an odd empirical regularity about the way numbers appear in the world. It says that if we take the first digit (1-9) of each number in a large pool of numbers, the number of times that any given digit appears as the first digit should asymptotically follow a particular distribution. 1 occurs the most, and so forth:
If d is a given digit, then log(1+1/d) gives the expected proportion out of all first digits you'd expect digit d to take up (above).
Question: do demographic data obey this law? Test data: HMD death counts (Deaths_lexis.txt). Here are the results:
I'd say that yes, HMD death counts obey Benford's law. The dashed blue line simply takes the count from each death triangle and counts how many times each first digits appears for males and females (1918722 total numbers as of this writing). I've not looked how the pattern unfolds by sex, age or over time. We do see here that some HMD populations follow the law more closely than others. This begs the question: is there something fishy about the data for countries whose death counts don't follow this law (look how lumpy some of those lines are)? To check, I measured how much each grey line departs from the red line (half the sum of the absolute deviations from the reference distribution), and we rank the HMD countries. Scores are bounded by 0 (exact match) and 1 (totally different distributions, no overlap). Therefore larger scores mean larger departures from Benford:
To put things in perspective, .12 is not a huge difference between two distributions, so let's just looking at the rankings. These ranking results are unexpected to me, and I suspect will be to others. The greatest departures from Benford's law are found in Scandinavian countries, and the most Benford-obedient populations are in the former Soviet Republics and Switzerland. Go figure. It is widely held among demographers that the Nordic countries in general have the best data (and most abundant, longest-collected, etc), but perhaps this metric is ill-conceived...
Now a flurry of questions: Do we learn something from this departure? Is there any particular particular reason why the Benford distribution must apply? Are there too few numbers from any given country in order for this asymptotic property to shine through? Is the discrepancy large enough to merit further digging? Of course, looking at death triangle output from the HMD means that the data have been massaged to some degree prior to running this test (splitting 5-year age groups, stuff like that). However, if we do the same test on abridged death counts (Deaths_5x1) --- those that resemble lowest common denominator database inputs ---, the overall ranking is very similar. Does an anomolous digit distribution indicate anomalous data, an anomalous population or population process, or an anomalous application of Benford's law?
R code to completely reproduce this is on github, here. It uses the DemogBerkeley package (installed from github) to grab necessary data from the web.
Age distributions of 'hedonic wellbeing' (a.k.a. subjective wellbeing, happiness in a superficial not-aristotelian way) have been reported from time to time. Graphical evidence is more appealing to me. I'd never before seen the GSS (general social survey) until I saw it cited as the source for earlier happy-curves. Enter AJ Damico's awesome R utilities for social surveys (scripts on github here, overview and blog here). Lo and behold, AJ had a helper script for that survey. I couldn't resist.
Here are US male and female happy surfaces, because why not (click to enbiggen):
Some details on where the values for each 5-year AP square come from: There is a question 'General Happiness' coded as 'happy' when loaded into R, the is asked as follows:
157. Taken all together, how would you say things are these days--would you say that you are very happy,
pretty happy, or not too happy?
So we get those categories, plus NAs, 'I don't knows', etc. I, boldly or blithely, gave 'very' a score of 1, 'pretty' a score of .5, and 'not very' a score of 0, and ignored the other values, taking the (duly weighted) distribution of each of these 3 responses within each age-year-sex block. The sum of these three numbers is the score, ranging from 0 to 1, most often between .6 and .8. Visually, we see slight patterns in age, calendar year and cohorts (pre WWII cohorts very happy in this snapshot, would be worth busting down to triangles). It also would have been practical to simply take the proportion saying 'very' as our measure.
If we're OK with these hypothetical happiness units, one might ask how much happiness would be experienced in an average lifetime, something like happiness expectancy. To get a tack on that, I've taken the survival-weighted age-specific happiness for each year, and summed. We get this as the time trend:
So, females have a higher happiness expectancy too! Actually the gap here is often lower than the life expectancy gap, but a
hypothetical THR (total happiness rate) is too noisy to read a believable
trend from. It remains to be seen whether Venezuela's Ministry of Supreme Social Happiness takes demography seriously ;-P. Bhutan probably does, but I haven't looked closely. There are certainly other measures of well-being that might mean more to most social scientists...
To be precise, this is the same as total lifetime happiness in the hypothetical stationary population, and is a synthetic measure that indexes this value for a particular year, but for no particular cohort. We could decompose the time trends or sex differences using old-school Kitagawa decomposition, and we'd see that the upward trend in happiness expectancy (h0) is mostly due to mortality gains.
Ugly details: we don't have info for kids or people or the elderly, so I assumed kids 0-14 were a constant rather-happy .8, and that people aged 90-110+ were constant at the age 85 level for a given year. The latter choice has little leverage on results, but the former will make a difference.
Full reproducible R script on github here. It'll download the necessary data. Sorry, you'll need to change local file paths to get it running, as well as potentially install a few packages (code given).