posted Mar 7, 2012 9:14 AM by Tim Riffe
[
updated Mar 7, 2012 9:19 AM
]
This version update fixed a couple bugs, but includes no structural overhaul. Most notable bug fix, credit to the most notable bug finder: Felix Rößger! The drop.tadj argument was only working for single population counts, but not 5-year age groups. Fixed and fixed! It's not a solid or exemplary piece of code, but the more bugs people tell me about, the better it gets! Also, just a reminder, the most up to date downloads are always on the downloads page, not the attachments at the bottom of the R Code page, which include nearly all earlier versions of like everything, and not necessarily the newest stuff.
If you see this and will be at the 2012 PAA or live in the Bay Area, shoot me a line! |
posted Feb 25, 2012 10:42 AM by Tim Riffe
------------------------------- // Another note in the name of weird demography \\ ------------------------------
Recall that there has been 1 generation recorded whose cohort life expectancy was sqrt(2) times life expectancy at birth, the "Da Vinci"-like 1866 cohort of Danish males that I was so excited to locate in the HMD and mentioned here. In the name of sanity, I ought to point out that that allusion was not all that accurate, since calendar time does not get worked in at all. The cohort/age relation works, but somewhere time itself gets lots in the mix. Well, there is one condition under which all three relations would hold: 1 calendar year = 1 year of age = sqrt(2) years lived, and that is the following:
Imagine a large spaceship, with a large population of humans on it producing babies and some demographers that want to study these humans and their offspring. IFF the spaceship is moving at a constant speed of sqrt(1/2)*c (.707 times the speed of light [this is theoretically feasible]), and successive birth cohorts are dropped off on static space stations with automatic data measurement time-synchronized perfectly with the demographers' clocks at the time off drop-off and with little gravity (e.g. Deep Space 9; Death Star too big) along the way, but the demographers remain on the ship in calendar/age reference space, then standard Lexis proportions would roughly represent the 1-1-sqrt(2) relationship that we see on paper. Namely demographers would experience perfect AP relationships on the spaceship frame of reference, but when returning to space stations to collect their automatically generated demographic data (say after the cohorts are extinct, or in periodic data pick-ups), they would notice that the cohorts had lived sqrt(2) years for each year that had passed on the demographers' clocks. This is known as the twin paradox. These temporal relationships they would render on paper just like the Lexis diagrams that we know and love.
* this is all assuming that acceleration, deceleration do not matter; i.e. when the demographers turn around to pick up the data, they decelerate instantly and are instantly moving back towards the space stations in the same speed, something that is impossible (unless of course the universe is cyclindrical)- so technically this is all only true in the limit, and even then the Lexis surfaces they derive would still have a small Tuftian lie factor.
|
posted Feb 8, 2012 6:04 AM by Tim Riffe
I've mentioned the Applied Demography Toolbox before, just wanted to reiterate that it has grown since my last mention, and that it is continually undergoing expansion. Check it out if you haven't already.
The methods included there are mostly indirect and semidirect (i.e. exciting) for dealing with deficient and/or defective data, which is par for the course in many contemporary countries, such as DHS countries, as well as in historical demography, such as I've been fiddling around with of late. It's really user-friendly too- they go into the assumptions, the checks, everything, a proper manual. I dig it and will likely look to it in the future for guidance.
I'd like to give a heads up to other R-loving demographers that Fans Willekens will be coming out (no date announcement yet) with a book in the Springer UseR series for his Biograph package, which includes tools for estimating transition rates from surveys, and other neat life course tools. For now there is the manual, which includes plenty of examples. Awesome.
|
posted Feb 8, 2012 5:01 AM by Tim Riffe
[
updated Feb 10, 2012 1:11 AM
]
At the Centre d'Estudis Demographics, students learn a bit about the concepts behind Louis Henry's idea of 'reproduction of years lived' ( here's the main reference, including commentary from Jean Bourgeois-Pichat). There is a lineage to this idea, as the center director, Anna Cabré is of the French school, and students of hers, such as Julio Pérez have carried on the idea. And to this I add Juan Antonio Fernández Cordón. Now me! The neat point behind this concept is that if fertility rates fall below the conventional wisdom level of replacement while longevity rises, a population may still be replacing itself- the replacement is just stretched over time. Eh? Say a birth cohort of 1000 people gives birth over it's lifetime to 900 children. Conventionally, we'd say this this cohort failed to reproduce itself. Boo. Say the e0 of the original cohort was 65. They then lived a total of 65000 years. Say their childrens' cohort life expectancies averaged out to 73. Then the children will have lived for 72 * 900 = 65700 years. MORE YEARS.
Is this not progress? That is to say, potentially less effot devoted to biological reproduction, but more years of human life on the planet? Such lifespan increases are not unprecedented: take the Vaupel average rate of e0 improvement of .25 (hours per hour, years per year). If the children of the first cohort were born on average, say 32 years after the birth of the 'parent' cohort, we'd expect 8 years more of life expectancy for the children. I don't know what's going to happen in the future, but this kind of number trick has plenty of precendent in the historical record.
It's a pain in the booty to calculate (directly) though. I did my best using the 1870 Swedish cohort as a reference, and these were my steps:
1) e0 for 1870 Sweden (males and females together) was 50.14 (taken straight from the HMD)
2) We want to compare the 1870 cohort's e0 with the average e0 of their offspring. This means we need to take the weighted mean of several cohort e0s, those 12-55 years afterward, which are available from the HMD up until 1919 as of this writing. To fill in for 1920-1925 I just did a linear extrapolation, since the rate of improvement over these cohorts was linear (good luck for me). To weight these, I decided to use NTFR, discussed in the prior post- makes sense to me.. I thus needed 2 more items to do the weighting- 1870 cohort ASFR and 1870, available from the HFD starting in 1892. Again, boo, since I want these rates starting in 1882. I assumed a rate of 0 for age 12 and used a monotonic spline to interpolate (the Hyman spline, available from Rob J Hyndman, and which I just call into R using source("http://robjhyndman.com/Rfiles/interpcode.R"). This is an awesome tool for filling out the tails of rate distributions. With this, I had the full fertility curve for 1870 females, which when multiplied with the vector of lifetable exposures, female Lx (also HMD), gives the net age-specific fertility rates. This vector, weights the children's cohort e0s. Final number: 60.35.
3) the initital size of the 1870 cohort (all together) was 119838.
4) the 1870 cohort gave birth to a total of 118504.6 children +/- a few (HFD cohort births, plus some data mungery for years 1882-1891).
* So, 119838 people produced 118504.6 people, meaning, so only 98.887% of the job completed? * However, the children lived 60.35/50.14 = 1.203684 = 120% longer, on average. * and so the children lived a total of .98887 * 1.203684 = 1.190291 = 119% more total years than the 1870 cohort.
There you have it. The years produced in 1870 Sweden contributed an additional 19% more years to Swedish humanity, despite not having replaced their own cohort size.
|
posted Jan 31, 2012 7:56 AM by Tim Riffe
[
updated Feb 6, 2012 10:41 AM
]
Today something classic. A neat little exercise. Usually to look at the reproductive efficiency of a population we would caluclate the net reproduction rate, R0, sum(fxf*Lx), where Lx are the lifetable exposures of females calculated with a starting population (radix) of 1, and where fxf are fertility rates, but only taking into account female births. I have the impression that it's not so common to cite R0 when writing demographic reports for contemporary countries, but it really is interesting, and very informative when decomposed over time or between populations.
I've been thinking with a friend of mine about what kind of measure one might like to use to approximate the reproductive efficiency of a population. Total Fertility (TFR) for a given population is a measure of how many births a mortality-free cohort of females would have per capita, assuming the rates observed over all ages and cohorts in a particular year. Of course, it's fraught with tempo and quantum danger, but it's still a good point of departure.
Let's make TFR not mortality-free. Like, assume 1000 women (err. 1000 female person years) ages 20-24 had 150 births, for a fertility rate of 0.15. If 1100 females had to be born in order for those 1000 person years to happen, then 100 were lost along the way (this is an island and they don't have boats). Imagine that the females that died would have had the same fertility as the ones that lived, and now we make them not die: Then the rate of 0.15 gets multiplied by 1100 instead of by 1000, and we would have seen 165 births, 15 more than those actually observed. Hmmmm.
Now let's make our lives easier. Where TFR is the sum of observed age-specific fertility rates in a year, which is the same as sum(fx*1). TFR discounted for mortality looks strikingly similar to R0: sum(fx*Lx). Let's call this last one net total fertility (it probably already has a name, but I'm not going to leaf through a manual just now). Net TFR is always lower than TFR. The difference can be thought of as potential births lost to the mortality of potential mothers. Here's what they look like for almost 120 years in Sweden: Hopefully my terminology isn't too far off. Anyone perceive phantom grid lines in the white space? Pretty cool eh? The ratio of these two components is my running definition of reproductive efficiency: That about sums it up. This second graph is purged of the overall level of fertility, just a pure measurement of the efficiency of fertility in a given year. Not all that representative of any particular cohort, though it could be calculated in the same way for cohorts. And now I have to run unexpectedly. I'll post the R code and data tomorrow.
----- And back! I've omitted the data-munging aspects and just start with a single matrix that contains everything we need, attached at the bottom of this post.. For reference, the Lx column is just the Lx calculated by the HMD in the file called fltper_1x1.txt for Sweden- I divided it by 10000 to give it a radix of 1. ASFR is from the HFD for Sweden, using the file called asfrRR.txt (period). Calculating some off-beat measuresload("SE.Rdata") head(SE) # console output: Year Age Lx ASFR [1,] 1891 0 0.93621 0 [2,] 1891 1 0.88843 0 [3,] 1891 2 0.86543 0 [4,] 1891 3 0.85036 0 [5,] 1891 4 0.83887 0 [6,] 1891 5 0.82990 0
years <- sort(unique(SE[, "Year"])) N <- length(years) TFR <- NETTFR <- OH <- Eff <- c() for (i in 1:N){ ind <- SE[,"Year"] == years[i] Fx <- SE[ind, "ASFR"] Lx <- SE[ind, "Lx"] TFR[i] <- sum(Fx) # TFR NETTFR[i] <- sum(Fx * Lx) # net TFR OH[i] <- TFR[i] - NETTFR[i] # overhead Eff[i] <- OH[i] / TFR[i] # efficiency } Code for the first plot: mortality overhead for period fertilityplot(years, rep(0, length(years)), type = "n", ylim = c(0, 6), xaxs = "i", yaxs = "i", axes = FALSE, xlab = "", ylab = "", sub = "Conventional TFR is the sum of these two components", main = "Period fertility discounted by female morality\nSweden 1891-2008, HFD&HMD") polygon(c(years, rev(years)), c(rep(0, N), rev(TFR)), col = gray(.7), border = NA) polygon(c(years, rev(years)), c(TFR, rev(TFR + OH)),col = gray(.3), border = gray(.3), lwd = .1) abline(v = seq(1900, 2000, by = 20), col = "white", lwd = 1.5) abline(v = seq(1890, 1990, by = 20), col = "white", lwd = .5) abline(h = seq(1, 5, by = 1), col = "white") text(1890, seq(0, 5, by = 1), seq(0, 5, by = 1), pos = 2, xpd = TRUE) text(seq(1900, 2000, by = 20), - .2, seq(1900, 2000, by = 20), xpd = TRUE) legend("topright", fill = c(gray(.3), gray(.7)), legend = c("TFR lost to mortality", "net TFR"), bty = "n") text(1885, 5.5, "Births", xpd = TRUE) I did my best to follow Tuftian principles in these figures, comments welcome. Perhaps a bit too artisanal for the likes of most.
Code for the second plot: Fertility efficiencyplot(years, Eff, type = 'n', main = "% of potential TFR lost to female mortality\nSweden, HMD&HFD", xaxs = "i", yaxs = "i", ylim = c(0, .3), axes = FALSE, xlab = "", ylab = "") polygon(c(years, rev(years)), c(Eff, rep(0, length(years))), col = gray(.7), border = NA) abline(v = seq(1900, 2000, by = 20), col = "white", lwd = 1.5) abline(v = seq(1890, 1990, by = 20), col = "white", lwd = .5) abline(h = seq(.05, .25, by = .05), col = "white") lines(years, Eff, col = gray(.5)) text(1890, seq(0, .25, by = .05), c("0%", "5%", "10%", "15%", "20%", "25%"), pos = 2, xpd = TRUE) text(seq(1900, 2000, by = 20), - .01, seq(1900, 2000, by = 20), xpd = TRUE) text(1950, - .03, "Year", xpd = TRUE) So why this measure instead of net reproduction, R0?: Because I think the idea of a monosex population is weird! On the other hand, the present measure isn't all that great either because I only discounted for female mortality. The two sex problem is EVERYWHERE. Look over your shoulder NOW!
--- and again, with a fertility retrojection! --------- So, the HMD give mortality estimates back to 1751 in Sweden, but the HFD goes back 'just' to 1891. With no ASFR or estimate thereof prior to 1891, we can't bring this indicator back any farther in time. So here's my proposal:
1) we take the HFD asfr curves for 1891-1901, and turn each into a pdf (i.e divide each rate by the sum of all rates, such that now all curves sum to 1). 2) now we have 10 fertility PDFs, while there is a time trend in how they change (bummer), it isn't that great. Here's an image of all 10 curves with the age-specific mean (which yay arithmetic also sums to 1): just a diagnostic. The black line will be the 'standard' pdf that we retroject.
3) Now for each year prior to 1891, multiply the standard curve, above, with the female exposures given in the HMD, then sum to arrive at a false sum of births 4) haha! The HMD has already given us a long time series of estimated births, back to 1749, so we'll treat these totals as the 'real' births. Then the ratio of (real/false) is our scale factor for the standard curve. 5) voila, in each year we rescale the standard pdf according to the ratio in step 4, arriving at an ASFR estimate. 6) the Lx is already in the HMD back to 1751, and now we have an estimate of ASFR back to 1751, and that's all we need to calculate THIS: Note that I went ahead and did 1 - Eff so that it's easier to understand the percentage as the concept of 'efficiency'. I totally dig this result. The famines and such really shine, but really, the 'efficiency' concept really lends itself more to cohorts than to periods, so sometime I'll have to come back and recalculate this- when I do, we'll see a much less noisy graph I think.
[edit on Feb 6th, paraphrasing some comments I received today from Adrien Remund and my responses :] 1) isn't this basically the same as R0/fxf?: yes, they should covary almost perfectly. No big deal which measure we take I don't think, I just prefer to speak of the whole population. 2) what about other fertiltiy shapes- is this a strong assumption?: I did a bit of sensitivity testing by moving the above standard curve 2 ages to the left and 2 ages to the right. When reproducing this last figure, the results don't vary all that much. A later fertility curve is subject to higher mortality, and so needs to have a higher TFR, but will have a lower reproductive efficiency and vice versa. When shifting years in either direction, it doesn't make that big of a difference, and definitely doesn't change the story. I may come back and post those results at a later date, and when more complete. Notably, I didn't check for other shapes, such as those that are left-leaning, i.e. sharp increases in fertility at the beginning and a fat tail on the right, rather than the near-bellcurve used here (which may have been coincidental to the reference years). I'll try out different shapes, and maybe some assumptions of changing shape over time, but don't anticipate major changes in results. Good call though Adrien.
3) Adrien rocks, because he goes ahead and busts out an Excel to try things out, and he even sent me along the hutterite schedule to compare that shape! He's full of golden references. By the way, as soon as he says 'jump' I'm going to write up the awesome paper he'll presenting this year at the EPC in Stockholm. Stay tuned fertility and second demographic transition folks! |
posted Jan 23, 2012 3:30 PM by Tim Riffe
[
updated Jan 23, 2012 3:33 PM
]
Earlier I posted how to plot Lexis surfaces of APC triangles in ggplot2, and now I've got the trick in lattice! The main difference, aside from the plot calls, is in the data prep. I can also say the lattice renders the surface blazi'n fast compared with either base or ggplot2. This makes a big difference when you need to generate a plot several times, what with the iterative nature of getting things right. I'm pretty sold on the lattice method for the time being. I've attached the same data to this post- only difference is that I already took out all the rows with ASFR values of 0, to save a line of code. Same ID variable has also been taken care of earlier. Here goes:
Load the data: load in the data (Sweden APCSFR from HFD)load("DATA.Rdata") head(DATA) # console output: ID Code Year Age Cohort ASFR 85830 SWE1891131878 SWE 1891 13 1878 0.00001 85831 SWE1891131877 SWE 1891 13 1877 0.00003 85832 SWE1891141877 SWE 1891 14 1877 0.00002 85833 SWE1891141876 SWE 1891 14 1876 0.00002 85834 SWE1891151876 SWE 1891 15 1876 0.00027 85835 SWE1891151875 SWE 1891 15 1875 0.00064 Define colors and breaks, this time separately for the plot and for the legend. These colors can definitely be improved upon... They might be a little different than the ggplot2 version, I forget if they were defined the same. Colors and breaksmyfxHCLramp <- function(H,C=95,L,N=5){ # H and L must be of equal length colsi <- c() for (i in 1:(length(H)-1)){ Hi <- seq(H[i],H[i+1],length=(N+1)) Hi[Hi<0] <- Hi[Hi<0]+360 colsi <- c(colsi,hcl(h=Hi,c=C,l=seq(L[i],L[i+1],length=(N+1)))[ifelse(i==1,1,2):(N+1)]) } colsi }
H <- seq(255,-60,length=6) L <- seq(75,15,length=6) # L is luminance L[1] <- 99 # colors for polygons: cols <- myfxHCLramp(H=H,C=95,L=L,N=50) brks <- seq(0,.25,length=251) # colors for legend: colsl <- myfxHCLramp(H=H,C=95,L=L,N=5) brksl <- seq(0,.25,length=26) Now we need to make a couple of functions to help us get the data into a format that lattice will like. First, I just realized today that there's never a need to repeatedly call polygon(): you can make multiple polygons in a single polygon call by separating them with NAs. Hence our triangle coordinate functions adds a row of NAs below each output. This little trick it what picks up the speed I think, more than the simple fact of this being lattice. We'll make a separate vector of colors later, since each 3 coordinates only needs a single color- there's no sense putting them in the same data.frame. LexTriLatticePrep and Color-Picker-OuterLexTriLatticePrep <- function(x){ ID <- x["ID"] X <- as.integer(unlist(x[c("Year","Age","Cohort")])) # lower if (diff(c(X[2],X[1]))==X[3]){ xcoord <- c(X[1],X[1]+1,X[1]+1) ycoord <- c(X[2],X[2],X[2]+1) } else { #upper xcoord <- c(X[1],X[1]+1,X[1]) ycoord <- c(X[2],X[2]+1,X[2]+1) } # the NA on the bottom separates data.frame(ID=c(rep(ID,3),NA),Year=c(xcoord,NA),Age=c(ycoord,NA)) }
getCols <- function(ASFR,cols=cols,brks=brks){ rev(cols[! (brks-ASFR) > 0])[1] } Now we make 2 biggish data.frames, one for the triangle coordinates (triangles separated by NAs) and another for our colors. ColDF for colors and POS for coordinateslibrary(plyr) ColDF <- data.frame(ID=DATA$ID,Color=sapply(DATA$ASFR,getCols,cols=cols,brks=brks),stringsAsFactors=FALSE) # rbind.fill rbinds each element of the massive list produced by the call to apply POS <- rbind.fill(apply(DATA,1,LexTriLatticePrep)) head(POS) # console output: ID Year Age 1 SWE1891131878 1891 13 2 SWE1891131878 1892 13 3 SWE1891131878 1892 14 4 <NA> NA NA 5 SWE1891131877 1891 13 6 SWE1891131877 1892 14 I don't find this the most intuitive way to do things, but with lattice anything 'fancy' needs to be done via panel functions. Here we're going to make one panel function that will do pretty much all of out fancy stuff: it will draw the triangles and all the reference lines. This function then gets called with the main lattice plotting function, xyplot(). important panel functionlibrary(lattice) cohint <- -(unique(DATA$Cohort[DATA$Cohort%%5==0])) lexisreferencelines <- function(...){ # I think this ... start is required panel.xyplot(...) # plot all triangles in a single call! no iterating! panel.polygon(x=POS$Year,y=POS$Age,col=ColDF$Color,border=NA) # year lines panel.abline(v = seq(1895,2010,by=5),col="#80808030") # age lines panel.abline(h = seq(15,55,by=5),col="#80808030") # cohort lines for (i in 1:length(cohint)){ panel.abline(a=cohint[i],b=1,col="#80808030") } } Nice, still no plotting has happened yet: the panel function is just part of the setup. Now we will make the plot object, but not plot it yet. I'm separating these two steps so that it's easier to understand for peo The main xyplot() callLatticeLexis <- xyplot(Year~Age, data=POS, col=NA, xlim=c(1891,2010), ylim=c(12,56), aspect=diff(c(12,56))/diff(c(1891,2010)), # different definition of aspect ratio... #set up x axis then y using scales: scales=list(x=list(at=seq(1895,2010,by=5),labels=seq(1895,2010,by=5)), # set up y axis: y=list(at=seq(15,55,by=5)),labels=seq(15,55,by=5) ), # to get color strip legend on the right: legend = list(right = list(fun = draw.colorkey, args = list(key = list(col = colsl,at = brksl),draw = FALSE))), panel = lexisreferencelines, main="Sweden, APC Fertility 1891-2009 (HMD)" ) Notice the panel function gets called at the very end. Also, the aspect ratio is assured in a pretty strange way: you really need to specify a ratio. You can't just say 1. Scales and the legend are handled in their own special lattice way, but those are just details here. To plot it: plotting the lattice objectdev.new(height=6,width=14) print(LatticeLexis) and this is what you get: this is a low resolution image, but you get the idea. I can say that this thing plots in a split second, totally impressive. Not the most impressive colors, but that's an improvement to be made another day.
There you have it, APC triangle Lexis surfaces in base (all over this site), ggplot2 (earlier post) and now lattice.
|
posted Jan 23, 2012 8:00 AM by Tim Riffe
[
updated Jan 23, 2012 12:10 PM
]
I've been meaning to get started with ggplot2 for some time now, but have been using weird plotting problems like Lexis surfaces as an excuse not to. It turns out there's no excuse, since it's still possible. Since it's not an obvious way to go about plotting, I'm going to put in a full example, using the attached DATA.Rdata file below. It contains Swedish age specific fertility rates classified by both cohort and year, except I took out the "-" and "+", and I added an ID column. Let's get started the example data# download the data attached to this post load("DATA.Rdata") head(DATA) # console output: ID Code Year Age Cohort ASFR 85829 SWE1891121878 SWE 1891 12 1878 0.00000 85830 SWE1891131878 SWE 1891 13 1878 0.00001 85831 SWE1891131877 SWE 1891 13 1877 0.00003 85832 SWE1891141877 SWE 1891 14 1877 0.00002 85833 SWE1891141876 SWE 1891 14 1876 0.00002 85834 SWE1891151876 SWE 1891 15 1876 0.00027 Each row of this data.frame is a triangle to be rendered, meaning that we need to determine 3 x positions and 3 y positions for each row. We will be using the geom_polygon geom of ggplot2, which knows where to stop and start new polygons by using the ID variable. The strategy: for each row of DATA (there are 10320 rows here), we want to create a new 3x3 mini-data.frame with a column identifying the triangle ID, a column with x coordinates (Year) and a column with y coordinates (Age). Here's my function for that: LexTriCoords# where the argument x is a row of DATA LexTriCoords <- function(x){ ID <- x["ID"] X <- as.integer(unlist(x[c("Year","Age","Cohort")])) # lower if (diff(c(X[2],X[1]))==X[3]){ xcoord <- c(X[1],X[1]+1,X[1]+1) ycoord <- c(X[2],X[2],X[2]+1) } else { #upper xcoord <- c(X[1],X[1]+1,X[1]) ycoord <- c(X[2],X[2]+1,X[2]+1) } data.frame(ID=ID,Year=xcoord,Age=ycoord,check.rows=FALSE) } Now, we apply() this function to DATA to get a data.frame 3 times longer, but organized to give us coordinates the way ggplot wants to receive them: Apply LexTriCoords to DATA# use the plyr package to collapse output from the call to apply- There are other ways, but this is faster. library(plyr) POS <- rbind.fill(apply(DATA,1,LexTriCoords))head(POS) # console output: ID Year Age 1 SWE1891121878 1891 12 2 SWE1891121878 1892 13 3 SWE1891121878 1891 13 4 SWE1891131878 1891 13 5 SWE1891131878 1892 13 6 SWE1891131878 1892 14 See, each ID is repeated for 3 lines. Now we add in the ASFR info (this could have been done within the function we applied, but I forgot to include it). Then take out all the ASFR values of 0, since we don't want them to plot. datapoly is the data.frame to be plottedASFR <- data.frame(ID = as.factor(DATA$ID),ASFR = as.numeric(DATA$ASFR)) datapoly <- merge(ASFR, POS, by=c("ID")) datapoly <- datapoly[datapoly$ASFR > 0,] Now, before we get started plotting, there are some details to take care of. First of all, we want to be able to plot 5-year reference lines over the surface (we'll do it in semitransparent light gray, no worries)- for the diagonal ones we'll need another data.frame with info for the lines. The trick is to use ggplot2's version of abline(), using slope intercept form... a and b for cohort abline'sa <- -rev(unique(DATA$Cohort)[unique(DATA$Cohort)%%5==0]) # we pick out the cohorts and remember that a is the 'y' intercept b <- rep(1,length(a)) DF <- data.frame(a,b) # I'd prefer not to do it this way, but the abline function wants a data.frame later. and finally, let's work out the colors to be used in the plot. We'll be specifying custom colors using the scale_fill_gradientn scale. This thing takes care of both the colors in the plot (which are continuously interpolated), and the color strip legend (over which you don't have much control, as far as I can tell). We'll need to give it a vector of colors for the legend, the breakpoints for the legend and for purposes of color interpolation, and the legend labels: colors and breaks# my dorky HCL ramp function, useful pretty much only for this kind of plot. # assuming you have clear values between which you want specific colors to interpolate, this thing will set things up right: myfxHCLramp <- function(H,C=95,L,N=5){ # H and L must be of equal length colsi <- c() for (i in 1:(length(H)-1)){ Hi <- seq(H[i],H[i+1],length=(N+1)) Hi[Hi<0] <- Hi[Hi<0]+360 colsi <- c(colsi,hcl(h=Hi,c=C,l=seq(L[i],L[i+1],length=(N+1)))[ifelse(i==1,1,2):(N+1)]) } colsi }
H <- seq(255,-60,length=6) L <- seq(75,15,length=6) # L is luminance L[1] <- 99 # we blend in for near-white # cols = colors, brks = breaks, labs = labels cols <- myfxHCLramp(H=H,L=L,N=2) brks <- seq(0,.25,length.out=11) labs <- c(">0","0.025", "0.050", "0.075", "0.100", "0.125", "0.150", "0.175", "0.200", "0.225", "0.250")
Now we're ready to make a big ggplot() function call. Really this can all be done additively, but the plotting itself takes a few minutes because the data.frame is quite big and the surface is made of a large number of triangles. Explanations to follow: the ggplot() calllibrary(ggplot2) dev.new(width=14,height=6) (LexFx <- ggplot(datapoly, aes(x=Year, y=Age)) + geom_polygon(aes(fill=ASFR, group=ID)) + scale_fill_gradientn(colour= cols,limits=c(0,.25),breaks=brks,labels=labs,space="Lab") + theme_bw() + coord_equal(ratio = 1) + scale_x_continuous(expand=c(0,0),breaks=seq(1895,2010,by=5),labels=seq(1895,2010,by=5),limits=c(1891,2010)) + scale_y_continuous(expand=c(0,0),breaks=seq(15,55,by=5),labels=seq(15,55,by=5),limits=c(12,56)) + geom_vline(xintercept=seq(1895,2010,by=5),colour="#80808030") + geom_hline(yintercept=seq(15,55,by=5),colour="#80808030") + geom_abline(aes(intercept=a, slope=b),data=DF,colour="#80808030") ) looks like a lot of lines, but it's fewer than doing this in base, I can vouch for that. ggplot() does take longer to actually make the thing though... OK, the lines: - ggplot(datapoly, aes(x=Year, y=Age)) : lays the thing out, starts out object (LexFx)
- geom_polygon(aes(fill=ASFR, group=ID)) : tells it that we'll be plotting custom polygons, and that the fill color is a function of ASFR, it also knows to split polygons using the ID variable.
- scale_fill_gradientn(colour= cols,limits=c(0,.25),breaks=brks,labels=labs,space="Lab") : interpolates colors for the polygon fills AND defines the color legend. cols was a vector of 11 colors defined earlier, limits is probably redundantl breaks would be the legend labels, but we leave it here so that the 11 color vector lines up correctly with particular break values; we want custom labels to be able to write ">0", and finally we prefer to interpolate over Lab space rather than RGB (RGB doesn't blend well here). I wish it could HCL interpolate, but I guess that's for another day.
- theme_bw() : otherwise you get the ggplot2 background grey sticking out on the edges.
- coord_equal(ratio = 1) : this is a way to set the aspect ratio to 1 in ggplot2 (probably not the only way)
- scale_x_continuous(expand=c(0,0),breaks=seq(1895,2010,by=5),labels=seq(1895,2010,by=5),limits=c(1891,2010)) : this lets us be particular about the x axis- we want ticks and labels every 5 years, not every single year.
- scale_y_continuous(expand=c(0,0),breaks=seq(15,55,by=5),labels=seq(15,55,by=5),limits=c(12,56)) : likewise for the y axis ticks and labels
- geom_vline(xintercept=seq(1895,2010,by=5),colour="#80808030") : draws vertical reference lines every 5 years in transparent medium gray
- geom_hline(yintercept=seq(15,55,by=5),colour="#80808030") : same for age lines
- geom_abline(aes(intercept=a, slope=b),data=DF,colour="#80808030") : a trick to draw the cohort lines, using the object we made earlier containg slopes and intercepts. Unfortunately I don't know how to put cohort labels on (upper and right side). This is one of the (few) reasons why I've been making these in base graphics so far, but it's a small tradeoff here.
It's not bad, though I still need to iron out some details (still new to this game). Hopefully this will help get some demographer (other than myself) more into ggplot2.
A word on color. For the time being take this as just my opinion, but it is based on some stuff I've been reading: surfaces appearing earlier on my site have naively used colors interpolated over the RGB colorspace, but this is bad for a few different reasons. * It makes some colors appear brighter than others in no particular order * They print terribly in grayscale * colors get dimmer when blending, creating a wavy ringy look * in short, RGB ramps don't match perception well
This particular ramp was an hcl (hue, chroma, luminosity) ramp made in a convoluted sort of way. There are ramp functions for hcl colors in R, but they didn't produce 'exactly' the kinds of ramps I wanted. There is a big problem in the particular colors used above, and that is the combo of red and green, which shuts out about 10% of males from properly enjoying the plot. I'm sorry! On the other hand, color blind folks should still be able to sense the gradients between reds and greens because there is still a luminosity gradient going on there- it's just less clear than also having colors to reference. In general, using hcl colors is better practice, I'm just trying to get the hang of it. I may think more and write more on this topic at a later time.
For now, enjoy lexis surface plots in ggplot2! |
posted Jan 21, 2012 3:33 AM by Tim Riffe
[
updated Jan 21, 2012 5:40 AM
]
I'm proud to announce the winner of the Lexis generation contest: Iceland males, 1866! The ratio of cohort life expectancy for males born in 1866 Iceland (36.31) to period life expectancy in 1866 (25.61), 1.4178055, is VERY close to the square root of 2: 1.414214. 1866 Iceland males on average outperformed lifespan expectations at the time of their births by slightly over 41%. I cannot emphasize enough how incredibly awesome this is. It means that in a da Vinci sort of way, 1866 Iceland males' mortality conformed to this shape nearly perfectly: A testament to the perfection of man.
now all we have to do is imagine that calendar years for these males passed by at sqrt(2) times faster than the speed of time so that the period dimension is correct? Oh man, now it's all messed up again.. |
posted Jan 20, 2012 12:44 PM by Tim Riffe
[
updated Jan 20, 2012 1:39 PM
]
This post is just as much a question I'm posing as it is a demonstration of something I recently discovered. I'm not primarily a statistician, so beware, lest I give bad advice in what follows. ----------------------------------------------------------- Lately I've been making figures for a paper in preparation that falls under my awesome adviser's project, WORLDFAM. I won't get into the substance, just my recent realizations on form in graphics and how I just (yesterday) realized that simple bivariate OLS regression does not have the properties that I unknowingly assumed that it did. This is probably something that seasoned students of statistics have either heard explicitly or stumbled upon. When it happened to me I was flabbergasted and thought there was a typo in my code. No ma'am.
The back story: I prepared figures for a presentation in Vienna a couple months ago. These were flashy colorful scatterplots of x and y, where x and y were both percentages, such as percentage of females attending school versus percentage that are mothers or in union, with 70+ countries. We focused on young ages, like 12-24, and so in the scatterplots, each age was distinguished with its own color, and an OLS was fit to each age separately. This was just a data exploration exercise, but ended up being very useful for the presentation. Here's an example of one such scatterplot from the presentation in Vienna: The points represent age-specific aggregate percentages for two variables for females from several dozen countries, and you can see how the relation between these two variables changes over age, and assuming you like the idea of using the slope of each line as indicative of the relationship, you can really get a quick overview in the legend. Lots of information, perhaps too much? Anyway, it's straightforward to understand what the figure is telling us.
Act II-------------------- I recently read Tufte's "Visual Display of Quantitaive Information", and can honestly say it's changed the way I think about creating and using statistical graphics. These days I'm going around telling my quant friends to read it. Now you know too! Bottom line: we're now finishing up the paper from that conference to be reviewed for publication and I've been working to redesign all the figures to be intelligible in grayscale. The above plot fails miserably when printed in grayscale. Here's my Tufte-inspired (hopefully I don't befoul his name) redraft: (this is a pre-draft, I'm still going to label a few of the slopes, but you get the idea) There are several non-trivial changes here! 1) no points, still considering this, perhaps symbols and creative labeling for the lines? 2) no gridlines, no frame 3) confidence intervals plotted rather than significance stated with stars 4) The big change: the axes have been swapped.
Everything to follow derives from point 4, and I hope not to ramble too much. When I first swapped the x and y, I naively fit OLS lines in the same way. Great, except when it came time to interpret the slopes, I didn't come to the same conclusions as the first time. I checked, the slope of the fit after swapping x and y was NOT the inverse of the original. A code typo? 1 hour wasted determining that no, no code typo. Enter this post from MathExchange. To put it simply, OLS works by minimizing the sum of the squared *vertical* residuals- when I swapped x and y, it also switched out which residuals were to be minimized, and so the two fits are unique and not inverse. Bummer. Is there no symmetrical way to this? Do I simply average the two fits in some way? The answer goes under many names: Orthogonal Least Squares, Total Least Squares, Deming Regression, and probably more. The idea is to fit by minimizing the sum of the squared residuals that are measured as the *perpendicular* distance from the line. If you do so, then you can flip the axes and the line comes along for the ride! I couldn't find such a function in base R, although I come to understand that it is identical to the first principal component in princomp(), and may look into that more. I ended up using the Deming() function in the MethComp package by Bendix Carstensen, who kindly gave me a tip on how to bootstrap the confidence bands shown above, since the function was lacking a predict method. There are apparently other ways of doing this in R, and for personal enrichment I also wrote a function that minimizes the sum of the squared orthogonal distances in a way I can understand mechanically: OrthogLSOrthogLS <- function(x,y){ SumofSquaredOrthogonalResids <- function(par,x,y){ m <- par["m"] ; b <- par["b"] sum((abs(y-m*x-b)/sqrt(m^2+1))^2) } starts <- lm(y~x)$coef pars <- list(b=starts[1],m=starts[2]) optim(par=pars,fn=SumofSquaredOrthogonalResids,x=x,y=y)$par } This I now feel validated in having done this, since I've this post from Bill Venables on RHelp. He then goes on to say "the resulting line is not optimal for either predicting y for a new x or x for a new y", although Prof. Ripley (in the same thread) goes on to justify the practice in certain circumstances, such as calibrating measurements of the same quantity to each other if the two measurement sets also happen to have the same variances.
So, here's my big question: the linear fits in the grayscale plot above have been done using orthogonal least squares fitting (using the Deming() function. I think that the propensity to err in the measurement of those two variables is about equal. I am obliged by peer pressure to not make statements about the direction of causality when talking about this plot -> when I throw and OLS line on there though it makes a really big difference which is my x and which is my y. So, is this a happy compromise? Does the symmetrically fit line describe the data better? Is the slope of a such a line a more reliable talking point? Here's a toy example, in case someone would like to jump in: Toy exampley <- c(49.239766, 47.037915, 22.425033, 92.161081, 75.298100, 66.200922, 53.815491, 54.190434, 79.998622, 12.516807, 78.462177, 77.903252, 54.442483, 61.951685, 65.908460, 64.253315, 7.808119, 71.269211, 15.849700, 72.450935, 59.193253,87.117728, 76.141267, 68.661506, 55.576316, 26.420523, 66.800155, 62.750947, 32.613086, 73.722081, 64.426035, 58.145996, 62.629611, 67.529446, 37.414968,88.191353, 76.035453, 54.515327, 84.309417, 56.477642, 29.438982, 85.168253, 60.545666, 87.712789, 64.851906, 71.420197, 60.502092, 69.147231, 42.337102,46.409740, 20.562824, 36.565379, 61.446797, 64.155112, 62.427641, 85.020644, 68.879758, 78.993385, 84.826404, 35.151776, 57.114365, 55.071618, 59.806793,76.823016) x <- c(6.159844, 29.100150, 19.448066 ,1.138328, 9.642995, 6.325263, 20.488188, 13.147119, 18.185829, 26.841392 , 6.526878, 11.348219, 21.178090, 26.431718, 34.804482, 21.044291, 38.772400, 5.543061, 31.594873, 29.173062, 5.080554, 4.207480, 10.163295, 13.940471, 6.415701, 12.562935, 4.863480, 20.434148, 7.613086, 5.291560, 9.301775, 4.218237, 6.904172, 7.995674, 12.752680, 5.223810, 6.610673, 12.054681, 4.674597, 12.230650, 8.993638, 9.288743, 9.676719, 1.858038 ,16.042552,17.137030, 26.464564, 24.128718, 22.716664, 34.129038, 29.597630, 16.353631, 6.020501, 11.528146, 7.624661, 6.064460, 5.579599 , 2.837447, 7.791589, 26.908167, 27.000846, 3.281563, 19.595554, 3.068228)
# to illustrate the "problem": plot(x,y,pch=19,col="#00FF5550",main="x and y can't flip with lm()",xlim=c(0,100),ylim=c(0,100)) points(y,x,pch=19,col="#0055FF50") # note that the x and y coords of the points are simply swapped. LMyx <- lm(y~x) LMxy <- lm(x~y) abline(LMyx,col="#00FF55") abline(LMxy,col="#0055FF")
# the lines can't be flipped and maintain the same interpretation. # : only vertical residuals are used to optimize, and the vertical # variable changes when we swap x and y. LMyx$coef[2];1/LMxy$coef[2] LMxy$coef[2];1/LMyx$coef[2]

So here is how to do the fit using Deming(): [SAME POINTS] Text Box# install.packages("MethComp") library(MethComp)
plot(x,y,pch=19,col="#00FF5550",main="Deming slopes are inverse when x and y are swapped",xlim=c(0,100),ylim=c(0,100)) points(y,x,pch=19,col="#0055FF50") DM <- Deming(x,y)[1:2] abline(a=DM[1],b=DM[2],col="#00FF55") DM2 <- Deming(y,x)[1:2] abline(a=DM2[1],b=DM2[2],col="#0055FF") # perfect inverse slopes: DM2[2] ; 1/DM[2] 
To restate, much of what I'm learning about statistics is on the street, and I truly hope that I'm not misusing this line fitting procedure, but in this context I think it's pretty innocent. The fact that I haven't seen it elsewhere in the social sciences makes me think that it's not the right tool, but then the "problem" I had earlier was in my mind problematic, and this appears to be a decent compromise. I know the naive OLS fits could also take into account heteroskedasticity, be robust, and a variety of other things; none of these fancier fits would solve the initial observation of regression not spitting back inverse answers when x and y are swapped. If I had a million dollars I'd go back to the ICPSR. I may revisit this post in the future.
|
posted Dec 13, 2011 2:59 AM by Tim Riffe
[
updated Dec 13, 2011 7:00 AM
]
My friend Adrien Remund informed me of another index measure of digit-preference age-heaping proposed by Thomas Spoorenberg (2007) here. It produces results that are rather consistent with Myers' blended method, but that are comparable with the flawed but more commonly used Whipple index. Spoorenberg discusses this in the linked article, as well as some other indices prior to his own. He proposes a measure that also blends the absolute deviation from uniformity over all digits. It works like this (See the full formulas in his article): bias for declaring age 30 can be roughly measured by taking (5*P30)/sum(5P28), where 5P28 means the age interval from 28 to 32, i.e. centered on the age in question. Preference for terminal digit 0 is then calculated as five times the sum of all relevant numerators, i.e. 5*(P30+P40+P50+P60), divided by the sum of all the 5-year age intervals centered on the ages in the numerator. This ratio he calls W0, or Wi in general, as it can be calculated for any terminal digit. If there is no particular preference for the digit in question, then Wi should be close to 1. If there is bias for or against declaring a particular digit, the ratio will deviate from 1. Spoorenberg proposes to repeat this excercise for each digit 0-9, and defines Wtot as the sum of the absolute deviation from parity of the digit-specific indicators, Wi. He then goes to show that results are consistent with the Myers index but comparable in order and magnitude to Whipple indices. I do not agree with him that Wtot is easier to calculate than the Myers index- it is indeed easier to wrap your mind around the calculation itself, but the Myers index at least has a palatable result: 'the minimum proportion [siq: it's actually a percentage estimate ] of the population for whom an age with an incorrect final digit is reported.''(Shyrock et al (1976 condensed edition)). It ranges from 0 (no detectable heaping) to 90 (everyone reports the same terminal digit).
Anyway, here's an R-tastic Wtot calculator: Wtot (Spoorenberg (2007)Wtot_calc <- function(Pop,Ages=0:90,Minage=23,Maxage=62){ # used to shift the indices in question in order to # grab the relevant 5-year age groups lagger <- function(series,lag){ if (lag != 0){ if (lag > 0){ series <- c(rep(FALSE,lag),series[-c((length(series)-lag+1):length(series))]) } if (lag < 0){ series <- c(series[-c(1:abs(lag))],rep(FALSE,abs(lag))) } } series } # I couldn't remember the better way to vectorize indexing Getthe5s <- function(ind,x){ x[ind] } # calculates W for a given i (using the proportion centered on the given digit +/-2 Wi_calc <- function(i,ages=Ages,pop=Pop,minage=Minage,maxage=Maxage){ if (missing(ages)) { ages <- 0:(length(pop)-1) } ind_i <- ages %% 10 == i & ages >= minage & ages <= maxage ind_I <- sapply(-2:2,lagger,series=ind_i) (5*sum(pop[ind_i]))/sum(apply(ind_I,2,Getthe5s,x=pop)) } # so if we do Wi_calc for each digit, take the absolute value of the difference from 1, and then sum: sum(abs(sapply(0:9,Wi_calc,minage=Minage,maxage=Maxage,ages=Ages)-1)) # this number is Spoorenberg's Wtot }
(there's gotta be a more parsimonious way to implement this, I wouldn't suggest trying to figure out how it works unless you've got some time)
So, using the same data from 1960 philippines, we get: Wtot Philippines 1960PH1960 <- c(786464, 888180, 963230, 969309, 965232, 957698, 928673, 938899, 841636, 702492, 841356, 581400, 796786, 619293, 596592, 565714, 566942, 538891, 651318, 491441, 565801, 494895, 515823, 456892, 425212, 522203, 358549, 376221, 395766, 300610, 535924, 222086, 318481, 246260, 233700, 401936, 242659, 242462, 316210, 225207, 434156, 126632, 217881, 169167, 151142, 319118, 160329, 160855, 237287, 155094, 313636, 78534, 128935, 93279, 95715, 163093, 87754, 71828, 93049, 72206, 275436, 31299, 49634, 40154, 34381, 102440, 26445, 35311, 40711, 20921, 136771, 13000, 28017, 16662, 14490, 50558, 15010, 11878, 23353, 9212, 73741, 5532, 9331, 5653, 5089, 18604, 4803, 5617, 4388, 4000, 57111)
Wtot_calc(Pop=PH1960) # 2.389128 I propose 2.5 little changes to how it's calculated:
1) I have one problem with summing absolute deviations from parity: .5 ought to count as the same magnitude as 2, since 1/2 is just as far from parity as 2/1. That's not that big of a deal (I guess, maybe) if we want results to be biased toward positive deviations. I'd prefer to sum absolute deviations that are also of comparable magintude and then divide by two...
So I thought of this little trick: take the sum of the deviations from 1 of exp(abs(log((5*P30)/sum(5P28)))). Look funky? Well it gives really consistent results: 1/2 spits back 2 and 2/1 spits back 2: this I like.
After making this change, I'd divide the sum of Wi by 2, since we'd be double-counting the proportion falsely declaring their terminal digit. i.e. 100 too few 24-years olds is also 100 too many 25-year olds, and those absent, inumerate or forgetful folks are the same folks whether missing in age 24 or showing up in age 25, right? So there, divide that final sum by 2 just because.
2) It just so happens that populations whose pyramids are shaped like pyramids tend to also be populations with lots of age heaping, for one reason or another. So, in the Wtot calculation, there will be more 20-year olds than 30-year olds, and so forth. If we first sum the numerator and denominator prior to taking the ratio, then the younger false-declarers get more weight than the older false-declarers. Would we not like a measure of age-heaping where the age-groups don't get weighted by their sizes but rather by their propensity to, err, display bias in age declaration? If that's our style, then I'd propose: taking the mean of: (5*P30)/sum(5P28), (5*P40)/sum(5P38), (5*P50)/sum(5P48), (5*P60)/sum(5P58) to get W0, and so forth for each terminal digit.
If that all makes sense to you, then maybe you'd prefer this function: Wtot meanWtot_mean_calc <- function(Pop,Ages=0:90,Minage=23,Maxage=62){ # used to shift the indices in question in order to # grab the relevant 5-year age groups lagger <- function(series,lag){ if (lag != 0){ if (lag > 0){ series <- c(rep(FALSE,lag),series[-c((length(series)-lag+1):length(series))]) } if (lag < 0){ series <- c(series[-c(1:abs(lag))],rep(FALSE,abs(lag))) } } series } # I couldn't remember the better way to vectorize indexing Getthe5s <- function(ind,x){ x[ind] } # changes: 1) assure equal magnitude for ratios <1 and >1 # 2) take each ratio separately and calculate the mean # 3) take deviation from 1 (always positive) Wi_mean_calc <- function(i,ages=Ages,pop=Pop,minage=Minage,maxage=Maxage){ if (missing(ages)) { ages <- 0:(length(pop)-1) } ind_i <- ages %% 10 == i & ages >= minage & ages <= maxage ind_I <- sapply(-2:2,lagger,series=ind_i) mean(exp(abs(log((5*pop[ind_i])/rowSums(apply(ind_I,2,Getthe5s,x=pop))))))-1 } # take sum of these (strictly positive) Wi's and divide by 2 sum(sapply(0:9,Wi_mean_calc,minage=Minage,maxage=Maxage,ages=Ages))/2 # this number is my modified Wtot }
Again, how is it different: Pyramidy-shaped pyramids won't bias the results to younger ages, ratios greater than or less than 1 are counted equally, and the sum of all the Wi's gets divided by two.
on the same data we get: comparing Wtot and Wtot meanWtot_calc(Pop=PH1960) # [1] 2.389128 Wtot_calc(Pop=PH1960,Minage=13,Maxage=72) # [1] 1.761708 Wtot_calc(PH1960,Minage=23,Maxage=72) # [1] 2.544538 Wtot_calc(PH1960,Minage=13,Maxage=82) # [1] 1.834131
Wtot_mean_calc(PH1960) # [1] 1.91717 Wtot_mean_calc(PH1960,Minage=13,Maxage=72) # [1] 2.098257 Wtot_mean_calc(PH1960,Minage=23,Maxage=72) # [1] 2.462614 Wtot_mean_calc(PH1960,Minage=13,Maxage=82) # [1] 2.630706 These are qualitatively different results, so perhaps a deeper comparison is in order.
|
|