R-stuff
The importance of R: an introduction
"R is really important to the point that it’s hard to overvalue it”
Daryl Pregibon, research scientist at Google
Since some years I have been working with R and I must confess that currently I used only this statistical software for my analysis. For those who do not (yet) know R I promise that once you start using it and forgetting other software, you will find a number of built-in mechanisms for organizing data, running calculations, and creating graphical representations of data sets.
I am not a "fundamentalist" of this software, but I am still searching for a good reason to be "converted". I ensure that there is not learning curve who stops me. Obviously this is not a web-page which should explain capabilities of R, but I'd like to list some nonacademic articles which could give you some reasons for moving to R.
Furthermore, in the last year I develop a specific package for R for smoothing and forecasting mortality (and any Poisson distributed data). It can been installed directly within your R device. See subsection MortalitySmooth and associated paper (Camarda 2012, JSS, 50, 1-24) for more information.
Some years ago, I published an article in which we proposed a novel methodology for smoothing density with evident digit preference patterns (Camarda et. al 2008, Stat Model, 8(4), 385–401). In subsection Digit Preference Model , I provide a simple code for implementing the suggested approach.
For speeding up my own research, I also developed other two packages which are not freely accessible for copyright reasons. They collect mortality data from the Human Mortality Database and they present user-friendly function for extracting specific subsets of these data. See subsection HMDdata and HMDdataLT for information and some examples.
Data Analysts Captivated by R’s Power by Ashlee Vance (January 6, 2009). Bits Blog (technology section of the New York Times)
R You Ready for R? by Ashlee Vance (January 8, 2009). Bits Blog (technology section of the New York Times)
Names You Need to Know in 2011: R Data Analysis Software by Steve McNally (October, 11, 2010). Forbes.
The Wikipedia page which compares statistical software: hard to beat R.
'R' language bringing statistical analytics to the masses by Dave Rosenberg (June 3, 2010). CNet.
How Google and Facebook are using R by Mike (February 19, 2009). Dataspora. A Via Science Company.
An R function to determine if you are a data scientist by leekgroup (October 10, 2011). Simply Statistics.
R Tops Data Mining Software Poll by David Smith (May 31, 2012). SyS-Con Media.
The Popularity of Data Analysis Software. Robert A. Muenchen (2010-2013). r4stats.com.
Digit Preference Model
Here I offer the link to the R-codes for implementing the model presented in
Camarda, C. G., P. Eilers and J. Gampe (2008)
Modelling General Pattern of Digit Preference
Statistical Modelling 8, 385-401
(link)
Two codes are provided:
- functionsPCLML1.R : a collection of functions useful to estimate the model
- DPexample.R: an example with simulated data
I commented these files as much as possible and I used simple simulated data, but if something is not clear please let me know.
HMDdata
HMDdata is a package I have compiled to speed up my work on mortality analysis. It does not aim to be universal and fully tested as a CRAN package. It is more a user-friendly tool for extracting 1x1 mortality data from the Human Mortality Database (HMD). More complete and general R-code for similar purposes are available on Tim Riffe's web-site.
So, HMDdata is nothing complex and there are no models/statistics involved. It is a product of an afternoon-work when I have got bothered by the tedious job of selecting every time specific data from the HMD in a matrix formats. Therefore I decided to include all HMD populations in an R-package and I provide a user-friendly R-function for extracting them. I also update my own database twice a year by including latest years for already included populations, and newly added populations.
Unfortunately the package HMDdata can't be publicly available because it include many information coming from the Human Mortality Database (HMD) and distributing the data would violate the HMD user agreement (paragraph 3: "Please do not pass your copy of these data to other users...."). Hence if you need it or you would like to have a look, please write me an email: carlo-giovanni.camarda at ined.fr
Specifically, data for deaths, population, exposures and rates are available in 1x1 age-year intervals (HMDdata). Two functions are also provided:
selectHMDdata: select a specific dataset and create an HMDdata object
plot.HMDdata: simple plot of an HMDdata object
The manual can be found here and below some example and the instruction for installation.
Needless to say that any suggestion/comment/criticism is more than welcome. I hope you'll find it useful and easy to use. Otherwise have a nice day!
## EXAMPLE:
## load the package
library(HMDdata)
## check available populations
names(HMDdata)
## select Danish females deaths, ages 50-100, years 1950-2009
D <- selectHMDdata(country="Denmark",
data="Death",
sex="Females",
ages=50:100,
years=1950:2009)
## plot the data
plot(D)
D is a matrix where rows and columns are indexed by age and year, respectively.
Plotting an HMDdata object produces a shaded contour-map or a scatter plot for 2D and 1D datasets, respectively (log-scale in case of rates).
## INSTALLATION:
## with Linux
setwd("~/your_path/")
install.packages(pkg="HMDdata_1.0.tar.gz")
## with Windows
setwd("~\\your_path\\")
install.packages(pkg="HMDdata_1.0.zip")
HMDdataLT
This is a mirrored package of HMDdata. It is a user-friendly tool for collecting and extracting in R 1x1 life-table functions from the Human Mortality Database (HMD). It works similarly to the HMDdata and it can't be publicly available, too. You could simply send me an email to get it: carlo-giovanni.camarda at ined.fr
Life Expectancy Confidence Intervals
By working with relatively small sub-populations, I encountered the issue of assessing the uncertainty around their life expectancy. In other words, I faced the issue of constructing confidence intervals around life expectancy. Chiang (1984) already proposed a solution based on the binomial assumptions for the probability of dying within the life table. Andreev and Shkolnikov (2010) presented a spreadsheet for calculation of confidence limits for any life table.
Mimicking this last work, my contribution is to provide a simple R routine for building confidence intervals for life expectancy using bootstrapping life table deaths based on the binomial assumptions. As by product, you could also find an R function for constructing a life table from a series of deaths and exposures.
Two codes and three datasets for testing the method are provided:
- LifeTableFUN.R : a collection of functions useful to building any life table and for constructing confidence interval for life expectancy at any user-defined level of confidence, number of simulated life table, age and sex.
- ConfidenceIntervalLifeExpectancy.R : a code running the previous routines on 3 different datasets which are provided below.
- ExampleLT.csv , ExampleLT2.txt and ExampleLT3.txt : three datasets which could be used for testing the approach.
As always, I commented these files as much as possible and I used simple data, but if something is not clear please let me know.
MortalitySmooth
Under construction
Smooth Constrained Mortality Forecasting
This page offers the link to the R-codes for implementing the model presented in
Camarda, C. G. (2008)
Smooth Constrained Mortality Forecasting
Demographic Research. 41 (38), 1091-1130
DOI: 10.4054/DemRes.2019.41.38
Prepared on 2019.09.03 using version 3.6.1. These files are only slightly different from those published in the Demographic Research web-page.
Requirements in terms of R codes and packages are described in the preamble of each file.
Specifically, registration to the Human Mortality Database is required for running all presented examples. Using the R-package "HMDHFDplus", the code will prompt for the HMD user name and password.
Codes are extensively commented and object-names follow as much as possible notation as presented in the publication, but if something is not clear please let me know.
SmoothConstrainedMortalityForecasting_MainProgram.R: R-code for estimating CP-splines as presented in the publication
SmoothConstrainedMortalityForecasting_Functions.R: R-code with a set of functions useful for modelling and forecasting based on CP-splines
SmoothConstrainedMortalityForecasting_LifeTableFunctions.R: R-code with a set of functions useful for building life-table based on different inputs and extract e-dagger from it
SmoothConstrainedMortalityForecasting_SmoothLeeCarterFunctions.R: R-code for estimating a smooth version of the Lee-Carter model as in Delwarde et al. (2007). Smoothing the Lee-Carter and Poisson log-bilinear models for mortality forecasting: A penalized log-likelihood approach. Statistical Modelling 7, 29–48.
SmoothConstrainedMortalityForecasting_OutOfSample.R: R-code for running the out-of-sample forecast exercise for comparing CP-splines with alternative forecasting approaches as presented in the publication (and Supplementary Material)
SmoothConstrainedMortalityForecasting_ChangingTimeConstraints.R: R-code for assessing the effect of the change in confidence level in rate-of-change over time for CP-splines as presented in the Supplementary Material of the publication
SmoothConstrainedMortalityForecasting_TimeWindow.R: R-code for assessing the effect of the change in time-windows for CP-splines and Hyndman-Ullah model as presented in the Supplementary Material of the publication
Transition Coefficients
This page offers the link to the R-codes for implementing the model presented in
Camarda, C. G. (2013).
Estimating Transition Coefficients in Reconstructing Continuous Series of Mortality by Cause of Death.
Modicod. Kick-off Seminar, MPIDR, Rostock (Germany), April 2013.
(slides)
Afterward the model has been modified in the estimating procedure, though the concept remained unchanged. Specifically I reduce the number of transition coefficients that needs to be estimated by incorporating the equality constraints into the regression frame. Additionally I implemented a quadratic programming approach via the R-function solve.QP(quadprogr) instead of the constrained linear model by lsei(limSolve)as shown in the original slides.
Two codes are provided:
CfunCoD.R : function for building the transition matrix C as presented on slide 11
EstimCoefExample.R: an example with Russian data on digestive disease
I commented these files as much as possible and I used simple dataset, but if something is not clear please let me know.
Please note that I also generalized the model assuming a smooth change over ages of the transition coefficients. This general approach was presented in
Camarda, C. G. (2014)
Reconstructing Mortality Series by Cause of Death: Two alternative approaches
In Kneib, T., Sobotka, F., Fahrenholz, J. and Irmer, H.: Proceedings of the 29th International Workshop on Statistical Modelling
Göttingen (Germany). 14-18 July, 2014. 69-74
(paper)
and related R-code will be (soon) available.