Missing data
The main idea...
Missing data is a concern in virtually all empirical research, from psychology (Graham, 2009) to phylogenomics (Roure et al., 2013) and ecology and evolution (Nakagawa and Freckleton, 2008). Data is absent for a wide variety of reasons such as the phenomenon being unobservable at all times or technical failures. If missing values can be considered randomly distributed in a data set and are handled appropriately, missing data need not prevent quality analysis. However, if missing data is ignored, systematically generated, or badly handled, it can affect the interpretations and conclusions of most analyses.
Some types of missing data
The type of missing data you may encounter in an analysis depends, of course, on the data itself and the underlying phenomena generating that data. Below, the three major categories of missing data are described. Efforts should be made to identify what kind of 'missingness' is in effect as this will help determine what actions can be taken.
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
In order to be MCAR, a value's missingness is in no way related to any other value in the data set. Little's MCAR test (Little, 1988) may be used to evaluate whether data is MCAR.
MAR data is random in the sense that missing values have no relation to the value that 'should' be there; however, the fact that a value is missing does depend on other variables in the data set. For example, if a sensor that measures a given variable, y, only operates above a certain temperature, tmin, and all values of y are missing below tmin, the missing data can be said to be MAR. Thus, other variables must be taken into account for MAR data to be considered random (i.e. missing data is "conditioned by" other data in the data set).
MNAR data is absent in some systematic way which depends on the value of the variable of interest. Referring to the example above, this would suggest that the value of y is directly related to missingness of a given data point. This is the most 'dangerous' kind of missing data as it biases analytical results. If the missingness (coded as a dummy variable) of a variable can be shown to be associated to that variable's values (e.g. by regression) then there's a good chance that it's MNAR. Failure to find clear association should not be considered as 'proof' that missingness is not MNAR, but suggests that it could be MAR or MCAR. The process responsible for MNAR data must be identified and corrected for if possible or the experiment re-run in a manner that addresses the measurement bias in y.
Dealing with missing data
Multiple methods to deal with missing data exist. Some resort to simple deletion of objects or variables with missing values. Others attempt to "impute" (i.e. "fill in") reasonable substitute values based on either other values in the data set or a suitable distribution that models the variable of interest. While deletion is straightforward, imputation methods conserve sample size and hence prevent a loss of power. However, most imputation methods are subject to several assumptions and require careful handling. Below, a few common approaches to handling missing data are briefly described.
Warnings
MNAR data requires special handling. It is likely to introduce bias into many analyses.
Single-value imputation methods are convenient, however, are often seen as "overconfident" as they lead to underestimates of uncertainty by re-using existing data.
When applicable (e.g. in MI), it is important that the model used for imputing data is compatible with the analysis that the imputed data will be subject to. For example, if the analysis will use four variables, then all four variables should be included in the imputation model.
MI and EM methods generally assume multivariate normality and require that the missing data is random. The choice of distribution or model used is pivotal and must reflect the nature of the variables in question. Note that one can transform variables to conform to a distribution, impute values, and back-transform the data.
The EM algorithm can be used with maximum a posteriori (MAP) estimators rather than maximum likelihood to avoid overfitting.
There is a risk that the EM algorithm will get stuck in a local optimum.
Implementations
R
Functions such as is.na(), na.omit(), na.exclude(), na.pass(), and na.fail() are useful in handling missing data.
The aregImpute() function from the Hmisc package allows predictive mean matching, regression-based imputation, and weighted sampling from related objects using bootstrapping to assess uncertainty.
The Amelia II package offers a range of imputation techniques including MI and bootstrapped EM and can support time-series and longitudinal data. This may be used with Zelig package to combine imputation results.
The mitools package offers a set of tools to create and combine multiple imputation data sets.
The robCompositions package offers, among other robust methods, robust imputation of missing values for compositional data (i.e. where the data are relative and describe parts of some whole), which does not require multivariate normality.
The function LittleMCAR() in the BaylorEdPysch package can run Little's MCAR test on a data set with no more than 50 variables.
The package mice includes several advanced imputation functions
The SPSS MVA module offers imputation and augmentation techniques and related tests (e.g. Little's MCAR test)
References
Dempster AP, Laird NM, Rubin DB (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. J Royal Stat Soc Ser B. 39(1):1-38.
Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol. 26:897-899.
Graham JW (2009) Missing Data Analysis: Making It Work in the Real World. Annu Rev Psychol. 60:549–576.
Horton NJ, Kleinman KP (2007) Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Stat. 61(1):79–90.
Little RJA (1988) A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 83(404): 1198–1202.
Nakagawa S, Freckleton RP (2008) Missing inaction: the dangers of ignoring missing data. Trends Ecol Evol. 23(11):592-596.
Roure B, Baurain D, Philippe H (2013) Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets. Mol Biol Evol. 30(1):197-214.
Rubin DB (1978) Multiple imputations in sample surveys: A phenomenological Bayesian approach to nonresponse (with discussion). Proc Survey Research Methods Section. 20-34. Amer Statist Assoc. Alexandria, VA, USA.
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc. 91:473-489.
Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res. 8(1):3-15.