Missing data

The main idea...

Missing data is a concern in virtually all empirical research, from psychology (Graham, 2009) to phylogenomics (Roure et al., 2013) and ecology and evolution (Nakagawa and Freckleton, 2008). Data is absent for a wide variety of reasons such as the phenomenon being unobservable at all times or technical failures. If missing values can be considered randomly distributed in a data set and are handled appropriately, missing data need not prevent quality analysis. However, if missing data is ignored, systematically generated, or badly handled, it can affect the interpretations and conclusions of most analyses.

Some types of missing data

The type of missing data you may encounter in an analysis depends, of course, on the data itself and the underlying phenomena generating that data. Below, the three major categories of missing data are described. Efforts should be made to identify what kind of 'missingness' is in effect as this will help determine what actions can be taken.

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNAR)

In order to be MCAR, a value's missingness is in no way related to any other value in the data set. Little's MCAR test (Little, 1988) may be used to evaluate whether data is MCAR.

MAR data is random in the sense that missing values have no relation to the value that 'should' be there; however, the fact that a value is missing does depend on other variables in the data set. For example, if a sensor that measures a given variable, y, only operates above a certain temperature, tmin, and all values of y are missing below tmin, the missing data can be said to be MAR. Thus, other variables must be taken into account for MAR data to be considered random (i.e. missing data is "conditioned by" other data in the data set). 

MNAR data is absent in some systematic way which depends on the value of the variable of interest. Referring to the example above, this would suggest that the value of y is directly related to missingness of a given data point. This is the most 'dangerous' kind of missing data as it biases analytical results. If the missingness (coded as a dummy variable) of a variable can be shown to be associated to that variable's values (e.g. by regression) then there's a good chance that it's MNAR. Failure to find clear association should not be considered as 'proof' that missingness is not MNAR, but suggests that it could be MAR or MCAR. The process responsible for MNAR data must be identified and corrected for if possible or the experiment re-run in a manner that addresses the measurement bias in y.

Dealing with missing data

Multiple methods to deal with missing data exist. Some resort to simple deletion of objects or variables with missing values. Others attempt to "impute" (i.e. "fill in") reasonable substitute values based on either other values in the data set or a suitable distribution that models the variable of interest. While deletion is straightforward, imputation methods conserve sample size and hence prevent a loss of power. However, most imputation methods are subject to several assumptions and require careful handling. Below, a few common approaches to handling missing data are briefly described.

Warnings

Implementations

References