In statistics, missing data refers to the absence of observations or values in a dataset. There are different types of missingness that can occur, and understanding these types is important when deciding how to handle missing data.
Missing Completely at Random (MCAR)
The probability of a value being missing is the same for all observations in the dataset, regardless of the observed values.
The missingness is completely random and unrelated to any other variable in the dataset.
Missing at Random (MAR)
The probability of a value being missing depends on other variables in the dataset, but not on the missing value itself.
The missingness is related to observed variables in the dataset.
Missing Not at Random (MNAR)
The probability of a value being missing depends on the missing value itself or on some unobserved variable in the dataset.
The missingness is related to the value that is missing or to some other variable that was not measured or recorded.
Understanding the type of missingness is important because it determines the most appropriate method for handling missing data. MCAR missingness can be handled using any imputation method, while MAR missingness can be handled using imputation methods that incorporate observed variables. MNAR missingness is more challenging to handle, and it often requires more sophisticated imputation methods or the use of sensitivity analysis to examine the impact of the missingness on the results.
Fill the nulls of EMTALA Y/N(Patient Status Details) as "No" Where the “Request Status” equal to “Accepted”
Fill LOS Outlier as "Not Outlier" Where the “Request Status” equal to “Accepted” & LOS is not null
Distribution of the whole dataset.
Columns with a lot of missing values
County
Infection Precaution
Special Considerations
Missing value matrix sorted by Length of Stay (LOS)
DRG Program, Payor, LOS, ICU LOS, and Discharge Disposition are variables in the dataset that exhibit a similar pattern.
The "redder" cells indicate a lower degree of correlation between the missing values of columns, meaning that the missing values are more randomly distributed across the dataset.
Conversely, the "bluer" cells indicate a higher degree of correlation between the missing values of columns, meaning that the missing values tend to occur in patterns or clusters across the dataset.
There are 11 dark blue cells.
Cluster leaves are linked together at a distance of zero, which indicates that the data points or variables are identical or indistinguishable from each other based on the similarity or distance metric used for clustering.
Variables in the same cluster being always empty or filled together or both empty is related to the concept of conditional missingness or dependence between variables.
Rows where "Request Status" is "Accepted"
Distribution of the dataset where "Request Status" is Accepted.
Columns with a lot of missing values
County
Infection Precaution
Special Considerations
Missing value matrix sorted by Length of Stay (LOS)
DRG Program, Payor, LOS, ICU LOS, and Discharge Disposition are variables in the dataset that exhibit a similar pattern.
The "redder" cells indicate a lower degree of correlation between the missing values of columns, meaning that the missing values are more randomly distributed across the dataset.
Conversely, the "bluer" cells indicate a higher degree of correlation between the missing values of columns, meaning that the missing values tend to occur in patterns or clusters across the dataset.
There are 16 dark blue cells.
Cluster leaves are linked together at a distance of zero, which indicates that the data points or variables are identical or indistinguishable from each other based on the similarity or distance metric used for clustering.
Variables in the same cluster being always empty or filled together or both empty is related to the concept of conditional missingness or dependence between variables.
By analyzing the missingness patterns using techniques such as matrix, heat maps, and tree diagrams. It seems that the missing values in both datasets belong to the type of Missing at Random (MAR). This means that the probability of a value being missing depends on the observed values in the dataset, but not on the missing values themselves.
KNN imputation is a method where the missing values in a dataset are imputed based on the values of the k-nearest neighbors of the observation with the missing value. This method can be used for both numeric and categorical data.
MICE (Multiple Imputation by Chained Equations) is a method where missing values are imputed iteratively modeling each variable with missing values conditional on the other variables and imputing missing values based on that model. This method is also applicable to both numeric and categorical data.
KNN imputed categories and KNN imputed numerics separately on original data:
Applied KNN imputation separately for the categorical and numeric variables in the dataset.
This approach can work well if the missingness patterns for these two types of variables are different. However, if the missingness patterns are similar, it may be more efficient to apply KNN imputation on the entire dataset, regardless of variable type.
KNN imputed the whole original data:
Applied KNN imputation on the entire dataset, including both categorical and numeric variables.
This approach can be effective if the missingness patterns are similar across all variables.
MICE imputed on data of request status is accepted:
Applied MICE imputation only on the subset of the data where the request status was accepted.
This approach can work well if the missingness patterns are different for accepted requests compared to rejected requests.
KNN imputed categories and MICE imputed numerics on data of request status is accepted:
Applied KNN imputation for the categorical variables and MICE imputation for the numeric variables, but only on the subset of the data where the request status was accepted.
This approach can be effective if the missingness patterns for these two types of variables are different, and if the missingness patterns for accepted requests are different from those for rejected requests.
Overall, the choice of imputation method and strategy will depend on the specific characteristics of the data and the research question trying to answer. It is important to carefully evaluate the performance of different imputation methods and strategies, and to consider the assumptions underlying each method.
Distribution of original rows after imputation
The kernel density plots suggests that the imputation of the filtered data performs better.
A kernel density plot with four different curves, representing the density of the variable "LOS" after different imputation techniques have been applied.
A kernel density plot with four different curves, representing the density of the variable "Age" after different imputation techniques have been applied.
Based on the kernel density plots, both KNN and MICE imputation techniques on the filtered data appear to have similar performance.
A kernel density plot with three different curves, representing the density of the variable "LOS" after different imputation techniques have been applied.
A kernel density plot with three different curves, representing the density of the variable "Age" after different imputation techniques have been applied.
Based on the analysis of the missingness patterns, it appears that the missing values in both datasets follow the Missing at Random (MAR) mechanism. Various imputation methods were applied to fill the missing values, and it was observed that the data where the request status was marked as "accepted" showed better results in imputation. The kernel density plots for KNN and MICE on this filtered data revealed that both methods had comparable performance, and there was no significant difference between them.
Utilize statistical tests and modeling techniques to identify whether the missingness in the data is Missing at Random (MAR)
Full-Information Maximum Likelihood (FIML)
A technique for estimating the parameters of a model using all available data, including the missing value
FIML works by using the observed data to estimate the likelihood function and then maximizing it to obtain the parameter estimates.
Pattern-Mixture Models
These are a class of models that account for different missingness patterns by assuming that the data is generated by different sub-populations, each with its own distribution.
The sub-populations are defined based on the missingness pattern, and the model parameters are estimated separately for each sub-population.
Little's MCAR Test
A test that can be used to determine whether the missingness pattern is completely random (MCAR), which is a special case of MAR.
The test works by comparing the observed missingness pattern to a random pattern, and if there is no significant difference between the two, then the missingness can be assumed to be MCAR.
Try more imputation method
Joint Multiple Imputation (JMI)
Imputes missing values simultaneously across all variables in the dataset, taking into account the correlations between them.
This method can be particularly useful when the correlations between variables are strong and need to be preserved in the imputed values.
Evaluate the imputation results
Since the true values are not available, we can use the following various methods to evaluate the imputation results.
Compare the imputed data with the observed data
One approach is to compare the imputed data with the observed data by calculating the correlation coefficient, mean, and standard deviation for both datasets.
If the correlation coefficient is high and the mean and standard deviation are similar between the imputed and observed datasets, then it suggests that the imputation method has performed well.
Multiple imputation
Multiple imputation can be used to generate several imputed datasets, which can be analyzed together to estimate the uncertainty in the imputed values.
This can provide a measure of the variability in the imputed values and the uncertainty in the estimates based on those values.
Cross-validation
The dataset can be split into training and test sets, and the imputation method can be trained on the training set and evaluated on the test set.
This can provide an estimate of how well the imputation method will perform on new, unseen data.
Imputation quality metrics
There are several imputation quality metrics that can be used to evaluate the performance of the imputation method, such as the mean squared error, root mean squared error, mean absolute error, and the coefficient of determination.
These metrics can be used to compare the performance of different imputation methods and to select the best method for the dataset at hand.