Home health care is a wide range of health care services that can be given in your home for an illness or injury. Home health care is usually less expensive, more convenient, and just as effective as care you get in a hospital or Skilled Nursing Facility (SNF)[1]. Home Health Care Agency is primarily engaged in providing skilled nursing services and other therapeutic services for patients at home as its name implies. In resource constraint settings it is very important for the Home Health Agencies to maintain quality of care avoiding patient hospitalization or admission to long-term care institutions. The utilization outcomes of home health agencies (HHAs), namely, hospital admission (HA) rate and unplanned emergency room visit (ER) rate, have received particular attention due to the associated cost and health concerns[2]. Therefore, it is very important to understand the relationship between the quality/process measures and the utilization outcomes to provide targeted improvement initiatives for the policymakers or the government.
The project aims to in understanding the characteristics of HHAs such as timely care, check depression, etc. that are associated with utilization outcomes.
Determining the aspects of Home Health Agencies corresponding to lower and higher utilization outcomes namely Hospital Admission rate and Emergency Room Visit rate can facilitate suggestions to focus on and prioritize improvement initiatives.
The data is published quarterly for all the Home Health Agencies in the United States under the medicare program which is publicly available on the Centers for Medicare and Medicaid Services Website. The website lists the archived files for each quarter of the year from 2010 to 2020 from which we will be dealing with the provider data. The column heading indicates the specific measure name as it appears on Home Health Compare. Each quality measure is represented in a percentage calculated for a specific quarter.
The dataset contains 11,212 rows for each quarter from 2010 to 2020.
The number of columns changes over time as measures are added to or dropped from Home Health Compare.
Each row belongs to a Home Health Agency giving information about the type of services offered to patients, such as:
How often they began the patient care in a timely manner
Taught patients about their drugs
How often they checked patient risk of falling.
Outcome Variables: Admission Rate and Emergency Room Visit Rate.
For convenience purpose I have renamed a few columns:
Zip - Zip Code of Home Health Agency
CCN - CMS Certification Number
state
name - Name of the Home Health Agency
address
city
phone
type - Type of Ownership
off.nursing - Offers Nursing Care Services
off.physical - Offers Physical Therapy Services
off.occupational - Offers Occupational Therapy Services
off.speech - Offers Speech Pathology Services
off.medical - Offers Medical Social Services
off.hha - Offers Home Health Aide Services
date - Date Certified
rating - Quality of patient care star rating
timely - How often the home health team began their patients' care in a timely manner
taughtdrugs - How often the home health team taught patients (or their family caregivers) about their drugs
checkfall - How often the home health team checked patients' risk of falling
checkdepression - How often the home health team checked patients for depression
flushot - How often the home health team determined whether patients received a flu shot for the current flu season
pneumococcal - How often the home health team determined whether their patients received a pneumococcal vaccine
taughtfootcare - How often the home health team got doctor's orders, gave foot care, and taught patients about foot care
betterwalking - How often patients got better at walking or moving around
bettered - How often patients got better at getting in and out of bed
betterbathing - How often patients got better at bathing
lesspain - How often patients had less pain when moving around
betterbreathing - How often patients' breathing improved
betterheal - How often patients' wounds improved or healed after an operation
betterdrug - How often patients got better at taking their drugs correctly by mouth
admitted - How often home health patients had to be admitted to the hospital
ER - How often patients receiving home health care needed urgent, unplanned care in the ER without being admitted
ruca - Gives information whether it is an Urban HHA or Rural
episode - The episode indicates how many patients and how much workload an HHA finishes every year, which can be used to suggest the size of an HHA to some extend.
mean - mean household income in that zip code
median- median household income in that zip code
Data Cleaning and Preprocessing
Supervised Machine Learning Algorithms will be used for Modeling like linear regression, regression tree approach
Variable Subset Selection (Forward and Backward)
The model will be evaluated for performance using a cross-validation approach
Identifying the features related to utilization outcomes and comparing the results
Data Preprocessing
Merging Multiple files from the data : Since the data has been collected over a period of time on a quarterly basis starting from 2010 - 2020 and available as multiple files and we want to analyze it all in one go, I combined all the files into a single data frame.
Data Cleaning : Prior to the data merging step, each file needed some amount of cleaning. Each file had a few columns for footnotes giving information about the characteristics of the Home Health Agencies. Since these columns were empty, they had to be dropped. In addition to that, the number of columns changed over time in each file as some of the measures were added to or dropped from the Home Health Compare datasets. Therefore, only the columns in common across all the years have been retained for consistency and to prevent missing data. After combining the data I had to perform a few data type conversions. Since there were many missing values in my data frame and some of the missing values were represented using codes such as 199 and 201, I had to make a plot to understand if there was any pattern in how the data was missing and since there was no significant pattern observed, I have removed all the missing values.
These were the final datatypes of the features present in my data frame
Missing values present in the data frame
To get a proper understanding of the outcome variables and understand how many data points fall within a specified range of values I thought a visual representation would be better. Therefore, I plotted a histogram for the Hospital Admission Rate and Unplanned Emergency Room Visit Rate.
It can be observed that most of the data points have values between 10-20 for the admission rate
It can be observed that most of the data points have values between 5-20 for the ER rate
Model Fitting
Since this is a regression problem I used linear regression as a basic model. Before fitting the model I plotted a correlation matrix to understand if there is any correlation between the features.
From the matrix we can see that the columns such as betterwalking, betterbed and betterbathing are highly correlated. Similarly mean, median and pop are highly correlated therefore I dropped them during the cleaning phase. Finally the data frame has 21 features.
Linear Regression
Since linear regression assumes that the observations/records are not dependent on each other I aggregated all the rows belonging to a Home Health Agency such that each Home Health Agency has a single row and fit the model. Since our aim in this project was to understand the characteristics of HHAs such as timely care, check depression, etc. associated with utilization outcomes, from the model summary we can get this information. From the model summary, we can observe that the features with lower p values(represented with ***) such as checkfall, taughtdrugs, checkdepression followed by features with (**) are important in predicting the Hospital Admission Rate.
Tree Model
Tree-based models help us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome[4]. From the plot, we can observe that timeindex played a major role in the decision making followed by whether it was a rural or urban area, betterdrug, timely care etc.
Problem with choosing features related to Admission Rate or Emergency Room Visit Rate using Linear Regression
It seems likely that if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response. However, this logic is flawed. Hence, if we use the individual t-statistics and associated p values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship[5]. F statistic does not suffer this problem as it adjusts for all the predictors.
Therefore we need to use F statistic to estimate the p-value which helps us conclude if at least one predictor is related to response. But we do not know which variables are important. Therefore we use variable selection methods to determine which predictors are associated with the response.
Variable Selection
It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted. Now least-squares are extremely unlikely to yield any coefficient estimates that are exactly zero. [5]
Forward selection, the backward selection result in the creation of a set of models, each of which contains a subset of the p predictors. In order to implement these methods, we need a way to determine which of these models is best. The model containing all of the predictors will always have the smallest RSS and the largest R2. Therefore we need to use other metrics such as Cp, BIC, Adjusted R2. These metrics are better estimates of the test error compared to RSS and R2.
Forward Stepwise Selection
Begins with a model containing no predictors, and then adds predictors to the model, one at a time, until all of the predictors are in the model. In particular, at each step, the variable that gives the greatest additional improvement to the fit is added to the model.[5]
Forward Selection For Admission Rate
We see that using Forward Stepwise Selection, it is evident from the graphs that the model performance did not increase significantly after increasing the features beyond 10. The BIC value for the 10 variable model was the least. The features displayed below are the ten features included in the 10 variable model. So we can consider these features to be related to admission rate.
Forward Selection For Emergency Room Visit Rate
The BIC and Cp values for the 9 variable model was the least. The features displayed below are the 9 features included in the model chosen. So we can consider these features to be associated with Emergency Room Visit Rate.
Backward Stepwise Selection
The forward selection follows a greedy approach and might include variables early which might become redundant later. Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one at a time. From the results, we can observe that backward stepwise selection provides results similar to the forward stepwise selection.
Backward Selection for Admission Rate
These are the features related to Admission Rate in the model which performed best.
Backward Selection for Emergency Room Visit Rate
These are the features related to Emergency Room Visit Rate in the model which performed best.
Cross-Validation Approach
To choose between different models during the selection process, I used metrics such as BIC value and Cp but since these were not accurate estimates of test error and since both forward and backward selection gave similar results, I used 10 fold cross-validation in combination with subset selection. This approach is somewhat involved, as we must perform the best subset selection within each of the 10 training sets. Since I did cross-validation with subset selection, this has given us a 10×19 matrix, of which the (i, j)th element corresponds to the test MSE for the ith cross-validation fold for the best j-variable model. We use the apply() function to average over the columns of this apply() matrix in order to obtain a vector for which the jth element is the cross-validation error for the j-variable model.
Mean cross-validation error for predicting the admission rate vs number of features
From the above image, it is evident that the model with 10 features had the least cross-validation error while predicting the Admission Rate.
Mean cross-validation error for predicting the Emergency Room Visit Rate vs number of features
From the above image, it is evident that the model with 12 features had the least cross-validation error while predicting the Emergency Room Visit Rate.
Results:
Admission Rate: These are some of the features which were related to the Admission Rate from the 10 Variable model selected by Cross-Validation.
taughtdrugs
checkfall
checkdepression
betterbathing
lesspain
betterbreathing
rucaUrban
off.hha
From observing these features it can be observed that these features are common across all the models(forward & backward). Therefore it gives some confidence that these features are related to the Admission Rate. Moreover, these results seem meaningful for example, teaching patients about the drugs help them use them efficiently in a way helping them recover faster.
Emergency Room Visit Rate: These are some of the features which were related to the Emergency Room Visit Rate from the 12 Variable model selected by Cross-Validation.
timely
taughtdrugs
checkfall
checkdepression
pnumococcal
lesspain
betterbreathing
rucaUrban
off.medical
betterdrug
season
median
From observing these features it can be observed that these features are common across all the models(forward & backward) except that forward and backward missed two features from this list. Therefore it gives some confidence that these features are related to the Emergency Room Visit Rate.
Limitations and What can be done next:
The limitation with my approach is basically during the initial data preprocessing phase I randomly dropped one among the columns which are highly correlated. There is a chance that the dropped column might have been important. Therefore if in future work I believe instead of dropping the features randomly I need to use selection techniques to remove the features.
While choosing between models during the forward and backward selection, I used widely used metrics namely BIC, Cp but I did not have sufficient evidence for whether if that metric is suitable for my data/problem. In the future, I believe finding a metric suitable to the data to choose between models would be better.
Missing Data - A lot of missing values had to be dropped since the reason for their missing was unknown.
Pre-COVID and COIVID timeframe
Out of curiosity, I thought of understanding if there were was any difference in the features that were related to utilization outcomes pre- covid and after covid timeframes. I applied the backward stepwise selection method for understanding the relation between the input features and admission rate.
Backward selection on pre Covid data for Admission Rate
By observing the BIC value we can see that the 11 variable model performed best
Backward selection on post Covid data for Admission Rate
By observing the BIC value we can see that the 4 variable model performed best
From the above results, we can see that some of the features which were important during pre COVID timeframe were not included in the selection process when performed with the data after COVID. I believe this can be attributed to the lack of sufficient data as the COVID time frame has not ended yet and currently, I could find data only for three quarters.
Some of the features which were not included in the model when performed on data during the COVID timeframe are:
taughtdrugs
checkfall
betterdrug
off.hha
I believe these features need to be included because I believe they are very important in determining the Hospital Admission Rate. Therefore these results are not satisfactory. Future work can include performing the modeling by collecting more data after the COVID has occurred.
Conclusion:
We were able to find features related to both Admission Rate and Emergency Room Visit Rate and make comparisons using different selection techniques.
I would like to conclude saying that a few features like taught drugs, check depression, check fall were related to the Admission Rate and Emergency Room Visit rate. Therefore improving these features may help Home Health Agencies improve quality of care there by decreasing the ptient hospitalization.
As far as I know, there is very little research done on this topic using the entire data from 2010 to 2020 to predict the utilization outcomes and comparing the results. There is some work done on this in the research group at the HIT Lab UMBC 2 years ago where they have worked with only one-quarter of the data. I addition to that results will be compared for the pre-COVID and post COVID time frames (Here, by post COVID we mean the time frame after COVID has occured).
References:
What's home health care? (n.d.). Retrieved March 07, 2021, from https://www.medicare.gov/what-medicare-covers/whats-home-health-care
Koru G, Parameshwarappa P, Alhuwail D, Aifan A. Facilitating focused process improvement efforts in home health agencies to improve utilization outcomes effectively and efficiently. Home Health Care Management & Practice.
Yili Zhang, PHD and Gunes Koru, PHD, FAMIA. Characterization of US Home Health Agencies with Respect to Utilization Outcomes
Robk@statmethods.net, R. (n.d.). Tree-based models. Retrieved April 12, 2021, from https://www.statmethods.net/advstats/cart.html
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R. Boston: Springer.