Over 90% data are categorical data
Improving model performance
Make it easier to interpret the results of statistical analyses
Reduce the computational complexity of models
Feature importance
Importance visualization
P-value
Confidence interval error bar
Training data: the training data contains 80% of total dataset
Model selection:
Classification models : predict each group of length of stay
Regression models: predict the number of length of stay
Linear Regressor
The goal of a Linear Regressor is to minimize the difference between the predicted values and the actual values of the target variable.
Ridge Regularization
Ridge regularization adds a penalty term to the linear regression objective function. Which discourages the model from over-relying on any one input feature. By adding this penalty term, the model is encouraged to use all the input features to make predictions.
Lasso Regularization
Lasso regularization also adds a penalty term to the linear regression objective function. However, different from Ridge, this model is encouraged to use only the most important input features to make predictions.
Random Forest Regressor
Random Forest Regressor works by building a large number of decision trees on random subsets of the training data. Each decision tree makes a prediction for the target variable based on the values of the input features.
SGD
SVM
Decision Tree
Random Forest
We first set this problem into a classification problem, each value as a group and used 4 different models to predict the result. It turns out that Random forest gets the best performance with 98.48% accuracy. Decision Tree with 95.45% accuracy, SVM and SGD with accuracy below 60%.
Lasso
Ridge
Linear Regressor
Random Forest Regressor
For regression models, it turns out that Random forest regressor gets the best performance as well with 80.86% accuracy. Overall classification models tend to have a better performance than regression models in this case. Since over 90% of data are in text.
Moreover, We tried 4 different methods of imputing the missing value and refit the model, compared the findings and tried to find the best prediction results. That are KNN Imputed Categories and KNN Imputed Numerics on original data, KNN Imputed Categories and MICE Imputed Numerics on Request Status is Accepted, MICE on Request Status is Accepted and KNN to the whole.
KNN Imputed Categories and KNN Imputed Numerics on original data
KNN to the whole
MICE on Request Status is Accepted
KNN Imputed Categories and MICE Imputed Numerics on Request Status is Accepted
The result shows that KNN Imputed Categories and MICE Imputed Numerics on Request Status is Accepted performs the best for the model. which results in R-squared of 81.54%
Hospital data is in a complex form so it needs a lot of data cleaning to make the data more organized. As the columns mostly contain text, we have to convert those 18 categorical data to numbers for further analysis. Which results in 193 columns in total. We selected the best features to make the model prediction more efficient.
From the result, we can conclude that Random Forest Regressor is a powerful machine learning algorithm, since it performs better than other regressors. The length of stay is related to the features ‘Ecmo’, ‘Age’, ‘SNF’,’Inpatient’, ‘ER’, ‘Long Term Care’ and ‘Home/Self Care’ the most. Based on the features that are more important to the target, we improved the model and increased the accuracy of the model to 81.60%. The model is highly accurate in predicting a patient's length of stay. With this capability, the model can be applied to larger and more complex medical datasets, which will assist hospitals in making informed decisions regarding the acceptance of transfer requests. This, in turn, will not only enhance the patient experience but also reduce costs.
For further analysis, we will work on feature selections more. We have used Mutual information regression and mRMR selection. It slightly improves the prediction by 0.06%. We can further use backward selection to see if it makes a change. Also, since staying more than 15 days usually indicates if it’s an outlier or not, we can further predict if the patients stay more than 15 days or not by Classification models since they tend to have better performance than regression models.