Prediction

Data Preprocessing

Transform categorical data to numerical data

Over 90% data are categorical data

Normalize data

Improving model performance
Make it easier to interpret the results of statistical analyses
Reduce the computational complexity of models

Feature selection

Feature importance

Importance visualization

P-value

Confidence interval error bar

Predict models

Training data: the training data contains 80% of total dataset

Model selection:

Classification models : predict each group of length of stay
Regression models: predict the number of length of stay

Linear Regressor

The goal of a Linear Regressor is to minimize the difference between the predicted values and the actual values of the target variable.

Ridge Regularization

Ridge regularization adds a penalty term to the linear regression objective function. Which discourages the model from over-relying on any one input feature. By adding this penalty term, the model is encouraged to use all the input features to make predictions.

Lasso Regularization

Lasso regularization also adds a penalty term to the linear regression objective function. However, different from Ridge, this model is encouraged to use only the most important input features to make predictions.

Random Forest Regressor

Random Forest Regressor works by building a large number of decision trees on random subsets of the training data. Each decision tree makes a prediction for the target variable based on the values of the input features.

Result

SGD

SVM

Decision Tree

Random Forest

We first set this problem into a classification problem, each value as a group and used 4 different models to predict the result. It turns out that Random forest gets the best performance with 98.48% accuracy. Decision Tree with 95.45% accuracy, SVM and SGD with accuracy below 60%.

Lasso

Ridge

Linear Regressor

Random Forest Regressor

For regression models, it turns out that Random forest regressor gets the best performance as well with 80.86% accuracy. Overall classification models tend to have a better performance than regression models in this case. Since over 90% of data are in text.

Extra findings and thoughts

Moreover, We tried 4 different methods of imputing the missing value and refit the model, compared the findings and tried to find the best prediction results. That are KNN Imputed Categories and KNN Imputed Numerics on original data, KNN Imputed Categories and MICE Imputed Numerics on Request Status is Accepted, MICE on Request Status is Accepted and KNN to the whole.

KNN Imputed Categories and KNN Imputed Numerics on original data

KNN to the whole

MICE on Request Status is Accepted

KNN Imputed Categories and MICE Imputed Numerics on Request Status is Accepted

The result shows that KNN Imputed Categories and MICE Imputed Numerics on Request Status is Accepted performs the best for the model. which results in R-squared of 81.54%

Conclusions

Hospital data is in a complex form so it needs a lot of data cleaning to make the data more organized. As the columns mostly contain text, we have to convert those 18 categorical data to numbers for further analysis. Which results in 193 columns in total. We selected the best features to make the model prediction more efficient.

From the result, we can conclude that Random Forest Regressor is a powerful machine learning algorithm, since it performs better than other regressors. The length of stay is related to the features ‘Ecmo’, ‘Age’, ‘SNF’,’Inpatient’, ‘ER’, ‘Long Term Care’ and ‘Home/Self Care’ the most. Based on the features that are more important to the target, we improved the model and increased the accuracy of the model to 81.60%. The model is highly accurate in predicting a patient's length of stay. With this capability, the model can be applied to larger and more complex medical datasets, which will assist hospitals in making informed decisions regarding the acceptance of transfer requests. This, in turn, will not only enhance the patient experience but also reduce costs.

For further analysis, we will work on feature selections more. We have used Mutual information regression and mRMR selection. It slightly improves the prediction by 0.06%. We can further use backward selection to see if it makes a change. Also, since staying more than 15 days usually indicates if it’s an outlier or not, we can further predict if the patients stay more than 15 days or not by Classification models since they tend to have better performance than regression models.

Page updated

Report abuse