Team 1

Team 1 - Healthcare Analytics

Babajide Omobo, Stephen Rule, and Jumman Hossain

{omobo1,srule1,jumman.hossain}@umbc.edu

Final Team Report

Abstract

This project applies to the healthcare industry. On a top level, the project aims to predict length of stay of patients (LOS) at hospitals.

Hospitals often find it difficult to know when ICU beds will be available for new patients because they are not sure what the LOS of admitted patients would be. This became quite an issue during the Covid-19 pandemic. This project will take a deep dive into this problem and try to look for solutions using machine learning.

A dataset that has patient hospitalization information is being used for this project. It aims to predict the length of stay of a patient.

Background/Motivation/Benefits

This is a very useful project in the healthcare industry as it allows the hospital management and local governments to know when beds are going to be available and communicate this to necessary stakeholders.

If a hospital agent is able to input a patient's admission information and gets an estimated length of stay for the patient , the hospital would be able to use than information to maximize the use of their ICU beds and also be more efficient.

Data source Description

The training dataset consists of 18 columns and 318,439 rows of data. We plan to split the training dataset to create two datasets one being training and one being test. We also plan to reduce the dataset to 25,000 to allow the program run quickly.
The dataset contains hospital information such as case id, hospital codes, hospital region, available rooms, departments, ward type and codes, bed grade, type of admission, severity of illness, visitors with the patient, age, admission deposit, and the number of days the patient stayed.
The dataset is listed on Kaggle and currently has 1 task, 67 code, and 11 discussions. Although the main task is to predict length of stay, our team is looking to predict length of stay and possibly the admission deposit.
Link: https://www.kaggle.com/nehaprabhavalkar/av-healthcare-analytics-ii/download (Note: Kaggle account required to download)

Approach

We will perform some exploratory data analysis to get an idea of the distribution of the data.
We will use the data set and apply some machine learning algorithms to predict the label (Length of stay)
We will use accuracy to evaluate the model classification
We will transform the label and then use accuracy and ROC AUC to evaluate the new binary label prediction

Exploratory Data Analysis

In this plot we try to visualize the age group that is staying longer at the hospital. What we noticed is that patients with age above 50 tend to stay longer at the hospital. While patients below 50 tend to stay shorter. This was in sync with our prior intuition.

In this plot we try to visualize the reason why people are staying longer at the hospital. What we noticed is that patients with extreme illness stay longer at the hospital while patients with minor illness have short length of stay. This was in sync with our intuition.

In this plot we plot the the of admission and the length of stay. What we noticed is that patients with trauma stay longer at the hospital while patients with urgent admissions have short length of stay.

This plot shows that a large concentration for all departments for length of stay is between 11-20, 21-30, and 31-40 days. The Surgery department had a large number of patients on days between 51-60.

We plotted the length of stay of admitted patients by department. We wanted to figure out the magnitude so we had the count of patients as the Y axis.

As the plot shows most patients admitted into hospitals stay less than 30 days regardless of the department they are admitted into. However, the gynecology department has the most patients staying more than 30 days.

Model Development

Null values were found in two features, "Bed Grade" and "City_Code_Patient". We chose to drop the rows that have null values rather than perform imputation methods because we already had a lot of data to work with and dropping the null values did not change the distribution of the dataset.

Dropped columns case_id, City_Code_Hopsital, Hopital_region_code, Available Extra Rooms in Hospital, Ward_Facility_Code, Bed Grade, patientid, and City_code_Patient. Our team found these features to have the least effect on the models or were not appropriate or useful to add to the model.

Encoded all non numerical values, normalized admission deposit, removed outliers, and prepared data for models listed in the diagram to the left.

Results and Insights

Our team ran the following models: ZeroR, OneR, Logistic Regression, Linear Discriminant Analysis, K Nearest Neighbor, CART (Decision Tree), Naive Bayes, SVM, Random Forest, and Extreme Gradient Boosting. Below are the accuracy scores for each model.

As the above suggests there is a very low accuracy score. The data set is highly dimensional, the label column has 11 categories to predict which is very challenging to do accurately. The other input fields also are highly dimensional.

Because of the low accuracy score, our team explored lowering the dataset from 313,793 rows to 25,000 keeping the same dataset distribution. We decide to transform the label from 11 categories to 2 categories. The following is the transformation performed (0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100, and more than 100 days) to 2 categories (Under/Over 30 days)

We decided to do this because there is an interest in the "short term patients" (less than 30 days stay) vs "long term patients" (more than 30 days).

Below are the result after the transformation of the label.

As the results shows Extreme Gradient Boosting is the best model with an accuracy of 83.8%

ROC Curve of the XGB model and the no skill is plotted above.

Final Test Data Result:

We used the XGB model on the test dataset and got an accuracy of 77.74% and a ROC of 0.814.

Code: Located at link: https://github.com/stephenrule/Healthcare-Analytics

Conclusion

Our team explored different models by including and removing different input features and found that these features had no significant weight on prediction. Initially the team started out trying to predict LOS with 11 different categories but was not able to predict the categories accurately, a low accuracy below 40% was recorded. Transforming the stay label from 11 categories to 2 categories, under/over 30 days, the accuracy of the model predictions increased to above 70% with some models in the 80% prediction accuracy. We choose under/over 30 days because a hospital would want to know if patients are going to be staying short term (less than a month) or long term (more than a month). With this a hospital can better inform government agencies, and patients of availability.

Future Work

Below are some of the potential work and improvements that can be undertaken to attain this project's objective.

We can try to reduce the dimensionality of the inputs to get a better prediction with the label with 11 categories.
We can also try some more sophisticated models such as deep learning/neural networks.
We can collect more data that might be more helpful to track the length of stay of a patient such as:

- If the patient has a pre-existing condition (Hospitalization History).
- Type of injury/disease.
- We can talk to domain experts to figure out other features that can be helpful.

Implementing the project in a real world scenario such an hospital to get insight into its application.

Team Reflection

The approach the team took was to split the tasks into 3 main categories (data exploration, data cleaning and model application).

These tasks were performed in a Box's loop approach such that we kept iterating over the models.

We learned that data set often has high dimensionality problems, inconsistencies and the need to speak to a domain expert is crucial in making sense of certain situations.

Reference

Our dataset came from Kaggle.com: https://www.kaggle.com/nehaprabhavalkar/av-healthcare-analytics-ii

Our team read through the problem set (Description) and downloaded the dataset from the above link. All other information used in our project was from IS 733 homework assignments, Scikit-Learn: https://scikit-learn.org/stable/, and Pandas: https://pandas.pydata.org/.