Hospital Death Analysis

drop columns with same value
- readmission_status,gcs_unable_apache
drop columns with no specific meaning
- patient_id, icu_id, hospital_id
drop columns with no effect to the target
- chi- square test
- gender, ethnicity, aids.. which can be proved from the graph on the right

Feature Engineer

Numeric:

PCA

dimension reduction
get rid of the correlation

Categorical:

dummy variables

From PCA, we mainly choose 26 components, which contains about 90% of data information

New dataset

about 2w rows and 61 columns which has 26 components and 35 dummy variables.

Score Set Preparation

same techniques applied to the test set
fill numeric missing value with median in the columns
fill categorical missing value with mode in that columns

Model Choosing

Analysis preparation

- split training and test set

- using SMOTE balanced the target variables

Before preparation

We see more patients are not dead so the target is not balanced

After SMOTE

The training set are sampled with balanced the target which will help us improve model performance

Model Compare

From graph on the left, we can find that

XGBoost have the highest accuracy rate with lowest ROC
Logistic Regression have highest ROC but accuracy is not so good.
Try all models and find random forest have the best performance.

So we choose Random Forest to tuning the model.

From the graph on the left, we can see that the third model with max depth= 50, min samples split with 80 and min samples leaf with 10 is the best Random Forest model

And it did good in test set with 0.80 ROC score

Feature Important are conducted for this model.

From the graph above, we can find that ventilated_apache and intubated apache is quite important to impact patients health, which is quite make sense.

Because if a patient needs ventilate or intubated, he or she may have serious health problem, which makes them have higher probability dies in the hospital.

Limitation

We just simply delete the missing value rows in training set, which may have better way to deal with it
After cleaning, we had only 2w rows, and we are using these data to predict 3w rows in test set, which doesn't make sense.
In feature engineering , we just simply used PCA, which may have better ways to deal with the correlated data(Some health data are mainly one thing, which may have be tter way than just simply using PCA)
We can still try some classification model like neural network which may improve the model performance

Code

Hospital Death Analysis

Page updated

Google Sites

Report abuse