Dataset: Kaggle Dataset
Software: Python
Dataset Description: Over 90,000 rows and 186 columns
Target: Hospital Death
Variable: can be characterized as two types according to the stakeholder:
Hospital feature: lCU floor...
Patient demography and health metrics : age, ethnicity, blood pressure...
drop columns with over 50% missing value
drop columns with variance less than 1
drop columns with same value
readmission_status,gcs_unable_apache
drop columns with no specific meaning
patient_id, icu_id, hospital_id
drop columns with no effect to the target
chi- square test
gender, ethnicity, aids.. which can be proved from the graph on the right
PCA
dimension reduction
get rid of the correlation
dummy variables
From PCA, we mainly choose 26 components, which contains about 90% of data information
New dataset
about 2w rows and 61 columns which has 26 components and 35 dummy variables.
same techniques applied to the test set
fill numeric missing value with median in the columns
fill categorical missing value with mode in that columns
- split training and test set
- using SMOTE balanced the target variables
Before preparation
We see more patients are not dead so the target is not balanced
After SMOTE
The training set are sampled with balanced the target which will help us improve model performance
Feature Important are conducted for this model.
From the graph above, we can find that ventilated_apache and intubated apache is quite important to impact patients health, which is quite make sense.
Because if a patient needs ventilate or intubated, he or she may have serious health problem, which makes them have higher probability dies in the hospital.
We just simply delete the missing value rows in training set, which may have better way to deal with it
After cleaning, we had only 2w rows, and we are using these data to predict 3w rows in test set, which doesn't make sense.
In feature engineering , we just simply used PCA, which may have better ways to deal with the correlated data(Some health data are mainly one thing, which may have be tter way than just simply using PCA)
We can still try some classification model like neural network which may improve the model performance