With this Machine Learning project, consisting of the development of a first exploratory data analysis to proceed, then, to carry out binary classification algorithm (AdaBoost, Reg.log, RandomForest, XGBoost, CatBoost, GradientBoost, DecisionTree, KNN and Neuronal network) with a roc auc curve and a sufficiently effective confusion matrix, it is intended to detect, through a secondary intervention, following certain variables, those employees who may develop burnout syndrome within a company. To elaborate this project, datasets from Kaggle have been used. Likewise, tools such as Python and Visual Studio Code have been used with libraries such as Pandas, Numpy, Seaborn, Matplotlib, Scikit-learn, Tensorflow, Keras etc
NOTE: The following is just a part of a complete case study of machine learning in which a long process of feature engineering, data analysis and cleaning and training of different models has been carried out. The complete case study can be found on my Github (see section 7)
In our company, we can observe that the 60,00% of the employees are Male while 40,00% are Female.
The age distribution is quite normal for a business company. We can appreciate that most of the people are between 30 and 40 years old.
Target variable is highly imbalanced. I am going to solve this problem in following steps.
Now, it´s time to focus on bivariant analysis (Target and other features). Once I have done the feature importance (see below), I decide to visualize the relationship between the categorical variables with the highest score and the target (burnout).
Statistical feature importance:
Heatmap between numerical features in order to see lineal coodependence.
The following heatmap also takes into account the correlation between the categorical features using PhiK correlation.
There is lineal coodependence between:
age & job level; age & total working years; department & monthly income; job level & job role; job level & monthly income; job level & total w.years; job level & years at company; job role & monthly income!!!; monthly income & total w.years; total w.years & years at company; years at company & years since last promotion; years at company & years with current manager.
Once the EDA, feature engineering, training and analysis of the models metrics have been done, it is time to discern which of the models is the most accurate in the classification of employees. In this case, the one that presents the best roc curve and confusion matrix is the Adaboost since it correctly classifies 41 employees and only makes an error in 10. Consequently, the model that will be sent to production to be able to work with it, will be the Adaboost classifier .
In this image we can see how it predicts by correctly classifying an employee (with these characteristics) as suffering from burnout.