The goals of this project are to analyze the data, predict whether or not an employee will leave the company and to identify factors that contribute to their leaving.
Since the goal is to predict whether an employee leaves the company, which is a categorical outcome variable, this task involves binary classification. The type of models most appropriate for this task is either a binomial logistic regression model or tree-based(decision tree and random forest) machine learning models. The plan is to implement all and see how they compare.
The dataset was gotten from the Google Advanced Data Analytics Certificate Program on Coursera. This project was done with python programming language. Here is the code.
The dataset consist of 14999 rows and 10 columns and contains no missing values but has a total of 3,008 duplicate rows; that is about 20% of the data. The tenure column contains 824 outliers. The duplicate rows were removed. The columns were renamed . Three columns (last_evaluation, number_project and average_monthly_average) in the dataset are highly correlated. The target variable is an approximately 83%-17% split which is not perfectly balanced but not too imbalanced. The categorical variables were encoded after exploring the data.
Employees who worked on more projects, worked for longer hours. All employees who worked on seven projects left the company. The optimal number of project for employees to work on is 3-4.
Some employees who worked longer hours had their satisfaction level close to zero, some who had normal working hours had their satisfaction level around 0.4, and some who worked longer hours had thier satisfaction level ranging 0.7-0.9.
Four-year employees who left seem to have an unusually low satisfaction level. The longest-tenured employees did not leave, their satisfaction level aligned with those of newer employees who stayed. There are relatively few longer-tenured employees.
The mean and median satisfaction scores of employees who left are lower than those pf employees who stayed.
Employees who were overworked and performed very well left. Employees who worked slightly under the nominal monthly average of 166.67 hours with lower evaluation scores left. There isn't a high percentage of employees who worked under the nominal monthly average with high evaluation score but working long hours does not guarantee a good evaluation score. Most of the employees work well over 167 hours per month.
Very few employees who were promoted in the last five years left. Very few employees who work the most hours were promoted. Most of the employees who left were working the longest hours and were not promoted.
For the binomial logistic regression, the data did not meet the no extreme outliers and little to no multicollinearity assumptions. The outliers were removed and the data still maintained an approximately 83%-17% split. The dataset was split into 75% training set and 25% testing set. The model achieved a precision of 79%, recall of 82%, f1-score of 80% (all weighted averages), and accuracy of 82%. However, the model achieved a precision of 44%, recall of 27%, and f1-score of 33% for employees who would leave, these scores are significantly low.
For the tree-based models, the dataset was also split into 75% training set and 25% testing set. The cross-validation technique was used to evaluate the performance of the models. The AUC was the deciding metric used to select the champion model. Both models had strong AUC scores in the first round but the random forest outperformed the decision tree. Also, the other evaluation scores of the random forest model were better than those of the decision tree model, with the exception of recall which was slightly lower. The test data was used to evaluate the random forest model and the scores were great!
With an assumption that there could be a chance that there is some data leakage occurring, a feature was dropped and another feature was engineered to represent if the employees were overworked or not. The evaluation scores for this round dropped slightly, still the scores were very good. For this round, the random forest also outperformed the decision tree using the AUC as the deciding metric. Finally, the test data was used to evaluate the champion model and it achieved a precision of 87.1%, recall of 90.9%, f1 of 88.9%, accuracy of 96.2% and AUC of 94.1%. These scores indicates a stable, well-performing model.
The plot below shows that number_project, last_evaluation, tenure, and overworked have the highest importance, in that order. These variables are most helpful in predicting the outcome variable, left.
To retain employees, the following recommendations could be presented to the stakeholders:
Cap the number of projects that employees can work on.
Consider a proportionate scale for rewarding employees and high evaluation scores should not be reserved for employees who work longer hours.
Consider promoting employees who have been with the company for at least four years, or conduct further investigation about why four-year tenured employees are so dissatisfied.
Either reward employees for working longer hours, or do not require them to do so.
Hold company-wide and within-team discussion to understand and address the company work culture, across the board and in specific contexts.