Predicting Employee Attrition (A Classification Problem Where Recall Matters)


Background and Objective:


This project utilizes a fictional employee dataset to help determine an employee's likelihood of attrition. The data set was prepared by a group of IBM data scientists and can be found here. I will explore how factors may contribute to employee attrition, including an employee's distance from home, relationship satisfaction, working overtime, etc, and I will create two classification models to predict employee attrition: logistic regression and support vector machine (SVM). This project will also explore precision-recall curves to help determine the probability threshold for our respective algorithms which provides the best balance between precision and recall, keeping in mind that recall as a metric will be particularly useful for a company trying to predict attrition among employees.

Import the necessary libraries

There are 1470 entries, and all columns have non-null values, indicating no missing data.

Dataset:

The dataset contains the following 35 columns of information regarding our employees:

EmployeeCount, Over18, and StandardHours all have the same value for every employee and therefore will serve little purpose for our predictive purposes. We will therefore drop these columns.

Our EmployeeNumber appears to contain unique identifiers (there are 1470 unique numbers for 1470 entries). Let's use those as our index value.

For columns with only two categorical values (Attrition, Gender, OverTime), let's use label encoder to convert into 0 and 1

Statistical observations on our numerical data:

The data appears to reflect real numbers. The max and min values do not seem to reflect any impossibilities or extreme outliers. The age mean and median are nearly the same, and it appears this company does not have many older employees. The max age is 60 and the 75th percentile age is only 43. The median distance traveled for work is 7 km and the largest distance is 29, indicating that the employees don't traverse particularly large distances. The salary range of employees reflects what might be part time employees on the lower end of the spectrum (1009/month) to an executive salary (19999/month).

Here is a visualization of the distribution of our numerical values:

Age, education, and job involvement appear to have a normal distribution. Some of our features are right skewed, including distance from home (indicating a majority of workers live nearer to work), monthly income, and years at the company. Our unknown values of "monthly rate" and "hourly rate" seem to have a similar distribution across all bins of their respective histograms, and these distributions do not seem to match with the distribution of any of our other variables, e.g. monthly income which is right skewed. It is therefore not clear if these 'rates' are really related to anything else (e.g., income).

Visualizing the distribution of some of our categorical variables:

We see that the most number of employees work in research and development followed by sales. Most are married, which might also make sense with our median age observed earlier. Most employees do not work overtime but a sizeable number do. We also see that most employees rarely travel.

Let's now see a correlation matrix to see how attrition is connected to other variables.

Some variables do have noteworthy correlation with attrition. An employee's status as working over time, e.g., has some positive correlation with attrition. There is negative correlation between total working years & attrition, monthly income & attrition, job level & attrition, and age & attrition. The positive correlation with working over time and correlation, along with the negative correlations noted for the other variables make sense. For example, having less of a salary would make one less committed to remain with a company, as would being a relatively newer employee (i.e. having less total working years). Let's see how attrition manifests in some of these variables:

Visualizing how attrition plays out with select variables:

Our visualizations confirm our previous comments about the positive correlation between attrition and working over time, and the negative correlation between attrition and the other variables noted. Note that the x-axis labels for monthly income are not visible for the obvious reason that there are many unique income values for this continuous variable, but the visualization should show how the orange coloring (representing attrition) seems more concentrated in the lower monthly incomes.

The average values for those who attrite (1) and those who do not (0), separated out by variable:

It is clear to see from these averaged values for employees who attrite and those who do not, that the two sets of employees differ for many of these variables. For example, those who attrite have, ON AVERAGE, lower stock option levels, are more likely to work overtime, have a lower average monthly income, are younger, live slightly farther away from work, are slightly less satisfied with their job, environment, relationships, and work-life balance. They have also worked less time at the company on average.

Building Our Models and Tuning for Improved Recall

Preparing data for modeling

Some of our categorical variables where we do not want to indicate ordinality should be split into get_dummy columns, including BusinessTravel, Department, Education, EducationField, JobInvolvement, JobLevel, JobRole, and MaritalStatus.

Separating our independent (X) and dependent (Y) variables

Because our independent variables are in different magnitudes (e.g., the range of income values is much higher than our gender values which are only 0 and 1), our algorithm may incorrectly give weight to variables based on their higher magnitude. So we will scale all of our non-categorical values with the StandardScaler class. The scaler will ultimately make the standard deviation = 1, and mean = 0 for our values.

Splitting into Train/Test Data [80%/20%]

In classification problems like the one we are dealing with, our test and train data need to be appropriately sampled or we may end up with an inbalance in the distribution of classes. One method of addressing this is with a method known as a stratified sampling technique, which will be passed as an argument in the train_test_split function below. For more information, see here. Note that while stratified sampling ensures we have similar proportions of our classes in the train and test data, it doesn't solve the problem of major imbalances that may occur in the representation of classes in a dataset.

While 20% is obviously a low number, it is not an extreme case of imbalance, and so for the purposes of this study we will not be balancing out the representation of "Attrite" and "Not Attrite" classes through available methods to do this.

Moving on to our data at hand, it's important that our metrics align with our objectives. In this classification case, we want to especially reduce false negatives predicted by our model, i.e. we want to reduce the time the model says an employee won't attrite when in fact they do attrite. In a real life situation making the opposite prediction mistake is less costly (predicting someone will attrite when they actually do not). Because we want to reduce false negatives we have to focus on improving our recall metric. Recall finds what proportion those predicted to be attriting were out of the entire propulation of those who in fact did attrite. So the lower the number of predicted attritions over actual attritions, the more false negatives. The higher the number of predicted attritions over actual attritions, the fewer false negatives, which is what we want. Our evaluation metrics will therefore take into account recall.

Building the models

I will build and compare 2 different models:

Logistic Regression Model

Here is an explanation of the values in the printed classification report:

Regarding our confusion matrix:

Observations based on our results:

Based on the above metrics, we see that our train and test predictions have similar accuracy of close to 90%. However, the recall metric for class 1 (will attrite) is only about 50% for our train data, and less than 40% for our test. This means we have a sizeable number of false negatives, and thus the model is not satisfactory, since there are many employees who will attrite that this model cannot catch, and this is the case even though the overall accuracy appears to be good. As one can see from teh heat map, there are a sizeable number of false negatives that need to be reduced (predicted: "Not Attrite"; actual: "Attrite"). This can be done by responsibly reducing recall.

We now refer to the method coef_ which gives us the variables most significant for classifying our y variable of attrition.

Based on the above, we find that the features that most positively effect the likelihood of an employee leaving the company are those listed on the top (e.g., overtime, having a job involvement score of 1, a job level of 5, highly frequent travel for the company, etc.). Those that have a negative effect are those listed at the bottom of the list (e.g., having a job level of 2, a stock option of 2, not traveling for the business, having a job level of 4, etc). Overtime is the most important feature resulting in attrition based on our model. We also see how having a stock option of 0 appears to have a significant impact. The company may want to explore its overtime and stock option policies for employees. Being a newer employee (job level of 1) would also be a flag, as would having worked at many companies before. We can see that these values seem to make sense with the correlation matrix we observed earlier as well, in addition to our own intuition.

The coefficients obtained above for our logistic regression model give us a value that is the log of odds. Converting these values by determining their exponential will let us determine their real odds:

We can read the values above as telling us the X likelihood of someone attriting if they have a feature. E.g., an employee working overtime is 6.67 times more likely to attrite than someone who does not. Someone who's satisfaction with their personal relationship is scored as a 1 is 1.87 times more likely to attrite than someone with a different score, etc.

Balancing Precision and Recall for the Best Predictions (Tuning Our Predictive Threshold)

The logistic regression model will determine the probability of each entity being labeled as 'attrite' or not. By default the 'threshold' is 0.5, meaning that any time the probabilty of an entity being 'attrite' is 0.5 or higher, the model will label it as such. In cases of class imbalance (as we noticed our data had earlier), this default threshold may be poor. Adjusting it may improve our metric scores.

One of the tools to help us achieve this is the precision recall curve, which helps one see the tradeoff between precision and recall at different probability thresholds for a single class (in our case, "attrite"/1).

This is different from an ROC curve (ROC = "receiver operating characteristic"), which instead considers the false positive and true positive rate of the model.

In the curve, having a threshold of 0 would mean all of our instances would be classified as attrition by the algorithm (since anything over 0% probability of being a case of attrition would be labeled as such). This would of course result in poor precision (a little under 0.2 in the chart above), but not absolutely 0 precision, since this extreme threshold would still explain some of the data. But, it would also mean the lowest number of false negatives since all of the entities would be labeled with 'attrite', and thus our recall would be perfect, 1.0. But note that even though our recall would be perfect here, it would obviously refer to poor predictions because the precision is really bad.

On the other hand, having a very high threshold that approaches 1.0 would mean that only values at a stringent probability threshold would be mapped to the class of attrition. Our recall would head in the direction of 0 the higher the threshold, since a majority of entities would be labeled as 'no attrite' given the higher threshold to be deemed 'attrite'. This would thus result in very high numbers of false negatives, and thus a very low recall. On the converse, the precision would go up as the threshold goes higher. That is because precision is concerned with reducing false positives. The higher the probability threshold, the better precision. However, after a certain point, the threshold does not lead to better precision.

We ultimately want to define our threshold to be at a point we can balance precision and recall, and according to the above chart, that is around 0.38. Let's see how it impacts the model:

The recall has improved for class 1 as can be seen in the heatmap. As expected, the precision has gone down (i.e. more false positives)

Now let's see the results on our test data:

Similarly on our test data we see an improvement in our recall and reduction in our precision with our new threshold of 0.38

SVM Model

For our class of 1, our SVM produced slightly better results than our train logistic regression model with the thresholds left unchanged.

Our SVM model has a similar accuracy in both models, and thus there is not a big overfitting problem. The recall is better in this model's performance on the test data than was the case with the logistic regression model with the original thresholds, so we are moving in the direction of our original objective of reducing false negatives that will be costly for the company. Let us determine an optimal threshold for our model as we did before:

Conclusion

Compared to our baseline threshold model, the new threshold of 0.25 on our SVM model has improved our recall substantially (though with some reduction in precision as expected). Based on a discussion with company stakeholders, the SVM model may save the company money by identifying those who will likely attrite (to whom further resources can be allocated to keep them as employees), and reduces false negatives the most compared to all models looked at. Therefore, we recommend SVM with the given threshold as a starting point for the company to identify those at risk of attrition. Perhaps just as importantly, the above discussion of features and their respective coefficients derived from our Logistic Regression model have helped us identify several of the key factors that appear correlated to an employee leaving the company, including working over time, traveling frequently for work, how new an employee is to a company, an employee's monthly income, and which department they operate in, among others. These can all be addressed in different ways. New policies can be implemented to address these issues. These include:

Reducing the need for employees to work overtime, figuring out how employees can accomplish their tasks locally without having to travel extensively, developing programs to help newer employees better integrate into a positive work environment at the company, ensuring salaries are competitive where possible, investigating the culture of departments with greater employee attrition, etc.