Here we make use of the HR Analytics dataset available on Kagoo. The dataset was created to understand the factors behind employee attrition and can be used to train a model for predicting employee churn.
The Python Code for the 2 models is on Github
We can start by importing the required libraries and import the HR dataset CSV file to a pandas dataframe.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
csvlocation = csvpath = 'HR_capstone_dataset.csv'
df = pd.read_csv(csvlocation)
First, we should check the data imported into the Pandas data frame looks good:
We can also run df. columns to view the columns of the data frame and df. values to display the values of the data frame in an array.
The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.
We can also run df.columns to view the columns of the data frame and df.values to display the values of the data frame in an array.
The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.
Here we can verify that all columns have the same count of 14999, so there appears to be no missing data. We can also view the standard deviation (std) in relation to the mean, to get an idea of the variance in the data, for example:
Mean average_monthly_hours = 201 hours
1 standard deviation average monthly hours = 49.94 hours
Therefore we can infer that:
Approx. 68% of employees work 201 hours+-49.94 hours = 152-150 hours a month.
Approx. 95% of employees work 201 hours +-100 hours (rounded) = 101 to 301 hours a month
We need to check the datatypes are in the correct format if we want to build a model from them.
The dependent variable that we want to predict is the ‘left’ field, which is a binary field of 0 or 1. A value of 1 is that they left the company (attrition).
There are 9 other columns which make up the independent variables:
There are 2 floats, 5 integers, and 2 object data types.
The salary field is categorical and includes 3 levels:
low, medium, and high
Because there is an inherent order we can use ordinal encoding to convert it to an integer
salary_map = {‘low’: 0, ‘medium’: 1, ‘high’: 2}
The Department fields consist of non-ordered fields:
Accounting, HR, IT, management, marketing, product_mgt, RandomD, sales.
We use one-hot encoding on this field which creates 8 new binary fields, one for each department name. The code is as follows:
df = pd.get_dummies(df, columns=[‘Department’])
This results in a lot more columns in the data frame as below, but now we’re ready to build a model.
Building the Random Forests model
The steps for building the model are as follows:
1. Split the dataset into X and y variables where X consists of all the independent variable fields.
The ‘y’ list contains only the ‘y’ field (the dependent variable).
2. Split the X and y dataframes into 4 sets for training. The test sets were randomized 20% of the data.
3. Initialize the random forest classifier
4. Train the model with the training data (X_train and y_train). This represents 80% of the data.
5. Make predictions from the X_test set. The model tries to predict the actual values that are in the y_test set.
6. Measure the accuracy by comparing the predictions of y (y_pred) with the actual y_test values.
# Split the dataset into features (X) and target variable (y)
X = df.drop('left', axis=1) # Features
y = df['left'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Random Forest classifier
#We initialize the Random Forest classifier with 100 trees and fit it to the training data.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)
# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
The resulting accuracy: Accuracy: 0.9886666666666667 which represents 99% accuracy.
If I re-run the code it gives similar results. The model is almost perfect!
We can compare this random forest model to a logistic regression model. Here code is below:
The resulting accuracy score is only: 0.785 or 78% accurate.
So the random forest model wins!
Logistic Regression Model Code
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the dataset
csv_location = 'HR_capstone_dataset.csv'
df = pd.read_csv(csv_location)
# Convert 'salary' into numerical values
salary_map = {'low': 0, 'medium': 1, 'high': 2}
df['salary'] = df['salary'].map(salary_map)
# One-hot encode 'Department'
df = pd.get_dummies(df, columns=['Department'])
# Split the dataset into features (X) and target variable (y)
X = df.drop('left', axis=1) # Features
y = df['left'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Logistic Regression classifier
log_reg_classifier = LogisticRegression(max_iter=1000, random_state=42)
# Train the classifier on the training data
log_reg_classifier.fit(X_train, y_train)
# Make predictions on the test data
y_pred = log_reg_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Leave a Reply