Tag: data-science

  • How to Build a Human Resources Employee Attrition Model

    Here we make use of the HR Analytics dataset available on Kagoo. The dataset was created to understand the factors behind employee attrition and can be used to train a model for predicting employee churn.

    The Python Code for the 2 models is on Github

    We can start by importing the required libraries and import the HR dataset CSV file to a pandas dataframe.

    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    
    csvlocation = csvpath = 'HR_capstone_dataset.csv'
    df = pd.read_csv(csvlocation)


    First, we should check the data imported into the Pandas data frame looks good:


    We can also run df. columns to view the columns of the data frame and df. values to display the values of the data frame in an array.

    The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.

    We can also run df.columns to view the columns of the data frame and df.values to display the values of the data frame in an array.

    The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.



    Here we can verify that all columns have the same count of 14999, so there appears to be no missing data. We can also view the standard deviation (std) in relation to the mean, to get an idea of the variance in the data, for example:

    Mean average_monthly_hours = 201 hours
    1 standard deviation average monthly hours = 49.94 hours
    Therefore we can infer that:
    Approx. 68% of employees work 201 hours+-49.94 hours = 152-150 hours a month.
    Approx. 95% of employees work 201 hours +-100 hours (rounded) = 101 to 301 hours a month

    We need to check the datatypes are in the correct format if we want to build a model from them.


    The dependent variable that we want to predict is the ‘left’ field, which is a binary field of 0 or 1. A value of 1 is that they left the company (attrition).

    There are 9 other columns which make up the independent variables:
    There are 2 floats, 5 integers, and 2 object data types.

    The salary field is categorical and includes 3 levels:
    low, medium, and high
    Because there is an inherent order we can use ordinal encoding to convert it to an integer
    salary_map = {‘low’: 0, ‘medium’: 1, ‘high’: 2}

    The Department fields consist of non-ordered fields:
    Accounting, HR, IT, management, marketing, product_mgt, RandomD, sales.

    We use one-hot encoding on this field which creates 8 new binary fields, one for each department name. The code is as follows:

    df = pd.get_dummies(df, columns=[‘Department’])


    This results in a lot more columns in the data frame as below, but now we’re ready to build a model.

    Building the Random Forests model

    The steps for building the model are as follows:
    1. Split the dataset into X and y variables where X consists of all the independent variable fields.
    The ‘y’ list contains only the ‘y’ field (the dependent variable).
    2. Split the X and y dataframes into 4 sets for training. The test sets were randomized 20% of the data.
    3. Initialize the random forest classifier
    4. Train the model with the training data (X_train and y_train). This represents 80% of the data.
    5. Make predictions from the X_test set. The model tries to predict the actual values that are in the y_test set.
    6. Measure the accuracy by comparing the predictions of y (y_pred) with the actual y_test values.

    # Split the dataset into features (X) and target variable (y)
    X = df.drop('left', axis=1)  # Features
    y = df['left']  # Target variable
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Random Forest classifier
    #We initialize the Random Forest classifier with 100 trees and fit it to the training data.
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Train the classifier on the training data
    rf_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)


    The resulting accuracy: Accuracy: 0.9886666666666667 which represents 99% accuracy.
    If I re-run the code it gives similar results. The model is almost perfect!

    We can compare this random forest model to a logistic regression model. Here code is below:
    The resulting accuracy score is only: 0.785 or 78% accurate.
    So the random forest model wins!

    Logistic Regression Model Code

    
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    
    # Load the dataset
    csv_location = 'HR_capstone_dataset.csv'
    df = pd.read_csv(csv_location)
    
    # Convert 'salary' into numerical values
    salary_map = {'low': 0, 'medium': 1, 'high': 2}
    df['salary'] = df['salary'].map(salary_map)
    
    # One-hot encode 'Department'
    df = pd.get_dummies(df, columns=['Department'])
    
    # Split the dataset into features (X) and target variable (y)
    X = df.drop('left', axis=1)  # Features
    y = df['left']  # Target variable
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Logistic Regression classifier
    log_reg_classifier = LogisticRegression(max_iter=1000, random_state=42)
    
    # Train the classifier on the training data
    log_reg_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = log_reg_classifier.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
    
  • An Example of Poor Correlation using Python

    In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).


    We can check age and first purchase amount against first 12 month sales amount just to confirm this.

    Age vs. Sales (first 12 month sales) code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='Age', y='LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age")  # graph title
    plt.xlabel('Age')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    


    The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.



    Next we can look at First Purchase amount and First 12 Month Sales Amount.
    Here is the code below. It includes removing the major outlier with CustomerKey = 1

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df = df[df['CustomerID'] != 1]
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
    plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
    plt.xlabel('FirstPurchaseAmount')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    
    


    And here is the graph. The correlation co-efficent is only 0.07.

  • Some basic Python Graphing Data Examples with the Matplotlib library

    In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.

    Histogram:
    We can start by having a look at the distribution of customers’ ages with a histogram.
    Here is the code:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns # display the columns available
    
    # Create a histogram
    df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
    plt.title("Distribution of Customer Age")
    plt.xlabel('Age')
    plt.ylabel('Frequency')
      
    plt.show()

    The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.





    Scatter Plot
    Next, we can have a look at the first 12-month sales amount against age.
    Here is the Python code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    
    #kind options: 'scatter', 'hist', 'bar',  
    df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age") # graph title
    plt.xlabel('Age') #x axis lable
    plt.ylabel('First 12 Month Sales') #y axis label
      
    plt.show()


    Here is the graph:


    We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.

    All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100.
    Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales.

    Pie Graph

    The following code gives us a pie chart showing us the slit of gender in the dataset.

    import pandas as pd
    import matplotlib.pyplot as plt
    # import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    #replace 0 and 1 in the 'Gender' column with femail and male
    df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})
    
    #return lables and count index of unique age types
    age_counts = df['Gender'].value_counts()
    #print(age_counts)
    
    # Define colors for each category
    colors = {'female': 'pink', 'male': 'blue'}
    
    # Plotting
    plt.figure(figsize=(4, 4))  # Adjust figure size if needed
    plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', 
            colors=[colors[g] for g in df['Gender'].value_counts().index])
    plt.title('Distribution of Gender')
    plt.show()


    The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’.
    The data looks pretty normal as we would expect. There are no nulls which is good.


    Another Scatter Plot
    Next, we look at the ‘FirstPurchaseAmount’ field.
    We create a scatter graph with the code below.

    # -*- coding: utf-8 -*-
    """
    Created on Wed May  8 12:17:26 2024
    
    @author: Admin
    """
    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    
    #kind options: 'scatter', 'hist', 'bar',  
    df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
    plt.title("Customers by First Purchase Amount") # graph title
    plt.xlabel('CustomerID') #x axis label
    plt.ylabel('First Purchase Amount') #y axis label
      
    plt.show()


    The first thing we notice when we run the graph is that there is one major outlier in the data.

    Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.

  • Examples of using the Python Seaborn Graphs Library

    This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.



    import seaborn as sns
    import matplotlib.pyplot as plt
    #from sklearn.datasets import fetch_california_housing
    import pandas as pd
    
    csvlocation = csvpath = 'outliercheck.csv'
    df = pd.read_csv(csvlocation)
    #df.columns # display the columns available
    
    pairplot = sns.pairplot(df)
    
    # Display the plot
    plt.show()

    If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.

    # Sort the DataFrame by the 'firstpurchasedate' column
    df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)
    
    # Display the first few rows of the sorted DataFrame
    print(df_sorted.head()

    The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.


    We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.

    #remove row where CustomerID = 1
    df_filtered = df[df['CustomerID'] != 1]
    # Remove the 'age', 'gender' and CustomerID columns
    df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)


    The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.

    Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.