Category: Machine Learning

  • How to check for model assumptions with Python Seaborn Graphics


    We need to check assumptions for models to give us confidence that are models have integrity and are not biased or overfitting the data.

    We check for three assumptions in this example, with sub-plotted seaborn graphics for a linear regression model.
    The code for creating the linear regression model can be found in this post
    You can run the code below once you have built the model. The model models the relationship between radio advertising spending and radio sales.

    assumption-checking

    Graph 1: Checking the Linearity Assumption
    The linearity assumption is as follows: ‘Each predictor variable (x) is linearly related to the outcome of variable y.’
    In the first graph, we plot radio advertising spend against radio sales and can see there is a linear relationship (first graph). So we can conclude the linearity assumption is met.

    Graph 2: Checking Homoscedacity assumption with a scatterplot

    The homoscedasticity assumption (extra points if you can spell it correctly) is as follows:
    y_pred are the predicted y values from a regression line.

    In the second graph, we plot the residuals of the model, which are the difference between actuals and model forecasts.

    Homoscedasticity means that the residuals have equal or almost equal variance across the regression line.
    By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.’ Good homoscedacity is therefore a balanced graph of residuals above and below zero.

    Graph 3: Check for Normality Assumption
    In the third graph, the histogram is used to plot the residuals of the regression line (the actual y values vs. the predicted y values) for x. If the model is unbiased, the residuals should be normally distributed (and we see that).

    The fourth graph is a Q-Q plot which is also used to check the normality assumption.

    fig, ax = plt.subplots(2, 2, figsize=(18, 10))
    
    fig.suptitle('Assumption Checks')
    
    #Check for linearity
    ax[0, 0] = sns.regplot(
        ax=ax[0, 0],
        data = df,
        x = df['Radio'],
        y = df['Sales'], 
        );
    ax[0, 0].set_title('Radio Sales')
    ax[0, 0].set_xlabel('Radio Spend ($K)')
    ax[0, 0].set_ylabel('Sales ($)')
    #ax[0].set_xticks(range(0,10,10))
    #ax[0].set_xticks(rotation=90)
    
    
    #Check for Homeoscedacity
    # Plot residuals against the fitted values
    ax[0, 1] = sns.scatterplot( ax=ax[0, 1],x=y_pred, y=residuals)
    ax[0, 1].set_title("Residuals vs Fitted Values")
    ax[0, 1].set_xlabel("Fitted Values")
    ax[0, 1].set_ylabel("Residuals")
    ax[0, 1].axhline(0, linestyle='--', color='red')
    
    
    #Check for normality
    ax[1, 0] = sns.histplot(ax=ax[1, 0], x=residuals)
    ax[1, 0].set_xlabel("Residual Value")
    ax[1, 0].set_title("Histogram of Residuals")
    
    #Check for nomrmality QQ plot
    ax[1, 1] = sm.qqplot(residuals, line='s',ax = ax[1,1])
    ax[1, 0].set_title("Q-Q Plot")
    
    
    
    #sm.qqplot(test, loc = 20, scale = 5 ,  line='45')
    
    plt.show()

  • A Summary of the data process used in Classification Models

    Introduction:

    Classification models are machine learning models that are used to predict binary outcome scenarios such as:

    Spam / Not Spam
    Fraudulent Transaction / Non-Fraudulant Transation
    customer churn/ customer will not churn
    custer high value / customer low value
    load approval /non-approval

    The Data Process

    Planning

    1. Understand the business requirements. Understand what the measure of success is and what needs to be measured.
      e.g. In binary outcomes (precision, recall, or f1 score). Identify Type 1 and Type 2 errors.
    2. Identify the key stakeholders and subject matter experts relevant to the project
    3. Understand where the data is and how it can be accessed. For larger data projects. If the data is from many different data sources, can the data be brought together in a data warehouse such as Google Big Query?
    4. Understand the technology required for the project. Are extra resources required?
    5. Is there a data dictionary describing all the data field types and purposes?

    Exploratory Data Analysis (Python)

    1. Explore the data, and list the number of columns rows, and data types. If there are any questions, these may need to be referred back to the business.
    2. Explore the data ranges (df. describe() ). Are the data counts complete? Do the means and ranges make sense, do the min and max statistics flag any potential errors or outliers in the data?
    3. Explore null values. If there are null values, either look to fill the data or drop the rows.
    4. Remove or adjust outliers.
    5. Summarize and graph the data:
    6. Use boxplots to look for outliers in columns.
    7. Use histograms to understand the distributions of data.
    8. Use a correlation matrix and pair plot to understand co-variance between columns.
    9. Visualize the data with interactive tools such as Tableau or Power BI for the initial analysis of data for clients

    Model Selection (Classification)

    Classification Models: Naive Bayes, Logistic Regression, Decision Tree, Random Forests, XG Boost

    1. Identify the variable to be predicted: Is it a continuous variable or a categorical variable?
    2. Select the machine learning model relevant to the task and 1 or 2 additional models to compare results to.
    3. Confirm the assumptions required by the model and check the data to confirm they meet the requirements.
    4. Feature selection: Select the features (Independent variables and the dependent variable (columns)) to be used in the model.
    5. Feature transformation: transform categorical data into numeric data using:
      a) one-hot encoding, for non-ordered categories e.g. departments
      b) ordinal encoding for orderly data such as ‘low’, ‘medium’, ‘high’
      Where numeric data across the features has high variance e.g. small numbers 0-1 and large numbers 100+ consider applying scaling to the data:
      a) Log normalization (using a logarithm of the data)
      b) Standardization (which converts data to Z false on a standard distribution, with a mean of zero within each feature)
    6. Feature extraction: Create new features from existing features examples are weekly hours.

    7. Check for Class inbalance in the data. In the case of a binary dependent variable (True, False), we would ideally like an even split, but the minimum should be 10% of the smaller of the two. Class inblances can be address with either:
    a) Downsampling to level the major segment down by random sampling.
    b) Upsampling to level the smaller segment up by random sampling.

    Model Execution

    1. Setup the X variables and Y variables in a separate data frame.
    2. Decide whether to use a
    3. Use test_train_split to create training and test sets and potentially a 3rd validation set of data. The test dataset should be around 25%, but can be larger if the dataset it large. Should the split be stratified?
    4. Select the hyperparameter requirements of the model. The GridSearchCV is the powerful sklearn function takes a list of hyper-parameters, scoring metrics and X values to run multiple models to find the best model
    5. Build and run the model.

    Model Validation

    In a simple scenario, the best performing model is ran against the hold out sample (e.g. 25% of the data), that the model has not been trained on.

    a) In cross-validation, samples of data are taken from the main training data and tested against the test data.
    Cross-validation is slower.

    b) Separate validation set: random samples are taken from the test training set and the model tested against a set number of times e.g. 5. This is advantageous when the main dataset is small and we don’t want to keep sampling from the training data.

    Measurement
    1. Merics: Accuracy, Precision, Recall, F1, AUC Score, Classification Report.

    2. Visualise: Confusion Matrix, ROC Curve

    Check Assumptions:

    Linearity: All Independent and dependent variables should exhibit a degree of linearity (use pairplot)
    Independent Observations: This information is business specific, so this requires understanding about how the variables are generated.
    No Multicolinearity: There should be no colinearity between independent variables or this will reduce the impact of the model.
    No extreme outliers: extreme outliers should be exlcuded from the model.



  • An Example of Using K-Means Cluster modelling

    Supermarket Example

    Import libraries and read in CSV file to data frame
    The data comes from kaggle here (mall data)

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.cluster import KMeans
    
    df = pd.read_csv("Supermarket Customers.csv"
    df.info()


    First, we run some basic checks on the data to check for data integrity.
    Data includes 200 rows of data by 5 columns



    We can change Gender to numeric with the following

    df['Gender'].replace(0, 'Female',inplace=True)
    df['Gender'].replace(1, 'Male',inplace=True)
    df.describe()





    Then we check for nulls as follows:

    
    df.isnull().sum() #check for nulls - gives a count of nulls values by column name.

    Get counts of unique values

    
    len(df['CustomerID'].unique()) #how many unique values are there


    CustomerID: 200
    Gender: 2
    Age: 51
    Annual Income (k$): 65
    Spending Score (1-100): 85

    Fixing Null Values
    Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.

    Drop rows where annual income (k$) is null

    df.dropna(subset=['Annual Income (k$)'], inplace=True)
    df.isnull().sum() #re-run check for missing data

    Then do the same for spending score:

    df.dropna(subset=['Spending Score (1-100)'], inplace=True)
    df.isnull().sum() #re-run check for missing data
    df.info() # the number of rows is reduced to 197.

    We can run the seaborn pairplot to plot the graphs of the combination of variables.

    pairplot = sns.pairplot(df)
    plt.show()


    From here we can see there are some very interesting distinct-looking clusters around
    annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It

    For the K-means model, we need to determine the value of K which is the number of clusters we want to identify.
    We use the elbow method to do this as follows:

    # Extracting features for clustering
    X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
    
    # Using the elbow method to determine the optimal number of clusters<br>wcss = []
    for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    # Plotting the elbow graph
    plt.plot(range(1, 11), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

    The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.

    K-Means Model

    Now we have the number of clusters to be used in the model, we can run the K-Means model.

    ##Use 5 clusters based on elbow graph
    
    # Fitting K-Means to the dataset with the optimal number of clusters (assuming 3 for this example)
    kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)
    y_kmeans = kmeans.fit_predict(X)
    
    # Visualizing the clusters
    plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
    plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
    plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
    plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
    plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='purple', label='Cluster 5')
    
    # Plotting the centroids of the clusters
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
    plt.title('Clusters of customers')
    plt.xlabel('Annual Income (k$)')
    plt.ylabel('Spending Score (1-100)')
    plt.legend()
    plt.show()

    The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.

  • How to Build a Human Resources Employee Attrition Model

    Here we make use of the HR Analytics dataset available on Kagoo. The dataset was created to understand the factors behind employee attrition and can be used to train a model for predicting employee churn.

    The Python Code for the 2 models is on Github

    We can start by importing the required libraries and import the HR dataset CSV file to a pandas dataframe.

    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    
    csvlocation = csvpath = 'HR_capstone_dataset.csv'
    df = pd.read_csv(csvlocation)


    First, we should check the data imported into the Pandas data frame looks good:


    We can also run df. columns to view the columns of the data frame and df. values to display the values of the data frame in an array.

    The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.

    We can also run df.columns to view the columns of the data frame and df.values to display the values of the data frame in an array.

    The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.



    Here we can verify that all columns have the same count of 14999, so there appears to be no missing data. We can also view the standard deviation (std) in relation to the mean, to get an idea of the variance in the data, for example:

    Mean average_monthly_hours = 201 hours
    1 standard deviation average monthly hours = 49.94 hours
    Therefore we can infer that:
    Approx. 68% of employees work 201 hours+-49.94 hours = 152-150 hours a month.
    Approx. 95% of employees work 201 hours +-100 hours (rounded) = 101 to 301 hours a month

    We need to check the datatypes are in the correct format if we want to build a model from them.


    The dependent variable that we want to predict is the ‘left’ field, which is a binary field of 0 or 1. A value of 1 is that they left the company (attrition).

    There are 9 other columns which make up the independent variables:
    There are 2 floats, 5 integers, and 2 object data types.

    The salary field is categorical and includes 3 levels:
    low, medium, and high
    Because there is an inherent order we can use ordinal encoding to convert it to an integer
    salary_map = {‘low’: 0, ‘medium’: 1, ‘high’: 2}

    The Department fields consist of non-ordered fields:
    Accounting, HR, IT, management, marketing, product_mgt, RandomD, sales.

    We use one-hot encoding on this field which creates 8 new binary fields, one for each department name. The code is as follows:

    df = pd.get_dummies(df, columns=[‘Department’])


    This results in a lot more columns in the data frame as below, but now we’re ready to build a model.

    Building the Random Forests model

    The steps for building the model are as follows:
    1. Split the dataset into X and y variables where X consists of all the independent variable fields.
    The ‘y’ list contains only the ‘y’ field (the dependent variable).
    2. Split the X and y dataframes into 4 sets for training. The test sets were randomized 20% of the data.
    3. Initialize the random forest classifier
    4. Train the model with the training data (X_train and y_train). This represents 80% of the data.
    5. Make predictions from the X_test set. The model tries to predict the actual values that are in the y_test set.
    6. Measure the accuracy by comparing the predictions of y (y_pred) with the actual y_test values.

    # Split the dataset into features (X) and target variable (y)
    X = df.drop('left', axis=1)  # Features
    y = df['left']  # Target variable
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Random Forest classifier
    #We initialize the Random Forest classifier with 100 trees and fit it to the training data.
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Train the classifier on the training data
    rf_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)


    The resulting accuracy: Accuracy: 0.9886666666666667 which represents 99% accuracy.
    If I re-run the code it gives similar results. The model is almost perfect!

    We can compare this random forest model to a logistic regression model. Here code is below:
    The resulting accuracy score is only: 0.785 or 78% accurate.
    So the random forest model wins!

    Logistic Regression Model Code

    
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    
    # Load the dataset
    csv_location = 'HR_capstone_dataset.csv'
    df = pd.read_csv(csv_location)
    
    # Convert 'salary' into numerical values
    salary_map = {'low': 0, 'medium': 1, 'high': 2}
    df['salary'] = df['salary'].map(salary_map)
    
    # One-hot encode 'Department'
    df = pd.get_dummies(df, columns=['Department'])
    
    # Split the dataset into features (X) and target variable (y)
    X = df.drop('left', axis=1)  # Features
    y = df['left']  # Target variable
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Logistic Regression classifier
    log_reg_classifier = LogisticRegression(max_iter=1000, random_state=42)
    
    # Train the classifier on the training data
    log_reg_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = log_reg_classifier.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
    
  • Creating a Customer Lifetime Value (LTV) Prediction Model


    In this project I’m trying to predict customer lifetime sales (well actually first 12-month sales), initially using the SKlearns linear regression model. The code can be downloaded from Github here. The results haven’t been great so far and I wonder how the data from the Contoso Retail data warehouse was created as I have an r-squared value of only around 10%. Still, it’s been a good exercise, and adding additional fields has improved the model gradually.

    A lifetime value model is particularly useful in business for predicting the value of newly acquired customers.

    The benefits of the model are as follows:

    1. It can both guide the forecast for future sales, based on the predicted value of new customers, and forecast new customers being acquired.
    2. It can guide the acquisition team in both targeting higher-value customers and understanding how much they can spend on acquiring new customers.

    3. It helps the business understand what the max (CPA or Customer Per Acquisition) should be to keep the business profitable, while still growing the business.

    Lifetime Timeframe
    One of the immediate challenges you will face with creating an LTV model is that your older customers are naturally more likely to have a higher lifetime value than more recently acquired customers. To move forward, we can look to create lifetime models based on the first x amount of time from the customer acquisition date. For example, predicted 12-month LTV. An analysis of lifetime data can of course give you a much better understanding of how long a customer’s lifetime is likely to last and what are the key timeframes you can use in creating the prediction.

    Building a Model with Linear Regression

    In this example, we use a Linear Regression Model as the data we want to predict (lifetime sales) is continuous data. In this example, we’ll guess which fields we think will predict the highest lifetime sales and we’ll soon find out that some of them have very low correlations, forcing us to go back to the drawing board as we’re trying to skip the first 2 stages of analytics: Exploration and Diagnosis (correlation).

    Here are the main steps for building the model:
    1. Create an SQL Query for importing the data into Python.
    2. Create the Python connection and import the data using the SQL Query.
    3. Transform and clean the data.
    4. Build the model
    5. Evaluate the model

    6. Improve the model.

    1. Create SQL Query to import the historical customer data for training the model. This includes the independent variables and the dependent variable (first 12-month lifetime sales), which we want to predict for future new customers. 12 months is used as we don’t have a lot of historical data.

    I’m choosing to start using the following demographic fields:
    Age, Gender and Yearly Income and First Purchase Amount

    The first part of building the code is to create an SQL script that will get us the historical data we need to train our model.

    The SQL code can be found in this post

    Now we have the SQL, we can start building the Python script which will pull in the data and build the model.

    2. Python – Importing the Data (Extract


    4. Building the model

    Now the exciting bit, is where we get to build the model using the sklearn machine learning module.
    The process is as follows:

    1. Separate independent variables (Age, Gender, first purchase amount, Income) and the dependent variable LifetimeSales into their own dataframes.
    2. The data is then separated further into a train and test segment. In this case, 20% of the data is put to the test.
    3. The next step is to create the linear regression model
    4. Then to evaluate the model
    5. Save the model to a .pkl file for future use.

    Code for building the model

    # -*- coding: utf-8 -*-
    """
    Created on Tue May 28 09:29:07 2024
    
    """
    
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load the data
    data = pd.read_csv("customerltvdata.csv")
    data = data[~data.CustomerKey.isin({1, 0})]  #use tilde with isin for is not in
    
    # Preprocess the data
    data = data.dropna()
    data = pd.get_dummies(data, drop_first=True)
    
    # Define features and target variable
    X = data[['Age', 'Gender', 'NumberCarsOwned', 'HouseOwnerFlag', 'YearlyIncome', 'MaritalStatus', 'FirstPurchaseAmount']]  # Replace with actual features
    y = data['LifetimeSales']  # Replace with the target variable
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_predictions = model.predict(X_train)
    y_test_predictions = model.predict(X_test)
    
    # Evaluate the model
    mse_train = mean_squared_error(y_train, y_train_predictions)
    mse_test = mean_squared_error(y_test, y_test_predictions)
    
    r2 = r2_score(y_test, y_test_predictions)
    
    print("Train MSE:", mse_train)
    print("Test MSE:", mse_test)
    
    print(f'R-squared: {r2}')
    
    # Step 5: Interpret the model coefficients
    coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})
    print(coefficients)
    
    # Analyze the results
    plt.scatter(y_test, y_test_predictions)
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Actual vs Predicted Values')
    plt.show()
    
    sns.residplot(x=y_test, y=y_test_predictions, lowess=True)
    plt.xlabel('Actual Values')
    plt.ylabel('Residuals')
    plt.title('Residuals vs Actual Values')
    plt.show()
    


    5. Model evaluation output (At time of writing).

    Train MSE: 98009842.04325004
    Test MSE: 98328891.39303765

    R-squared: 0.09431790148611996


    Variable Coefficient
    0 Age 17.114500
    1 Gender 224.837535
    2 NumberCarsOwned 446.586582
    3 HouseOwnerFlag 283.217963
    4 YearlyIncome 0.080253
    5 MaritalStatus 1577.068714
    6 FirstPurchaseAmount 1.023711

    We can ask ChatGTP and Gemini to interpret the results: Here is a response from Gemini:

    Model Performance

    • Train MSE (Mean Squared Error): 98009842.04325
    • Test MSE: 98328891.393038
    • R-squared: 0.0943

    Both the Train MSE and Test MSE are relatively high. This indicates that the model has a high average error in predicting lifetime sales, both on the data it was trained on and unseen test data. Ideally, you want the MSE to be as low as possible.

    The R-squared value is also quite low (around 0.1). This means that the model only explains a small portion of the variance in lifetime sales. In other words, there are other factors that influence lifetime sales that are not captured by this model.

    I keep adding additional fields, which is slightly improving the model, but there is a long way to go. As the Contoso data is not real, it’s possible this isn’t the best data to use.