Creating a Customer Lifetime Value (LTV) Prediction Model


In this project I’m trying to predict customer lifetime sales (well actually first 12-month sales), initially using the SKlearns linear regression model. The code can be downloaded from Github here. The results haven’t been great so far and I wonder how the data from the Contoso Retail data warehouse was created as I have an r-squared value of only around 10%. Still, it’s been a good exercise, and adding additional fields has improved the model gradually.

A lifetime value model is particularly useful in business for predicting the value of newly acquired customers.

The benefits of the model are as follows:

1. It can both guide the forecast for future sales, based on the predicted value of new customers, and forecast new customers being acquired.
2. It can guide the acquisition team in both targeting higher-value customers and understanding how much they can spend on acquiring new customers.

3. It helps the business understand what the max (CPA or Customer Per Acquisition) should be to keep the business profitable, while still growing the business.

Lifetime Timeframe
One of the immediate challenges you will face with creating an LTV model is that your older customers are naturally more likely to have a higher lifetime value than more recently acquired customers. To move forward, we can look to create lifetime models based on the first x amount of time from the customer acquisition date. For example, predicted 12-month LTV. An analysis of lifetime data can of course give you a much better understanding of how long a customer’s lifetime is likely to last and what are the key timeframes you can use in creating the prediction.

Building a Model with Linear Regression

In this example, we use a Linear Regression Model as the data we want to predict (lifetime sales) is continuous data. In this example, we’ll guess which fields we think will predict the highest lifetime sales and we’ll soon find out that some of them have very low correlations, forcing us to go back to the drawing board as we’re trying to skip the first 2 stages of analytics: Exploration and Diagnosis (correlation).

Here are the main steps for building the model:
1. Create an SQL Query for importing the data into Python.
2. Create the Python connection and import the data using the SQL Query.
3. Transform and clean the data.
4. Build the model
5. Evaluate the model

6. Improve the model.

1. Create SQL Query to import the historical customer data for training the model. This includes the independent variables and the dependent variable (first 12-month lifetime sales), which we want to predict for future new customers. 12 months is used as we don’t have a lot of historical data.

I’m choosing to start using the following demographic fields:
Age, Gender and Yearly Income and First Purchase Amount

The first part of building the code is to create an SQL script that will get us the historical data we need to train our model.

The SQL code can be found in this post

Now we have the SQL, we can start building the Python script which will pull in the data and build the model.

2. Python – Importing the Data (Extract


4. Building the model

Now the exciting bit, is where we get to build the model using the sklearn machine learning module.
The process is as follows:

1. Separate independent variables (Age, Gender, first purchase amount, Income) and the dependent variable LifetimeSales into their own dataframes.
2. The data is then separated further into a train and test segment. In this case, 20% of the data is put to the test.
3. The next step is to create the linear regression model
4. Then to evaluate the model
5. Save the model to a .pkl file for future use.

Code for building the model

# -*- coding: utf-8 -*-
"""
Created on Tue May 28 09:29:07 2024

"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
data = pd.read_csv("customerltvdata.csv")
data = data[~data.CustomerKey.isin({1, 0})]  #use tilde with isin for is not in

# Preprocess the data
data = data.dropna()
data = pd.get_dummies(data, drop_first=True)

# Define features and target variable
X = data[['Age', 'Gender', 'NumberCarsOwned', 'HouseOwnerFlag', 'YearlyIncome', 'MaritalStatus', 'FirstPurchaseAmount']]  # Replace with actual features
y = data['LifetimeSales']  # Replace with the target variable

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_train_predictions = model.predict(X_train)
y_test_predictions = model.predict(X_test)

# Evaluate the model
mse_train = mean_squared_error(y_train, y_train_predictions)
mse_test = mean_squared_error(y_test, y_test_predictions)

r2 = r2_score(y_test, y_test_predictions)

print("Train MSE:", mse_train)
print("Test MSE:", mse_test)

print(f'R-squared: {r2}')

# Step 5: Interpret the model coefficients
coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})
print(coefficients)

# Analyze the results
plt.scatter(y_test, y_test_predictions)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

sns.residplot(x=y_test, y=y_test_predictions, lowess=True)
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Actual Values')
plt.show()


5. Model evaluation output (At time of writing).

Train MSE: 98009842.04325004
Test MSE: 98328891.39303765

R-squared: 0.09431790148611996


Variable Coefficient
0 Age 17.114500
1 Gender 224.837535
2 NumberCarsOwned 446.586582
3 HouseOwnerFlag 283.217963
4 YearlyIncome 0.080253
5 MaritalStatus 1577.068714
6 FirstPurchaseAmount 1.023711

We can ask ChatGTP and Gemini to interpret the results: Here is a response from Gemini:

Model Performance

  • Train MSE (Mean Squared Error): 98009842.04325
  • Test MSE: 98328891.393038
  • R-squared: 0.0943

Both the Train MSE and Test MSE are relatively high. This indicates that the model has a high average error in predicting lifetime sales, both on the data it was trained on and unseen test data. Ideally, you want the MSE to be as low as possible.

The R-squared value is also quite low (around 0.1). This means that the model only explains a small portion of the variance in lifetime sales. In other words, there are other factors that influence lifetime sales that are not captured by this model.

I keep adding additional fields, which is slightly improving the model, but there is a long way to go. As the Contoso data is not real, it’s possible this isn’t the best data to use.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *