In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.
Histogram:
We can start by having a look at the distribution of customers’ ages with a histogram.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns # display the columns available
# Create a histogram
df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
plt.title("Distribution of Customer Age")
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.
Scatter Plot
Next, we can have a look at the first 12-month sales amount against age.
Here is the Python code:
import pandas as pd
import matplotlib.pyplot as plt
csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#kind options: 'scatter', 'hist', 'bar',
df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
plt.title("First 12 Month Sales by Customer Age") # graph title
plt.xlabel('Age') #x axis lable
plt.ylabel('First 12 Month Sales') #y axis label
plt.show()
Here is the graph:
We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.
All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100.
Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales.
Pie Graph
The following code gives us a pie chart showing us the slit of gender in the dataset.
import pandas as pd
import matplotlib.pyplot as plt
# import matplotlib.pyplot as plt
csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#replace 0 and 1 in the 'Gender' column with femail and male
df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})
#return lables and count index of unique age types
age_counts = df['Gender'].value_counts()
#print(age_counts)
# Define colors for each category
colors = {'female': 'pink', 'male': 'blue'}
# Plotting
plt.figure(figsize=(4, 4)) # Adjust figure size if needed
plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%',
colors=[colors[g] for g in df['Gender'].value_counts().index])
plt.title('Distribution of Gender')
plt.show()
The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’.
The data looks pretty normal as we would expect. There are no nulls which is good.
Another Scatter Plot
Next, we look at the ‘FirstPurchaseAmount’ field.
We create a scatter graph with the code below.
# -*- coding: utf-8 -*-
"""
Created on Wed May 8 12:17:26 2024
@author: Admin
"""
import pandas as pd
import matplotlib.pyplot as plt
csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#kind options: 'scatter', 'hist', 'bar',
df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
plt.title("Customers by First Purchase Amount") # graph title
plt.xlabel('CustomerID') #x axis label
plt.ylabel('First Purchase Amount') #y axis label
plt.show()
The first thing we notice when we run the graph is that there is one major outlier in the data.
Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.
Leave a Reply