An Example of Poor Correlation using Python

In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).

We can check age and first purchase amount against first 12 month sales amount just to confirm this.

Age vs. Sales (first 12 month sales) code:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns  # display the columns available

# Create a scatter plot with regression line
sns.regplot(data=df, x='Age', y='LifetimeSales')
plt.title("First 12 Month Sales by Customer Age")  # graph title
plt.xlabel('Age')  # x axis label
plt.ylabel('First 12 Month Sales')  # y axis label

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])

# Add equation of the regression line to the plot
plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

# Calculate correlation coefficient
correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')


plt.show()

The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.

Next we can look at First Purchase amount and First 12 Month Sales Amount.
Here is the code below. It includes removing the major outlier with CustomerKey = 1

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df = df[df['CustomerID'] != 1]
df.columns  # display the columns available

# Create a scatter plot with regression line
sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
plt.xlabel('FirstPurchaseAmount')  # x axis label
plt.ylabel('First 12 Month Sales')  # y axis label

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])

# Add equation of the regression line to the plot
plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

# Calculate correlation coefficient
correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')


plt.show()

And here is the graph. The correlation co-efficent is only 0.07.

An Example of Poor Correlation using Python

More posts

DAX Functions Reference

Setting up Power BI and Git Integration with Azure DevOps

DAX Optimization – Analyzing the Query plan and storage retrieveal

Multiple parameter selections using Invoke Custom Function.