Tag: regression

  • An Example of Poor Correlation using Python

    In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).


    We can check age and first purchase amount against first 12 month sales amount just to confirm this.

    Age vs. Sales (first 12 month sales) code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='Age', y='LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age")  # graph title
    plt.xlabel('Age')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    


    The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.



    Next we can look at First Purchase amount and First 12 Month Sales Amount.
    Here is the code below. It includes removing the major outlier with CustomerKey = 1

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df = df[df['CustomerID'] != 1]
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
    plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
    plt.xlabel('FirstPurchaseAmount')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    
    


    And here is the graph. The correlation co-efficent is only 0.07.