Creating a simple Linear regression model and preparing for multi-linear regression.
In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api
import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
df = pd.read_csv("marketing_sales_data.csv")
# Drop rows with any null values
df.dropna(inplace=True)
# Check and handle duplicates if needed
if df.duplicated().sum() > 0:
df.drop_duplicates(inplace=True)
#rename columns to snake
df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)
# Simple order encoding
tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
df['TV'] = df['TV'].replace(tv_dict)
# One-hot encoding for non-ordinal variable
df = pd.get_dummies(df, columns=['Influencer'], dtype=int)
# Define and fit the model
ols_formula = "Sales ~ Radio"
OLS = ols(formula=ols_formula, data=df)
model = OLS.fit()
summary = model.summary()
print(summary) #Prints off the statistical summary including R squared and the Beta coefficients.
# Calculate residuals and predicted values
residuals = model.resid
y_pred = model.predict(df['Radio'])
Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.
The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.