Creating a simple Linear regression model and preparing for multi-linear regression.

In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api

```
import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
df = pd.read_csv("marketing_sales_data.csv")
# Drop rows with any null values
df.dropna(inplace=True)
# Check and handle duplicates if needed
if df.duplicated().sum() > 0:
df.drop_duplicates(inplace=True)
#rename columns to snake
df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)
# Simple order encoding
tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
df['TV'] = df['TV'].replace(tv_dict)
# One-hot encoding for non-ordinal variable
df = pd.get_dummies(df, columns=['Influencer'], dtype=int)
# Define and fit the model
ols_formula = "Sales ~ Radio"
OLS = ols(formula=ols_formula, data=df)
model = OLS.fit()
summary = model.summary()
print(summary) #Prints off the statistical summary including R squared and the Beta coefficients.
# Calculate residuals and predicted values
residuals = model.resid
y_pred = model.predict(df['Radio'])
```

Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.

The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.