Creating a simple Linear regression model and preparing for multi-linear regression.
In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api
You can download the original file from Kaggle here then just replace the location you save it to in the (df = pd.read_csv() line.
Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.
The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.
A sample of the population is tested for height. Do they match the known population? Population mean and standard deviation are known, sample > 30
1
n < 30 and unknown variance
One-sample t-test
1 and 2
t-score
Students t
n-1
A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
2
Same 2 groups (before and after). 2 Dependent samples.
Paired t-test
1 and 2
t-score
Students t
n-1
A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
2*
2 Independent groups, equal variance
Independent t-test (equal variance – pooled)
1 or 2
t-score
Students t
n1+n2−2
A retailer wants to compare the average sales of two different stores located in different regions.
2*
2 Independent groups, unequal variance
Independent t-test (Unequal/Welch’s)
1 and 2
t-score
Students t
n1+n2−2
3+
3+ groups – look at one variable
One-Way ANOVA
1 Only
F-score
F
between groups k-1 (where k is num of groups).
Three groups are given different drugs in a test to see the improvement in blood sugar.
3+
3+ groups – look at multiple variables
Two-way ANOVA
1 Only
F-score
F
Effect of water and sun on height of sunflowers, where 3+ combinations of water and sun (3 + groups)
2 Categories
2 Categories – comparison is to see if they are related
Pearsons Chi-Square
neither, just different
Chi-Squared
Chi-Squared
(no. of rows – 1) * (no. of columns – 1).
Individuals that received social media vs. those that received email – 2 Groups purchased and not purchased.
Correlation
2
Correlation
Pearsons Correlation Coefficient (R)
1 or 2
t-statistic
t-distribution table
n−2
Relationship between daily hours of sunlight and temperature. Best for continuous data with linear relationships.
Provides both the direction (positive or negative) and the strength of a linear relationship between two variables. Note for simple linear regression. R squared is the Pearson r squared. It ranges between -1 and 1
2
Correlation
Spearmans Rank
1 or 2
t-statistic?
Spearmans Rho?
n-2
Relationship between the ranks of employees’ performance and their years of experience. Best for ordinal data or when data does not meet Pearson’s assumptions.
r Squared (used in regression analysis)
One (simple linear regression) or more independent variables (Multiple linear regression) + 1 dependent variable
Not used in significance testing directly
NA
Explaining the variance in academic performance based on study hours, attendance, and other factors. Used in the context of regression analysis to explain the proportion of variance in the dependent variable.
Indicates the proportion of the variance in the dependent variable that can be explained by the independent variable(s). It does not provide information about the direction of the relationship. Ranges from 0 to 1
Regression Analysis
F-Statistic
F-Distribution
Fit a regression model and computer squared. Perform an F-test to determine overall significance.
Non-Parametric Tests
Not Normal distribution
2
Compares distributions of the 2 groups
Mann Whitney U
U Statistic
Mann-Whitney U Distribution
NA
2 Independent groups (like the t-test)
2
Compares Medians of the 2 groups
Wilcoxen signed rank
T Statistic
Wilcoxen signed rank table
NA
When comparing two related samples, matched samples, or repeated measurements on a single sample.
* Note: Uses Levenes test to determine equal/unequal variance
Significant Testing Steps
Null hypothesis: There is no significant difference between variables between the two groups.
Examples: 1. In a drug efficacy study, the null hypothesis might state that the drug does not affect patients compared to a placebo. 2. The mean cholesterol levels of patients taking the new drug are equal to the mean cholesterol levels of patients taking a placebo. 3. The mean sales after the advertising campaign are equal to the mean sales before the campaign. 4. The mean sales of the product with the new packaging design are equal to the mean sales of the product with the old packaging design.
Alternative hypothesis: There is a significant difference between the means of the groups.
The significance level (𝛼α) is the probability threshold for rejecting the null hypothesis. Commonly used significance levels include 0.05, 0.01, and 0.10. e.g. 5%: The result of the test only has a 5% chance of occurring.
Calculate test statistic (t-value, z-value, etc). The test statistic is used to look up the p-value from the relevant distribution tables.
4. Calculate the p-value using the test statistic. The p-value is used to determine if the test is significant. e.g. a p-value of less than 5% (0.05) is significant and we can reject the null hypothesis. If the p-value is higher than the significance level, then the results are reported as statistically significant.
1 or 2 Tailed Tests One-tailed test if measuring greater or less than Left or Right Sided Two-tailed if measuring the difference
Type 1 and Type 2 Errors Type 1 Error: We reject the null hypothesis even though it was true (Alpha) Type 2 Error: We accept the null hypothesis even though it was false (Beta)
Two Methods of accepting or rejecting Null Hypothesis 1. Critical Values (taken from normal or t-distribution) 2. P-values (if p < or > than 0.05)
Linear Regression using Salary and Years of Experience Data
Data Source: Salary_dataset.csv Kaggle The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)
For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y). Simple Linear Regression Formula Formula: Y = b + aX + e
Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
a: is the slope change in Y for a one-unit change in X
b: is the intercept of the Y axis. This represents the value of Y when X is zero.
The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.
The steps are as follows: 1. Use Pandas to import the CSV file to a data frame and convert each series to an array. 2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line. 3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope. 4. The null hypothesis is then evaluated based on the p-value 5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltfromsklearn.linear_modelimportLinearRegressionimportstatsmodels.apiassmfromscipyimportstatsdefload_data(csv_location):"""Load CSV data into a pandas DataFrame."""df = pd.read_csv(csv_location)returndfdefprepare_data(df):"""Prepare independent and dependent variables for regression."""X = df[['YearsExperience']].values # Extractas 2Darrayy = df['Salary'].values # Extractas 1DarrayreturnX,ydeffit_sklearn_model(X,y):"""Fit a linear regression model using sklearn."""model = LinearRegression()model.fit(X,y)returnmodeldeffit_statsmodels_ols(X,y):"""Fit a linear regression model using statsmodels OLS."""X_with_const = sm.add_constant(X) # Addanintercepttothemodelmodel = sm.OLS(y,X_with_const).fit()returnmodeldefplot_regression_line(df,intercept,slope):"""Plot the regression line along with data points."""plt.scatter(df['YearsExperience'],df['Salary'],color='blue',label='Data points')plt.plot(df['YearsExperience'],intercept + slope*df['YearsExperience'],color='red',label='Regression line')plt.title("Salary by Years of Experience")plt.xlabel('Years of Experience')plt.ylabel('Salary')plt.legend()plt.show()defmain():csv_location = "salary_dataset.csv"df = load_data(csv_location) # Displaybasicstatistics #print(df.describe())X,y = prepare_data(df) # Fitthemodelusingsklearnsklearn_model = fit_sklearn_model(X,y)intercept,slope = sklearn_model.intercept_,sklearn_model.coef_[0]print("Calculation of Regression Line:\n")print(f"Intercept is: {intercept}")print(f"Slope is: {slope}") # Fitthemodelusingstatsmodelstogetp-valuesandR-squaredstatsmodels_model = fit_statsmodels_ols(X,y) # print(statsmodels_model.summary()) # ExtractR-squaredandp-valuesr_squared = statsmodels_model.rsquaredp_values = statsmodels_model.pvaluesprint(f"R-squared: {r_squared}") #print(f"P-values: {p_values}") # Extractingspecificp-valuesbyindexintercept_p_value = p_values[0] # Firstp-value (intercept)slope_p_value = p_values[1] # Secondp-value (YearsExperience) #print(f"Intercept p-value: {intercept_p_value}")print(f"p-value (YearsExperience): {slope_p_value}")print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") print("This is obtained from the t-distribution with n−2 degrees of freedom ")print("where n is the number of observations\n")ifslope_p_value > 0.05:print("P-value is not signficant and therefore we accept the null hypothesis")ifslope_p_value < 0.05:print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌") # Plottingtheregressionlineplot_regression_line(df,intercept,slope) # Fitalinearregressionlineusingscipy.stats (forcomparison)slope,intercept,r_value,p_value,std_err = stats.linregress(df['YearsExperience'],df['Salary']) # plt.text(df['YearsExperience'].min(),df['Salary'].max(),f'y = {slope:.2f}x + {intercept:.2f}',ha='left')if__name__ == "__main__":main()
In business, the most commonly used t-test is often the Independent Two-Sample t-test. This test is widely utilized to compare the means of two independent groups or samples to determine if there is a significant difference between them.
The Independent Two-Sample t-test is particularly useful in various business scenarios, such as:
Comparing the performance metrics of two different products or services.
Analyzing the effectiveness of different marketing strategies or campaigns.
Assessing the impact of changes in processes or procedures on business outcomes.
Evaluating the differences in customer satisfaction levels between different segments or demographics.
Comparing the financial performance of different investment portfolios or asset classes.
One-Sample t-Test
Purpose:
To determine if the mean of a single sample differs significantly from a known or hypothesized population mean.
Example:
Scenario: A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
Hypothesis: The mean weekly sales of the new product is different from $5000.
Application: The company collects weekly sales data for the new product over several weeks and uses the one-sample t-test to determine if the average sales differ from $5000.
Calculation x-X/sd/SQRT(n) Where x = sample mean, X = est., population mean, n = sample size
Nominator: Mean – hypothesized population mean Denominator: sample standard deviation/SQRT(sample size n)
2. Independent Two-sample t-test
Purpose:
To compare the means of two independent groups to see if they differ significantly.
Null Hypothesis: The means of 2 groups are equal. If rejected, there is a significant difference between the 2 means.
Types:
Equal Variance (Pooled) t-Test: Assumes that both groups have the same variance.
Unequal Variance (Welch’s) t-Test: Does not assume equal variances between the groups.
How do we determine if they have the same variance? Use the Levenes Test first to determine if the variances are equal between groups
Example (Equal Variance) or Pooled t-Test:
Scenario: A retailer wants to compare the average sales of two different stores located in different regions.
Hypothesis: The mean sales of Store A are equal to the mean sales of Store B.
Application: Sales data from Store A and Store B are collected over the same time period. An independent two-sample t-test assuming equal variances is used to compare the average sales.
t-Value = Complex Calculation
Degrees of Freedom df=n1+n2−2
Example (Unequal Variance, Welch’s t-test):
Scenario: A company wants to compare the average salaries of employees in two different departments.
Hypothesis: The mean salary in Department X is equal to the mean salary in Department Y.
Application: Salary data from both departments are collected. An independent two-sample t-test with unequal variances (Welch’s t-test) is used to compare the average salaries.
Calculation Usage Independent t-test is used when the number of samples and the variance of the two data sets is different.
To compare the means of two related groups (e.g., the same subjects measured at two different times).
To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).
Business Example:
Scenario: A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
Hypothesis: The mean performance score after training is different from the mean performance score before training.
Application: Performance scores of employees are recorded before and after the training. A paired sample t-test is used to compare the average performance scores before and after the training.
More Examples:
example: Comparing the mean blood pressure of patients before and after treatment example: Testing if a new drug increases recovery rates (greater than) compared to the known recovery rate without the drug.
Calculation T-Value = nominator: mean1 – mean2 (difference between the means) denominator: s(Diff)/SQRT(n)
s(Diff) = (standard deviation of the differences of the paired data values) n = sample size (number of paired differences)
Degrees of Freedom:
n-1 = degrees of freedom
One Side or Two-Sided:
a) One-Sided t-Test
Purpose:
To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).
Example:
Scenario: A company believes that a new marketing campaign will increase the average number of customers visiting their store per day compared to the previous average of 100 customers.
Hypothesis: The mean number of customers after the campaign is greater than 100.
Application: The company collects daily customer visit data after the campaign starts. A one-sided t-test is used to determine if the average number of customers is significantly greater than 100.
Example 2: Comparing the average test score of a class to a national average
b) Two-Sided t-Test Purpose:
To test if the sample mean is significantly different from the hypothesized population mean (used for non-directional hypotheses).
Example:
Scenario: A company wants to compare the customer satisfaction ratings of two different products to see if there is any difference.
Hypothesis: The mean satisfaction rating of Product A is different from the mean satisfaction rating of Product B.
Application: Customer satisfaction ratings for both products are collected. A two-sided t-test is used to compare the average satisfaction ratings of the two products.
Levenes Test – Python Code: Test if the variances of two independent groups are equal – use for determining whether to use the Equal or Unequal Independent Samples test
If the p-value for the Levene test is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met). In this case, we can use the Independent Samples Equal Variance T-Test If they are significantly different (p<0.05), then we use the Independent Samples Equal Variance T-Test
Analysis: UK House Price Index Data Source: Office Of National Statistics Period: Monthly Graph: Showing December Yearly Prices
Linear Trendline
Polynomial Trendline (order 3)
Observations (from the raw data): Over the 19 years between YE 2005 and YE 2023, average UK house prices have increased from: £160,209 in 2006 to £284,681 in 2023 or 78% with an average yearly increase of £6,551
The maximum Average House Price was: £288,784 in December 2022.
The drop from 2007 to 2008 of £28,239 is evident and goes against the positive trend relating to the 2007 financial crash. The drop from 2022 to 2023 of £4053 relates to the higher interest rates in the UK and the cost of living crisis.
Trendlines
If we add a linear regression line (left graph), to the data points, we get an R-squared score of 0.85%, meaning 85% of the variability of average house prices is accounted for by year, which is a pretty strong correlation as we would expect.
If we add the forecast based on the linear regression line equation (not included here), we get £303,618 in 2028. (I’ve used the FORECAST.LINEAR() Excel function to generate the forecasts).
An alternative is to use the polynomial regression line which is used when there are peaks and valleys in the data. In this example, we are using an order 3 polynomial trendline which accounts for 2 peaks or valleys. This gives an R-squared value of 0.96, which appears to be a very good fit, but I’m somewhat hesitant as we’re right at the point of a 2nd drop. One could look back at the years following 2007 and apply the growth to the years following 2022. This would be similar to a seasonal adjustment, but I’m not sure what tool will pick this up at present (and of course, it’s not the same situation).
If we apply the forecasting from the Polynomial regression trendline, we get £389,243 for 2028.
So the linear regression forecast gives £303K for 2018 and the Polynomial regression method gives £389K, that’s a pretty big difference. As you can see the polynomial regression is less impacted by the peaks and troughs than the regression line.
If we use the Excel forecast tool which uses the ‘AAA Exponential Triple Smoothing (ETS) algorithm’, for the next 5 years, we can see the forecast of £319,647 for 2028, however, the upper and lower (non-bold) lines indicate the confidence levels, which indicate we can the estimate will lie between £204,369 and £434,925 with 95% certainty, which is very uncertain.
Some of the reasons for this are that we only have 18 years (18 data points) to work with, we have 2 dips in the data, and a somewhat flat period, this doesn’t help the model and hence the confidence limit is less certain.
The algorithm is best used for linear seasonal fluctuating data, which isn’t quite what we have here.
To create a better model, we could create a multi-variate model, including additional features such as interest rate, population, housing supply, and region. We can also see if there is more data available pre-2005.
Descriptive statistics are general statistics used to describe data. Statistics such as average customer value and average sales amount are typically used as Key Performance Indicators for the business.
In the graph below we plot each sales amount for a fictitious electrical store. It could also represent a supermarket’s sales, charitable donations, purchases from a manufacturer, or any other business where you have this kind of data.
This is just a graph created from Excel using a list table of 115 sales amounts. Looking at the data visually, we can make some observations. The average sales look to be around £60 at a guess, there are a few high sales amounts that look like outliers. Outliers lie outside the normal range and so we need to calculate what that normal range is, to determine what to exclude.
Mean Mean or average is probably the most commonly used descriptive statistic. Everyone understands averages, so they are commonly used throughout analysis and reporting.
In the example above the mean is £74.18, so the guess wasn’t far out. The means is just the total sum of all the sale amounts divided by the count of sale amounts. I’m lazy so I just selected the column of sales amounts in Excel and looked at the average at the bottom of the sheet. This also gives you the total and the count, used to calculate the sum. The formula in Excel looks like =AVERAGE(A2:A116) where A2:A116 is the range of cells holding the range of sales amounts.
Examples of using the mean are as follows:
Average Order Value (AOV) – A commonly used statistic to gauge the average value of orders. Businesses will try and improve this number by increasing prices or selling more to existing customers. Typically the value you be presented along with other KPIs such as Revenue, Number of customers, etc.
Average Customer Value (ACV) – A measure that can be used over time, such as a month, quarter, or year to determine the the average value of customers. This helps planning, in that to achieve x amount of revenue a business may need to sell to x amount of customers with an AOV of x.
Average Open Rate, Average Clickthrough Rate – Common statistics used in email marketing.
Average Daily Revenue, Average Monthly Revenue – Overall averages that a business can use to forecast its monthly revenue. e.g. Monthly revenue = Average daily revenue x 30/31 (assuming a flat sales pattern, which is often not the case). Using trend lines or phasing revenue is a better way of forecasting or breaking down budgets by day, week, or month.
If you use the average to predict the sales for the rest of the month, you can see due to the outlier, you run the risk of grossly overestimating. Total sales for the week commencing 1st November is £8531, which you can see in the screenshot above. So the daily run rate is £1,219 for the first week in November (£8531/7). So to forecast the month I could do something like £1219 * 30 as there are 30 days in November. So that gives me £36,560. Businesses love talking about run-rate as they are always focused on revenue. Using historical data, you should be able to get a picture of the historical run rate, to compare how you are currently performing. Businesses will want to look at this vs. the previous year as seasonality impacts a lot of businesses. November is different from October.
Note: for newbies, you might wonder how you get all the sale amounts. Well, essentially what you’re doing is picking up all the sales receipts and entering the totals onto a list from which you can perform your calculations. In the modern world of cause, this is all captured in a database. Organizations typically develop data warehouses, which take a copy of this information into a safe non-live database. SQL is used to report on the data or to hook the data into platforms such as data cubes or Power BI workflows, where analysts can access the data easily. Note, that you may not have to do all the nitty gritty work to get your data. It all depends on how the business is organized.
Median
Median is less used, but can still be useful when displayed alongside the Medium. An average can be influenced more by high or low outliers, which can distort the picture.
In our Electric Now sales store example, the medium is £61.44. It’s quite a bit less than the average of £74.18 and as we thought, the outliers are skewing the data up a bit. Having flagged the medium you may want to remove outliers to give a forecast of to give you a forecast range. In Excel, the formula is: =MEDIAN(A2:A116) where A2:A116 is the range of sales amounts.
To manually get the median, sort your sales amounts from lowest to highest and find the number in the middle.
Mode The mode is useful when you want to know the most frequent points in a set of data.
For example, if you had a subscription option to a website with variable rates available. Looking at the mode will tell you the most common subscription amount.
If your data varies significantly, you might find it helps to group your data into bands in order to create more frequent groups. For example, if you had 1000 sales amounts with all varying pence amounts, these could be banded into groups of £10-15, £20-25, £30-35, etc. This would also be much easier to graph.
If we pivot the data from the Electric Now shop in the spreadsheet and graph the data, we can see that £39.44 is the most common sales amount, but not by much. I can also just use the formula =MODE(A2:A116)
As this isn’t very informative we can choose to group, bucket, or band the data together as in the graph below. From this we can make some better higher-level observations, such as sales amounts are evenly distributed from 25 to 100. We don’t have many sales below 25.
But to be more accurate, we want to say what % of sales are in these ranges. As in the table below.
Min and Max Depending on the data in question, the Min and the Max can be useful, when used in combination with the other statistics. The Min is just the minimum value in your data range and the Max is the Maximum.
In the Electric Now example, the Min is £2.12 and the Max is £412.
What it does tell you is that some orders can be quite small amounts. If you wanted to have a report on the maximum sale amount that you wanted to beat, it would be useful to display the MAX sale amount.
Once you’ve understood all these descriptive statistics work, you can skip ahead by using the Excel data analysis tool-pack (File > More > Options > Add-ins > Analysis Toolpack > Go. You can then summarize the data using the descriptive statistics tool in a table like the one below. Note the range is here is the difference between the Min and the Max. The standard deviation we can look at next.
Standard Deviation The standard deviation helps describe how data points are distributed relative to the mean. Are they all close to the mean, do they distribute evenly from the mean in the form of a normal distribution? In a normal distribution, 68% of data lies within 1 standard deviation of the mean, 95% of data lies within 2 standard deviations of the mean and 97.5% of data lies within 3 standard deviations of the mean.
As we saw in the description statistics table, the standard deviation is 59.6, so with our mean of £74.2, The range of values between £44.4 to £104 are within the first standard deviation of the mean and include 68% of our data. Likewise, values between £14.6 and £134 are within two standard deviations of the mean and include 95% of our data. -Lastly, values between -£15 and £163 are within three standard deviations of the mean and include 97.5% of our data.
Manually calculations of the Standard deviation go like this: 1. Sum up all the differences of the values from the mean and square the result. In this example that equals: 404479 2. Next Divide that number by Count -1 or 114. In this example we get 3548 this is the variance. 3. The Standard deviation is the square root of the variance which is 59.6
Plotting all the sales amounts on a single axis and overlaying the standard distribution points (blue is 1st, red is 2nd, and yellow is 3rd – each including the whole of the range in between, we get a graph as below.