Category: Statistics

  • Using Python for Simple Linear Regression

    Creating a simple Linear regression model and preparing for multi-linear regression.

    In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api

    You can download the original file from Kaggle here then just replace the location you save it to in the (df = pd.read_csv() line.

    
    import pandas as pd
    import seaborn as sns
    from statsmodels.formula.api import ols
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    pd.set_option('display.max_columns', None)
    df = pd.read_csv("marketing_sales_data.csv")
    
    # Drop rows with any null values
    df.dropna(inplace=True)
    
    # Check and handle duplicates if needed
    if df.duplicated().sum() > 0:
        df.drop_duplicates(inplace=True)
    
    #rename columns to snake 
    df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)
    
    # Simple order encoding
    tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
    df['TV'] = df['TV'].replace(tv_dict)
    
    # One-hot encoding for non-ordinal variable
    df = pd.get_dummies(df, columns=['Influencer'], dtype=int)
    
    
    # Define and fit the model
    ols_formula = "Sales ~ Radio"
    OLS = ols(formula=ols_formula, data=df)
    model = OLS.fit()
    summary = model.summary()
    print(summary)  #Prints off the statistical summary including R squared and the Beta coefficients.
    
    # Calculate residuals and predicted values
    residuals = model.resid
    y_pred = model.predict(df['Radio'])
    

    Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.

    The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.

  • How to Select the Correct Statistical Testing Test (A Table)

    Significance Tests (normal distribution assumed)

    GroupsRequirementsTestTailsStatisticDistributionDegrees of freedomExample
    1normal dist, n > 30 and known variance z-test1 and 2z-scoreNormalNAA sample of the population is tested for height. Do they match the known population?
    Population mean and standard deviation are known, sample > 30
    1n < 30 and unknown varianceOne-sample t-test1 and 2t-scoreStudents tn-1 A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
    2Same 2 groups (before and after).
    2 Dependent samples.
    Paired t-test 1 and 2t-scoreStudents t

    n-1A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
    2*2 Independent groups, equal varianceIndependent t-test (equal variance – pooled)1 or 2t-scoreStudents t

    n1​+n2​−2A retailer wants to compare the average sales of two different stores located in different regions.
    2*2 Independent groups, unequal varianceIndependent t-test (Unequal/Welch’s)1 and 2t-scoreStudents t

    n1​+n2​−2
    3+3+ groups – look at one variableOne-Way ANOVA1 OnlyF-scoreFbetween groups
    k-1 (where k is num of groups).
    Three groups are given different drugs in a test to see the improvement in blood sugar.
    3+3+ groups – look at multiple variablesTwo-way ANOVA1 Only
    F-scoreFEffect of water and sun on height of sunflowers, where 3+ combinations of water and sun (3 + groups)
    2 Categories2 Categories – comparison is to see if they are relatedPearsons Chi-Squareneither, just differentChi-SquaredChi-Squared(no. of rows – 1) * (no. of columns – 1). Individuals that received social media vs. those that received email – 2 Groups purchased and not purchased.
    Correlation
    2CorrelationPearsons Correlation Coefficient (R)1 or 2t-statistic
    t-distribution table
    n−2Relationship between daily hours of sunlight and temperature. Best for continuous data with linear relationships.

    Provides both the direction (positive or negative) and the strength of a linear relationship between two variables. Note for simple linear regression. R squared is the Pearson r squared.
    It ranges between -1 and 1
    2 Correlation Spearmans Rank1 or 2t-statistic?Spearmans Rho?n-2Relationship between the ranks of employees’ performance and their years of experience.
    Best for ordinal data or when data does not meet Pearson’s assumptions.
    r Squared (used in regression analysis)

    One (simple linear regression) or
    more independent variables
    (Multiple linear regression) + 1 dependent
    variable
    Not used in significance testing directlyNAExplaining the variance in academic performance based on study hours, attendance, and other factors.
    Used in the context of regression analysis to explain the proportion of variance in the dependent variable.

    Indicates the proportion of the variance in the dependent variable that can be explained by the independent variable(s). It does not provide information about the direction of the relationship.
    Ranges from 0 to 1
    Regression AnalysisF-StatisticF-DistributionFit a regression model and computer squared.
    Perform an F-test to determine overall significance.
    Non-Parametric TestsNot Normal distribution
    2 Compares distributions of the 2 groupsMann Whitney U
    U Statistic Mann-Whitney U DistributionNA2 Independent groups (like the t-test)
    2Compares Medians of the 2 groupsWilcoxen signed rankT StatisticWilcoxen signed rank tableNAWhen comparing two related samples, matched samples, or repeated measurements on a single sample.

    * Note: Uses Levenes test to determine equal/unequal variance

    Significant Testing Steps

    1. Null hypothesis: There is no significant difference between variables between the two groups.

      Examples:
      1. In a drug efficacy study, the null hypothesis might state that the drug does not affect patients compared to a placebo.
      2. The mean cholesterol levels of patients taking the new drug are equal to the mean cholesterol levels of patients taking a placebo.
      3. The mean sales after the advertising campaign are equal to the mean sales before the campaign.
      4. The mean sales of the product with the new packaging design are equal to the mean sales of the product with the old packaging design.

      Alternative hypothesis: There is a significant difference between the means of the groups.
    2. The significance level (𝛼α) is the probability threshold for rejecting the null hypothesis. Commonly used significance levels include 0.05, 0.01, and 0.10. e.g. 5%: The result of the test only has a 5% chance of occurring.
    3. Calculate test statistic (t-value, z-value, etc). The test statistic is used to look up the p-value from the relevant distribution tables.

    4. Calculate the p-value using the test statistic. The p-value is used to determine if the test is significant. e.g. a p-value of less than 5% (0.05) is significant and we can reject the null hypothesis. If the p-value is higher than the significance level, then the results are reported as statistically significant.

    1 or 2 Tailed Tests
    One-tailed test if measuring greater or less than Left or Right Sided
    Two-tailed if measuring the difference

    Type 1 and Type 2 Errors
    Type 1 Error: We reject the null hypothesis even though it was true (Alpha)
    Type 2 Error: We accept the null hypothesis even though it was false (Beta)

    Two Methods of accepting or rejecting Null Hypothesis
    1. Critical Values (taken from normal or t-distribution)
    2. P-values (if p < or > than 0.05)


  • Simple Linear Regression Example with Python

    Linear Regression using Salary and Years of Experience Data

    Data Source: Salary_dataset.csv Kaggle
    The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

    Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

    For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

    Simple Linear Regression
    Formula
    Formula: Y = b + aX + e

    1. Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
    2. Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
    3. a: is the slope change in Y for a one-unit change in X
    4. b: is the intercept of the Y axis. This represents the value of Y when X is zero.

      The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

      The steps are as follows:
      1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
      2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
      3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
      4. The null hypothesis is then evaluated based on the p-value
      5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
    
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    import statsmodels.api as sm
    from scipy import stats
    
    def load_data(csv_location):
        """Load CSV data into a pandas DataFrame."""
        df = pd.read_csv(csv_location)
        return df
    
    def prepare_data(df):
        """Prepare independent and dependent variables for regression."""
        X = df[['YearsExperience']].values  # Extract as 2D array
        y = df['Salary'].values  # Extract as 1D array
        return X, y
    
    def fit_sklearn_model(X, y):
        """Fit a linear regression model using sklearn."""
        model = LinearRegression()
        model.fit(X, y)
        return model
    
    def fit_statsmodels_ols(X, y):
        """Fit a linear regression model using statsmodels OLS."""
        X_with_const = sm.add_constant(X)  # Add an intercept to the model
        model = sm.OLS(y, X_with_const).fit()
        return model
    
    def plot_regression_line(df, intercept, slope):
        """Plot the regression line along with data points."""
        plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
        plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
        plt.title("Salary by Years of Experience")
        plt.xlabel('Years of Experience')
        plt.ylabel('Salary')
        plt.legend()
        plt.show()
    
    def main():
        csv_location = "salary_dataset.csv"
        df = load_data(csv_location)
    
        # Display basic statistics
        #print(df.describe())
    
        X, y = prepare_data(df)
    
        # Fit the model using sklearn
        sklearn_model = fit_sklearn_model(X, y)
        intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
        
        print("Calculation of Regression Line:\n")
        print(f"Intercept is: {intercept}")
        print(f"Slope is: {slope}")
    
        # Fit the model using statsmodels to get p-values and R-squared
        statsmodels_model = fit_statsmodels_ols(X, y)
      #  print(statsmodels_model.summary())
    
        # Extract R-squared and p-values
        r_squared = statsmodels_model.rsquared
        p_values = statsmodels_model.pvalues
    
        print(f"R-squared: {r_squared}")
        #print(f"P-values: {p_values}")
    
        # Extracting specific p-values by index
        intercept_p_value = p_values[0]  # First p-value (intercept)
        slope_p_value = p_values[1]  # Second p-value (YearsExperience)
    
        #print(f"Intercept p-value: {intercept_p_value}")
        print(f"p-value (YearsExperience): {slope_p_value}")
    
        print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
        print("This is obtained from the t-distribution with n−2 degrees of freedom ")
        print("where n is the number of observations\n")
    
        if slope_p_value > 0.05:
            print("P-value is not signficant and therefore we accept the null hypothesis")
        if slope_p_value < 0.05:
            print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌")
        # Plotting the regression line
        plot_regression_line(df, intercept, slope)
    
        # Fit a linear regression line using scipy.stats (for comparison)
        slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
       # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    if __name__ == "__main__":
        main()
    



  • How to use the T-Test Statistical Test

    In business, the most commonly used t-test is often the Independent Two-Sample t-test. This test is widely utilized to compare the means of two independent groups or samples to determine if there is a significant difference between them.

    The Independent Two-Sample t-test is particularly useful in various business scenarios, such as:

    1. Comparing the performance metrics of two different products or services.
    2. Analyzing the effectiveness of different marketing strategies or campaigns.
    3. Assessing the impact of changes in processes or procedures on business outcomes.
    4. Evaluating the differences in customer satisfaction levels between different segments or demographics.
    5. Comparing the financial performance of different investment portfolios or asset classes.


    1. One-Sample t-Test

    Purpose:

    • To determine if the mean of a single sample differs significantly from a known or hypothesized population mean.

    Example:

    • Scenario: A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
    • Hypothesis: The mean weekly sales of the new product is different from $5000.
    • Application: The company collects weekly sales data for the new product over several weeks and uses the one-sample t-test to determine if the average sales differ from $5000.

      Calculation
      x-X/sd/SQRT(n)
      Where x = sample mean, X = est., population mean, n = sample size

      Nominator: Mean – hypothesized population mean
      Denominator: sample standard deviation/SQRT(sample size n)

    2. Independent Two-sample t-test

    Purpose:

    • To compare the means of two independent groups to see if they differ significantly.
    • Null Hypothesis: The means of 2 groups are equal. If rejected, there is a significant difference between the 2 means.

    Types:

    • Equal Variance (Pooled) t-Test: Assumes that both groups have the same variance.
    • Unequal Variance (Welch’s) t-Test: Does not assume equal variances between the groups.

      How do we determine if they have the same variance?
      Use the Levenes Test first to determine if the variances are equal between groups


    Example (Equal Variance) or Pooled t-Test:

    • Scenario: A retailer wants to compare the average sales of two different stores located in different regions.
    • Hypothesis: The mean sales of Store A are equal to the mean sales of Store B.
    • Application: Sales data from Store A and Store B are collected over the same time period. An independent two-sample t-test assuming equal variances is used to compare the average sales.

      t-Value = Complex Calculation

      Degrees of Freedom
      df=n1​+n2​−2

    Example (Unequal Variance, Welch’s t-test):

    • Scenario: A company wants to compare the average salaries of employees in two different departments.
    • Hypothesis: The mean salary in Department X is equal to the mean salary in Department Y.
    • Application: Salary data from both departments are collected. An independent two-sample t-test with unequal variances (Welch’s t-test) is used to compare the average salaries.

      Calculation
      Usage Independent t-test is used when the number of samples and the variance of the two data sets is different.

      T-Value: t = (X1-X2)/SQRT((var1/n1)+(var2/n2))

      Nominator: mean1 – mean2
      Denominator: SQRT((var1/n1)+(var2/n2))

      Degrees of Freedom Calculation more complex.

    3. Paired Sample t-Test

    Purpose:

    • To compare the means of two related groups (e.g., the same subjects measured at two different times).
    • To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).

    Business Example:

    • Scenario: A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
    • Hypothesis: The mean performance score after training is different from the mean performance score before training.
    • Application: Performance scores of employees are recorded before and after the training. A paired sample t-test is used to compare the average performance scores before and after the training.

      More Examples:

      example: Comparing the mean blood pressure of patients before and after treatment
      example: Testing if a new drug increases recovery rates (greater than) compared to the known recovery rate without the drug.

      Calculation
      T-Value =
      nominator: mean1 – mean2 (difference between the means)
      denominator: s(Diff)/SQRT(n)

      s(Diff) = (standard deviation of the differences of the paired data values)
      n = sample size (number of paired differences)
    • Degrees of Freedom:
    • n-1 = degrees of freedom

    One Side or Two-Sided:

    a) One-Sided t-Test

    Purpose:

    • To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).

    Example:

    • Scenario: A company believes that a new marketing campaign will increase the average number of customers visiting their store per day compared to the previous average of 100 customers.
    • Hypothesis: The mean number of customers after the campaign is greater than 100.
    • Application: The company collects daily customer visit data after the campaign starts. A one-sided t-test is used to determine if the average number of customers is significantly greater than 100.

      Example 2: Comparing the average test score of a class to a national average

    b) Two-Sided t-Test

    Purpose:

    • To test if the sample mean is significantly different from the hypothesized population mean (used for non-directional hypotheses).


    Example:

    • Scenario: A company wants to compare the customer satisfaction ratings of two different products to see if there is any difference.
    • Hypothesis: The mean satisfaction rating of Product A is different from the mean satisfaction rating of Product B.
    • Application: Customer satisfaction ratings for both products are collected. A two-sided t-test is used to compare the average satisfaction ratings of the two products.


      Levenes Test – Python Code:
      Test if the variances of two independent groups are equal – use for determining whether to use the Equal or Unequal Independent Samples test

      If the p-value for the Levene test is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met).
      In this case, we can use the Independent Samples Equal Variance T-Test
      If they are significantly different (p<0.05), then we use the Independent Samples Equal Variance T-Test
    import scipy.stats as stats
    
    # Data for Set 1 and Set 2
    set1 = [19.7, 20.4, 19.6, 17.8, 18.5, 18.9, 18.3, 18.9, 19.5, 21.95]
    set2 = [28.3, 26.7, 20.1, 23.3, 25.2, 22.1, 17.7, 27.6, 20.6, 13.7, 23.2, 17.5, 20.6, 18, 23.9, 21.6, 24.3, 20.4, 23.9, 13.3]
    
    # Check if the variances are equal
    # Levene's test for equal variances
    levene_stat, p_levene = stats.levene(set1, set2)
    
    # Print the results of the Levene's test
    print("Levene's test for Equal Variances:")
    print(f"F statistic: {levene_stat:.3f}")
    print(f"p-value: {p_levene:.3f}")
  • Excel Analysis and Forecast of the UK House Pricing

    Analysis: UK House Price Index
    Data Source: Office Of National Statistics
    Period: Monthly
    Graph: Showing December Yearly Prices


    Linear Trendline

    Polynomial Trendline (order 3)


    Observations (from the raw data):
    Over the 19 years between YE 2005 and YE 2023, average UK house prices have increased from:
    £160,209 in 2006 to £284,681 in 2023 or 78% with an average yearly increase of £6,551

    The maximum Average House Price was: £288,784 in December 2022.

    The drop from 2007 to 2008 of £28,239 is evident and goes against the positive trend relating to the 2007 financial crash. The drop from 2022 to 2023 of £4053 relates to the higher interest rates in the UK and the cost of living crisis.

    Trendlines

    If we add a linear regression line (left graph), to the data points, we get an R-squared score of 0.85%, meaning 85% of the variability of average house prices is accounted for by year, which is a pretty strong correlation as we would expect.

    If we add the forecast based on the linear regression line equation (not included here), we get £303,618 in 2028.
    (I’ve used the FORECAST.LINEAR() Excel function to generate the forecasts).

    An alternative is to use the polynomial regression line which is used when there are peaks and valleys in the data. In this example, we are using an order 3 polynomial trendline which accounts for 2 peaks or valleys. This gives an R-squared value of 0.96, which appears to be a very good fit, but I’m somewhat hesitant as we’re right at the point of a 2nd drop. One could look back at the years following 2007 and apply the growth to the years following 2022. This would be similar to a seasonal adjustment, but I’m not sure what tool will pick this up at present (and of course, it’s not the same situation).

    If we apply the forecasting from the Polynomial regression trendline, we get £389,243 for 2028.



    So the linear regression forecast gives £303K for 2018 and the Polynomial regression method gives £389K, that’s a pretty big difference.
    As you can see the polynomial regression is less impacted by the peaks and troughs than the regression line.

    If we use the Excel forecast tool which uses the ‘AAA Exponential Triple Smoothing (ETS) algorithm’, for the next 5 years, we can see the forecast of £319,647 for 2028, however, the upper and lower (non-bold) lines indicate the confidence levels, which indicate we can the estimate will lie between £204,369 and £434,925 with 95% certainty, which is very uncertain.

    Some of the reasons for this are that we only have 18 years (18 data points) to work with, we have 2 dips in the data, and a somewhat flat period, this doesn’t help the model and hence the confidence limit is less certain.

    The algorithm is best used for linear seasonal fluctuating data, which isn’t quite what we have here.

    To create a better model, we could create a multi-variate model, including additional features such as interest rate, population, housing supply, and region. We can also see if there is more data available pre-2005.

  • An Overview of Basic Descriptive Statistics

    Mean
    Median
    Mode

    Min and Max
    Variance
    Standard Deviation


    Descriptive statistics are general statistics used to describe data. Statistics such as average customer value and average sales amount are typically used as Key Performance Indicators for the business.

    In the graph below we plot each sales amount for a fictitious electrical store. It could also represent a supermarket’s sales, charitable donations, purchases from a manufacturer, or any other business where you have this kind of data.

    Here is the basic spreadsheet for the fictitious Electric Now shop I used to create this data.

    This is just a graph created from Excel using a list table of 115 sales amounts. Looking at the data visually, we can make some observations. The average sales look to be around £60 at a guess, there are a few high sales amounts that look like outliers. Outliers lie outside the normal range and so we need to calculate what that normal range is, to determine what to exclude.

    Mean
    Mean or average is probably the most commonly used descriptive statistic. Everyone understands averages, so they are commonly used throughout analysis and reporting.

    In the example above the mean is £74.18, so the guess wasn’t far out. The means is just the total sum of all the sale amounts divided by the count of sale amounts. I’m lazy so I just selected the column of sales amounts in Excel and looked at the average at the bottom of the sheet. This also gives you the total and the count, used to calculate the sum. The formula in Excel looks like =AVERAGE(A2:A116) where A2:A116 is the range of cells holding the range of sales amounts.


    Examples of using the mean are as follows:

    Average Order Value (AOV) – A commonly used statistic to gauge the average value of orders. Businesses will try and improve this number by increasing prices or selling more to existing customers. Typically the value you be presented along with other KPIs such as Revenue, Number of customers, etc.

    Average Customer Value (ACV) – A measure that can be used over time, such as a month, quarter, or year to determine the the average value of customers. This helps planning, in that to achieve x amount of revenue a business may need to sell to x amount of customers with an AOV of x.

    Average Open Rate, Average Clickthrough Rate – Common statistics used in email marketing.

    Average Daily Revenue, Average Monthly Revenue – Overall averages that a business can use to forecast its monthly revenue. e.g. Monthly revenue = Average daily revenue x 30/31 (assuming a flat sales pattern, which is often not the case). Using trend lines or phasing revenue is a better way of forecasting or breaking down budgets by day, week, or month.

    If you use the average to predict the sales for the rest of the month, you can see due to the outlier, you run the risk of grossly overestimating.
    Total sales for the week commencing 1st November is £8531, which you can see in the screenshot above. So the daily run rate is £1,219 for the first week in November (£8531/7).
    So to forecast the month I could do something like £1219 * 30 as there are 30 days in November. So that gives me £36,560.
    Businesses love talking about run-rate as they are always focused on revenue. Using historical data, you should be able to get a picture of the historical run rate, to compare how you are currently performing. Businesses will want to look at this vs. the previous year as seasonality impacts a lot of businesses. November is different from October.

    Note: for newbies, you might wonder how you get all the sale amounts. Well, essentially what you’re doing is picking up all the sales receipts and entering the totals onto a list from which you can perform your calculations. In the modern world of cause, this is all captured in a database. Organizations typically develop data warehouses, which take a copy of this information into a safe non-live database. SQL is used to report on the data or to hook the data into platforms such as data cubes or Power BI workflows, where analysts can access the data easily. Note, that you may not have to do all the nitty gritty work to get your data. It all depends on how the business is organized.

    Median

    Median is less used, but can still be useful when displayed alongside the Medium. An average can be influenced more by high or low outliers, which can distort the picture. 

    In our Electric Now sales store example, the medium is £61.44. It’s quite a bit less than the average of £74.18 and as we thought, the outliers are skewing the data up a bit. Having flagged the medium you may want to remove outliers to give a forecast of to give you a forecast range.
    In Excel, the formula is: =MEDIAN(A2:A116) where A2:A116 is the range of sales amounts.

    To manually get the median, sort your sales amounts from lowest to highest and find the number in the middle.

    Mode
    The mode is useful when you want to know the most frequent points in a set of data.

    For example, if you had a subscription option to a website with variable rates available. Looking at the mode will tell you the most common subscription amount.

    If your data varies significantly, you might find it helps to group your data into bands in order to create more frequent groups.
    For example, if you had 1000 sales amounts with all varying pence amounts, these could be banded into groups of £10-15, £20-25, £30-35, etc. This would also be much easier to graph.

    If we pivot the data from the Electric Now shop in the spreadsheet and graph the data, we can see that £39.44 is the most common sales amount, but not by much.
    I can also just use the formula =MODE(A2:A116)


    As this isn’t very informative we can choose to group, bucket, or band the data together as in the graph below.
    From this we can make some better higher-level observations, such as sales amounts are evenly distributed from 25 to 100.
    We don’t have many sales below 25.

    But to be more accurate, we want to say what % of sales are in these ranges. As in the table below.

    Min and Max
    Depending on the data in question, the Min and the Max can be useful, when used in combination with the other statistics.
    The Min is just the minimum value in your data range and the Max is the Maximum.

    In the Electric Now example, the Min is £2.12 and the Max is £412.

    What it does tell you is that some orders can be quite small amounts. If you wanted to have a report on the maximum sale amount that you wanted to beat, it would be useful to display the MAX sale amount.

    Once you’ve understood all these descriptive statistics work, you can skip ahead by using the Excel data analysis tool-pack (File > More > Options > Add-ins > Analysis Toolpack > Go. You can then summarize the data using the descriptive statistics tool in a table like the one below.
    Note the range is here is the difference between the Min and the Max. The standard deviation we can look at next.

    Standard Deviation
    The standard deviation helps describe how data points are distributed relative to the mean. Are they all close to the mean, do they distribute evenly from the mean in the form of a normal distribution? In a normal distribution, 68% of data lies within 1 standard deviation of the mean, 95% of data lies within 2 standard deviations of the mean and 97.5% of data lies within 3 standard deviations of the mean.

    As we saw in the description statistics table, the standard deviation is 59.6, so with our mean of £74.2,
    The range of values between £44.4 to £104 are within the first standard deviation of the mean and include 68% of our data.
    Likewise, values between £14.6 and £134 are within two standard deviations of the mean and include 95% of our data.
    -Lastly, values between -£15 and £163 are within three standard deviations of the mean and include 97.5% of our data.

    Manually calculations of the Standard deviation go like this:
    1. Sum up all the differences of the values from the mean and square the result. In this example that equals: 404479
    2. Next Divide that number by Count -1 or 114. In this example we get 3548 this is the variance.
    3. The Standard deviation is the square root of the variance which is 59.6

    Plotting all the sales amounts on a single axis and overlaying the standard distribution points (blue is 1st, red is 2nd, and yellow is 3rd – each including the whole of the range in between, we get a graph as below.