This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.
import seaborn as sns
import matplotlib.pyplot as plt
#from sklearn.datasets import fetch_california_housing
import pandas as pd
csvlocation = csvpath = 'outliercheck.csv'
df = pd.read_csv(csvlocation)
#df.columns # display the columns available
pairplot = sns.pairplot(df)
# Display the plot
plt.show()
If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.
# Sort the DataFrame by the 'firstpurchasedate' column
df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)
# Display the first few rows of the sorted DataFrame
print(df_sorted.head()
The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.
We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.
#remove row where CustomerID = 1
df_filtered = df[df['CustomerID'] != 1]
# Remove the 'age', 'gender' and CustomerID columns
df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)
The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.
Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.
In this project I’m trying to predict customer lifetime sales (well actually first 12-month sales), initially using the SKlearns linear regression model. The code can be downloaded from Github here. The results haven’t been great so far and I wonder how the data from the Contoso Retail data warehouse was created as I have an r-squared value of only around 10%. Still, it’s been a good exercise, and adding additional fields has improved the model gradually.
A lifetime value model is particularly useful in business for predicting the value of newly acquired customers.
The benefits of the model are as follows:
1. It can both guide the forecast for future sales, based on the predicted value of new customers, and forecast new customers being acquired. 2. It can guide the acquisition team in both targeting higher-value customers and understanding how much they can spend on acquiring new customers.
3. It helps the business understand what the max (CPA or Customer Per Acquisition) should be to keep the business profitable, while still growing the business.
Lifetime Timeframe One of the immediate challenges you will face with creating an LTV model is that your older customers are naturally more likely to have a higher lifetime value than more recently acquired customers. To move forward, we can look to create lifetime models based on the first x amount of time from the customer acquisition date. For example, predicted 12-month LTV. An analysis of lifetime data can of course give you a much better understanding of how long a customer’s lifetime is likely to last and what are the key timeframes you can use in creating the prediction.
Building a Model with Linear Regression
In this example, we use a Linear Regression Model as the data we want to predict (lifetime sales) is continuous data. In this example, we’ll guess which fields we think will predict the highest lifetime sales and we’ll soon find out that some of them have very low correlations, forcing us to go back to the drawing board as we’re trying to skip the first 2 stages of analytics: Exploration and Diagnosis (correlation).
Here are the main steps for building the model: 1. Create an SQL Query for importing the data into Python. 2. Create the Python connection and import the data using the SQL Query. 3. Transform and clean the data. 4. Build the model 5. Evaluate the model
6. Improve the model.
1. Create SQL Query to import the historical customer data for training the model. This includes the independent variables and the dependent variable (first 12-month lifetime sales), which we want to predict for future new customers. 12 months is used as we don’t have a lot of historical data.
I’m choosing to start using the following demographic fields: Age, Gender and Yearly Income and First Purchase Amount
The first part of building the code is to create an SQL script that will get us the historical data we need to train our model.
Now we have the SQL, we can start building the Python script which will pull in the data and build the model.
2. Python – Importing the Data (Extract
4. Building the model
Now the exciting bit, is where we get to build the model using the sklearn machine learning module. The process is as follows:
1. Separate independent variables (Age, Gender, first purchase amount, Income) and the dependent variable LifetimeSales into their own dataframes. 2. The data is then separated further into a train and test segment. In this case, 20% of the data is put to the test. 3. The next step is to create the linear regression model 4. Then to evaluate the model 5. Save the model to a .pkl file for future use.
We can ask ChatGTP and Gemini to interpret the results: Here is a response from Gemini:
Model Performance
Train MSE (Mean Squared Error): 98009842.04325
Test MSE: 98328891.393038
R-squared: 0.0943
Both the Train MSE and Test MSE are relatively high. This indicates that the model has a high average error in predicting lifetime sales, both on the data it was trained on and unseen test data. Ideally, you want the MSE to be as low as possible.
The R-squared value is also quite low (around 0.1). This means that the model only explains a small portion of the variance in lifetime sales. In other words, there are other factors that influence lifetime sales that are not captured by this model.
I keep adding additional fields, which is slightly improving the model, but there is a long way to go. As the Contoso data is not real, it’s possible this isn’t the best data to use.
The Power BI KPI demo report is a demonstration of how a typical KPI report will look. The report uses data from the fictitious Microsoft Contoso Retail Data Warehouse, which is stored in a SQL Server Database. I’ve renamed it to rn retail for demo purposes.
The report is currently divided into 5 pages.
The first page is a typical style KPI report, leading with the financial KPI, then breaking down to the store type, and then some online KPI. As the database is quite old (going back to 2009), it doesn’t have a lot of person-level information, but it does for online (hence online KPIs are included here). The KPIs are presented here in comparison to the previous year’s performance, but targets would also typically be included. The month selected allows the user to go back to any point in time in the year (or years if more data was included).
2. The second page covers sales MTD, QTD, and YTD vs. the previous year there are graphs showing the YoY breakdown by store type and a table allowing more drill down into the individual store performance YoY. This interactive tool allows users to quickly pinpoint some of the reasons responsible for higher or lower YoY performance. The measure calculations can be found lower down this page.
3. The third page of the report goes into more detail on stores broken down by country and city and includes the sales per square feet calculation. Again this helps the user explore the data to identify best and worst-performing stores in more detail.
The fourth page, shows sales by product category and product, allowing you to drill down from the top category levels. There are then graphs for Monthly and yearly performance.
5. The fifth page focuses on activities. This can be developed further to show actives by activity segment (e.g., new, active, reactive, lapsed, etc. once activity segmentation is added to the model.
The Data Model:
Time Intelligence DAX measures The DAX measures for MTD, QTD and YTD and comparisons to the previous year are below:
//Month to Date Sales Sales MTD = CALCULATE(SUM(FactSales[SalesAmount]) , DATESMTD('Date'[Date]))
//Quarter to Date Sales Sales QTD = CALCULATE(SUM(FactSales[SalesAmount]) , DATESQTD('Date'[Date]))
//Year to Date Sales Sales YTD = CALCULATE(SUM(FactSales[SalesAmount]) , DATESYTD('Date'[Date]))
//Last Year Month to Date Sales Sales LY MTD = CALCULATE(SUM(FactSales[SalesAmount]) , SAMEPERIODLASTYEAR(DATESMTD('Date'[Date])))
//Last Year Quarter to Date Sales Sales LY QTD = CALCULATE(SUM(FactSales[SalesAmount]) , SAMEPERIODLASTYEAR(DATESQTD('Date'[Date])))
//Last Year Year to Date Sales Sales LY YTD = CALCULATE(SUM(FactSales[SalesAmount]) , SAMEPERIODLASTYEAR(DATESYTD('Date'[Date])))
//Variance Month to Date vs. Last Year Sales Sales VAR MTD vs. Last Year = [Sales MTD]-[Sales LY MTD]
//Variance Quarter to Date vs. Last Year Sales Sales VAR QTD vs. Last Year = [Sales QTD]-[Sales LY QTD]
//Variance Year to Date vs. Last Year Sales Sales VAR YTD vs. Last Year = [Sales YTD]-[Sales LY YTD]
//Variance % Month to Date vs. Last Year Sales Sales VAR% MTD vs. Last Year = ([Sales MTD]-[Sales LY MTD])/[Sales LY MTD]
//Variance % Quarter to Date vs. Last Year Sales Sales VAR% QTD vs. Last Year = ([Sales QTD]-[Sales LY QTD])/[Sales LY QTD]
//Variance % Year to Date vs. Last Year Sales Sales VAR% YTD vs. Last Year = ([Sales YTD]-[Sales LY YTD])/[Sales LY YTD]
Analysis: UK House Price Index Data Source: Office Of National Statistics Period: Monthly Graph: Showing December Yearly Prices
Linear Trendline
Polynomial Trendline (order 3)
Observations (from the raw data): Over the 19 years between YE 2005 and YE 2023, average UK house prices have increased from: £160,209 in 2006 to £284,681 in 2023 or 78% with an average yearly increase of £6,551
The maximum Average House Price was: £288,784 in December 2022.
The drop from 2007 to 2008 of £28,239 is evident and goes against the positive trend relating to the 2007 financial crash. The drop from 2022 to 2023 of £4053 relates to the higher interest rates in the UK and the cost of living crisis.
Trendlines
If we add a linear regression line (left graph), to the data points, we get an R-squared score of 0.85%, meaning 85% of the variability of average house prices is accounted for by year, which is a pretty strong correlation as we would expect.
If we add the forecast based on the linear regression line equation (not included here), we get £303,618 in 2028. (I’ve used the FORECAST.LINEAR() Excel function to generate the forecasts).
An alternative is to use the polynomial regression line which is used when there are peaks and valleys in the data. In this example, we are using an order 3 polynomial trendline which accounts for 2 peaks or valleys. This gives an R-squared value of 0.96, which appears to be a very good fit, but I’m somewhat hesitant as we’re right at the point of a 2nd drop. One could look back at the years following 2007 and apply the growth to the years following 2022. This would be similar to a seasonal adjustment, but I’m not sure what tool will pick this up at present (and of course, it’s not the same situation).
If we apply the forecasting from the Polynomial regression trendline, we get £389,243 for 2028.
So the linear regression forecast gives £303K for 2018 and the Polynomial regression method gives £389K, that’s a pretty big difference. As you can see the polynomial regression is less impacted by the peaks and troughs than the regression line.
If we use the Excel forecast tool which uses the ‘AAA Exponential Triple Smoothing (ETS) algorithm’, for the next 5 years, we can see the forecast of £319,647 for 2028, however, the upper and lower (non-bold) lines indicate the confidence levels, which indicate we can the estimate will lie between £204,369 and £434,925 with 95% certainty, which is very uncertain.
Some of the reasons for this are that we only have 18 years (18 data points) to work with, we have 2 dips in the data, and a somewhat flat period, this doesn’t help the model and hence the confidence limit is less certain.
The algorithm is best used for linear seasonal fluctuating data, which isn’t quite what we have here.
To create a better model, we could create a multi-variate model, including additional features such as interest rate, population, housing supply, and region. We can also see if there is more data available pre-2005.
Most businesses will have sales and marketing departments as well as top-level C-Suite/director levels. They are all going to be interested in a bunch of similar reports from organization to organization.
One of the most important reports, that sales and marketing teams are going to be interested in is: Who are our best customers?
Top customers often contribute a significant amount of revenue to any organization. The 80/20 rule, that 80% of the revenue is made up of 20% of customers is quite common. Identifying top customers allows teams to segment their audience. Top customers can be given special treatment, such as dedicated account managers and special offers.
In the model from the Contoso database we use here, the DimCustomer table is joined to the FactOnlineSales table, and the FactOnlineSales table is joined to the Date table. We use the FactOnlineSales table rather than the FactSales table as the FactOnlineSales table as it includes a CustomerKey we can join to the DimCustomer table and pull out customer data at the person level (the FactSales table appears to include store-level data only).
To start creating the report, we can drag in the SalesAmount field from the FactOnlineSales table and the FirstName, LastName, and CustomerKey from the DimCustomer table as well as DateFirstPurchased. These are some of the fields a sales team is going to be interested in and they will likely ask for more to be added as time goes by.
Adding the Year field as a filtered visual from the data table, which is already joined to the FactOnlineSales table by the unique date key, allows us to filter the sales amount by year (and we could add months for example, as required). We filter the report at the person level by adding the CustomerType field in the Filters panel:
You will notice here I just dragged in the SalesAmount field from the FactOnlineSales table, but that’s a bit lazy and we should create a measure for the field, like this:
OnlineSales = SUM(FactOnlineSales[SalesAmount])
The sales amount filter is useful, but teams are likely to be interested in several different sales calculations, which we can calculate using additional measures. For example, we might want to see the current year-to-date sales and the lifetime sales of a customer in the same report as follows:
Using REMOVEFILTERS(‘Date’), removes all filters applied from the Date table, such as the year that is being used in the drop-down in this report, so the SalesAmount will be calculated for the customer’s lifetime. This now makes the report more interesting. The year filter still applies to the OnlineSales Measure, but does not apply to the OnlineSales Lifetime measure as below:
Comparing year-to-date vs. previous year-to-date is another common request. To calculate year-to-date sales amount, we use the CALCULATE() function again, but this time combined with the DATESYTD() function:
Calculating Year-to-Date and Previous Year-to-Date Sales with DAX
To calculate Year to date sales we can use the DATESYTD function with CALCULATE:
Adding these measures to our report table, we should now have something looking like this :
Here we now have a top customer report for 2009, we can see which customers have brought in the highest revenue this year vs. last year to date, so we are comparing equal periods. We can also see customers’ lifetime value. In this example, Jon Yang is our best online customer, with $61K in online sales. This is equal to his lifetime sales. As you will notice, their first Date of Purchase is 22nd July 2001, which is a long time ago, but the database doesn’t have a record of this transaction as 2009 YTD sales equal lifetime sales. Presumably, this data was not recorded historically, which is often the case with older databases. The data you work with is often not perfect and we need to be aware of these specifics.
Now we have this year’s YTD measure and the previous year’s YTD measure, it is easy for us to calculate the variance % (or difference), between the two years.
We have an infinity number for the first customer as they had no sales in the previous year, we can tidy that up using the DIVIDE() function, which automatically handles errors and infinity to tidy up the code. Online Sales Var YTD% = DIVIDE(([OnlineSales YTD]-[OnlineSales Prev YTD]),[OnlineSales Prev YTD])
Let’s now add the month as an additional filter. As you can see the YTD and Prev YTD, are recalculated to show the online sales amount up until the end of the month selected. That’s useful and pretty powerful.
In a real-world scenario, sales teams are going to want the most up-to-date information, so we want to look to add previous days’ sales, week-to-date sales, and month-to-date sales. The issue here is that our data is old, so we need to do some juggling first to bring the Contoso sales data into the future to use for experimentation. The process of adding a future date is outlined here.
DAX Previous Day Sales Calculation
Once we have the future date in our FactOnlineSales table as in the link above and join it to the date table, our report is brought into the future and we can create a measure calculating yesterday’s sales amount:
As you can see I can now filter on the current year, which is 2023, and remove the month filter. I can now see the top customers for yesterday if I sort by the OnlineSalesYesterday field. This is now giving the sales teams up-to-date, daily reporting, so they can keep their finger on the pulse of the business:
Now we’re using more ‘current’ data, we can now see we need Week-to-Date and Month-to-Date measures.
If you look around the DAX functions, you are going to be disappointed. There is no DAX week to date function like there is in Tableau. There are different methods you can Google, but let’s create our one here from scratch. In this example, we are using the FactOnlineSales table from the Contoso Retail DW. Before we start, want can as an option, bring the old data into the future, by using the method outlined here. This is not a requirement, but it means we can create live-looking scenarios with sales data up until yesterday.
To verify the measure we are going to create, we can create a fixed date range measure that we can compare our dynamic measure to:
This measure uses the DATESBETWEEN() function as the filter for the CALCULATE() function. Note that sales will be inclusive from the start to the end dates. You can change the date range to reflect the current week you are in. The dates used in this example are reflective of the time when I created the function.
To create a dynamic measure, we can start by adding a new measure OnlineSales WTD (week to date). Let’s start by adding in the TODAY() function
OnlineSales WTD = TODAY()-1 We are going to use the TODAY()-1 as we usually want the data up until the end of the previous day, and not to show any partial data for the current day that may be imported to the database. If we add it to a card visual, we can verify the output as today-1 (at the time of writing):
Next, we can use the WEEKDAY function to get the day number from today. We also use variables here to tidy up the code:
OnlineSales WTD = VAR ThisDay = TODAY()-1 VAR ThisDayNum = WEEKDAY(currentday) RETURN currentdayNum
The measure now displays 6. Today is the 16th of November 2023 (at the time of writing), which is a Saturday, but we added the -1, which is Friday, so we want it to display 5. To get the WEEKDAY() function to start on a Monday, we can add 2 as the second argument, which sets the first day of the week (1) to Monday and the last day of the week to Sunday (7). You can change this according to your needs.
OnlineSales WTD = VAR ThisDay = TODAY()-1 VAR ThisDayNum = WEEKDAY(currentday, 2) RETURN ThisDayNum
This now gives us 5, which is what we want:
Next, we want to calculate the first day of the week, which we can do as follows:
OnlineSales WTD = VAR ThisDay = TODAY()-1 VAR ThisDayNum = WEEKDAY(currentday, 2) VAR FirstDayOfWeek = thisday -(thisdayNum-1) RETURN FirstDayOfWeek
So now we have the logic for calculating the first day of the week and today. So we want to calculate the sales between these 2 dates. So let’s try and use this logic to create a dynamic WTD measure:
OnlineSales WTD = VAR ThisDay = TODAY()-1 VAR ThisDayNum = WEEKDAY(Thisday,2) VAR FirstDayOfWeek = Thisday -(ThisDayNum-1) VAR MonSales = CALCULATE(SUM(FactOnlineSales[SalesAmount]), DATESBETWEEN('Date'[Date], ThisDay, ThisDay)) VAR WTDSales = CALCULATE(SUM(FactOnlineSales[SalesAmount]), DATESBETWEEN('Date'[Date], FirstDayOfWeek, ThisDay)) RETURN IF(ThisDayNum=1, MonSales, IF(ThisDayNum>1, WTDSales))
The code uses two different calculations for WTD. If the previous day is a Monday, it just calculates the Monday sales, if it is after Monday, it calculates the WTD sales
Now let’s say we want to go back to any point in time and get the Week to Date sales, we can tweak the code as follows:
OnlineSales WTD = VAR ThisDay = MAX('Date'[Date])-1 VAR ThisDayNum = WEEKDAY(Thisday,2) VAR FirstDayOfWeek = Thisday -(ThisDayNum-1) VAR MonSales = CALCULATE(SUM(FactOnlineSales[SalesAmount]), DATESBETWEEN('Date'[Date], ThisDay, ThisDay)) VAR WTDSales = CALCULATE(SUM(FactOnlineSales[SalesAmount]), DATESBETWEEN('Date'[Date], FirstDayOfWeek, ThisDay)) RETURN IF(ThisDayNum=1, MonSales, IF(ThisDayNum>1, WTDSales))
Now if we add the day filter and select today’s date, the week-to-date measure should still match the check, but we can also go back to any point in time and calculate week-to-date.
Lastly, to calculate the previous year’s week-to-date sales, we can create a measure based on our existing OnlineSales WTD measure, but use the SAMEPERIODLASTYEAR() function to calculate the previous year’s week-to-date, without having to re-rewrite all the code.
OnlineSales WTD Prev Year = CALCULATE([OnlineSales WTD], SAMEPERIODLASTYEAR('Date'[Date]))
You can check out my DAX function reference guide here
In the Contoso Retail database, we have old historical data which only goes up until the end of 2009. For our demo reports, we’d like to use more up-to-date data, so we want to add a date field to the FactOnlineSales (and FactSales) tables, so we can mimic more current reports.
To add what we’ll call a ‘future date’ field, we want to find out the number of days from the present day to the last date in the FactOnlineSales table, then we can recalculate all the old dates bringing them forward by this difference.
Getting the Max Date in Power Query
There doesn’t appear to be a direct M code function for max dates, but there is a List.Max() function which works with lists, so we need to create a list from the date field in the FactOnlineSales table. The steps are as follows:
1. Create a list from the date field (right-click on the DateKey field and select ‘Add as New Query’.
2. This will create a list of dates from – here I renamed it to FactOnlineSalesDates
3. We can now reference this list in the FactOnlineSales table using the List.Max() function. So we need to create a new custom column in the FactOnlineSales table as follows:
We then just need to transform it into a Date type:
That will give us a column containing the max date (the last date), in the FactOnlineSalesTable. The MaxDate is 31/12/2009. We can now use that data in our calculations. Sure we could have just looked for it, but if the data is dynamic, it’s good to be able to calculate it, so we are sure it is correct.
4. Next we can add another custom column containing today’s date and transform it to a Date type as well. We use the Date.From() function, working on the DateTime.LocalNow() function:
5. Next add another custom column to calculate the date difference, between today and the max date in the FactOnlineSales table. We can use the Duration.Days() function to calculate the difference in days between today and the last date in the FactOnlineSales table.
This gives us the number we are looking for.
Now all we need to do is add another custom column to transform the dates into the present day. We can call this ‘DateFuture’. And we can transform it to Date type.
We’ve added 4 additional columns here to a big table. To limit the impact on performance, we should remove the 3 columns we don’t need, just leaving us with the DateFuture column.
Now close & apply the data from Power Query to load it into the model. In the model view, we can now edit the relationship between the Date table and the FactOnlineSales table, so they are joined using the DateFuture key, which will bring all the data into the future, when we use filters from the Date table.
If we now look at the FactOlineSales table, we can see the latest date is today (at the time of writing).
In practice a data warehouse is updated overnight, so the latest complete days worth of data should be the day before, so we want to tweak our calculation.
If we go back into Power Query query we can subtract a day from today’s date, using the Date.AddDays()function, so this will feed into the future date calculation. Date.AddDays(Date.From(DateTime.LocalNow()), -1)
Now we see the data is up to today.
So, this is useful if we want to mimic a real-life scenario, where data is up until yesterday relative to today, but it will be confusing for financial reports if our future date keeps changing, so it makes sense to have a fixed date as well.
So to fix the today’s date field, we can do the following, using the #date function
There are three steps to get up and running with a Contoso Retail Datawarehouse that you can use for experimenting with Power BI.
Step 1. If you don’t already have access to an installation of Microsoft SQL server you can install one on your local machine.
There are 2 free non-production versions available:
Microsoft SQL Server Express – This includes just the database engine required to get up and run with the database.
Microsoft SQL Server Developer – In addition to the database engine, this also includes SQL Server Analysis services, reporting services, integration services, and Machine Learning services. It’s over a GB, so a larger installation than the Express version. Of note, Analysis services is included in the Developer version, so it’s useful to install this version.
Microsoft SQL Server Express and Developer versions are available with instructions here. Installation is very straightforward, just follow the instructions.
Step 2. Install SQL Server Management Studio (SSMS). You will use this tool to connect to the SQL Server installation and run queries on the database. You can also download this from Microsoft – here.
This is the connection information for connecting to the SQL Server developer on my local machine. Note for me if I use localhost\MSSQLSERVER for the server name I get an error. It works on localhost only
For further troubleshooting, if you open the services app, you can check that services are running and start as required (Run > Services.exe)
If you connect ok, you should see a screen like this one. You only need to worry about the Databases folder, so don’t worry about all the other stuff for now. Note: This is the Developer version, the Express version, will be slightly different.
Step 3: The Contoso Retail Datawarehouse database can be downloaded from Microsoft here. There are two files: ContosoBIdemoABF.exe and ContosoBIdemoBAK.exe .
The ContosoRetailDW.bak file is the raw database, which you can get started with and only requires the database engine (which is included in SQL Server Express).
The ContosoBIdemoABF.exe is the OLAP version of the database, which requires SQL Server Analysis Services. For this, you will need either a full version or the free developer version.
To install the first file. Connect to the SQL Server Database Engine (Open SSMS and select database engine). Right-click on Databases and select ‘Restore Database’.
Select Device from the next screen.
You can then select ‘Add’ to add the file. I placed the ContosoRetailDW.bak file in my C: drive to make it easier to find:
Then just select the file and click OK, OK, and if all goes well, the database will be restored.
Now to check it is really there, just return to the main screen, spin open the ContosoRetailDW database, and check the table is there.
Now you can start running some queries against it and you can connect the Power BI desktop to it.
A standard starting point for creating customer insight reports is to combine data from sales with your customer table. In the Contoso Retail DW, we have the FactSales and DimCustomer tables that we can use. The DimCustomer Table has a field called Customerkey, but if we look in the main FactSales table we cannot see a Customerkey.
So what we can do is search all the tables for tables with a field name like ‘CustomerKey’.
Here is the code:
SELECT c.name AS 'FieldName'
,(SCHEMA_NAME(t.schema_id) + '.' + t.name) AS 'TableName'
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE 'CustomerKey%'
ORDER BY FieldName, TableName
The output is below:
We can see FactOnlineSales table does have a CustomerKey.
There is a CustomerType field in the DimCustomer table which segments the data into 2 different types: Person and Company.
SELECT customerType
FROM DimCustomer
GROUP BY CustomerType
Going back to the FactSales table, we can see there is a StoreKey field, which looks useful and there is a table called DimStore.
A basic join across these tables (see below) returns 306 store names, so we know that we can at least work at the store level. Whether online sales data includes customer-level data is still to be investigated. If so, this will help us structure the reports, so we will potentially have Store vs. Online (with online presumably being a single website for the Contoso company). At this stage, we don’t know yet.
SELECT st.StoreName, SUM(s.SalesAmount) as SalesAmount
FROM
[ContosoRetailDW].[dbo].[DimStore] st LEFT OUTER JOIN dbo.FactSales s
ON s.[StoreKey] = st.[StoreKey]
GROUP BY st.StoreName
With this new information, we can import the DimStore Table as we know we are going to use that in the report. In this example, we are importing the table directly into the report, but with the Power BI service, we would want to set up the table imports into the Power BU dataflows, so they become available to all reports.
On exploration of the DimStore table, find there is a StoreType field that has 4 rows: Catalog, Online, Reseller, Store.
We can now have a look at sales under these new segments, to see how relevant they are:
We can use the following query:
SELECT st.StoreType, format(SUM(s.SalesAmount),'$0,,M') as SalesAmount
FROM
[ContosoRetailDW].[dbo].[DimStore] st LEFT OUTER JOIN dbo.FactSales s
ON s.[StoreKey] = st.[StoreKey]
GROUP BY st.StoreType
This gives us a really good new understanding of how the business is organized. We can see that there as well as sales via Catalog, Online, and Reseller, there are also Store sales, which may indicate that the database includes sales by store as opposed to sales by customer.
Now, let’s go back to that question about individual customer-level data. The following query will return sales from the FactOnlineSales table by Customer Type
Select [CustomerType], format(SUM(o.SalesAmount),'$0,,M') as SalesAmount
from
[ContosoRetailDW].[dbo].[DimCustomer] c
LEFT OUTER JOIN [ContosoRetailDW].[dbo].[FactOnlineSales] o
ON c.CustomerKey =o.CustomerKey
GROUP BY [CustomerType]
So we can see that sales in the FactOnlineSales table by company are over 3 times that of person.
Let’s see if the FactSales Table contains data in the FactOnlineSales table When we try to join the FactSales table to the FactOnlineSales table, we find there is no relationship between the SalesKey and OnlineSalesKey, which we might expect.
SELECT
[SalesKey]
FROM
[ContosoRetailDW].[dbo].[FactSales] s
INNER JOIN [dbo].[FactOnlineSales] o
ON s.[SalesKey] = o.[OnlineSalesKey]
The output is blank so there is no relationship here.
So, let’s take all the online sales from the FactSales table by Date and see how closely this matches the sales by date of the FactOnlineSales Table: We have 2 queries:
FactSales
SELECT
[SalesKey]
FROM
[ContosoRetailDW].[dbo].[FactSales] s
INNER JOIN [dbo].[FactOnlineSales] o
ON s.[SalesKey] = o.[OnlineSalesKey]
FactOnlineSales
SELECT [DateKey], SUM(SalesAmount) as SalesAmount
FROM
[ContosoRetailDW].[dbo].[FactOnlineSales]
group by [DateKey]
If we grab the output of these queries and graph them in Excel, we can see the sales are similar, but with differences in the 2 tables.
So now, we just need to figure out what the differences are. Again, if we could reach out to someone in the company, that would be great, but we can’t so we need to figure it out ourselves.
The next step is to get a bit more granular, so select a specific date from the 2 tables and see if we can figure out the differences.
Here is the code. We’re going to see if filtering both tables by the Store name= ‘Contoso Europe Online Store’. This will give us matching sales across the 2 tables.
SELECT d.StoreName, SUM(SalesAmount) as SalesAmount
FROM
[ContosoRetailDW].[dbo].[FactOnlineSales] s
LEFT OUTER JOIN [ContosoRetailDW].[dbo].[DimCustomer] c
ON s.CustomerKey=c.CustomerKey
LEFT OUTER JOIN [ContosoRetailDW].[dbo].[DimStore] d
ON s.StoreKey = d.StoreKey
WHERE DateKey = '2009-01-01' AND d.StoreName = 'Contoso Europe Online Store'
group by d.StoreName
SELECT s.DateKey, SUM(s.SalesAmount) as SalesAmount
FROM
[ContosoRetailDW].[dbo].[FactSales] s LEFT OUTER JOIN [ContosoRetailDW].[dbo].[DimStore] st
ON s.StoreKey = st.StoreKey
WHERE StoreType = 'Online' AND s.DateKey = '2009-01-01' AND st.StoreName = 'Contoso Europe Online Store'
GROUP BY s.DateKey
The first query output is as follows:
StoreName SalesAmount Contoso Europe Online Store 432284.288
So that’s a shame there is still a big difference between the 2 there. The next step seems to be go down to the product level, so we try the following two queries:
Both tables also include two fields of interest: ReturnAmount and Discount. It would be good to check they are similar across tables. We can run both queries together.
SELECT SUM(s.SalesAmount) as SalesAmount, SUM(s.[ReturnAmount]) as ReturnAmount, SUM(s.[DiscountAmount])
as DiscountAmount
FROM
[ContosoRetailDW].[dbo].[FactSales] s LEFT OUTER JOIN [ContosoRetailDW].[dbo].[DimStore] st
ON s.StoreKey = st.StoreKey
WHERE StoreType = 'Online' AND s.DateKey = '2009-01-01' AND st.StoreName = 'Contoso Europe Online Store'
GROUP BY s.DateKey
SELECT SUM(s.SalesAmount) as SalesAmount, SUM(s.[ReturnAmount]) as ReturnAmount, SUM(s.[DiscountAmount])
as DiscountAmount
FROM
[ContosoRetailDW].[dbo].[FactOnlineSales] s LEFT OUTER JOIN [ContosoRetailDW].[dbo].[DimStore] st
ON s.StoreKey = st.StoreKey
WHERE StoreType = 'Online' AND s.DateKey = '2009-01-01' AND st.StoreName = 'Contoso Europe Online Store'
GROUP BY s.DateKey
What we see here is that the return amounts are similar but the discount amounts on the FactOnlineSales table are over 3x that of the FactSales table.
The first thing I do is add the ReturnAmount and the DiscountAmount to both queries. The discount amount from the first two products ordered by sales amount is much higher in the FactOnlineSales table than in the FactSales table. If I add the discount amount and return amount to the sales amount in each table, we it gets much closer.
That’s it for now, I’m pretty convinced all the sales in FactOnlineSales are in the FactSales table. As the FactOnlineSales key includes a CustomerKey, we can use this table for customer-level analysis, but bearing in mind it is Online sales data only.
Probably the most essential table you need in your Power BI model is a date or calendar table. You join your date table from any other table containing dates, such as your main Fact Sales table.
The advantage of using a calendar table is that it contains associated fields such as Year, Month, Start of Month, Start of Quarter, and relevant Financial Periods that will be bespoke to the business.
The date table should be stored in a data flow so everyone can use it. Here we create one using DAX, which could be exported and imported as a table in the Power Query service. You could also create a date table with Power Query directly. Creating one in DAX gives us practice with Time Intelligence functions.
The steps to create a date table are as follows: 1. Open Table View and select ‘New Table’ 2. Start by creating the first column which will be our date key. We can use the CALENDAR() function to auto-create a list of dates in the table. The date range should include the minimum date relevant to your date up to any dates in the future. You could use TODAY() as dates up to today, for the end date, but sometimes for example with forecasting, it can be useful to have future data. Of course, a big table may slow performance, so just bear that in mind.
Be careful with the day of the week number. It’s always best to double-check against another calendar. The default is 1 for Sunday, to set the first day of the week, use WEEKDAY(‘Date'[Date], 2).
Once you have the day number setup, which can be useful in itself, you can then create a column for the day name:
Once we’ve finished creating our data table, we can then join it within the model to the relevant tables, such as the Contoso FactSales table which has a date key. Note, that the Date field in the Date table is unique.
An interesting dimension to report on for demographics is the age of customers. Understanding what type of customers buy what types of products can help inform our customer insight that can be used for targeting marketing offers more effectively.
In the DimCustomer table of the Contoso Retail Sales DW, we have a field called: Birthdate, but of course, this is unique for all the customers. What we want to do is group these ages into buckets or age bands, so we can create a more informative report. We can then use the grouping in graphs or as filters as in the graph below:
To create the categorization, I created a new calculated column in the DimCustomer table with the DAX code following the steps outlined below. Note the exact categories you use are up to you or the business. There are no definitive rules on what age bands you should follow, but by studying other reports, you will get the gist of the most common methods.
The first best practice step is to start creating variables. The _dob variable is assigned the [BirthDate] field. The _now variable is assigned the date of today, using the TODAY() function. The _ageadys variable is then assigned the difference in the number of days between the date of birth field and today’s date, which is essential for the persons’ age in days. The _ageyears variable divides the age in days by 365, to get the age in years. The section after the RETURN statement then returns the age band category, based on the IF ELSE logic. It’s important to put the statements in the correct order, starting from the least upwards, and note that there are some blank Birthdates if have used IF(_ageyears = BLANK(), “None”. I only knew there were some blank dates by visually inspecting the BirthDate columns. If I hadn’t used this the blanks would have been picked up as “<18”), so it’s important to sense check as much as possible.
DAX for Calculated Column (note Age Range is the column name): Age Range = VAR _dob = DimCustomer[BirthDate] VAR _now = TODAY() VAR _agedays = DATEDIFF(_dob, _now, DAY) VAR _ageyears = _agedays/365
RETURN IF(_ageyears = BLANK(), “None”, IF(_ageyears<18, “<18”, IF(_ageyears<30, “18 to 29”, IF(_ageyears<40, “30 to 39”, IF(_ageyears<50, “40 to 49”, IF(_ageyears<60, “50 to 59”, IF(_ageyears<70, “60 to 69”, IF(_ageyears<80, “70 to 79”, IF(_ageyears>=80, “80 plus”)))))))))
It’s very useful for users, administrators, and developers to have the data last refresh data added to Power BI reports as this helps confirm how up-to-date, the data is. Sometimes, when a Power BI file is refreshed, you will notice the data doesn’t actually refresh and there is no error, which can be confusing.
This is how to add the refresh date:
1. Open Power Query and add a blank query
2. In the formula bar of the blank query enter = DateTime.LocalNow(). This function gets the current time. Click on the source on the right-hand side to activate the function.
5. You should now have a second Transform tab, to transform the blank query to a table. Select To Table > Table
6. You can then change the data type to Date/Time
7. Finally just rename the column of the table and the column name to something meaningful, like Last Refresh Date.
8. Then Close and Apply to return to the report view.
9. Add a card to your page and add the field from the Last Refresh date table you just created. Adjust the size and you should get something like this.
11. The Date and time of the last refresh date should now refresh each time you refresh the report (as the last refresh table is refreshed).
So I’m going on a question to find a video generation tool that I can use to make money on Tik-Tok. It’s interesting the top 5 sites that my first Google for ‘AI video generation brings up are .io and .ai domains. Something to think about.
Synthesia is my first port of call: Turn your text into a video in minutes. Get natural-sounding AI voice, 140+Avatars and Its $22/month
AI Script Generator AI Voices AI Avatar AI Video Generator
The free version only lets you input text to generate an AI video that it sends you via email.
AI Video Generator Free Review and Price Comparison
Most of them offer more expensive full-business packages in addition to the prices below. Quoted on 15/11/2023. Prices are likely to change. When subscribing to a full package, watch out for the annual billing, amount of video minutes included, the resolution, and any other features you may need that may not be included.
All the sites give prices monthly, but bill annually, which is a bit well… Internet. So I’m just putting the prices here, so you know what to expect.
The script only generated Avatar that didn’t move.
$30/Month for 10 mins video $225/Month for 90 mins prices excl. VAT Note: price dropped once I took the free trial
The trial experience was not good. I had a non-animated avatar and some separate speech generated from Chat GTP. It said to export to see the animated version, but it didn’t work.
Text, voice, tweet, powerpoint, to video and Animation
I chose text to animation. So it generated a video with a Getty images logo. It appeared more like an image than AI, that was just moved over, to give the feel of motion. Maybe it’s just the free trial, but it wasn’t what I expected.
The learn power bi we need data to work with. The Contoso Retail DW database from Microsoft is freely available, so we’ll use that one. You could just install a copy to a local desktop version of SQL Express. A description of the dataset from Microsoft is as follows:
‘The Contoso BI Demo dataset is used to demonstrate DW/BI functionalities across the entire Microsoft Office product family. This dataset includes C-level, sales/marketing, IT, and common finance scenarios for the retail industry and support map integration. In addition, this dataset offers large volumes of transactions from OLTP and well-structured aggregations from OLAP, along with reference and dimension data’.
So although, when you work in business, the data warehouse will be different. In a Retail business, it will include various similarities that that of the Contoso data warehouse. Indeed there will still be similarities across other types of organisations, such as Business to Business (B2B) and Charitable organisations. They will all for example have a customer table and a sales/revenue table. Customers will have similar characteristics, such as their first order date, and last order date, and demographics, such as age, gender, and location. Field names might be different, but there will be similarities. Some organizations will have additional datasets bespoke to their industry. Online gambling companies, for example, will have tables for bonus amounts, although this might be compared to rewards given by retail companies. Another example is VIP/loyalty levels which are also used in many industries. Supermarkets have loyalty programs, Charities, and B2B organizations will identify high-level donors or high-value customers. Gambling Companies will have high-value losing customers.
So let’s have a look at the Contoso Retail Datawarehouse and see what’s in it:
Dimension Tables
dbo.DimAccount SELECT TOP 100 * FROM dbo.DimAccount Inspecting the information we can see a list of accounts. This will likely have been pulled from an accounting (Finance Department).
dbo.DimChannel SELECT TOP 100 * FROM dbo.DimChannel This table is just a list of 4 channels: Store, Online, Catalog, and Reseller. We would expect to use this information when querying transaction information to add what channel the transactions are from. We would expect to join using the ChannelKey field.
dbo.DimCurrency SELECT TOP 100 * FROM dbo.DimCurrency Again a similar table to DimChannel but for currency. We would expect a sales table to include the currency key to which we can join on to add currency to the query.
dbo.DimCustomer SELECT TOP 100 * FROM dbo.DimCustomer The most important table (in addition to sales) in the data warehouse. The big list of businesses. customers. We would expect to join this to a sales table, to report on what customers are buying. Fields in this table include gender, dob, email address, address, phone numbers, and first purchase date. All sensitive personal information, so you need to think about how you work with this data and who has access to it.
dbo.DimDate SELECT TOP 100 * FROM dbo.DimDate An extremely useful and important table, that you would expect to be working with a lot in your reporting. By joining the table to sales orders, you can then not just the relevant Calendar Year/Quarter/Month, but the Financial Year, quarter, month, and week, which are often specific to the organization.
dbo.DimEmployee SELECT TOP 100 * FROM dbo.DimEmployee A useful table of employee information likely used for a variety of report requirements, not just related to pay, but maybe related to sales.
dbo.DimEntity SELECT TOP 100 * FROM dbo.DimEntity A table of global region and country sites, which are relevant to multinational organizations, used for Global group reporting.
dbo. DimGeography SELECT TOP 100 * FROM dbo.DimGeography A table with countries and regions, is probably used to classify countries by Global region in some way.
dbo.DimMachine SELECT TOP 100 * FROM dbo.DimMachine Appears to relate to certain types of equipment in stores.
dbo.DimOutage SELECT TOP 100 * FROM dbo.DimOutage A table of computer outages, possibly related to DimMachine
dbo.DimProduct SELECT TOP 100 * FROM dbo.DimProduct A key table, listing all the products presumably for sales, with a product key to join it to the sales table.
dbo.DimProductCategory SELECT TOP 100 * FROM dbo.DimProductCategory A table one would expect to join to DimProduct to allow for category level reporting.
dbo.DimProductSubCategory SELECT TOP 100 * FROM dbo.DimProductSubCategory Similar to dbo.DimProductCategory but presumably the subcategory level
dbo.DimPromotion SELECT TOP 100 * FROM dbo.DimPromotion A table of promotions, we can join it to the sales table and add marketing information to sales
dbo.DimSalesTerritory SELECT TOP 100 * FROM dbo.DimSalesTerritory A table of territories, presumably relevant to the sales team, which may also join to dbo.DimEmployee
dbo.DimScenario SELECT TOP 100 * FROM dbo.DimScenario A Finance table with keys for Actual, Budget, and Forecast. There should be sales information we can join to, to determine what type of data it is (Actual, Budget, and forecast). Top-level reporting for example will want to show actual revenue against budget and forecast.
dbo.DimStore SELECT TOP 100 * FROM dbo.DimStore A table of store names and addresses.
Fact Tables
dbo.FactExchangeRate SELECT TOP 100 * FROM dbo.FactExchangeRate A table of exchange rates. With a multi-national organization, the business will want to be able to report in Global sales (perhaps in dollars), so this table will be key. One would expect this to be updated on a regular basis, but it might be just once a quarter.
dbo.FactInventory SELECT TOP 100 * FROM dbo.FactInventory An essential table keeping a count of all products in stock. A table relating to the costs of machines (presumably check-outs)
dbo.FactITSLA SELECT TOP 100 * FROM dbo.FactITSLA A table with information about the uptime of machines
dbo.FactOnlineSales SELECT TOP 100 * FROM dbo.FactOnlineSales A transaction table of online sales
dbo.FactSales SELECT TOP 100 * FROM dbo.FactSales The most important table of all (along with the customers). The transaction table, possibly including the Online sales data. More investigation is needed.
FactSalesQuota SELECT TOP 100 * FROM dbo.FactSalesQuota The sales quotas by store, with scenarios set next to them (Actual, Budget, Forecast), are presumably used for reporting on what stores are hitting their targets.
dbo.FactStrategyPlan SELECT TOP 100 * FROM dbo.FactStrategyPlan An empty table
So that’s a brief look at all the tables. As you can see we’ve covered data relevant to C-Suite (targets, store performance, global/regional sales), Sales (transaction/region/customer, employee/quotas), Marketing (Promotions), Finance (Accounts, Sales Quotas, exchange rates, costs).
So I’m going to learn how to create videos in order to present my Power BI visuals in an interesting medium. Hey, people are moving more into watching videos than reading content, so I might as well play catchup. My plan is also to be able to do this as a digital nomad, so I start off with some reading. I know I’m going to need to learn to use some video editing software, but what software?
The first notes are about how it talks about investing in software. I was actually thinking about open source, but I guess needs must. The blog mentions Adobe products. I know the more modern licensing model, where you can pay a monthly fee is a lot more accessible than the old school, one-off several hundred to a thousand type costs, I remember from 15 years ago when I worked in a digital marketing agency as an email marketing bod. It also mentions how video editing is both a technical and a creative skill. I guess I like graphic design. It seems like a long time ago, there was Macromedia Flash, which seemed revolutionary at the time, but I guess it was more of a motion graphics tool than video editing. Still, I imagine some of the editing skills, such as using a timeline, will be relevant. Then there are separate audio editing tools. I can’t remember the tool off the top of my head. Vestax or something was the tool. Technology has changed an awful lot in the last 20 years. Costs have come down, the software has advanced significantly due to memory, and computer speeds have increased dramatically. I am writing this on a £150 Dell Latitude 7490, Core I7 8th Generation Intel laptop. It’s got 16 GB of RAM, but I don’t know if the built-in graphics gizmo will be powerful enough for what I need. I’ve got 187GB of free space. Not long ago, that would have been the combined capacity of several hundred computers. It’s insane.
And then there is AI. I’ve started using TikTok recently, which I’ve found to be a pretty useful educational platform. The Twitter app search functionality stopped working, which I only realised needed an upgrade to fix, but in the few days in between, I found myself on another social media tool. I always thought TikTok was for kids, but it’s obviously evolved to include content from all realms, and I found it really educational, for tutorials and great for music. I found myself considering doing a self-video of ‘Do you know the muffin man’. Anyway, to cut a long story short, I found myself over the last few days rethinking my vague plan about building a Power BI DAX website and trying to get some money from Amazon affiliate links to learn video editing with a view to opening up more opportunities.
So, where to start?
Well, I know I will need to get learning ASAP and stop watching Twitter and TikTok videos about how to make money online. I also need to stop reading the news constantly, which is another addiction I have picked up. I need to find out about which software to learn and how much money it will cost. To become a freelancer, I will need to build a website or a YouTube channel to showcase my work (which seems like a long way off). I need a website too (hey how about this one), where I can still some affiliate links and maybe generate some money. Being an oldie newbie, I’m not really expecting this site to gain a lot of traction, but who knows, maybe I’ll get lucky. I also know I’m going to need perseverance. I’m enthusiastic this morning, but I’ve got a day’s work ahead of me (as a Power BI developer). How will I feel at the end of the day? Will I reflect on this post I wrote at the start of the day and think, ‘Well that was just another one of my random ideas, that seemed like a good idea at the time and then I just forgot about it’.
Motivation – I need motivation:
Reasons to learn video editing.
1. It’s cool – People will be interested in what I’m doing. There is nothing wrong with enjoying communication and interest. This is what drives community. 2. It’s Interesting – Well I’m pretty sure it will be. There is lot to learn and projects will be endless. There are only so many financial reports I can create in Power BI. 3. It’s technical – If you’re technically inclined, you should enjoy this. There is going to be a lot of detail and precision is going to be important as will be understanding of how it works. 4. It could set you free – Yes, although we will still be slaves to the money system, being a freelancer is certainly a step in the right direction towards freedom. 5. It can be done anyway – If you can’t afford a house in the UK, then think abroad. If you can work abroad, then there is no reason to buy a £300K house and spend the next 30 years paying it off. 6. It’s more natural. What I mean by that is that visual things are more natural to us. Hey, most of us dream at night. 7. It’s funky – well potentially. 8. It’s not fully readable by AI – AI may be able to read everything text and computer vision is giving it the ability to learn how to interpret video, but I’m guessing we will still have an edge over AI for a while.
Well, that’s it for now. It’s 08:33 UK 6th July 2023. I start work at home at 09:00. Maybe I’ll listen to some video editing podcasts today when I work.
So I continue. I do a bit of learning from some podcasts on Castbox on video editing. I learned about the de facto industry Premier tool, which is around 50$ a month and used for making TV adverts and films and stuff. I don’t think it’s what I need for creating YouTube videos. Davinci Resolve is named as a top free tool, and I make a note. Then Adobe Premiere Pro was also another top tool. I learned about the top 3 video editing jobs in demand, which are:
1. Micro-content for YouTube/Shorts/TikTok 2. YouTube full video editing. 3. Something else, which I think was making ads for digital marketing.
I also make a note of some of the industry jargon being mentioned, such as story-telling, workflow, proxies, editing, cutting, and creating a show reel for your CV. By the sounds of it, to get into making a big movie, the industry wanted degrees in Film, etc., but to get a freelance editing job doing short stuff, a reel was critical to getting a job. There were no bragging video editing skills; you could either do it or not. It would take time to develop the skills, maybe six months to get good, but the better you got, the more in demand your skills would be. I also read about a glut in work to save for, but I think that mainly applied to bigger projects like adverts, etc.
So that was it. The goal was to make a show reel which consists of a video highlighting your ability across content. I listened to a YouTube video (yet listened as I was working on something else). It was about using Adobe Premiere Pro. I heard about panels, timelines, and using Zoom a lot. It kind of made sense. I would have to build my knowledge up slowly until it all sank in. Lastly, I listened a little about the monetisation of YouTube ads and how pay depended on the audience attracted to the content produced. This would be relevant if I were going to produce my own content. An example was given as Financial Videos, which attracted audiences that Stock Brokers wanted to target. Advertising would spend a lot to get this audience, so it would pay more, but I imagine the audience would be smaller. Still, something to think about.
Email marketing, which may also related and sits up Customer Relationship Management (CRM), has been the number 1 method for acquisition and retention marketing since email was born, due to its low cost. It is also the easiest contact method for websites, transactions, registration confirmand inquiries.
GDPR (General Data Protection Regulation), however, has been making email marketing more difficult, due to the requirement for recipients to have explicitly opted in to receive an email, and learning about the law is now essential for anyone dealing with email marketing. A bit about that later.
So what is email marketing? Note: the image below was created by AI, this article is created by a human.
At its basis, Email marketing is the process of selecting data, creating content, and sending emails to a list of multiple contacts, which may range from small numbers to millions (although if you are in the millions and are not a blue chip, you are probably a spammer). Email marketing typically uses a dedicated mass-send platform (an online service as a service provider), where you can send an email to multiple contacts individually, without the entire audience being able to see the other email addresses on the list. Email marketing is subject to law, so must follow the country-specific guidelines for features such as including unsubscribe and privacy links. Yes, you can send an email from Outlook to 50 people with a list of your latest cars in stock, but doing so and following the rules isn’t so easy and how will you know your message has even arrived without tracking? That’s where the dedicated cloud email platforms come in.
So what about the data?
So, as with all CRM activities, everything starts with the data. With no data, there is no one to contact and with no one to contact, there is no one to send anything to.
So where does the data come from?
With tighter regulation, people must have opted in to receive communications from the company you are sending from. This means a real interaction, typically via a tick box on your website or a tick on a post-based communication such as a sales flyer with an offer on it, returned or through a conversation on the phone requesting the opt-in. The most abundant form of data collection is via the website when people sign up to register their interest or to buy. The data is collected in the website database, which may be integrated directly into an email marketing platform or maybe exported separately and entered in a CRM or uploaded manually to an email platform.
The days of buying lists of email contacts for acquiring new customers and leads are over. This was common practice in the old days when less-regulated methods of email collection were used. These days marketers are more reliant on mediums such as social media advertising, PPC advertising (pay-per-click), and content designed to draw individuals in (typically shared on social networks) and picked up by search engines (Google/Bing), by using SEO methods (Search Engine Optimization), when creating their content.
Investigate
To get a better idea of how data is collected, try some of the following for your own research:
1. Sign up to a popular website to study the data collection process and notice the email opt-in and privacy policy requirements.
2. Look through the marketing messages and look in the header and footer of the emails to view information on how the data was collected and associated privacy policies and opt-out links.
3. Search through your unopened ‘junk mail’ – no it’s not junk mail its marketing communications! Make a note of the various contact methods available and if there is a data entry section to collect your details, make a note of the email and related opt-in fields.
4. Phone up a company from a TV Ad and sign up for a charity or order a set of kitchenware! Note the option, when your email details are collected, and what language the phone operator uses to request your permission to opt-in.
5. Go to your local sports shop or DIY outlet and buy something over the counter. Do they ask for your email address?
These are some of the main methods employed to collect your email address. It’s planned, organized, and will be actioned. You will be receiving an email in the next month or so, to get you to buy again. The email address, being a unique identifier can also be readily matched against any data already stored against you and your offline purchase matched against your online purchase data, enriching the intelligence the organization has about you, enabling them to target you with more personal messaging and improving your experience, so you want to buy more, again and again and again.
Email Data Job Roels
So if you are employed to work with email data, you could be in one of the following positions:
1. An email marketing executive or manager – This person will be a dedicated person in a business or agency, that will be responsible for handling part or all of the email process. They may be simply sent lists of email contacts already processed by a marketing or person or ‘data bod’ or they may be responsible for maintaining the data themselves. More professional approaches have data stored on an email platform where lists can be segmented into more targeted lists, using the tools available on the platform itself. A simple example would be selecting an age and gender range to target an offer on a clothing item, such as a pair of Ski goggles for women.
2. A CRM manager – This person will be responsible for managing the company’s CRM. The CRM is typically a cloud-based database solution with a web-based portal for managing all the company’s contacts and transaction history outside of any live website environment. Examples are Salesforce and Microsoft Dynamics
Descriptive statistics are general statistics used to describe data. Statistics such as average customer value and average sales amount are typically used as Key Performance Indicators for the business.
In the graph below we plot each sales amount for a fictitious electrical store. It could also represent a supermarket’s sales, charitable donations, purchases from a manufacturer, or any other business where you have this kind of data.
This is just a graph created from Excel using a list table of 115 sales amounts. Looking at the data visually, we can make some observations. The average sales look to be around £60 at a guess, there are a few high sales amounts that look like outliers. Outliers lie outside the normal range and so we need to calculate what that normal range is, to determine what to exclude.
Mean Mean or average is probably the most commonly used descriptive statistic. Everyone understands averages, so they are commonly used throughout analysis and reporting.
In the example above the mean is £74.18, so the guess wasn’t far out. The means is just the total sum of all the sale amounts divided by the count of sale amounts. I’m lazy so I just selected the column of sales amounts in Excel and looked at the average at the bottom of the sheet. This also gives you the total and the count, used to calculate the sum. The formula in Excel looks like =AVERAGE(A2:A116) where A2:A116 is the range of cells holding the range of sales amounts.
Examples of using the mean are as follows:
Average Order Value (AOV) – A commonly used statistic to gauge the average value of orders. Businesses will try and improve this number by increasing prices or selling more to existing customers. Typically the value you be presented along with other KPIs such as Revenue, Number of customers, etc.
Average Customer Value (ACV) – A measure that can be used over time, such as a month, quarter, or year to determine the the average value of customers. This helps planning, in that to achieve x amount of revenue a business may need to sell to x amount of customers with an AOV of x.
Average Open Rate, Average Clickthrough Rate – Common statistics used in email marketing.
Average Daily Revenue, Average Monthly Revenue – Overall averages that a business can use to forecast its monthly revenue. e.g. Monthly revenue = Average daily revenue x 30/31 (assuming a flat sales pattern, which is often not the case). Using trend lines or phasing revenue is a better way of forecasting or breaking down budgets by day, week, or month.
If you use the average to predict the sales for the rest of the month, you can see due to the outlier, you run the risk of grossly overestimating. Total sales for the week commencing 1st November is £8531, which you can see in the screenshot above. So the daily run rate is £1,219 for the first week in November (£8531/7). So to forecast the month I could do something like £1219 * 30 as there are 30 days in November. So that gives me £36,560. Businesses love talking about run-rate as they are always focused on revenue. Using historical data, you should be able to get a picture of the historical run rate, to compare how you are currently performing. Businesses will want to look at this vs. the previous year as seasonality impacts a lot of businesses. November is different from October.
Note: for newbies, you might wonder how you get all the sale amounts. Well, essentially what you’re doing is picking up all the sales receipts and entering the totals onto a list from which you can perform your calculations. In the modern world of cause, this is all captured in a database. Organizations typically develop data warehouses, which take a copy of this information into a safe non-live database. SQL is used to report on the data or to hook the data into platforms such as data cubes or Power BI workflows, where analysts can access the data easily. Note, that you may not have to do all the nitty gritty work to get your data. It all depends on how the business is organized.
Median
Median is less used, but can still be useful when displayed alongside the Medium. An average can be influenced more by high or low outliers, which can distort the picture.
In our Electric Now sales store example, the medium is £61.44. It’s quite a bit less than the average of £74.18 and as we thought, the outliers are skewing the data up a bit. Having flagged the medium you may want to remove outliers to give a forecast of to give you a forecast range. In Excel, the formula is: =MEDIAN(A2:A116) where A2:A116 is the range of sales amounts.
To manually get the median, sort your sales amounts from lowest to highest and find the number in the middle.
Mode The mode is useful when you want to know the most frequent points in a set of data.
For example, if you had a subscription option to a website with variable rates available. Looking at the mode will tell you the most common subscription amount.
If your data varies significantly, you might find it helps to group your data into bands in order to create more frequent groups. For example, if you had 1000 sales amounts with all varying pence amounts, these could be banded into groups of £10-15, £20-25, £30-35, etc. This would also be much easier to graph.
If we pivot the data from the Electric Now shop in the spreadsheet and graph the data, we can see that £39.44 is the most common sales amount, but not by much. I can also just use the formula =MODE(A2:A116)
As this isn’t very informative we can choose to group, bucket, or band the data together as in the graph below. From this we can make some better higher-level observations, such as sales amounts are evenly distributed from 25 to 100. We don’t have many sales below 25.
But to be more accurate, we want to say what % of sales are in these ranges. As in the table below.
Min and Max Depending on the data in question, the Min and the Max can be useful, when used in combination with the other statistics. The Min is just the minimum value in your data range and the Max is the Maximum.
In the Electric Now example, the Min is £2.12 and the Max is £412.
What it does tell you is that some orders can be quite small amounts. If you wanted to have a report on the maximum sale amount that you wanted to beat, it would be useful to display the MAX sale amount.
Once you’ve understood all these descriptive statistics work, you can skip ahead by using the Excel data analysis tool-pack (File > More > Options > Add-ins > Analysis Toolpack > Go. You can then summarize the data using the descriptive statistics tool in a table like the one below. Note the range is here is the difference between the Min and the Max. The standard deviation we can look at next.
Standard Deviation The standard deviation helps describe how data points are distributed relative to the mean. Are they all close to the mean, do they distribute evenly from the mean in the form of a normal distribution? In a normal distribution, 68% of data lies within 1 standard deviation of the mean, 95% of data lies within 2 standard deviations of the mean and 97.5% of data lies within 3 standard deviations of the mean.
As we saw in the description statistics table, the standard deviation is 59.6, so with our mean of £74.2, The range of values between £44.4 to £104 are within the first standard deviation of the mean and include 68% of our data. Likewise, values between £14.6 and £134 are within two standard deviations of the mean and include 95% of our data. -Lastly, values between -£15 and £163 are within three standard deviations of the mean and include 97.5% of our data.
Manually calculations of the Standard deviation go like this: 1. Sum up all the differences of the values from the mean and square the result. In this example that equals: 404479 2. Next Divide that number by Count -1 or 114. In this example we get 3548 this is the variance. 3. The Standard deviation is the square root of the variance which is 59.6
Plotting all the sales amounts on a single axis and overlaying the standard distribution points (blue is 1st, red is 2nd, and yellow is 3rd – each including the whole of the range in between, we get a graph as below.