Demonstrating the use of the DAX Index function with the Contoso Retail Data warehouse. We can start off by building a virtual table in the DAX query editor, which we can use to apply the Index function. The table created is a list of the first 10 customers by customer key from the DimCustomer table.
If we evaluate the CustomerSalesSample table first we can see the table we are working with.
The syntax for the INDEX function is as follows:
— INDEX(<position>[, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] ) In our example, we use the <position> of 1 to get the first sales amount, which is the highest sales amount, as the Total Sales column is sorted in descending order (DESC)
And then when we evaluate the Index expression (variable Ind), we get the following, which is the correct output we are looking for.
To get the last position, we could either sort the data by ascending order (ASC), or we could use the INDEX of -1 as the following example:
When we use -1 as the index, we get the blank sales returned, which isn’t what we wanted, so we need to modify the code.
One way of filtering out the blanks from the sales table is to add a filter on the table to filter out the blanks as an additional variable as int he below.
DEFINE--Syntax--INDEX(<position>[, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] ) VAR CustomerSalesSample = CALCULATETABLE( SELECTCOLUMNS( DimCustomer, "CustomerKey", DimCustomer[CustomerKey], "FirstName", DimCustomer[FirstName], "LastName", DimCustomer[LastName], "Total Sales", SUMX( RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount] ) ), TOPN( 10, DimCustomer, DimCustomer[CustomerKey], ASC ) ) VAR CustomerSalesNotBlank = FILTER(CustomerSalesSample, NOT(ISBLANK([Total Sales]))) VAR Ind = INDEX( -1, CustomerSalesNotBlank, ORDERBY( [Total Sales], DESC ) )EVALUATE Ind
Evaluating this code now gives us the least total sales amount that is not blank:
To demonstrate the use of the DAX function, we’ll start by creating a simple sales table with the sales amount and the year. Then we’ll add additional columns to create examples of using the RANK function.
In the first example (SimpleSalesRank), we’ll just create a simple ranking. The default function ranks the Sales column from the least amount to the highest amount, and you’ll notice the blank sales value is included as rank one. Note, we can order by more than one column if partitioning by more than one column.
SimpleSalesRank=RANK(ORDERBY(SalesRank[Sales]))
The first thing we can do is move the blank value to the end of the ranking, using the LAST parameter.
In the next example, we create a similar Sales table with a Sales column and a year column. We can then use the RANK functions to create our rankings.
First, we’ll create a simple ranking as before, but with RANKX(). Included here is the function syntax in the comments. As you can see from the table above, the RANKX function defaults to sorting the highest sales value first, whereas the RANK function sorts it last. The RANKX function also defaults to putting the blank last, whereas the RANK function ordered it first.
We can also apply the DENSE clause for the ties, as the default is to SKIP the ranking when there have been equal values, for example, there are 2 sales of value 94, which are both ranked 6, as the default treatment of ties is set to SKIP, the next rank value jumps to 8. With the DENSE clause, the next rank does not jump.
RankXFunctionDENSE=--RANKX(<table>, <expression>[, <value>[, <order>[, <ties>]]]) --note: defaul ties is SKIPRANKX(SalesRankX2, SalesRankX2[Sales], , ASC, DENSE)
In the next example we use RANKX with RELATEDTABLE(). We start of by creating a sample of the DimCustomer table, joining it to the FactOnline Sales table and then adding a ranking table to the new customer table for total sales. We then check the ranking by adding a totalsales
Step 1: Create a sample customer table. In the core below, we create a simple selection of CustomerKey, FirstName, LastName and then filter the table by the first 10 customers by customerkey. So this will give us a table of the first 10 customers in the DimCustomer table. These data come from the Contoso Retail Data Warehouse.
The table below is created (first 3 columns), then we add the SalesRank and CustomerTotalSalesCheck to demonstrate the use of the RANKX function with RELATEDTABLE function
The code for ranking for sales is below. Then we add the CustoemrTotalSalesCheck to confirm that our ranking is working correctly. As you can see it is. The first rank of 1 is allocated to total sales of £3,932. The last value is blank ranked 10. You will notice there is no rank 8, so the default tie is to SKIP rank when then is a tie. We can change this by adding the DENSE clause
Creating a Measure using RANKX() Now we have some good examples of using the RANK and RANKX function we can use them in a measure, which creates a temporary table in memory.
To create the measure we can use the DAX Editor in Power BI, DAX Tabular Editor, or DAX Studio. My preference is the DAX Tabular editor, but the code will work in all three, allowing us to see the data as we build the measure. If you just try and build the measure, you can’t be sure what’s going on unless you build the table directly in Power BI, but they are slower to run.
Here we borrow the code from the previous example creating a physical table, but ensure the table is set to DimCustomer in the Rank function.
We should now be able to use the DAX in a measure, so we transfer the DAX code into a Measure as follows: The code using DEFINE and EVALUATE is a DAX query in the DAX query editor. When we create a measure we are creating a DAX expression. DAX expressions are converted to DAX queries when they are evaluated. Here is the code for creating the measure below.
We can create a more interesting sales report using Window functions and additional calculations in the query, including the gross margin %, total profit, and ranks for product price, sales quantity, and sales rank. The query uses a Common Table Expression (CTE) function to group the data first (as we can’t group using Windows functions).
This is the kind of report decision-makers will be interested in and as more questions come up, additional fields can be added.
USE [ContosoRetailDW];-- Note:Qty*UnitPrice-Discount=TotalSalesAmount (smalldescrepency)--CTEWITHsub1AS(SELECTp.ProductName,s.UnitCost,s.UnitPrice,SUM(SalesQuantity) as TotalSalesQuantity,(s.UnitCost*SUM(SalesQuantity)) as TotalCost,SUM(DiscountAmount) as TotalDiscount,SUM(SalesAmount) as TotalSalesAmountFROM [dbo].[FactOnlineSales] as sLEFTOUTERJOIN [dbo].[DimProduct] as p ON s.[ProductKey] = p.[ProductKey]LEFTOUTERJOIN [dbo].[DimDate] dons.DateKey=d.DateKeyLEFTOUTERJOIN [dbo].[DimStore] stONs.StoreKey=st.StoreKeyWHEREd.CalendarYear=2007ANDd.CalendarMonthLabel='January'ANDst.StoreName='Contoso Europe Online Store'GROUPBYp.ProductName,s.UnitCost,s.UnitPrice)--MainQueryreferencingtheCTSELECTProductName,UnitCost,UnitPrice, (UnitPrice-UnitCost)/UnitCostas Margin,TotalSalesQuantity,format(TotalCost,'$0,,.0M') as TotalCost,format(TotalSalesAmount,'$0,,.0M') as TotalSales,format(TotalSalesAmount-TotalCost,'$0,,.0M') as TotalProfit,TotalDiscount,RANK() OVER(ORDERBYUnitPriceDESC) as PriceRank,RANK() OVER(ORDERBYTotalSalesQuantityDESC) as QtyRank,RANK() OVER(ORDERBYTotalSalesAmountDESC) as SalesRankFROMsub1ORDERBYSalesRankORDERBYSalesRank
Creating a simple Linear regression model and preparing for multi-linear regression.
In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api
You can download the original file from Kaggle here then just replace the location you save it to in the (df = pd.read_csv() line.
Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.
The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.
We need to check assumptions for models to give us confidence that are models have integrity and are not biased or overfitting the data.
We check for three assumptions in this example, with sub-plotted seaborn graphics for a linear regression model. The code for creating the linear regression model can be found in this post You can run the code below once you have built the model. The model models the relationship between radio advertising spending and radio sales.
Graph 1: Checking the Linearity Assumption The linearity assumption is as follows: ‘Each predictor variable (x) is linearly related to the outcome of variable y.’ In the first graph, we plot radio advertising spend against radio sales and can see there is a linear relationship (first graph). So we can conclude the linearity assumption is met.
Graph 2: Checking Homoscedacity assumption with a scatterplot
The homoscedasticity assumption (extra points if you can spell it correctly) is as follows: y_pred are the predicted y values from a regression line.
In the second graph, we plot the residuals of the model, which are the difference between actuals and model forecasts.
Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.’ Good homoscedacity is therefore a balanced graph of residuals above and below zero.
Graph 3: Check for Normality Assumption In the third graph, the histogram is used to plot the residuals of the regression line (the actual y values vs. the predicted y values) for x. If the model is unbiased, the residuals should be normally distributed (and we see that).
The fourth graph is a Q-Q plot which is also used to check the normality assumption.
Classification models are machine learning models that are used to predict binary outcome scenarios such as:
Spam / Not Spam Fraudulent Transaction / Non-Fraudulant Transation customer churn/ customer will not churn custer high value / customer low value load approval /non-approval
The Data Process
Planning
Understand the business requirements. Understand what the measure of success is and what needs to be measured. e.g. In binary outcomes (precision, recall, or f1 score). Identify Type 1 and Type 2 errors.
Identify the key stakeholders and subject matter experts relevant to the project
Understand where the data is and how it can be accessed. For larger data projects. If the data is from many different data sources, can the data be brought together in a data warehouse such as Google Big Query?
Understand the technology required for the project. Are extra resources required?
Is there a data dictionary describing all the data field types and purposes?
Exploratory Data Analysis (Python)
Explore the data, and list the number of columns rows, and data types. If there are any questions, these may need to be referred back to the business.
Explore the data ranges (df. describe() ). Are the data counts complete? Do the means and ranges make sense, do the min and max statistics flag any potential errors or outliers in the data?
Explore null values. If there are null values, either look to fill the data or drop the rows.
Remove or adjust outliers.
Summarize and graph the data:
Use boxplots to look for outliers in columns.
Use histograms to understand the distributions of data.
Use a correlation matrix and pair plot to understand co-variance between columns.
Visualize the data with interactive tools such as Tableau or Power BI for the initial analysis of data for clients
Identify the variable to be predicted: Is it a continuous variable or a categorical variable?
Select the machine learning model relevant to the task and 1 or 2 additional models to compare results to.
Confirm the assumptions required by the model and check the data to confirm they meet the requirements.
Feature selection: Select the features (Independent variables and the dependent variable (columns)) to be used in the model.
Feature transformation: transform categorical data into numeric data using: a) one-hot encoding, for non-ordered categories e.g. departments b) ordinal encoding for orderly data such as ‘low’, ‘medium’, ‘high’ Where numeric data across the features has high variance e.g. small numbers 0-1 and large numbers 100+ consider applying scaling to the data: a) Log normalization (using a logarithm of the data) b) Standardization (which converts data to Z false on a standard distribution, with a mean of zero within each feature)
Feature extraction: Create new features from existing features examples are weekly hours.
7. Check for Class inbalance in the data. In the case of a binary dependent variable (True, False), we would ideally like an even split, but the minimum should be 10% of the smaller of the two. Class inblances can be address with either: a) Downsampling to level the major segment down by random sampling. b) Upsampling to level the smaller segment up by random sampling.
Model Execution
Setup the X variables and Y variables in a separate data frame.
Decide whether to use a
Use test_train_split to create training and test sets and potentially a 3rd validation set of data. The test dataset should be around 25%, but can be larger if the dataset it large. Should the split be stratified?
Select the hyperparameter requirements of the model. The GridSearchCV is the powerful sklearn function takes a list of hyper-parameters, scoring metrics and X values to run multiple models to find the best model
Build and run the model.
Model Validation
In a simple scenario, the best performing model is ran against the hold out sample (e.g. 25% of the data), that the model has not been trained on.
a) In cross-validation, samples of data are taken from the main training data and tested against the test data. Cross-validation is slower.
b) Separate validation set: random samples are taken from the test training set and the model tested against a set number of times e.g. 5. This is advantageous when the main dataset is small and we don’t want to keep sampling from the training data.
Linearity: All Independent and dependent variables should exhibit a degree of linearity (use pairplot) Independent Observations: This information is business specific, so this requires understanding about how the variables are generated. No Multicolinearity: There should be no colinearity between independent variables or this will reduce the impact of the model. No extreme outliers: extreme outliers should be exlcuded from the model.
Fixing Null Values Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.
Drop rows where annual income (k$) is null
df.dropna(subset=['Annual Income (k$)'],inplace=True)df.isnull().sum() #re-runcheckformissingdata
We can run the seaborn pairplot to plot the graphs of the combination of variables.
pairplot=sns.pairplot(df)plt.show()
From here we can see there are some very interesting distinct-looking clusters around annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It
For the K-means model, we need to determine the value of K which is the number of clusters we want to identify. We use the elbow method to do this as follows:
# ExtractingfeaturesforclusteringX=df[['Annual Income (k$)','Spending Score (1-100)']]# Usingtheelbowmethodtodeterminetheoptimalnumberofclusters<br>wcss= []foriinrange(1,11):kmeans=KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=42)kmeans.fit(X)wcss.append(kmeans.inertia_)# Plottingtheelbowgraphplt.plot(range(1,11),wcss)plt.title('Elbow Method')plt.xlabel('Number of clusters')plt.ylabel('WCSS')plt.show()
The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.
K-Means Model
Now we have the number of clusters to be used in the model, we can run the K-Means model.
##Use5clustersbasedonelbowgraph# FittingK-Meanstothedatasetwiththeoptimalnumberofclusters (assuming3forthisexample)kmeans=KMeans(n_clusters=5,init='k-means++',max_iter=300,n_init=10,random_state=42)y_kmeans=kmeans.fit_predict(X)# Visualizingtheclustersplt.scatter(X.iloc[y_kmeans==0,0],X.iloc[y_kmeans==0,1],s=100,c='red',label='Cluster 1')plt.scatter(X.iloc[y_kmeans==1,0],X.iloc[y_kmeans==1,1],s=100,c='blue',label='Cluster 2')plt.scatter(X.iloc[y_kmeans==2,0],X.iloc[y_kmeans==2,1],s=100,c='green',label='Cluster 3')plt.scatter(X.iloc[y_kmeans==3,0],X.iloc[y_kmeans==3,1],s=100,c='orange',label='Cluster 4')plt.scatter(X.iloc[y_kmeans==4,0],X.iloc[y_kmeans==4,1],s=100,c='purple',label='Cluster 5')# Plottingthecentroidsoftheclustersplt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=300,c='yellow',label='Centroids')plt.title('Clusters of customers')plt.xlabel('Annual Income (k$)')plt.ylabel('Spending Score (1-100)')plt.legend()plt.show()
The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.
A sample of the population is tested for height. Do they match the known population? Population mean and standard deviation are known, sample > 30
1
n < 30 and unknown variance
One-sample t-test
1 and 2
t-score
Students t
n-1
A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
2
Same 2 groups (before and after). 2 Dependent samples.
Paired t-test
1 and 2
t-score
Students t
n-1
A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
2*
2 Independent groups, equal variance
Independent t-test (equal variance – pooled)
1 or 2
t-score
Students t
n1+n2−2
A retailer wants to compare the average sales of two different stores located in different regions.
2*
2 Independent groups, unequal variance
Independent t-test (Unequal/Welch’s)
1 and 2
t-score
Students t
n1+n2−2
3+
3+ groups – look at one variable
One-Way ANOVA
1 Only
F-score
F
between groups k-1 (where k is num of groups).
Three groups are given different drugs in a test to see the improvement in blood sugar.
3+
3+ groups – look at multiple variables
Two-way ANOVA
1 Only
F-score
F
Effect of water and sun on height of sunflowers, where 3+ combinations of water and sun (3 + groups)
2 Categories
2 Categories – comparison is to see if they are related
Pearsons Chi-Square
neither, just different
Chi-Squared
Chi-Squared
(no. of rows – 1) * (no. of columns – 1).
Individuals that received social media vs. those that received email – 2 Groups purchased and not purchased.
Correlation
2
Correlation
Pearsons Correlation Coefficient (R)
1 or 2
t-statistic
t-distribution table
n−2
Relationship between daily hours of sunlight and temperature. Best for continuous data with linear relationships.
Provides both the direction (positive or negative) and the strength of a linear relationship between two variables. Note for simple linear regression. R squared is the Pearson r squared. It ranges between -1 and 1
2
Correlation
Spearmans Rank
1 or 2
t-statistic?
Spearmans Rho?
n-2
Relationship between the ranks of employees’ performance and their years of experience. Best for ordinal data or when data does not meet Pearson’s assumptions.
r Squared (used in regression analysis)
One (simple linear regression) or more independent variables (Multiple linear regression) + 1 dependent variable
Not used in significance testing directly
NA
Explaining the variance in academic performance based on study hours, attendance, and other factors. Used in the context of regression analysis to explain the proportion of variance in the dependent variable.
Indicates the proportion of the variance in the dependent variable that can be explained by the independent variable(s). It does not provide information about the direction of the relationship. Ranges from 0 to 1
Regression Analysis
F-Statistic
F-Distribution
Fit a regression model and computer squared. Perform an F-test to determine overall significance.
Non-Parametric Tests
Not Normal distribution
2
Compares distributions of the 2 groups
Mann Whitney U
U Statistic
Mann-Whitney U Distribution
NA
2 Independent groups (like the t-test)
2
Compares Medians of the 2 groups
Wilcoxen signed rank
T Statistic
Wilcoxen signed rank table
NA
When comparing two related samples, matched samples, or repeated measurements on a single sample.
* Note: Uses Levenes test to determine equal/unequal variance
Significant Testing Steps
Null hypothesis: There is no significant difference between variables between the two groups.
Examples: 1. In a drug efficacy study, the null hypothesis might state that the drug does not affect patients compared to a placebo. 2. The mean cholesterol levels of patients taking the new drug are equal to the mean cholesterol levels of patients taking a placebo. 3. The mean sales after the advertising campaign are equal to the mean sales before the campaign. 4. The mean sales of the product with the new packaging design are equal to the mean sales of the product with the old packaging design.
Alternative hypothesis: There is a significant difference between the means of the groups.
The significance level (𝛼α) is the probability threshold for rejecting the null hypothesis. Commonly used significance levels include 0.05, 0.01, and 0.10. e.g. 5%: The result of the test only has a 5% chance of occurring.
Calculate test statistic (t-value, z-value, etc). The test statistic is used to look up the p-value from the relevant distribution tables.
4. Calculate the p-value using the test statistic. The p-value is used to determine if the test is significant. e.g. a p-value of less than 5% (0.05) is significant and we can reject the null hypothesis. If the p-value is higher than the significance level, then the results are reported as statistically significant.
1 or 2 Tailed Tests One-tailed test if measuring greater or less than Left or Right Sided Two-tailed if measuring the difference
Type 1 and Type 2 Errors Type 1 Error: We reject the null hypothesis even though it was true (Alpha) Type 2 Error: We accept the null hypothesis even though it was false (Beta)
Two Methods of accepting or rejecting Null Hypothesis 1. Critical Values (taken from normal or t-distribution) 2. P-values (if p < or > than 0.05)
import pandas as pd #read csv file to pandas DatafFrame titanic = pd.read_csv(“titanic.csv”) air_quality = pd.read_csv(“air_quality_no2.csv”, index_col=0, parse_dates=True)
#save as excel, index=False removes row index numbers from export titanic.to_excel(“titanic.xlsx”, sheet_name=”passengers”, index=False)
#attributes and methods of DataFrame titanic.shape #row and columns count titanic.info # titanic.columns # titanic.dtypes #data types titanic.head(5) #head of df titanic.tail(5) #tail of df titanic.describe() #statistics of file
#Create Sub-set of DataFrame as series ages = titanic[“Age”] #create sub-set from dataframe: note single bracket creates a pandas series. type(titanic[“Age”]) #confirm type of Age object
#Create Sub-set of DataFrame as DataFrame age_sex = titanic[[“Age”, “Sex”]] #inner brackets are list, outer is DataFrame
#Filtering rows getpassenger list with Age > 35 above_35 = titanic[titanic[“Age”] > 35] #return rows with age over 35 #Filter passenger list with Pclass = 2 or 3 (2 ways of doing this) titanic[“Pclass”].isin([2, 3]) #checks if 2 or 3 is in Pclass and returns Boolean titanic[“Pclass”].isin([2, 3])
#Filter passenger list with Age not NA age_no_na = titanic[titanic[“Age”].notna()] #Filter passenger liset and return list of Names adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]
#Filter specified rows and columns titanic.iloc[9:25, 2:5]
#change the first 3 elements in row 3 to “anonymous” titanic.iloc[0:3, 3] = “anonymous”
7. Combining Data from tables
import pandas as pd import os
#check current working director cwd = os.getcwd() cwd
#Concatonate Sub-sets – creates a union between the 2 tables. #the Keys argument adds another hierachical index to show which table the data comes from air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) air_quality_no2.head(5) air_quality_pm25.head(5) air_quality_no2.shape #2068 rows 4 columns air_quality_pm25.shape #5274 rows, 4 columns air_quality.shape #7340 rows, 4 columms
#merge the columns from the stations_coord dataframe to the air_quality df #use location as the key to join the two dataframes #how=”left” is left join from air_quality on the left air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”) air_quality.head(10) air_quality.to_csv(“checkqualitymerge.csv”)
air_quality_parameters.head(5) #merge left join using parameter column on left column and id on right air_quality = pd.merge(air_quality, air_quality_parameters, how=’left’, left_on=’parameter’, right_on=’id’)
#unique cities in air_quality df air_quality.city.unique()
#read air_qualitydataframe to a datetime object instead of text air_quality[“datetime”] = pd.to_datetime(air_quality[“datetime”])
air_quality[“datetime”]
#can also do this to read the dates into datetime format #pd.read_csv(“../data/air_quality_no2_long.csv”, parse_dates=[“datetime”])
#check start and end datetime air_quality[“datetime”].min(), air_quality[“datetime”].max()
#datetime difference using pandas timestamp air_quality[“datetime”].max() – air_quality[“datetime”].min()
#Add a month column using the air_quality[“month”] = air_quality[“datetime”].dt.month
#Average no2 concentration by day of week by location air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean()
#Plot bar graph using hour on the x axis and average no2 on the y fig, axs = plt.subplots(figsize=(12, 4))
air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot( kind=’bar’, rot=0, ax=axs) plt.xlabel(“Hour of the day”); # custom x label using Matplotlib plt.ylabel(“$NO_2 (µg/m^3)$”);
#use index.year and index.weekday of the datetime object no_2.index.year, no_2.index.weekday #plot using a time range no_2[“2019-05-20″:”2019-05-21”].plot();
#resample method of datetime object works like groupby. It takes an #aggregation function and the “M” is time based grouping for Month (month end) monthly_max = no_2.resample(“M”).max() monthly_max
monthly_max.index.freq #tells you the frequence attribute – e.g. MonthEnd
#Create table plot of average no2 levels for each station no_2.resample(“D”).mean().plot(style=”-o”, figsize=(10, 5));
9. Textual Data
import pandas as pd
titanic = pd.read_csv(“titanic.csv”)
titanic[“Name”].str.lower()
#Split the Name field by comma separated titanic[“Name”].str.split(“,”) #create series surname using the first element of the split name field titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) titanic[“Surname”]
#Return name series with true false if contains Countess titanic[“Name”].str.contains(“Countess”)
#return rows where name contains Countess titanic[titanic[“Name”].str.contains(“Countess”)]
#string length titanic[“Name”].str.len() #find max name length titanic[“Name”].str.len().idxmax() #Return longest name titanic.loc[titanic[“Name”].str.len().idxmax(), “Name”]
Create a list Simple list mylist = [“apple”, “orange”, “banana”, “mango”, “pineapple”] mylist[2] # returns ‘banana’ mylist[-1] # returns ‘pineapple’ mylist[2:4] #returns ‘banana’, ‘mango’, ‘pineapple if “apple” in thislist: print(“Yes, ‘apple’ is in the fruits list”) mylist.insert(2, “watermelon”) #insert at position speified mylist.append(‘grapes’) #adds to end of list mylist = [“apple”, “banana”, “cherry”] tropical = [“mango”, “pineapple”, “papaya”] mylist.extend(tropical) # adds tropical list to thislist my.remove(“banana”) #removes first occurrence mylist.pop(1) #removes specified instance del mylist[0] #also removes instance del mylist # deletes list clear(mylist) # clears the list
for i in thislist: #look list print(i)
i = 0 while i < len(thislist): #while loop print(thislist[i]) i = i + 1
newlist = [] for x in fruits: #adds list items containing a to new list if “a” in x: newlist.append(x)
thislist.sort() #sort alphabetically or numerically depending on data type thislist.sort(reverse = True) #sort in reverse thislist.sort(key = str.lower) #sort lower case first thislist.reverse() #sort in reverse mylist = thislist.copy() #copy list mylist = list(thislist) #also makes a list list3 = list1 + list2 #concetenate lists
6 rows, 2 columns
b1 = [[2308, 6], [2408, 6.2], [2508, 5.8],[2608, 5.6], [2708, 5.9]] #create the list print(b1) #print the list print(len(b1)) #print length of list print(type(b1)) #print data type of list print(b1[:2]) #print the first 2 elements in the list (note doesn’t use zero) print(b1[3:]) #print from the 3rd elements and after
Reference Index in array
b1[3] #returns second element b1[0:2] # returns first and second element of list b1[-1] # returns last element index
Add and Delete, Update Delete del(b1[0]) # delete first element or b1.pop(1) #remove second element
Insert b1.insert(1,[24082, 111]) #insert element at position 1 Update b1[6]= [2808,6.7] #update value
Append Additions = [[29008, 6.6], [3008, 6], [3108, 6.1]] b1.extend(Additions) #Adds the Additions list to b1 b1.append([2808,6.6]) # add 1 element (only 1)
Clear the list b1.clear() #empties the list Check if element in array if [2308,6.2] in b1: print(“yes”) Loop for x in b1: print(x)
def country_select(countrylist): count = 1 for i in countrylist: print(f”{i}: {countrylist[i]}”) count = count + 1 return countrylist
Numpy
baseball = [180, 215, 210, 210, 188, 176, 209, 200] import numpy as np np_baseball = np.array(baseball) print(type(np_baseball)) mean = np_baseball.mean() print(“mean is: “+ str(mean)) med = np.median(np_baseball) print(“median is: “+str(med)) std = np.std(np_baseball) print(“standard deviation is: “+str(std))
Linear Regression using Salary and Years of Experience Data
Data Source: Salary_dataset.csv Kaggle The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)
For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y). Simple Linear Regression Formula Formula: Y = b + aX + e
Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
a: is the slope change in Y for a one-unit change in X
b: is the intercept of the Y axis. This represents the value of Y when X is zero.
The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.
The steps are as follows: 1. Use Pandas to import the CSV file to a data frame and convert each series to an array. 2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line. 3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope. 4. The null hypothesis is then evaluated based on the p-value 5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltfromsklearn.linear_modelimportLinearRegressionimportstatsmodels.apiassmfromscipyimportstatsdefload_data(csv_location):"""Load CSV data into a pandas DataFrame."""df = pd.read_csv(csv_location)returndfdefprepare_data(df):"""Prepare independent and dependent variables for regression."""X = df[['YearsExperience']].values # Extractas 2Darrayy = df['Salary'].values # Extractas 1DarrayreturnX,ydeffit_sklearn_model(X,y):"""Fit a linear regression model using sklearn."""model = LinearRegression()model.fit(X,y)returnmodeldeffit_statsmodels_ols(X,y):"""Fit a linear regression model using statsmodels OLS."""X_with_const = sm.add_constant(X) # Addanintercepttothemodelmodel = sm.OLS(y,X_with_const).fit()returnmodeldefplot_regression_line(df,intercept,slope):"""Plot the regression line along with data points."""plt.scatter(df['YearsExperience'],df['Salary'],color='blue',label='Data points')plt.plot(df['YearsExperience'],intercept + slope*df['YearsExperience'],color='red',label='Regression line')plt.title("Salary by Years of Experience")plt.xlabel('Years of Experience')plt.ylabel('Salary')plt.legend()plt.show()defmain():csv_location = "salary_dataset.csv"df = load_data(csv_location) # Displaybasicstatistics #print(df.describe())X,y = prepare_data(df) # Fitthemodelusingsklearnsklearn_model = fit_sklearn_model(X,y)intercept,slope = sklearn_model.intercept_,sklearn_model.coef_[0]print("Calculation of Regression Line:\n")print(f"Intercept is: {intercept}")print(f"Slope is: {slope}") # Fitthemodelusingstatsmodelstogetp-valuesandR-squaredstatsmodels_model = fit_statsmodels_ols(X,y) # print(statsmodels_model.summary()) # ExtractR-squaredandp-valuesr_squared = statsmodels_model.rsquaredp_values = statsmodels_model.pvaluesprint(f"R-squared: {r_squared}") #print(f"P-values: {p_values}") # Extractingspecificp-valuesbyindexintercept_p_value = p_values[0] # Firstp-value (intercept)slope_p_value = p_values[1] # Secondp-value (YearsExperience) #print(f"Intercept p-value: {intercept_p_value}")print(f"p-value (YearsExperience): {slope_p_value}")print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") print("This is obtained from the t-distribution with n−2 degrees of freedom ")print("where n is the number of observations\n")ifslope_p_value > 0.05:print("P-value is not signficant and therefore we accept the null hypothesis")ifslope_p_value < 0.05:print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌") # Plottingtheregressionlineplot_regression_line(df,intercept,slope) # Fitalinearregressionlineusingscipy.stats (forcomparison)slope,intercept,r_value,p_value,std_err = stats.linregress(df['YearsExperience'],df['Salary']) # plt.text(df['YearsExperience'].min(),df['Salary'].max(),f'y = {slope:.2f}x + {intercept:.2f}',ha='left')if__name__ == "__main__":main()
df.shape #outputs rows and columns df.sample(5) #output sample of 5 rows
df.columns #ouput columns titanic.head(5) #print head of dataframe (top five rows) titanic.tail(5) #print tail of dataframe (bottom five rows) type(df) #output object type
Convert gg[‘YearString’] = gg[‘Year’].astype(“string”) #convert series to string gg[“YearInterval”] = gg[“YearString”].str[:3] #LEFT type function, get first 3 elements of string
gg[“YearInterval”] = gg[“YearInterval”].astype(int) #convert series integer
#Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) # creates union
merge the columns from the stations_coord dataframe to the air_quality df. Use location as the key to join the two dataframes, how=”left” is left join from air_quality on the left air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)
merge left join using parameter column on left column and id on right
titanic.iloc[9:25, 2:5] #Return specified columns and rows titanic.iloc[0:3, 3] = “anonymous” #Change the 3rd element of the row to anonymous
def city_select(country): #function with country as parameter df2 = df[df[“country”] == country] #slice dataframe where country = country e.g. United States df2.sort_values(by=[“population”], ascending=False) #sort dataframe by population descending df3 = df2.iloc[0:5] #new dataframe sorted by top 5 population df4 = df3[“city”] #select city series (top 5 cities by population) return df4
def get_lat(selection): #define function latitude = selection[“lat”] #latitude = series with “lat” as column lat = latitude.iloc[0] #get first row of latitude series return lat #return first row
Groupby
df.groupby(“version”).count() #Version with Count version df.groupby(“version”).sum() #Version with Sum columns air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean() #avg by day and location
DateTime df[“datetime”] = pd.to_datetime(air_quality[“datetime”]) #change dates in dataframe to datetime formate
pd.read_csv(“../data/df_long.csv”, parse_dates=[“datetime”]) #does the same df[“datetime”].min(), df[“datetime”].max() #start and end datetime df[“datetime”].max() – df[“datetime”].min() #difference in Datetime between 2 dates df[“month”] = df[“datetime”].dt.month #add column month based on other column df.index.year, df.index.weekday #return year and weekday of datetime object. df_pivoted = df.pivot(columns=”location”, values=”value”).reset_index() #reset pivot table index df = df_pivoted.melt(id_vars=”date.utc”)
gg[“YearMonth”] = pd.to_datetime(dict(year=gg.Year, month=gg.Month, day=1)) #create date from columns year, month and 1 for first
gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns gg4 = gg3.max() #change group by to max group by
air_quality.city.unique() #return unique cities in column cities air_quality = air_quality.rename(columns={“date.utc”: “datetime”}) #rename columns Split Columns in Dataframe titanic[“Name”].str.split(“,”) #split Name column by comma titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) #split name column by comma and return the first value in the array.
df[“Name”].str.lower() #change column name to lowercase df[“Name”].str.contains(“Countess”) #return true if column contains “Countess
df[“Name”].str.len() #return rows with string lenth of the Name column df[“Name”].str.len().idxmax() #return length of longest name df.loc[titanic[“Name”].str.len().idxmax(), “Name”] #return longest name in column using loc #rename gender column and replace genders to M and F in dataframe df[“gender_short”] = df[“gender”].replace({“male”: “M”, “female”: “F”}) #replace names in gender column with M and F titanic[“Pclass”].isin([2, 3]) #checkss if 2 and 3 is in the PClass columns adult_names = titanic.loc[titanic[“Age”] > 35, “Name”] #return Names of passengers over 35
Print
print(“\n1 Day Retention by Version”) #\n for new line print(f”You searched for: {domain_name}\n”) #f allows you to include placeholders in print
Loops
for link in a: print(link.get
Input domain = input(“Enter domain to search for: “)
In business, the most commonly used t-test is often the Independent Two-Sample t-test. This test is widely utilized to compare the means of two independent groups or samples to determine if there is a significant difference between them.
The Independent Two-Sample t-test is particularly useful in various business scenarios, such as:
Comparing the performance metrics of two different products or services.
Analyzing the effectiveness of different marketing strategies or campaigns.
Assessing the impact of changes in processes or procedures on business outcomes.
Evaluating the differences in customer satisfaction levels between different segments or demographics.
Comparing the financial performance of different investment portfolios or asset classes.
One-Sample t-Test
Purpose:
To determine if the mean of a single sample differs significantly from a known or hypothesized population mean.
Example:
Scenario: A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
Hypothesis: The mean weekly sales of the new product is different from $5000.
Application: The company collects weekly sales data for the new product over several weeks and uses the one-sample t-test to determine if the average sales differ from $5000.
Calculation x-X/sd/SQRT(n) Where x = sample mean, X = est., population mean, n = sample size
Nominator: Mean – hypothesized population mean Denominator: sample standard deviation/SQRT(sample size n)
2. Independent Two-sample t-test
Purpose:
To compare the means of two independent groups to see if they differ significantly.
Null Hypothesis: The means of 2 groups are equal. If rejected, there is a significant difference between the 2 means.
Types:
Equal Variance (Pooled) t-Test: Assumes that both groups have the same variance.
Unequal Variance (Welch’s) t-Test: Does not assume equal variances between the groups.
How do we determine if they have the same variance? Use the Levenes Test first to determine if the variances are equal between groups
Example (Equal Variance) or Pooled t-Test:
Scenario: A retailer wants to compare the average sales of two different stores located in different regions.
Hypothesis: The mean sales of Store A are equal to the mean sales of Store B.
Application: Sales data from Store A and Store B are collected over the same time period. An independent two-sample t-test assuming equal variances is used to compare the average sales.
t-Value = Complex Calculation
Degrees of Freedom df=n1+n2−2
Example (Unequal Variance, Welch’s t-test):
Scenario: A company wants to compare the average salaries of employees in two different departments.
Hypothesis: The mean salary in Department X is equal to the mean salary in Department Y.
Application: Salary data from both departments are collected. An independent two-sample t-test with unequal variances (Welch’s t-test) is used to compare the average salaries.
Calculation Usage Independent t-test is used when the number of samples and the variance of the two data sets is different.
To compare the means of two related groups (e.g., the same subjects measured at two different times).
To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).
Business Example:
Scenario: A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
Hypothesis: The mean performance score after training is different from the mean performance score before training.
Application: Performance scores of employees are recorded before and after the training. A paired sample t-test is used to compare the average performance scores before and after the training.
More Examples:
example: Comparing the mean blood pressure of patients before and after treatment example: Testing if a new drug increases recovery rates (greater than) compared to the known recovery rate without the drug.
Calculation T-Value = nominator: mean1 – mean2 (difference between the means) denominator: s(Diff)/SQRT(n)
s(Diff) = (standard deviation of the differences of the paired data values) n = sample size (number of paired differences)
Degrees of Freedom:
n-1 = degrees of freedom
One Side or Two-Sided:
a) One-Sided t-Test
Purpose:
To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).
Example:
Scenario: A company believes that a new marketing campaign will increase the average number of customers visiting their store per day compared to the previous average of 100 customers.
Hypothesis: The mean number of customers after the campaign is greater than 100.
Application: The company collects daily customer visit data after the campaign starts. A one-sided t-test is used to determine if the average number of customers is significantly greater than 100.
Example 2: Comparing the average test score of a class to a national average
b) Two-Sided t-Test Purpose:
To test if the sample mean is significantly different from the hypothesized population mean (used for non-directional hypotheses).
Example:
Scenario: A company wants to compare the customer satisfaction ratings of two different products to see if there is any difference.
Hypothesis: The mean satisfaction rating of Product A is different from the mean satisfaction rating of Product B.
Application: Customer satisfaction ratings for both products are collected. A two-sided t-test is used to compare the average satisfaction ratings of the two products.
Levenes Test – Python Code: Test if the variances of two independent groups are equal – use for determining whether to use the Equal or Unequal Independent Samples test
If the p-value for the Levene test is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met). In this case, we can use the Independent Samples Equal Variance T-Test If they are significantly different (p<0.05), then we use the Independent Samples Equal Variance T-Test
CURRENT_DATE() as Current_Date,--returnstodaysdateDATE(2024,01,24) as Constructed_Date,--constructsthedatefromYear,Month,DayLAST_DAY('2008-12-25') ASLast_Date,--returns2008-12-25thelastdateDATE_ADD(DATE"2024-01-24",INTERVAL1DAY),--1daylaterDATE_SUB(DATE"2024-01-24",INTERVAL1DAY),--1daybefore
FORMAT_DATE and PARSE_DATE
SELECTCURRENT_DATE() as today,--returnstodayFORMAT_DATE('%d/%m/%Y',CURRENT_DATE()) ASformatteday,--UKformatPARSE_DATE('%d-%m-%Y','01-05-2024') as passstringdate,--convertstringtodateobjectFORMAT_DATE('%d/%m/%Y',PARSE_DATE('%d-%m-%Y','01-05-2024')) ASformatedpassstring--convertstringtodateandthenformat
EXTRACT
--CURRENT_DATE() is15/05/2024SELECTEXTRACT(DAYfromCURRENT_DATE()) as day,--15/05/2024EXTRACT(DAYOFWEEKfromCURRENT_DATE()) as DayOfWeek,--4EXTRACT(WEEKfromCURRENT_DATE()) as Week,--19EXTRACT(WEEK(MONDAY) fromCURRENT_DATE()) as WeekStartingMonday,--20EXTRACT(ISOWEEKfromCURRENT_DATE()) as WeekStartingMonday2,--20EXTRACT(MONTHfromCURRENT_DATE()) as Mth,--5EXTRACT(QUARTERfromCURRENT_DATE()) as Qtr,--2EXTRACT(YEARfromCURRENT_DATE()) as Yr,--2024EXTRACT(ISOYEARfromCURRENT_DATE()) as YrWkStartMon --2024
An example we start with we a sales and date table to create our running table sales from. We start by creating a sales measure as follows:
Sales = SUM(FactSales[SalesAmount])
Next, we create a quick measure to create the Running Total in Sales and add it to our table:
The Code for the Quick Measure ‘Sales running total in Year’ is a follows:
Sales running total in Year =
CALCULATE(
[Sales],
FILTER(
ALLSELECTED( 'Date'[Year]),
ISONORAFTER( 'Date'[Year], MAX('Date'[Year]), DESC
)
))
It looks pretty simple, but there is quite a lot to it, so let’s break it down:
1. CALCULATE() Syntax: CALCULATE(Expression, [Filter1]…) The CALCULATE() function takes the [Sales] measure as the first argument. [Sales] is just the sum of sales.
Next, we have the main filter argument that filters the [Sales] measure on a range of years from the ‘Date’Year column. The filter starts with the FILTER function, which takes ALLSELECTED(‘Date'[Year]) as the column to filter on and the ISONORAFTER( ‘Date'[Year], MAX(‘Date'[Year]), DESC as the filter argument.
ALLSELECTED(‘Date'[Year]) The ALLSELECTED function removes all filters external to the column ‘Date'[Year], so in our table, the Years in the year column will be ignored, and all years in the ‘Date’ table will be included in the column. We can prove this by creating another measure:
When added to our table visual, we can see the count is 24 for every year row in the table. That is because this particular table has 24 years of date information in it.
The next part is the tricky part, how we filter ALLSELECTED(‘Date'[Year])
ISONORAFTER(‘Date'[Year], MAX(‘Date'[Year]), DESC) The ISONORAFTER function compares each element of the first argument (the ‘Date'[Year column] with the second argument, the MAX(‘Date'[Year]), which is either sorted in ascending order (ASC) or descending order (DESC). We can check what MAX(‘Date'[Year]) evaluates to with the following measure:
Max Date Year = MAX(‘Date'[Year])
We can see in our table that the max date equals the Year in the table visual.
4.2 The second argument is MAX(‘Date'[Year]). To understand what this function does, we can create a measure as it may not be what you think.
Max Date Year = MAX(‘Date'[Year])
As you can see in the table below, the Max Date Year used in this context actually returns the same date as the Year column.
The comparison works as follows:
‘Based on the sort order, the first parameter is compared with the second parameter. If the sort order is ascending, the comparison to be done is the first parameter greater than the second parameter. If the sort order is descending, the comparison to be done is the second parameter less than the first parameter.’
To table below shows what is going on (if I filter the page to years 2007-2009), which effectively filters the years in the table down from 24 to 3, we can create the following table. Essentially, it is checking if each element in the date year table is on or after the max date, which is derived from the year in the table visual. If it is less than the Max([Date] then it returns true.
The Years in the Date[Year] column are then returned as a filter to the Calculate to filter the [Sales] measure, creating the ‘Running Total in Sales’
Year: 2007
Date[Year]
Max([Date])
ISONORAFTER(a,a, DESC) If first less than second
Sales Total
2007
2007
TRUE
4.56M
2008
2007
FALSE
2009
2007
FALSE
Running Total
4.56M
Year: 2008
Max([Date])
ISONORAFTER()
Sales Total
2007
2008
TRUE
4.56M
2008
2008
TRUE
4.11M
2009
2008
FALSE
Running Total
8.63M
Year :2009
Max([Date])
ISONORAFTER()
Sales Total
2007
2009
TRUE
3.74M
2008
2009
TRUE
4.11M
2009
2009
TRUE
3.74M
Running Total
12.41M
That’s pretty complicated if you ask me, even though it is correct. We can create our own Running Total in Sales Measure, which is easy to understand. The code is below:
Running Total human =
VAR the = SELECTEDVALUE('Date'[Year])
VAR MaxDateInFilterContext = MAX ('Date'[Year])
VAR DatesLessThanMaxDate =
CALCULATE([Sales], 'Date'[Year]<= MaxDateInFilterContext)
RETURN
DatesLessThanMaxDate
The first part of building the code to pull the data for the lifetime value model is to create an SQL script that will get the historical data we need to train our model on.
The SQL Code select 6 fields from the Contoso Retail Datawarehouse: The full SQL code is at the bottom of the page
First Section of Code (WITH)
USE ContosoRetailDW;
WITH FirstTransactionDates AS (
SELECT customerkey,MIN([DateKey]) AS first_transaction_date
FROM [dbo].[FactOnlineSales]
GROUP BY customerkey)
,
Orders AS (SELECT
[CustomerKey] as CustomerKey,
[SalesOrderNumber] as SalesOrderNumber,
ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
SUM([SalesAmount]) as SalesAmount
FROM
[dbo].[FactOnlineSales]
GROUP BY [CustomerKey], [SalesOrderNumber]
)
The code starts off with a subquery after the database is selected in the USE statement.
2 Sub-queries are then created in the WITH clause. This makes the code easier to manage rather than using sub-queries on sub-queries in the main code body.
The first sub-query: FirstTransactionDates creates a temporary table of customers and their first transaction date from the FactOnlineSales table
The second-subquery: Orders creates a temporary table of customers and their orders with the sales ordernumber field ranking the orders from 1 upwards with 1 being the first order. I will create a separate post explaining the line: ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
Second Section of Code
SELECT
--selection of CustomerKey from DimCustomer
c.CustomerKey as CustomerKey,
-- Calculate Age as difference between [BirthDate] selected from DimCustomer and today (using the GETDATE() function), then divide by 365
DATEDIFF(DAY, [BirthDate], GETDATE())/365 as Age,
--Gender selected as the [gender] field from DimCustomer and converted to 0 and 1 for m and f respectively
CASE
WHEN [gender] = 'm' THEN 0
WHEN [gender] = 'f' THEN 1
END as Gender,
--YearlyIncome selected from DimCustomer table
c.YearlyIncome as YearlyIncome,
--FirstPurchaseAmount from sub-query a
a.FirstPurchaseAmount as FirstPurchaseAmount,
LifetimeSales as sub-query b
b.first12months_salesamount as LifetimeSales
-- select from DimCustomer with left join on.... tables to follow...
FROM dbo.DimCustomer c LEFT OUTER JOIN
As you can see from the above, there are 6 field selections: 1. The CustomerKey from the DimCustomer table (alias c) 2. Age calculated as the difference in days between the customers birthdate ([BirthDate] and todays date (GETDATE()). This is then divided by 365 to get the age in years. 3. The Gender which uses the [gender] filed from the DimCustomer table and swaps ‘m’ for 0 and ‘f’ for 1, so we have numeric data for the model. 4. The YearlyIncome field from the DimCustomer table.
Then the last 2 fields FirstPurchaseAmount and first12months_salesamount which are calculated using sub-queries.
Third Section of Code (Sub-query a)
(
SELECT CustomerKey, SalesAmount as FirstPurchaseAmount
FROM ORDERS
WHERE OrderNumber = 1
) a
ON c.CustomerKey = a.CustomerKey
First Sub-query a, which is the DimCustomer table joins to on ON c.CustomerKey = a.CustomerKey The SELECT query select, CustomerKey and SalesAmount from the ORDERS table. The ORDERS table is a temporary table created with a sub-query in the WITH clause at the start of the code. The ORDERS table is filtered by OrderNumber = 1 to return the sum of sales for each customer on their first order. This gives us the FirstPurchaseAmount field.
The Fourth Section of code is sub-query B
LEFT OUTER JOIN
-- Build First 12 Month Sales Table b
(
SELECT
f.customerkey,
SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount
ELSE 0
END) AS first12months_salesamount
FROM [dbo].[FactOnlineSales] f
LEFT OUTER JOIN FirstTransactionDates ft ON
f.customerkey = ft.customerkey
GROUP BY f.CustomerKey
) as b
ON c.CustomerKey = b.CustomerKey
WHERE c.CustomerType = 'Person'
The fourth section of code shows the LEFT OUTER JOIN of the DimCustomer table to sub-query b. Sub-query b gets us the first 12 month sales amount of customers, which will be our dependent variable in our final Model.
The SELECT code starts by selecting the CustomerKey from the FactOnlineSales table.
The second line selects the first_transaction_date from the FirstTransactionDates sub-query created in the WITH clause and the f.DateKey from the FactOnlineSales table and looks at the difference in months between the 2. If the DateKey (transaction date is less than 12 months), then the sales amount is summed as first12months_salesamount. b.first12months_salesamount as LifetimeSales
SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount ELSE 0 END) AS first12months_salesamount
This is the field used in the main SELECT query in code section 2. b.first12months_salesamount as LifetimeSales
Full SQL Code
USE ContosoRetailDW;
WITH FirstTransactionDates AS (
SELECT customerkey,MIN([DateKey]) AS first_transaction_date
FROM [dbo].[FactOnlineSales]
GROUP BY customerkey)
,
Orders AS (SELECT
[CustomerKey] as CustomerKey,
[SalesOrderNumber] as SalesOrderNumber,
ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
SUM([SalesAmount]) as SalesAmount
FROM
[dbo].[FactOnlineSales]
GROUP BY [CustomerKey], [SalesOrderNumber]
)
SELECT
c.CustomerKey as CustomerKey,
DATEDIFF(DAY, [BirthDate], GETDATE())/365 as Age,
CASE
WHEN [gender] = 'm' THEN 0
WHEN [gender] = 'f' THEN 1
END as Gender,
c.YearlyIncome as YearlyIncome,
a.FirstPurchaseAmount as FirstPurchaseAmount,
b.first12months_salesamount as LifetimeSales
FROM
dbo.DimCustomer c
--Build FirstPurchaseAmount table a
LEFT OUTER JOIN
(
SELECT CustomerKey, SalesAmount as FirstPurchaseAmount
FROM ORDERS
WHERE OrderNumber = 1
) a
ON c.CustomerKey = a.CustomerKey
LEFT OUTER JOIN
-- Build First 12 Month Sales Table b
(
SELECT
f.customerkey,
SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount
ELSE 0
END) AS first12months_salesamount
FROM [dbo].[FactOnlineSales] f
LEFT OUTER JOIN FirstTransactionDates ft ON
f.customerkey = ft.customerkey
GROUP BY f.CustomerKey
) as b
ON c.CustomerKey = b.CustomerKey
WHERE c.CustomerType = 'Person'
Here we make use of the HR Analytics dataset available on Kagoo. The dataset was created to understand the factors behind employee attrition and can be used to train a model for predicting employee churn.
First, we should check the data imported into the Pandas data frame looks good:
We can also run df. columns to view the columns of the data frame and df. values to display the values of the data frame in an array.
The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.
We can also run df.columns to view the columns of the data frame and df.values to display the values of the data frame in an array.
The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.
Here we can verify that all columns have the same count of 14999, so there appears to be no missing data. We can also view the standard deviation (std) in relation to the mean, to get an idea of the variance in the data, for example:
Mean average_monthly_hours = 201 hours 1 standard deviation average monthly hours = 49.94 hours Therefore we can infer that: Approx. 68% of employees work 201 hours+-49.94 hours = 152-150 hours a month. Approx. 95% of employees work 201 hours +-100 hours (rounded) = 101 to 301 hours a month
We need to check the datatypes are in the correct format if we want to build a model from them.
The dependent variable that we want to predict is the ‘left’ field, which is a binary field of 0 or 1. A value of 1 is that they left the company (attrition).
There are 9 other columns which make up the independent variables: There are 2 floats, 5 integers, and 2 object data types.
The salary field is categorical and includes 3 levels: low, medium, and high Because there is an inherent order we can use ordinal encoding to convert it to an integer salary_map = {‘low’: 0, ‘medium’: 1, ‘high’: 2}
The Department fields consist of non-ordered fields: Accounting, HR, IT, management, marketing, product_mgt, RandomD, sales.
We use one-hot encoding on this field which creates 8 new binary fields, one for each department name. The code is as follows:
df = pd.get_dummies(df, columns=[‘Department’])
This results in a lot more columns in the data frame as below, but now we’re ready to build a model.
Building the Random Forests model
The steps for building the model are as follows: 1. Split the dataset into X and y variables where X consists of all the independent variable fields. The ‘y’ list contains only the ‘y’ field (the dependent variable). 2. Split the X and y dataframes into 4 sets for training. The test sets were randomized 20% of the data. 3. Initialize the random forest classifier 4. Train the model with the training data (X_train and y_train). This represents 80% of the data. 5. Make predictions from the X_test set. The model tries to predict the actual values that are in the y_test set. 6. Measure the accuracy by comparing the predictions of y (y_pred) with the actual y_test values.
The resulting accuracy: Accuracy: 0.9886666666666667 which represents 99% accuracy. If I re-run the code it gives similar results. The model is almost perfect!
We can compare this random forest model to a logistic regression model. Here code is below: The resulting accuracy score is only: 0.785 or 78% accurate. So the random forest model wins!
Power BI can be useful for displaying and storing survey results for your organization. In this example we build some data for the results of a fictitious Airline survey, using questions taken from a real online airport survey.
The data was built in Excel with a sheet for questions and a sheet for answers. You can download the Excel file here.
It’s easy to publish the report as a PDF, so you can share it with users who don’t have access to Power BI. Here is the airport survey report as PDF
The questions are a simple list with a Question ID and section.
The answers on the second sheet are formatted so that each row represents all the answers for each respondent.
Steps for Building the Airline Survey Report.
Step 1 The Data is imported to our Power BI Desktop survey using Power Query.
The only change to the question table is a simple column rename.
The answer table is transformed so that each line of the table contains the respondent ID, the question, and the Value (the answer). This prepares the data so it can be joined to the question table and easily visualized. The applied steps are as follows:
As you can see the first step after the import is simply to promote the headers, so the column names are the actual column names. The second step is just to change the column types to the correct types. The key third step is to unpivot the data, so the table is transformed as below a separate row for each respondent’s answer. This makes the report easy to build.
The report only requires one measure, which is just a unique count of the respondent ID
Answer Count = DISTINCTCOUNT(Answers[Respondent])
Basic Graphs can be added.
And Likert Scale style graphs using a 100% Stacked bar graph visual.
In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).
We can check age and first purchase amount against first 12 month sales amount just to confirm this.
Age vs. Sales (first 12 month sales) code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns # display the columns available
# Create a scatter plot with regression line
sns.regplot(data=df, x='Age', y='LifetimeSales')
plt.title("First 12 Month Sales by Customer Age") # graph title
plt.xlabel('Age') # x axis label
plt.ylabel('First 12 Month Sales') # y axis label
# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])
# Add equation of the regression line to the plot
plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
# Calculate correlation coefficient
correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
plt.show()
The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.
Next we can look at First Purchase amount and First 12 Month Sales Amount. Here is the code below. It includes removing the major outlier with CustomerKey = 1
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df = df[df['CustomerID'] != 1]
df.columns # display the columns available
# Create a scatter plot with regression line
sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
plt.title("First 12 Month Sales by FirstPurchaseAmount") # graph title
plt.xlabel('FirstPurchaseAmount') # x axis label
plt.ylabel('First 12 Month Sales') # y axis label
# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])
# Add equation of the regression line to the plot
plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
# Calculate correlation coefficient
correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
plt.show()
And here is the graph. The correlation co-efficent is only 0.07.
In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.
Histogram: We can start by having a look at the distribution of customers’ ages with a histogram. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns # display the columns available
# Create a histogram
df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
plt.title("Distribution of Customer Age")
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.
Scatter Plot Next, we can have a look at the first 12-month sales amount against age. Here is the Python code:
import pandas as pd
import matplotlib.pyplot as plt
csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#kind options: 'scatter', 'hist', 'bar',
df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
plt.title("First 12 Month Sales by Customer Age") # graph title
plt.xlabel('Age') #x axis lable
plt.ylabel('First 12 Month Sales') #y axis label
plt.show()
Here is the graph:
We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.
All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100. Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales. Pie Graph The following code gives us a pie chart showing us the slit of gender in the dataset.
import pandas as pd
import matplotlib.pyplot as plt
# import matplotlib.pyplot as plt
csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#replace 0 and 1 in the 'Gender' column with femail and male
df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})
#return lables and count index of unique age types
age_counts = df['Gender'].value_counts()
#print(age_counts)
# Define colors for each category
colors = {'female': 'pink', 'male': 'blue'}
# Plotting
plt.figure(figsize=(4, 4)) # Adjust figure size if needed
plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%',
colors=[colors[g] for g in df['Gender'].value_counts().index])
plt.title('Distribution of Gender')
plt.show()
The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’. The data looks pretty normal as we would expect. There are no nulls which is good.
Another Scatter Plot Next, we look at the ‘FirstPurchaseAmount’ field. We create a scatter graph with the code below.
# -*- coding: utf-8 -*-
"""
Created on Wed May 8 12:17:26 2024
@author: Admin
"""
import pandas as pd
import matplotlib.pyplot as plt
csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#kind options: 'scatter', 'hist', 'bar',
df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
plt.title("Customers by First Purchase Amount") # graph title
plt.xlabel('CustomerID') #x axis label
plt.ylabel('First Purchase Amount') #y axis label
plt.show()
The first thing we notice when we run the graph is that there is one major outlier in the data.
Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.