Author: CB

Using Python for Multi-Linear Regression

Creating a simple Linear regression model and preparing for multi-linear regression.

In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api

You can download the original file from Kaggle here then just replace the location you save it to in the (df = pd.read_csv() line.

import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
df = pd.read_csv("marketing_sales_data.csv")

# Drop rows with any null values
df.dropna(inplace=True)

# Check and handle duplicates if needed
if df.duplicated().sum() > 0:
    df.drop_duplicates(inplace=True)

#rename columns to snake 
df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)

# Simple order encoding
tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
df['TV'] = df['TV'].replace(tv_dict)

# One-hot encoding for non-ordinal variable
df = pd.get_dummies(df, columns=['Influencer'], dtype=int)


# Define and fit the model
ols_formula = "Sales ~ Radio"
OLS = ols(formula=ols_formula, data=df)
model = OLS.fit()
summary = model.summary()
print(summary)  #Prints off the statistical summary including R squared and the Beta coefficients.

# Calculate residuals and predicted values
residuals = model.resid
y_pred = model.predict(df['Radio'])


import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
df = pd.read_csv("marketing_sales_data.csv")

# Drop rows with any null values
df.dropna(inplace=True)

# Check and handle duplicates if needed
if df.duplicated().sum() > 0:
    df.drop_duplicates(inplace=True)

#rename columns to snake 
df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)

# Simple order encoding
tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
df['TV'] = df['TV'].replace(tv_dict)

# One-hot encoding for non-ordinal variable
df = pd.get_dummies(df, columns=['Influencer'], dtype=int)


# Define and fit the model
ols_formula = "Sales ~ Radio"
OLS = ols(formula=ols_formula, data=df)
model = OLS.fit()
summary = model.summary()
print(summary)  #Prints off the statistical summary including R squared and the Beta coefficients.

# Calculate residuals and predicted values
residuals = model.resid
y_pred = model.predict(df['Radio'])

Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.

The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.

June 24, 2024

How to check for model assumptions with Python Seaborn Graphics
We need to check assumptions for models to give us confidence that are models have integrity and are not biased or overfitting the data.

We check for three assumptions in this example, with sub-plotted seaborn graphics for a linear regression model.
The code for creating the linear regression model can be found in this post
You can run the code below once you have built the model. The model models the relationship between radio advertising spending and radio sales.

Graph 1: Checking the Linearity Assumption
The linearity assumption is as follows: ‘Each predictor variable (x) is linearly related to the outcome of variable y.’
In the first graph, we plot radio advertising spend against radio sales and can see there is a linear relationship (first graph). So we can conclude the linearity assumption is met.

Graph 2: Checking Homoscedacity assumption with a scatterplot

The homoscedasticity assumption (extra points if you can spell it correctly) is as follows:
y_pred are the predicted y values from a regression line.

In the second graph, we plot the residuals of the model, which are the difference between actuals and model forecasts.

Homoscedasticity means that the residuals have equal or almost equal variance across the regression line.
By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.’ Good homoscedacity is therefore a balanced graph of residuals above and below zero.

Graph 3: Check for Normality Assumption
In the third graph, the histogram is used to plot the residuals of the regression line (the actual y values vs. the predicted y values) for x. If the model is unbiased, the residuals should be normally distributed (and we see that).

The fourth graph is a Q-Q plot which is also used to check the normality assumption.
```
fig, ax = plt.subplots(2, 2, figsize=(18, 10))

fig.suptitle('Assumption Checks')

#Check for linearity
ax[0, 0] = sns.regplot(
    ax=ax[0, 0],
    data = df,
    x = df['Radio'],
    y = df['Sales'], 
    );
ax[0, 0].set_title('Radio Sales')
ax[0, 0].set_xlabel('Radio Spend ($K)')
ax[0, 0].set_ylabel('Sales ($)')
#ax[0].set_xticks(range(0,10,10))
#ax[0].set_xticks(rotation=90)


#Check for Homeoscedacity
# Plot residuals against the fitted values
ax[0, 1] = sns.scatterplot( ax=ax[0, 1],x=y_pred, y=residuals)
ax[0, 1].set_title("Residuals vs Fitted Values")
ax[0, 1].set_xlabel("Fitted Values")
ax[0, 1].set_ylabel("Residuals")
ax[0, 1].axhline(0, linestyle='--', color='red')


#Check for normality
ax[1, 0] = sns.histplot(ax=ax[1, 0], x=residuals)
ax[1, 0].set_xlabel("Residual Value")
ax[1, 0].set_title("Histogram of Residuals")

#Check for nomrmality QQ plot
ax[1, 1] = sm.qqplot(residuals, line='s',ax = ax[1,1])
ax[1, 0].set_title("Q-Q Plot")



#sm.qqplot(test, loc = 20, scale = 5 ,  line='45')

plt.show()
```
June 24, 2024
A Summary of the data process used in Classification Models
Introduction:

Classification models are machine learning models that are used to predict binary outcome scenarios such as:

Spam / Not Spam
Fraudulent Transaction / Non-Fraudulant Transation
customer churn/ customer will not churn
custer high value / customer low value
load approval /non-approval

The Data Process

Planning
1. Understand the business requirements. Understand what the measure of success is and what needs to be measured.
  e.g. In binary outcomes (precision, recall, or f1 score). Identify Type 1 and Type 2 errors.
2. Identify the key stakeholders and subject matter experts relevant to the project
3. Understand where the data is and how it can be accessed. For larger data projects. If the data is from many different data sources, can the data be brought together in a data warehouse such as Google Big Query?
4. Understand the technology required for the project. Are extra resources required?
5. Is there a data dictionary describing all the data field types and purposes?
Exploratory Data Analysis (Python)
1. Explore the data, and list the number of columns rows, and data types. If there are any questions, these may need to be referred back to the business.
2. Explore the data ranges (df. describe() ). Are the data counts complete? Do the means and ranges make sense, do the min and max statistics flag any potential errors or outliers in the data?
3. Explore null values. If there are null values, either look to fill the data or drop the rows.
4. Remove or adjust outliers.
5. Summarize and graph the data:
6. Use boxplots to look for outliers in columns.
7. Use histograms to understand the distributions of data.
8. Use a correlation matrix and pair plot to understand co-variance between columns.
9. Visualize the data with interactive tools such as Tableau or Power BI for the initial analysis of data for clients
Model Selection (Classification)

Classification Models: Naive Bayes, Logistic Regression, Decision Tree, Random Forests, XG Boost
1. Identify the variable to be predicted: Is it a continuous variable or a categorical variable?
2. Select the machine learning model relevant to the task and 1 or 2 additional models to compare results to.
3. Confirm the assumptions required by the model and check the data to confirm they meet the requirements.
4. Feature selection: Select the features (Independent variables and the dependent variable (columns)) to be used in the model.
5. Feature transformation: transform categorical data into numeric data using:
  a) one-hot encoding, for non-ordered categories e.g. departments
  b) ordinal encoding for orderly data such as ‘low’, ‘medium’, ‘high’
  Where numeric data across the features has high variance e.g. small numbers 0-1 and large numbers 100+ consider applying scaling to the data:
  a) Log normalization (using a logarithm of the data)
  b) Standardization (which converts data to Z false on a standard distribution, with a mean of zero within each feature)
6. Feature extraction: Create new features from existing features examples are weekly hours.
7. Check for Class inbalance in the data. In the case of a binary dependent variable (True, False), we would ideally like an even split, but the minimum should be 10% of the smaller of the two. Class inblances can be address with either:
a) Downsampling to level the major segment down by random sampling.
b) Upsampling to level the smaller segment up by random sampling.

Model Execution
1. Setup the X variables and Y variables in a separate data frame.
2. Decide whether to use a
3. Use test_train_split to create training and test sets and potentially a 3rd validation set of data. The test dataset should be around 25%, but can be larger if the dataset it large. Should the split be stratified?
4. Select the hyperparameter requirements of the model. The GridSearchCV is the powerful sklearn function takes a list of hyper-parameters, scoring metrics and X values to run multiple models to find the best model
5. Build and run the model.
Model Validation

In a simple scenario, the best performing model is ran against the hold out sample (e.g. 25% of the data), that the model has not been trained on.

a) In cross-validation, samples of data are taken from the main training data and tested against the test data.
Cross-validation is slower.

b) Separate validation set: random samples are taken from the test training set and the model tested against a set number of times e.g. 5. This is advantageous when the main dataset is small and we don’t want to keep sampling from the training data.

Measurement
1. Merics: Accuracy, Precision, Recall, F1, AUC Score, Classification Report.

2. Visualise: Confusion Matrix, ROC Curve

Check Assumptions:

Linearity: All Independent and dependent variables should exhibit a degree of linearity (use pairplot)
Independent Observations: This information is business specific, so this requires understanding about how the variables are generated.
No Multicolinearity: There should be no colinearity between independent variables or this will reduce the impact of the model.
No extreme outliers: extreme outliers should be exlcuded from the model.
June 22, 2024

An Example of Using K-Means Cluster modelling

Supermarket Example

Import libraries and read in CSV file to data frame
The data comes from kaggle here (mall data)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

df = pd.read_csv("Supermarket Customers.csv"
df.info()

First, we run some basic checks on the data to check for data integrity.
Data includes 200 rows of data by 5 columns

We can change Gender to numeric with the following

df['Gender'].replace(0, 'Female',inplace=True)
df['Gender'].replace(1, 'Male',inplace=True)
df.describe()

Then we check for nulls as follows:


df.isnull().sum() #check for nulls - gives a count of nulls values by column name.

Get counts of unique values


len(df['CustomerID'].unique()) #how many unique values are there

CustomerID: 200
Gender: 2
Age: 51
Annual Income (k$): 65
Spending Score (1-100): 85

Fixing Null Values
Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.

Drop rows where annual income (k$) is null

df.dropna(subset=['Annual Income (k$)'], inplace=True)
df.isnull().sum() #re-run check for missing data

Then do the same for spending score:

df.dropna(subset=['Spending Score (1-100)'], inplace=True)
df.isnull().sum() #re-run check for missing data
df.info() # the number of rows is reduced to 197.

We can run the seaborn pairplot to plot the graphs of the combination of variables.

pairplot = sns.pairplot(df)
plt.show()

From here we can see there are some very interesting distinct-looking clusters around
annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It

For the K-means model, we need to determine the value of K which is the number of clusters we want to identify.
We use the elbow method to do this as follows:

# Extracting features for clustering
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# Using the elbow method to determine the optimal number of clusters<br>wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plotting the elbow graph
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.

K-Means Model

Now we have the number of clusters to be used in the model, we can run the K-Means model.

##Use 5 clusters based on elbow graph

# Fitting K-Means to the dataset with the optimal number of clusters (assuming 3 for this example)
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualizing the clusters
plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='purple', label='Cluster 5')

# Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.

May 31, 2024

How to Select the Correct Statistical Testing Test (A Table)

Significance Tests (normal distribution assumed)

Groups	Requirements	Test	Tails	Statistic	Distribution	Degrees of freedom	Example
1	normal dist, n > 30 and known variance	z-test	1 and 2	z-score	Normal	NA	A sample of the population is tested for height. Do they match the known population? Population mean and standard deviation are known, sample > 30
1	n < 30 and unknown variance	One-sample t-test	1 and 2	t-score	Students t	n-1	A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
2	Same 2 groups (before and after). 2 Dependent samples.	Paired t-test	1 and 2	t-score	Students t	n-1	A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
2*	2 Independent groups, equal variance	Independent t-test (equal variance – pooled)	1 or 2	t-score	Students t	n1+n2−2	A retailer wants to compare the average sales of two different stores located in different regions.
2*	2 Independent groups, unequal variance	Independent t-test (Unequal/Welch’s)	1 and 2	t-score	Students t	n1+n2−2
3+	3+ groups – look at one variable	One-Way ANOVA	1 Only	F-score	F	between groups k-1 (where k is num of groups).	Three groups are given different drugs in a test to see the improvement in blood sugar.
3+	3+ groups – look at multiple variables	Two-way ANOVA	1 Only	F-score	F		Effect of water and sun on height of sunflowers, where 3+ combinations of water and sun (3 + groups)
2 Categories	2 Categories – comparison is to see if they are related	Pearsons Chi-Square	neither, just different	Chi-Squared	Chi-Squared	(no. of rows – 1) * (no. of columns – 1).	Individuals that received social media vs. those that received email – 2 Groups purchased and not purchased.
Correlation
2	Correlation	Pearsons Correlation Coefficient (R)	1 or 2	t-statistic	t-distribution table	n−2	Relationship between daily hours of sunlight and temperature. Best for continuous data with linear relationships. Provides both the direction (positive or negative) and the strength of a linear relationship between two variables. Note for simple linear regression. R squared is the Pearson r squared. It ranges between -1 and 1
2	Correlation	Spearmans Rank	1 or 2	t-statistic?	Spearmans Rho?	n-2	Relationship between the ranks of employees’ performance and their years of experience. Best for ordinal data or when data does not meet Pearson’s assumptions.
	r Squared (used in regression analysis) One (simple linear regression) or more independent variables (Multiple linear regression) + 1 dependent variable	Not used in significance testing directly			NA		Explaining the variance in academic performance based on study hours, attendance, and other factors. Used in the context of regression analysis to explain the proportion of variance in the dependent variable. Indicates the proportion of the variance in the dependent variable that can be explained by the independent variable(s). It does not provide information about the direction of the relationship. Ranges from 0 to 1
		Regression Analysis		F-Statistic	F-Distribution		Fit a regression model and computer squared. Perform an F-test to determine overall significance.
Non-Parametric Tests	Not Normal distribution
2	Compares distributions of the 2 groups	Mann Whitney U		U Statistic	Mann-Whitney U Distribution	NA	2 Independent groups (like the t-test)
2	Compares Medians of the 2 groups	Wilcoxen signed rank		T Statistic	Wilcoxen signed rank table	NA	When comparing two related samples, matched samples, or repeated measurements on a single sample.

* Note: Uses Levenes test to determine equal/unequal variance

Significant Testing Steps

Null hypothesis: There is no significant difference between variables between the two groups.

Examples:
1. In a drug efficacy study, the null hypothesis might state that the drug does not affect patients compared to a placebo.
2. The mean cholesterol levels of patients taking the new drug are equal to the mean cholesterol levels of patients taking a placebo.
3. The mean sales after the advertising campaign are equal to the mean sales before the campaign.
4. The mean sales of the product with the new packaging design are equal to the mean sales of the product with the old packaging design.

Alternative hypothesis: There is a significant difference between the means of the groups.
The significance level (𝛼α) is the probability threshold for rejecting the null hypothesis. Commonly used significance levels include 0.05, 0.01, and 0.10. e.g. 5%: The result of the test only has a 5% chance of occurring.
Calculate test statistic (t-value, z-value, etc). The test statistic is used to look up the p-value from the relevant distribution tables.

4. Calculate the p-value using the test statistic. The p-value is used to determine if the test is significant. e.g. a p-value of less than 5% (0.05) is significant and we can reject the null hypothesis. If the p-value is higher than the significance level, then the results are reported as statistically significant.

1 or 2 Tailed Tests
One-tailed test if measuring greater or less than Left or Right Sided
Two-tailed if measuring the difference

Type 1 and Type 2 Errors
Type 1 Error: We reject the null hypothesis even though it was true (Alpha)
Type 2 Error: We accept the null hypothesis even though it was false (Beta)

Two Methods of accepting or rejecting Null Hypothesis
1. Critical Values (taken from normal or t-distribution)
2. P-values (if p < or > than 0.05)

May 30, 2024

Python Notes: Pandas from getting started Tutorials
Notes from Getting Started Tutorial:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
1. Creating a Dataframe and Series
```
import pandas as pd

#Create a series
books= pd.Series(["Papillion", "Great Expectations", "Magic"], name="Books") #Create a series

#Create DataFrame with Key value pairs. In this case the values are lists of values.

df = pd.DataFrame(
    {
        "Name": [
            "Humpfries, Mr. Bill",
            "Derby, Mrs. Sheila",
            "Winfrey, Miss. Elizabeth",
        ],
        "Age": [44, 66, 42],
        "Gender": ["male", "female", "female"],
        "City": ["London", "Manchester", "Leeds"],
        "Occupation": ["Doctor", "Architect", "Analyst"]
    }
)

#Using the list
df["Age"] #Output Age Column
#df["Age"].max() #outputs max age
```
2. Create Plots with Matplotlib
```
import pandas as pd
import matplotlib.pyplot as plt

#Create Pandas dataframe from csv file
air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality["station_paris"].plot() #craate plot of subset from DataFrame
plt.show() 

#Plot with Matplot lib
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
plt.show() 

#Create scatter from DataFrame
air_quality.plot.box() #Create box plot
plt.show() #Display boxplot

#subplots
fig, axs = air_quality.plot.area(figsize=(10, 4), subplots=True)
plt.show()
```
3. Add new columns from existing columns
```
Import pandas as pd
#index_col=0, sets first column to index, parse_dates converts dates to date type
air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)

#Add new column based on existing column - this adds column to DataFrame
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
#Another example, this time a %
air_quality["ratio_paris_antwerp"] = (air_quality["station_paris"] / air_quality["station_antwerp"]

#renaming columns
air_quality_renamed = air_quality.rename(columns={"station_antwerp": "BETR801","station_paris": "FR04014", "station_london": "London Westminster"})
```
4. Summary Statistics
```
import pandas as pd
#read_csv function - also read_ excel, sql, json, parquet and more

titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
titanic["Age"].mean() #average of Age column
titanic[["Age", "Fare"]].median() #median of Age and Fare Coluns
titanic[["Age", "Fare"]].describe() #summary stats of 2 columns
titanic.agg({"Age": ["min", "max", "median", "skew"],"Fare": ["min", "max", "median", "mean"]})
 #multiple stats on Age and Fare columns
 
 #groupby to create summary tables
 titanic[["Sex", "Age"]].groupby("Sex").mean()

 #Average age by sex
 titanic.groupby("Sex").mean(numeric_only=True) 
 #average of all numberics by Sex
 titanic.groupby(["Sex", "Pclass"])["Fare"].mean() #table average fare by sex and PClass
 
 titanic["Pclass"].value_counts() #count passengers in Pclass
 titanic.groupby("Pclass")["Pclass"].count() #count longer method
```
5. Re-shape layout of tables
```
import pandas as pd
titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
air_quality = pd.read_csv("air_quality_long.csv", index_col="date.utc", parse_dates=True)

#Sorting Tables
sort1 = titanic.sort_values(by="Age").head() 

##sort by Age
sort2 = titanic.sort_values(by=['Pclass', 'Age'], ascending=False) #sort by Pclas then Age descending 

# filter for no2 data only
no2 = air_quality[air_quality["parameter"] == "no2"]

# filter for no2 data only<br>no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset
#pivot table wide format
air_quality.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")
air_quality.pivot_table(values="value",index="location",columns="parameter",aggfunc="mean",margins=True)

#wide to long format
no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
no2_pivoted.head(5)
#pandas.melt() method converts pivot table to long format
no_2 = no2_pivoted.melt(id_vars="date.utc")<br>no_2.head(5)
```
6. Sub-sets and Filtering

import pandas as pd
#read csv file to pandas DatafFrame
titanic = pd.read_csv(“titanic.csv”)
air_quality = pd.read_csv(“air_quality_no2.csv”, index_col=0, parse_dates=True)

#save as excel, index=False removes row index numbers from export
titanic.to_excel(“titanic.xlsx”, sheet_name=”passengers”, index=False)

#attributes and methods of DataFrame
titanic.shape #row and columns count
titanic.info #
titanic.columns #
titanic.dtypes #data types
titanic.head(5) #head of df
titanic.tail(5) #tail of df
titanic.describe() #statistics of file

#Create Sub-set of DataFrame as series
ages = titanic[“Age”] #create sub-set from dataframe: note single bracket creates a pandas series.
type(titanic[“Age”]) #confirm type of Age object

#Create Sub-set of DataFrame as DataFrame
age_sex = titanic[[“Age”, “Sex”]] #inner brackets are list, outer is DataFrame

#Filtering rows getpassenger list with Age > 35
above_35 = titanic[titanic[“Age”] > 35] #return rows with age over 35
#Filter passenger list with Pclass = 2 or 3 (2 ways of doing this)
titanic[“Pclass”].isin([2, 3]) #checks if 2 or 3 is in Pclass and returns Boolean
titanic[“Pclass”].isin([2, 3])

#Filter passenger list with Age not NA
age_no_na = titanic[titanic[“Age”].notna()]
#Filter passenger liset and return list of Names
adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

#Filter specified rows and columns
titanic.iloc[9:25, 2:5]

#change the first 3 elements in row 3 to “anonymous”
titanic.iloc[0:3, 3] = “anonymous”

7. Combining Data from tables

import pandas as pd
import os

#check current working director
cwd = os.getcwd()
cwd

#Import data into DataFrames
titanic = pd.read_csv(“titanic.csv”)
air_quality_no2 = pd.read_csv(“air_quality_no2_long.csv”, parse_dates=True)
air_quality_pm25 = pd.read_csv(“air_quality_long.csv”, parse_dates=True)
stations_coord = pd.read_csv(“air_quality_stations.csv”)
air_quality_parameters = pd.read_csv(“air_quality_parameters.csv”)

#Create Sub-sets
air_quality_no2 = air_quality_no2[[“date.utc”, “location”,
“parameter”, “value”]]

air_quality_pm25 = air_quality_pm25[[“date.utc”, “location”,
“parameter”, “value”]]

#Concatonate Sub-sets – creates a union between the 2 tables.
#the Keys argument adds another hierachical index to show which table the data comes from
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”])
air_quality_no2.head(5)
air_quality_pm25.head(5)
air_quality_no2.shape #2068 rows 4 columns
air_quality_pm25.shape #5274 rows, 4 columns
air_quality.shape #7340 rows, 4 columms

air_quality = air_quality.sort_values(“date.utc”)
air_quality.head(25)
air_quality.to_csv(“checkcombine.csv”)

#note you can reset the index to a column
air_quality.reset_index(level=0)

stations_coord = pd.read_csv(“air_quality_stations.csv”)

stations_coord.head()

#merge the columns from the stations_coord dataframe to the air_quality df
#use location as the key to join the two dataframes
#how=”left” is left join from air_quality on the left
air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)
air_quality.head(10)
air_quality.to_csv(“checkqualitymerge.csv”)

air_quality_parameters.head(5)
#merge left join using parameter column on left column and id on right
air_quality = pd.merge(air_quality, air_quality_parameters,
how=’left’, left_on=’parameter’, right_on=’id’)

8. Timeseries

import pandas as pd

import matplotlib.pyplot as plt

air_quality = pd.read_csv(“air_quality_no2_long.csv”)

air_quality = air_quality.rename(columns={“date.utc”: “datetime”})
air_quality.head(5)
air_quality.columns

#unique cities in air_quality df
air_quality.city.unique()

#read air_qualitydataframe to a datetime object instead of text
air_quality[“datetime”] = pd.to_datetime(air_quality[“datetime”])

air_quality[“datetime”]

#can also do this to read the dates into datetime format
#pd.read_csv(“../data/air_quality_no2_long.csv”, parse_dates=[“datetime”])

#check start and end datetime
air_quality[“datetime”].min(), air_quality[“datetime”].max()

#datetime difference using pandas timestamp
air_quality[“datetime”].max() – air_quality[“datetime”].min()

#Add a month column using the
air_quality[“month”] = air_quality[“datetime”].dt.month

#Average no2 concentration by day of week by location
air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean()

#Plot bar graph using hour on the x axis and average no2 on the y
fig, axs = plt.subplots(figsize=(12, 4))

air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(
kind=’bar’, rot=0, ax=axs)
plt.xlabel(“Hour of the day”); # custom x label using Matplotlib
plt.ylabel(“$NO_2 (µg/m^3)$”);

#create pivot table
no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
no_2.head(5)

#use index.year and index.weekday of the datetime object
no_2.index.year, no_2.index.weekday
#plot using a time range
no_2[“2019-05-20″:”2019-05-21”].plot();

#resample method of datetime object works like groupby. It takes an
#aggregation function and the “M” is time based grouping for Month (month end)
monthly_max = no_2.resample(“M”).max()
monthly_max

monthly_max.index.freq #tells you the frequence attribute – e.g. MonthEnd

#Create table plot of average no2 levels for each station
no_2.resample(“D”).mean().plot(style=”-o”, figsize=(10, 5));

9. Textual Data

import pandas as pd

titanic = pd.read_csv(“titanic.csv”)

titanic[“Name”].str.lower()

#Split the Name field by comma separated
titanic[“Name”].str.split(“,”)
#create series surname using the first element of the split name field
titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0)
titanic[“Surname”]

#Return name series with true false if contains Countess
titanic[“Name”].str.contains(“Countess”)

#return rows where name contains Countess
titanic[titanic[“Name”].str.contains(“Countess”)]

#string length
titanic[“Name”].str.len()
#find max name length
titanic[“Name”].str.len().idxmax()
#Return longest name
titanic.loc[titanic[“Name”].str.len().idxmax(), “Name”]

titanic[“Sex_short”] = titanic[“Sex”].replace({“male”: “M”, “female”: “F”})

titanic[“Sex_short”]
May 27, 2024
Useful Power BI DAX Code Examples
DAX Code Examples:
1. Using Variables
2. FORMAT()
3. HASONEVALUE()
4. AND, &&
5. CALCULATETABLE() and SUMMARIZE()
6. USERELATIONSHIP()
7. SWITCH()
8. ISFILTERED() and making visual transparent
9. SELECTEDVALUE() and creating a dynamic Graph Title
10. FILTER and ADDCOLUMNS
11. RANK()

VAR: Using Variables

Running Total =
VAR MaxDateInFilterContext = MAX ( Dates[Date] ) //variable 1 max date#
VAR MaxYear = YEAR ( MaxDateInFilterContext ) //variable 2 year of max date
VAR DatesLessThanMaxDate = //variable 3 filter dates > variable 1 and variable 2
FILTER (
ALL ( Dates[Date], Dates[Calendar Year Number] ),
Dates[Date] <= MaxDateInFilterContext
&& Dates[Calendar Year Number] = MaxYear
)
VAR Result = //variable 4 total sales filtered by variable 3
CALCULATE (
[Total Sales],
DatesLessThanMaxDate
)
RETURN
Result //return variable 4

FORMAT: Formatting Numbers
actual = if(sum[actual] >1000000, “FORMAT(SUM([actual], “#, ##M”), IF(SUM([actual]>=1000, “FORMAT(SUM(actual]), “#,,.0K”))

FORMAT(min(column, “0.0%”)
FORMAT(min(column, “Percent”)

eg. if matrix is filtered,
IF(ISFILTERED(field], SELECTEDVALUE([column])

HASONEVALUE: Check if column has one value in if
Valuecheck = if(HASONEVALUE([column], VALUES(field))

FILTER table by related field = united states and sumx salesamount_usd
= SUMX(FILTER( ‘InternetSales_USD’ , RELATED(‘SalesTerritory'[SalesTerritoryCountry]) <>”United States” ) ,’InternetSales_USD'[SalesAmount_USD])

AND, can also use &&
Demand =
    SUMX (
        FILTER (
            RELATEDTABLE ( Assignments ),
            AND (
                [AssignmentStartDate] <= [TimeByDay],
                [TimeByDay] <= [AssignmentFinishDate]
            )
        ),
        Assignments[Av Per Day]
    )

CALCULATETABLE, SUMMARIZE
Calculate Table with Summarize and Filter

Order Profile =
CALCULATETABLE (
SUMMARIZE (
‘Sales Table’,
‘Sales Table'[Order_Num_Key],
Customer[Sector],
“Total Value”, SUM ( ‘Sales Table'[Net Invoice Value] ),
“Order Count”, DISTINCTCOUNT ( ‘Sales Table'[Order_Num_Key] )
),
YEAR ( DimDate[Datekey] ) = YEAR ( TODAY () )
)
```
)

USERELATIONSHIP Uses inactive relationship between tables

CALCULATE (
    [Sales Amount],
    Customer[Gender] = "Male",
    Products[Color] IN { "Green", "Yellow", "White" },
    USERELATIONSHIP ( Dates[Date], Sales[Due Date] ),
    FILTER ( ALL ( Dates ), Dates[Date] < MAX ( Dates[Date] ) )
```
SWITCH

SWITCH(<expression>, <value>, <result>[, <value>, <result>]…[, <else>])
= SWITCH([Month], 1, “January”, 2, “February”, 3, “March”, 4, “April” , 5, “May”, 6, “June”, 7, “July”, 8, “August” , 9, “September”, 10, “October”, 11, “November”, 12, “December” , BLANK() ) //place on separate lines

SWITCH with Measure

= SWITCH(TRUE(),
[measure] = “turnover”, [turnover]
[measure] = “Profit”, “[Profit]

, BLANK()

)

Visuals
ISFILTERED()
Check Filtered = ISFILTERED([column])

Dynamic Visual

MakeTransparent =
IF([check filtered], “FFFFF00” # returns transparent – note hex 7
“White”)

Message = IF([check Filtered), “”, “please select a row”)

Dynamic Graph Title
Graph year title = selectedvalue([columnname])& ” – My graph Title”

ADDCOLUMNS
New Table>
Creates a new table and adds columns to it (and in this case also filters it)

2013Sales = FILTER(ADDCOLUMNS(FactInternetSales, “Dates”, FactInternetSales[OrderDate], “Sales2”, SUM(FactInternetSales[SalesAmount])), FactInternetSales[OrderYear=2013)

RANK by 2 Columns (calculated column)
Measure = RANKX( Filter(all(‘Table’), ‘Table'[customer] = EARLIER(‘Table'[Customer])), ‘Table'[Txn Number],,DESC, DENSE)

Creates a rank for each customer based on the txn number for each customer
May 19, 2024
Python Notes: Python Lists Examples

Create a list
Simple list
mylist = [“apple”, “orange”, “banana”, “mango”, “pineapple”]
mylist[2] # returns ‘banana’
mylist[-1] # returns ‘pineapple’
mylist[2:4] #returns ‘banana’, ‘mango’, ‘pineapple
if “apple” in thislist:
  print(“Yes, ‘apple’ is in the fruits list”)
mylist.insert(2, “watermelon”) #insert at position speified
mylist.append(‘grapes’) #adds to end of list
mylist = [“apple”, “banana”, “cherry”]
tropical = [“mango”, “pineapple”, “papaya”]
mylist.extend(tropical) # adds tropical list to thislist
my.remove(“banana”) #removes first occurrence
mylist.pop(1) #removes specified instance
del mylist[0] #also removes instance
del mylist # deletes list
clear(mylist) # clears the list

for i in thislist: #look list
  print(i)

i = 0
while i < len(thislist): #while loop
  print(thislist[i])
i = i + 1

newlist = []
for x in fruits: #adds list items containing a to new list
  if “a” in x:
    newlist.append(x)

thislist.sort() #sort alphabetically or numerically depending on data type
thislist.sort(reverse = True) #sort in reverse
thislist.sort(key = str.lower) #sort lower case first
thislist.reverse() #sort in reverse
mylist = thislist.copy() #copy list
mylist = list(thislist) #also makes a list
list3 = list1 + list2 #concetenate lists

6 rows, 2 columns

b1 = [[2308, 6], [2408, 6.2], [2508, 5.8],[2608, 5.6], [2708, 5.9]] #create the list
print(b1) #print the list
print(len(b1)) #print length of list
print(type(b1)) #print data type of list
print(b1[:2]) #print the first 2 elements in the list (note doesn’t use zero)
print(b1[3:]) #print from the 3rd elements and after

Reference Index in array

b1[3] #returns second element
b1[0:2] # returns first and second element of list
b1[-1] # returns last element index

Add and Delete, Update
Delete
del(b1[0]) # delete first element
or
b1.pop(1) #remove second element

Insert
b1.insert(1,[24082, 111]) #insert element at position 1
Update
b1[6]= [2808,6.7] #update value

Append
Additions = [[29008, 6.6], [3008, 6], [3108, 6.1]]
b1.extend(Additions) #Adds the Additions list to b1
b1.append([2808,6.6]) # add 1 element (only 1)

Clear the list
b1.clear() #empties the list

Check if element in array
if [2308,6.2] in b1: print(“yes”)
Loop
for x in b1: print(x)

def country_select(countrylist):
count = 1
for i in countrylist:
print(f”{i}: {countrylist[i]}”)
count = count + 1
return countrylist

Numpy

baseball = [180, 215, 210, 210, 188, 176, 209, 200]
import numpy as np
np_baseball = np.array(baseball)
print(type(np_baseball))
mean = np_baseball.mean()
print(“mean is: “+ str(mean))
med = np.median(np_baseball)
print(“median is: “+str(med))
std = np.std(np_baseball)
print(“standard deviation is: “+str(std))

May 19, 2024

Simple Linear Regression Example with Python

Linear Regression using Salary and Years of Experience Data

Data Source: Salary_dataset.csv Kaggle
The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

Simple Linear Regression Formula
Formula: Y = b + aX + e

Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
a: is the slope change in Y for a one-unit change in X
b: is the intercept of the Y axis. This represents the value of Y when X is zero.

The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

The steps are as follows:
1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
4. The null hypothesis is then evaluated based on the p-value
5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

def load_data(csv_location):
    """Load CSV data into a pandas DataFrame."""
    df = pd.read_csv(csv_location)
    return df

def prepare_data(df):
    """Prepare independent and dependent variables for regression."""
    X = df[['YearsExperience']].values  # Extract as 2D array
    y = df['Salary'].values  # Extract as 1D array
    return X, y

def fit_sklearn_model(X, y):
    """Fit a linear regression model using sklearn."""
    model = LinearRegression()
    model.fit(X, y)
    return model

def fit_statsmodels_ols(X, y):
    """Fit a linear regression model using statsmodels OLS."""
    X_with_const = sm.add_constant(X)  # Add an intercept to the model
    model = sm.OLS(y, X_with_const).fit()
    return model

def plot_regression_line(df, intercept, slope):
    """Plot the regression line along with data points."""
    plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
    plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
    plt.title("Salary by Years of Experience")
    plt.xlabel('Years of Experience')
    plt.ylabel('Salary')
    plt.legend()
    plt.show()

def main():
    csv_location = "salary_dataset.csv"
    df = load_data(csv_location)

    # Display basic statistics
    #print(df.describe())

    X, y = prepare_data(df)

    # Fit the model using sklearn
    sklearn_model = fit_sklearn_model(X, y)
    intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
    
    print("Calculation of Regression Line:\n")
    print(f"Intercept is: {intercept}")
    print(f"Slope is: {slope}")

    # Fit the model using statsmodels to get p-values and R-squared
    statsmodels_model = fit_statsmodels_ols(X, y)
  #  print(statsmodels_model.summary())

    # Extract R-squared and p-values
    r_squared = statsmodels_model.rsquared
    p_values = statsmodels_model.pvalues

    print(f"R-squared: {r_squared}")
    #print(f"P-values: {p_values}")

    # Extracting specific p-values by index
    intercept_p_value = p_values[0]  # First p-value (intercept)
    slope_p_value = p_values[1]  # Second p-value (YearsExperience)

    #print(f"Intercept p-value: {intercept_p_value}")
    print(f"p-value (YearsExperience): {slope_p_value}")

    print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
    print("This is obtained from the t-distribution with n−2 degrees of freedom ")
    print("where n is the number of observations\n")

    if slope_p_value > 0.05:
        print("P-value is not signficant and therefore we accept the null hypothesis")
    if slope_p_value < 0.05:
        print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌")
    # Plotting the regression line
    plot_regression_line(df, intercept, slope)

    # Fit a linear regression line using scipy.stats (for comparison)
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
   # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

if __name__ == "__main__":
    main()

May 19, 2024

Python Notes: Pandas Code Examples

Pandas Quick Guide

Modules

import pandas as pd #pandas module
import scipy.stats as stats #stats sub-module
from bs4 import BeautifulSoup #scrape module
import requests #API module
import whois as whois #whois module

Files
df = pd.read_csv(“CookieData.csv”) #import CSV file to dataframe df

#define index column and convert dates to date format

air_quality = pd.read_csv(“air_quality_long.csv”, index_col=”date.utc”, parse_dates=True)

air_quality.to_csv(“checkcombine.csv”) #save dataframe to csv file
tests.to_excel(“tests.xlsx”, sheet_name=”expt”, index=False)

Dataframe info

df.describe() #output descriptive statistics of dataframe (numeric columns)

df.info() #display dataframe columns datatypes No-Null
df.dtypes #output datatypes of dataframe

df.shape #outputs rows and columns
df.sample(5) #output sample of 5 rows

df.columns #ouput columns
titanic.head(5) #print head of dataframe (top five rows)
titanic.tail(5) #print tail of dataframe (bottom five rows)
type(df) #output object type

Convert
gg[‘YearString’] = gg[‘Year’].astype(“string”) #convert series to string
gg[“YearInterval”] = gg[“YearString”].str[:3] #LEFT type function, get first 3 elements of string

gg[“YearInterval”] = gg[“YearInterval”].astype(int) #convert series integer

Delete Columns
del df2[“Outlier_Flag”]

Sub-set

df2 = df[[“version”, “retention_1”]]

Combine Sub-sets

#Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) # creates union

merge the columns from the stations_coord dataframe to the air_quality df. Use location as the key to join the two dataframes, how=”left” is left join from air_quality on the left
air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)

merge left join using parameter column on left column and id on right

air_quality = pd.merge(air_quality, air_quality_parameters,
how=’left’, left_on=’parameter’, right_on=’id’)

Dataframe Aggregate Functions
titanic[“Age”].mean()
titanic[[“Age”, “Fare”]].median()

Sort Dataframe
air_quality = air_quality.sort_values(“date.utc”)
sort1 = titanic.sort_values(by=”Age”).head()
sort2 = titanic.sort_values(by=[‘Pclass’, ‘Age’], ascending=False)~
no2_subset = no2.sort_index().groupby([“location”]).head(2) #group and sort by index

Filtering Dataframe

df2 = df[df.Outlier_Flag == 0]

gate_30 = df[df.version == “gate_30”]
no2 = air_quality[air_quality[“parameter”] == “no2”]
above_35 = titanic[titanic[“Age”] > 35]
class_23 = titanic[(titanic[“Pclass”] == 2) | (titanic[“Pclass”] ==3)]

age_no_na = titanic[titanic[“Age”].notna()] #return column not NA

#Return Names of passengers over 35

adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

titanic.iloc[9:25, 2:5] #Return specified columns and rows
titanic.iloc[0:3, 3] = “anonymous” #Change the 3rd element of the row to anonymous

def city_select(country): #function with country as parameter
df2 = df[df[“country”] == country] #slice dataframe where country = country e.g. United States
df2.sort_values(by=[“population”], ascending=False) #sort dataframe by population descending
df3 = df2.iloc[0:5] #new dataframe sorted by top 5 population
df4 = df3[“city”] #select city series (top 5 cities by population)
return df4

def latlong(country, city): #define function
selection = df[(df[“country”] == country) & (df[“city”] == city)] #filter dataframe by 2 criteria
return selection #return filtered dataframe

def get_lat(selection): #define function
latitude = selection[“lat”] #latitude = series with “lat” as column
lat = latitude.iloc[0] #get first row of latitude series
return lat #return first row

Groupby

df.groupby(“version”).count() #Version with Count version
df.groupby(“version”).sum() #Version with Sum columns
air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean() #avg by day and location

air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(kind=’bar’, rot=0, ax=axs)

gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns

Create Pivot Table

table = pd.pivot_table(gate_30, values=’sum_gamerounds’, index=[‘version’], columns=[‘retention_1’],
aggfunc=”mean”, margins=”true”) #note margins give totals

no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
air_quality.pivot_table(values=”value”, index=”location”, columns=”parameter”, aggfunc=”mean”) #wide format

DateTime
df[“datetime”] = pd.to_datetime(air_quality[“datetime”]) #change dates in dataframe to datetime formate

pd.read_csv(“../data/df_long.csv”, parse_dates=[“datetime”]) #does the same
df[“datetime”].min(), df[“datetime”].max() #start and end datetime
df[“datetime”].max() – df[“datetime”].min() #difference in Datetime between 2 dates
df[“month”] = df[“datetime”].dt.month #add column month based on other column
df.index.year, df.index.weekday #return year and weekday of datetime object.
df_pivoted = df.pivot(columns=”location”, values=”value”).reset_index() #reset pivot table index
df = df_pivoted.melt(id_vars=”date.utc”)

gg[“YearMonth”] = pd.to_datetime(dict(year=gg.Year, month=gg.Month, day=1)) #create date from columns year, month and 1 for first

gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns
gg4 = gg3.max() #change group by to max group by

Plot

df[“2019-05-20″:”2019-05-21”].plot(); #plot timerange

Working with Columns

air_quality.city.unique() #return unique cities in column cities
air_quality = air_quality.rename(columns={“date.utc”: “datetime”}) #rename columns
Split Columns in Dataframe
titanic[“Name”].str.split(“,”) #split Name column by comma
titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) #split name column by comma and return the first value in the array.

df[“Name”].str.lower() #change column name to lowercase
df[“Name”].str.contains(“Countess”) #return true if column contains “Countess

df[“Name”].str.len() #return rows with string lenth of the Name column
df[“Name”].str.len().idxmax() #return length of longest name
df.loc[titanic[“Name”].str.len().idxmax(), “Name”] #return longest name in column using loc
#rename gender column and replace genders to M and F in dataframe
df[“gender_short”] = df[“gender”].replace({“male”: “M”, “female”: “F”}) #replace names in gender column with M and F
titanic[“Pclass”].isin([2, 3]) #checkss if 2 and 3 is in the PClass columns
adult_names = titanic.loc[titanic[“Age”] > 35, “Name”] #return Names of passengers over 35

Print

print(“\n1 Day Retention by Version”) #\n for new line
print(f”You searched for: {domain_name}\n”) #f allows you to include placeholders in print

Loops

for link in a:
print(link.get

Input
domain = input(“Enter domain to search for: “)

May 19, 2024
How to use the T-Test Statistical Test
In business, the most commonly used t-test is often the Independent Two-Sample t-test. This test is widely utilized to compare the means of two independent groups or samples to determine if there is a significant difference between them.

The Independent Two-Sample t-test is particularly useful in various business scenarios, such as:
1. Comparing the performance metrics of two different products or services.
2. Analyzing the effectiveness of different marketing strategies or campaigns.
3. Assessing the impact of changes in processes or procedures on business outcomes.
4. Evaluating the differences in customer satisfaction levels between different segments or demographics.
5. Comparing the financial performance of different investment portfolios or asset classes.
1. One-Sample t-Test
Purpose:
- To determine if the mean of a single sample differs significantly from a known or hypothesized population mean.
Example:
- Scenario: A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
- Hypothesis: The mean weekly sales of the new product is different from $5000.
- Application: The company collects weekly sales data for the new product over several weeks and uses the one-sample t-test to determine if the average sales differ from $5000.
  
  Calculation
  x-X/sd/SQRT(n)
  Where x = sample mean, X = est., population mean, n = sample size
  
  Nominator: Mean – hypothesized population mean
  Denominator: sample standard deviation/SQRT(sample size n)
2. Independent Two-sample t-test

Purpose:
- To compare the means of two independent groups to see if they differ significantly.
- Null Hypothesis: The means of 2 groups are equal. If rejected, there is a significant difference between the 2 means.
Types:
- Equal Variance (Pooled) t-Test: Assumes that both groups have the same variance.
- Unequal Variance (Welch’s) t-Test: Does not assume equal variances between the groups.
  
  How do we determine if they have the same variance?
  Use the Levenes Test first to determine if the variances are equal between groups
Example (Equal Variance) or Pooled t-Test:
- Scenario: A retailer wants to compare the average sales of two different stores located in different regions.
- Hypothesis: The mean sales of Store A are equal to the mean sales of Store B.
- Application: Sales data from Store A and Store B are collected over the same time period. An independent two-sample t-test assuming equal variances is used to compare the average sales.
  
  t-Value = Complex Calculation
  
  Degrees of Freedom
  df=n1+n2−2
Example (Unequal Variance, Welch’s t-test):
- Scenario: A company wants to compare the average salaries of employees in two different departments.
- Hypothesis: The mean salary in Department X is equal to the mean salary in Department Y.
- Application: Salary data from both departments are collected. An independent two-sample t-test with unequal variances (Welch’s t-test) is used to compare the average salaries.
  
  Calculation
  Usage Independent t-test is used when the number of samples and the variance of the two data sets is different.
  
  T-Value: t = (X1-X2)/SQRT((var1/n1)+(var2/n2))
  
  Nominator: mean1 – mean2
  Denominator: SQRT((var1/n1)+(var2/n2))
  
  Degrees of Freedom Calculation more complex.
3. Paired Sample t-Test

Purpose:
- To compare the means of two related groups (e.g., the same subjects measured at two different times).
- To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).
Business Example:
- Scenario: A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
- Hypothesis: The mean performance score after training is different from the mean performance score before training.
- Application: Performance scores of employees are recorded before and after the training. A paired sample t-test is used to compare the average performance scores before and after the training.
  
  More Examples:
  
  example: Comparing the mean blood pressure of patients before and after treatment
  example: Testing if a new drug increases recovery rates (greater than) compared to the known recovery rate without the drug.
  
  Calculation
  T-Value =
  nominator: mean1 – mean2 (difference between the means)
  denominator: s(Diff)/SQRT(n)
  
  s(Diff) = (standard deviation of the differences of the paired data values)
  n = sample size (number of paired differences)
- Degrees of Freedom:
- n-1 = degrees of freedom
One Side or Two-Sided:

a) One-Sided t-Test

Purpose:
- To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).
Example:
- Scenario: A company believes that a new marketing campaign will increase the average number of customers visiting their store per day compared to the previous average of 100 customers.
- Hypothesis: The mean number of customers after the campaign is greater than 100.
- Application: The company collects daily customer visit data after the campaign starts. A one-sided t-test is used to determine if the average number of customers is significantly greater than 100.
  
  Example 2: Comparing the average test score of a class to a national average
b) Two-Sided t-Test

Purpose:
- To test if the sample mean is significantly different from the hypothesized population mean (used for non-directional hypotheses).
Example:
- Scenario: A company wants to compare the customer satisfaction ratings of two different products to see if there is any difference.
- Hypothesis: The mean satisfaction rating of Product A is different from the mean satisfaction rating of Product B.
- Application: Customer satisfaction ratings for both products are collected. A two-sided t-test is used to compare the average satisfaction ratings of the two products.
  
  Levenes Test – Python Code:
  Test if the variances of two independent groups are equal – use for determining whether to use the Equal or Unequal Independent Samples test
  
  If the p-value for the Levene test is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met).
  In this case, we can use the Independent Samples Equal Variance T-Test
  If they are significantly different (p<0.05), then we use the Independent Samples Equal Variance T-Test
```
import scipy.stats as stats

# Data for Set 1 and Set 2
set1 = [19.7, 20.4, 19.6, 17.8, 18.5, 18.9, 18.3, 18.9, 19.5, 21.95]
set2 = [28.3, 26.7, 20.1, 23.3, 25.2, 22.1, 17.7, 27.6, 20.6, 13.7, 23.2, 17.5, 20.6, 18, 23.9, 21.6, 24.3, 20.4, 23.9, 13.3]

# Check if the variances are equal
# Levene's test for equal variances
levene_stat, p_levene = stats.levene(set1, set2)

# Print the results of the Levene's test
print("Levene's test for Equal Variances:")
print(f"F statistic: {levene_stat:.3f}")
print(f"p-value: {p_levene:.3f}")
```
May 17, 2024

Examples of working with Dates in Google BigQuery

References:

Date Functions
https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions
Format Elements
https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_elements_date_time

Current Date, Last_Date, Construction Add and Subtract

CURRENT_DATE() as Current_Date, -- returns todays date

DATE(2024,01,24) as Constructed_Date, -- constructs the date from Year, Month, Day

LAST_DAY('2008-12-25') AS Last_Date, -- returns 2008-12-25 the last date

DATE_ADD(DATE "2024-01-24", INTERVAL 1 DAY), -- 1 day later

DATE_SUB(DATE "2024-01-24", INTERVAL 1 DAY), -- 1 day before

FORMAT_DATE and PARSE_DATE

SELECT

CURRENT_DATE() as today, -- returns today

FORMAT_DATE('%d/%m/%Y', CURRENT_DATE()) AS formatteday, -- UK format

PARSE_DATE('%d-%m-%Y', '01-05-2024') as passstringdate, -- convert string to date object

FORMAT_DATE('%d/%m/%Y', PARSE_DATE('%d-%m-%Y', '01-05-2024')) AS formatedpassstring -- convert string to date and then format

EXTRACT

--CURRENT_DATE() is 15/05/2024

SELECT

EXTRACT(DAY from CURRENT_DATE()) as day, -- 15/05/2024

EXTRACT(DAYOFWEEK from CURRENT_DATE()) as DayOfWeek, -- 4

EXTRACT(WEEK from CURRENT_DATE()) as Week, --19

EXTRACT(WEEK(MONDAY) from CURRENT_DATE()) as WeekStartingMonday, -- 20

EXTRACT(ISOWEEK from CURRENT_DATE()) as WeekStartingMonday2, -- 20

EXTRACT(MONTH from CURRENT_DATE()) as  Mth, -- 5

EXTRACT(QUARTER from CURRENT_DATE()) as Qtr, -- 2

EXTRACT(YEAR from CURRENT_DATE()) as Yr, -- 2024

EXTRACT(ISOYEAR from CURRENT_DATE()) as YrWkStartMon -- 2024

DATE_DIFF

SELECT DATE_DIFF(DATE '2024-01-25', DATE '2024-01-02', DAY) AS days_diff -- returns 23 (note format end_date, start_date)

DATE() Constructor

SELECT *
FROM `bigquery-public-data.google_trends.top_terms`
WHERE rank = 1
--Date Constructor
AND refresh_date >= DATE(2024, 01, 01)
LIMIT 10

DATE TRUNC()

SELECT *

FROM `bigquery-public-data.google_trends.top_terms`

WHERE rank = 1
-- DATE_SUB subtracts from given date

AND refresh_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 WEEK)

-- Limit result to 10 (works like TOP)

LIMIT 10

May 15, 2024

DAX Running Total Quick Measure verus a Human Created Measure

Dax Running Total

An example we start of we a sales and date table to crate our running table sales from.
We start by creating a sales measure as follows:

Sales = SUM(FactSales[SalesAmount])

Next we create a quick measure to create the ;Running Total in Sales’ and add it to our table:

The Code for the Quick Measure ‘Sales running total in Year’ is a follows:

Sales running total in Year = 
CALCULATE(
	[Sales],
	FILTER(
		ALLSELECTED( 'Date'[Year]),  
		ISONORAFTER( 'Date'[Year], MAX('Date'[Year]), DESC
	)
))

It looks pretty simple, but there is quite a lot to it, so let’s break it down:

1. CALCULATE() Syntax: CALCULATE(Expression, [Filter1]…)
The CALCULATE() function takes the [Sales] measure as the first argument. [Sales] is just the sum of sales.

Next, we have the main filter argument that filters the [Sales] measure on a range of years from the ‘Date’Year column.
The filter starts with the FILTER function which takes ALLSELECTED(‘Date'[Year]) as the column to filter on and the
ISONORAFTER( ‘Date'[Year], MAX(‘Date'[Year]), DESC as the filter argument.

ALLSELECTED(‘Date'[Year])
The ALLSELECTED function removes all filters external to the column ‘Date'[Year], so in our table, the Years in the year column will be ignored, so all years in the ‘Date’ table will be included in the column.
We can prove this by creating another measure:

CountrowsAllSelectedYear = COUNTROWS(ALLSELECTED( ‘Date'[Year]))

When added to our table visual, we can see the count is 24 for every year row in the table. That is because this particular table has 24 years of date information in it.

The next part is the tricky part, how we filter ALLSELECTED(‘Date'[Year])

ISONORAFTER(‘Date'[Year], MAX(‘Date'[Year]), DESC)
The ISONORAFTER function compares each element of the first argument (the ‘Date'[Year column] with the second argument, the MAX(‘Date'[Year]), which is either sorted in ascending order (ASC) or descending order (DESC).
We can check what MAX(‘Date'[Year]) evaluates to with the following measure:

Max Date Year = MAX(‘Date'[Year])

We can see in our table that the max date equals the Year in the table visual.

4.2 The second argument is MAX(‘Date'[Year]). To understand what this function does, we can create a measure as it may not be what you think.

Max Date Year = MAX(‘Date'[Year])

As you can see in the table below, the Max Date Year used in this context actually returns the same date as the Year column.

The comparison works as follows:

‘Based on the sort order, the first parameter is compared with the second parameter. If the sort order is ascending, the comparison to be done is the first parameter greater than the second parameter. If the sort order is descending, the comparison to be done is the second parameter less than the first parameter’

To table below shows what is going on (if I filter the page to years 2007-2009), which effectively filters the years in the table down from 24 to 3, we can create the following table.
Essentially it is checking if each element in the date year table is on or after the max date, which is derived from the year in the table visual. If it is less than the Max([Date] then it returns true.

The Years in the Date[Year] column are then returned as a filter to the Calculate to filter the [Sales] measure creating the ‘Running Total in Sales’

Year: 2007
Date[Year]	Max([Date])	ISONORAFTER(a,a, DESC) If first less than second	Sales Total
2007	2007	TRUE	4.56M
2008	2007	FALSE
2009	2007	FALSE
Running Total			4.56M

Year: 2008	Max([Date])	ISONORAFTER()	Sales Total
2007	2008	TRUE	4.56M
2008	2008	TRUE	4.11M
2009	2008	FALSE
Running Total			8.63M

Year :2009	Max([Date])	ISONORAFTER()	Sales Total
2007	2009	TRUE	3.74M
2008	2009	TRUE	4.11M
2009	2009	TRUE	3.74M
Running Total			12.41M

That’s pretty complicated if you ask me, even though it is correct.
We can create our own Running Total in Sales Measure, that is easy to understand.
The code is below:

Running Total human = 
VAR the = SELECTEDVALUE('Date'[Year])
VAR MaxDateInFilterContext = MAX ('Date'[Year])
VAR DatesLessThanMaxDate =
    CALCULATE([Sales], 'Date'[Year]<= MaxDateInFilterContext)
RETURN
DatesLessThanMaxDate

And the output looks good.

May 14, 2024

SQL Code for Lifetime Value Model

The first part of building the code to pull the data for the lifetime value model is to create an SQL script that will get the historical data we need to train our model on.

The SQL Code select 6 fields from the Contoso Retail Datawarehouse:
The full SQL code is at the bottom of the page

First Section of Code (WITH)

USE ContosoRetailDW;

WITH FirstTransactionDates AS (
    SELECT customerkey,MIN([DateKey]) AS first_transaction_date
    FROM [dbo].[FactOnlineSales]
    GROUP BY customerkey)
	,
	Orders AS (SELECT 
		  [CustomerKey] as CustomerKey, 
		  [SalesOrderNumber] as SalesOrderNumber,
		  ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
		  SUM([SalesAmount]) as SalesAmount
		  FROM
		  [dbo].[FactOnlineSales]
		  GROUP BY [CustomerKey], [SalesOrderNumber] 
)

The code starts off with a subquery after the database is selected in the USE statement.

2 Sub-queries are then created in the WITH clause. This makes the code easier to manage rather than using sub-queries on sub-queries in the main code body.

The first sub-query: FirstTransactionDates creates a temporary table of customers and their first transaction date from the FactOnlineSales table

The second-subquery: Orders creates a temporary table of customers and their orders with the sales ordernumber field ranking the orders from 1 upwards with 1 being the first order. I will create a separate post explaining the line:
ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,

Second Section of Code

SELECT 
--selection of CustomerKey from DimCustomer
c.CustomerKey as CustomerKey, 

-- Calculate Age as difference between [BirthDate] selected from DimCustomer and today (using the GETDATE() function), then divide by 365
DATEDIFF(DAY, [BirthDate], GETDATE())/365 as Age, 

--Gender selected as the [gender] field from DimCustomer and converted to 0 and 1 for m and f respectively
CASE 
WHEN [gender] = 'm' THEN 0
WHEN [gender] = 'f' THEN 1
END as Gender,

--YearlyIncome selected from DimCustomer table
c.YearlyIncome as YearlyIncome, 

--FirstPurchaseAmount from sub-query a
a.FirstPurchaseAmount as FirstPurchaseAmount,

LifetimeSales as sub-query b
b.first12months_salesamount as LifetimeSales

-- select from DimCustomer with left join on.... tables to follow...
FROM dbo.DimCustomer c LEFT OUTER JOIN

As you can see from the above, there are 6 field selections:
1. The CustomerKey from the DimCustomer table (alias c)
2. Age calculated as the difference in days between the customers birthdate ([BirthDate] and todays date (GETDATE()). This is then divided by 365 to get the age in years.
3. The Gender which uses the [gender] filed from the DimCustomer table and swaps ‘m’ for 0 and ‘f’ for 1, so we have numeric data for the model.
4. The YearlyIncome field from the DimCustomer table.

Then the last 2 fields FirstPurchaseAmount and first12months_salesamount which are calculated using sub-queries.

Third Section of Code (Sub-query a)

(
SELECT CustomerKey, SalesAmount as FirstPurchaseAmount
FROM ORDERS
WHERE OrderNumber = 1
) a
ON c.CustomerKey = a.CustomerKey

First Sub-query a, which is the DimCustomer table joins to on ON c.CustomerKey = a.CustomerKey
The SELECT query select, CustomerKey and SalesAmount from the ORDERS table.
The ORDERS table is a temporary table created with a sub-query in the WITH clause at the start of the code. The ORDERS table is filtered by OrderNumber = 1 to return the sum of sales for each customer on their first order. This gives us the FirstPurchaseAmount field.

The Fourth Section of code is sub-query B

LEFT OUTER JOIN

-- Build First 12 Month Sales Table b
(
SELECT 
    f.customerkey,
    SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount 
	ELSE 0 
	END) AS first12months_salesamount

FROM [dbo].[FactOnlineSales] f
LEFT OUTER JOIN FirstTransactionDates ft ON 
f.customerkey = ft.customerkey
GROUP BY f.CustomerKey
) as b
ON c.CustomerKey = b.CustomerKey

WHERE c.CustomerType = 'Person'

The fourth section of code shows the LEFT OUTER JOIN of the DimCustomer table to sub-query b.
Sub-query b gets us the first 12 month sales amount of customers, which will be our dependent variable in our final Model.

The SELECT code starts by selecting the CustomerKey from the FactOnlineSales table.

The second line selects the first_transaction_date from the FirstTransactionDates sub-query created in the WITH clause and the f.DateKey from the FactOnlineSales table and looks at the difference in months between the 2. If the DateKey (transaction date is less than 12 months), then the sales amount is summed as first12months_salesamount.
b.first12months_salesamount as LifetimeSales

SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount
ELSE 0
END) AS first12months_salesamount

This is the field used in the main SELECT query in code section 2.
b.first12months_salesamount as LifetimeSales

Full SQL Code

USE ContosoRetailDW;

WITH FirstTransactionDates AS (
    SELECT customerkey,MIN([DateKey]) AS first_transaction_date
    FROM [dbo].[FactOnlineSales]
    GROUP BY customerkey)
	,
	Orders AS (SELECT 
		  [CustomerKey] as CustomerKey, 
		  [SalesOrderNumber] as SalesOrderNumber,
		  ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
		  SUM([SalesAmount]) as SalesAmount
		  FROM
		  [dbo].[FactOnlineSales]
		  GROUP BY [CustomerKey], [SalesOrderNumber] 
)

SELECT 
c.CustomerKey as CustomerKey, 
DATEDIFF(DAY, [BirthDate], GETDATE())/365 as Age, 
CASE 
WHEN [gender] = 'm' THEN 0
WHEN [gender] = 'f' THEN 1
END as Gender,
c.YearlyIncome as YearlyIncome, 
a.FirstPurchaseAmount as FirstPurchaseAmount,
b.first12months_salesamount as LifetimeSales

FROM 
dbo.DimCustomer c

--Build FirstPurchaseAmount table a
LEFT OUTER JOIN
(
SELECT CustomerKey, SalesAmount as FirstPurchaseAmount
FROM ORDERS
WHERE OrderNumber = 1
) a
ON c.CustomerKey = a.CustomerKey

LEFT OUTER JOIN

-- Build First 12 Month Sales Table b
(
SELECT 
    f.customerkey,
    SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount 
	ELSE 0 
	END) AS first12months_salesamount

FROM [dbo].[FactOnlineSales] f
LEFT OUTER JOIN FirstTransactionDates ft ON 
f.customerkey = ft.customerkey
GROUP BY f.CustomerKey
) as b
ON c.CustomerKey = b.CustomerKey

WHERE c.CustomerType = 'Person'

May 9, 2024

How to Build a Human Resources Employee Attrition Model
Here we make use of the HR Analytics dataset available on Kagoo. The dataset was created to understand the factors behind employee attrition and can be used to train a model for predicting employee churn.

The Python Code for the 2 models is on Github

We can start by importing the required libraries and import the HR dataset CSV file to a pandas dataframe.
```
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

csvlocation = csvpath = 'HR_capstone_dataset.csv'
df = pd.read_csv(csvlocation)
```
First, we should check the data imported into the Pandas data frame looks good:

We can also run df. columns to view the columns of the data frame and df. values to display the values of the data frame in an array.

The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.

We can also run df.columns to view the columns of the data frame and df.values to display the values of the data frame in an array.

The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.

Here we can verify that all columns have the same count of 14999, so there appears to be no missing data. We can also view the standard deviation (std) in relation to the mean, to get an idea of the variance in the data, for example:

Mean average_monthly_hours = 201 hours
1 standard deviation average monthly hours = 49.94 hours
Therefore we can infer that:
Approx. 68% of employees work 201 hours+-49.94 hours = 152-150 hours a month.
Approx. 95% of employees work 201 hours +-100 hours (rounded) = 101 to 301 hours a month

We need to check the datatypes are in the correct format if we want to build a model from them.

The dependent variable that we want to predict is the ‘left’ field, which is a binary field of 0 or 1. A value of 1 is that they left the company (attrition).

There are 9 other columns which make up the independent variables:
There are 2 floats, 5 integers, and 2 object data types.

The salary field is categorical and includes 3 levels:
low, medium, and high
Because there is an inherent order we can use ordinal encoding to convert it to an integer
salary_map = {‘low’: 0, ‘medium’: 1, ‘high’: 2}

The Department fields consist of non-ordered fields:
Accounting, HR, IT, management, marketing, product_mgt, RandomD, sales.

We use one-hot encoding on this field which creates 8 new binary fields, one for each department name. The code is as follows:

df = pd.get_dummies(df, columns=[‘Department’])

This results in a lot more columns in the data frame as below, but now we’re ready to build a model.

Building the Random Forests model

The steps for building the model are as follows:
1. Split the dataset into X and y variables where X consists of all the independent variable fields.
The ‘y’ list contains only the ‘y’ field (the dependent variable).
2. Split the X and y dataframes into 4 sets for training. The test sets were randomized 20% of the data.
3. Initialize the random forest classifier
4. Train the model with the training data (X_train and y_train). This represents 80% of the data.
5. Make predictions from the X_test set. The model tries to predict the actual values that are in the y_test set.
6. Measure the accuracy by comparing the predictions of y (y_pred) with the actual y_test values.
```
# Split the dataset into features (X) and target variable (y)
X = df.drop('left', axis=1)  # Features
y = df['left']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest classifier
#We initialize the Random Forest classifier with 100 trees and fit it to the training data.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
The resulting accuracy: Accuracy: 0.9886666666666667 which represents 99% accuracy.
If I re-run the code it gives similar results. The model is almost perfect!

We can compare this random forest model to a logistic regression model. Here code is below:
The resulting accuracy score is only: 0.785 or 78% accurate.
So the random forest model wins!

Logistic Regression Model Code
```
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
csv_location = 'HR_capstone_dataset.csv'
df = pd.read_csv(csv_location)

# Convert 'salary' into numerical values
salary_map = {'low': 0, 'medium': 1, 'high': 2}
df['salary'] = df['salary'].map(salary_map)

# One-hot encode 'Department'
df = pd.get_dummies(df, columns=['Department'])

# Split the dataset into features (X) and target variable (y)
X = df.drop('left', axis=1)  # Features
y = df['left']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression classifier
log_reg_classifier = LogisticRegression(max_iter=1000, random_state=42)

# Train the classifier on the training data
log_reg_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = log_reg_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
May 9, 2024
How to Build a Survey Results Report with Power BI
Power BI can be useful for displaying and storing survey results for your organization. In this example we build some data for the results of a fictitious Airline survey, using questions taken from a real online airport survey.

The data was built in Excel with a sheet for questions and a sheet for answers. You can download the Excel file here.

It’s easy to publish the report as a PDF, so you can share it with users who don’t have access to Power BI. Here is the airport survey report as PDF

The questions are a simple list with a Question ID and section.

The answers on the second sheet are formatted so that each row represents all the answers for each respondent.

Steps for Building the Airline Survey Report.

Step 1
The Data is imported to our Power BI Desktop survey using Power Query.

The only change to the question table is a simple column rename.

The answer table is transformed so that each line of the table contains the respondent ID, the question, and the Value (the answer). This prepares the data so it can be joined to the question table and easily visualized.
The applied steps are as follows:

As you can see the first step after the import is simply to promote the headers, so the column names are the actual column names.
The second step is just to change the column types to the correct types.
The key third step is to unpivot the data, so the table is transformed as below a separate row for each respondent’s answer. This makes the report easy to build.

The report only requires one measure, which is just a unique count of the respondent ID
```
Answer Count = DISTINCTCOUNT(Answers[Respondent])
```
Basic Graphs can be added.

And Likert Scale style graphs using a 100% Stacked bar graph visual.

Its a simple, but useful report.
May 8, 2024

An Example of Poor Correlation using Python

In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).

We can check age and first purchase amount against first 12 month sales amount just to confirm this.

Age vs. Sales (first 12 month sales) code:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns  # display the columns available

# Create a scatter plot with regression line
sns.regplot(data=df, x='Age', y='LifetimeSales')
plt.title("First 12 Month Sales by Customer Age")  # graph title
plt.xlabel('Age')  # x axis label
plt.ylabel('First 12 Month Sales')  # y axis label

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])

# Add equation of the regression line to the plot
plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

# Calculate correlation coefficient
correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')


plt.show()

The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.

Next we can look at First Purchase amount and First 12 Month Sales Amount.
Here is the code below. It includes removing the major outlier with CustomerKey = 1

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df = df[df['CustomerID'] != 1]
df.columns  # display the columns available

# Create a scatter plot with regression line
sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
plt.xlabel('FirstPurchaseAmount')  # x axis label
plt.ylabel('First 12 Month Sales')  # y axis label

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])

# Add equation of the regression line to the plot
plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

# Calculate correlation coefficient
correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')


plt.show()

And here is the graph. The correlation co-efficent is only 0.07.

May 8, 2024

Some basic Python Graphing Data Examples with the Matplotlib library
In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.

Histogram:
We can start by having a look at the distribution of customers’ ages with a histogram.
Here is the code:
```
import pandas as pd
import matplotlib.pyplot as plt

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns # display the columns available

# Create a histogram
df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
plt.title("Distribution of Customer Age")
plt.xlabel('Age')
plt.ylabel('Frequency')
  
plt.show()
```
The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.

Scatter Plot
Next, we can have a look at the first 12-month sales amount against age.
Here is the Python code:
```
import pandas as pd
import matplotlib.pyplot as plt

csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available

#kind options: 'scatter', 'hist', 'bar',  
df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
plt.title("First 12 Month Sales by Customer Age") # graph title
plt.xlabel('Age') #x axis lable
plt.ylabel('First 12 Month Sales') #y axis label
  
plt.show()
```
Here is the graph:

We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.

All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100.
Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales.

Pie Graph
The following code gives us a pie chart showing us the slit of gender in the dataset.
```
import pandas as pd
import matplotlib.pyplot as plt
# import matplotlib.pyplot as plt

csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#replace 0 and 1 in the 'Gender' column with femail and male
df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})

#return lables and count index of unique age types
age_counts = df['Gender'].value_counts()
#print(age_counts)

# Define colors for each category
colors = {'female': 'pink', 'male': 'blue'}

# Plotting
plt.figure(figsize=(4, 4))  # Adjust figure size if needed
plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', 
        colors=[colors[g] for g in df['Gender'].value_counts().index])
plt.title('Distribution of Gender')
plt.show()
```
The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’.
The data looks pretty normal as we would expect. There are no nulls which is good.

Another Scatter Plot
Next, we look at the ‘FirstPurchaseAmount’ field.
We create a scatter graph with the code below.
```
# -*- coding: utf-8 -*-
"""
Created on Wed May  8 12:17:26 2024

@author: Admin
"""

import pandas as pd
import matplotlib.pyplot as plt

csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available

#kind options: 'scatter', 'hist', 'bar',  
df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
plt.title("Customers by First Purchase Amount") # graph title
plt.xlabel('CustomerID') #x axis label
plt.ylabel('First Purchase Amount') #y axis label
  
plt.show()
```
The first thing we notice when we run the graph is that there is one major outlier in the data.

Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.
May 8, 2024
Examples of using the Python Seaborn Graphs Library
This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.
```
import seaborn as sns
import matplotlib.pyplot as plt
#from sklearn.datasets import fetch_california_housing
import pandas as pd

csvlocation = csvpath = 'outliercheck.csv'
df = pd.read_csv(csvlocation)
#df.columns # display the columns available

pairplot = sns.pairplot(df)

# Display the plot
plt.show()
```
If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.
```
# Sort the DataFrame by the 'firstpurchasedate' column
df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)

# Display the first few rows of the sorted DataFrame
print(df_sorted.head()
```
The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.

We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.
```
#remove row where CustomerID = 1
df_filtered = df[df['CustomerID'] != 1]
# Remove the 'age', 'gender' and CustomerID columns
df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)
```
The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.

Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.
May 8, 2024
Creating a Customer Lifetime Value (LTV) Prediction Model
In this project I’m trying to predict customer lifetime sales (well actually first 12-month sales), initially using the SKlearns linear regression model. The code can be downloaded from Github here. The results haven’t been great so far and I wonder how the data from the Contoso Retail data warehouse was created as I have an r-squared value of only around 10%. Still, it’s been a good exercise, and adding additional fields has improved the model gradually.

A lifetime value model is particularly useful in business for predicting the value of newly acquired customers.

The benefits of the model are as follows:

1. It can both guide the forecast for future sales, based on the predicted value of new customers, and forecast new customers being acquired.
2. It can guide the acquisition team in both targeting higher-value customers and understanding how much they can spend on acquiring new customers.

3. It helps the business understand what the max (CPA or Customer Per Acquisition) should be to keep the business profitable, while still growing the business.

Lifetime Timeframe
One of the immediate challenges you will face with creating an LTV model is that your older customers are naturally more likely to have a higher lifetime value than more recently acquired customers. To move forward, we can look to create lifetime models based on the first x amount of time from the customer acquisition date. For example, predicted 12-month LTV. An analysis of lifetime data can of course give you a much better understanding of how long a customer’s lifetime is likely to last and what are the key timeframes you can use in creating the prediction.

Building a Model with Linear Regression

In this example, we use a Linear Regression Model as the data we want to predict (lifetime sales) is continuous data. In this example, we’ll guess which fields we think will predict the highest lifetime sales and we’ll soon find out that some of them have very low correlations, forcing us to go back to the drawing board as we’re trying to skip the first 2 stages of analytics: Exploration and Diagnosis (correlation).

Here are the main steps for building the model:
1. Create an SQL Query for importing the data into Python.
2. Create the Python connection and import the data using the SQL Query.
3. Transform and clean the data.
4. Build the model
5. Evaluate the model

6. Improve the model.

1. Create SQL Query to import the historical customer data for training the model. This includes the independent variables and the dependent variable (first 12-month lifetime sales), which we want to predict for future new customers. 12 months is used as we don’t have a lot of historical data.

I’m choosing to start using the following demographic fields:
Age, Gender and Yearly Income and First Purchase Amount

The first part of building the code is to create an SQL script that will get us the historical data we need to train our model.

The SQL code can be found in this post

Now we have the SQL, we can start building the Python script which will pull in the data and build the model.

2. Python – Importing the Data (Extract

4. Building the model

Now the exciting bit, is where we get to build the model using the sklearn machine learning module.
The process is as follows:

1. Separate independent variables (Age, Gender, first purchase amount, Income) and the dependent variable LifetimeSales into their own dataframes.
2. The data is then separated further into a train and test segment. In this case, 20% of the data is put to the test.
3. The next step is to create the linear regression model
4. Then to evaluate the model
5. Save the model to a .pkl file for future use.

Code for building the model
```
# -*- coding: utf-8 -*-
"""
Created on Tue May 28 09:29:07 2024

"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
data = pd.read_csv("customerltvdata.csv")
data = data[~data.CustomerKey.isin({1, 0})]  #use tilde with isin for is not in

# Preprocess the data
data = data.dropna()
data = pd.get_dummies(data, drop_first=True)

# Define features and target variable
X = data[['Age', 'Gender', 'NumberCarsOwned', 'HouseOwnerFlag', 'YearlyIncome', 'MaritalStatus', 'FirstPurchaseAmount']]  # Replace with actual features
y = data['LifetimeSales']  # Replace with the target variable

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_train_predictions = model.predict(X_train)
y_test_predictions = model.predict(X_test)

# Evaluate the model
mse_train = mean_squared_error(y_train, y_train_predictions)
mse_test = mean_squared_error(y_test, y_test_predictions)

r2 = r2_score(y_test, y_test_predictions)

print("Train MSE:", mse_train)
print("Test MSE:", mse_test)

print(f'R-squared: {r2}')

# Step 5: Interpret the model coefficients
coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})
print(coefficients)

# Analyze the results
plt.scatter(y_test, y_test_predictions)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

sns.residplot(x=y_test, y=y_test_predictions, lowess=True)
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Actual Values')
plt.show()
```
5. Model evaluation output (At time of writing).

Train MSE: 98009842.04325004
Test MSE: 98328891.39303765

R-squared: 0.09431790148611996

Variable Coefficient
0 Age 17.114500
1 Gender 224.837535
2 NumberCarsOwned 446.586582
3 HouseOwnerFlag 283.217963
4 YearlyIncome 0.080253
5 MaritalStatus 1577.068714
6 FirstPurchaseAmount 1.023711

We can ask ChatGTP and Gemini to interpret the results: Here is a response from Gemini:

Model Performance
- Train MSE (Mean Squared Error): 98009842.04325
- Test MSE: 98328891.393038
- R-squared: 0.0943
Both the Train MSE and Test MSE are relatively high. This indicates that the model has a high average error in predicting lifetime sales, both on the data it was trained on and unseen test data. Ideally, you want the MSE to be as low as possible.

The R-squared value is also quite low (around 0.1). This means that the model only explains a small portion of the variance in lifetime sales. In other words, there are other factors that influence lifetime sales that are not captured by this model.

I keep adding additional fields, which is slightly improving the model, but there is a long way to go. As the Contoso data is not real, it’s possible this isn’t the best data to use.
May 5, 2024

Author: CB

#Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

merge left join using parameter column on left column and id on right