Author: CB

  • Using Python for Multi-Linear Regression

    Creating a simple Linear regression model and preparing for multi-linear regression.

    In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api

    You can download the original file from Kaggle here then just replace the location you save it to in the (df = pd.read_csv() line.

    
    import pandas as pd
    import seaborn as sns
    from statsmodels.formula.api import ols
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    pd.set_option('display.max_columns', None)
    df = pd.read_csv("marketing_sales_data.csv")
    
    # Drop rows with any null values
    df.dropna(inplace=True)
    
    # Check and handle duplicates if needed
    if df.duplicated().sum() > 0:
        df.drop_duplicates(inplace=True)
    
    #rename columns to snake 
    df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)
    
    # Simple order encoding
    tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
    df['TV'] = df['TV'].replace(tv_dict)
    
    # One-hot encoding for non-ordinal variable
    df = pd.get_dummies(df, columns=['Influencer'], dtype=int)
    
    
    # Define and fit the model
    ols_formula = "Sales ~ Radio"
    OLS = ols(formula=ols_formula, data=df)
    model = OLS.fit()
    summary = model.summary()
    print(summary)  #Prints off the statistical summary including R squared and the Beta coefficients.
    
    # Calculate residuals and predicted values
    residuals = model.resid
    y_pred = model.predict(df['Radio'])
    

    Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.

    The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.

  • How to check for model assumptions with Python Seaborn Graphics


    We need to check assumptions for models to give us confidence that are models have integrity and are not biased or overfitting the data.

    We check for three assumptions in this example, with sub-plotted seaborn graphics for a linear regression model.
    The code for creating the linear regression model can be found in this post
    You can run the code below once you have built the model. The model models the relationship between radio advertising spending and radio sales.

    assumption-checking

    Graph 1: Checking the Linearity Assumption
    The linearity assumption is as follows: ‘Each predictor variable (x) is linearly related to the outcome of variable y.’
    In the first graph, we plot radio advertising spend against radio sales and can see there is a linear relationship (first graph). So we can conclude the linearity assumption is met.

    Graph 2: Checking Homoscedacity assumption with a scatterplot

    The homoscedasticity assumption (extra points if you can spell it correctly) is as follows:
    y_pred are the predicted y values from a regression line.

    In the second graph, we plot the residuals of the model, which are the difference between actuals and model forecasts.

    Homoscedasticity means that the residuals have equal or almost equal variance across the regression line.
    By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.’ Good homoscedacity is therefore a balanced graph of residuals above and below zero.

    Graph 3: Check for Normality Assumption
    In the third graph, the histogram is used to plot the residuals of the regression line (the actual y values vs. the predicted y values) for x. If the model is unbiased, the residuals should be normally distributed (and we see that).

    The fourth graph is a Q-Q plot which is also used to check the normality assumption.

    fig, ax = plt.subplots(2, 2, figsize=(18, 10))
    
    fig.suptitle('Assumption Checks')
    
    #Check for linearity
    ax[0, 0] = sns.regplot(
        ax=ax[0, 0],
        data = df,
        x = df['Radio'],
        y = df['Sales'], 
        );
    ax[0, 0].set_title('Radio Sales')
    ax[0, 0].set_xlabel('Radio Spend ($K)')
    ax[0, 0].set_ylabel('Sales ($)')
    #ax[0].set_xticks(range(0,10,10))
    #ax[0].set_xticks(rotation=90)
    
    
    #Check for Homeoscedacity
    # Plot residuals against the fitted values
    ax[0, 1] = sns.scatterplot( ax=ax[0, 1],x=y_pred, y=residuals)
    ax[0, 1].set_title("Residuals vs Fitted Values")
    ax[0, 1].set_xlabel("Fitted Values")
    ax[0, 1].set_ylabel("Residuals")
    ax[0, 1].axhline(0, linestyle='--', color='red')
    
    
    #Check for normality
    ax[1, 0] = sns.histplot(ax=ax[1, 0], x=residuals)
    ax[1, 0].set_xlabel("Residual Value")
    ax[1, 0].set_title("Histogram of Residuals")
    
    #Check for nomrmality QQ plot
    ax[1, 1] = sm.qqplot(residuals, line='s',ax = ax[1,1])
    ax[1, 0].set_title("Q-Q Plot")
    
    
    
    #sm.qqplot(test, loc = 20, scale = 5 ,  line='45')
    
    plt.show()

  • A Summary of the data process used in Classification Models

    Introduction:

    Classification models are machine learning models that are used to predict binary outcome scenarios such as:

    Spam / Not Spam
    Fraudulent Transaction / Non-Fraudulant Transation
    customer churn/ customer will not churn
    custer high value / customer low value
    load approval /non-approval

    The Data Process

    Planning

    1. Understand the business requirements. Understand what the measure of success is and what needs to be measured.
      e.g. In binary outcomes (precision, recall, or f1 score). Identify Type 1 and Type 2 errors.
    2. Identify the key stakeholders and subject matter experts relevant to the project
    3. Understand where the data is and how it can be accessed. For larger data projects. If the data is from many different data sources, can the data be brought together in a data warehouse such as Google Big Query?
    4. Understand the technology required for the project. Are extra resources required?
    5. Is there a data dictionary describing all the data field types and purposes?

    Exploratory Data Analysis (Python)

    1. Explore the data, and list the number of columns rows, and data types. If there are any questions, these may need to be referred back to the business.
    2. Explore the data ranges (df. describe() ). Are the data counts complete? Do the means and ranges make sense, do the min and max statistics flag any potential errors or outliers in the data?
    3. Explore null values. If there are null values, either look to fill the data or drop the rows.
    4. Remove or adjust outliers.
    5. Summarize and graph the data:
    6. Use boxplots to look for outliers in columns.
    7. Use histograms to understand the distributions of data.
    8. Use a correlation matrix and pair plot to understand co-variance between columns.
    9. Visualize the data with interactive tools such as Tableau or Power BI for the initial analysis of data for clients

    Model Selection (Classification)

    Classification Models: Naive Bayes, Logistic Regression, Decision Tree, Random Forests, XG Boost

    1. Identify the variable to be predicted: Is it a continuous variable or a categorical variable?
    2. Select the machine learning model relevant to the task and 1 or 2 additional models to compare results to.
    3. Confirm the assumptions required by the model and check the data to confirm they meet the requirements.
    4. Feature selection: Select the features (Independent variables and the dependent variable (columns)) to be used in the model.
    5. Feature transformation: transform categorical data into numeric data using:
      a) one-hot encoding, for non-ordered categories e.g. departments
      b) ordinal encoding for orderly data such as ‘low’, ‘medium’, ‘high’
      Where numeric data across the features has high variance e.g. small numbers 0-1 and large numbers 100+ consider applying scaling to the data:
      a) Log normalization (using a logarithm of the data)
      b) Standardization (which converts data to Z false on a standard distribution, with a mean of zero within each feature)
    6. Feature extraction: Create new features from existing features examples are weekly hours.

    7. Check for Class inbalance in the data. In the case of a binary dependent variable (True, False), we would ideally like an even split, but the minimum should be 10% of the smaller of the two. Class inblances can be address with either:
    a) Downsampling to level the major segment down by random sampling.
    b) Upsampling to level the smaller segment up by random sampling.

    Model Execution

    1. Setup the X variables and Y variables in a separate data frame.
    2. Decide whether to use a
    3. Use test_train_split to create training and test sets and potentially a 3rd validation set of data. The test dataset should be around 25%, but can be larger if the dataset it large. Should the split be stratified?
    4. Select the hyperparameter requirements of the model. The GridSearchCV is the powerful sklearn function takes a list of hyper-parameters, scoring metrics and X values to run multiple models to find the best model
    5. Build and run the model.

    Model Validation

    In a simple scenario, the best performing model is ran against the hold out sample (e.g. 25% of the data), that the model has not been trained on.

    a) In cross-validation, samples of data are taken from the main training data and tested against the test data.
    Cross-validation is slower.

    b) Separate validation set: random samples are taken from the test training set and the model tested against a set number of times e.g. 5. This is advantageous when the main dataset is small and we don’t want to keep sampling from the training data.

    Measurement
    1. Merics: Accuracy, Precision, Recall, F1, AUC Score, Classification Report.

    2. Visualise: Confusion Matrix, ROC Curve

    Check Assumptions:

    Linearity: All Independent and dependent variables should exhibit a degree of linearity (use pairplot)
    Independent Observations: This information is business specific, so this requires understanding about how the variables are generated.
    No Multicolinearity: There should be no colinearity between independent variables or this will reduce the impact of the model.
    No extreme outliers: extreme outliers should be exlcuded from the model.



  • An Example of Using K-Means Cluster modelling

    Supermarket Example

    Import libraries and read in CSV file to data frame
    The data comes from kaggle here (mall data)

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.cluster import KMeans
    
    df = pd.read_csv("Supermarket Customers.csv"
    df.info()


    First, we run some basic checks on the data to check for data integrity.
    Data includes 200 rows of data by 5 columns



    We can change Gender to numeric with the following

    df['Gender'].replace(0, 'Female',inplace=True)
    df['Gender'].replace(1, 'Male',inplace=True)
    df.describe()





    Then we check for nulls as follows:

    
    df.isnull().sum() #check for nulls - gives a count of nulls values by column name.

    Get counts of unique values

    
    len(df['CustomerID'].unique()) #how many unique values are there


    CustomerID: 200
    Gender: 2
    Age: 51
    Annual Income (k$): 65
    Spending Score (1-100): 85

    Fixing Null Values
    Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.

    Drop rows where annual income (k$) is null

    df.dropna(subset=['Annual Income (k$)'], inplace=True)
    df.isnull().sum() #re-run check for missing data

    Then do the same for spending score:

    df.dropna(subset=['Spending Score (1-100)'], inplace=True)
    df.isnull().sum() #re-run check for missing data
    df.info() # the number of rows is reduced to 197.

    We can run the seaborn pairplot to plot the graphs of the combination of variables.

    pairplot = sns.pairplot(df)
    plt.show()


    From here we can see there are some very interesting distinct-looking clusters around
    annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It

    For the K-means model, we need to determine the value of K which is the number of clusters we want to identify.
    We use the elbow method to do this as follows:

    # Extracting features for clustering
    X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
    
    # Using the elbow method to determine the optimal number of clusters<br>wcss = []
    for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    # Plotting the elbow graph
    plt.plot(range(1, 11), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

    The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.

    K-Means Model

    Now we have the number of clusters to be used in the model, we can run the K-Means model.

    ##Use 5 clusters based on elbow graph
    
    # Fitting K-Means to the dataset with the optimal number of clusters (assuming 3 for this example)
    kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)
    y_kmeans = kmeans.fit_predict(X)
    
    # Visualizing the clusters
    plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
    plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
    plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
    plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
    plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='purple', label='Cluster 5')
    
    # Plotting the centroids of the clusters
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
    plt.title('Clusters of customers')
    plt.xlabel('Annual Income (k$)')
    plt.ylabel('Spending Score (1-100)')
    plt.legend()
    plt.show()

    The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.

  • How to Select the Correct Statistical Testing Test (A Table)

    Significance Tests (normal distribution assumed)

    GroupsRequirementsTestTailsStatisticDistributionDegrees of freedomExample
    1normal dist, n > 30 and known variance z-test1 and 2z-scoreNormalNAA sample of the population is tested for height. Do they match the known population?
    Population mean and standard deviation are known, sample > 30
    1n < 30 and unknown varianceOne-sample t-test1 and 2t-scoreStudents tn-1 A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
    2Same 2 groups (before and after).
    2 Dependent samples.
    Paired t-test 1 and 2t-scoreStudents t

    n-1A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
    2*2 Independent groups, equal varianceIndependent t-test (equal variance – pooled)1 or 2t-scoreStudents t

    n1​+n2​−2A retailer wants to compare the average sales of two different stores located in different regions.
    2*2 Independent groups, unequal varianceIndependent t-test (Unequal/Welch’s)1 and 2t-scoreStudents t

    n1​+n2​−2
    3+3+ groups – look at one variableOne-Way ANOVA1 OnlyF-scoreFbetween groups
    k-1 (where k is num of groups).
    Three groups are given different drugs in a test to see the improvement in blood sugar.
    3+3+ groups – look at multiple variablesTwo-way ANOVA1 Only
    F-scoreFEffect of water and sun on height of sunflowers, where 3+ combinations of water and sun (3 + groups)
    2 Categories2 Categories – comparison is to see if they are relatedPearsons Chi-Squareneither, just differentChi-SquaredChi-Squared(no. of rows – 1) * (no. of columns – 1). Individuals that received social media vs. those that received email – 2 Groups purchased and not purchased.
    Correlation
    2CorrelationPearsons Correlation Coefficient (R)1 or 2t-statistic
    t-distribution table
    n−2Relationship between daily hours of sunlight and temperature. Best for continuous data with linear relationships.

    Provides both the direction (positive or negative) and the strength of a linear relationship between two variables. Note for simple linear regression. R squared is the Pearson r squared.
    It ranges between -1 and 1
    2 Correlation Spearmans Rank1 or 2t-statistic?Spearmans Rho?n-2Relationship between the ranks of employees’ performance and their years of experience.
    Best for ordinal data or when data does not meet Pearson’s assumptions.
    r Squared (used in regression analysis)

    One (simple linear regression) or
    more independent variables
    (Multiple linear regression) + 1 dependent
    variable
    Not used in significance testing directlyNAExplaining the variance in academic performance based on study hours, attendance, and other factors.
    Used in the context of regression analysis to explain the proportion of variance in the dependent variable.

    Indicates the proportion of the variance in the dependent variable that can be explained by the independent variable(s). It does not provide information about the direction of the relationship.
    Ranges from 0 to 1
    Regression AnalysisF-StatisticF-DistributionFit a regression model and computer squared.
    Perform an F-test to determine overall significance.
    Non-Parametric TestsNot Normal distribution
    2 Compares distributions of the 2 groupsMann Whitney U
    U Statistic Mann-Whitney U DistributionNA2 Independent groups (like the t-test)
    2Compares Medians of the 2 groupsWilcoxen signed rankT StatisticWilcoxen signed rank tableNAWhen comparing two related samples, matched samples, or repeated measurements on a single sample.

    * Note: Uses Levenes test to determine equal/unequal variance

    Significant Testing Steps

    1. Null hypothesis: There is no significant difference between variables between the two groups.

      Examples:
      1. In a drug efficacy study, the null hypothesis might state that the drug does not affect patients compared to a placebo.
      2. The mean cholesterol levels of patients taking the new drug are equal to the mean cholesterol levels of patients taking a placebo.
      3. The mean sales after the advertising campaign are equal to the mean sales before the campaign.
      4. The mean sales of the product with the new packaging design are equal to the mean sales of the product with the old packaging design.

      Alternative hypothesis: There is a significant difference between the means of the groups.
    2. The significance level (𝛼α) is the probability threshold for rejecting the null hypothesis. Commonly used significance levels include 0.05, 0.01, and 0.10. e.g. 5%: The result of the test only has a 5% chance of occurring.
    3. Calculate test statistic (t-value, z-value, etc). The test statistic is used to look up the p-value from the relevant distribution tables.

    4. Calculate the p-value using the test statistic. The p-value is used to determine if the test is significant. e.g. a p-value of less than 5% (0.05) is significant and we can reject the null hypothesis. If the p-value is higher than the significance level, then the results are reported as statistically significant.

    1 or 2 Tailed Tests
    One-tailed test if measuring greater or less than Left or Right Sided
    Two-tailed if measuring the difference

    Type 1 and Type 2 Errors
    Type 1 Error: We reject the null hypothesis even though it was true (Alpha)
    Type 2 Error: We accept the null hypothesis even though it was false (Beta)

    Two Methods of accepting or rejecting Null Hypothesis
    1. Critical Values (taken from normal or t-distribution)
    2. P-values (if p < or > than 0.05)


  • Python Notes: Pandas from getting started Tutorials

    Notes from Getting Started Tutorial:
    https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

    1. Create a Pandas Series and DataFrame
    2. Create Graphs with Matplotlib
    3. Adding new Columns from Existing Columns (inc. split)
    4. Summary Statistics including groupby
    5. Reshape layout of tables incl. Sorting and Pivot tables
    6. Sub-sets and filtering
    7. Combining multiple tables data (Concatonate (union) and merge (Join)
    8. Working with timeseries (pd.to_datetime(), dt, DatetimeIndex and Resample)
    9. Textual data (.str and .replace and split)


    1. Creating a Dataframe and Series

    import pandas as pd
    
    #Create a series
    books= pd.Series(["Papillion", "Great Expectations", "Magic"], name="Books") #Create a series
    
    #Create DataFrame with Key value pairs. In this case the values are lists of values.
    
    df = pd.DataFrame(
        {
            "Name": [
                "Humpfries, Mr. Bill",
                "Derby, Mrs. Sheila",
                "Winfrey, Miss. Elizabeth",
            ],
            "Age": [44, 66, 42],
            "Gender": ["male", "female", "female"],
            "City": ["London", "Manchester", "Leeds"],
            "Occupation": ["Doctor", "Architect", "Analyst"]
        }
    )
    
    #Using the list
    df["Age"] #Output Age Column
    #df["Age"].max() #outputs max age

    2. Create Plots with Matplotlib

    import pandas as pd
    import matplotlib.pyplot as plt
    
    #Create Pandas dataframe from csv file
    air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
    air_quality["station_paris"].plot() #craate plot of subset from DataFrame
    plt.show() 
    
    #Plot with Matplot lib
    air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
    plt.show() 
    
    #Create scatter from DataFrame
    air_quality.plot.box() #Create box plot
    plt.show() #Display boxplot
    
    #subplots
    fig, axs = air_quality.plot.area(figsize=(10, 4), subplots=True)
    plt.show()
    


    3. Add new columns from existing columns

    Import pandas as pd
    #index_col=0, sets first column to index, parse_dates converts dates to date type
    air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
    
    #Add new column based on existing column - this adds column to DataFrame
    air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
    #Another example, this time a %
    air_quality["ratio_paris_antwerp"] = (air_quality["station_paris"] / air_quality["station_antwerp"]
    
    #renaming columns
    air_quality_renamed = air_quality.rename(columns={"station_antwerp": "BETR801","station_paris": "FR04014", "station_london": "London Westminster"})
    

    4. Summary Statistics


    import pandas as pd
    #read_csv function - also read_ excel, sql, json, parquet and more
    
    titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
    titanic["Age"].mean() #average of Age column
    titanic[["Age", "Fare"]].median() #median of Age and Fare Coluns
    titanic[["Age", "Fare"]].describe() #summary stats of 2 columns
    titanic.agg({"Age": ["min", "max", "median", "skew"],"Fare": ["min", "max", "median", "mean"]})
     #multiple stats on Age and Fare columns
     
     #groupby to create summary tables
     titanic[["Sex", "Age"]].groupby("Sex").mean()
    
     #Average age by sex
     titanic.groupby("Sex").mean(numeric_only=True) 
     #average of all numberics by Sex
     titanic.groupby(["Sex", "Pclass"])["Fare"].mean() #table average fare by sex and PClass
     
     titanic["Pclass"].value_counts() #count passengers in Pclass
     titanic.groupby("Pclass")["Pclass"].count() #count longer method


    5. Re-shape layout of tables

    import pandas as pd
    titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
    air_quality = pd.read_csv("air_quality_long.csv", index_col="date.utc", parse_dates=True)
    
    #Sorting Tables
    sort1 = titanic.sort_values(by="Age").head() 
    
    ##sort by Age
    sort2 = titanic.sort_values(by=['Pclass', 'Age'], ascending=False) #sort by Pclas then Age descending 
    
    # filter for no2 data only
    no2 = air_quality[air_quality["parameter"] == "no2"]
    
    # filter for no2 data only<br>no2_subset = no2.sort_index().groupby(["location"]).head(2)
    no2_subset
    #pivot table wide format
    air_quality.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")
    air_quality.pivot_table(values="value",index="location",columns="parameter",aggfunc="mean",margins=True)
    
    #wide to long format
    no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
    no2_pivoted.head(5)
    #pandas.melt() method converts pivot table to long format
    no_2 = no2_pivoted.melt(id_vars="date.utc")<br>no_2.head(5)


    6. Sub-sets and Filtering

    import pandas as pd
    #read csv file to pandas DatafFrame
    titanic = pd.read_csv(“titanic.csv”)
    air_quality = pd.read_csv(“air_quality_no2.csv”, index_col=0, parse_dates=True)

    #save as excel, index=False removes row index numbers from export
    titanic.to_excel(“titanic.xlsx”, sheet_name=”passengers”, index=False)


    #attributes and methods of DataFrame
    titanic.shape #row and columns count
    titanic.info #
    titanic.columns #
    titanic.dtypes #data types
    titanic.head(5) #head of df
    titanic.tail(5) #tail of df
    titanic.describe() #statistics of file

    #Create Sub-set of DataFrame as series
    ages = titanic[“Age”] #create sub-set from dataframe: note single bracket creates a pandas series.
    type(titanic[“Age”]) #confirm type of Age object

    #Create Sub-set of DataFrame as DataFrame
    age_sex = titanic[[“Age”, “Sex”]] #inner brackets are list, outer is DataFrame

    #Filtering rows getpassenger list with Age > 35
    above_35 = titanic[titanic[“Age”] > 35] #return rows with age over 35
    #Filter passenger list with Pclass = 2 or 3 (2 ways of doing this)
    titanic[“Pclass”].isin([2, 3]) #checks if 2 or 3 is in Pclass and returns Boolean
    titanic[“Pclass”].isin([2, 3])

    #Filter passenger list with Age not NA
    age_no_na = titanic[titanic[“Age”].notna()]
    #Filter passenger liset and return list of Names
    adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

    #Filter specified rows and columns
    titanic.iloc[9:25, 2:5]

    #change the first 3 elements in row 3 to “anonymous”
    titanic.iloc[0:3, 3] = “anonymous”



    7. Combining Data from tables

    import pandas as pd
    import os

    #check current working director
    cwd = os.getcwd()
    cwd

    #Import data into DataFrames
    titanic = pd.read_csv(“titanic.csv”)
    air_quality_no2 = pd.read_csv(“air_quality_no2_long.csv”, parse_dates=True)
    air_quality_pm25 = pd.read_csv(“air_quality_long.csv”, parse_dates=True)
    stations_coord = pd.read_csv(“air_quality_stations.csv”)
    air_quality_parameters = pd.read_csv(“air_quality_parameters.csv”)

    #Create Sub-sets
    air_quality_no2 = air_quality_no2[[“date.utc”, “location”,
    “parameter”, “value”]]

    air_quality_pm25 = air_quality_pm25[[“date.utc”, “location”,
    “parameter”, “value”]]

    #Concatonate Sub-sets – creates a union between the 2 tables.
    #the Keys argument adds another hierachical index to show which table the data comes from
    air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”])
    air_quality_no2.head(5)
    air_quality_pm25.head(5)
    air_quality_no2.shape #2068 rows 4 columns
    air_quality_pm25.shape #5274 rows, 4 columns
    air_quality.shape #7340 rows, 4 columms

    air_quality = air_quality.sort_values(“date.utc”)
    air_quality.head(25)
    air_quality.to_csv(“checkcombine.csv”)

    #note you can reset the index to a column
    air_quality.reset_index(level=0)

    stations_coord = pd.read_csv(“air_quality_stations.csv”)

    stations_coord.head()

    #merge the columns from the stations_coord dataframe to the air_quality df
    #use location as the key to join the two dataframes
    #how=”left” is left join from air_quality on the left
    air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)
    air_quality.head(10)
    air_quality.to_csv(“checkqualitymerge.csv”)

    air_quality_parameters.head(5)
    #merge left join using parameter column on left column and id on right
    air_quality = pd.merge(air_quality, air_quality_parameters,
    how=’left’, left_on=’parameter’, right_on=’id’)

    8. Timeseries

    import pandas as pd

    import matplotlib.pyplot as plt

    air_quality = pd.read_csv(“air_quality_no2_long.csv”)

    air_quality = air_quality.rename(columns={“date.utc”: “datetime”})
    air_quality.head(5)
    air_quality.columns

    #unique cities in air_quality df
    air_quality.city.unique()

    #read air_qualitydataframe to a datetime object instead of text
    air_quality[“datetime”] = pd.to_datetime(air_quality[“datetime”])

    air_quality[“datetime”]

    #can also do this to read the dates into datetime format
    #pd.read_csv(“../data/air_quality_no2_long.csv”, parse_dates=[“datetime”])

    #check start and end datetime
    air_quality[“datetime”].min(), air_quality[“datetime”].max()

    #datetime difference using pandas timestamp
    air_quality[“datetime”].max() – air_quality[“datetime”].min()

    #Add a month column using the
    air_quality[“month”] = air_quality[“datetime”].dt.month

    #Average no2 concentration by day of week by location
    air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean()

    #Plot bar graph using hour on the x axis and average no2 on the y
    fig, axs = plt.subplots(figsize=(12, 4))

    air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(
    kind=’bar’, rot=0, ax=axs)
    plt.xlabel(“Hour of the day”); # custom x label using Matplotlib
    plt.ylabel(“$NO_2 (µg/m^3)$”);

    #create pivot table
    no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
    no_2.head(5)

    #use index.year and index.weekday of the datetime object
    no_2.index.year, no_2.index.weekday
    #plot using a time range
    no_2[“2019-05-20″:”2019-05-21”].plot();

    #resample method of datetime object works like groupby. It takes an
    #aggregation function and the “M” is time based grouping for Month (month end)
    monthly_max = no_2.resample(“M”).max()
    monthly_max

    monthly_max.index.freq #tells you the frequence attribute – e.g. MonthEnd

    #Create table plot of average no2 levels for each station
    no_2.resample(“D”).mean().plot(style=”-o”, figsize=(10, 5));

    9. Textual Data

    import pandas as pd

    titanic = pd.read_csv(“titanic.csv”)

    titanic[“Name”].str.lower()

    #Split the Name field by comma separated
    titanic[“Name”].str.split(“,”)
    #create series surname using the first element of the split name field
    titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0)
    titanic[“Surname”]

    #Return name series with true false if contains Countess
    titanic[“Name”].str.contains(“Countess”)

    #return rows where name contains Countess
    titanic[titanic[“Name”].str.contains(“Countess”)]

    #string length
    titanic[“Name”].str.len()
    #find max name length
    titanic[“Name”].str.len().idxmax()
    #Return longest name
    titanic.loc[titanic[“Name”].str.len().idxmax(), “Name”]

    titanic[“Sex_short”] = titanic[“Sex”].replace({“male”: “M”, “female”: “F”})

    titanic[“Sex_short”]

  • Useful Power BI DAX Code Examples

     DAX Code Examples:
    1. Using Variables
    2. FORMAT()
    3. HASONEVALUE()
    4. AND, &&
    5. CALCULATETABLE() and SUMMARIZE()
    6. USERELATIONSHIP()
    7. SWITCH()
    8. ISFILTERED() and making visual transparent
    9. SELECTEDVALUE() and creating a dynamic Graph Title
    10. FILTER and ADDCOLUMNS
    11. RANK()

    VAR: Using Variables

    Running Total =
    VAR MaxDateInFilterContext = MAX ( Dates[Date] ) //variable 1 max date#
    VAR MaxYear = YEAR ( MaxDateInFilterContext ) //variable 2 year of max date
    VAR DatesLessThanMaxDate = //variable 3 filter dates > variable 1 and variable 2
    FILTER (
    ALL ( Dates[Date], Dates[Calendar Year Number] ),
    Dates[Date] <= MaxDateInFilterContext
    && Dates[Calendar Year Number] = MaxYear
    )
    VAR Result = //variable 4 total sales filtered by variable 3
    CALCULATE (
    [Total Sales],
    DatesLessThanMaxDate
    )
    RETURN
    Result //return variable 4

    FORMAT: Formatting Numbers
    actual = if(sum[actual] >1000000, “FORMAT(SUM([actual], “#, ##M”), IF(SUM([actual]>=1000, “FORMAT(SUM(actual]), “#,,.0K”))

    FORMAT(min(column, “0.0%”)
    FORMAT(min(column, “Percent”)

    eg. if matrix is filtered, 
    IF(ISFILTERED(field], SELECTEDVALUE([column])

    HASONEVALUE: Check if column has one value in if
    Valuecheck = if(HASONEVALUE([column], VALUES(field))


    FILTER table by related field = united states and sumx salesamount_usd
    = SUMX(FILTER( ‘InternetSales_USD’ , RELATED(‘SalesTerritory'[SalesTerritoryCountry]) <>”United States” ) ,’InternetSales_USD'[SalesAmount_USD])

    AND, can also use &&

    Demand =
        SUMX (
            FILTER (
                RELATEDTABLE ( Assignments ),
                AND (
                    [AssignmentStartDate] <= [TimeByDay],
                    [TimeByDay] <= [AssignmentFinishDate]
                )
            ),
            Assignments[Av Per Day]
        )


    CALCULATETABLE, SUMMARIZE

    Calculate Table with Summarize and Filter

    Order Profile =
    CALCULATETABLE (
    SUMMARIZE (
    ‘Sales Table’,
    ‘Sales Table'[Order_Num_Key],
    Customer[Sector],
    “Total Value”, SUM ( ‘Sales Table'[Net Invoice Value] ),
    “Order Count”, DISTINCTCOUNT ( ‘Sales Table'[Order_Num_Key] )
    ),
    YEAR ( DimDate[Datekey] ) = YEAR ( TODAY () )
    )


    )

    USERELATIONSHIP Uses inactive relationship between tables

    CALCULATE (
    [Sales Amount],
    Customer[Gender] = "Male",
    Products[Color] IN { "Green", "Yellow", "White" },
    USERELATIONSHIP ( Dates[Date], Sales[Due Date] ),
    FILTER ( ALL ( Dates ), Dates[Date] < MAX ( Dates[Date] ) )


    SWITCH

    SWITCH(<expression>, <value>, <result>[, <value>, <result>]…[, <else>])
    = SWITCH([Month], 1, “January”, 2, “February”, 3, “March”, 4, “April” , 5, “May”, 6, “June”, 7, “July”, 8, “August” , 9, “September”, 10, “October”, 11, “November”, 12, “December” , BLANK() ) //place on separate lines

    SWITCH with Measure

    = SWITCH(TRUE(), 
            [measure] = “turnover”, [turnover]
            [measure] = “Profit”, “[Profit]

    , BLANK()

    )

    Visuals
    ISFILTERED()
    Check Filtered = ISFILTERED([column])

    Dynamic Visual 

    MakeTransparent = 
    IF([check filtered], “FFFFF00” # returns transparent – note hex 7
    “White”)

    Message = IF([check Filtered), “”, “please select a row”)

    Dynamic Graph Title
    Graph year title = selectedvalue([columnname])& ” –  My graph Title”

    ADDCOLUMNS
    New Table>
    Creates a new table and adds columns to it (and in this case also filters it) 

    2013Sales = FILTER(ADDCOLUMNS(FactInternetSales, “Dates”, FactInternetSales[OrderDate], “Sales2”, SUM(FactInternetSales[SalesAmount])), FactInternetSales[OrderYear=2013)

    RANK by 2 Columns (calculated column)
    Measure = RANKX( Filter(all(‘Table’), ‘Table'[customer] = EARLIER(‘Table'[Customer])), ‘Table'[Txn Number],,DESC, DENSE)

    Creates a rank for each customer based on the txn number for each customer

  • Python Notes: Python Lists Examples


    Create a list
    Simple list
    mylist = [“apple”, “orange”, “banana”, “mango”, “pineapple”]
    mylist[2] # returns ‘banana’
    mylist[-1] # returns ‘pineapple’
    mylist[2:4] #returns ‘banana’, ‘mango’, ‘pineapple
    if “apple” in thislist:
      print(“Yes, ‘apple’ is in the fruits list”)
    mylist.insert(2, “watermelon”) #insert at position speified
    mylist.append(‘grapes’) #adds to end of list
    mylist = [“apple”, “banana”, “cherry”]
    tropical = [“mango”, “pineapple”, “papaya”]
    mylist.extend(tropical) # adds tropical list to thislist
    my.remove(“banana”) #removes first occurrence
    mylist.pop(1) #removes specified instance
    del mylist[0] #also removes instance
    del mylist # deletes list
    clear(mylist) # clears the list

    for i in thislist: #look list
      print(i)

    i = 0
    while i < len(thislist): #while loop
      print(thislist[i])
      i = i + 1

    newlist = []
    for x in fruits: #adds list items containing a to new list
      if “a” in x:
        newlist.append(x)

    thislist.sort() #sort alphabetically or numerically depending on data type
    thislist.sort(reverse = True) #sort in reverse
    thislist.sort(key = str.lower) #sort lower case first
    thislist.reverse() #sort in reverse
    mylist = thislist.copy() #copy list
    mylist = list(thislist) #also makes a list
    list3 = list1 + list2 #concetenate lists





    6 rows, 2 columns

    b1 = [[2308, 6], [2408, 6.2], [2508, 5.8],[2608, 5.6], [2708, 5.9]] #create the list
    print(b1) #print the list
    print(len(b1)) #print length  of list
    print(type(b1)) #print data type of list
    print(b1[:2]) #print the first 2 elements in the list (note doesn’t use zero)
    print(b1[3:]) #print from the 3rd elements and after

    Reference Index in array

    b1[3] #returns second element
    b1[0:2] # returns first and second element of list
    b1[-1] # returns last element index

    Add and Delete, Update
    Delete
    del(b1[0]) # delete first element
    or
    b1.pop(1) #remove second element


    Insert
    b1.insert(1,[24082, 111]) #insert element at position 1
    Update
    b1[6]= [2808,6.7] #update value

    Append
    Additions = [[29008, 6.6], [3008, 6], [3108, 6.1]]
    b1.extend(Additions) #Adds the Additions list to b1
    b1.append([2808,6.6]) # add 1 element (only 1)

    Clear the list
    b1.clear() #empties the list

    Check if element in array

    if [2308,6.2] in b1: print(“yes”)
    Loop
    for x in b1: print(x)

    def country_select(countrylist):
    count = 1
    for i in countrylist:
    print(f”{i}: {countrylist[i]}”)
    count = count + 1
    return countrylist



    Numpy

    baseball = [180, 215, 210, 210, 188, 176, 209, 200]
    import numpy as np
    np_baseball = np.array(baseball)
    print(type(np_baseball))
    mean = np_baseball.mean()
    print(“mean is: “+ str(mean))
    med = np.median(np_baseball)
    print(“median is: “+str(med))
    std = np.std(np_baseball)
    print(“standard deviation is: “+str(std))

  • Simple Linear Regression Example with Python

    Linear Regression using Salary and Years of Experience Data

    Data Source: Salary_dataset.csv Kaggle
    The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

    Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

    For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

    Simple Linear Regression
    Formula
    Formula: Y = b + aX + e

    1. Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
    2. Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
    3. a: is the slope change in Y for a one-unit change in X
    4. b: is the intercept of the Y axis. This represents the value of Y when X is zero.

      The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

      The steps are as follows:
      1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
      2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
      3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
      4. The null hypothesis is then evaluated based on the p-value
      5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
    
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    import statsmodels.api as sm
    from scipy import stats
    
    def load_data(csv_location):
        """Load CSV data into a pandas DataFrame."""
        df = pd.read_csv(csv_location)
        return df
    
    def prepare_data(df):
        """Prepare independent and dependent variables for regression."""
        X = df[['YearsExperience']].values  # Extract as 2D array
        y = df['Salary'].values  # Extract as 1D array
        return X, y
    
    def fit_sklearn_model(X, y):
        """Fit a linear regression model using sklearn."""
        model = LinearRegression()
        model.fit(X, y)
        return model
    
    def fit_statsmodels_ols(X, y):
        """Fit a linear regression model using statsmodels OLS."""
        X_with_const = sm.add_constant(X)  # Add an intercept to the model
        model = sm.OLS(y, X_with_const).fit()
        return model
    
    def plot_regression_line(df, intercept, slope):
        """Plot the regression line along with data points."""
        plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
        plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
        plt.title("Salary by Years of Experience")
        plt.xlabel('Years of Experience')
        plt.ylabel('Salary')
        plt.legend()
        plt.show()
    
    def main():
        csv_location = "salary_dataset.csv"
        df = load_data(csv_location)
    
        # Display basic statistics
        #print(df.describe())
    
        X, y = prepare_data(df)
    
        # Fit the model using sklearn
        sklearn_model = fit_sklearn_model(X, y)
        intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
        
        print("Calculation of Regression Line:\n")
        print(f"Intercept is: {intercept}")
        print(f"Slope is: {slope}")
    
        # Fit the model using statsmodels to get p-values and R-squared
        statsmodels_model = fit_statsmodels_ols(X, y)
      #  print(statsmodels_model.summary())
    
        # Extract R-squared and p-values
        r_squared = statsmodels_model.rsquared
        p_values = statsmodels_model.pvalues
    
        print(f"R-squared: {r_squared}")
        #print(f"P-values: {p_values}")
    
        # Extracting specific p-values by index
        intercept_p_value = p_values[0]  # First p-value (intercept)
        slope_p_value = p_values[1]  # Second p-value (YearsExperience)
    
        #print(f"Intercept p-value: {intercept_p_value}")
        print(f"p-value (YearsExperience): {slope_p_value}")
    
        print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
        print("This is obtained from the t-distribution with n−2 degrees of freedom ")
        print("where n is the number of observations\n")
    
        if slope_p_value > 0.05:
            print("P-value is not signficant and therefore we accept the null hypothesis")
        if slope_p_value < 0.05:
            print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌")
        # Plotting the regression line
        plot_regression_line(df, intercept, slope)
    
        # Fit a linear regression line using scipy.stats (for comparison)
        slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
       # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    if __name__ == "__main__":
        main()
    



  • Python Notes: Pandas Code Examples

    Pandas Quick Guide

    Modules

    import pandas as pd #pandas module
    import scipy.stats as stats #stats sub-module
    from bs4 import BeautifulSoup #scrape module
    import requests #API module
    import whois as whois #whois module


    Files
    df = pd.read_csv(“CookieData.csv”) #import CSV file to dataframe df

    #define index column and convert dates to date format

    air_quality = pd.read_csv(“air_quality_long.csv”, index_col=”date.utc”, parse_dates=True)

    air_quality.to_csv(“checkcombine.csv”) #save dataframe to csv file
    tests.to_excel(“tests.xlsx”, sheet_name=”expt”, index=False)



    Dataframe info

    df.describe() #output descriptive statistics of dataframe (numeric columns)

    df.info() #display dataframe columns datatypes No-Null
    df.dtypes #output datatypes of dataframe

    df.shape #outputs rows and columns
    df.sample(5) #output sample of 5 rows

    df.columns #ouput columns
    titanic.head(5) #print head of dataframe (top five rows)
    titanic.tail(5) #print tail of dataframe (bottom five rows)
    type(df) #output object type

    Convert
    gg[‘YearString’] = gg[‘Year’].astype(“string”) #convert series to string
    gg[“YearInterval”] = gg[“YearString”].str[:3] #LEFT type function, get first 3 elements of string

    gg[“YearInterval”] = gg[“YearInterval”].astype(int) #convert series integer



    Delete Columns
    del df2[“Outlier_Flag”]

    Sub-set

    df2 = df[[“version”, “retention_1”]]

    Combine Sub-sets

    #Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

    air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) # creates union

    merge the columns from the stations_coord dataframe to the air_quality df. Use location as the key to join the two dataframes, how=”left” is left join from air_quality on the left
    air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)

    merge left join using parameter column on left column and id on right

    air_quality = pd.merge(air_quality, air_quality_parameters,
    how=’left’, left_on=’parameter’, right_on=’id’)



    Dataframe Aggregate Functions
    titanic[“Age”].mean()
    titanic[[“Age”, “Fare”]].median()


    Sort Dataframe
    air_quality = air_quality.sort_values(“date.utc”)
    sort1 = titanic.sort_values(by=”Age”).head()
    sort2 = titanic.sort_values(by=[‘Pclass’, ‘Age’], ascending=False)~
    no2_subset = no2.sort_index().groupby([“location”]).head(2) #group and sort by index


    Filtering Dataframe

    df2 = df[df.Outlier_Flag == 0]

    gate_30 = df[df.version == “gate_30”]
    no2 = air_quality[air_quality[“parameter”] == “no2”]
    above_35 = titanic[titanic[“Age”] > 35]
    class_23 = titanic[(titanic[“Pclass”] == 2) | (titanic[“Pclass”] ==3)]

    age_no_na = titanic[titanic[“Age”].notna()] #return column not NA

    #Return Names of passengers over 35

    adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

    titanic.iloc[9:25, 2:5] #Return specified columns and rows
    titanic.iloc[0:3, 3] = “anonymous” #Change the 3rd element of the row to anonymous

    def city_select(country): #function with country as parameter
    df2 = df[df[“country”] == country] #slice dataframe where country = country e.g. United States
    df2.sort_values(by=[“population”], ascending=False) #sort dataframe by population descending
    df3 = df2.iloc[0:5] #new dataframe sorted by top 5 population
    df4 = df3[“city”] #select city series (top 5 cities by population)
    return df4

    def latlong(country, city): #define function
    selection = df[(df[“country”] == country) & (df[“city”] == city)] #filter dataframe by 2 criteria
    return selection #return filtered dataframe

    def get_lat(selection): #define function
    latitude = selection[“lat”] #latitude = series with “lat” as column
    lat = latitude.iloc[0] #get first row of latitude series
    return lat #return first row



    Groupby

    df.groupby(“version”).count() #Version with Count version
    df.groupby(“version”).sum() #Version with Sum columns
    air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean() #avg by day and location

    air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(kind=’bar’, rot=0, ax=axs)

    gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns



    Create Pivot Table

    table = pd.pivot_table(gate_30, values=’sum_gamerounds’, index=[‘version’], columns=[‘retention_1’],
    aggfunc=”mean”, margins=”true”) #note margins give totals

    no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
    air_quality.pivot_table(values=”value”, index=”location”, columns=”parameter”, aggfunc=”mean”) #wide format


    DateTime
    df[“datetime”] = pd.to_datetime(air_quality[“datetime”]) #change dates in dataframe to datetime formate

    pd.read_csv(“../data/df_long.csv”, parse_dates=[“datetime”]) #does the same
    df[“datetime”].min(), df[“datetime”].max() #start and end datetime
    df[“datetime”].max() – df[“datetime”].min() #difference in Datetime between 2 dates
    df[“month”] = df[“datetime”].dt.month #add column month based on other column
    df.index.year, df.index.weekday #return year and weekday of datetime object.
    df_pivoted = df.pivot(columns=”location”, values=”value”).reset_index() #reset pivot table index
    df = df_pivoted.melt(id_vars=”date.utc”)

    gg[“YearMonth”] = pd.to_datetime(dict(year=gg.Year, month=gg.Month, day=1)) #create date from columns year, month and 1 for first

    gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns
    gg4 = gg3.max() #change group by to max group by


    Plot

    df[“2019-05-20″:”2019-05-21”].plot(); #plot timerange



    Working with Columns

    air_quality.city.unique() #return unique cities in column cities
    air_quality = air_quality.rename(columns={“date.utc”: “datetime”}) #rename columns
    Split Columns in Dataframe
    titanic[“Name”].str.split(“,”) #split Name column by comma
    titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) #split name column by comma and return the first value in the array.

    df[“Name”].str.lower() #change column name to lowercase
    df[“Name”].str.contains(“Countess”) #return true if column contains “Countess

    df[“Name”].str.len() #return rows with string lenth of the Name column
    df[“Name”].str.len().idxmax() #return length of longest name
    df.loc[titanic[“Name”].str.len().idxmax(), “Name”] #return longest name in column using loc
    #rename gender column and replace genders to M and F in dataframe
    df[“gender_short”] = df[“gender”].replace({“male”: “M”, “female”: “F”}) #replace names in gender column with M and F
    titanic[“Pclass”].isin([2, 3]) #checkss if 2 and 3 is in the PClass columns
    adult_names = titanic.loc[titanic[“Age”] > 35, “Name”] #return Names of passengers over 35


    Print

    print(“\n1 Day Retention by Version”) #\n for new line
    print(f”You searched for: {domain_name}\n”) #f allows you to include placeholders in print

    Loops

    for link in a:
    print(link.get

    Input
    domain = input(“Enter domain to search for: “)

  • How to use the T-Test Statistical Test

    In business, the most commonly used t-test is often the Independent Two-Sample t-test. This test is widely utilized to compare the means of two independent groups or samples to determine if there is a significant difference between them.

    The Independent Two-Sample t-test is particularly useful in various business scenarios, such as:

    1. Comparing the performance metrics of two different products or services.
    2. Analyzing the effectiveness of different marketing strategies or campaigns.
    3. Assessing the impact of changes in processes or procedures on business outcomes.
    4. Evaluating the differences in customer satisfaction levels between different segments or demographics.
    5. Comparing the financial performance of different investment portfolios or asset classes.


    1. One-Sample t-Test

    Purpose:

    • To determine if the mean of a single sample differs significantly from a known or hypothesized population mean.

    Example:

    • Scenario: A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
    • Hypothesis: The mean weekly sales of the new product is different from $5000.
    • Application: The company collects weekly sales data for the new product over several weeks and uses the one-sample t-test to determine if the average sales differ from $5000.

      Calculation
      x-X/sd/SQRT(n)
      Where x = sample mean, X = est., population mean, n = sample size

      Nominator: Mean – hypothesized population mean
      Denominator: sample standard deviation/SQRT(sample size n)

    2. Independent Two-sample t-test

    Purpose:

    • To compare the means of two independent groups to see if they differ significantly.
    • Null Hypothesis: The means of 2 groups are equal. If rejected, there is a significant difference between the 2 means.

    Types:

    • Equal Variance (Pooled) t-Test: Assumes that both groups have the same variance.
    • Unequal Variance (Welch’s) t-Test: Does not assume equal variances between the groups.

      How do we determine if they have the same variance?
      Use the Levenes Test first to determine if the variances are equal between groups


    Example (Equal Variance) or Pooled t-Test:

    • Scenario: A retailer wants to compare the average sales of two different stores located in different regions.
    • Hypothesis: The mean sales of Store A are equal to the mean sales of Store B.
    • Application: Sales data from Store A and Store B are collected over the same time period. An independent two-sample t-test assuming equal variances is used to compare the average sales.

      t-Value = Complex Calculation

      Degrees of Freedom
      df=n1​+n2​−2

    Example (Unequal Variance, Welch’s t-test):

    • Scenario: A company wants to compare the average salaries of employees in two different departments.
    • Hypothesis: The mean salary in Department X is equal to the mean salary in Department Y.
    • Application: Salary data from both departments are collected. An independent two-sample t-test with unequal variances (Welch’s t-test) is used to compare the average salaries.

      Calculation
      Usage Independent t-test is used when the number of samples and the variance of the two data sets is different.

      T-Value: t = (X1-X2)/SQRT((var1/n1)+(var2/n2))

      Nominator: mean1 – mean2
      Denominator: SQRT((var1/n1)+(var2/n2))

      Degrees of Freedom Calculation more complex.

    3. Paired Sample t-Test

    Purpose:

    • To compare the means of two related groups (e.g., the same subjects measured at two different times).
    • To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).

    Business Example:

    • Scenario: A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
    • Hypothesis: The mean performance score after training is different from the mean performance score before training.
    • Application: Performance scores of employees are recorded before and after the training. A paired sample t-test is used to compare the average performance scores before and after the training.

      More Examples:

      example: Comparing the mean blood pressure of patients before and after treatment
      example: Testing if a new drug increases recovery rates (greater than) compared to the known recovery rate without the drug.

      Calculation
      T-Value =
      nominator: mean1 – mean2 (difference between the means)
      denominator: s(Diff)/SQRT(n)

      s(Diff) = (standard deviation of the differences of the paired data values)
      n = sample size (number of paired differences)
    • Degrees of Freedom:
    • n-1 = degrees of freedom

    One Side or Two-Sided:

    a) One-Sided t-Test

    Purpose:

    • To test if the sample mean is significantly greater than or less than the hypothesized population mean (used for directional hypotheses).

    Example:

    • Scenario: A company believes that a new marketing campaign will increase the average number of customers visiting their store per day compared to the previous average of 100 customers.
    • Hypothesis: The mean number of customers after the campaign is greater than 100.
    • Application: The company collects daily customer visit data after the campaign starts. A one-sided t-test is used to determine if the average number of customers is significantly greater than 100.

      Example 2: Comparing the average test score of a class to a national average

    b) Two-Sided t-Test

    Purpose:

    • To test if the sample mean is significantly different from the hypothesized population mean (used for non-directional hypotheses).


    Example:

    • Scenario: A company wants to compare the customer satisfaction ratings of two different products to see if there is any difference.
    • Hypothesis: The mean satisfaction rating of Product A is different from the mean satisfaction rating of Product B.
    • Application: Customer satisfaction ratings for both products are collected. A two-sided t-test is used to compare the average satisfaction ratings of the two products.


      Levenes Test – Python Code:
      Test if the variances of two independent groups are equal – use for determining whether to use the Equal or Unequal Independent Samples test

      If the p-value for the Levene test is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met).
      In this case, we can use the Independent Samples Equal Variance T-Test
      If they are significantly different (p<0.05), then we use the Independent Samples Equal Variance T-Test
    import scipy.stats as stats
    
    # Data for Set 1 and Set 2
    set1 = [19.7, 20.4, 19.6, 17.8, 18.5, 18.9, 18.3, 18.9, 19.5, 21.95]
    set2 = [28.3, 26.7, 20.1, 23.3, 25.2, 22.1, 17.7, 27.6, 20.6, 13.7, 23.2, 17.5, 20.6, 18, 23.9, 21.6, 24.3, 20.4, 23.9, 13.3]
    
    # Check if the variances are equal
    # Levene's test for equal variances
    levene_stat, p_levene = stats.levene(set1, set2)
    
    # Print the results of the Levene's test
    print("Levene's test for Equal Variances:")
    print(f"F statistic: {levene_stat:.3f}")
    print(f"p-value: {p_levene:.3f}")
  • Examples of working with Dates in Google BigQuery

    References:

    Date Functions
    https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions
    Format Elements
    https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_elements_date_time

    Current Date, Last_Date, Construction Add and Subtract


    CURRENT_DATE() as Current_Date, -- returns todays date
    
    DATE(2024,01,24) as Constructed_Date, -- constructs the date from Year, Month, Day
    
    LAST_DAY('2008-12-25') AS Last_Date, -- returns 2008-12-25 the last date
    
    DATE_ADD(DATE "2024-01-24", INTERVAL 1 DAY), -- 1 day later
    
    DATE_SUB(DATE "2024-01-24", INTERVAL 1 DAY), -- 1 day before


    FORMAT_DATE and PARSE_DATE

    SELECT
    
    CURRENT_DATE() as today, -- returns today
    
    FORMAT_DATE('%d/%m/%Y', CURRENT_DATE()) AS formatteday, -- UK format
    
    PARSE_DATE('%d-%m-%Y', '01-05-2024') as passstringdate, -- convert string to date object
    
    FORMAT_DATE('%d/%m/%Y', PARSE_DATE('%d-%m-%Y', '01-05-2024')) AS formatedpassstring -- convert string to date and then format



    EXTRACT

    --CURRENT_DATE() is 15/05/2024
    
    SELECT
    
    EXTRACT(DAY from CURRENT_DATE()) as day, -- 15/05/2024
    
    EXTRACT(DAYOFWEEK from CURRENT_DATE()) as DayOfWeek, -- 4
    
    EXTRACT(WEEK from CURRENT_DATE()) as Week, --19
    
    EXTRACT(WEEK(MONDAY) from CURRENT_DATE()) as WeekStartingMonday, -- 20
    
    EXTRACT(ISOWEEK from CURRENT_DATE()) as WeekStartingMonday2, -- 20
    
    EXTRACT(MONTH from CURRENT_DATE()) as  Mth, -- 5
    
    EXTRACT(QUARTER from CURRENT_DATE()) as Qtr, -- 2
    
    EXTRACT(YEAR from CURRENT_DATE()) as Yr, -- 2024
    
    EXTRACT(ISOYEAR from CURRENT_DATE()) as YrWkStartMon -- 2024


    DATE_DIFF

    SELECT DATE_DIFF(DATE '2024-01-25', DATE '2024-01-02', DAY) AS days_diff -- returns 23 (note format end_date, start_date)

    DATE() Constructor

    SELECT *
    FROM `bigquery-public-data.google_trends.top_terms`
    WHERE rank = 1
    --Date Constructor
    AND refresh_date >= DATE(2024, 01, 01)
    LIMIT 10


    DATE TRUNC()

    SELECT *
    
    FROM `bigquery-public-data.google_trends.top_terms`
    
    WHERE rank = 1
    -- DATE_SUB subtracts from given date
    
    AND refresh_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 WEEK)
    
    -- Limit result to 10 (works like TOP)
    
    LIMIT 10






  • DAX Running Total Quick Measure verus a Human Created Measure

    Dax Running Total

    An example we start of we a sales and date table to crate our running table sales from.
    We start by creating a sales measure as follows:

    Sales = SUM(FactSales[SalesAmount])

    Next we create a quick measure to create the ;Running Total in Sales’ and add it to our table:

    The Code for the Quick Measure ‘Sales running total in Year’ is a follows:

    Sales running total in Year = 
    CALCULATE(
    	[Sales],
    	FILTER(
    		ALLSELECTED( 'Date'[Year]),  
    		ISONORAFTER( 'Date'[Year], MAX('Date'[Year]), DESC
    	)
    ))


    It looks pretty simple, but there is quite a lot to it, so let’s break it down:

    1. CALCULATE() Syntax: CALCULATE(Expression, [Filter1]…)
    The CALCULATE() function takes the [Sales] measure as the first argument. [Sales] is just the sum of sales.

    Next, we have the main filter argument that filters the [Sales] measure on a range of years from the ‘Date’Year column.
    The filter starts with the FILTER function which takes ALLSELECTED(‘Date'[Year]) as the column to filter on and the
    ISONORAFTER( ‘Date'[Year], MAX(‘Date'[Year]), DESC as the filter argument.

    ALLSELECTED(‘Date'[Year])
    The ALLSELECTED function removes all filters external to the column ‘Date'[Year], so in our table, the Years in the year column will be ignored, so all years in the ‘Date’ table will be included in the column.
    We can prove this by creating another measure:

    CountrowsAllSelectedYear = COUNTROWS(ALLSELECTED( ‘Date'[Year]))


    When added to our table visual, we can see the count is 24 for every year row in the table. That is because this particular table has 24 years of date information in it.

    The next part is the tricky part, how we filter ALLSELECTED(‘Date'[Year])

    ISONORAFTER(‘Date'[Year], MAX(‘Date'[Year]), DESC)
    The ISONORAFTER function compares each element of the first argument (the ‘Date'[Year column] with the second argument, the MAX(‘Date'[Year]), which is either sorted in ascending order (ASC) or descending order (DESC).
    We can check what MAX(‘Date'[Year]) evaluates to with the following measure:

    Max Date Year = MAX(‘Date'[Year])


    We can see in our table that the max date equals the Year in the table visual.

    4.2 The second argument is MAX(‘Date'[Year]). To understand what this function does, we can create a measure as it may not be what you think.

    Max Date Year = MAX(‘Date'[Year])


    As you can see in the table below, the Max Date Year used in this context actually returns the same date as the Year column.


    The comparison works as follows:

    ‘Based on the sort order, the first parameter is compared with the second parameter. If the sort order is ascending, the comparison to be done is the first parameter greater than the second parameter. If the sort order is descending, the comparison to be done is the second parameter less than the first parameter’

    To table below shows what is going on (if I filter the page to years 2007-2009), which effectively filters the years in the table down from 24 to 3, we can create the following table.
    Essentially it is checking if each element in the date year table is on or after the max date, which is derived from the year in the table visual. If it is less than the Max([Date] then it returns true.

    The Years in the Date[Year] column are then returned as a filter to the Calculate to filter the [Sales] measure creating the ‘Running Total in Sales’

    Year: 2007
    Date[Year]Max([Date]) ISONORAFTER(a,a, DESC)
    If first less than second
    Sales Total
    20072007TRUE4.56M
    20082007FALSE
    20092007FALSE
    Running Total4.56M
    Year: 2008Max([Date])ISONORAFTER()Sales Total
    20072008TRUE4.56M
    20082008TRUE4.11M
    20092008FALSE
    Running Total8.63M
    Year :2009Max([Date])ISONORAFTER()Sales Total
    20072009TRUE3.74M
    20082009TRUE4.11M
    20092009TRUE3.74M
    Running Total12.41M


    That’s pretty complicated if you ask me, even though it is correct.
    We can create our own Running Total in Sales Measure, that is easy to understand.
    The code is below:

    Running Total human = 
    VAR the = SELECTEDVALUE('Date'[Year])
    VAR MaxDateInFilterContext = MAX ('Date'[Year])
    VAR DatesLessThanMaxDate =
        CALCULATE([Sales], 'Date'[Year]<= MaxDateInFilterContext)
    RETURN
    DatesLessThanMaxDate


    And the output looks good.


  • SQL Code for Lifetime Value Model

    The first part of building the code to pull the data for the lifetime value model is to create an SQL script that will get the historical data we need to train our model on.

    The SQL Code select 6 fields from the Contoso Retail Datawarehouse:
    The full SQL code is at the bottom of the page

    First Section of Code (WITH)

    USE ContosoRetailDW;
    
    WITH FirstTransactionDates AS (
        SELECT customerkey,MIN([DateKey]) AS first_transaction_date
        FROM [dbo].[FactOnlineSales]
        GROUP BY customerkey)
    	,
    	Orders AS (SELECT 
    		  [CustomerKey] as CustomerKey, 
    		  [SalesOrderNumber] as SalesOrderNumber,
    		  ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
    		  SUM([SalesAmount]) as SalesAmount
    		  FROM
    		  [dbo].[FactOnlineSales]
    		  GROUP BY [CustomerKey], [SalesOrderNumber] 
    )
    
    


    The code starts off with a subquery after the database is selected in the USE statement.

    2 Sub-queries are then created in the WITH clause. This makes the code easier to manage rather than using sub-queries on sub-queries in the main code body.

    The first sub-query: FirstTransactionDates creates a temporary table of customers and their first transaction date from the FactOnlineSales table

    The second-subquery: Orders creates a temporary table of customers and their orders with the sales ordernumber field ranking the orders from 1 upwards with 1 being the first order. I will create a separate post explaining the line:
    ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,


    Second Section of Code

    SELECT 
    --selection of CustomerKey from DimCustomer
    c.CustomerKey as CustomerKey, 
    
    -- Calculate Age as difference between [BirthDate] selected from DimCustomer and today (using the GETDATE() function), then divide by 365
    DATEDIFF(DAY, [BirthDate], GETDATE())/365 as Age, 
    
    --Gender selected as the [gender] field from DimCustomer and converted to 0 and 1 for m and f respectively
    CASE 
    WHEN [gender] = 'm' THEN 0
    WHEN [gender] = 'f' THEN 1
    END as Gender,
    
    --YearlyIncome selected from DimCustomer table
    c.YearlyIncome as YearlyIncome, 
    
    --FirstPurchaseAmount from sub-query a
    a.FirstPurchaseAmount as FirstPurchaseAmount,
    
    LifetimeSales as sub-query b
    b.first12months_salesamount as LifetimeSales
    
    -- select from DimCustomer with left join on.... tables to follow...
    FROM dbo.DimCustomer c LEFT OUTER JOIN


    As you can see from the above, there are 6 field selections:
    1. The CustomerKey from the DimCustomer table (alias c)
    2. Age calculated as the difference in days between the customers birthdate ([BirthDate] and todays date (GETDATE()). This is then divided by 365 to get the age in years.
    3. The Gender which uses the [gender] filed from the DimCustomer table and swaps ‘m’ for 0 and ‘f’ for 1, so we have numeric data for the model.
    4. The YearlyIncome field from the DimCustomer table.

    Then the last 2 fields FirstPurchaseAmount and first12months_salesamount which are calculated using sub-queries.

    Third Section of Code (Sub-query a)



    First Sub-query a, which is the DimCustomer table joins to on ON c.CustomerKey = a.CustomerKey
    The SELECT query select, CustomerKey and SalesAmount from the ORDERS table.
    The ORDERS table is a temporary table created with a sub-query in the WITH clause at the start of the code. The ORDERS table is filtered by OrderNumber = 1 to return the sum of sales for each customer on their first order. This gives us the FirstPurchaseAmount field.

    The Fourth Section of code is sub-query B

    LEFT OUTER JOIN
    
    -- Build First 12 Month Sales Table b
    (
    SELECT 
        f.customerkey,
        SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount 
    	ELSE 0 
    	END) AS first12months_salesamount
    
    FROM [dbo].[FactOnlineSales] f
    LEFT OUTER JOIN FirstTransactionDates ft ON 
    f.customerkey = ft.customerkey
    GROUP BY f.CustomerKey
    ) as b
    ON c.CustomerKey = b.CustomerKey
    
    WHERE c.CustomerType = 'Person'


    The fourth section of code shows the LEFT OUTER JOIN of the DimCustomer table to sub-query b.
    Sub-query b gets us the first 12 month sales amount of customers, which will be our dependent variable in our final Model.

    The SELECT code starts by selecting the CustomerKey from the FactOnlineSales table.

    The second line selects the first_transaction_date from the FirstTransactionDates sub-query created in the WITH clause and the f.DateKey from the FactOnlineSales table and looks at the difference in months between the 2. If the DateKey (transaction date is less than 12 months), then the sales amount is summed as first12months_salesamount.
    b.first12months_salesamount as LifetimeSales

    SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount
    ELSE 0
    END) AS first12months_salesamount


    This is the field used in the main SELECT query in code section 2.
    b.first12months_salesamount as LifetimeSales

    Full SQL Code

    USE ContosoRetailDW;
    
    WITH FirstTransactionDates AS (
        SELECT customerkey,MIN([DateKey]) AS first_transaction_date
        FROM [dbo].[FactOnlineSales]
        GROUP BY customerkey)
    	,
    	Orders AS (SELECT 
    		  [CustomerKey] as CustomerKey, 
    		  [SalesOrderNumber] as SalesOrderNumber,
    		  ROW_NUMBER() OVER (PARTITION BY [CustomerKey] ORDER BY [SalesOrderNumber] ASC) AS OrderNumber,
    		  SUM([SalesAmount]) as SalesAmount
    		  FROM
    		  [dbo].[FactOnlineSales]
    		  GROUP BY [CustomerKey], [SalesOrderNumber] 
    )
    
    SELECT 
    c.CustomerKey as CustomerKey, 
    DATEDIFF(DAY, [BirthDate], GETDATE())/365 as Age, 
    CASE 
    WHEN [gender] = 'm' THEN 0
    WHEN [gender] = 'f' THEN 1
    END as Gender,
    c.YearlyIncome as YearlyIncome, 
    a.FirstPurchaseAmount as FirstPurchaseAmount,
    b.first12months_salesamount as LifetimeSales
    
    FROM 
    dbo.DimCustomer c
    
    --Build FirstPurchaseAmount table a
    LEFT OUTER JOIN
    (
    SELECT CustomerKey, SalesAmount as FirstPurchaseAmount
    FROM ORDERS
    WHERE OrderNumber = 1
    ) a
    ON c.CustomerKey = a.CustomerKey
    
    LEFT OUTER JOIN
    
    -- Build First 12 Month Sales Table b
    (
    SELECT 
        f.customerkey,
        SUM(CASE WHEN DATEDIFF(month, ft.first_transaction_date, f.DateKey) < 12 THEN f.salesamount 
    	ELSE 0 
    	END) AS first12months_salesamount
    
    FROM [dbo].[FactOnlineSales] f
    LEFT OUTER JOIN FirstTransactionDates ft ON 
    f.customerkey = ft.customerkey
    GROUP BY f.CustomerKey
    ) as b
    ON c.CustomerKey = b.CustomerKey
    
    WHERE c.CustomerType = 'Person'
    

  • How to Build a Human Resources Employee Attrition Model

    Here we make use of the HR Analytics dataset available on Kagoo. The dataset was created to understand the factors behind employee attrition and can be used to train a model for predicting employee churn.

    The Python Code for the 2 models is on Github

    We can start by importing the required libraries and import the HR dataset CSV file to a pandas dataframe.

    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    
    csvlocation = csvpath = 'HR_capstone_dataset.csv'
    df = pd.read_csv(csvlocation)


    First, we should check the data imported into the Pandas data frame looks good:


    We can also run df. columns to view the columns of the data frame and df. values to display the values of the data frame in an array.

    The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.

    We can also run df.columns to view the columns of the data frame and df.values to display the values of the data frame in an array.

    The pandas data frame describe() method gives us descriptive statistics on the data frame. Note the include column is included now to return all rows, the default is numeric data only, but you will note that it displays NaN (Not a number) for the values.



    Here we can verify that all columns have the same count of 14999, so there appears to be no missing data. We can also view the standard deviation (std) in relation to the mean, to get an idea of the variance in the data, for example:

    Mean average_monthly_hours = 201 hours
    1 standard deviation average monthly hours = 49.94 hours
    Therefore we can infer that:
    Approx. 68% of employees work 201 hours+-49.94 hours = 152-150 hours a month.
    Approx. 95% of employees work 201 hours +-100 hours (rounded) = 101 to 301 hours a month

    We need to check the datatypes are in the correct format if we want to build a model from them.


    The dependent variable that we want to predict is the ‘left’ field, which is a binary field of 0 or 1. A value of 1 is that they left the company (attrition).

    There are 9 other columns which make up the independent variables:
    There are 2 floats, 5 integers, and 2 object data types.

    The salary field is categorical and includes 3 levels:
    low, medium, and high
    Because there is an inherent order we can use ordinal encoding to convert it to an integer
    salary_map = {‘low’: 0, ‘medium’: 1, ‘high’: 2}

    The Department fields consist of non-ordered fields:
    Accounting, HR, IT, management, marketing, product_mgt, RandomD, sales.

    We use one-hot encoding on this field which creates 8 new binary fields, one for each department name. The code is as follows:

    df = pd.get_dummies(df, columns=[‘Department’])


    This results in a lot more columns in the data frame as below, but now we’re ready to build a model.

    Building the Random Forests model

    The steps for building the model are as follows:
    1. Split the dataset into X and y variables where X consists of all the independent variable fields.
    The ‘y’ list contains only the ‘y’ field (the dependent variable).
    2. Split the X and y dataframes into 4 sets for training. The test sets were randomized 20% of the data.
    3. Initialize the random forest classifier
    4. Train the model with the training data (X_train and y_train). This represents 80% of the data.
    5. Make predictions from the X_test set. The model tries to predict the actual values that are in the y_test set.
    6. Measure the accuracy by comparing the predictions of y (y_pred) with the actual y_test values.

    # Split the dataset into features (X) and target variable (y)
    X = df.drop('left', axis=1)  # Features
    y = df['left']  # Target variable
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Random Forest classifier
    #We initialize the Random Forest classifier with 100 trees and fit it to the training data.
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Train the classifier on the training data
    rf_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)


    The resulting accuracy: Accuracy: 0.9886666666666667 which represents 99% accuracy.
    If I re-run the code it gives similar results. The model is almost perfect!

    We can compare this random forest model to a logistic regression model. Here code is below:
    The resulting accuracy score is only: 0.785 or 78% accurate.
    So the random forest model wins!

    Logistic Regression Model Code

    
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    
    # Load the dataset
    csv_location = 'HR_capstone_dataset.csv'
    df = pd.read_csv(csv_location)
    
    # Convert 'salary' into numerical values
    salary_map = {'low': 0, 'medium': 1, 'high': 2}
    df['salary'] = df['salary'].map(salary_map)
    
    # One-hot encode 'Department'
    df = pd.get_dummies(df, columns=['Department'])
    
    # Split the dataset into features (X) and target variable (y)
    X = df.drop('left', axis=1)  # Features
    y = df['left']  # Target variable
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Logistic Regression classifier
    log_reg_classifier = LogisticRegression(max_iter=1000, random_state=42)
    
    # Train the classifier on the training data
    log_reg_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = log_reg_classifier.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
    
  • How to Build a Survey Results Report with Power BI

    Power BI can be useful for displaying and storing survey results for your organization. In this example we build some data for the results of a fictitious Airline survey, using questions taken from a real online airport survey.


    The data was built in Excel with a sheet for questions and a sheet for answers. You can download the Excel file here.

    It’s easy to publish the report as a PDF, so you can share it with users who don’t have access to Power BI. Here is the airport survey report as PDF

    The questions are a simple list with a Question ID and section.


    The answers on the second sheet are formatted so that each row represents all the answers for each respondent.



    Steps for Building the Airline Survey Report.

    Step 1
    The Data is imported to our Power BI Desktop survey using Power Query.

    The only change to the question table is a simple column rename.


    The answer table is transformed so that each line of the table contains the respondent ID, the question, and the Value (the answer). This prepares the data so it can be joined to the question table and easily visualized.
    The applied steps are as follows:

    As you can see the first step after the import is simply to promote the headers, so the column names are the actual column names.
    The second step is just to change the column types to the correct types.
    The key third step is to unpivot the data, so the table is transformed as below a separate row for each respondent’s answer. This makes the report easy to build.


    The report only requires one measure, which is just a unique count of the respondent ID

    Answer Count = DISTINCTCOUNT(Answers[Respondent])


    Basic Graphs can be added.

    And Likert Scale style graphs using a 100% Stacked bar graph visual.

    Its a simple, but useful report.



  • An Example of Poor Correlation using Python

    In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).


    We can check age and first purchase amount against first 12 month sales amount just to confirm this.

    Age vs. Sales (first 12 month sales) code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='Age', y='LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age")  # graph title
    plt.xlabel('Age')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    


    The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.



    Next we can look at First Purchase amount and First 12 Month Sales Amount.
    Here is the code below. It includes removing the major outlier with CustomerKey = 1

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df = df[df['CustomerID'] != 1]
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
    plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
    plt.xlabel('FirstPurchaseAmount')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    
    


    And here is the graph. The correlation co-efficent is only 0.07.

  • Some basic Python Graphing Data Examples with the Matplotlib library

    In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.

    Histogram:
    We can start by having a look at the distribution of customers’ ages with a histogram.
    Here is the code:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns # display the columns available
    
    # Create a histogram
    df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
    plt.title("Distribution of Customer Age")
    plt.xlabel('Age')
    plt.ylabel('Frequency')
      
    plt.show()

    The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.





    Scatter Plot
    Next, we can have a look at the first 12-month sales amount against age.
    Here is the Python code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    
    #kind options: 'scatter', 'hist', 'bar',  
    df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age") # graph title
    plt.xlabel('Age') #x axis lable
    plt.ylabel('First 12 Month Sales') #y axis label
      
    plt.show()


    Here is the graph:


    We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.

    All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100.
    Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales.

    Pie Graph

    The following code gives us a pie chart showing us the slit of gender in the dataset.

    import pandas as pd
    import matplotlib.pyplot as plt
    # import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    #replace 0 and 1 in the 'Gender' column with femail and male
    df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})
    
    #return lables and count index of unique age types
    age_counts = df['Gender'].value_counts()
    #print(age_counts)
    
    # Define colors for each category
    colors = {'female': 'pink', 'male': 'blue'}
    
    # Plotting
    plt.figure(figsize=(4, 4))  # Adjust figure size if needed
    plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', 
            colors=[colors[g] for g in df['Gender'].value_counts().index])
    plt.title('Distribution of Gender')
    plt.show()


    The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’.
    The data looks pretty normal as we would expect. There are no nulls which is good.


    Another Scatter Plot
    Next, we look at the ‘FirstPurchaseAmount’ field.
    We create a scatter graph with the code below.

    # -*- coding: utf-8 -*-
    """
    Created on Wed May  8 12:17:26 2024
    
    @author: Admin
    """
    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    
    #kind options: 'scatter', 'hist', 'bar',  
    df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
    plt.title("Customers by First Purchase Amount") # graph title
    plt.xlabel('CustomerID') #x axis label
    plt.ylabel('First Purchase Amount') #y axis label
      
    plt.show()


    The first thing we notice when we run the graph is that there is one major outlier in the data.

    Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.

  • Examples of using the Python Seaborn Graphs Library

    This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.



    import seaborn as sns
    import matplotlib.pyplot as plt
    #from sklearn.datasets import fetch_california_housing
    import pandas as pd
    
    csvlocation = csvpath = 'outliercheck.csv'
    df = pd.read_csv(csvlocation)
    #df.columns # display the columns available
    
    pairplot = sns.pairplot(df)
    
    # Display the plot
    plt.show()

    If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.

    # Sort the DataFrame by the 'firstpurchasedate' column
    df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)
    
    # Display the first few rows of the sorted DataFrame
    print(df_sorted.head()

    The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.


    We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.

    #remove row where CustomerID = 1
    df_filtered = df[df['CustomerID'] != 1]
    # Remove the 'age', 'gender' and CustomerID columns
    df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)


    The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.

    Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.

  • Creating a Customer Lifetime Value (LTV) Prediction Model


    In this project I’m trying to predict customer lifetime sales (well actually first 12-month sales), initially using the SKlearns linear regression model. The code can be downloaded from Github here. The results haven’t been great so far and I wonder how the data from the Contoso Retail data warehouse was created as I have an r-squared value of only around 10%. Still, it’s been a good exercise, and adding additional fields has improved the model gradually.

    A lifetime value model is particularly useful in business for predicting the value of newly acquired customers.

    The benefits of the model are as follows:

    1. It can both guide the forecast for future sales, based on the predicted value of new customers, and forecast new customers being acquired.
    2. It can guide the acquisition team in both targeting higher-value customers and understanding how much they can spend on acquiring new customers.

    3. It helps the business understand what the max (CPA or Customer Per Acquisition) should be to keep the business profitable, while still growing the business.

    Lifetime Timeframe
    One of the immediate challenges you will face with creating an LTV model is that your older customers are naturally more likely to have a higher lifetime value than more recently acquired customers. To move forward, we can look to create lifetime models based on the first x amount of time from the customer acquisition date. For example, predicted 12-month LTV. An analysis of lifetime data can of course give you a much better understanding of how long a customer’s lifetime is likely to last and what are the key timeframes you can use in creating the prediction.

    Building a Model with Linear Regression

    In this example, we use a Linear Regression Model as the data we want to predict (lifetime sales) is continuous data. In this example, we’ll guess which fields we think will predict the highest lifetime sales and we’ll soon find out that some of them have very low correlations, forcing us to go back to the drawing board as we’re trying to skip the first 2 stages of analytics: Exploration and Diagnosis (correlation).

    Here are the main steps for building the model:
    1. Create an SQL Query for importing the data into Python.
    2. Create the Python connection and import the data using the SQL Query.
    3. Transform and clean the data.
    4. Build the model
    5. Evaluate the model

    6. Improve the model.

    1. Create SQL Query to import the historical customer data for training the model. This includes the independent variables and the dependent variable (first 12-month lifetime sales), which we want to predict for future new customers. 12 months is used as we don’t have a lot of historical data.

    I’m choosing to start using the following demographic fields:
    Age, Gender and Yearly Income and First Purchase Amount

    The first part of building the code is to create an SQL script that will get us the historical data we need to train our model.

    The SQL code can be found in this post

    Now we have the SQL, we can start building the Python script which will pull in the data and build the model.

    2. Python – Importing the Data (Extract


    4. Building the model

    Now the exciting bit, is where we get to build the model using the sklearn machine learning module.
    The process is as follows:

    1. Separate independent variables (Age, Gender, first purchase amount, Income) and the dependent variable LifetimeSales into their own dataframes.
    2. The data is then separated further into a train and test segment. In this case, 20% of the data is put to the test.
    3. The next step is to create the linear regression model
    4. Then to evaluate the model
    5. Save the model to a .pkl file for future use.

    Code for building the model

    # -*- coding: utf-8 -*-
    """
    Created on Tue May 28 09:29:07 2024
    
    """
    
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load the data
    data = pd.read_csv("customerltvdata.csv")
    data = data[~data.CustomerKey.isin({1, 0})]  #use tilde with isin for is not in
    
    # Preprocess the data
    data = data.dropna()
    data = pd.get_dummies(data, drop_first=True)
    
    # Define features and target variable
    X = data[['Age', 'Gender', 'NumberCarsOwned', 'HouseOwnerFlag', 'YearlyIncome', 'MaritalStatus', 'FirstPurchaseAmount']]  # Replace with actual features
    y = data['LifetimeSales']  # Replace with the target variable
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_predictions = model.predict(X_train)
    y_test_predictions = model.predict(X_test)
    
    # Evaluate the model
    mse_train = mean_squared_error(y_train, y_train_predictions)
    mse_test = mean_squared_error(y_test, y_test_predictions)
    
    r2 = r2_score(y_test, y_test_predictions)
    
    print("Train MSE:", mse_train)
    print("Test MSE:", mse_test)
    
    print(f'R-squared: {r2}')
    
    # Step 5: Interpret the model coefficients
    coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})
    print(coefficients)
    
    # Analyze the results
    plt.scatter(y_test, y_test_predictions)
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Actual vs Predicted Values')
    plt.show()
    
    sns.residplot(x=y_test, y=y_test_predictions, lowess=True)
    plt.xlabel('Actual Values')
    plt.ylabel('Residuals')
    plt.title('Residuals vs Actual Values')
    plt.show()
    


    5. Model evaluation output (At time of writing).

    Train MSE: 98009842.04325004
    Test MSE: 98328891.39303765

    R-squared: 0.09431790148611996


    Variable Coefficient
    0 Age 17.114500
    1 Gender 224.837535
    2 NumberCarsOwned 446.586582
    3 HouseOwnerFlag 283.217963
    4 YearlyIncome 0.080253
    5 MaritalStatus 1577.068714
    6 FirstPurchaseAmount 1.023711

    We can ask ChatGTP and Gemini to interpret the results: Here is a response from Gemini:

    Model Performance

    • Train MSE (Mean Squared Error): 98009842.04325
    • Test MSE: 98328891.393038
    • R-squared: 0.0943

    Both the Train MSE and Test MSE are relatively high. This indicates that the model has a high average error in predicting lifetime sales, both on the data it was trained on and unseen test data. Ideally, you want the MSE to be as low as possible.

    The R-squared value is also quite low (around 0.1). This means that the model only explains a small portion of the variance in lifetime sales. In other words, there are other factors that influence lifetime sales that are not captured by this model.

    I keep adding additional fields, which is slightly improving the model, but there is a long way to go. As the Contoso data is not real, it’s possible this isn’t the best data to use.