Category: python

  • Python Notes: Pandas from getting started Tutorials

    Notes from Getting Started Tutorial:
    https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

    1. Create a Pandas Series and DataFrame
    2. Create Graphs with Matplotlib
    3. Adding new Columns from Existing Columns (inc. split)
    4. Summary Statistics including groupby
    5. Reshape layout of tables incl. Sorting and Pivot tables
    6. Sub-sets and filtering
    7. Combining multiple tables data (Concatonate (union) and merge (Join)
    8. Working with timeseries (pd.to_datetime(), dt, DatetimeIndex and Resample)
    9. Textual data (.str and .replace and split)


    1. Creating a Dataframe and Series

    import pandas as pd
    
    #Create a series
    books= pd.Series(["Papillion", "Great Expectations", "Magic"], name="Books") #Create a series
    
    #Create DataFrame with Key value pairs. In this case the values are lists of values.
    
    df = pd.DataFrame(
        {
            "Name": [
                "Humpfries, Mr. Bill",
                "Derby, Mrs. Sheila",
                "Winfrey, Miss. Elizabeth",
            ],
            "Age": [44, 66, 42],
            "Gender": ["male", "female", "female"],
            "City": ["London", "Manchester", "Leeds"],
            "Occupation": ["Doctor", "Architect", "Analyst"]
        }
    )
    
    #Using the list
    df["Age"] #Output Age Column
    #df["Age"].max() #outputs max age

    2. Create Plots with Matplotlib

    import pandas as pd
    import matplotlib.pyplot as plt
    
    #Create Pandas dataframe from csv file
    air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
    air_quality["station_paris"].plot() #craate plot of subset from DataFrame
    plt.show() 
    
    #Plot with Matplot lib
    air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
    plt.show() 
    
    #Create scatter from DataFrame
    air_quality.plot.box() #Create box plot
    plt.show() #Display boxplot
    
    #subplots
    fig, axs = air_quality.plot.area(figsize=(10, 4), subplots=True)
    plt.show()
    


    3. Add new columns from existing columns

    Import pandas as pd
    #index_col=0, sets first column to index, parse_dates converts dates to date type
    air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
    
    #Add new column based on existing column - this adds column to DataFrame
    air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
    #Another example, this time a %
    air_quality["ratio_paris_antwerp"] = (air_quality["station_paris"] / air_quality["station_antwerp"]
    
    #renaming columns
    air_quality_renamed = air_quality.rename(columns={"station_antwerp": "BETR801","station_paris": "FR04014", "station_london": "London Westminster"})
    

    4. Summary Statistics


    import pandas as pd
    #read_csv function - also read_ excel, sql, json, parquet and more
    
    titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
    titanic["Age"].mean() #average of Age column
    titanic[["Age", "Fare"]].median() #median of Age and Fare Coluns
    titanic[["Age", "Fare"]].describe() #summary stats of 2 columns
    titanic.agg({"Age": ["min", "max", "median", "skew"],"Fare": ["min", "max", "median", "mean"]})
     #multiple stats on Age and Fare columns
     
     #groupby to create summary tables
     titanic[["Sex", "Age"]].groupby("Sex").mean()
    
     #Average age by sex
     titanic.groupby("Sex").mean(numeric_only=True) 
     #average of all numberics by Sex
     titanic.groupby(["Sex", "Pclass"])["Fare"].mean() #table average fare by sex and PClass
     
     titanic["Pclass"].value_counts() #count passengers in Pclass
     titanic.groupby("Pclass")["Pclass"].count() #count longer method


    5. Re-shape layout of tables

    import pandas as pd
    titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
    air_quality = pd.read_csv("air_quality_long.csv", index_col="date.utc", parse_dates=True)
    
    #Sorting Tables
    sort1 = titanic.sort_values(by="Age").head() 
    
    ##sort by Age
    sort2 = titanic.sort_values(by=['Pclass', 'Age'], ascending=False) #sort by Pclas then Age descending 
    
    # filter for no2 data only
    no2 = air_quality[air_quality["parameter"] == "no2"]
    
    # filter for no2 data only<br>no2_subset = no2.sort_index().groupby(["location"]).head(2)
    no2_subset
    #pivot table wide format
    air_quality.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")
    air_quality.pivot_table(values="value",index="location",columns="parameter",aggfunc="mean",margins=True)
    
    #wide to long format
    no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
    no2_pivoted.head(5)
    #pandas.melt() method converts pivot table to long format
    no_2 = no2_pivoted.melt(id_vars="date.utc")<br>no_2.head(5)


    6. Sub-sets and Filtering

    import pandas as pd
    #read csv file to pandas DatafFrame
    titanic = pd.read_csv(“titanic.csv”)
    air_quality = pd.read_csv(“air_quality_no2.csv”, index_col=0, parse_dates=True)

    #save as excel, index=False removes row index numbers from export
    titanic.to_excel(“titanic.xlsx”, sheet_name=”passengers”, index=False)


    #attributes and methods of DataFrame
    titanic.shape #row and columns count
    titanic.info #
    titanic.columns #
    titanic.dtypes #data types
    titanic.head(5) #head of df
    titanic.tail(5) #tail of df
    titanic.describe() #statistics of file

    #Create Sub-set of DataFrame as series
    ages = titanic[“Age”] #create sub-set from dataframe: note single bracket creates a pandas series.
    type(titanic[“Age”]) #confirm type of Age object

    #Create Sub-set of DataFrame as DataFrame
    age_sex = titanic[[“Age”, “Sex”]] #inner brackets are list, outer is DataFrame

    #Filtering rows getpassenger list with Age > 35
    above_35 = titanic[titanic[“Age”] > 35] #return rows with age over 35
    #Filter passenger list with Pclass = 2 or 3 (2 ways of doing this)
    titanic[“Pclass”].isin([2, 3]) #checks if 2 or 3 is in Pclass and returns Boolean
    titanic[“Pclass”].isin([2, 3])

    #Filter passenger list with Age not NA
    age_no_na = titanic[titanic[“Age”].notna()]
    #Filter passenger liset and return list of Names
    adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

    #Filter specified rows and columns
    titanic.iloc[9:25, 2:5]

    #change the first 3 elements in row 3 to “anonymous”
    titanic.iloc[0:3, 3] = “anonymous”



    7. Combining Data from tables

    import pandas as pd
    import os

    #check current working director
    cwd = os.getcwd()
    cwd

    #Import data into DataFrames
    titanic = pd.read_csv(“titanic.csv”)
    air_quality_no2 = pd.read_csv(“air_quality_no2_long.csv”, parse_dates=True)
    air_quality_pm25 = pd.read_csv(“air_quality_long.csv”, parse_dates=True)
    stations_coord = pd.read_csv(“air_quality_stations.csv”)
    air_quality_parameters = pd.read_csv(“air_quality_parameters.csv”)

    #Create Sub-sets
    air_quality_no2 = air_quality_no2[[“date.utc”, “location”,
    “parameter”, “value”]]

    air_quality_pm25 = air_quality_pm25[[“date.utc”, “location”,
    “parameter”, “value”]]

    #Concatonate Sub-sets – creates a union between the 2 tables.
    #the Keys argument adds another hierachical index to show which table the data comes from
    air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”])
    air_quality_no2.head(5)
    air_quality_pm25.head(5)
    air_quality_no2.shape #2068 rows 4 columns
    air_quality_pm25.shape #5274 rows, 4 columns
    air_quality.shape #7340 rows, 4 columms

    air_quality = air_quality.sort_values(“date.utc”)
    air_quality.head(25)
    air_quality.to_csv(“checkcombine.csv”)

    #note you can reset the index to a column
    air_quality.reset_index(level=0)

    stations_coord = pd.read_csv(“air_quality_stations.csv”)

    stations_coord.head()

    #merge the columns from the stations_coord dataframe to the air_quality df
    #use location as the key to join the two dataframes
    #how=”left” is left join from air_quality on the left
    air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)
    air_quality.head(10)
    air_quality.to_csv(“checkqualitymerge.csv”)

    air_quality_parameters.head(5)
    #merge left join using parameter column on left column and id on right
    air_quality = pd.merge(air_quality, air_quality_parameters,
    how=’left’, left_on=’parameter’, right_on=’id’)

    8. Timeseries

    import pandas as pd

    import matplotlib.pyplot as plt

    air_quality = pd.read_csv(“air_quality_no2_long.csv”)

    air_quality = air_quality.rename(columns={“date.utc”: “datetime”})
    air_quality.head(5)
    air_quality.columns

    #unique cities in air_quality df
    air_quality.city.unique()

    #read air_qualitydataframe to a datetime object instead of text
    air_quality[“datetime”] = pd.to_datetime(air_quality[“datetime”])

    air_quality[“datetime”]

    #can also do this to read the dates into datetime format
    #pd.read_csv(“../data/air_quality_no2_long.csv”, parse_dates=[“datetime”])

    #check start and end datetime
    air_quality[“datetime”].min(), air_quality[“datetime”].max()

    #datetime difference using pandas timestamp
    air_quality[“datetime”].max() – air_quality[“datetime”].min()

    #Add a month column using the
    air_quality[“month”] = air_quality[“datetime”].dt.month

    #Average no2 concentration by day of week by location
    air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean()

    #Plot bar graph using hour on the x axis and average no2 on the y
    fig, axs = plt.subplots(figsize=(12, 4))

    air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(
    kind=’bar’, rot=0, ax=axs)
    plt.xlabel(“Hour of the day”); # custom x label using Matplotlib
    plt.ylabel(“$NO_2 (Āµg/m^3)$”);

    #create pivot table
    no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
    no_2.head(5)

    #use index.year and index.weekday of the datetime object
    no_2.index.year, no_2.index.weekday
    #plot using a time range
    no_2[“2019-05-20″:”2019-05-21”].plot();

    #resample method of datetime object works like groupby. It takes an
    #aggregation function and the “M” is time based grouping for Month (month end)
    monthly_max = no_2.resample(“M”).max()
    monthly_max

    monthly_max.index.freq #tells you the frequence attribute – e.g. MonthEnd

    #Create table plot of average no2 levels for each station
    no_2.resample(“D”).mean().plot(style=”-o”, figsize=(10, 5));

    9. Textual Data

    import pandas as pd

    titanic = pd.read_csv(“titanic.csv”)

    titanic[“Name”].str.lower()

    #Split the Name field by comma separated
    titanic[“Name”].str.split(“,”)
    #create series surname using the first element of the split name field
    titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0)
    titanic[“Surname”]

    #Return name series with true false if contains Countess
    titanic[“Name”].str.contains(“Countess”)

    #return rows where name contains Countess
    titanic[titanic[“Name”].str.contains(“Countess”)]

    #string length
    titanic[“Name”].str.len()
    #find max name length
    titanic[“Name”].str.len().idxmax()
    #Return longest name
    titanic.loc[titanic[“Name”].str.len().idxmax(), “Name”]

    titanic[“Sex_short”] = titanic[“Sex”].replace({“male”: “M”, “female”: “F”})

    titanic[“Sex_short”]

  • Python Notes: Python Lists Examples


    Create a list
    Simple list
    mylist = [“apple”, “orange”, “banana”, “mango”, “pineapple”]
    mylist[2] # returns ‘banana’
    mylist[-1] # returns ‘pineapple’
    mylist[2:4] #returns ‘banana’, ‘mango’, ‘pineapple
    ifĀ “apple”Ā inĀ thislist:
    Ā Ā print(“Yes, ‘apple’ is in the fruits list”)
    mylist.insert(2,Ā “watermelon”) #insert at position speified
    mylist.append(‘grapes’) #adds to end of list
    mylist = [“apple”,Ā “banana”,Ā “cherry”]
    tropical = [“mango”,Ā “pineapple”,Ā “papaya”]
    mylist.extend(tropical) # adds tropical list to thislist
    my.remove(“banana”) #removes first occurrence
    mylist.pop(1) #removes specified instance
    delĀ mylist[0] #also removes instance
    del mylist # deletes list
    clear(mylist) # clears the list

    forĀ iĀ inĀ thislist: #look list
    Ā Ā print(i)

    i =Ā 0
    whileĀ i <Ā len(thislist): #while loop
    Ā Ā print(thislist[i])
    Ā  i = i +Ā 1

    newlist = []
    forĀ xĀ inĀ fruits: #adds list items containing a to new list
    Ā Ā ifĀ “a”Ā inĀ x:
    Ā Ā Ā Ā newlist.append(x)

    thislist.sort() #sort alphabetically or numerically depending on data type
    thislist.sort(reverse =Ā True) #sort in reverse
    thislist.sort(keyĀ =Ā str.lower) #sort lower case first
    thislist.reverse() #sort in reverse
    mylistĀ = thislist.copy() #copy list
    mylistĀ =Ā list(thislist) #also makes a list
    list3 = list1 + list2 #concetenate lists





    6 rows, 2 columns

    b1 = [[2308, 6], [2408, 6.2], [2508, 5.8],[2608, 5.6], [2708, 5.9]] #create the list
    print(b1) #print the list
    print(len(b1)) #print length  of list
    print(type(b1)) #print data type of list
    print(b1[:2]) #print the first 2 elements in the list (note doesn’t use zero)
    print(b1[3:]) #print from the 3rd elements and after

    Reference Index in array

    b1[3] #returns second element
    b1[0:2] # returns first and second element of list
    b1[-1] # returns last element index

    Add and Delete, Update
    Delete
    del(b1[0]) # delete first element
    or
    b1.pop(1) #remove second element


    Insert
    b1.insert(1,[24082, 111]) #insert element at position 1
    Update
    b1[6]= [2808,6.7] #update value

    Append
    Additions = [[29008, 6.6], [3008, 6], [3108, 6.1]]
    b1.extend(Additions) #Adds the Additions list to b1
    b1.append([2808,6.6]) # add 1 element (only 1)

    Clear the list
    b1.clear() #empties the list

    Check if element in array

    if [2308,6.2] in b1: print(“yes”)
    Loop
    for x in b1: print(x)

    def country_select(countrylist):
    count = 1
    for i in countrylist:
    print(f”{i}: {countrylist[i]}”)
    count = count + 1
    return countrylist



    Numpy

    baseball = [180, 215, 210, 210, 188, 176, 209, 200]
    import numpy as np
    np_baseball = np.array(baseball)
    print(type(np_baseball))
    mean = np_baseball.mean()
    print(“mean is: “+ str(mean))
    med = np.median(np_baseball)
    print(“median is: “+str(med))
    std = np.std(np_baseball)
    print(“standard deviation is: “+str(std))

  • Simple Linear Regression Example with Python

    Linear Regression using Salary and Years of Experience Data

    Data Source: Salary_dataset.csv Kaggle
    The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

    Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

    For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

    Simple Linear Regression
    Formula
    Formula: Y = b + aX + e

    1. Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
    2. Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
    3. a: is the slope change in Y for a one-unit change in X
    4. b: is the intercept of the Y axis. This represents the value of Y when X is zero.

      The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

      The steps are as follows:
      1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
      2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
      3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
      4. The null hypothesis is then evaluated based on the p-value
      5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
    
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    import statsmodels.api as sm
    from scipy import stats
    
    def load_data(csv_location):
        """Load CSV data into a pandas DataFrame."""
        df = pd.read_csv(csv_location)
        return df
    
    def prepare_data(df):
        """Prepare independent and dependent variables for regression."""
        X = df[['YearsExperience']].values  # Extract as 2D array
        y = df['Salary'].values  # Extract as 1D array
        return X, y
    
    def fit_sklearn_model(X, y):
        """Fit a linear regression model using sklearn."""
        model = LinearRegression()
        model.fit(X, y)
        return model
    
    def fit_statsmodels_ols(X, y):
        """Fit a linear regression model using statsmodels OLS."""
        X_with_const = sm.add_constant(X)  # Add an intercept to the model
        model = sm.OLS(y, X_with_const).fit()
        return model
    
    def plot_regression_line(df, intercept, slope):
        """Plot the regression line along with data points."""
        plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
        plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
        plt.title("Salary by Years of Experience")
        plt.xlabel('Years of Experience')
        plt.ylabel('Salary')
        plt.legend()
        plt.show()
    
    def main():
        csv_location = "salary_dataset.csv"
        df = load_data(csv_location)
    
        # Display basic statistics
        #print(df.describe())
    
        X, y = prepare_data(df)
    
        # Fit the model using sklearn
        sklearn_model = fit_sklearn_model(X, y)
        intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
        
        print("Calculation of Regression Line:\n")
        print(f"Intercept is: {intercept}")
        print(f"Slope is: {slope}")
    
        # Fit the model using statsmodels to get p-values and R-squared
        statsmodels_model = fit_statsmodels_ols(X, y)
      #  print(statsmodels_model.summary())
    
        # Extract R-squared and p-values
        r_squared = statsmodels_model.rsquared
        p_values = statsmodels_model.pvalues
    
        print(f"R-squared: {r_squared}")
        #print(f"P-values: {p_values}")
    
        # Extracting specific p-values by index
        intercept_p_value = p_values[0]  # First p-value (intercept)
        slope_p_value = p_values[1]  # Second p-value (YearsExperience)
    
        #print(f"Intercept p-value: {intercept_p_value}")
        print(f"p-value (YearsExperience): {slope_p_value}")
    
        print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
        print("This is obtained from the t-distribution with nāˆ’2 degrees of freedom ")
        print("where n is the number of observations\n")
    
        if slope_p_value > 0.05:
            print("P-value is not signficant and therefore we accept the null hypothesis")
        if slope_p_value < 0.05:
            print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor š‘‹ has a statistically significant effect on the outcome š‘Œ")
        # Plotting the regression line
        plot_regression_line(df, intercept, slope)
    
        # Fit a linear regression line using scipy.stats (for comparison)
        slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
       # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    if __name__ == "__main__":
        main()
    



  • Python Notes: Pandas Code Examples

    Pandas Quick Guide

    Modules

    import pandas as pd #pandas module
    import scipy.stats as stats #stats sub-module
    from bs4 import BeautifulSoup #scrape module
    import requests #API module
    import whois as whois #whois module


    Files
    df = pd.read_csv(“CookieData.csv”) #import CSV file to dataframe df

    #define index column and convert dates to date format

    air_quality = pd.read_csv(“air_quality_long.csv”, index_col=”date.utc”, parse_dates=True)

    air_quality.to_csv(“checkcombine.csv”) #save dataframe to csv file
    tests.to_excel(“tests.xlsx”, sheet_name=”expt”, index=False)



    Dataframe info

    df.describe() #output descriptive statistics of dataframe (numeric columns)

    df.info() #display dataframe columns datatypes No-Null
    df.dtypes #output datatypes of dataframe

    df.shape #outputs rows and columns
    df.sample(5) #output sample of 5 rows

    df.columns #ouput columns
    titanic.head(5) #print head of dataframe (top five rows)
    titanic.tail(5) #print tail of dataframe (bottom five rows)
    type(df) #output object type

    Convert
    gg[‘YearString’] = gg[‘Year’].astype(“string”) #convert series to string
    gg[“YearInterval”] = gg[“YearString”].str[:3] #LEFT type function, get first 3 elements of string

    gg[“YearInterval”] = gg[“YearInterval”].astype(int) #convert series integer



    Delete Columns
    del df2[“Outlier_Flag”]

    Sub-set

    df2 = df[[“version”, “retention_1”]]

    Combine Sub-sets

    #Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

    air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) # creates union

    merge the columns from the stations_coord dataframe to the air_quality df. Use location as the key to join the two dataframes, how=”left” is left join from air_quality on the left
    air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)

    merge left join using parameter column on left column and id on right

    air_quality = pd.merge(air_quality, air_quality_parameters,
    how=’left’, left_on=’parameter’, right_on=’id’)



    Dataframe Aggregate Functions
    titanic[“Age”].mean()
    titanic[[“Age”, “Fare”]].median()


    Sort Dataframe
    air_quality = air_quality.sort_values(“date.utc”)
    sort1 = titanic.sort_values(by=”Age”).head()
    sort2 = titanic.sort_values(by=[‘Pclass’, ‘Age’], ascending=False)~
    no2_subset = no2.sort_index().groupby([“location”]).head(2) #group and sort by index


    Filtering Dataframe

    df2 = df[df.Outlier_Flag == 0]

    gate_30 = df[df.version == “gate_30”]
    no2 = air_quality[air_quality[“parameter”] == “no2”]
    above_35 = titanic[titanic[“Age”] > 35]
    class_23 = titanic[(titanic[“Pclass”] == 2) | (titanic[“Pclass”] ==3)]

    age_no_na = titanic[titanic[“Age”].notna()] #return column not NA

    #Return Names of passengers over 35

    adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

    titanic.iloc[9:25, 2:5] #Return specified columns and rows
    titanic.iloc[0:3, 3] = “anonymous” #Change the 3rd element of the row to anonymous

    def city_select(country): #function with country as parameter
    df2 = df[df[“country”] == country] #slice dataframe where country = country e.g. United States
    df2.sort_values(by=[“population”], ascending=False) #sort dataframe by population descending
    df3 = df2.iloc[0:5] #new dataframe sorted by top 5 population
    df4 = df3[“city”] #select city series (top 5 cities by population)
    return df4

    def latlong(country, city): #define function
    selection = df[(df[“country”] == country) & (df[“city”] == city)] #filter dataframe by 2 criteria
    return selection #return filtered dataframe

    def get_lat(selection): #define function
    latitude = selection[“lat”] #latitude = series with “lat” as column
    lat = latitude.iloc[0] #get first row of latitude series
    return lat #return first row



    Groupby

    df.groupby(“version”).count() #Version with Count version
    df.groupby(“version”).sum() #Version with Sum columns
    air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean() #avg by day and location

    air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(kind=’bar’, rot=0, ax=axs)

    gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns



    Create Pivot Table

    table = pd.pivot_table(gate_30, values=’sum_gamerounds’, index=[‘version’], columns=[‘retention_1’],
    aggfunc=”mean”, margins=”true”) #note margins give totals

    no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
    air_quality.pivot_table(values=”value”, index=”location”, columns=”parameter”, aggfunc=”mean”) #wide format


    DateTime
    df[“datetime”] = pd.to_datetime(air_quality[“datetime”]) #change dates in dataframe to datetime formate

    pd.read_csv(“../data/df_long.csv”, parse_dates=[“datetime”]) #does the same
    df[“datetime”].min(), df[“datetime”].max() #start and end datetime
    df[“datetime”].max() – df[“datetime”].min() #difference in Datetime between 2 dates
    df[“month”] = df[“datetime”].dt.month #add column month based on other column
    df.index.year, df.index.weekday #return year and weekday of datetime object.
    df_pivoted = df.pivot(columns=”location”, values=”value”).reset_index() #reset pivot table index
    df = df_pivoted.melt(id_vars=”date.utc”)

    gg[“YearMonth”] = pd.to_datetime(dict(year=gg.Year, month=gg.Month, day=1)) #create date from columns year, month and 1 for first

    gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns
    gg4 = gg3.max() #change group by to max group by


    Plot

    df[“2019-05-20″:”2019-05-21”].plot(); #plot timerange



    Working with Columns

    air_quality.city.unique() #return unique cities in column cities
    air_quality = air_quality.rename(columns={“date.utc”: “datetime”}) #rename columns
    Split Columns in Dataframe
    titanic[“Name”].str.split(“,”) #split Name column by comma
    titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) #split name column by comma and return the first value in the array.

    df[“Name”].str.lower() #change column name to lowercase
    df[“Name”].str.contains(“Countess”) #return true if column contains “Countess

    df[“Name”].str.len() #return rows with string lenth of the Name column
    df[“Name”].str.len().idxmax() #return length of longest name
    df.loc[titanic[“Name”].str.len().idxmax(), “Name”] #return longest name in column using loc
    #rename gender column and replace genders to M and F in dataframe
    df[“gender_short”] = df[“gender”].replace({“male”: “M”, “female”: “F”}) #replace names in gender column with M and F
    titanic[“Pclass”].isin([2, 3]) #checkss if 2 and 3 is in the PClass columns
    adult_names = titanic.loc[titanic[“Age”] > 35, “Name”] #return Names of passengers over 35


    Print

    print(“\n1 Day Retention by Version”) #\n for new line
    print(f”You searched for: {domain_name}\n”) #f allows you to include placeholders in print

    Loops

    for link in a:
    print(link.get

    Input
    domain = input(“Enter domain to search for: “)

  • An Example of Poor Correlation using Python

    In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).


    We can check age and first purchase amount against first 12 month sales amount just to confirm this.

    Age vs. Sales (first 12 month sales) code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='Age', y='LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age")  # graph title
    plt.xlabel('Age')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    


    The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.



    Next we can look at First Purchase amount and First 12 Month Sales Amount.
    Here is the code below. It includes removing the major outlier with CustomerKey = 1

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df = df[df['CustomerID'] != 1]
    df.columns  # display the columns available
    
    # Create a scatter plot with regression line
    sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
    plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
    plt.xlabel('FirstPurchaseAmount')  # x axis label
    plt.ylabel('First 12 Month Sales')  # y axis label
    
    # Fit a linear regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])
    
    # Add equation of the regression line to the plot
    plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
    
    # Calculate correlation coefficient
    correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
    # Add correlation coefficient to plot
    plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')
    
    
    plt.show()
    
    


    And here is the graph. The correlation co-efficent is only 0.07.

  • Some basic Python Graphing Data Examples with the Matplotlib library

    In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.

    Histogram:
    We can start by having a look at the distribution of customers’ ages with a histogram.
    Here is the code:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvlocation)
    df.columns # display the columns available
    
    # Create a histogram
    df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
    plt.title("Distribution of Customer Age")
    plt.xlabel('Age')
    plt.ylabel('Frequency')
      
    plt.show()

    The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.





    Scatter Plot
    Next, we can have a look at the first 12-month sales amount against age.
    Here is the Python code:

    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    
    #kind options: 'scatter', 'hist', 'bar',  
    df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
    plt.title("First 12 Month Sales by Customer Age") # graph title
    plt.xlabel('Age') #x axis lable
    plt.ylabel('First 12 Month Sales') #y axis label
      
    plt.show()


    Here is the graph:


    We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.

    All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100.
    Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales.

    Pie Graph

    The following code gives us a pie chart showing us the slit of gender in the dataset.

    import pandas as pd
    import matplotlib.pyplot as plt
    # import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    #replace 0 and 1 in the 'Gender' column with femail and male
    df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})
    
    #return lables and count index of unique age types
    age_counts = df['Gender'].value_counts()
    #print(age_counts)
    
    # Define colors for each category
    colors = {'female': 'pink', 'male': 'blue'}
    
    # Plotting
    plt.figure(figsize=(4, 4))  # Adjust figure size if needed
    plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', 
            colors=[colors[g] for g in df['Gender'].value_counts().index])
    plt.title('Distribution of Gender')
    plt.show()


    The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’.
    The data looks pretty normal as we would expect. There are no nulls which is good.


    Another Scatter Plot
    Next, we look at the ‘FirstPurchaseAmount’ field.
    We create a scatter graph with the code below.

    # -*- coding: utf-8 -*-
    """
    Created on Wed May  8 12:17:26 2024
    
    @author: Admin
    """
    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
    df = pd.read_csv(csvpath)
    df.columns # display the columns available
    
    #kind options: 'scatter', 'hist', 'bar',  
    df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
    plt.title("Customers by First Purchase Amount") # graph title
    plt.xlabel('CustomerID') #x axis label
    plt.ylabel('First Purchase Amount') #y axis label
      
    plt.show()


    The first thing we notice when we run the graph is that there is one major outlier in the data.

    Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.

  • Examples of using the Python Seaborn Graphs Library

    This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.



    import seaborn as sns
    import matplotlib.pyplot as plt
    #from sklearn.datasets import fetch_california_housing
    import pandas as pd
    
    csvlocation = csvpath = 'outliercheck.csv'
    df = pd.read_csv(csvlocation)
    #df.columns # display the columns available
    
    pairplot = sns.pairplot(df)
    
    # Display the plot
    plt.show()

    If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.

    # Sort the DataFrame by the 'firstpurchasedate' column
    df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)
    
    # Display the first few rows of the sorted DataFrame
    print(df_sorted.head()

    The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.


    We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.

    #remove row where CustomerID = 1
    df_filtered = df[df['CustomerID'] != 1]
    # Remove the 'age', 'gender' and CustomerID columns
    df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)


    The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.

    Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.