Category: python

Python Notes: Pandas from getting started Tutorials
Notes from Getting Started Tutorial:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
1. Creating a Dataframe and Series
```
import pandas as pd

#Create a series
books= pd.Series(["Papillion", "Great Expectations", "Magic"], name="Books") #Create a series

#Create DataFrame with Key value pairs. In this case the values are lists of values.

df = pd.DataFrame(
    {
        "Name": [
            "Humpfries, Mr. Bill",
            "Derby, Mrs. Sheila",
            "Winfrey, Miss. Elizabeth",
        ],
        "Age": [44, 66, 42],
        "Gender": ["male", "female", "female"],
        "City": ["London", "Manchester", "Leeds"],
        "Occupation": ["Doctor", "Architect", "Analyst"]
    }
)

#Using the list
df["Age"] #Output Age Column
#df["Age"].max() #outputs max age
```
2. Create Plots with Matplotlib
```
import pandas as pd
import matplotlib.pyplot as plt

#Create Pandas dataframe from csv file
air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality["station_paris"].plot() #craate plot of subset from DataFrame
plt.show() 

#Plot with Matplot lib
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
plt.show() 

#Create scatter from DataFrame
air_quality.plot.box() #Create box plot
plt.show() #Display boxplot

#subplots
fig, axs = air_quality.plot.area(figsize=(10, 4), subplots=True)
plt.show()
```
3. Add new columns from existing columns
```
Import pandas as pd
#index_col=0, sets first column to index, parse_dates converts dates to date type
air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)

#Add new column based on existing column - this adds column to DataFrame
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
#Another example, this time a %
air_quality["ratio_paris_antwerp"] = (air_quality["station_paris"] / air_quality["station_antwerp"]

#renaming columns
air_quality_renamed = air_quality.rename(columns={"station_antwerp": "BETR801","station_paris": "FR04014", "station_london": "London Westminster"})
```
4. Summary Statistics
```
import pandas as pd
#read_csv function - also read_ excel, sql, json, parquet and more

titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
titanic["Age"].mean() #average of Age column
titanic[["Age", "Fare"]].median() #median of Age and Fare Coluns
titanic[["Age", "Fare"]].describe() #summary stats of 2 columns
titanic.agg({"Age": ["min", "max", "median", "skew"],"Fare": ["min", "max", "median", "mean"]})
 #multiple stats on Age and Fare columns
 
 #groupby to create summary tables
 titanic[["Sex", "Age"]].groupby("Sex").mean()

 #Average age by sex
 titanic.groupby("Sex").mean(numeric_only=True) 
 #average of all numberics by Sex
 titanic.groupby(["Sex", "Pclass"])["Fare"].mean() #table average fare by sex and PClass
 
 titanic["Pclass"].value_counts() #count passengers in Pclass
 titanic.groupby("Pclass")["Pclass"].count() #count longer method
```
5. Re-shape layout of tables
```
import pandas as pd
titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
air_quality = pd.read_csv("air_quality_long.csv", index_col="date.utc", parse_dates=True)

#Sorting Tables
sort1 = titanic.sort_values(by="Age").head() 

##sort by Age
sort2 = titanic.sort_values(by=['Pclass', 'Age'], ascending=False) #sort by Pclas then Age descending 

# filter for no2 data only
no2 = air_quality[air_quality["parameter"] == "no2"]

# filter for no2 data only<br>no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset
#pivot table wide format
air_quality.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")
air_quality.pivot_table(values="value",index="location",columns="parameter",aggfunc="mean",margins=True)

#wide to long format
no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
no2_pivoted.head(5)
#pandas.melt() method converts pivot table to long format
no_2 = no2_pivoted.melt(id_vars="date.utc")<br>no_2.head(5)
```
6. Sub-sets and Filtering

import pandas as pd
#read csv file to pandas DatafFrame
titanic = pd.read_csv(“titanic.csv”)
air_quality = pd.read_csv(“air_quality_no2.csv”, index_col=0, parse_dates=True)

#save as excel, index=False removes row index numbers from export
titanic.to_excel(“titanic.xlsx”, sheet_name=”passengers”, index=False)

#attributes and methods of DataFrame
titanic.shape #row and columns count
titanic.info #
titanic.columns #
titanic.dtypes #data types
titanic.head(5) #head of df
titanic.tail(5) #tail of df
titanic.describe() #statistics of file

#Create Sub-set of DataFrame as series
ages = titanic[“Age”] #create sub-set from dataframe: note single bracket creates a pandas series.
type(titanic[“Age”]) #confirm type of Age object

#Create Sub-set of DataFrame as DataFrame
age_sex = titanic[[“Age”, “Sex”]] #inner brackets are list, outer is DataFrame

#Filtering rows getpassenger list with Age > 35
above_35 = titanic[titanic[“Age”] > 35] #return rows with age over 35
#Filter passenger list with Pclass = 2 or 3 (2 ways of doing this)
titanic[“Pclass”].isin([2, 3]) #checks if 2 or 3 is in Pclass and returns Boolean
titanic[“Pclass”].isin([2, 3])

#Filter passenger list with Age not NA
age_no_na = titanic[titanic[“Age”].notna()]
#Filter passenger liset and return list of Names
adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

#Filter specified rows and columns
titanic.iloc[9:25, 2:5]

#change the first 3 elements in row 3 to “anonymous”
titanic.iloc[0:3, 3] = “anonymous”

7. Combining Data from tables

import pandas as pd
import os

#check current working director
cwd = os.getcwd()
cwd

#Import data into DataFrames
titanic = pd.read_csv(“titanic.csv”)
air_quality_no2 = pd.read_csv(“air_quality_no2_long.csv”, parse_dates=True)
air_quality_pm25 = pd.read_csv(“air_quality_long.csv”, parse_dates=True)
stations_coord = pd.read_csv(“air_quality_stations.csv”)
air_quality_parameters = pd.read_csv(“air_quality_parameters.csv”)

#Create Sub-sets
air_quality_no2 = air_quality_no2[[“date.utc”, “location”,
“parameter”, “value”]]

air_quality_pm25 = air_quality_pm25[[“date.utc”, “location”,
“parameter”, “value”]]

#Concatonate Sub-sets – creates a union between the 2 tables.
#the Keys argument adds another hierachical index to show which table the data comes from
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”])
air_quality_no2.head(5)
air_quality_pm25.head(5)
air_quality_no2.shape #2068 rows 4 columns
air_quality_pm25.shape #5274 rows, 4 columns
air_quality.shape #7340 rows, 4 columms

air_quality = air_quality.sort_values(“date.utc”)
air_quality.head(25)
air_quality.to_csv(“checkcombine.csv”)

#note you can reset the index to a column
air_quality.reset_index(level=0)

stations_coord = pd.read_csv(“air_quality_stations.csv”)

stations_coord.head()

#merge the columns from the stations_coord dataframe to the air_quality df
#use location as the key to join the two dataframes
#how=”left” is left join from air_quality on the left
air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)
air_quality.head(10)
air_quality.to_csv(“checkqualitymerge.csv”)

air_quality_parameters.head(5)
#merge left join using parameter column on left column and id on right
air_quality = pd.merge(air_quality, air_quality_parameters,
how=’left’, left_on=’parameter’, right_on=’id’)

8. Timeseries

import pandas as pd

import matplotlib.pyplot as plt

air_quality = pd.read_csv(“air_quality_no2_long.csv”)

air_quality = air_quality.rename(columns={“date.utc”: “datetime”})
air_quality.head(5)
air_quality.columns

#unique cities in air_quality df
air_quality.city.unique()

#read air_qualitydataframe to a datetime object instead of text
air_quality[“datetime”] = pd.to_datetime(air_quality[“datetime”])

air_quality[“datetime”]

#can also do this to read the dates into datetime format
#pd.read_csv(“../data/air_quality_no2_long.csv”, parse_dates=[“datetime”])

#check start and end datetime
air_quality[“datetime”].min(), air_quality[“datetime”].max()

#datetime difference using pandas timestamp
air_quality[“datetime”].max() – air_quality[“datetime”].min()

#Add a month column using the
air_quality[“month”] = air_quality[“datetime”].dt.month

#Average no2 concentration by day of week by location
air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean()

#Plot bar graph using hour on the x axis and average no2 on the y
fig, axs = plt.subplots(figsize=(12, 4))

air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(
kind=’bar’, rot=0, ax=axs)
plt.xlabel(“Hour of the day”); # custom x label using Matplotlib
plt.ylabel(“$NO_2 (µg/m^3)$”);

#create pivot table
no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
no_2.head(5)

#use index.year and index.weekday of the datetime object
no_2.index.year, no_2.index.weekday
#plot using a time range
no_2[“2019-05-20″:”2019-05-21”].plot();

#resample method of datetime object works like groupby. It takes an
#aggregation function and the “M” is time based grouping for Month (month end)
monthly_max = no_2.resample(“M”).max()
monthly_max

monthly_max.index.freq #tells you the frequence attribute – e.g. MonthEnd

#Create table plot of average no2 levels for each station
no_2.resample(“D”).mean().plot(style=”-o”, figsize=(10, 5));

9. Textual Data

import pandas as pd

titanic = pd.read_csv(“titanic.csv”)

titanic[“Name”].str.lower()

#Split the Name field by comma separated
titanic[“Name”].str.split(“,”)
#create series surname using the first element of the split name field
titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0)
titanic[“Surname”]

#Return name series with true false if contains Countess
titanic[“Name”].str.contains(“Countess”)

#return rows where name contains Countess
titanic[titanic[“Name”].str.contains(“Countess”)]

#string length
titanic[“Name”].str.len()
#find max name length
titanic[“Name”].str.len().idxmax()
#Return longest name
titanic.loc[titanic[“Name”].str.len().idxmax(), “Name”]

titanic[“Sex_short”] = titanic[“Sex”].replace({“male”: “M”, “female”: “F”})

titanic[“Sex_short”]
May 27, 2024
Python Notes: Python Lists Examples

Create a list
Simple list
mylist = [“apple”, “orange”, “banana”, “mango”, “pineapple”]
mylist[2] # returns ‘banana’
mylist[-1] # returns ‘pineapple’
mylist[2:4] #returns ‘banana’, ‘mango’, ‘pineapple
if “apple” in thislist:
  print(“Yes, ‘apple’ is in the fruits list”)
mylist.insert(2, “watermelon”) #insert at position speified
mylist.append(‘grapes’) #adds to end of list
mylist = [“apple”, “banana”, “cherry”]
tropical = [“mango”, “pineapple”, “papaya”]
mylist.extend(tropical) # adds tropical list to thislist
my.remove(“banana”) #removes first occurrence
mylist.pop(1) #removes specified instance
del mylist[0] #also removes instance
del mylist # deletes list
clear(mylist) # clears the list

for i in thislist: #look list
  print(i)

i = 0
while i < len(thislist): #while loop
  print(thislist[i])
i = i + 1

newlist = []
for x in fruits: #adds list items containing a to new list
  if “a” in x:
    newlist.append(x)

thislist.sort() #sort alphabetically or numerically depending on data type
thislist.sort(reverse = True) #sort in reverse
thislist.sort(key = str.lower) #sort lower case first
thislist.reverse() #sort in reverse
mylist = thislist.copy() #copy list
mylist = list(thislist) #also makes a list
list3 = list1 + list2 #concetenate lists

6 rows, 2 columns

b1 = [[2308, 6], [2408, 6.2], [2508, 5.8],[2608, 5.6], [2708, 5.9]] #create the list
print(b1) #print the list
print(len(b1)) #print length of list
print(type(b1)) #print data type of list
print(b1[:2]) #print the first 2 elements in the list (note doesn’t use zero)
print(b1[3:]) #print from the 3rd elements and after

Reference Index in array

b1[3] #returns second element
b1[0:2] # returns first and second element of list
b1[-1] # returns last element index

Add and Delete, Update
Delete
del(b1[0]) # delete first element
or
b1.pop(1) #remove second element

Insert
b1.insert(1,[24082, 111]) #insert element at position 1
Update
b1[6]= [2808,6.7] #update value

Append
Additions = [[29008, 6.6], [3008, 6], [3108, 6.1]]
b1.extend(Additions) #Adds the Additions list to b1
b1.append([2808,6.6]) # add 1 element (only 1)

Clear the list
b1.clear() #empties the list

Check if element in array
if [2308,6.2] in b1: print(“yes”)
Loop
for x in b1: print(x)

def country_select(countrylist):
count = 1
for i in countrylist:
print(f”{i}: {countrylist[i]}”)
count = count + 1
return countrylist

Numpy

baseball = [180, 215, 210, 210, 188, 176, 209, 200]
import numpy as np
np_baseball = np.array(baseball)
print(type(np_baseball))
mean = np_baseball.mean()
print(“mean is: “+ str(mean))
med = np.median(np_baseball)
print(“median is: “+str(med))
std = np.std(np_baseball)
print(“standard deviation is: “+str(std))

May 19, 2024

Simple Linear Regression Example with Python

Linear Regression using Salary and Years of Experience Data

Data Source: Salary_dataset.csv Kaggle
The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

Simple Linear Regression Formula
Formula: Y = b + aX + e

Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
a: is the slope change in Y for a one-unit change in X
b: is the intercept of the Y axis. This represents the value of Y when X is zero.

The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

The steps are as follows:
1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
4. The null hypothesis is then evaluated based on the p-value
5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

def load_data(csv_location):
    """Load CSV data into a pandas DataFrame."""
    df = pd.read_csv(csv_location)
    return df

def prepare_data(df):
    """Prepare independent and dependent variables for regression."""
    X = df[['YearsExperience']].values  # Extract as 2D array
    y = df['Salary'].values  # Extract as 1D array
    return X, y

def fit_sklearn_model(X, y):
    """Fit a linear regression model using sklearn."""
    model = LinearRegression()
    model.fit(X, y)
    return model

def fit_statsmodels_ols(X, y):
    """Fit a linear regression model using statsmodels OLS."""
    X_with_const = sm.add_constant(X)  # Add an intercept to the model
    model = sm.OLS(y, X_with_const).fit()
    return model

def plot_regression_line(df, intercept, slope):
    """Plot the regression line along with data points."""
    plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
    plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
    plt.title("Salary by Years of Experience")
    plt.xlabel('Years of Experience')
    plt.ylabel('Salary')
    plt.legend()
    plt.show()

def main():
    csv_location = "salary_dataset.csv"
    df = load_data(csv_location)

    # Display basic statistics
    #print(df.describe())

    X, y = prepare_data(df)

    # Fit the model using sklearn
    sklearn_model = fit_sklearn_model(X, y)
    intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
    
    print("Calculation of Regression Line:\n")
    print(f"Intercept is: {intercept}")
    print(f"Slope is: {slope}")

    # Fit the model using statsmodels to get p-values and R-squared
    statsmodels_model = fit_statsmodels_ols(X, y)
  #  print(statsmodels_model.summary())

    # Extract R-squared and p-values
    r_squared = statsmodels_model.rsquared
    p_values = statsmodels_model.pvalues

    print(f"R-squared: {r_squared}")
    #print(f"P-values: {p_values}")

    # Extracting specific p-values by index
    intercept_p_value = p_values[0]  # First p-value (intercept)
    slope_p_value = p_values[1]  # Second p-value (YearsExperience)

    #print(f"Intercept p-value: {intercept_p_value}")
    print(f"p-value (YearsExperience): {slope_p_value}")

    print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
    print("This is obtained from the t-distribution with n−2 degrees of freedom ")
    print("where n is the number of observations\n")

    if slope_p_value > 0.05:
        print("P-value is not signficant and therefore we accept the null hypothesis")
    if slope_p_value < 0.05:
        print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌")
    # Plotting the regression line
    plot_regression_line(df, intercept, slope)

    # Fit a linear regression line using scipy.stats (for comparison)
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
   # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

if __name__ == "__main__":
    main()

May 19, 2024

Python Notes: Pandas Code Examples

Pandas Quick Guide

Modules

import pandas as pd #pandas module
import scipy.stats as stats #stats sub-module
from bs4 import BeautifulSoup #scrape module
import requests #API module
import whois as whois #whois module

Files
df = pd.read_csv(“CookieData.csv”) #import CSV file to dataframe df

#define index column and convert dates to date format

air_quality = pd.read_csv(“air_quality_long.csv”, index_col=”date.utc”, parse_dates=True)

air_quality.to_csv(“checkcombine.csv”) #save dataframe to csv file
tests.to_excel(“tests.xlsx”, sheet_name=”expt”, index=False)

Dataframe info

df.describe() #output descriptive statistics of dataframe (numeric columns)

df.info() #display dataframe columns datatypes No-Null
df.dtypes #output datatypes of dataframe

df.shape #outputs rows and columns
df.sample(5) #output sample of 5 rows

df.columns #ouput columns
titanic.head(5) #print head of dataframe (top five rows)
titanic.tail(5) #print tail of dataframe (bottom five rows)
type(df) #output object type

Convert
gg[‘YearString’] = gg[‘Year’].astype(“string”) #convert series to string
gg[“YearInterval”] = gg[“YearString”].str[:3] #LEFT type function, get first 3 elements of string

gg[“YearInterval”] = gg[“YearInterval”].astype(int) #convert series integer

Delete Columns
del df2[“Outlier_Flag”]

Sub-set

df2 = df[[“version”, “retention_1”]]

Combine Sub-sets

#Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) # creates union

merge the columns from the stations_coord dataframe to the air_quality df. Use location as the key to join the two dataframes, how=”left” is left join from air_quality on the left
air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)

merge left join using parameter column on left column and id on right

air_quality = pd.merge(air_quality, air_quality_parameters,
how=’left’, left_on=’parameter’, right_on=’id’)

Dataframe Aggregate Functions
titanic[“Age”].mean()
titanic[[“Age”, “Fare”]].median()

Sort Dataframe
air_quality = air_quality.sort_values(“date.utc”)
sort1 = titanic.sort_values(by=”Age”).head()
sort2 = titanic.sort_values(by=[‘Pclass’, ‘Age’], ascending=False)~
no2_subset = no2.sort_index().groupby([“location”]).head(2) #group and sort by index

Filtering Dataframe

df2 = df[df.Outlier_Flag == 0]

gate_30 = df[df.version == “gate_30”]
no2 = air_quality[air_quality[“parameter”] == “no2”]
above_35 = titanic[titanic[“Age”] > 35]
class_23 = titanic[(titanic[“Pclass”] == 2) | (titanic[“Pclass”] ==3)]

age_no_na = titanic[titanic[“Age”].notna()] #return column not NA

#Return Names of passengers over 35

adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

titanic.iloc[9:25, 2:5] #Return specified columns and rows
titanic.iloc[0:3, 3] = “anonymous” #Change the 3rd element of the row to anonymous

def city_select(country): #function with country as parameter
df2 = df[df[“country”] == country] #slice dataframe where country = country e.g. United States
df2.sort_values(by=[“population”], ascending=False) #sort dataframe by population descending
df3 = df2.iloc[0:5] #new dataframe sorted by top 5 population
df4 = df3[“city”] #select city series (top 5 cities by population)
return df4

def latlong(country, city): #define function
selection = df[(df[“country”] == country) & (df[“city”] == city)] #filter dataframe by 2 criteria
return selection #return filtered dataframe

def get_lat(selection): #define function
latitude = selection[“lat”] #latitude = series with “lat” as column
lat = latitude.iloc[0] #get first row of latitude series
return lat #return first row

Groupby

df.groupby(“version”).count() #Version with Count version
df.groupby(“version”).sum() #Version with Sum columns
air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean() #avg by day and location

air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(kind=’bar’, rot=0, ax=axs)

gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns

Create Pivot Table

table = pd.pivot_table(gate_30, values=’sum_gamerounds’, index=[‘version’], columns=[‘retention_1’],
aggfunc=”mean”, margins=”true”) #note margins give totals

no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
air_quality.pivot_table(values=”value”, index=”location”, columns=”parameter”, aggfunc=”mean”) #wide format

DateTime
df[“datetime”] = pd.to_datetime(air_quality[“datetime”]) #change dates in dataframe to datetime formate

pd.read_csv(“../data/df_long.csv”, parse_dates=[“datetime”]) #does the same
df[“datetime”].min(), df[“datetime”].max() #start and end datetime
df[“datetime”].max() – df[“datetime”].min() #difference in Datetime between 2 dates
df[“month”] = df[“datetime”].dt.month #add column month based on other column
df.index.year, df.index.weekday #return year and weekday of datetime object.
df_pivoted = df.pivot(columns=”location”, values=”value”).reset_index() #reset pivot table index
df = df_pivoted.melt(id_vars=”date.utc”)

gg[“YearMonth”] = pd.to_datetime(dict(year=gg.Year, month=gg.Month, day=1)) #create date from columns year, month and 1 for first

gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns
gg4 = gg3.max() #change group by to max group by

Plot

df[“2019-05-20″:”2019-05-21”].plot(); #plot timerange

Working with Columns

air_quality.city.unique() #return unique cities in column cities
air_quality = air_quality.rename(columns={“date.utc”: “datetime”}) #rename columns
Split Columns in Dataframe
titanic[“Name”].str.split(“,”) #split Name column by comma
titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) #split name column by comma and return the first value in the array.

df[“Name”].str.lower() #change column name to lowercase
df[“Name”].str.contains(“Countess”) #return true if column contains “Countess

df[“Name”].str.len() #return rows with string lenth of the Name column
df[“Name”].str.len().idxmax() #return length of longest name
df.loc[titanic[“Name”].str.len().idxmax(), “Name”] #return longest name in column using loc
#rename gender column and replace genders to M and F in dataframe
df[“gender_short”] = df[“gender”].replace({“male”: “M”, “female”: “F”}) #replace names in gender column with M and F
titanic[“Pclass”].isin([2, 3]) #checkss if 2 and 3 is in the PClass columns
adult_names = titanic.loc[titanic[“Age”] > 35, “Name”] #return Names of passengers over 35

Print

print(“\n1 Day Retention by Version”) #\n for new line
print(f”You searched for: {domain_name}\n”) #f allows you to include placeholders in print

Loops

for link in a:
print(link.get

Input
domain = input(“Enter domain to search for: “)

May 19, 2024

An Example of Poor Correlation using Python

In order to find the best data build our model, we need to run correlations on the data. In the initial predictive model we built, we guessed the fields, namely age, gender, income and first purchase amount, but the model gave a very poor MSE accuracy result, so we need to go back to the data and looking for fields that correlate well with sales amount (as well as checking the existing the fields we have used).

We can check age and first purchase amount against first 12 month sales amount just to confirm this.

Age vs. Sales (first 12 month sales) code:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns  # display the columns available

# Create a scatter plot with regression line
sns.regplot(data=df, x='Age', y='LifetimeSales')
plt.title("First 12 Month Sales by Customer Age")  # graph title
plt.xlabel('Age')  # x axis label
plt.ylabel('First 12 Month Sales')  # y axis label

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['LifetimeSales'])

# Add equation of the regression line to the plot
plt.text(df['Age'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

# Calculate correlation coefficient
correlation_coefficient = df['Age'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')


plt.show()

The output of graphing customer and age first 12 month sales confirms the poor correlation. Here the correlation coefficent is only 0.08.

Next we can look at First Purchase amount and First 12 Month Sales Amount.
Here is the code below. It includes removing the major outlier with CustomerKey = 1

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df = df[df['CustomerID'] != 1]
df.columns  # display the columns available

# Create a scatter plot with regression line
sns.regplot(data=df, x='FirstPurchaseAmount', y='LifetimeSales')
plt.title("First 12 Month Sales by FirstPurchaseAmount")  # graph title
plt.xlabel('FirstPurchaseAmount')  # x axis label
plt.ylabel('First 12 Month Sales')  # y axis label

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['FirstPurchaseAmount'], df['LifetimeSales'])

# Add equation of the regression line to the plot
plt.text(df['FirstPurchaseAmount'].min(), df['LifetimeSales'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

# Calculate correlation coefficient
correlation_coefficient = df['FirstPurchaseAmount'].corr(df['LifetimeSales'])
# Add correlation coefficient to plot
plt.text(df['Age'].max(), df['LifetimeSales'].min(), f'Correlation coefficient: {correlation_coefficient:.2f}', ha='right')


plt.show()

And here is the graph. The correlation co-efficent is only 0.07.

May 8, 2024

Some basic Python Graphing Data Examples with the Matplotlib library
In the previous article ‘Creating a Customer Lifetime Value Model’ we imported and transformed a table of customer data from an MS SQL Database, which included the ‘first 12 months sales amount’. At the end of the exercise, we exported the data frame to a CSV file, which we can check out using the Matplotlib library in Python.

Histogram:
We can start by having a look at the distribution of customers’ ages with a histogram.
Here is the code:
```
import pandas as pd
import matplotlib.pyplot as plt

csvlocation = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvlocation)
df.columns # display the columns available

# Create a histogram
df['Age'].plot(kind='hist', bins=20) # Adjust the number of bins as needed
plt.title("Distribution of Customer Age")
plt.xlabel('Age')
plt.ylabel('Frequency')
  
plt.show()
```
The graph of the output is below. We immediately notice that all customers are above 45. This isn’t an issue, its just a reflection of the database being old and the age calculation being based on the current time and the date of birth of the customers. We can say there are more customers between the ages of 50 and 65. We would expect there to be less customers as age increases after that, so I think that is fairly reflective of the general population.

Scatter Plot
Next, we can have a look at the first 12-month sales amount against age.
Here is the Python code:
```
import pandas as pd
import matplotlib.pyplot as plt

csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available

#kind options: 'scatter', 'hist', 'bar',  
df.plot(kind = 'scatter', x = 'Age', y = 'LifetimeSales')
plt.title("First 12 Month Sales by Customer Age") # graph title
plt.xlabel('Age') #x axis lable
plt.ylabel('First 12 Month Sales') #y axis label
  
plt.show()
```
Here is the graph:

We can immediately see the range of first month sales amounts from (10K to 70K). That’s a very high amount and makes me want to go back and check the data. When I go and look at the data, it does look correct, so we can continue.

All customers are over 40, this is reflective of the database being over 15 years old and no new customers have been added, but it also means we have some really old customers with an estimated age of 100.
Customers’ ages look to be fairly evenly distributed between 40 and 80. With 80 plus having lower sales.

Pie Graph
The following code gives us a pie chart showing us the slit of gender in the dataset.
```
import pandas as pd
import matplotlib.pyplot as plt
# import matplotlib.pyplot as plt

csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available
#replace 0 and 1 in the 'Gender' column with femail and male
df['Gender'] = df['Gender'].replace({0: 'female', 1: 'male'})

#return lables and count index of unique age types
age_counts = df['Gender'].value_counts()
#print(age_counts)

# Define colors for each category
colors = {'female': 'pink', 'male': 'blue'}

# Plotting
plt.figure(figsize=(4, 4))  # Adjust figure size if needed
plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', 
        colors=[colors[g] for g in df['Gender'].value_counts().index])
plt.title('Distribution of Gender')
plt.show()
```
The output looks like this. The 0s and 1s in the code we replaced with ‘female’ and ‘male’.
The data looks pretty normal as we would expect. There are no nulls which is good.

Another Scatter Plot
Next, we look at the ‘FirstPurchaseAmount’ field.
We create a scatter graph with the code below.
```
# -*- coding: utf-8 -*-
"""
Created on Wed May  8 12:17:26 2024

@author: Admin
"""

import pandas as pd
import matplotlib.pyplot as plt

csvlocation = csvpath = r'c:\Users\Admin\Documents\Github\Python Code\outliercheck.csv'
df = pd.read_csv(csvpath)
df.columns # display the columns available

#kind options: 'scatter', 'hist', 'bar',  
df.plot(kind = 'scatter', x = 'CustomerID', y = 'FirstPurchaseAmount')
plt.title("Customers by First Purchase Amount") # graph title
plt.xlabel('CustomerID') #x axis label
plt.ylabel('First Purchase Amount') #y axis label
  
plt.show()
```
The first thing we notice when we run the graph is that there is one major outlier in the data.

Removing this customer should help improve the model and there should be able to see more outliers in the graph once removed.
May 8, 2024
Examples of using the Python Seaborn Graphs Library
This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.
```
import seaborn as sns
import matplotlib.pyplot as plt
#from sklearn.datasets import fetch_california_housing
import pandas as pd

csvlocation = csvpath = 'outliercheck.csv'
df = pd.read_csv(csvlocation)
#df.columns # display the columns available

pairplot = sns.pairplot(df)

# Display the plot
plt.show()
```
If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.
```
# Sort the DataFrame by the 'firstpurchasedate' column
df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)

# Display the first few rows of the sorted DataFrame
print(df_sorted.head()
```
The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.

We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.
```
#remove row where CustomerID = 1
df_filtered = df[df['CustomerID'] != 1]
# Remove the 'age', 'gender' and CustomerID columns
df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)
```
The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.

Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.
May 8, 2024

Category: python

#Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

merge left join using parameter column on left column and id on right