This is pretty powerful. You can spend your time creating graphs one at a time or you can create a pairs plot using the Seaborn library. The pairs plot graphs every combination of variables, to create multiple graphs. The scatter graphs are below and the simple code is further down the page. There is a big outlier for the first purchase amount that stands out. We need to fish that one out first and re-run.
import seaborn as sns
import matplotlib.pyplot as plt
#from sklearn.datasets import fetch_california_housing
import pandas as pd
csvlocation = csvpath = 'outliercheck.csv'
df = pd.read_csv(csvlocation)
#df.columns # display the columns available
pairplot = sns.pairplot(df)
# Display the plot
plt.show()
If can sort the data frame by the ‘firstpurchasedate’ field we can have a look at the output.
# Sort the DataFrame by the 'firstpurchasedate' column
df_sorted = df.sort_values(by='FirstPurchaseAmount', ascending=False)
# Display the first few rows of the sorted DataFrame
print(df_sorted.head()
The outlier with CustomerID = 1 is the big outlier. It was possibly a test, which raises another question. Is there test data that needs to be removed? Anyway. Let’s remove it.
We can remove just this one customer using the following code. We can also remove the Age and Gender fields as they look normal from a previous look and will reduce the number of graphs in the pair plot.
#remove row where CustomerID = 1
df_filtered = df[df['CustomerID'] != 1]
# Remove the 'age', 'gender' and CustomerID columns
df_filtered = df_filtered.drop(['Age', 'Gender', 'CustomerID'], axis=1)
The output of re-running the graphs after the major outlier is removed and I’ve removed the Age, CustomerID, and Gender columns from the output as they aren’t of interest.
Now, we have a clearer view of the data. As expected there are more outliers to remove from the data. We can use the 3x standard deviation method to remove outliers at the top and bottom 2.5% of the data.
Leave a Reply