Supermarket Example
Import libraries and read in CSV file to data frame
The data comes from kaggle here (mall data)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
df = pd.read_csv("Supermarket Customers.csv"
df.info()
First, we run some basic checks on the data to check for data integrity.
Data includes 200 rows of data by 5 columns
We can change Gender to numeric with the following
df['Gender'].replace(0, 'Female',inplace=True)
df['Gender'].replace(1, 'Male',inplace=True)
df.describe()
Then we check for nulls as follows:
df.isnull().sum() #check for nulls - gives a count of nulls values by column name.
Get counts of unique values
len(df['CustomerID'].unique()) #how many unique values are there
CustomerID: 200
Gender: 2
Age: 51
Annual Income (k$): 65
Spending Score (1-100): 85
Fixing Null Values
Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.
Drop rows where annual income (k$) is null
df.dropna(subset=['Annual Income (k$)'], inplace=True)
df.isnull().sum() #re-run check for missing data
Then do the same for spending score:
df.dropna(subset=['Spending Score (1-100)'], inplace=True)
df.isnull().sum() #re-run check for missing data
df.info() # the number of rows is reduced to 197.
We can run the seaborn pairplot to plot the graphs of the combination of variables.
pairplot = sns.pairplot(df)
plt.show()
From here we can see there are some very interesting distinct-looking clusters around
annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It
For the K-means model, we need to determine the value of K which is the number of clusters we want to identify.
We use the elbow method to do this as follows:
# Extracting features for clustering
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
# Using the elbow method to determine the optimal number of clusters<br>wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plotting the elbow graph
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.
K-Means Model
Now we have the number of clusters to be used in the model, we can run the K-Means model.
##Use 5 clusters based on elbow graph
# Fitting K-Means to the dataset with the optimal number of clusters (assuming 3 for this example)
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Visualizing the clusters
plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='purple', label='Cluster 5')
# Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.
Leave a Reply