Supermarket Example

Import libraries and read in CSV file to data frame
The data comes from kaggle here (mall data)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

df = pd.read_csv("Supermarket Customers.csv"
df.info()


First, we run some basic checks on the data to check for data integrity.
Data includes 200 rows of data by 5 columns



We can change Gender to numeric with the following

df['Gender'].replace(0, 'Female',inplace=True)
df['Gender'].replace(1, 'Male',inplace=True)
df.describe()





Then we check for nulls as follows:


df.isnull().sum() #check for nulls - gives a count of nulls values by column name.

Get counts of unique values


len(df['CustomerID'].unique()) #how many unique values are there


CustomerID: 200
Gender: 2
Age: 51
Annual Income (k$): 65
Spending Score (1-100): 85

Fixing Null Values
Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.

Drop rows where annual income (k$) is null

df.dropna(subset=['Annual Income (k$)'], inplace=True)
df.isnull().sum() #re-run check for missing data

Then do the same for spending score:

df.dropna(subset=['Spending Score (1-100)'], inplace=True)
df.isnull().sum() #re-run check for missing data
df.info() # the number of rows is reduced to 197.

We can run the seaborn pairplot to plot the graphs of the combination of variables.

pairplot = sns.pairplot(df)
plt.show()


From here we can see there are some very interesting distinct-looking clusters around
annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It

For the K-means model, we need to determine the value of K which is the number of clusters we want to identify.
We use the elbow method to do this as follows:

# Extracting features for clustering
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# Using the elbow method to determine the optimal number of clusters<br>wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plotting the elbow graph
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.

K-Means Model

Now we have the number of clusters to be used in the model, we can run the K-Means model.

##Use 5 clusters based on elbow graph

# Fitting K-Means to the dataset with the optimal number of clusters (assuming 3 for this example)
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualizing the clusters
plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='purple', label='Cluster 5')

# Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.


Comments

4 responses to “K-Means”

  1. Ny weekly For the reason that the admin of this site is working, no uncertainty very quickly it will be renowned, due to its quality contents.

  2. GlobalBllog I appreciate you sharing this blog post. Thanks Again. Cool.

  3. Fran Candelera Pretty! This has been a really wonderful post. Many thanks for providing these details.

  4. Aroma Sensei I like the efforts you have put in this, regards for all the great content.

Leave a Reply

Your email address will not be published. Required fields are marked *