Customer Segmentation in Python
Customer Segmentation in Python
Customer segmentation can help businesses tailor their marketing efforts and
improve customer satisfaction. Here’s how.
Let’s start by stating our goal: By applying RFM analysis and K-means
clustering to this dataset, we’d like to gain insights into customer behavior and
preferences.
First, let’s import the necessary libraries and the specific modules as needed:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
We need pandas and matplotlib for data exploration and visualization, and
the KMeans class from scikit-learn’s cluster module to perform K-Means
clustering.
As mentioned, we’ll use the Online Retail dataset. The dataset contains
customer records: transactional information, including purchase dates,
quantities, prices, and customer IDs.
Let's read in the data that’s originally in an excel file from its URL into a
pandas dataframe.
# Load the dataset from UCI repository
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/00352/Online%20Retail.xlsx"
data = pd.read_excel(url)
Alternatively, you can download the dataset and read the excel file into a
pandas dataframe.
Now let’s start exploring the dataset. Look at the first few rows of the dataset:
data.head()
Output of data.head()
Now call the describe() method on the dataframe to understand the numerical
features better:
data.describe()
We see that the “CustomerID” column is currently a floating point value. When
we clean the data, we’ll cast it into an integer:
Output of data.describe()
Also note that the dataset is quite noisy. The “Quantity” and “UnitPrice”
columns contain negative values:
Output of data.describe()
Let’s take a closer look at the columns and their data types:
data.info()
We see that the dataset has over 541K records and the “Description” and
“CustomerID” columns contain missing values:
Let’s get the count of missing values in each column:
Also recall that the values “Quantity” and “UnitPrice” columns should be
strictly non-negative. But they contain negative values. So let's also drop the
records with negative values for “Quantity” and “UnitPrice”:
# Remove rows with negative Quantity and Price
data = data[(data['Quantity'] > 0) & (data['UnitPrice'] > 0)]
Let’s start out by defining a reference date snapshot_date that’s a day later than
the most recent date in the “InvoiceDate” column:
snapshot_date = max(data['InvoiceDate']) + pd.DateOffset(days=1)
Next, create a “Total” column that contains Quantity*UnitPrice for all the
records:
data['Total'] = data['Quantity'] * data['UnitPrice']
We’ll essentially assign the values to five different bins, and map each bin to a
value. To help us fix the bin edges, let’s use the quantile values of the
“Recency”, “Frequency”, and “MonetaryValue” columns:
rfm.describe()
Now that we’ve defined the bin edges, let’s map the scores to corresponding
labels between 1 and 5 (both inclusive):
# Calculate Recency score based on custom bins
rfm['R_Score'] = pd.cut(rfm['Recency'], bins=recency_bins, labels=range(1,
6), include_lowest=True)
# Reverse the Recency scores so that higher values indicate more recent
purchases
rfm['R_Score'] = 5 - rfm['R_Score'].astype(int) + 1
Notice that the R_Score, based on the bins, is 1 for recent purchases 5 for all
purchases made over 250 days ago. But we’d like the most recent purchases
to have an R_Score of 5 and purchases made over 250 days ago to have an
R_Score of 1.
Let’s look at the first few rows of the R_Score, F_Score, and M_Score
columns:
# Print the first few rows of the RFM DataFrame to verify the scores
print(rfm[['R_Score', 'F_Score', 'M_Score']].head(10))
If you’d like, you can use these R, F, and M scores to carry out an in-depth
analysis. Or use clustering to identify segments with similar RFM
characteristics. We’ll choose the latter!
We see that the curve elbows out at 4 clusters. So let’s divide the customer
base into four segments.
We’ve fixed K to 4. So let’s run the K-Means algorithm to get the cluster
assignments for all points in the dataset:
# Perform K-means clustering with best K
best_kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
rfm['Cluster'] = best_kmeans.fit_predict(X)
Now that we have the clusters, let’s try to characterize them based on the
RFM scores.
# Group by cluster and calculate mean values
cluster_summary = rfm.groupby('Cluster').agg({
'R_Score': 'mean',
'F_Score': 'mean',
'M_Score': 'mean'
}).reset_index()
The average R, F, and M scores for each cluster should already give you an
idea of the characteristics.
print(cluster_summary)
But let’s visualize the average R, F, and M scores for the clusters so it’s easy
to interpret:
colors = ['#3498db', '#2ecc71', '#f39c12','#C9B1BD']
plt.tight_layout()
plt.show()
Let’s visualize the distribution of the different clusters using a pie chart:
cluster_counts = rfm['Cluster'].value_counts()
plt.show()
Here we go! For this example, we have quite an even distribution of
customers across segments. So we can invest time and effort in retaining
existing customers, re-engaging with at-risk customers, and educating recent
customers.
Wrapping Up
And that’s a wrap! We went from over 154K customer records to 4 clusters in
7 easy steps. I hope you understand how customer segmentation allows you
to make data-driven decisions that influence business growth and customer
satisfaction by allowing for:
Personalization: Segmentation allows businesses to tailor their marketing
messages, product recommendations, and promotions to each customer
group's specific needs and interests.
Improved Targeting: By identifying high-value and at-risk customers,
businesses can allocate resources more efficiently, focusing efforts where
they are most likely to yield results.
Customer Retention: Segmentation helps businesses create retention
strategies by understanding what keeps customers engaged and satisfied.
As a next step, try applying this approach to another dataset, document your
journey, and share with the community! But remember, effective customer
segmentation and running targeted campaigns requires a good understanding
of your customer base—and how the customer base evolves. So it requires
periodic analysis to refine your strategies over time.
Dataset Credits
Bala Priya C is a developer and technical writer from India. She likes working
at the intersection of math, programming, data science, and content creation.
Her areas of interest and expertise include DevOps, data science, and natural
language processing. She enjoys reading, writing, coding, and coffee!
Currently, she's working on learning and sharing her knowledge with the
developer community by authoring tutorials, how-to guides, opinion pieces,
and more. Bala also creates engaging resource overviews and coding
tutorials.
Customer segmentation is important for businesses to understand their target audience. Different
advertisements can be curated and sent to different audience segments based on their
demographic profile, interests, and affluence level.
There are many unsupervised machine learning algorithms that can help companies identify their
user base and create consumer segments.
In this article, we will be looking at a popular unsupervised learning technique called K-Means
clustering.
This algorithm can take in unlabelled customer data and assign each data point to clusters.
The goal of K-Means is to group all the data available into non-overlapping sub-groups that are
distinct from each other.
That means each sub-group/cluster will consist of features that distinguish them from other
clusters.
K-Means clustering is a commonly used technique by data scientists to help companies with
customer segmentation. It is an important skill to have, and most data science interviews will test
your understanding of this algorithm/your ability to apply it to real life scenarios.
Pre-requisites
Make sure to have the following libraries installed before getting started: pandas, numpy,
matplotlib, seaborn, scikit-learn, kneed.
Run the following lines of code to import the necessary libraries and read the dataset:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sea
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
df.head()
There are five variables in the dataset. CustomerID is the unique identifier of each customer in
the dataset, and we can drop this variable. It doesn't provide us with any useful cluster
information.
Since gender is a categorial variable, it needs to be encoded and converted into numeric.
All other variables will be scaled to follow a normal distribution before being fed into the model.
We will standardize these variables with a mean of 0 and a standard deviation of 1.
Standardizing variables
First, lets standardize all variables in the dataset to get them around the same scale.
We can see that all the variables have been transformed, and are now centered around zero.
The variable 'gender' is categorical, and we need to transform this into a numeric variable.
This means that we need to substitute numbers for each category. We can do this with Pandas
using pd.get_dummies().
gender = df['Gender']
newdf = scaled_features.join(gender)
newdf = newdf.drop(['Gender_Male'],axis=1)
newdf.head()
The values for 'Gender_Male' can be inferred from 'Gender_Female,' (that is, if
'Gender_Female' is 0, then 'Gender_Male' will be 1 and vice versa).
To learn more about one-hot encoding on categorical variables, you can watch this YouTube
video.
Silhouette coefficient
A silhouette coefficient, or a silhouette score is a metric used to evaluate the quality of clusters
created by the algorithm.
Silhouette scores range from -1 to +1. The higher the silhouette score, the better the model.
The silhouette score measures the distance between all the data points within the same cluster.
The lower this distance, the better the silhouette score.
It also measures the distance between an object and the data points in the nearest cluster. The
higher this distance, the better.
A silhouette score closer to +1 indicates good clustering performance, and a silhouette score
closer to -1 indicates a poor clustering model.
This isn't a bad model, but we can do better and try getting higher cluster separation.
Before we try doing that, lets visualize the clusters we just built to get an idea of how well the
model is doing:
clusters = kmeans.fit_predict(df.iloc[:,1:])
newdf["label"] = clusters
fig = plt.figure(figsize=(21,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(newdf.Age[newdf.label == 0], newdf["Annual Income
(k$)"][newdf.label == 0], df["Spending Score (1-100)"][newdf.label == 0],
c='blue', s=60)
ax.scatter(newdf.Age[df.label == 1], newdf["Annual Income (k$)"][newdf.label
== 1], newdf["Spending Score (1-100)"][newdf.label == 1], c='red', s=60)
ax.scatter(newdf.Age[df.label == 2], newdf["Annual Income (k$)"][newdf.label
== 2], df["Spending Score (1-100)"][newdf.label == 2], c='green', s=60)
ax.scatter(newdf.Age[newdf.label == 3], newdf["Annual Income
(k$)"][newdf.label == 3], newdf["Spending Score (1-100)"][newdf.label == 3],
c='orange', s=60)
ax.view_init(30, 185)
plt.show()
From the above diagram, we can see that cluster separation isn't too great.
The red points are mixed with the blue, and the green are overlapping the yellow.
This, along with the silhouette score shows us that the model isn't performing too well.
Now, lets create a new model that has better cluster separability than this one.
PCA is a technique that helps us reduce the dimension of a dataset. When we run PCA on a data
frame, new components are created. These components explain the maximum variance in the
model.
We can select a subset of these variables and include them into the K-means model.
pca = PCA(n_components=4)
principalComponents = pca.fit_transform(newdf)
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(features)
PCA_components = pd.DataFrame(principalComponents)
Based on this visualization, we can see that the first two PCA components explain around 70%
of the dataset variance.
Lets build the model again with the first two principal components, and decide on the number of
clusters to use:
ks = range(1, 10)
inertias = []
for k in ks:
model = KMeans(n_clusters=k)
model.fit(PCA_components.iloc[:,:2])
inertias.append(model.inertia_)
We can calculate the silhouette score for this model with 4 clusters:
model = KMeans(n_clusters=4)
model.fit(PCA_components.iloc[:,:2])
# silhouette score
print(silhouette_score(PCA_components.iloc[:,:2], model.labels_,
metric='euclidean'))
The silhouette score of this model is 0.42, which is better than the previous model we created.
We can visualize the clusters for this model just like we did earlier:
model = KMeans(n_clusters=4)
clusters = model.fit_predict(PCA_components.iloc[:,:2])
newdf["label"] = clusters
fig = plt.figure(figsize=(21,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(newdf.Age[newdf.label == 0], newdf["Annual Income
(k$)"][newdf.label == 0], newdf["Spending Score (1-100)"][newdf.label == 0],
c='blue', s=60)
ax.scatter(newdf.Age[newdf.label == 1], newdf["Annual Income
(k$)"][newdf.label == 1], newdf["Spending Score (1-100)"][newdf.label == 1],
c='red', s=60)
ax.scatter(newdf.Age[newdf.label == 2], newdf["Annual Income
(k$)"][newdf.label == 2], newdf["Spending Score (1-100)"][newdf.label == 2],
c='green', s=60)
ax.view_init(30, 185)
plt.show()
Model 1 vs Model 2
Lets compare the cluster separability of this model to that of the first model:
Model 1 (left) vs Model 2 (right)
Notice that the clusters in the second model are much better separated than that in the first
model.
For these reasons, we can pick the second model to go forward with our analysis.
Cluster Analysis
Now that we're done building these different clusters, lets try to interpret them and look at the
different customer segments.
First, lets map the clusters back to the dataset and take a look at the head of the data frame.
df = pd.read_csv('Mall_Customers.csv')
df = df.drop(['CustomerID'],axis=1)
pred = model.predict(PCA_components.iloc[:,:2])
frame = pd.DataFrame(df)
frame['cluster'] = pred
frame.head()
Notice that each row in the data frame is now assigned to a cluster.
To compare attributes of the different clusters, lets find the average of all variables across each
cluster:
sns.barplot(x='cluster',y='Age',data=avg_df)
sns.barplot(x='cluster',y='Spending Score (1-100)',data=avg_df)
sns.barplot(x='cluster',y='Annual Income (k$)',data=avg_df)
Cluster 0:
High average annual income, low spending.
Mean age is around 40 and gender is predominantly male.
Cluster 1:
Cluster 2:
Cluster 3:
Also, females are more highly represented in the entire dataset, which is why most clusters
contain a larger number of females than males. We can find the percentage of each gender
relative to the numbers in the entire dataset to give us a better idea of gender distribution.
Building personas around each cluster
Photo by h heyerlein on Unsplash
Now that we know the attributes of each cluster, we can build personas around them.
Being able to tell a story around your analysis is an important skill to have as a data scientist.
This will help your clients or stakeholders understand your findings more easily.
This persona comprises of middle aged individuals who are very careful with money.
Despite having the highest average income compared to individuals in all other clusters, they
spend the least.
This might be because they have financial responsibilities - like saving up for their kid's higher
education.
Recommendation: Promos, coupons, and discount codes will attract individuals in this segment
due to their tendency to spend less.
They earn less and spend less, and are probably saving up for retirement.
Recommendation: Marketing to these individuals can be done through Facebook, which appeals
to an older demographic. Promote healthcare related products to people in this segment.
These are enthusiastic young individuals who enjoy living a good lifestyle, and tend to spend
above their means.
Recommendation: Since these are young individuals who spend a lot, providing them with travel
coupons or hotel discounts might be a good idea. Providing them with discounts off top clothing
and makeup brands would also work well for this segment.
These are individuals who have worked hard to build up a significant amount of wealth.
These individuals have likely just started a family, and are leading baby or family-focused
lifestyles. It is a good idea to promote baby or child related products to these individuals.
Recommendation: Due to their large spending capacity and their demographic, these individuals
are likely to be looking for properties to buy or invest in. They are also more likely than all other
segments to take out housing loans and make serious financial commitments.
Conclusion
We have successfully built a K-Means clustering model for customer segmentation. We also
explored cluster interpretation, and analyzed the behaviour of individuals in each cluster.
Finally, we took a look at some business recommendations that could be provided based on the
attributes of each individual in the cluster.
You can use the analysis above as starter code for any clustering or segmentation project in the
future.
Table of Contents
4. Data Preprocessing
7. Conclusion
Introduction
Let’s say, you decided to buy a t-shirt from a brand online. Have you
ever thought that who else bought the same t-shirt?
People, who have similar to you, right? Same age, same hobbies, same
gender, etc.
Business Scenario
All data has been collected through the loyalty cards they use at
checkout :)
We will utilize K-Means and PCA algorithms for this project and see
how we define new grouped customers!
Variable Description
0: male
1: female
0: single
0:other / unknown
1: high school
2: university
3: graduate school
0: unemployed/unskilled
Settlement size: The size of the city that the customer lives in.
0: small city
1: mid-sized city
2: big city
We have datasets and know the business problem. Now, Let’s start
coding!
Importing Libraries
In this project, we will need some friends that help you along the way!
We see the mean of Age and Income 35.90 and 120954 respectively.
Describe method is very useful for numerical columns.
df.info()
df.info() method returns information about the DataFrame including
the index data type and columns, non-null values, and memory usage.
We see that there is no missing value in the dataset and all the
variables are integer.
Their correlation is 0.68. That means If you have a higher salary, you
are more likely to have a higher-level occupation such as a manager.
Next section will be the segmentation. But before that, we need to scale
our data first.
4. Data Preprocessing
In general, We want to treat all the features equally and we can achieve
that by transforming the features in such a way that their values fall
within the same numerical range such as [0:1].
Standardization
But How?
The elbow in the graph is the four-cluster mark. This is the only place
until which the graph is steeply declining while smoothing out
afterward.
# We create a new data frame with the original features and add a new
column with the assigned clusters for each point.
df_segm_kmeans= df_std.copy()
df_std[‘Segment K-means’] = kmeans.labels_
Let’s group the customers by clusters and see the average values for
each variable.
df_segm_analysis = df_std.groupby(['Segment K-means']).mean()
df_segm_analysis
It has almost the same number of men and women with an average age
of 56. Compared to other clusters, we realize that this is the oldest
segment.
This segment has the lowest values for the annual salary.
With low income living in small cities, it seems that this is a segment of
people with fewer opportunities.
This is the youngest segment with an average age of 29. They have
medium level of education and average income.
They also seem average about every parameter we can label the
segment average or standard.
We can conclude that K-Means did a decent job! However, it’s hard to
separate segments from each other.
In the next section, we will combine PCA and K-Means to try to get a
better result.
For instance, the first value of the array shows the loading of the first
feature on the first component.
Let’s put this information in a pandas data frame so that we can see
them nicely. Columns are seven original features and rows are three
components that PCA gave us.
df_pca_comp = pd.DataFrame(data = pca.components_,
columns = df.columns,
index = ['Component 1', 'Component 2', 'Component
3'])
df_pca_comp
plt.figure(figsize=(12,9))
sns.heatmap(df_pca_comp,
vmin = -1,
vmax = 1,
cmap = 'RdBu',
annot = True)
plt.yticks([0, 1, 2],
['Component 1', 'Component 2', 'Component 3'],
rotation = 45,
fontsize = 12)
plt.title('Components vs Original Features',fontsize = 14)
plt.show()
We see that there is a positive correlation between Component1
and Age,Income, Occupation and Settlement size. These are strictly
related to the career of a person. So this component shows the career
focus of the individual.
For the second component Sex, Marital status and Education are by
far the most prominent determinants.
For the final component, we realize that Age, Marital Status, and
Occupation are the most important features. We observed that marital
status and occupation load negatively but are still important.
Our new dataset is ready! It’s time to apply K-Means to our brand new
dataset with 3 components.
K-Means algorithm has learnt from our new components and created 4
clusters . I would like to see old datasets with new components and
labels .
df_segm_pca_kmeans = pd.concat([df.reset_index(drop = True),
pd.DataFrame(scores_pca)], axis =
1)df_segm_pca_kmeans.columns.values[-3: ] = ['Component 1',
'Component 2', 'Component 3']
df_segm_pca_kmeans['Segment K-means PCA'] =
kmeans_pca.labels_df_segm_pca_kmeans.head()
That was one of the biggest goals of PCA to reduce the number of
variables by combining them into bigger ones.
“Don’t find customers for your products, find products for your
customers.”
— Seth Godin
Conclusion
If you want to see the entire code in Jupyter notebook, it can be found
on my Github.