FIG. 02 Customer Segmentation, Unsupervised

Which customers
earn their keep.

A segmentation model that sorts every customer by value. Aims every marketing dollar at the 21% who generate 64% of revenue. The other 79% stop draining the budget.

Below: a live classifier, segment profiles, and a campaign ROI simulator. Try the ROI simulator first. Every chart is computed in your browser from the JSON the Python script writes to /assets/data/segmentation/. No hosted inference, no screenshots, no hand-picked numbers.

Filed

Dataset

UCI · Online Retail II · ~1M transactions

Source

notebooks/segmentation_model.py ↗

§ I Why segmentation matters

Mass marketing dilutes both signal and budget. A 1% response rate from a blanket email is a 99% waste — and the 1% who responded were going to buy anyway. Segmentation replaces that with targeted outreach: the right message to the right subset, priced against the revenue each segment actually produces.

Well-run segmentation programs routinely deliver double-digit percentage-point lift in campaign ROI over blanket sends. This page shows the method — RFM features, three clustering algorithms compared, each segment profiled against its revenue contribution — run against the public UCI Online Retail II dataset so every number is reproducible from the linked script.

The winning model here is K-Means at k=4. Four segments unpack a real 1M-transaction book into groups with dramatically different economics — as you're about to see, one of them earns roughly 64% of revenue on 21% of customers.

§ II Data Profile

§ III Three clustering methods, one dataset

Run on the same log+scaled R/F/M matrix. Higher silhouette and Calinski-Harabasz are better; lower Davies-Bouldin is better. No single metric decides — silhouette is the primary criterion here. The winner row is highlighted.

Method	Best k	Silhouette	Calinski Harabasz	Davies Bouldin	Inertia

§ IV · FIG. 02.1 K selection — silhouette and elbow, all three methods

Left: silhouette score vs k — higher is tighter clusters. Accent dot marks the chosen operational k. The k=2 column is shaded and labeled (degenerate): it wins on score because RFM data has a strong bimodal Pareto split, but two clusters is operationally useless for targeted outreach. Right: K-Means inertia elbow — diminishing returns after the operational k.

Fig. 02.1 a · Silhouette vs k

Fig. 02.1 b · K-Means inertia elbow

§ V · FIG. 02.2 The segment map — 2,000 customers, PCA(2)

Each point is one customer. Axes are the top two principal components of the log-scaled R/F/M matrix. Hover a point for raw values. Click a legend entry to isolate a cluster.

Fig. 02.2 · Segment scatter (PCA)

HOVER — for R / F / M on this customer

§ VI · FIG. 02.3 Classify a customer, live

Inputs are log1p-transformed and standardized with the same scaler the model was trained on, then assigned to the nearest K-Means centroid by Euclidean distance. Move the sliders — assignment updates in real time.

Assigned segment

—

Customers: —
Of base: —
Of revenue: —

This is the same K-Means model used in Fig. 02.2. The inset re-projects through the same PCA so the "you" dot lives in the same space as the scatter above.

§ VII · FIG. 02.4 Segment profiles

One card per cluster. The small bars show where each segment's median customer lands on the overall distribution for Recency, Frequency, and Monetary. The revenue-vs-customers bar tells you whether the segment is pulling its weight.

§ VIII · FIG. 02.5 Campaign ROI simulator

Pick a segment, set a per-customer campaign cost and an expected lift multiplier. The baseline response rate is the segment's own measured 60-day repurchase rate — not a fabricated number. The ROI math and all intermediates are computed live.

Segment

Cost / customer $1.00

Expected lift 1.5×

contacted

—

→

baseline buyers
size × repurchase rate

—

→

with-campaign buyers
baseline × lift

—

→

incremental buyers

—

Incremental revenue

—

incremental buyers × AOV

Campaign cost

—

size × cost per customer

Net ROI

—

The 60-day repurchase rate is measured directly from the data — fraction of each segment's customers who bought in the last 60 days of the snapshot. The lift multiplier is user-adjustable — industry direct-response benchmarks typically land between 1.3× and 2.0× for win-back campaigns. Your mileage, as always, varies.

§ IX Methodology & Colophon

Dataset

UCI ML Repository · Online Retail II ↗ — two years of UK online retail transactions, roughly 1M rows before cleaning.

Pipeline script

notebooks/segmentation_model.py ↗ — RFM aggregation, log1p + StandardScaler, three clustering methods × k grid, silhouette / Calinski-Harabasz / Davies-Bouldin, PCA(2) for the map, deterministic percentile-band auto-namer.

Reproducibility

random_state=42 everywhere. Last regenerated —. Running python notebooks/segmentation_model.py twice on the same CSV produces byte-identical JSON.

Limitations

Static snapshot — no drift between training and today. RFM captures transaction pattern but misses product affinity and channel mix. The live classifier snaps a new customer to the nearest frozen centroid; a production program would retrain on a schedule. Four segments is deliberately few for readability — real programs usually run a dozen or more.

← Back to the portfolio View the script on GitHub ↗