ML.EVAL.CLUSTERING.SILHOUETTE_SCORE¶
Returns the mean silhouette coefficient over all samples.
Syntax¶
Arguments¶
| Name | Type | Default | Description |
|---|---|---|---|
| X | object | DataFrame or 2-D array object of the data that was clustered. | |
| labels | object | DataFrame or array object of predicted cluster labels, one per row of X. | |
| metric | Any | "euclidean" | Distance metric used to compute pairwise distances (e.g. 'euclidean', 'manhattan', 'cosine'). |
| sample_size | Any | None | If set, compute the silhouette on a random subsample of this many rows instead of the full dataset. |
| random_state | Any | None | Random seed for sample_size selection. Use a fixed integer for reproducible results. |
Returns¶
A single number between -1 and +1 — higher means better-separated clusters.
When to use¶
Use ML.EVAL.CLUSTERING.SILHOUETTE_SCORE to gauge how cleanly your clusters
separate. For each sample it compares its average distance to other points in
its own cluster against its average distance to the nearest other cluster, and
averages the result across all samples.
The score lives in [-1, 1]:
- +1 — clusters are tight and well-separated.
- 0 — clusters overlap; samples sit on or near a boundary.
- -1 — many samples are likely assigned to the wrong cluster.
It's especially handy alongside ML.INSPECT.INERTIA when picking
n_clusters: inertia always shrinks as k grows, but the silhouette score
peaks at a "natural" number of clusters and declines on either side.
Examples¶
Score a fitted K-Means model's predicted labels on the data in A2:E101:
=ML.CLUSTERING.KMEANS(3, "k-means++", "auto", 300, 0.0001, 0)
=ML.FIT(H1, A2:E101)
=ML.PREDICT(H2, A2:E101)
=ML.EVAL.CLUSTERING.SILHOUETTE_SCORE(A2:E101, H3#)
Score multiple k values to find the most natural cluster count, lining the
results up next to the inertia for the elbow chart:
A2: 2 B2: =ML.CLUSTERING.KMEANS(A2, "k-means++", "auto", 300, 0.0001, 0)
C2: =ML.FIT(B2, $A$25:$E$125)
D2: =ML.INSPECT.INERTIA(C2)
E2: =ML.PREDICT(C2, $A$25:$E$125)
F2: =ML.EVAL.CLUSTERING.SILHOUETTE_SCORE($A$25:$E$125, E2#)
Remarks¶
- Pass the same feature matrix (
X) you trained K-Means on, plus the predicted cluster labels — usually the output ofML.PREDICTagainst your fitted K-Means model. - The score is undefined for
n_clusters = 1and for trivial inputs where every sample is its own cluster — those configurations raise an error. - Silhouette scoring is O(n²) in memory and runtime; on very large
datasets either subset the rows first or pass a
sample_sizeto score on a random subset (userandom_statefor reproducibility). - Pre-scale your features with
ML.PREPROCESSING.STANDARD_SCALERbefore scoring — silhouette uses Euclidean distance by default, so unequal feature magnitudes will dominate the comparison.
See also¶
- ML.EVAL.CLUSTERING.ADJUSTED_MUTUAL_INFO_SCORE
- ML.EVAL.CLUSTERING.ADJUSTED_RAND_SCORE
- ML.EVAL.CLUSTERING.COMPLETENESS_SCORE
- ML.EVAL.CLUSTERING.FOWLKES_MALLOWS_SCORE
- ML.EVAL.CLUSTERING.HOMOGENEITY_SCORE
- ML.EVAL.CLUSTERING.MUTUAL_INFO_SCORE
- ML.EVAL.CLUSTERING.NORMALIZED_MUTUAL_INFO_SCORE
- ML.EVAL.CLUSTERING.RAND_SCORE
- ML.EVAL.CLUSTERING.V_MEASURE_SCORE
- ML.CLUSTERING.KMEANS
- ML.INSPECT.INERTIA
- ML.PREDICT