Why does k-means give different results on different runs?

Its objective is non-convex, so the algorithm converges to a local optimum that depends on the random initial centers; running it several times and keeping the best result is standard practice.

How do I choose the number of clusters k?

Common heuristics include the elbow in the within-cluster sum of squares, the gap statistic, and average silhouette width, though none is definitive and domain knowledge often guides the choice.

K-Means Clustering

K-means clustering partitions observations into a fixed number of clusters by minimizing the total within-cluster sum of squared distances to cluster centroids.

Definition

K-means clustering is a partitioning method that places a prescribed number of cluster centers and assigns each observation to its nearest center so as to minimize the total squared Euclidean distance from observations to their assigned centers.

Scope

This topic covers the within-cluster sum-of-squares objective, the iterative assignment-and-update algorithm that alternates between assigning points to the nearest centroid and recomputing centroids, the dependence on initialization and the resulting local optima, choice of the number of clusters, and the assumptions and limitations of the method.

Core questions

How can observations be partitioned to minimize within-cluster dispersion?
Why does the algorithm converge only to a local optimum and how is this mitigated?
How is the number of clusters chosen?
What cluster shapes and scales does the method implicitly assume?

Key theories

Within-cluster sum-of-squares minimization: K-means seeks the partition and set of centroids minimizing the total squared distance from points to their cluster centers, an objective for which the alternating assignment-update iteration monotonically decreases the criterion.
Local-optimum sensitivity: Because the objective is non-convex, the algorithm converges to a local minimum that depends on the initial centers, motivating multiple restarts and careful seeding.

Clinical relevance

K-means is a fast, scalable default for partitioning large datasets and is used in vector quantization, image color reduction, customer segmentation, and as an initialization for more complex models.

History

The centroid-based partitioning idea was formalized by MacQueen, who named k-means in 1967, building on Lloyd's earlier quantization algorithm. It became one of the most widely used clustering methods owing to its simplicity and speed.

Debates

Implicit assumptions of k-means: Minimizing squared Euclidean distance favors roughly spherical, equally sized clusters, so k-means can mislead when clusters are elongated, of unequal size, or non-convex, which motivates model-based or density-based alternatives.

Key figures

James MacQueen
Stuart Lloyd

Seminal works

hastie2009
everitt2011
macqueen1967

Frequently asked questions

Why does k-means give different results on different runs?: Its objective is non-convex, so the algorithm converges to a local optimum that depends on the random initial centers; running it several times and keeping the best result is standard practice.
How do I choose the number of clusters k?: Common heuristics include the elbow in the within-cluster sum of squares, the gap statistic, and average silhouette width, though none is definitive and domain knowledge often guides the choice.