Clustering Algorithms
Clustering algorithms partition data into groups of similar items, revealing natural structure without using any labels.
Definition
Clustering is the unsupervised partitioning of a dataset into groups such that points within a group are more similar to each other than to points in other groups, where similarity is defined by a distance or density criterion chosen for the application.
Scope
This topic covers the main families of clustering: centroid-based methods such as k-means, hierarchical agglomerative clustering that builds a tree of nested groups, density-based methods that find arbitrarily shaped clusters, and the choice of distance measures and the number of clusters. It addresses what makes a good clustering and why the problem is inherently ambiguous.
Core questions
- What makes a set of points a cluster?
- How does k-means iteratively minimize within-cluster variance?
- How is the number of clusters chosen?
- When do hierarchical or density-based methods outperform centroid methods?
Key theories
- k-means and Lloyd's algorithm
- k-means minimizes total squared distance to cluster centers by alternating assignment of points to nearest centers and recomputation of centers, a procedure that converges to a local optimum.
- Hierarchical clustering
- Agglomerative clustering repeatedly merges the closest groups to build a dendrogram, giving clusterings at every granularity and avoiding the need to fix the number of clusters in advance.
- Mixture-model clustering
- Treating clusters as components of a probabilistic mixture allows soft assignments and clusters of differing shape and size, connecting clustering to latent-variable density estimation.
Clinical relevance
Clustering underlies market segmentation, document and image organization, gene-expression grouping, and anomaly detection, and it is a primary tool of exploratory data analysis; because clusterings depend on the chosen distance and number of groups, results must be interpreted with care rather than treated as a unique ground truth.
History
The k-means procedure traces to Lloyd's 1957 quantization work, published in 1982, and to MacQueen's independent formulation. Hierarchical clustering arose in numerical taxonomy, and density-based methods such as DBSCAN extended clustering to arbitrarily shaped groups, together forming the standard toolkit of unsupervised grouping.
Key figures
- Stuart Lloyd
- James MacQueen
- Trevor Hastie
Related topics
Seminal works
- lloyd1982
- hastie2009
- bishop2006
Frequently asked questions
- Why does k-means require choosing the number of clusters?
- k-means optimizes the placement of a fixed number of centers, so that number is an input. Choosing it relies on heuristics such as the elbow method, silhouette scores, or domain knowledge, since adding more clusters always reduces within-cluster distance.
- Can different clustering methods give different answers?
- Yes. Because there is no single definition of a cluster, centroid, hierarchical, and density-based methods can produce different partitions of the same data, each valid under its own criterion. The right choice depends on the expected cluster shapes and the goal.