Combinatorial clustering
$k$-means
Agglomerative clustering
$k$-means
- Assign points to a cluster
- For each cluster, calculate it's mean center point
- Assign points to a cluster based on the closest center
- Repeat steps 2-3 until no point changes cluster
$k$-means
kmeansStart(stuff, 2)
kmeansStop()
kmeansStart(stuff, 3)
kmeansStop()
kmeansStart(stuff, 4)
kmeansStop()
kmeansStart(stuff, 5)
kmeansStop()
Problems with k-means
- Dependent on initial conditions
- Only linear boundaries
- Clusters tend to be around the same size
- What if cluster is inside another?
- How do you know how many clusters to use?
- Outliers? Use $k$-medoids
Agglomerative single-linkage
Agglomerative single-linkage clustering is one of several "bottom-up" types of hierarchical clustering.
- Start with each point in its own cluster
- Compute the distances of each cluster
- In this case, find the two closest points between clusters
- In other cases you take the average distance of points between clusters
- In complete-linkage you find the furthest points between clusters
- Merge the two clusters with the minimal distance
- Repeat steps 2 and 3 until you have the desired number of clusters
Agglomerative single-linkage
agglo.kmeansStart(2)
agglo.kmeansStop()
agglo.agglomerativeStart(2, 100)
agglo.agglomerativeStop()
agglo.agglomerativeStart(4, 60)
agglo.agglomerativeStop()
Single-linkage clustering
- Naive implementation is $O(n^3)$ slow
- Exhibits "chaining". Points in the same cluster can be further away from each other than from a point in another cluster.
- Again, how do you know how many clusters to use?
- Alternatives?
- Complete-linkage clustering
- Average-linkage clustering
- Also, spectral clustering
Agglomerative
agglo2.kmeansStart(2)
agglo2.kmeansStop()
$k$-means
agglo2.agglomerativeStart(4, 100, "single")
agglo2.agglomerativeStop()
Single linkage
agglo2.agglomerativeStart(4, 100, "complete")
agglo2.agglomerativeStop()
Complete linkage
agglo2.agglomerativeStart(4, 100, "average")
agglo2.agglomerativeStop()
Average linkage