Clustering

I. Ozkan

Spring 2025

Readings

Book Chapter

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 12

Others

Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell, Chapter 20, 21, 22

Introduction

Intro: Clustering

- How to define similar/different

Similarity Measures are used to assess how similar/different these observations

K-means Clustering

K-means Clustering

\(W\left(C_k\right) = \sum_{x_i \in C_k}\left(x_{i} - \mu_k\right)^2\)

\(SS_{within} = \sum^k_{k=1}W\left(C_k\right) = \sum^k_{k=1}\sum_{x_i \in C_k}\left(x_i - \mu_k\right)^2\)

K-means Clustering: Complex Geometric Groupings

K-means Clustering: Algorithm

  1. Randomly select \(k\) observations from the data set to serve as the initial centers (or randomly generate center points within the range of data)

  2. All remaining observations are assigned to its closest centroid

  3. Computes the new center

  4. All the observations are reassigned again using the updated centroid

  5. Do steps 3 and 4 until the cluster assignments stop changing

K-means Clustering: Algorithm

K-means Clustering: Algorithm

K-means Clustering: Algorithm

Cluster Validity: How Many Clusters

Cluster Validity: Elbow Method

Cluster Validity: Silhouette

Silhouette (Peter Rousseeuw, 1987)

\(Score=\frac{D(nearest)-D(intra)}{max(D(nearest),D(intra))}\)

where, \(D(nearest), D(intra)\) are the distances of each observation to the nearest and intra-cluster centers

Cluster Validity: The Gap Statistics

Gap Statistics (Robert Tibshirani, Guenther Walther, and Trevor Hastie)

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering: Linkage

Hierarchical Clustering: Dendrogram

x1 x2
-0.4203567 -0.5210302
-0.1726331 -0.1559380
1.1690312 -0.9490473

Hierarchical Clustering: Dendrogram

Number of Clusters

Cutting the Dendrogram

Implementation in R (Examples: Next Week)