I. Ozkan
Spring 2025
Book Chapter
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 12
Others
Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell, Chapter 20, 21, 22
Part of Unsupervised Learning Methods
Only a set of features, \(X_1, X_2, \cdots, X_n\) measured on n observations
The goal is to discover interesting things about the measurements on \(X_1, X_2, \cdots, X_n\)
If there is a reason to believe that there are some homogeneous groups of observations in these \(n\) observations, clustering may be used to find them
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set
It’s a widely applied method in many different areas of science, business, and other applications
- How to define similar/different
Similarity Measures are used to assess how similar/different these observations
How to group observations, wide Range of algorithms available but in this course:
How to visualize the grouping
How to interpret the grouping
Construct groups of observations (clusters) so that the total within-cluster variation is minimized
The standard algorithm is the Hartigan-Wong algorithm
Try to find the
centroids of a
fixed number of clusters of
points in a high-dimensional space
Two points to pre-specify:
\(W\left(C_k\right) = \sum_{x_i \in C_k}\left(x_{i} - \mu_k\right)^2\)
\(x_i\) is an observation belonging to the cluster \(C_k\)
\(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)
Each observation is assigned to a given cluster such that the sum of squared distances to their assigned cluster centers is minimized
\(SS_{within} = \sum^k_{k=1}W\left(C_k\right) = \sum^k_{k=1}\sum_{x_i \in C_k}\left(x_i - \mu_k\right)^2\)
Randomly select \(k\) observations from the data set to serve as the initial centers (or randomly generate center points within the range of data)
All remaining observations are assigned to its closest centroid
Computes the new center
All the observations are reassigned again using the updated centroid
Do steps 3 and 4 until the cluster assignments stop changing
Assume 3 clusters (although clearly there are two)
Randomization: Select initial centers
Number of clusters should be specified before clustering algorithm is applied
If there exists a priori information about the number of groups then this number can be used
Specifying small number of clusters may result in non-homogenous groups
Specifying large number of clusters may result in overfitting
The idea of validation of clusters is based on that these
clusters must be compact and well
seperated
Numerous of cluster validity measures (called cluster validity indices) are proposed and used widely
Some of the selected ones will be considered in this course
Compute k-means clustering for different values of number of clusters, \(k\)
For each \(k\) calculate the total within-cluster sum of squares (WSS)
Plot the curve of WSS according to the number of clusters, \(k\)
The location of a bend (i.e., elbow) in the plot is generally considered as an indicator of the appropriate number of clusters
Silhouette (Peter Rousseeuw, 1987)
\(Score=\frac{D(nearest)-D(intra)}{max(D(nearest),D(intra))}\)
where, \(D(nearest), D(intra)\) are the distances of each observation to the nearest and intra-cluster centers
Gap Statistics (Robert Tibshirani, Guenther Walther, and Trevor Hastie)
An alternative approach to k-means clustering
The algorithm creates a hierarchy of clusters
There is no need to specify number of clusters
The result of the algorithm can be visualized using
dendrogram, attractive tree-based representation
Source: https://bradleyboehmke.github.io/HOML/hierarchical.html
Can be divided into two main types
Agglomerative clustering (AGNES): Bottom-up
Divisive hierarchical clustering (DIANA): Top-down
| x1 | x2 | 
|---|---|
| -0.4203567 | -0.5210302 | 
| -0.1726331 | -0.1559380 | 
| 1.1690312 | -0.9490473 | 
One can visually assess the dendrogram and suggests the optimal number of clusters
Or similar to k-means clustering elbow, silhouette, and gap statistic methods may be used
R provides various packages for clustering analysis, including
cluster, fpc, factoextra, and
dbscan
The proxy package offers a flexible framework for
defining custom similarity measures
We can visualize clustering results using R packages