Readings

Book Chapter

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 12

Others

Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell, Chapter 20, 21, 22

Introduction

Part of Unsupervised Learning Methods
Only a set of features, \(X_1, X_2, \cdots, X_n\) measured on n observations
The goal is to discover interesting things about the measurements on \(X_1, X_2, \cdots, X_n\)
If there is a reason to believe that there are some homogeneous groups of observations in these \(n\) observations, clustering may be used to find them
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set
It’s a widely applied method in many different areas of science, business, and other applications
- Market/Customer segmentation
- Grouping of shopping items
- Finding different Sub-types of cancer
- Medical imaging
- Anomaly detection
- Social network analysis
- …

Intro: Clustering

Loosely speaking, Partitioning the data into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other called as clustering

- How to define similar/different

Similarity Measures are used to assess how similar/different these observations

How to group observations, wide Range of algorithms available but in this course:
- K-means clustering
- Hierarchical clustering
How to visualize the grouping
How to interpret the grouping

K-means Clustering

Construct groups of observations (clusters) so that the total within-cluster variation is minimized
The standard algorithm is the Hartigan-Wong algorithm
Try to find the centroids of a fixed number of clusters of points in a high-dimensional space
Two points to pre-specify:
- Number of Clusters: Cluster Validity Problem
- Similarity Measures: Similarity Measures

Source: https://bradleyboehmke.github.io/HOML/kmeans.html

K-means Clustering

Following Hartigan and Wong, let’s define the total within-cluster variation as the sum of the Euclidean distances between observation \(i\) and the corresponding centroid

\(W\left(C_k\right) = \sum_{x_i \in C_k}\left(x_{i} - \mu_k\right)^2\)

\(x_i\) is an observation belonging to the cluster \(C_k\)
\(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)
Each observation is assigned to a given cluster such that the sum of squared distances to their assigned cluster centers is minimized

\(SS_{within} = \sum^k_{k=1}W\left(C_k\right) = \sum^k_{k=1}\sum_{x_i \in C_k}\left(x_i - \mu_k\right)^2\)

K-means Clustering: Complex Geometric Groupings

K-means is ineffective when the clusters have complicated geometries

Source: https://bradleyboehmke.github.io/HOML/kmeans.html

K-means Clustering: Algorithm

Randomly select \(k\) observations from the data set to serve as the initial centers (or randomly generate center points within the range of data)
All remaining observations are assigned to its closest centroid
Computes the new center
All the observations are reassigned again using the updated centroid
Do steps 3 and 4 until the cluster assignments stop changing

K-means Clustering: Algorithm

Assume 3 clusters (although clearly there are two)
Randomization: Select initial centers

K-means Clustering: Algorithm

The randomization at step 1 results in slightly different clusters at each run: Solution is to use several random starts and choose the iteration with the lowest \(W\left(C_k\right)\): A good number of random start is 20-25

Cluster Validity: How Many Clusters

Number of clusters should be specified before clustering algorithm is applied
If there exists a priori information about the number of groups then this number can be used
Specifying small number of clusters may result in non-homogenous groups
Specifying large number of clusters may result in overfitting
The idea of validation of clusters is based on that these clusters must be compact and well seperated
Numerous of cluster validity measures (called cluster validity indices) are proposed and used widely
Some of the selected ones will be considered in this course

Cluster Validity: Elbow Method

Compute k-means clustering for different values of number of clusters, \(k\)
For each \(k\) calculate the total within-cluster sum of squares (WSS)
Plot the curve of WSS according to the number of clusters, \(k\)
The location of a bend (i.e., elbow) in the plot is generally considered as an indicator of the appropriate number of clusters

Cluster Validity: Silhouette

Silhouette (Peter Rousseeuw, 1987)

How well each data point fits into its assigned cluster and how far it is from other clusters. The average value of Silhouette scores calculated for each data points may be used for validation

\(Score=\frac{D(nearest)-D(intra)}{max(D(nearest),D(intra))}\)

where, \(D(nearest), D(intra)\) are the distances of each observation to the nearest and intra-cluster centers

Average score of:
- Strong: \(\geq 0.7\)
- Reasonable: \(\geq 0.5\)
- Weak: \(\geq 0.25\)

Cluster Validity: The Gap Statistics

Gap Statistics (Robert Tibshirani, Guenther Walther, and Trevor Hastie)

compares the total intra-cluster variation for different values of \(k\) with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process

Hierarchical Clustering

An alternative approach to k-means clustering
The algorithm creates a hierarchy of clusters
There is no need to specify number of clusters
The result of the algorithm can be visualized using dendrogram, attractive tree-based representation
Source: https://bradleyboehmke.github.io/HOML/hierarchical.html

Hierarchical Clustering

Can be divided into two main types
Agglomerative clustering (AGNES): Bottom-up
- Each observation is initially considered as a single-element cluster
- At each step of the algorithm, the two clusters that are the most similar are combined
- Repeated until all points are a member of just one single big cluster
Divisive hierarchical clustering (DIANA): Top-down
- Initially all observations are included in a single cluster
- At each step of the algorithm, the current cluster is split into two clusters
- Repeated until all observations are in single-element cluster

Source: https://bradleyboehmke.github.io/HOML/hierarchical.html

Hierarchical Clustering: Linkage

How do we measure the dissimilarity between two clusters of observations
- Maximum or complete linkage: Largest of all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2
- Minimum or single linkage: Smallest of all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2
- Mean or average linkage: Average of all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2
- Centroid linkage: Dissimilarity between the centroid for cluster 1 and the centroid for cluster 2
- Ward’s minimum variance method: Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged

Source: https://bradleyboehmke.github.io/HOML/hierarchical.html

Hierarchical Clustering: Dendrogram

x1	x2
-0.4203567	-0.5210302
-0.1726331	-0.1559380
1.1690312	-0.9490473

Hierarchical Clustering: Dendrogram

Number of Clusters

One can visually assess the dendrogram and suggests the optimal number of clusters
Or similar to k-means clustering elbow, silhouette, and gap statistic methods may be used

Cutting the Dendrogram

One can cut the dendrogram and visaulize the results

Implementation in R (Examples: Next Week)

R provides various packages for clustering analysis, including cluster, fpc, factoextra, and dbscan
The proxy package offers a flexible framework for defining custom similarity measures
We can visualize clustering results using R packages