I. Ozkan
Spring 2024
Shirkhorshidi AS, Aghabozorgi S, Wah TY. A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data. PLoS One. 2015 Dec 11;10(12):e0144059 DOI: 10.1371/journal.pone.0144059
KNN is a non-parametric supervised learning algorithm that uses
the proximity
(similarity) measures to make
classifications or predictions
Clustering analysis involves grouping similar
objects together based on certain criteria
The choice of a suitable similarity measure is crucial in both KNN and clustering algorithms
In this presentation, we will explore the properties of similarity measures commonly used
Similarity measures quantify the degree of resemblance between object
They determine the distance between data points in a feature space
It calculates a numerical value indicating how closely related the data points are
This value plays a central role in clustering algorithms by determining which data points belong to the same cluster
The choice of similarity measure influences the clustering results significantly
Euclidean Distance
Manhattan Distance
Cosine Similarity
Jaccard Similarity
Pearson Correlation Coefficient
Mahalanobis Distance
\(sim(A,B)=sim(B,A)\)
The similarity between data points \(A\) and \(B\) should be the same as the similarity between \(B\) and \(A\)
The order of comparison shouldn’t affect the result.
Non-negativity:
\(sim(A,B) \geq 0\)
The similarity measure must always be non-negative
most similar data points receive the highest/the lowest values, indicating maximal closeness
Identity:
\(sim(A,A)=1\)
\(sim(A,B)+sim(B,C) \geq sim(A,C)\)
Any three data points \(A\), \(B\), and \(C\), the similarity between \(A\) and \(C\) should not be greater than the sum of the similarities between \(A\) and \(B\) and \(B\) and \(C\)
This property ensures consistency in measuring the relative closeness of data points
Range:
The range of the similarity measure should be defined
The similarity measure to be used must be the based on the data types that the measure to be applied
The general data types are Numerical
,
Categorical
and Mixed
Some measure can only be applied to either of them, some measures
may be applied to both Numerical
and
Categorical
In this presentation, we will not go into the details of using same similarity measures for different data types
For Numerical Data:
Here are the widely used measures:
Euclidean Distance
Manhattan Distance (L1 norm)
Minkowski Distance
Cosine Similarity
Mahalanobis Distance
For Categorical Data:
Hamming Distance
Simple Matching Coefficient
Jaccard Similarity
For Mixed Data Types:
Kernel Based Similarities:
Euclidean Distance is the shortest distance between two data points
It is used with real valued data
Formula:
The euclidean distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions
\(d_E(x_i, x_j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)
Use When:
Data is continuous and on the same scale
Shape and size of clusters are roughly spherical (e.g.,
K-Means)
When the the shortest distance between points is meaningful
Avoid When:
Features have different scales (unless normalized).
Data has many outliers.
Formula:
The euclidean distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions
\(d_{Man}(x_i, x_j)=\sum_{k=1}^{p}\left|x_{ik}-x_{jk}\right|\)
Use When:
You want to reduce sensitivity to outliers
Data has grid-like structure (e.g., city blocks)
When the high-dimensional data points create a curse of dimensionality
Avoid When:
Euclidean assumption is preferred (diagonal closeness).
Formula:
The Minkowski distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions
\(d_{Mw}(x_i, x_j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)
Use When:
You want a flexible metric between Euclidean (\(p=2\)) and Manhattan (\(p=1\))
Cosine of the angle between two vectors
Results in a value that is in the range of \([-1,1]\)
Formula:
\(\cos \theta = \frac{\mathbf x_i^\top \mathbf x_j}{||\mathbf x_i||_2 ||\mathbf x_j||_2}=\langle \frac{\mathbf x_i}{||\mathbf x_i||_2}, \; \frac{\mathbf x_j}{||\mathbf x_j||_2}\rangle\)
Use When:
Focus is on the direction of vectors, not magnitude
Data is sparse (e.g., text mining)
Avoid When:
Absolute differences are important
If the documents to be compared are long in length
Formula:
The Mahalonobis distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions
\(d_{Mah}(x_i, x_j)=\sqrt{(x_i-x_j)^T\Sigma^{-1}(x_i-x_j)}\)
where, \(\Sigma^{-1}\) is the inverse of covariance matrix
Use When:
Features are correlated.
Clusters may be ellipsoidal.
Avoid When:
Small sample size makes covariance estimation unstable.
\(\begin{array}{cccc} &1 &0\\ 1& a & b\\ 0& c & d \end{array}\)
\(a\) is the number of attributes common to both \(x_i, x_j\)
\(b\) is the number of
attributes \(x_i\) has but \(x_j\) does not have
\(c\) is the number of attributes the that \(x_j\) has that \(x_i\) does not have
\(d\) is the number of attributes which neither \(x_i\) nor \(x_j\) has
Let, \(x_i=\{1,1,1,1\}\) and \(x_j=\{1,1,0,1\}\)
\(\begin{array}{cccc} &1 &0\\ 1& 3 & 1\\ 0& 0 & 0 \end{array}\)
Count of mismatched attributes (mismatched positions) of two vectors of equal length
Use When:
Attributes are binary or categorical with equal importance
Avoid When:
Attributes differ in relevance or have hierarchical relationships
Ex:
Hamming distance is the for \(x_i=\{\color{blue}{1},\color{blue}{1},\color{red}{1},\color{blue}{1}\}\) and \(x_j=\{\color{blue}{1},\color{blue}{1},\color{red}{0},\color{blue}{1}\}\) is \(1\)
Matches / Total attributes.
\(\frac{a+d}{a+b+c+d}=\frac{3+0}{3+1+0+0}=\frac{3}{4}\)
Use When:
Binary symmetric attributes (e.g., Gender, Yes/No).
Avoid When:
Attributes are asymmetric (e.g., presence/absence data).
Formula:
\(\frac{a}{a+b+c}=\frac{3}{3+1}=\frac{3}{4}\)
Use When:
Binary, asymmetric attributes (e.g., market basket data).
Avoid When:
Absence of an attribute carries equal meaning as presence.
Jaccard Similarity and SMC are equal. But in general there is an important difference. Market basket analysis for example. Assume market sells \(p\) item. Each customer has a basket containing \(k\) items where \(k<<p\). In this case the vectors representing customers baskets contain lots of \(0\)s. In this case Jaccard similarity is more appropriate
Ex:
Let \(p=100\), customer \(i\) buys bread and cheese and customer \(j\) buys breed and meet. The jaccard similarity is then \(1/3\) where SMC is \((1+97)/100=0.98\)
Readings
Gower’s distance for mixed data types
Combines different types (nominal, ordinal, interval, ratio)
To do so, it combines scores (\(s_{ijk}\)) for each different feature types \(k\)
Numeric variables (continuous or discrete): Range Normalized Manhattan
\(s_{ijk} = 1 - \frac{|x_{ik} - x_{jk}|}{R_k}\)
where, \(R_k\) is the range of the feature \(k\)
Binary variables: Jaccard Similarity
\(s_{ijk} =\left\{\begin{matrix}1 \text{ if }x_{ik} = x_{jk}\\0 \text{ if }x_{ik} \neq x_{jk}\end{matrix}\right.\)
Logical variables:
\(s_{ijk} = \left\{\begin{matrix}1 \text{ if }x_{ik} = \text{ true AND }x_{jk} = \text{ true } \\0 \text{ otherwise.}\end{matrix}\right.\)
The Total Similarity Score
\(S_{ij} = \frac{\sum_{k=1}^N{(s_{ijk}\cdot\delta_{ijk})}}{\sum_{k=1}^N{\delta_{ijk}}}.\)
where,
\(\delta_{ijk}=1\) for all features that are comparable, \(\delta_{ijk}=0\) otherwise
Similarity Distance
\(D_{ij} = \sqrt{1-S_{ij}}\)
Use When:
Data contains mixed types
Need a unified similarity score
Avoid When:
All attributes are of the same type (a simpler metric can be used).
Used in Spectral Clustering, SVM, or non-linear clustering:
Please refer to SVM notes
Use When:
You need to map data into a high-dimensional space for better separability.
Avoid When:
Interpretability is crucial.
Model tuning (σ) is difficult.
Data Type | Recommended Similarity | Clustering Algorithm Examples |
---|---|---|
Numerical (scaled) | Euclidean, Manhattan, Cosine | K-Means, DBSCAN, Hierarchical |
Categorical | Hamming, Jaccard, SMC | K-Modes, Agglomerative (custom distance) |
Mixed | Gower | PAM (Partitioning Around Medoids), Hierarchical |
Text / Sparse | Cosine, Jaccard | K-Means (with TF-IDF), LDA |
Correlated Features | Mahalanobis | Gaussian Mixture Models |
Nonlinear Structure | RBF Kernel | Spectral Clustering |
Normalization: Always normalize/standardize numerical features before using distance-based methods unless using Mahalanobis
Outliers: Prefer Manhattan or cosine similarity if your data is prone to outliers
High Dimensionality: Euclidean distance becomes less informative (“curse of dimensionality”). Use Cosine or Dimensionality Reduction
Interpretable Clustering: Prefer simple metrics (e.g., Euclidean or Jaccard) with hierarchical clustering
Understanding the properties of similarity measures is essential for selecting appropriate clustering algorithms
R provides a comprehensive set of tools for implementing clustering analysis and exploring different similarity measures
There is Cluster Task View
Among some others, proxy
, proxyC
are
packages that implement several similarity measures
Some packages has its own similarity measure functions, for
example daisy()
function in cluster
package
Example: Gower’s Similarity Coefficient
iris
dataPetal.Length | Petal.Width | Species |
---|---|---|
1.4 | 0.2 | setosa |
1.4 | 0.2 | setosa |
1.3 | 0.2 | setosa |
1.5 | 0.2 | setosa |
1.4 | 0.2 | setosa |
1.7 | 0.4 | setosa |
## Dissimilarities :
## 1 2 3 4 5
## 2 0.00000000
## 3 0.08333333 0.08333333
## 4 0.08333333 0.08333333 0.16666667
## 5 0.00000000 0.00000000 0.08333333 0.08333333
## 6 0.58333333 0.58333333 0.66666667 0.50000000 0.58333333
##
## Metric : mixed ; Types = I, I, N
## Number of objects : 6
head(iris[,3:5]) |>
daisy(metric = "gower") |> # gower's distance
hclust() |> # Hierarchical Clustering
plot() # plot the result
R provides various packages for clustering analysis, including
cluster
, fpc
, factoextra
, and
dbscan
The proxy
package offers a flexible framework for
defining custom similarity measures
We can visualize clustering results using R packages