MIS 302

Similarity Measures

I. Ozkan

Spring 2024

Readings

Shirkhorshidi AS, Aghabozorgi S, Wah TY. A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data. PLoS One. 2015 Dec 11;10(12):e0144059 DOI: 10.1371/journal.pone.0144059

Introduction

Importance of Similarity Measures

Commonly Used Similarity Measures

  • Euclidean Distance

  • Manhattan Distance

  • Cosine Similarity

  • Jaccard Similarity

  • Pearson Correlation Coefficient

  • Mahalanobis Distance

Properties of Similarity Measures

\(sim(A,B)=sim(B,A)\)

\(sim(A,B) \geq 0\)

\(sim(A,A)=1\)

\(sim(A,B)+sim(B,C) \geq sim(A,C)\)

The range of the similarity measure should be defined

Similarity Measures in General

Similarity Measures

For Numerical Data:

For Categorical Data:

For Mixed Data Types:

Kernel Based Similarities:

Euclidean Distance (\(\ell_2-norm\))

Formula:

The euclidean distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_E(x_i, x_j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

Use When:

Avoid When:

Manhattan Distance (\(\ell_1-norm\), City Blocks)

Formula:

The euclidean distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_{Man}(x_i, x_j)=\sum_{k=1}^{p}\left|x_{ik}-x_{jk}\right|\)

Use When:

Avoid When:

Euclidean assumption is preferred (diagonal closeness).

Euclidian and Manhattan

Minkowski Distance

Formula:

The Minkowski distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_{Mw}(x_i, x_j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

Use When:

You want a flexible metric between Euclidean (\(p=2\)) and Manhattan (\(p=1\))

Minkowski, Unit Distance Circle

Cosine Similarity

Formula:

\(\cos \theta = \frac{\mathbf x_i^\top \mathbf x_j}{||\mathbf x_i||_2 ||\mathbf x_j||_2}=\langle \frac{\mathbf x_i}{||\mathbf x_i||_2}, \; \frac{\mathbf x_j}{||\mathbf x_j||_2}\rangle\)

Use When:

Avoid When:

Mahalanobis Distance

Formula:

The Mahalonobis distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_{Mah}(x_i, x_j)=\sqrt{(x_i-x_j)^T\Sigma^{-1}(x_i-x_j)}\)

where, \(\Sigma^{-1}\) is the inverse of covariance matrix

Use When:

Features are correlated.

Clusters may be ellipsoidal.

Avoid When:

Small sample size makes covariance estimation unstable.

Similarity Measures for Categorical Data

\(\begin{array}{cccc} &1 &0\\ 1& a & b\\ 0& c & d \end{array}\)

Let, \(x_i=\{1,1,1,1\}\) and \(x_j=\{1,1,0,1\}\)

\(\begin{array}{cccc} &1 &0\\ 1& 3 & 1\\ 0& 0 & 0 \end{array}\)

Hamming Distance (One of the Edit Distances)

Count of mismatched attributes (mismatched positions) of two vectors of equal length

Use When:

Attributes are binary or categorical with equal importance

Avoid When:

Attributes differ in relevance or have hierarchical relationships

Ex:

Hamming distance is the for \(x_i=\{\color{blue}{1},\color{blue}{1},\color{red}{1},\color{blue}{1}\}\) and \(x_j=\{\color{blue}{1},\color{blue}{1},\color{red}{0},\color{blue}{1}\}\) is \(1\)

Simple Matching Coefficient (SMC)

Matches / Total attributes.

\(\frac{a+d}{a+b+c+d}=\frac{3+0}{3+1+0+0}=\frac{3}{4}\)

Use When:

Binary symmetric attributes (e.g., Gender, Yes/No).

Avoid When:

Attributes are asymmetric (e.g., presence/absence data).

Jaccard Similarity

Formula:

\(\frac{a}{a+b+c}=\frac{3}{3+1}=\frac{3}{4}\)

Use When:

Binary, asymmetric attributes (e.g., market basket data).

Avoid When:

Absence of an attribute carries equal meaning as presence.

Jaccard Similarity and SMC are equal. But in general there is an important difference. Market basket analysis for example. Assume market sells \(p\) item. Each customer has a basket containing \(k\) items where \(k<<p\). In this case the vectors representing customers baskets contain lots of \(0\)s. In this case Jaccard similarity is more appropriate

Ex:

Let \(p=100\), customer \(i\) buys bread and cheese and customer \(j\) buys breed and meet. The jaccard similarity is then \(1/3\) where SMC is \((1+97)/100=0.98\)

Gower’s Similarity Coefficient (Mixed Data)

Readings

Gower’s distance for mixed data types

Numeric variables (continuous or discrete): Range Normalized Manhattan

\(s_{ijk} = 1 - \frac{|x_{ik} - x_{jk}|}{R_k}\)

where, \(R_k\) is the range of the feature \(k\)

Binary variables: Jaccard Similarity

\(s_{ijk} =\left\{\begin{matrix}1 \text{ if }x_{ik} = x_{jk}\\0 \text{ if }x_{ik} \neq x_{jk}\end{matrix}\right.\)

Logical variables:

\(s_{ijk} = \left\{\begin{matrix}1 \text{ if }x_{ik} = \text{ true AND }x_{jk} = \text{ true } \\0 \text{ otherwise.}\end{matrix}\right.\)

The Total Similarity Score

\(S_{ij} = \frac{\sum_{k=1}^N{(s_{ijk}\cdot\delta_{ijk})}}{\sum_{k=1}^N{\delta_{ijk}}}.\)

where,

\(\delta_{ijk}=1\) for all features that are comparable, \(\delta_{ijk}=0\) otherwise

Similarity Distance

\(D_{ij} = \sqrt{1-S_{ij}}\)

Use When:

Data contains mixed types

Need a unified similarity score

Avoid When:

All attributes are of the same type (a simpler metric can be used).

Kernel-Based Similarities (Advanced)

Used in Spectral Clustering, SVM, or non-linear clustering:

Please refer to SVM notes

Use When:

You need to map data into a high-dimensional space for better separability.

Avoid When:

Interpretability is crucial.

Model tuning (σ) is difficult.

Data Types, Similarity and Clustering Algorithms

Data Type Recommended Similarity Clustering Algorithm Examples
Numerical (scaled) Euclidean, Manhattan, Cosine K-Means, DBSCAN, Hierarchical
Categorical Hamming, Jaccard, SMC K-Modes, Agglomerative (custom distance)
Mixed Gower PAM (Partitioning Around Medoids), Hierarchical
Text / Sparse Cosine, Jaccard K-Means (with TF-IDF), LDA
Correlated Features Mahalanobis Gaussian Mixture Models
Nonlinear Structure RBF Kernel Spectral Clustering

Practical Considerations

Conclusion

Example: Gower’s Similarity Coefficient

library(tidyverse)
library(cluster)
library(gt)

head(iris[,3:5]) |> gt()
Petal.Length Petal.Width Species
1.4 0.2 setosa
1.4 0.2 setosa
1.3 0.2 setosa
1.5 0.2 setosa
1.4 0.2 setosa
1.7 0.4 setosa
head(iris[,3:5]) |> 
  daisy(metric = "gower")
## Dissimilarities :
##            1          2          3          4          5
## 2 0.00000000                                            
## 3 0.08333333 0.08333333                                 
## 4 0.08333333 0.08333333 0.16666667                      
## 5 0.00000000 0.00000000 0.08333333 0.08333333           
## 6 0.58333333 0.58333333 0.66666667 0.50000000 0.58333333
## 
## Metric :  mixed ;  Types = I, I, N 
## Number of objects : 6
head(iris[,3:5]) |> 
  daisy(metric = "gower") |> # gower's distance
  hclust() |>   # Hierarchical Clustering
  plot()  # plot the result

Implementation in R (Examples: Next Week)