Introduction

KNN is a non-parametric supervised learning algorithm that uses the proximity (similarity) measures to make classifications or predictions
Clustering analysis involves grouping similar objects together based on certain criteria
The choice of a suitable similarity measure is crucial in both KNN and clustering algorithms
In this presentation, we will explore the properties of similarity measures commonly used

Importance of Similarity Measures

Similarity measures quantify the degree of resemblance between object
They determine the distance between data points in a feature space
It calculates a numerical value indicating how closely related the data points are
This value plays a central role in clustering algorithms by determining which data points belong to the same cluster
The choice of similarity measure influences the clustering results significantly

Commonly Used Similarity Measures

Euclidean Distance

Manhattan Distance

Cosine Similarity

Jaccard Similarity

Pearson Correlation Coefficient

Mahalanobis Distance

Properties of Similarity Measures

Symmetry:

\(sim(A,B)=sim(B,A)\)

The similarity between data points \(A\) and \(B\) should be the same as the similarity between \(B\) and \(A\)
The order of comparison shouldn’t affect the result.
Non-negativity:

\(sim(A,B) \geq 0\)

The similarity measure must always be non-negative
most similar data points receive the highest/the lowest values, indicating maximal closeness
Identity:

\(sim(A,A)=1\)

Triangle Inequality:

\(sim(A,B)+sim(B,C) \geq sim(A,C)\)

Any three data points \(A\), \(B\), and \(C\), the similarity between \(A\) and \(C\) should not be greater than the sum of the similarities between \(A\) and \(B\) and \(B\) and \(C\)
This property ensures consistency in measuring the relative closeness of data points
Range:

The range of the similarity measure should be defined

Similarity Measures in General

The similarity measure to be used must be the based on the data types that the measure to be applied
The general data types are Numerical, Categorical and Mixed
Some measure can only be applied to either of them, some measures may be applied to both Numerical and Categorical
In this presentation, we will not go into the details of using same similarity measures for different data types

Similarity Measures

For Numerical Data:

Here are the widely used measures:
Euclidean Distance
Manhattan Distance (L1 norm)
Minkowski Distance
Cosine Similarity
Mahalanobis Distance

For Categorical Data:

Hamming Distance
Simple Matching Coefficient
Jaccard Similarity

For Mixed Data Types:

Gower’s Similarity Coefficient

Kernel Based Similarities:

Used in Spectral Clustering (will be briefly introduced in ML course) SVM, or non-linear clustering (please refer to SVM course notes for a brief discussion)

Euclidean Distance (\(\ell_2-norm\))

Euclidean Distance is the shortest distance between two data points
It is used with real valued data

Formula:

The euclidean distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_E(x_i, x_j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

Use When:

Data is continuous and on the same scale
Shape and size of clusters are roughly spherical (e.g., K-Means)
When the the shortest distance between points is meaningful

Avoid When:

Features have different scales (unless normalized).
Data has many outliers.

Manhattan Distance (\(\ell_1-norm\), City Blocks)

The Manhattan distance is the sum of absolute differences between two data points

Formula:

The euclidean distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_{Man}(x_i, x_j)=\sum_{k=1}^{p}\left|x_{ik}-x_{jk}\right|\)

Use When:

You want to reduce sensitivity to outliers
Data has grid-like structure (e.g., city blocks)
When the high-dimensional data points create a curse of dimensionality

Avoid When:

Euclidean assumption is preferred (diagonal closeness).

Euclidian and Manhattan

Minkowski Distance

Generalization of Euclidean and Manhattan

Formula:

The Minkowski distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_{Mw}(x_i, x_j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

Use When:

You want a flexible metric between Euclidean (\(p=2\)) and Manhattan (\(p=1\))

Minkowski, Unit Distance Circle

The distance function where all points are at the unit distance from the center

Cosine Similarity

Cosine of the angle between two vectors
Results in a value that is in the range of \([-1,1]\)

Formula:

inner product of normalized vectors

\(\cos \theta = \frac{\mathbf x_i^\top \mathbf x_j}{||\mathbf x_i||_2 ||\mathbf x_j||_2}=\langle \frac{\mathbf x_i}{||\mathbf x_i||_2}, \; \frac{\mathbf x_j}{||\mathbf x_j||_2}\rangle\)

Use When:

Focus is on the direction of vectors, not magnitude
Data is sparse (e.g., text mining)

Avoid When:

Absolute differences are important
If the documents to be compared are long in length

Mahalanobis Distance

Formula:

The Mahalonobis distance between \(i^{th}\) and \(j^{th}\) observation with \(p\) dimensions

\(d_{Mah}(x_i, x_j)=\sqrt{(x_i-x_j)^T\Sigma^{-1}(x_i-x_j)}\)

where, \(\Sigma^{-1}\) is the inverse of covariance matrix

Use When:

Features are correlated.

Clusters may be ellipsoidal.

Avoid When:

Small sample size makes covariance estimation unstable.

Similarity Measures for Categorical Data

For the sake of simplicity, let’s define a matrix where rows stand for \(x_i\) observation and columns stands for \(x_j\) observation

\(\begin{array}{cccc} &1 &0\\ 1& a & b\\ 0& c & d \end{array}\)

\(a\) is the number of attributes common to both \(x_i, x_j\)
\(b\) is the number of attributes \(x_i\) has but \(x_j\) does not have
\(c\) is the number of attributes the that \(x_j\) has that \(x_i\) does not have
\(d\) is the number of attributes which neither \(x_i\) nor \(x_j\) has

Let, \(x_i=\{1,1,1,1\}\) and \(x_j=\{1,1,0,1\}\)

\(\begin{array}{cccc} &1 &0\\ 1& 3 & 1\\ 0& 0 & 0 \end{array}\)

Hamming Distance (One of the Edit Distances)

Count of mismatched attributes (mismatched positions) of two vectors of equal length

Use When:

Attributes are binary or categorical with equal importance

Avoid When:

Attributes differ in relevance or have hierarchical relationships

Ex:

Hamming distance is the for \(x_i=\{\color{blue}{1},\color{blue}{1},\color{red}{1},\color{blue}{1}\}\) and \(x_j=\{\color{blue}{1},\color{blue}{1},\color{red}{0},\color{blue}{1}\}\) is \(1\)

Simple Matching Coefficient (SMC)

Matches / Total attributes.

\(\frac{a+d}{a+b+c+d}=\frac{3+0}{3+1+0+0}=\frac{3}{4}\)

Use When:

Binary symmetric attributes (e.g., Gender, Yes/No).

Avoid When:

Attributes are asymmetric (e.g., presence/absence data).

Jaccard Similarity

Formula:

\(\frac{a}{a+b+c}=\frac{3}{3+1}=\frac{3}{4}\)

Use When:

Binary, asymmetric attributes (e.g., market basket data).

Avoid When:

Absence of an attribute carries equal meaning as presence.

Jaccard Similarity and SMC are equal. But in general there is an important difference. Market basket analysis for example. Assume market sells \(p\) item. Each customer has a basket containing \(k\) items where \(k<<p\). In this case the vectors representing customers baskets contain lots of \(0\)s. In this case Jaccard similarity is more appropriate

Ex:

Let \(p=100\), customer \(i\) buys bread and cheese and customer \(j\) buys breed and meet. The jaccard similarity is then \(1/3\) where SMC is \((1+97)/100=0.98\)

Gower’s Similarity Coefficient (Mixed Data)

Readings

Gower’s distance for mixed data types

Combines different types (nominal, ordinal, interval, ratio)
To do so, it combines scores (\(s_{ijk}\)) for each different feature types \(k\)

Numeric variables (continuous or discrete): Range Normalized Manhattan

\(s_{ijk} = 1 - \frac{|x_{ik} - x_{jk}|}{R_k}\)

where, \(R_k\) is the range of the feature \(k\)

Binary variables: Jaccard Similarity

\(s_{ijk} =\left\{\begin{matrix}1 \text{ if }x_{ik} = x_{jk}\\0 \text{ if }x_{ik} \neq x_{jk}\end{matrix}\right.\)

Logical variables:

\(s_{ijk} = \left\{\begin{matrix}1 \text{ if }x_{ik} = \text{ true AND }x_{jk} = \text{ true } \\0 \text{ otherwise.}\end{matrix}\right.\)

The Total Similarity Score

\(S_{ij} = \frac{\sum_{k=1}^N{(s_{ijk}\cdot\delta_{ijk})}}{\sum_{k=1}^N{\delta_{ijk}}}.\)

where,

\(\delta_{ijk}=1\) for all features that are comparable, \(\delta_{ijk}=0\) otherwise

Similarity Distance

\(D_{ij} = \sqrt{1-S_{ij}}\)

Scaled between 0 and 1

Use When:

Data contains mixed types

Need a unified similarity score

Avoid When:

All attributes are of the same type (a simpler metric can be used).

Kernel-Based Similarities (Advanced)

Used in Spectral Clustering, SVM, or non-linear clustering:

Please refer to SVM notes

Use When:

You need to map data into a high-dimensional space for better separability.

Avoid When:

Interpretability is crucial.

Model tuning (σ) is difficult.

Data Types, Similarity and Clustering Algorithms

Data Type	Recommended Similarity	Clustering Algorithm Examples
Numerical (scaled)	Euclidean, Manhattan, Cosine	K-Means, DBSCAN, Hierarchical
Categorical	Hamming, Jaccard, SMC	K-Modes, Agglomerative (custom distance)
Mixed	Gower	PAM (Partitioning Around Medoids), Hierarchical
Text / Sparse	Cosine, Jaccard	K-Means (with TF-IDF), LDA
Correlated Features	Mahalanobis	Gaussian Mixture Models
Nonlinear Structure	RBF Kernel	Spectral Clustering

Practical Considerations

Normalization: Always normalize/standardize numerical features before using distance-based methods unless using Mahalanobis
Outliers: Prefer Manhattan or cosine similarity if your data is prone to outliers
High Dimensionality: Euclidean distance becomes less informative (“curse of dimensionality”). Use Cosine or Dimensionality Reduction
Interpretable Clustering: Prefer simple metrics (e.g., Euclidean or Jaccard) with hierarchical clustering

Conclusion

Understanding the properties of similarity measures is essential for selecting appropriate clustering algorithms
R provides a comprehensive set of tools for implementing clustering analysis and exploring different similarity measures
There is Cluster Task View
Among some others, proxy, proxyC are packages that implement several similarity measures
Some packages has its own similarity measure functions, for example daisy() function in cluster package

Example: Gower’s Similarity Coefficient

Let’s use first 6 observation and last 3 columns of iris data

library(tidyverse)
library(cluster)
library(gt)

head(iris[,3:5]) |> gt()

Petal.Length	Petal.Width	Species
1.4	0.2	setosa
1.4	0.2	setosa
1.3	0.2	setosa
1.5	0.2	setosa
1.4	0.2	setosa
1.7	0.4	setosa

Let’s obtain the distance

head(iris[,3:5]) |> 
  daisy(metric = "gower")

## Dissimilarities :
##            1          2          3          4          5
## 2 0.00000000                                            
## 3 0.08333333 0.08333333                                 
## 4 0.08333333 0.08333333 0.16666667                      
## 5 0.00000000 0.00000000 0.08333333 0.08333333           
## 6 0.58333333 0.58333333 0.66666667 0.50000000 0.58333333
## 
## Metric :  mixed ;  Types = I, I, N 
## Number of objects : 6

A Cluster example

head(iris[,3:5]) |> 
  daisy(metric = "gower") |> # gower's distance
  hclust() |>   # Hierarchical Clustering
  plot()  # plot the result

Implementation in R (Examples: Next Week)

R provides various packages for clustering analysis, including cluster, fpc, factoextra, and dbscan
The proxy package offers a flexible framework for defining custom similarity measures
We can visualize clustering results using R packages

MIS 302

Similarity Measures

Readings

Introduction

Importance of Similarity Measures

Commonly Used Similarity Measures

Properties of Similarity Measures

Similarity Measures in General

Similarity Measures

Euclidean Distance (\(\ell_2-norm\))

Manhattan Distance (\(\ell_1-norm\), City Blocks)

Euclidian and Manhattan

Minkowski Distance

Minkowski, Unit Distance Circle

Cosine Similarity

Mahalanobis Distance

Similarity Measures for Categorical Data

Hamming Distance (One of the Edit Distances)

Simple Matching Coefficient (SMC)

Jaccard Similarity

Gower’s Similarity Coefficient (Mixed Data)

Kernel-Based Similarities (Advanced)

Data Types, Similarity and Clustering Algorithms

Practical Considerations

Conclusion

Implementation in R (Examples: Next Week)