Classification

Decision Tree (Classification Tree), MIS 301
KNN MIS 301
Linear Probability Model MIS 301
Logistic Regression MIS 301
Support Vector Machine MIS 302

Learning Objectives

Part of Supervised Learning (input(s)-output)
For Classification Tasks
- Originally Binary Classification
- Multi-Class Classification is possible
One of the best out of the box classifier

Keywords:

Maximal margin classifier: a simple and intuitive classifier
Support vector classifier: an extension of the maximal margin classifier
Support vector machine (SVM): further extension of the support vector classifier

SVM: Interpretability/Flexibility

SVM

A simple example: Discussion

SVM

Check the symbol $x$

Hyperplane

Hyperplane: A hyperplane in $p$-dimensional space is defined by A flat affine subspace of dimension $p − 1$ (a linear equation)

$f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p = 0$

$p=2$ is a 2-d space and $p=3$ is a 3-d space as for example are (adapted from Hands-On Machine Learning with R book):

And the equations of hyperplanes are

$p=2 \implies \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$ (a line) and,

$p=3 \implies \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 = 0$ (a plane)

Hyperplane

Suppose $X=(X_1,X_2,\dots,X_p)^T$ does not satisfy $ _0 + _1 X_1 + + _p X_p = 0$. If the point lies to one side of the hyperplane

$f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p > 0$

or the other side of the hyperplane

$f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p < 0$

Like a hyperplane is dividing p-dimensional space into two halves

The function $f(X)$ is a separating hyperplane
Let’s use $y_i=\{1,-1\}$ that represents two classes where $y_i=1$ for $f(X)>0$ and $y_i=-1$ for $f(X)<0$

Another Synthetic Example

Example is adopted from Hands-On Machine Learning with R book

unlimited number of separating hyperplanes may be drawn
What is the optimal one

Another Synthetic Example

Hyperplanes obtained using different approaches (GLM: Logistic, LDA: linear discriminant analysis, HMC: Optimal Separating Hyperplane-Hard Margin Classifier)

Optimal Separating Hyperplane

Maximal margin hyperplane and Optimal Separating Hyperplane are used interchangeably
The OSH is optimal. It separates the two classes while maximizing the distance to the closest points from either class
This maximized distance is referred to as the margin $M$ (The Maximal Margin Classifier)

Geometrically:

Draw the polygons surrounding each class (convex hull)
Draw the shortest line segment that connects the two polygons (dotted line between two classes)
The perpendicular bisector of this line segment is the Optimal Separating Hyperplane
The margin boundaries are passing through the support vectors (2 points shown as red triangles) parallel to the separating hyperplane
In this example two classes can be separated using linear hyperplane
Briefly, the maximal margin hyperplane is the solution to the optimization problem

$\begin{align} &\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M \\ &\text{subject to} \quad \begin{cases} \sum_{j = 1}^p \beta_j^2 = 1,\\ y_i\left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right) \ge M,\quad i = 1, 2, \dots, n \end{cases} \end{align}$

The soft margin classifier

Sometimes perfect separation is achievable, but not desirable
Let’s add a single outlier at the point $(0.5,1)$

The data are still perfectly separable by a linear line
The decision boundaries obtained using logistic regression and the HMC will not generalize well to new data (they are not robust to outliers in the feature space)
The LDA model seems to produce a more reasonable decision boundary
Constraints may be relaxed to get a better decision boundary, called, soft margin classifier

Soft Margin Classifier

The maximization problem is a similar to hard margin with a simple addition, $\epsilon_i$ , slack variable that allow individual observations to be on the wrong side of the margin or the hyperplane and $C$ (budget) which is the non negative tunable hyperparameter:

\[\begin{align} &\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M \\ &\text{subject to} \quad \begin{cases} \sum_{j = 1}^p \beta_j^2 = 1,\\ y_i\left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right) \ge M\left(1 - \epsilon_i\right), \quad i = 1, 2, \dots, n\\ \epsilon_i \ge 0, \\ \sum_{i = 1}^n \epsilon_i \le C\end{cases} \end{align}\]

$\epsilon_i$ tells us where the $i^{th}$ observation is located:
- If $\epsilon_i=0$ then the $i^{th}$ observation is on the correct side of the margin
- If $\epsilon_i > 0$ then the $i^{th}$ observation is on the wrong side of the margin, this observation has violated the margin
- If $\epsilon_i > 1$ then the $i^{th}$ observation is on the wrong side of the hyperplane
Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier
$C$ bounds the sum of the $\epsilon_i$’s, and so it determines the number and severity of the violations to the margin
When $C$ is small (cost of violation is high) optimization tries to find narrow margins that are rarely violated, that is highly fit to the data which may have low bias but high variance
When $C$ is high there are more violation, that is less hard fit to the data which may have high bias but lower variance

Wrong Side of the Margin and Hyperplane

Left: Points 1 and 8 are on the wrong side of the margin, no observation is on the wrong side of the hyperplane
Right: Points 12 and 11 are on the wrong side of the hyperplane, points 1 and 8 are on the wrong side of the margin

Another Synthetic Data Example

Soft Margins

Choosing the cost of violation is important. While we avoid underfitting, we also avoid overfitting. The balance is important. Cross-validation helps us to choose the good one
Soft margin classifier provides robustness against outliers and noisy data
The performance of soft margin classifier is sensitive to the choice of the value of the cost
Introducing the cost parameter increases flexibility

Support Vector Machines

Up to now, only linear decision boundaries are discussed (too restrictive to be useful in practice)
If the relationship is non-linear then the linear decision boundary suffers
Kernel trick may be the answer to overcome this
The kernel trick can be used to enlarge the feature space using basis functions (using functions of the predictors, such as polynomial functions that have quadratic, cubic terms etc.)
In this enlarged (kernel-induced) feature space, a hyperplane can often separate the two classes
- The decision boundary is linear in the enlarged feature space
- The decision boundary is non-linear when transformed back onto the original feature space

Example

Left: binary classification problem (non-linear case)
Right: Enlarged feature space by adding a third feature, $X_3 = X_1^2 + X_2^2$ (polynomial kernel function of degree 2)
$X3$ is smaller for all green points, hence this can be perfectly separable by a hyperplane (flat)

The resulting decision boundary
Idea is to use kernel trick to enlarge the feature space to obtain a feature space that is separable by a hyperplane
The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels

Kernel Functions

Popular Kernel Functions

$d^{th}$ degree polynomial: $K\left(x, x'\right) = \gamma\left(1 + \langle x, x' \rangle\right)^d$ where inner product, $\langle x, x' \rangle = \sum_{i = 1}^n x_i x_i'$
Radial basis function: $K\left(x, x'\right) = \exp\left(\gamma \lVert x - x'\rVert ^ 2\right)$
Hyperbolic tangent (Sigmoid Kernel): $K\left(x, x'\right) = \tanh\left(k_1 x \cdot x' + k_2\right)$

Example

More Examples

Polynomial Kernel Example

Radial Basis Function (RBF) or Gaussian Kernel

Sigmoid Kernel

SVMs with More than Two Classes

One-Versus-One Classification
- A one-versus-one or all-pairs approach constructs $\frac{K!}{2!(K-2)!}$ SVMs, each of which compares a pair of classes
- The final classification is performed by assigning the test observation to the class to which it was most frequently assigned in these pairwise classifications
One-Versus-All Classification
- $K$ SVMs are constructed, each a $i^{th}$ class compares the rest of the classes
- The final classification is performed by assigning the test observation to the class based on the SVM which has the highest margin

MIS 302 Support Vector Machines (SVM)

Preliminary Readings

Classification

Learning Objectives

SVM: Interpretability/Flexibility

SVM

SVM

Or

Hyperplane

Hyperplane

Another Synthetic Example

Another Synthetic Example

Optimal Separating Hyperplane

The soft margin classifier

Soft Margin Classifier

Wrong Side of the Margin and Hyperplane

Synthetic Data

Another Synthetic Data Example

Soft Margins

Support Vector Machines

Example

Kernel Functions

More Examples

SVMs with More than Two Classes