I. Ozkan
Spring 2025
Textbook
Some Slides are Adapted from Additional Book Chapter
Hands-On Machine Learning with R
Some Slides are Adapted from this Tutorial
Decision Tree (Classification Tree), MIS 301
KNN MIS 301
Linear Probability Model MIS 301
Logistic Regression MIS 301
Support Vector Machine MIS 302
Part of Supervised Learning (input(s)-output)
For Classification Tasks
One of the best out of the box classifier
Keywords:
Maximal margin classifier: a simple and intuitive classifier
Support vector classifier: an extension of the maximal margin classifier
Support vector machine (SVM): further extension of the support vector classifier
Hyperplane: A hyperplane in \(p\)-dimensional space is defined by A flat affine subspace of dimension \(p − 1\) (a linear equation)
\(f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p = 0\)
\(p=2\) is a 2-d space and \(p=3\) is a 3-d space as for example are (adapted from Hands-On Machine Learning with R book):
\(p=2 \implies \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0\) (a line) and,
\(p=3 \implies \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 = 0\) (a plane)
\(f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p > 0\)
or the other side of the hyperplane
\(f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p < 0\)
The function \(f(X)\) is a separating hyperplane
Let’s use \(y_i=\{1,-1\}\) that represents two classes where \(y_i=1\) for \(f(X)>0\) and \(y_i=-1\) for \(f(X)<0\)
unlimited number of separating hyperplanes may be drawn
What is the optimal one
Maximal margin hyperplane and Optimal Separating Hyperplane are used interchangeably
The OSH is optimal. It separates the two classes while maximizing the distance to the closest points from either class
This maximized distance is referred to as the margin \(M\) (The Maximal Margin Classifier)
Geometrically:
Draw the polygons surrounding each class (convex hull)
Draw the shortest line segment that connects the two polygons (dotted line between two classes)
The perpendicular bisector of this line segment is the Optimal Separating Hyperplane
The margin boundaries are passing through the support vectors (2 points shown as red triangles) parallel to the separating hyperplane
In this example two classes can be separated using linear hyperplane
Briefly, the maximal margin hyperplane is the solution to the optimization problem
\(\begin{align} &\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M \\ &\text{subject to} \quad \begin{cases} \sum_{j = 1}^p \beta_j^2 = 1,\\ y_i\left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right) \ge M,\quad i = 1, 2, \dots, n \end{cases} \end{align}\)
Sometimes perfect separation is achievable, but not desirable
Let’s add a single outlier at the point \((0.5,1)\)
The data are still perfectly separable by a linear line
The decision boundaries obtained using logistic regression and the HMC will not generalize well to new data (they are not robust to outliers in the feature space)
The LDA model seems to produce a more reasonable decision boundary
Constraints may be relaxed to get a better decision boundary, called, soft margin classifier
The maximization problem is a similar to hard margin with a simple addition, \(\epsilon_i\) , slack variable that allow individual observations to be on the wrong side of the margin or the hyperplane and \(C\) (budget) which is the non negative tunable hyperparameter:
\[\begin{align} &\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M \\ &\text{subject to} \quad \begin{cases} \sum_{j = 1}^p \beta_j^2 = 1,\\ y_i\left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right) \ge M\left(1 - \epsilon_i\right), \quad i = 1, 2, \dots, n\\ \epsilon_i \ge 0, \\ \sum_{i = 1}^n \epsilon_i \le C\end{cases} \end{align}\]
\(\epsilon_i\) tells us where the \(i^{th}\) observation is located:
Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier
\(C\) bounds the sum of the \(\epsilon_i\)’s, and so it determines the number and severity of the violations to the margin
When \(C\) is small (cost of violation is high) optimization tries to find narrow margins that are rarely violated, that is highly fit to the data which may have low bias but high variance
When \(C\) is high there are more violation, that is less hard fit to the data which may have high bias but lower variance
Left: Points 1 and 8 are on the wrong side of the margin, no observation is on the wrong side of the hyperplane
Right: Points 12 and 11 are on the wrong side of the hyperplane, points 1 and 8 are on the wrong side of the margin
Choosing the cost of violation is important. While we avoid underfitting, we also avoid overfitting. The balance is important. Cross-validation helps us to choose the good one
Soft margin classifier provides robustness against outliers and noisy data
The performance of soft margin classifier is sensitive to the choice of the value of the cost
Introducing the cost parameter increases flexibility
Up to now, only linear decision boundaries are discussed (too restrictive to be useful in practice)
If the relationship is non-linear then the linear decision
boundary suffers
Kernel trick may be the answer to overcome this
The kernel trick can be used to enlarge the feature space using basis functions (using functions of the predictors, such as polynomial functions that have quadratic, cubic terms etc.)
In this enlarged (kernel-induced) feature space, a hyperplane can often separate the two classes
Left: binary classification problem (non-linear case)
Right: Enlarged feature space by adding a third feature, \(X_3 = X_1^2 + X_2^2\) (polynomial kernel function of degree 2)
\(X3\) is smaller for all green points, hence this can be perfectly separable by a hyperplane (flat)
The resulting decision boundary
Idea is to use kernel trick to enlarge the feature space to obtain a feature space that is separable by a hyperplane
The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels
Popular Kernel Functions
\(d^{th}\) degree polynomial: \(K\left(x, x'\right) = \gamma\left(1 + \langle x, x' \rangle\right)^d\) where inner product, \(\langle x, x' \rangle = \sum_{i = 1}^n x_i x_i'\)
Radial basis function: \(K\left(x, x'\right) = \exp\left(\gamma \lVert x - x'\rVert ^ 2\right)\)
Hyperbolic tangent (Sigmoid Kernel): \(K\left(x, x'\right) = \tanh\left(k_1 x \cdot x' + k_2\right)\)
Example