Learning Objectives

What is Statistical Learning

Keywords:

Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.

Predictors and Response (Dependent, Output) Variables

Let Y be the dependent variable and \(X_1, X_2,\cdots,X_p\) be \(p\) different predictors.
Assume that there is some relationship between \(Y\) and \(X = (X_1, X_2,...,X_p)\):

\[Y=f(X)+\varepsilon\]

Where, \(f(.)\) is some fixed but unknown function of \(X_1,...,X_p\), \(\varepsilon\) is a random error term, which is independent of \(X\) and has mean zero.

\[Y=f(X)+\varepsilon=Pattern+Error\]

Function \(f()\)

Why estimate \(f()\)
How do we estimate \(f()\)
Prediction Accuracy and model interpretability
Supervised vs Unsupervised Learning
Regression vs Classification Problems

Why Estimate \(f()\), Prediction

For Prediction

\(X=(X_1, X_2, \cdots,X_p)\) are available but \(Y\) can not be obtained.

\(\hat Y=\hat f(X)\) since \(E(\varepsilon)=0\)

\(\hat f()\) may be a black box model where exact form is not important but it predicts \(Y\) accurately

Reducible (\(\hat f()\) is not perfect estimate of \(f\)) and irreducible error (\(\hat f()\) almost perfect estimate of \(f\), but \(Y\) is a function of \(\varepsilon\))

The expected value of the squared difference between actual and predicted value of \(Y\)

\[E(Y-\hat Y)^2=E[f(X)+\varepsilon -\hat f(X)]^2\] \[=\underbrace{[f(X) -\hat f(X)]^2}_{reducible} +\underbrace{Var(\varepsilon)}_{irreducible}\]

\(\varepsilon\) may contain (i) unmeasured variables and (ii) unmeasurable variation

The focus is to minimize the reducible error with different techniques for estimating \(\hat f()\)

Why Estimate \(f()\), Inference

For Inference

The main aim is to understand the relationship between \(X\) and \(Y\). The main aim is not necessarily to make prediction.

\(\hat f()\) should be chosen so that it is interpretable.

Questions are:

Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Is the relationship simple or complicated?

Inference: An Example, Advertising Data

TV	radio	newspaper	sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2

– Which media contribute to sales?

– Which media generate the biggest boost in sales?

– How much increase in sales is associated with a given increase in TV advertising?

How Do We Estimate \(f\)

In this course, some of the linear and non-linear approaches will be covered
Parametric methods (Estimating \(f()\) via estimating set of parameters):
- Make an assumption about functional form or shape of \(f()\), for example, a linear model: \(f(X)=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k\)
- Use the training data to fit (or train) the model (for example using ordinary least squares)

If the selected model performance is poor, one may choose to select more flexible models. This may result in overfitting the data (model follow the error too closely)

Parametric Example (Fig 2.4 ISLR)

As an example, here we use income, education and seniority relationship where the true underlying relationship shown as below:

Fig. 2.3

Here is the example of parametric approach,

\[income=\beta_0 + \beta_1 \times education + \beta_2 \times seniority\]

Then the estimated function:

Fig. 2.4

How Do We Estimate \(f\)

Non-parametric methods: no explicit assumptions about the functional form of \(f()\), suitable for wider range possible shapes for \(f()\).
- Try to estimate \(f()\) as close the data points as possible

Compare this with a non-parametric approach shown below:

Fig. 2.5

Prediction Accuracy and Model Interpretability

Some models are more flexible where some others are less flexible
For example, Linear Regression is an example of less flexible model
Spline model are considerably more flexible
Lasso, Ridge regression are examples of less flexible models and Trees, Bagging, Boosting are examples of more flexible models
If inference is important then there are clear advantages to using the less flexible models. Because it is easy to understand them
Time to time, Partially linear models may be used for inference where some of the variables exhibit non-linear relationship with the dependent model. If the effects of the variables that have linear relationship with dependent variable are important to analyze then partial linear model may be suitable

Supervised vs Unsupervised Learning

Most of the learning problems fall into
- Supervised Learning
- Unsupervised Learning
- (Semi-Supervised Learning)
- (Reinforcement Learning)
Many classical statistical learning methods are examples of Supervised Learning (since \(Y \: and \: X_i\) are available)
Unsupervised Learning describes the more challenging situation (since \(X_i\) are available but response variable \(Y\) is not available)
Clustering Analysis is an example of Unsupervised learning

Fig 2.8

Regression vs Classification

Some variables are Categorical (qualitative) and some others are Numerical (quantitative)
Qualitative variables take different categories as values (sometimes called as different classes), such as, default/no default; cancer/no cancer etc.
Regression Problem is used for when the response variable is quantitative
Classification Problem is used for when the response variable is qualitative
Some methods can be used either for quantitative or qualitative responses (Trees, Boosting for example)

Assessing Model Accuracy

Quality of fit measures (in general for test data)
- Mean Absolute Error, MAE, \(\text{mean}(|\varepsilon_j|)\)
- Mean Absolute Percentage Error, MAPE, \(\text{mean}(|p_{j}|), \: p_j=100\varepsilon_j/Y_j\)
- Mean Squared Error, MSE, \(\text{mean}(\varepsilon_j^2)\)
- Root Mean Squared Error, RMSE, \(\sqrt{\text{mean}(\varepsilon_j^2)}\)
- K-Fold Cross Validation
- …

Low Bias and Low Variance

Variance refers to the amount by which \(\hat f()\) would change if we estimated it using a different training data set.
- Ideally the estimate for f should not vary too much between training sets
- High variance leads to large changes in \(\hat f()\) even in the case when small changes in training data
- In general, more flexible statistical methods have higher variance

Fig. 2.9

Low Bias and Low Variance

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
- For example Linear Regression may perform poorer if the relationship is highly non-linear
- In these cases, increasing the number of observations do not lead to better predictions
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease
Good test set performance of a statistical learning method re- bias-variance quires low variance as well as low squared bias

Fig. 2.11

Model Assessment and Model Selection

The process of evaluating a model’s performance is known as model assessment
The process of selecting the proper level of flexibility for a model is known as assessment model selection
Two common Resampling Methods will be discussed later in this course
- Cross-Validation
- Bootstrap

Statistical Learning, An Intro

Preliminary Readings

Learning Objectives

Predictors and Response (Dependent, Output) Variables

Function \(f()\)

Why Estimate \(f()\), Prediction

Why Estimate \(f()\), Inference

Inference: An Example, Advertising Data

How Do We Estimate \(f\)

Parametric Example (Fig 2.4 ISLR)

How Do We Estimate \(f\)

Prediction Accuracy and Model Interpretability

Supervised vs Unsupervised Learning

Regression vs Classification

Assessing Model Accuracy

Low Bias and Low Variance

Low Bias and Low Variance

Model Assessment and Model Selection