Learning Objectives

What is Statistical Learning

Keywords:

Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.

Predictors and Response (Dependent, Output) Variables

Let Y be the dependent variable and \(X_1, X_2,\cdots,X_p\) be \(p\) different predictors
Assume that there is some relationship between \(Y\) and \(X = (X_1, X_2,...,X_p)\):

\[Y=f(X)+\varepsilon\]

Where, \(f(.)\) is some fixed but unknown function of \(X_1,...,X_p\), \(\varepsilon\) is a random error term, which is independent of \(X\) and has mean zero

\[Y=f(X)+\varepsilon=Pattern+Error\]

Function \(f()\)

Why estimate \(f()\)
How do we estimate \(f()\)
Prediction Accuracy and model interpretability
Supervised vs Unsupervised Learning
Regression vs Classification Problems

Why Estimate \(f()\), Prediction

For Prediction

\(X=(X_1, X_2, \cdots,X_p)\) are available but \(Y\) can not be obtained.

\(\hat Y=\hat f(X)\) since \(E(\varepsilon)=0\)

\(\hat f()\) may be a black box model where exact form is not important but it predicts \(Y\) accurately

Reducible (\(\hat f()\) is not perfect estimate of \(f\)) and irreducible error (\(\hat f()\) almost perfect estimate of \(f\), but \(Y\) is a function of \(\varepsilon\))

The expected value of the squared difference between actual and predicted value of \(Y\)

\[E(Y-\hat Y)^2=E[f(X)+\varepsilon -\hat f(X)]^2\] \[=\underbrace{E[\big(f(X) -\hat f(X)\big)^2]}_{reducible} +\underbrace{Var(\varepsilon)}_{irreducible}\]

\(\varepsilon\) may contain (i) unmeasured variables and (ii) unmeasurable variation

The focus is to minimize the reducible error with different techniques for estimating \(\hat f()\)

Why Estimate \(f()\), Inference

For Inference

The main aim is to understand the relationship between \(X\) and \(Y\). The main aim is not necessarily to make prediction.

\(\hat f()\) should be chosen so that it is interpretable.

Questions are:

Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Is the relationship simple or complicated?

Inference: An Example, Advertising Data

TV	radio	newspaper	sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2
57.5	32.8	23.5	11.8
120.2	19.6	11.6	13.2
8.6	2.1	1.0	4.8
199.8	2.6	21.2	10.6

– Which media contribute to sales?

– Which media generate the biggest boost in sales?

– How much increase in sales is associated with a given increase in TV advertising?

How Do We Estimate \(f\)

Both linear and non-linear approaches are available
Parametric methods (Estimating \(f()\) via estimating set of parameters):
- Make an assumption about functional form or shape of \(f()\), for example, a linear model: \(f(X)=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k\)
- Use the training data to fit (or train) the model (for example using ordinary least squares)
If the selected model performance is poor, one may choose to select more flexible models. This may result in overfitting the data (model follow the error too closely)

Parametric Example (Fig 2.4 ISLR)

Example: income, education and seniority relationship where the true underlying relationship shown as below:

Fig. 2.3
Here is the example of parametric approach,

\[income=\beta_0 + \beta_1 \times education + \beta_2 \times seniority\]

Then the estimated function:

Fig. 2.4

How Do We Estimate \(f\)

Non-parametric methods: no explicit assumptions about the functional form of \(f()\), suitable for wider range possible shapes for \(f()\)
- Try to estimate \(f()\) as close the data points as possible
An Example with a non-parametric approach shown below:

Fig. 2.5

Prediction Accuracy and Model Interpretability

Some models are more flexible where some others are less flexible
For example, Linear Regression is an example of less flexible model (MIS 301, MIS 302 (Review), MIS 305, MIS 403 (Review))
Spline regressions are considerably more flexible (MIS 403)
Lasso, Ridge regression (MIS 403) are examples of less flexible models and Trees (MIS 301, MIS 403), Bagging, Boosting are examples of more flexible models (MIS 403)
If inference is important then there are clear advantages to using the **less flexible models. Because it is easy to understand them
Time to time, Partially linear models may be used for inference where some of the variables exhibit non-linear relationship with the dependent model. If the effects of the variables that have linear relationship with dependent variable are important to analyze then partial linear model may be suitable (MIS 403)

Supervised vs Unsupervised Learning

Most of the learning problems fall into
- Supervised Learning
- Unsupervised Learning
- (Reinforcement Learning)
Many classical statistical learning methods are examples of Supervised Learning (since \(Y \: and \: X_i\) are available)
Unsupervised Learning describes the more challenging situation (since \(X_i\) are available but response variable \(Y\) is not available)
Clustering Analysis is an example of Unsupervised learning (MIS 302, MIS 403)

Fig 2.8

Regression vs Classification

Some variables are Categorical (qualitative) and some others are Numerical (quantitative)
Qualitative variables take different categories as values (sometimes called as different classes), such as, default/no default; cancer/no cancer etc.
Regression Problem is used for when the response variable is quantitative
Classification Problem is used for when the response variable is qualitative
Some methods can be used either for quantitative or qualitative responses (Trees, Boosting for example)

Assessing Model Accuracy

Quality of fit measures (in general for test data)
- Mean Absolute Error, MAE, \(\text{mean}(|\varepsilon_j|)\)
- Mean Absolute Percentage Error, MAPE, \(\text{mean}(|p_{j}|), \: p_j=100\varepsilon_j/Y_j\)
- Mean Squared Error, MSE, \(\text{mean}(\varepsilon_j^2)\)
- Root Mean Squared Error, RMSE, \(\sqrt{\text{mean}(\varepsilon_j^2)}\)
- K-Fold Cross Validation
- …

Low Bias and Low Variance

Variance refers to the amount by which \(\hat f()\) would change if we estimated it using a different training data set
- Ideally the estimate for f should not vary too much between training sets
- High variance leads to large changes in \(\hat f()\) even in the case when small changes in training data
- In general, more flexible statistical methods have higher variance

Fig. 2.9

Low Bias and Low Variance

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
- For example Linear Regression may perform poorer if the relationship is highly non-linear
- In these cases, increasing the number of observations do not lead to better predictions
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease
Good test set performance of a statistical learning method re- bias-variance quires low variance as well as low squared bias

Fig. 2.11

Model Assessment and Model Selection

The process of evaluating a model’s performance is known as model assessment
The process of selecting the proper level of flexibility for a model is known as assessment model selection
Two common Resampling Methods will be discussed (as well in MIS 403 course)
- Cross-Validation
- Bootstrap

MIS 302

Learning From Data: Statistical Learning

Reading

Learning Objectives

Predictors and Response (Dependent, Output) Variables

Function \(f()\)

Why Estimate \(f()\), Prediction

Why Estimate \(f()\), Inference

Inference: An Example, Advertising Data

How Do We Estimate \(f\)

Parametric Example (Fig 2.4 ISLR)

How Do We Estimate \(f\)

Prediction Accuracy and Model Interpretability

Supervised vs Unsupervised Learning

Regression vs Classification

Assessing Model Accuracy

Low Bias and Low Variance

Low Bias and Low Variance

Model Assessment and Model Selection