I. Ozkan
12 11 2020
Keywords:
Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Let Y be the dependent variable and \(X_1, X_2,\cdots,X_p\) be \(p\) different predictors.
Assume that there is some relationship between \(Y\) and \(X = (X_1, X_2,...,X_p)\):
\[Y=f(X)+\varepsilon\]
\[Y=f(X)+\varepsilon=Pattern+Error\]
Why estimate \(f()\)
How do we estimate \(f()\)
Prediction Accuracy and model interpretability
Supervised vs Unsupervised Learning
Regression vs Classification Problems
\(X=(X_1, X_2, \cdots,X_p)\) are available but \(Y\) can not be obtained.
\(\hat Y=\hat f(X)\) since \(E(\varepsilon)=0\)
\(\hat f()\) may be a black box model where exact form is not important but it predicts \(Y\) accurately
Reducible (\(\hat f()\) is not perfect estimate of \(f\)) and irreducible error (\(\hat f()\) almost perfect estimate of \(f\), but \(Y\) is a function of \(\varepsilon\))
The expected value of the squared difference between actual and predicted value of \(Y\)
\[E(Y-\hat Y)^2=E[f(X)+\varepsilon -\hat f(X)]^2\] \[=\underbrace{[f(X) -\hat f(X)]^2}_{reducible} +\underbrace{Var(\varepsilon)}_{irreducible}\]
\(\varepsilon\) may contain (i) unmeasured variables and (ii) unmeasurable variation
The focus is to minimize the reducible error with different techniques for estimating \(\hat f()\)
The main aim is to understand the relationship between \(X\) and \(Y\). The main aim is not necessarily to make prediction.
\(\hat f()\) should be chosen so that it is interpretable.
Questions are:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Is the relationship simple or complicated?
| TV | radio | newspaper | sales | 
|---|---|---|---|
| 230.1 | 37.8 | 69.2 | 22.1 | 
| 44.5 | 39.3 | 45.1 | 10.4 | 
| 17.2 | 45.9 | 69.3 | 9.3 | 
| 151.5 | 41.3 | 58.5 | 18.5 | 
| 180.8 | 10.8 | 58.4 | 12.9 | 
| 8.7 | 48.9 | 75.0 | 7.2 | 
– Which media contribute to sales?
– Which media generate the biggest boost in sales?
– How much increase in sales is associated with a given increase in TV advertising?
In this course, some of the linear and non-linear approaches will be covered
Parametric methods (Estimating \(f()\) via estimating set of parameters):
If the selected model performance is poor, one may choose to select more flexible models. This may result in overfitting the data (model follow the error too closely)
As an example, here we use income, education and seniority relationship where the true underlying relationship shown as below:
Fig. 2.3
Here is the example of parametric approach,
\[income=\beta_0 + \beta_1 \times education + \beta_2 \times seniority\]
Then the estimated function:
Fig. 2.4
Compare this with a non-parametric approach shown below:
Fig. 2.5
Some models are more flexible where some others are less flexible
For example, Linear Regression is an example of less flexible model
Spline model are considerably more flexible
Lasso, Ridge regression are examples of less flexible models and Trees, Bagging, Boosting are examples of more flexible models
If inference is important then there are clear advantages to using the less flexible models. Because it is easy to understand them
Time to time, Partially linear models may be used for inference where some of the variables exhibit non-linear relationship with the dependent model. If the effects of the variables that have linear relationship with dependent variable are important to analyze then partial linear model may be suitable
Most of the learning problems fall into
Many classical statistical learning methods are examples of Supervised Learning (since \(Y \: and \: X_i\) are available)
Unsupervised Learning describes the more challenging situation (since \(X_i\) are available but response variable \(Y\) is not available)
Clustering Analysis is an example of Unsupervised learning
Fig 2.8
Some variables are Categorical (qualitative) and some others are Numerical (quantitative)
Qualitative variables take different categories as values (sometimes called as different classes), such as, default/no default; cancer/no cancer etc.
Regression Problem is used for when the response variable is quantitative
Classification Problem is used for when the response variable is qualitative
Some methods can be used either for quantitative or qualitative responses (Trees, Boosting for example)
Fig. 2.9
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease
Good test set performance of a statistical learning method re- bias-variance quires low variance as well as low squared bias
Fig. 2.11
The process of evaluating a model’s performance is known as model assessment
The process of selecting the proper level of flexibility for a model is known as assessment model selection
Two common Resampling Methods will be discussed later in this course