I. Ozkan, PhD
Professor
MIS
Cankaya
University
iozkan@cankaya.edu.tr
Spring 2025
Keywords:
Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Let Y be the dependent variable and \(X_1, X_2,\cdots,X_p\) be \(p\) different predictors
Assume that there is some relationship between \(Y\) and \(X = (X_1, X_2,...,X_p)\):
\[Y=f(X)+\varepsilon\]
\[Y=f(X)+\varepsilon=Pattern+Error\]
Why estimate \(f()\)
How do we estimate \(f()\)
Prediction Accuracy and model interpretability
Supervised vs Unsupervised Learning
Regression vs Classification Problems
\(X=(X_1, X_2, \cdots,X_p)\) are available but \(Y\) can not be obtained.
\(\hat Y=\hat f(X)\) since \(E(\varepsilon)=0\)
\(\hat f()\) may be a black box model where exact form is not important but it predicts \(Y\) accurately
Reducible (\(\hat f()\) is not perfect estimate of \(f\)) and irreducible error (\(\hat f()\) almost perfect estimate of \(f\), but \(Y\) is a function of \(\varepsilon\))
The expected value of the squared difference between actual and predicted value of \(Y\)
\[E(Y-\hat Y)^2=E[f(X)+\varepsilon -\hat f(X)]^2\] \[=\underbrace{E[\big(f(X) -\hat f(X)\big)^2]}_{reducible} +\underbrace{Var(\varepsilon)}_{irreducible}\]
\(\varepsilon\) may contain (i) unmeasured variables and (ii) unmeasurable variation
The focus is to minimize the reducible error with different techniques for estimating \(\hat f()\)
The main aim is to understand the relationship between \(X\) and \(Y\). The main aim is not necessarily to make prediction.
\(\hat f()\) should be chosen so that it is interpretable.
Questions are:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Is the relationship simple or complicated?
TV | radio | newspaper | sales |
---|---|---|---|
230.1 | 37.8 | 69.2 | 22.1 |
44.5 | 39.3 | 45.1 | 10.4 |
17.2 | 45.9 | 69.3 | 9.3 |
151.5 | 41.3 | 58.5 | 18.5 |
180.8 | 10.8 | 58.4 | 12.9 |
8.7 | 48.9 | 75.0 | 7.2 |
57.5 | 32.8 | 23.5 | 11.8 |
120.2 | 19.6 | 11.6 | 13.2 |
8.6 | 2.1 | 1.0 | 4.8 |
199.8 | 2.6 | 21.2 | 10.6 |
– Which media contribute to sales?
– Which media generate the biggest boost in sales?
– How much increase in sales is associated with a given increase in TV advertising?
Both linear and non-linear approaches are available
Parametric methods (Estimating \(f()\) via estimating set of parameters):
Make an assumption about functional form or shape of \(f()\), for example, a linear model: \(f(X)=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k\)
Use the training data to fit (or train) the model (for example using ordinary least squares)
If the selected model performance is poor, one may choose to select more flexible models. This may result in overfitting the data (model follow the error too closely)
Example: income, education and seniority relationship where the true underlying relationship shown as below:
Here is the example of
parametric approach,
\[income=\beta_0 + \beta_1 \times education + \beta_2 \times seniority\]
Then the estimated function:
Non-parametric methods: no explicit assumptions about the functional form of \(f()\), suitable for wider range possible shapes for \(f()\)
An Example with a non-parametric approach shown below:
Some models are more flexible where some others are less flexible
For example, Linear Regression is an example of less flexible model (MIS 301, MIS 302 (Review), MIS 305, MIS 403 (Review))
Spline regressions are considerably more flexible (MIS 403)
Lasso, Ridge regression (MIS 403) are examples of less flexible models and Trees (MIS 301, MIS 403), Bagging, Boosting are examples of more flexible models (MIS 403)
If inference is important then there are clear advantages to using the **less flexible models. Because it is easy to understand them
Time to time, Partially linear models may be used for inference where some of the variables exhibit non-linear relationship with the dependent model. If the effects of the variables that have linear relationship with dependent variable are important to analyze then partial linear model may be suitable (MIS 403)
Most of the learning problems fall into
Supervised Learning
Unsupervised Learning
(Reinforcement Learning)
Many classical statistical learning methods are examples of Supervised Learning (since \(Y \: and \: X_i\) are available)
Unsupervised Learning describes the more challenging situation (since \(X_i\) are available but response variable \(Y\) is not available)
Clustering Analysis is an example of Unsupervised learning (MIS 302, MIS 403)
Some variables are Categorical (qualitative) and some others are Numerical (quantitative)
Qualitative variables take different categories as values (sometimes called as different classes), such as, default/no default; cancer/no cancer etc.
Regression Problem is used for when the response variable is quantitative
Classification Problem is used for when the response variable is qualitative
Some methods can be used either for quantitative or qualitative responses (Trees, Boosting for example)
Quality of fit measures (in general for test data)
Mean Absolute Error, MAE, \(\text{mean}(|\varepsilon_j|)\)
Mean Absolute Percentage Error, MAPE, \(\text{mean}(|p_{j}|), \: p_j=100\varepsilon_j/Y_j\)
Mean Squared Error, MSE, \(\text{mean}(\varepsilon_j^2)\)
Root Mean Squared Error, RMSE, \(\sqrt{\text{mean}(\varepsilon_j^2)}\)
K-Fold Cross Validation
…
Variance refers to the amount by which \(\hat f()\) would change if we estimated it using a different training data set
Ideally the estimate for f should not vary too much between training sets
High variance leads to large changes in \(\hat f()\) even in the case when small changes in training data
In general, more flexible statistical methods have higher variance
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
For example Linear Regression may perform poorer if the relationship is highly non-linear
In these cases, increasing the number of observations do not lead to better predictions
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease
Good test set performance of a statistical learning method re- bias-variance quires low variance as well as low squared bias
The process of evaluating a model’s performance is known as model assessment
The process of selecting the proper level of flexibility for a model is known as assessment model selection
Two common Resampling Methods will be discussed (as well in MIS 403 course)
Cross-Validation
Bootstrap