Learning From Data - Why

I. Ozkan

Spring 2025

Learning from Data.

Deals with obtaining the features (inputs) from data
Deals with predictive tasks such as:

Forecasting

Anomaly Detection

Missing Data Imputation

Classifying

Ranking

Recommendation/Decision

Learning From Data (When)

Data Rich Environment
Lack of Human Expertise
Difficult to explain Human Expertise
Dynamic Systems, Changing with time
Needs for adaptation

Departments of Business Where Analytics are valuable:

- Finance

Marketing and Sales
Supply Chaing and Logistics
…

Statistical Learning: Lots of Keywords

Learning:

Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Deep Learning (Part of both Supervised and Unsupervised Learning)
etc.

Task and Data:

Regression
Classification
Clustering
Forecasting
etc.

Why Statistical Learning in Business

Huge Amount of data
100s of covariates
It becomes more fashionable
It’s algorithms become more available
Computers are more powerful
Need to use different data types in modelling
Data to Pattern to [hopefully] theory is promising
…

Learning

Supervised Learning	Unsupervised Learning	~~Reinforcement Learning~~
{Y;X} available	{X} available	Ex: Game
\(E[Y \: given \: X]\)	Pattern inside data
\(P(Y=y \: given \:X=x)\)	Homogeneous Groups
Ex: Regression	Ex: Clustering

Source: https://www.grammarly.com/blog/ai/what-is-linear-regression/

Source: https://bookdown.org/tpinto_home/Unsupervised-learning/k-means-clustering.html

Source: Wikipedia

Data Rich Environment: [Very] High Dimensionality

The main Goal is:

\(Data=Pattern(s)+Error(s)\)

Example: Standard Regression

\(y=\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k + \varepsilon\)
for some \(k>>2\)

This is equivalent to

\(Pattern=\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k \; and \; \; error=\varepsilon\)
(Assumptions are skipped)

Or put in another form:

\(\mu(X)=E[Y|X=x]=\hat \beta_0+\hat \beta_1 x_1+\hat \beta_2 x_2+ \cdots +\hat \beta_k x_k\)

given \(E[\varepsilon]=0\) and \(\hat \beta_i\) are the estimated coefficients.

How to find the parameters, \(\hat \beta_i\):

An example: minimize Mean Squared Errors (MSE) (or Ordinary Least Squared Estimation)

\(MSE=\frac{1}{N+1} \sum_{i=0}^{N} (y_i-\mu(x_i))^2=\frac{1}{N+1} \sum_{i=0}^{N} \varepsilon_i^2\)

In Econometrics

In most of the cases, number of observations, \(N\) is grater than number of covariates (parameters), \(P\), \(N>>P\)
If \(N \sim P\) then one might talk about some failure due to degrees of freedom (and overfitting)
if \(N < P\) OLS fails.

\(\implies\) high dimensionality comes with difficulties.

In Economics

Causal Relationships are very important

Means:

Correlation vs Causation must be discussed (this one is the main critique)

Error structure is important

Behavioral assessments to model is crucial

Goodness of fit is not the main focus (though it is important)

In Econometrics

Theories (~~Thought Exercise: Idea~~)
Lots of assumptions without a plausible way to test them (many of them are unrealistic)
Theories \(\implies\) Models
Estimate Models (OLS, IV Regression, Max Likelihood GMM etc.)
Conclude with estimated parameters and standard errors

Fundamental Table

Data	Causal	Predictive
Observational	Good/Bad	Good/Bad
Experimental	Good/Bad	Good/Bad

Lets think two variables, \(y\) and \(x\), and the causality structure such that \(X\) causes \(Y\). All of the alternatives are:

\(X\) causes \(Y\), (or shown as \(X \implies Y\))

\(Y\) causes \(X\), \(Y \implies X\)

\(Z\) causes both \(X\) and \(Y\), \(Z \implies {Y, X}\) (but \(z\) may or may not be available)

By Chance, [remember p-value ]

By Selection

Causality (Will be back to this topic later)

Experiment to remove the effects of potential confounding factors? (may solve some of the cases)
Sample split randomly

It is possible then,

\(X \implies Y\)

\(Y\) do not causes \(X\) since the sample is splitt by chance then chance causes \(X\)

\(Z\) may cause both possible but by chance

It could still be by chance

It could be by selection, but it should be excluded by the experimenter

Why Data Analytics in Business

OPEN DISCUSSION (ONCE MORE)

An R Example

https://www.youtube.com/watch?v=GTgZfCltMm8