Learning From Data - Why

I. Ozkan

Spring 2025

Learning from Data.

Learning From Data (When)

Departments of Business Where Analytics are valuable:

- Finance

Statistical Learning: Lots of Keywords

Learning:

Task and Data:

Why Statistical Learning in Business

Learning

Supervised Learning Unsupervised Learning Reinforcement Learning
{Y;X} available {X} available Ex: Game
\(E[Y \: given \: X]\) Pattern inside data
\(P(Y=y \: given \:X=x)\) Homogeneous Groups
Ex: Regression Ex: Clustering



  • Source: Wikipedia

Data Rich Environment: [Very] High Dimensionality

\(Data=Pattern(s)+Error(s)\)

Example: Standard Regression

\(y=\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k + \varepsilon\)
for some \(k>>2\)

This is equivalent to

\(Pattern=\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k \; and \; \; error=\varepsilon\)
(Assumptions are skipped)

Or put in another form:

\(\mu(X)=E[Y|X=x]=\hat \beta_0+\hat \beta_1 x_1+\hat \beta_2 x_2+ \cdots +\hat \beta_k x_k\)

given \(E[\varepsilon]=0\) and \(\hat \beta_i\) are the estimated coefficients.

How to find the parameters, \(\hat \beta_i\):

\(MSE=\frac{1}{N+1} \sum_{i=0}^{N} (y_i-\mu(x_i))^2=\frac{1}{N+1} \sum_{i=0}^{N} \varepsilon_i^2\)

In Econometrics

\(\implies\) high dimensionality comes with difficulties.

In Economics

Means:

Correlation vs Causation must be discussed (this one is the main critique)

Error structure is important

Behavioral assessments to model is crucial

Goodness of fit is not the main focus (though it is important)

In Econometrics

Fundamental Table

Data Causal Predictive
Observational Good/Bad Good/Bad
Experimental Good/Bad Good/Bad

Lets think two variables, \(y\) and \(x\), and the causality structure such that \(X\) causes \(Y\). All of the alternatives are:

Causality (Will be back to this topic later)

It is possible then,

\(X \implies Y\)

\(Y\) do not causes \(X\) since the sample is splitt by chance then chance causes \(X\)

\(Z\) may cause both possible but by chance

It could still be by chance

It could be by selection, but it should be excluded by the experimenter

Why Data Analytics in Business



OPEN DISCUSSION (ONCE MORE)



An R Example

https://www.youtube.com/watch?v=GTgZfCltMm8