Data Analysis

Deals with obtaining the features (inputs) from data
Deals with predictive tasks such as:
Prediction
- Regression
- K-Nearest Neighbors
- Trees
Classification
- Logistic Regression
- K-Nearest Neighbors
- Trees
- Clustering
Forecasting (Not Covered in this course)
Anomaly Detection (Not covered in this course)
Missing Data Imputation (Partly covered as first steps)
Ranking (Not covered in this course)
Recommendation/Decision

The Data Science Process

Cross Industry Standard Process for Data Mining (CRISP-DM)

Data Collection/Usage: Principles

All processes such as social, biological, chemical or physical processes result in observations that we call data
Data sets are often used to obtain actionable information/knowledge, which may require model building, visualization, communication
Data sets are often lost during the process of obtaining actionable knowledge
There may be some some rules to keep in mind in order to re-use data effectively

Principles of Data Usage in Analytics

Record Data (Original Data)
- If possible store in text or text compatible format (csv, tsv, dat etc)
- Always have a backup
- Do not modify raw data
Data Organization
- Data Structure: wide, long data (will be discussed), tidy data format, nesting structure, row, column and value labels (lowercase, no space etc)
- Folder Structure
- Stamp data collection time (if necessary)
- Do not summarize data in raw data folders
Computation with data
- All computations and modifications should be made outside of the original data folders
Data errors must be systematic not random

Data File Organization

Try to avoid names:
- with blank spaces
- with special symbols: ?, $, *, +, #, (, ), -, /, }, {, |, >, <**, …
- with a number: use letter instead
- if necessary use number at the end, ex: city_1
Column names must be unique. Duplicated names are not allowed.
is case sensitive
Avoid blank rows in your data.
Delete any comments in your file
Use the four digit format for dates

Steps of Data Analysis/Mining

Data Collection/preparation: Business Understanding
Exploratory Data Analysis: Data Understanding
Preprocessing: Data Preparation
Knowledge Extraction: Modeling
- Feature Selection (Important Explanatory Variables)
- Feature Engineering
- Model Selection/Committee of Models

Testing and Validation
- Model Diagnostics
- Evaluation
Application Deployment

Data Mining (When)

Data Rich Environment
Lack of Human Expertise
Difficult to explain Human Expertise
Dynamic Systems, Changing with time
Needs for adaptation

Lots of Keywords

Learning:

Supervised Learning
Unsupervised Learning
~~Semi-Supervised Learning~~

~~Reinforcement Learning~~
~~Deep Learning~~
etc.

Task and Data

Data Preparation
Regression
Classification
Clustering
etc.

Why Business Deals with Data

Huge Amount of data
100s of covariates
Learning from data becomes more fashionable
Learning algorithms become more available
Computers are more powerful
Need to use different data types in modelling
Data to Pattern to [hopefully] action is promising

…

Learning From Data

Supervised Learning	Unsupervised Learning	Reinforcement Learning
{Y;X} available	{X} available	Ex: Game
$E[Y \: given \: X]$	Pattern inside data
$P(Y=y \: given \:X=x)$	Homogeneous Groups
Ex: Regression	Ex: Clustering

Data Rich Environment: [Very] High Dimensionality

The main Goal is:

$Data=Pattern(s)+Error(s)$

Example: Standard Regression

$y=\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k + \varepsilon$

for some $k>>2$

This is equivalent to

$y=\underbrace{\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k}_\text{Pattern}+\underbrace{\varepsilon}_\text{Error}$

Or put in another form:

$\mu(X)=E[Y|X=x]=\hat \beta_0+\hat \beta_1 x_1+\hat \beta_2 x_2+ \cdots +\hat \beta_k x_k$

given $E[\varepsilon]=0$ and $\hat \beta_i$ are the estimated coefficients.

How to find the parameters, $\hat \beta_i$:

An example: minimize Mean Squared Errors (MSE) (or Ordinary Least Squared Estimation)

$MSE=\frac{1}{N+1} \sum_{i=0}^{N} (y_i-\mu(x_i))^2=\frac{1}{N+1} \sum_{i=0}^{N} \varepsilon_i^2$

In Business

Causal Relationships are important for a strategic policy design and implementation

Means:

Correlation vs Causation must be discussed

Error structure is important

Behavioral assessments to model is crucial

Fundamental Table

Data	Causal	Predictive
Observational	Good/Bad	Good/Bad
Experimental	Good/Bad	Good/Bad

Lets think two variables, $y$ and $x$, and the causality structure such that $X$ causes $Y$. All of the alternatives are:

$X$ causes $Y$, (or shown as $X \implies Y$)

$Y$ causes $X$, $Y \implies X$

$Z$ causes both $X$ and $Y$, $Z \implies {Y, X}$ (but $z$ may or may not be available)

By Chance, [remember p-value ]

By Selection

Causality (Will be back to this topic later)

Experiment to remove the effects of potential confounding factors? (may solve some of the cases)
Sample split randomly

It is possible then,

$X \implies Y$

$Y$ do not causes $X$ since the sample is split by chance then chance causes $X$

$Z$ may cause both possible but by chance

It could still be by chance

It could be by selection, but it should be excluded by the experimenter

Fundamental Table

Data	Causal	Predictive
Observational	Bad	Good
Experimental	Good	Bad

$Theories \implies Data \implies Model$ traditionally used for [causal] analytics
causal structure is dictated by $Theories$.
- Example: EconoMetrics

Data Analytics for Business - Steps