I. Ozkan, PhD 
 Professor 
 MIS 
 Cankaya
University
 
iozkan@cankaya.edu.tr
  
Spring 2025
Deals with obtaining the features (inputs) from data
Deals with predictive tasks such as:
Prediction
Classification
Forecasting (Not Covered in this course)
Anomaly Detection (Not covered in this course)
Missing Data Imputation (Partly covered as first steps)
Ranking (Not covered in this course)
Recommendation/Decision
All processes such as social, biological, chemical or physical processes result in observations that we call data
Data sets are often used to obtain actionable information/knowledge, which may require model building, visualization, communication
Data sets are often lost during the process of obtaining actionable knowledge
There may be some some rules to keep in mind in order to re-use data effectively
Record Data (Original Data)
If possible store in text or text compatible format (csv, tsv, dat etc)
Always have a backup
Do not modify raw data
Data Organization
Data Structure: wide, long data (will
be discussed), tidy data format, nesting structure,
row, column and value labels
(lowercase, no space etc)
Folder Structure
Stamp data collection time (if necessary)
Do not summarize data in raw data folders
Computation with data
Data errors must be systematic not random
Try to avoid names:
with blank spaces
with special symbols: ?, $, *, +, #, (, ), -, /, }, {, |, >, <**, …
with a number: use letter instead
if necessary use number at the end, ex:
city_1
Column names must be unique. Duplicated names are not allowed.
is case sensitive
Avoid blank rows in your data.
Delete any comments in your file
Use the four digit format for dates
Data Collection/preparation: Business Understanding
Exploratory Data Analysis: Data Understanding
Preprocessing: Data Preparation
Knowledge Extraction: Modeling
Testing and Validation
Application Deployment
Data Rich Environment
Lack of Human Expertise
Difficult to explain Human Expertise
Dynamic Systems, Changing with time
Needs for adaptation
Learning:
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Deep Learning
etc.
Task and Data 
Data Preparation
Regression
Classification
Clustering
etc.
Huge Amount of data
100s of covariates
Learning from data becomes more fashionable
Learning algorithms become more available
Computers are more powerful
Need to use different data types in modelling
Data to Pattern to [hopefully] action is promising
…
| Supervised Learning | Unsupervised Learning | Reinforcement Learning | 
|---|---|---|
| {Y;X} available | {X} available | Ex: Game | 
| \(E[Y \: given \: X]\) | Pattern inside data | |
| \(P(Y=y \: given \:X=x)\) | Homogeneous Groups | |
| Ex: Regression | Ex: Clustering | 
\(Data=Pattern(s)+Error(s)\)
Example: Standard Regression
\(y=\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k + \varepsilon\)
for some \(k>>2\)
This is equivalent to
\(y=\underbrace{\beta_0+\beta_1 x_1+\beta_2 x_2+ \cdots + \beta_k x_k}_\text{Pattern}+\underbrace{\varepsilon}_\text{Error}\)
Or put in another form:
\(\mu(X)=E[Y|X=x]=\hat \beta_0+\hat \beta_1 x_1+\hat \beta_2 x_2+ \cdots +\hat \beta_k x_k\)
given \(E[\varepsilon]=0\) and \(\hat \beta_i\) are the estimated coefficients.
How to find the parameters, \(\hat \beta_i\):
\(MSE=\frac{1}{N+1} \sum_{i=0}^{N} (y_i-\mu(x_i))^2=\frac{1}{N+1} \sum_{i=0}^{N} \varepsilon_i^2\)
Means:
Correlation vs Causation must be discussed
Error structure is important
Behavioral assessments to model is crucial
Fundamental Table
| Data | Causal | Predictive | 
|---|---|---|
| Observational | Good/Bad | Good/Bad | 
| Experimental | Good/Bad | Good/Bad | 
Lets think two variables, \(y\) and \(x\), and the causality structure such that \(X\) causes \(Y\). All of the alternatives are:
Experiment to remove the effects of potential confounding factors? (may solve some of the cases)
Sample split randomly
It is possible then,
\(X \implies Y\)
\(Y\) do not causes \(X\) since the sample is split by chance then chance causes \(X\)
\(Z\) may cause both possible but by chance
It could still be by chance
It could be by selection, but it should be excluded by the experimenter
| Data | Causal | Predictive | 
|---|---|---|
| Observational | Bad | Good | 
| Experimental | Good | Bad | 
\(Theories \implies Data \implies Model\) traditionally used for [causal] analytics
causal structure is dictated by \(Theories\).
 
SW to be Used in This Course: and RStudio