Data, Models & Exploratory Data Analysis

I. Ozkan

Fall 2025

Resources for Preliminary Readings

Data



“A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data is multiple such measurements.”



Everything is Data

Data Basics

Tidy Data:

Rows represent observations

Columns represent variables

How to Obtain Data (important):

Data Types Once Again

Data and Procedure

How to deal with - Observational Data , for causal/predictive model(s)
- Experimental Data , for causal/predictive model(s)

How to summarize:

Both numerically and visually

*: The scope of this course is limited with all but Interval type data values

Data and Procedure

How to Model:

Dependent Variable (Outcome, regressand):

In the Model; how to Use:

Independent Variable(s) (Feature, Input):

How to Use in the Model:

Independent Variable(s) (Feature, Input):

\(Data \implies Pictures \implies Models \; and \; Stories\)


https://friendly.github.io/6135/lectures/Overview.pdf

Exploratory Data Analysis (EDA)



We look at numbers or graphs and try to find patterns. We pursue leads suggested by background information, imagination, patterns perceived, and experience with other data analyses.


P. Diaconis. Theories of data analysis: From magical thinking through classical statistics. In D.C. Hoaglin, F. Mosteller, and J.W. Tukey, editors, Exploring Data Tables, Trends, and Shapes, chapter 1. Wiley, 1985.

Exploratory Data Analysis


The four R of Exploratory Data Analysis

  1. Revelation: Visualization

  2. Residuals: Differences between the observed values of a variable and its predictions from some mathematical model

  3. Re-expression: Transformation

  4. Resistance: Outliers, Anomalies


P.F. Velleman and D.C. Hoaglin 1991. Data analysis. In D.C. Hoaglin and D.S. Moore, editors, Perspectives on Contemporary Statistics, number 21 in MAA Notes, chapter 2. Mathematical Association of America

Exploratory Data Analysis (EDA)

Key Properties of Data

Preliminary


Data Set(s) assumed in this course:


Tidy Data:

Rows represent observations

Columns represent variables

[Possible] Steps of Looking at a Dataset

A Brief Data Cleaning Steps

Cleaning a dataset may be based on the data at hand. Hence these steps may not be applicable

Numerical Summaries

Scalar Variables:

\(0.25(n+1)^{th} \text{ ordered position}\)

\(\bar x=\frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1+x_2+ \cdots + x_n}{n}, \; n: sample \: size\)

\(0.5(n+1)^{th} \text{ ordered position}\)

\(\sigma_x=\sqrt{Var(x)}\)

\(Var(x)=E((x-\bar x)^2)=\frac{1}{n-1} \sum_{i=1}^{n}(x_i-\bar x)^2\)

\(skewness=\frac{1}{\sigma_x^3} E[(x-\bar x)^3] = \frac{1}{n} \frac {\sum_{i=1}^{n}(x_i-\bar x)^3}{\sigma_x^3}\)

\(\kappa = \frac{1}{\sigma_x^4} E[(x-\bar x)^4]\)

\(0.75(n+1)^{th} \text{ ordered position}\)

Categorical Variables:

Multivariate: Correlation Structure:

Multivariate: Cluster Structure:


Table is a good way to present some of these summaries


Numerical Summaries Example: iris data

Selected Numerical Summaries
Iris Data
Vname TN min max mean median SD IQR
Petal.Length 150 1.0 6.9 3.758 4.35 1.765 3.5
Petal.Width 150 0.1 2.5 1.199 1.30 0.762 1.5
Sepal.Length 150 4.3 7.9 5.843 5.80 0.828 1.3
Sepal.Width 150 2.0 4.4 3.057 3.00 0.436 0.5


Numerical Summaries Example: iris data

More Numerical Summaries
Iris Data
Petal.Length Petal.Width Sepal.Length Sepal.Width
TN 150 150 150 150
nNeg 0 0 0 0
nZero 0 0 0 0
nPos 150 150 150 150
NegInf 0 0 0 0
PosInf 0 0 0 0
NA_Value 0 0 0 0
Per_of_Missing 0 0 0 0
sum 563.7 179.9 876.5 458.6
min 1.0 0.1 4.3 2.0
max 6.9 2.5 7.9 4.4
mean 3.758 1.199 5.843 3.057
median 4.35 1.30 5.80 3.00
SD 1.765 0.762 0.828 0.436
CV 0.470 0.636 0.142 0.143
IQR 3.5 1.5 1.3 0.5
Skewness -0.272 -0.102 0.312 0.316
Kurtosis -1.396 -1.336 -0.574 0.181
10% 1.4 0.2 4.8 2.5
20% 1.5 0.2 5.0 2.7
LB.25% -3.65 -1.95 3.15 2.05
UB.75% 10.35 4.05 8.35 4.05
nOutliers 0 0 0 4

Why We Need to Look at Data

Ref: Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899

anscombe dataset summaries

dataset

xbar

ybar

r

intercept

slope

1

9.00

7.50

0.82

3.00

0.50

2

9.00

7.50

0.82

3.00

0.50

3

9.00

7.50

0.82

3.00

0.50

4

9.00

7.50

0.82

3.00

0.50

Why We Need to Look at Data

Ref: Matejka, J., & Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. CHI 2017 Conference proceedings: ACM SIGCHI Conference on Human Factors in Computing Systems. Retrieved from https://www.research.autodesk.com/publications/same-stats-different-graphs/.#nolint

Principles of Analytic Graphics

*: Reading Exercise: Roger D. Peng, Exploratory Data Analysis with R

Psychology Facts


https://friendly.github.io/6135/lectures/Overview.pdf

Visualization

Histogram

Boxplot

Boxplot (Be Careful)

*:source: The R Graph Gallery

Boxplot for Comparison

Density

Bar Plot

Bar Plot (2)

Scatter Plot Y vs X

Categorical Data Mosaic Plot

Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16

Quantile-Quantile Plot

Quantile-Quantile Plot

Correlation Plot

Scatter with Correlation Plot

Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000