Data, Models & Exploratory Data Analysis

I. Ozkan

Fall 2025

Resources for Preliminary Readings

Data

“A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data is multiple such measurements.”

Everything is Data

Data Basics

Data Set(s) assumed in this course:

Tidy Data:

Rows represent observations

Columns represent variables

Some additional keywords, Wide Data, Long Data (in repeated measures)

How to Obtain Data (important):

Observational Data
Experimental Data

Data Types Once Again

Nominal/Categorical : Data with labels, may not link to quantitative levels, example, city names, gender, country, etc.
Ordinal : Data with a set order or scale, example, satisfaction level, level of happiness, etc.
Interval : Data with order and known exact differences, age brackets, income brackets, etc.
Ratio : Simple ratio, may be obtained by computation, data have scale, known exact differences and zero.
Scalar/Real Valued : Data with scale, known exact differences, zero and may have any values.

Data and Procedure

How to deal with - Observational Data , for causal/predictive model(s)
- Experimental Data , for causal/predictive model(s)

How to summarize:

Nominal/Categorical
Ordinal
Interval
Ratio
Scalar/Real Valued

Both numerically and visually

*: The scope of this course is limited with all but Interval type data values

Data and Procedure

How to Model:

Dependent Variable (Outcome, regressand):

Nominal/Categorical or Ordinal: Linear Probability Model (Binary Dependent), Logit, Probit Regression, Decision Trees, Forests, Extreme Boosting, etc.,
Scalar/Real Valued: Simple Linear, Spline, Lasso, Ridge, Elastic Net, other Non-Linear Regression, Decision Trees, Forests, Extreme Boosting, etc.

In the Model; how to Use:

Independent Variable(s) (Feature, Input):

Nominal/Categorical or Ordinal: e.g., Interaction Term, Dummy Variable
Scalar/Real Valued: e.g., Simple regressors, transformed variables

How to Use in the Model:

Independent Variable(s) (Feature, Input):

Nominal/Categorical or Ordinal: Interaction Term, Dummy Variable
Scalar/Real Valued: Simple regressors

$Data \implies Pictures \implies Models \; and \; Stories$

https://friendly.github.io/6135/lectures/Overview.pdf

Exploratory Data Analysis (EDA)

We look at numbers or graphs and try to find patterns. We pursue leads suggested by background information, imagination, patterns perceived, and experience with other data analyses.

P. Diaconis. Theories of data analysis: From magical thinking through classical statistics. In D.C. Hoaglin, F. Mosteller, and J.W. Tukey, editors, Exploring Data Tables, Trends, and Shapes, chapter 1. Wiley, 1985.

Exploratory Data Analysis

The four R of Exploratory Data Analysis

Revelation: Visualization
Residuals: Differences between the observed values of a variable and its predictions from some mathematical model
Re-expression: Transformation
Resistance: Outliers, Anomalies

P.F. Velleman and D.C. Hoaglin 1991. Data analysis. In D.C. Hoaglin and D.S. Moore, editors, Perspectives on Contemporary Statistics, number 21 in MAA Notes, chapter 2. Mathematical Association of America

Exploratory Data Analysis (EDA)

First Step for Data Analysis
No Formal Approach
A First look at the Data
Summarizing the Data
Understanding the key properties

Key Properties of Data

Key Properties of Data
- Location of the Data (Mean, Median, etc)
- Variation in the Data (Range, Variance, Covariance, etc)
- Shape of the Data (Histograms, Density estimations, etc)
- Outliers
- Granularity
- Cluster Structure

Preliminary

Data Set(s) assumed in this course:

Tidy Data:

Rows represent observations

Columns represent variables

[Possible] Steps of Looking at a Dataset

Formulate your question: a useful way to guide the exploratory data analysis process
Read in your data: If dataset is messy, clean it; readr::read_xxx() (there are some packages available for the cleaning)
Is this the data you expected: Number of columns, rows etc; dim(), nrow(), ncol()
First Run str() to see the structure of the data
Look at the first few and last few observations of the data; head(), tail()
Start look at the variables (columns) of the dataset

A Brief Data Cleaning Steps

Cleaning a dataset may be based on the data at hand. Hence these steps may not be applicable

Appropriate Variable Names: Avoiding some name notations;
- with blank spaces
- with special symbols: ?, $, *, +, #, (, ), -, /, }, {, |, >, <**, …
- with a number: use letter instead (if necessary use number at the end, ex: city_1)
- Column names must be unique ( is case sensitive)
Variable Formats: Should be appropriate
- Qualitative variables with correct labels
- Quantitative variables with appropriate scales
- Date Variables with correct date format
Duplicate rows and columns must be avoided
Blank rows and columns must be avoided
If necessary convert dataset into tidy format

Numerical Summaries

Scalar Variables:

Number of Missing Values
Minimum
$1^{st}$ Quartile ($Q_1$): $25^{th} \: percentile$

$0.25(n+1)^{th} \text{ ordered position}$

Arithmetic Mean

$\bar x=\frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1+x_2+ \cdots + x_n}{n}, \; n: sample \: size$

Median ($Q_2$): $50^{th} \: percentile$

$0.5(n+1)^{th} \text{ ordered position}$

Standard deviation

$\sigma_x=\sqrt{Var(x)}$

$Var(x)=E((x-\bar x)^2)=\frac{1}{n-1} \sum_{i=1}^{n}(x_i-\bar x)^2$

Skewness

$skewness=\frac{1}{\sigma_x^3} E[(x-\bar x)^3] = \frac{1}{n} \frac {\sum_{i=1}^{n}(x_i-\bar x)^3}{\sigma_x^3}$

Kurtosis

$\kappa = \frac{1}{\sigma_x^4} E[(x-\bar x)^4]$

$3^{rd}$ Quartile ($Q_3$): $75^{th} \: percentile$

$0.75(n+1)^{th} \text{ ordered position}$

Maximum

Categorical Variables:

Number of Missing Values

Proportions

Counts

Multivariate: Correlation Structure:

Correlation Matrix

Multivariate: Cluster Structure:

[An Example] Dendrogram

Table is a good way to present some of these summaries

Numerical Summaries Example: iris data

Vname	TN	min	max	mean	median	SD	IQR
Selected Numerical Summaries
Iris Data
Petal.Length	150	1.0	6.9	3.758	4.35	1.765	3.5
Petal.Width	150	0.1	2.5	1.199	1.30	0.762	1.5
Sepal.Length	150	4.3	7.9	5.843	5.80	0.828	1.3
Sepal.Width	150	2.0	4.4	3.057	3.00	0.436	0.5

Numerical Summaries Example: iris data

	Petal.Length	Petal.Width	Sepal.Length	Sepal.Width
More Numerical Summaries
Iris Data
TN	150	150	150	150
nNeg	0	0	0	0
nZero	0	0	0	0
nPos	150	150	150	150
NegInf	0	0	0	0
PosInf	0	0	0	0
NA_Value	0	0	0	0
Per_of_Missing	0	0	0	0
sum	563.7	179.9	876.5	458.6
min	1.0	0.1	4.3	2.0
max	6.9	2.5	7.9	4.4
mean	3.758	1.199	5.843	3.057
median	4.35	1.30	5.80	3.00
SD	1.765	0.762	0.828	0.436
CV	0.470	0.636	0.142	0.143
IQR	3.5	1.5	1.3	0.5
Skewness	-0.272	-0.102	0.312	0.316
Kurtosis	-1.396	-1.336	-0.574	0.181
10%	1.4	0.2	4.8	2.5
20%	1.5	0.2	5.0	2.7
LB.25%	-3.65	-1.95	3.15	2.05
UB.75%	10.35	4.05	8.35	4.05
nOutliers	0	0	0	4

Why We Need to Look at Data

Anscombe data set

Ref: Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899

anscombe dataset summaries
dataset	xbar	ybar	r	intercept	slope
1	9.00	7.50	0.82	3.00	0.50
2	9.00	7.50	0.82	3.00	0.50
3	9.00	7.50	0.82	3.00	0.50
4	9.00	7.50	0.82	3.00	0.50

All data sets have similar summaries (including slopes)

Why We Need to Look at Data

Ref: Matejka, J., & Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. CHI 2017 Conference proceedings: ACM SIGCHI Conference on Human Factors in Computing Systems. Retrieved from https://www.research.autodesk.com/publications/same-stats-different-graphs/.#nolint

Principles of Analytic Graphics

Show comparisons
Show Mechanism/Explanation
Show Multivariate Data
Integrate Evidence
Describe and Document the Evidence

*: Reading Exercise: Roger D. Peng, Exploratory Data Analysis with R

Psychology Facts

https://friendly.github.io/6135/lectures/Overview.pdf

Visualization

Histogram (univariate, bi-variate)
Density Estimation
Bar Plot
Scatter Plot ($Var_1 \: vs \: Var_2$)
Mosaic Plot (Categorical Data)
Box Plot
Quantile Plot
Correlation Plot

Histogram

Image Source Code

Boxplot

Image Source

Boxplots summarize the important characteristics of a univariate data distribution
- Location (center: Median)
- Variation (IQR)
- Shape (Symmetric?)
- Outliers

Boxplot (Be Careful)

*:source: The R Graph Gallery

Boxplot for Comparison

Boxplot is a summary and may hide the distribution. Hence violinplot may be the answer for those cases where density comparison is needed

Density

Image Source Code

Bar Plot

Bar Plot (2)

Image Source Code

Scatter Plot Y vs X

Categorical Data Mosaic Plot

Using HairEyeColor data set (from datasets package)
Type ?HairEyeColor for source code

	Brown	Blue	Hazel	Green
Black	68	20	15	5
Brown	119	84	54	29
Red	26	17	14	14
Blond	7	94	10	16

Quantile-Quantile Plot

Using precip data set (from datasets package)
Type ?qqplot for source code

Quantile-Quantile Plot

Normal Distribution

Correlation Plot

Scatter with Correlation Plot

Iris data correlation matrix

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Sepal.Length	1.0000000	-0.1175698	0.8717538	0.8179411
Sepal.Width	-0.1175698	1.0000000	-0.4284401	-0.3661259
Petal.Length	0.8717538	-0.4284401	1.0000000	0.9628654
Petal.Width	0.8179411	-0.3661259	0.9628654	1.0000000

Data, Models & Exploratory Data Analysis

Resources for Preliminary Readings

Data

Data Basics

Data Types Once Again

Data and Procedure

Data and Procedure

\(Data \implies Pictures \implies Models \; and \; Stories\)

Exploratory Data Analysis (EDA)

Exploratory Data Analysis

Exploratory Data Analysis (EDA)

Key Properties of Data

Preliminary

[Possible] Steps of Looking at a Dataset

A Brief Data Cleaning Steps

Numerical Summaries

Numerical Summaries Example: iris data

Numerical Summaries Example: iris data

Why We Need to Look at Data

Why We Need to Look at Data

Principles of Analytic Graphics

Psychology Facts

Visualization

Histogram

Boxplot

Boxplot (Be Careful)

Boxplot for Comparison

Density

Bar Plot

Bar Plot (2)

Scatter Plot Y vs X

Categorical Data Mosaic Plot

Quantile-Quantile Plot

Quantile-Quantile Plot

Correlation Plot

Scatter with Correlation Plot