I. Ozkan
Fall 2025
 
 
 Everything is Data
Tidy Data:
Rows represent observations
Columns represent variables
How to Obtain Data (important):
Nominal/Categorical : Data with labels, may not link to quantitative levels, example, city names, gender, country, etc.
Ordinal : Data with a set order or scale, example, satisfaction level, level of happiness, etc.
Interval : Data with order and known exact differences, age brackets, income brackets, etc.
Ratio : Simple ratio, may be obtained by computation, data have scale, known exact differences and zero.
Scalar/Real Valued : Data with scale, known exact differences, zero and may have any values.
How to deal with - Observational Data , for
causal/predictive model(s)
- Experimental Data , for causal/predictive
model(s)
How to summarize:
Both numerically and visually
*: The scope of this course is limited with all but Interval type data values
How to Model:
Dependent Variable (Outcome, regressand):
Nominal/Categorical or Ordinal: Linear Probability Model (Binary Dependent), Logit, Probit Regression, Decision Trees, Forests, Extreme Boosting, etc.,
Scalar/Real Valued: Simple Linear, Spline, Lasso, Ridge, Elastic Net, other Non-Linear Regression, Decision Trees, Forests, Extreme Boosting, etc.
In the Model; how to Use:
Independent Variable(s) (Feature, Input):
Nominal/Categorical or Ordinal: e.g., Interaction Term, Dummy Variable
Scalar/Real Valued: e.g., Simple regressors, transformed variables
How to Use in the Model:
Independent Variable(s) (Feature, Input):
Nominal/Categorical or Ordinal: Interaction Term, Dummy Variable
Scalar/Real Valued: Simple regressors
 
 We look at numbers or graphs and try to find
patterns. We pursue leads suggested by background information,
imagination, patterns perceived, and experience with other data
analyses.
 
  
 P. Diaconis. Theories of data
analysis: From magical thinking through classical statistics. In D.C.
Hoaglin, F. Mosteller, and J.W. Tukey, editors, Exploring Data Tables,
Trends, and Shapes, chapter 1. Wiley, 1985.
 
 The four R of Exploratory Data Analysis
  
Revelation: Visualization
Residuals: Differences between the observed values of a variable and its predictions from some mathematical model
Re-expression: Transformation
Resistance: Outliers, Anomalies
 
 P.F. Velleman and D.C. Hoaglin 1991. Data
analysis. In D.C. Hoaglin and D.S. Moore, editors, Perspectives on
Contemporary Statistics, number 21 in MAA Notes, chapter 2. Mathematical
Association of America 
First Step for Data Analysis
No Formal Approach
A First look at the Data
Summarizing the Data
Understanding the key properties
  Data Set(s) assumed in this
course:
 
 
Tidy Data:
Rows represent observations
Columns represent variables
Formulate your question: a useful way to guide the exploratory data analysis process
Read in your data: If dataset is messy, clean
it; readr::read_xxx() (there are some packages available
for the cleaning)
Is this the data you expected: Number of columns, rows etc;
dim(), nrow(), ncol()
First Run str() to see the structure of the
data
Look at the first few and last few observations of the data;
head(), tail()
Start look at the variables (columns) of the dataset
Cleaning a dataset may be based on the data at hand. Hence these steps may not be applicable
Appropriate Variable Names: Avoiding some name notations;
city_1)Variable Formats: Should be appropriate
Duplicate rows and columns must be avoided
Blank rows and columns must be avoided
If necessary convert dataset into tidy format
Scalar Variables:
Number of Missing Values
Minimum
\(1^{st}\) Quartile (\(Q_1\)): \(25^{th} \: percentile\)
\(0.25(n+1)^{th} \text{ ordered position}\)
\(\bar x=\frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1+x_2+ \cdots + x_n}{n}, \; n: sample \: size\)
\(0.5(n+1)^{th} \text{ ordered position}\)
\(\sigma_x=\sqrt{Var(x)}\)
\(Var(x)=E((x-\bar x)^2)=\frac{1}{n-1} \sum_{i=1}^{n}(x_i-\bar x)^2\)
\(skewness=\frac{1}{\sigma_x^3} E[(x-\bar x)^3] = \frac{1}{n} \frac {\sum_{i=1}^{n}(x_i-\bar x)^3}{\sigma_x^3}\)
\(\kappa = \frac{1}{\sigma_x^4} E[(x-\bar x)^4]\)
\(0.75(n+1)^{th} \text{ ordered position}\)
Categorical Variables:
Number of Missing Values
Proportions
Counts
Multivariate: Correlation Structure:
Multivariate: Cluster Structure:
| Selected Numerical Summaries | |||||||
| Iris Data | |||||||
| Vname | TN | min | max | mean | median | SD | IQR | 
|---|---|---|---|---|---|---|---|
| Petal.Length | 150 | 1.0 | 6.9 | 3.758 | 4.35 | 1.765 | 3.5 | 
| Petal.Width | 150 | 0.1 | 2.5 | 1.199 | 1.30 | 0.762 | 1.5 | 
| Sepal.Length | 150 | 4.3 | 7.9 | 5.843 | 5.80 | 0.828 | 1.3 | 
| Sepal.Width | 150 | 2.0 | 4.4 | 3.057 | 3.00 | 0.436 | 0.5 | 
| More Numerical Summaries | ||||
| Iris Data | ||||
| Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | |
|---|---|---|---|---|
| TN | 150 | 150 | 150 | 150 | 
| nNeg | 0 | 0 | 0 | 0 | 
| nZero | 0 | 0 | 0 | 0 | 
| nPos | 150 | 150 | 150 | 150 | 
| NegInf | 0 | 0 | 0 | 0 | 
| PosInf | 0 | 0 | 0 | 0 | 
| NA_Value | 0 | 0 | 0 | 0 | 
| Per_of_Missing | 0 | 0 | 0 | 0 | 
| sum | 563.7 | 179.9 | 876.5 | 458.6 | 
| min | 1.0 | 0.1 | 4.3 | 2.0 | 
| max | 6.9 | 2.5 | 7.9 | 4.4 | 
| mean | 3.758 | 1.199 | 5.843 | 3.057 | 
| median | 4.35 | 1.30 | 5.80 | 3.00 | 
| SD | 1.765 | 0.762 | 0.828 | 0.436 | 
| CV | 0.470 | 0.636 | 0.142 | 0.143 | 
| IQR | 3.5 | 1.5 | 1.3 | 0.5 | 
| Skewness | -0.272 | -0.102 | 0.312 | 0.316 | 
| Kurtosis | -1.396 | -1.336 | -0.574 | 0.181 | 
| 10% | 1.4 | 0.2 | 4.8 | 2.5 | 
| 20% | 1.5 | 0.2 | 5.0 | 2.7 | 
| LB.25% | -3.65 | -1.95 | 3.15 | 2.05 | 
| UB.75% | 10.35 | 4.05 | 8.35 | 4.05 | 
| nOutliers | 0 | 0 | 0 | 4 | 
Ref: Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899
dataset  | xbar  | ybar  | r  | intercept  | slope  | 
|---|---|---|---|---|---|
1  | 9.00  | 7.50  | 0.82  | 3.00  | 0.50  | 
2  | 9.00  | 7.50  | 0.82  | 3.00  | 0.50  | 
3  | 9.00  | 7.50  | 0.82  | 3.00  | 0.50  | 
4  | 9.00  | 7.50  | 0.82  | 3.00  | 0.50  | 
Ref: Matejka, J., & Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. CHI 2017 Conference proceedings: ACM SIGCHI Conference on Human Factors in Computing Systems. Retrieved from https://www.research.autodesk.com/publications/same-stats-different-graphs/.#nolint
Show comparisons
Show Mechanism/Explanation
Show Multivariate Data
Integrate Evidence
Describe and Document the Evidence
*: Reading Exercise: Roger D. Peng, Exploratory Data Analysis with R
Histogram (univariate, bi-variate)
Density Estimation
Bar Plot
Scatter Plot (\(Var_1 \: vs \: Var_2\))
Mosaic Plot (Categorical Data)
Box Plot
Quantile Plot
Correlation Plot
Using HairEyeColor data set (from datasets package)
Type ?HairEyeColor for source code
| Brown | Blue | Hazel | Green | |
|---|---|---|---|---|
| Black | 68 | 20 | 15 | 5 | 
| Brown | 119 | 84 | 54 | 29 | 
| Red | 26 | 17 | 14 | 14 | 
| Blond | 7 | 94 | 10 | 16 | 
Using precip data set (from datasets package)
Type ?qqplot for source code
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| Sepal.Length | 1.0000000 | -0.1175698 | 0.8717538 | 0.8179411 | 
| Sepal.Width | -0.1175698 | 1.0000000 | -0.4284401 | -0.3661259 | 
| Petal.Length | 0.8717538 | -0.4284401 | 1.0000000 | 0.9628654 | 
| Petal.Width | 0.8179411 | -0.3661259 | 0.9628654 | 1.0000000 |