I. Ozkan
Fall 2025
Values that are missed (should have been recorded but ??)
NA: Not Available
They need to be treated: Missing Data Treatment
This presentation is based on Lecture 5 of ETC5510: Introduction to Data Analysis https://mida-monash.netlify.app/slides/lecture_5a.pdf
# create x with some missing values 
x <- c(6, NA, 2, NA, 7, NA, 5)
# is there NA in this vector?
is.na(x)
## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
# how many? 
# first can we use sum()
sum(1,2,3) # same as sum(c(1,2,3))
## [1] 6
sum(TRUE, FALSE, TRUE, TRUE)
## [1] 3
# TRUE is counted as 1 good 
# is.na(x): [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE 
sum(is.na(x)) # X has there 3 NAs  
## [1] 3
# proportion missed  
# number of missed/total number of values 
length(x) # valid for vector 
## [1] 7
# proportion
sum(is.na(x))/length(x)
## [1] 0.4285714
3+4
## [1] 7
3 + NA
## [1] NA
# remember x 
x 
## [1]  6 NA  2 NA  7 NA  5
x+3 
## [1]  9 NA  5 NA 10 NA  8
sum(x)
## [1] NA
# but, let's see help page for sum() function 
# ?sum 
# Usage
# sum(..., na.rm = FALSE)
# please be careful with using the following functions 
sum(x, na.rm = TRUE)
## [1] 20
# remove na first then add the remaining numbers 
sum(na.omit(x))
## [1] 20| x | y | z | 
|---|---|---|
| NA | 1 | 1 | 
| NA | 2 | 2 | 
| NA | 3 | 3 | 
| 4 | NA | 4 | 
| 5 | NA | 5 | 
| 6 | NA | 6 | 
| 7 | 7 | NA | 
| 8 | 8 | NA | 
| 9 | 9 | NA | 
# vis_miss() function is available in visdat package 
# install.packages("visdat")
# library(visdat) 
vis_miss(dat_df)na.omit / na.rm| x | y | z | 
|---|---|---|
na.omit / na.rm| x | y | z | 
|---|---|---|
| 4 | NA | 4 | 
| 5 | NA | 5 | 
| 6 | NA | 6 | 
| 7 | 7 | NA | 
| 8 | 8 | NA | 
| 9 | 9 | NA | 
na.omit / na.rm| x | y | z | 
|---|---|---|
| 7 | 7 | NA | 
| 8 | 8 | NA | 
| 9 | 9 | NA | 
na.omit / na.rm| x | y | z | 
|---|---|---|
na.omit / na.rmna.rm or na.omit can remove entire rows containing missing values
This results in lose data; some, most of all your data
This means removing/censoring observations
| temp | location | 
|---|---|
| 27 | inside | 
| 26 | inside | 
| NA | outside | 
| 29 | inside | 
| NA | outside | 
| 20 | outside | 
| 21 | outside | 
| 24 | inside | 
Basic summaries of missingness:
n_missn_completeDataframe summaries of missingness:
miss_var_summarymiss_case_summaryNote: These functions work with group_by
# airquality data (from datasets)
# check with ?airquality
# Daily air quality measurements in New York, May to September 1973
# A data frame with 153 observations on 6 variables
# total number of NAs
sum(is.na(airquality)) 
## [1] 44
n_miss(airquality) # same as above 
## [1] 44
# total number without NAs
sum(!is.na(airquality))
## [1] 874
n_complete(airquality) # same as above 
## [1] 874#  group_by example 
airquality %>%
  group_by(Month) %>%
  miss_case_table()
## # A tibble: 11 × 4
## # Groups:   Month [5]
##    Month n_miss_in_case n_cases pct_cases
##    <int>          <int>   <int>     <dbl>
##  1     5              0      24     77.4 
##  2     5              1       5     16.1 
##  3     5              2       2      6.45
##  4     6              0       9     30   
##  5     6              1      21     70   
##  6     7              0      26     83.9 
##  7     7              1       5     16.1 
##  8     8              0      23     74.2 
##  9     8              1       8     25.8 
## 10     9              0      29     96.7 
## 11     9              1       1      3.33Drop the cases: Small fraction of cases have several missings (around 5%): explore data
Drop the variables: One or two variable, out of many, have a lot of missings
Impute: If missings are small in number, but located in many cases and variables
Carefully check the dependencies between missingness and existing variables to design the imputation
Using the mean or median of the complete cases for each variable (Not good generally)
Using models to predict missing values (Better)
Using a statistical distribution, e.g. normal model and simulate a value (Best)
simputation package: Median (of Each Month
Value)airquality %>%  
  as.data.frame() %>% 
  mutate(Ozone_NA = is.na(Ozone)) %>% 
  simputation::impute_median(Ozone ~ Month) %>% #<<
  ggplot(aes(x = Solar.R,
            y = Ozone,
            colour = Ozone_NA)) + 
  geom_point() + 
  theme_minimal()simputation package: Mean (of Each Month
Value)airquality %>%  
  as.data.frame() %>% 
  mutate(Ozone_NA = is.na(Ozone)) %>% 
  simputation::impute_proxy(Ozone ~ mean(Ozone, na.rm=T)|Month) %>%
  ggplot(aes(x = Solar.R,
            y = Ozone,
            colour = Ozone_NA)) + 
  geom_point() + 
  theme_minimal()simputation package: Linear Modelairquality %>%  
  as.data.frame() %>% 
  mutate(Ozone_NA = is.na(Ozone)) %>% 
  simputation::impute_lm(Ozone ~ Wind + Temp + Solar.R) %>% #<<
  ggplot(aes(x = Solar.R,
            y = Ozone,
            colour = Ozone_NA)) + 
  geom_point() + 
  theme_minimal()simputation package: Decision Treeairquality %>%  
  as.data.frame() %>% 
  mutate(Ozone_NA = is.na(Ozone)) %>% 
  simputation::impute_cart(Ozone ~ Wind + Temp + Solar.R) %>% #<<
  ggplot(aes(x = Solar.R,
            y = Ozone,
            colour = Ozone_NA)) + 
  geom_point() + 
  theme_minimal()