are missing values missing at random

  • chances of being missing not related to any particular variable?

example 1

  • lifestyle survey
  • age, gender - respondents will answer truthfully
  • weight, smoking, exercise - respondents might lie
  • religion, sexual orientation - respondents may omit
  • long list of questions on diet
    • respondents get bored and not answer last few pages
    • last few pages are all questions about vegetables
    • keen respondents happens to all like vegetables
    • missing values here are not missing at random

example 2

  • retinal damage checkups for diabetes patients
  • eye drops sting a bit during checkup
  • retinal checkup also do blood pressure etc
  • but if patients don’t show for retinal checkout, won’t have blood pressure values
  • patients most likely to miss appointments are usually those in the poorest condition
  • these missing values also not missing at random

  • various ways to deal with missing values,
  • choice depend on why values are missing

missing data in survival analysis

  • why data are missing - need to get to know data
    • get to know how data was collected
    • EDA: tables, histograms

pattern of missingness


  • missing completely at random
  • e.g. males just likely to have a missing value as female


  • missing at random
  • missingness can be explained by other variables in dataset
  • e.g. people with higher education less likely to disclose their income
  • can fill in missing value for one variable based on values of another variable


  • not missing at random
  • missing data depends on a variable that was not captured
  • e.g. missing value depend on disease severity

methods of handling missing data

complete case analysis

  • omit all records with missing values
  • if MCAR, estimates will be unbiased if sample size still large enough
  • if MAR or MNAR, estimates will be biased

mean substitution/ imputation

  • replace missing with mean of existing values for that variable
  • adv = does not change mean of variable
  • disadv = artificially decrease variance
  • disadv = makes it difficult to detect correlations btw imputed variable and other variables
  • mean imputation gives biased results, should be avoided

multiple imputation

  • assumes MCAR or MAR
  • missing values imputed from a distribution
  • done multiple times to yield multiple complete datasets,
  • each dataset analyzed, results combined
  • adv = yields unbiased results

maximum likelihood

  • assumes MCAR or MAR
  • variable assumed to be normally distributed
  • use existing values to calculate mean and variance
  • missing values drawn from existing values’ normal distribution
  • repeat until mean and variance of complete data converge with mean and variance of existing values

MNAR data need to be handled per-case basis