run with these predictors
- age, gender, copd, quintile, and ethnic group
- quintile is 5 socioeconomic categories (1 = most affluent, 5 = poorest),
- nationally weighted by population, meaning that 20% of England’s population live in areas of each quintile
library(survival)
g <- read.csv("GMPH_1K_final_2.csv", header = TRUE, sep = ",")
g %>% dim()
## [1] 1000 31
g %>% head()
## id death los age gender cancer cabg crt defib dementia diabetes hypertension
## 1 1 0 2 90 2 0 0 0 0 0 0 0
## 2 2 0 10 74 1 0 0 0 0 0 0 1
## 3 3 0 3 83 2 0 0 0 0 0 0 1
## 4 4 0 1 79 1 0 0 0 0 0 1 1
## 5 5 0 17 94 2 0 0 0 0 0 1 1
## 6 6 0 47 89 1 0 0 0 0 0 0 0
## ihd mental_health arrhythmias copd obesity pvd renal_disease valvular_disease
## 1 0 0 1 0 0 0 0 1
## 2 1 0 0 0 0 0 1 1
## 3 0 0 1 0 0 0 0 0
## 4 1 0 0 1 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 1 0 1
## metastatic_cancer pacemaker pneumonia prior_appts_attended prior_dnas pci
## 1 0 0 0 4 0 0
## 2 0 0 1 9 1 0
## 3 0 0 0 1 0 0
## 4 0 1 0 9 2 1
## 5 0 0 0 3 0 0
## 6 0 0 1 3 0 0
## stroke senile quintile ethnicgroup fu_time
## 1 0 0 2 NA 416
## 2 0 0 4 1 648
## 3 0 0 3 1 466
## 4 1 0 5 1 441
## 5 0 0 2 1 371
## 6 0 0 3 NA 47
g$gender <- as.factor(g$gender)
g$ethnicgroup <- as.factor(g$ethnicgroup)
g$copd <- as.factor(g$copd)
cox_convg <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile + ethnicgroup, data = g)
summary(cox_convg)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age + gender + copd +
## quintile + ethnicgroup, data = g)
##
## n= 951, number of events= 468
## (49 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.062452 1.064443 0.005635 11.082 < 2e-16 ***
## gender2 -0.278350 0.757031 0.096478 -2.885 0.00391 **
## copd1 0.044956 1.045982 0.108316 0.415 0.67811
## quintile 0.068762 1.071181 0.035380 1.944 0.05195 .
## ethnicgroup2 0.009406 1.009450 0.325074 0.029 0.97692
## ethnicgroup3 -0.796419 0.450941 0.415449 -1.917 0.05524 .
## ethnicgroup9 0.463197 1.589146 0.360763 1.284 0.19916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0644 0.9395 1.0528 1.0763
## gender2 0.7570 1.3209 0.6266 0.9146
## copd1 1.0460 0.9560 0.8459 1.2934
## quintile 1.0712 0.9335 0.9994 1.1481
## ethnicgroup2 1.0095 0.9906 0.5338 1.9089
## ethnicgroup3 0.4509 2.2176 0.1998 1.0180
## ethnicgroup9 1.5891 0.6293 0.7836 3.2229
##
## Concordance= 0.665 (se = 0.013 )
## Likelihood ratio test= 159.2 on 7 df, p=<2e-16
## Wald test = 134.2 on 7 df, p=<2e-16
## Score (logrank) test = 132.4 on 7 df, p=<2e-16
running with quintiles as categorical leads to non-convergence
g$quintile <- factor(g$quintile)
cox_nonconvg <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile + ethnicgroup, data = g)
"Loglik converged before variable 4,5,6,7,8 ; coefficient may be infinite."
## [1] "Loglik converged before variable 4,5,6,7,8 ; coefficient may be infinite."
summary(cox_nonconvg)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age + gender + copd +
## quintile + ethnicgroup, data = g)
##
## n= 951, number of events= 468
## (49 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 6.213e-02 1.064e+00 5.606e-03 11.082 < 2e-16 ***
## gender2 -3.104e-01 7.332e-01 9.695e-02 -3.201 0.00137 **
## copd1 6.439e-02 1.067e+00 1.092e-01 0.590 0.55548
## quintile1 1.422e+01 1.501e+06 9.177e+02 0.015 0.98764
## quintile2 1.391e+01 1.094e+06 9.177e+02 0.015 0.98791
## quintile3 1.431e+01 1.638e+06 9.177e+02 0.016 0.98756
## quintile4 1.427e+01 1.579e+06 9.177e+02 0.016 0.98759
## quintile5 1.430e+01 1.631e+06 9.177e+02 0.016 0.98756
## ethnicgroup2 1.402e-02 1.014e+00 3.261e-01 0.043 0.96570
## ethnicgroup3 -7.728e-01 4.617e-01 4.167e-01 -1.854 0.06369 .
## ethnicgroup9 5.211e-01 1.684e+00 3.623e-01 1.438 0.15038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.064e+00 9.398e-01 1.0525 1.0759
## gender2 7.332e-01 1.364e+00 0.6063 0.8866
## copd1 1.067e+00 9.376e-01 0.8610 1.3211
## quintile1 1.501e+06 6.660e-07 0.0000 Inf
## quintile2 1.094e+06 9.141e-07 0.0000 Inf
## quintile3 1.638e+06 6.107e-07 0.0000 Inf
## quintile4 1.579e+06 6.332e-07 0.0000 Inf
## quintile5 1.631e+06 6.131e-07 0.0000 Inf
## ethnicgroup2 1.014e+00 9.861e-01 0.5352 1.9217
## ethnicgroup3 4.617e-01 2.166e+00 0.2040 1.0450
## ethnicgroup9 1.684e+00 5.939e-01 0.8277 3.4254
##
## Concordance= 0.67 (se = 0.013 )
## Likelihood ratio test= 169.6 on 11 df, p=<2e-16
## Wald test = 142.4 on 11 df, p=<2e-16
## Score (logrank) test = 143.1 on 11 df, p=<2e-16
quintile CIs infinitely wide
table quintile
table(g$quintile, exclude = NULL)
##
## 0 1 2 3 4 5 <NA>
## 4 138 205 211 220 216 6
- quintile 0 is where postal code was missing, couldn’t find socioeconomic status from national file
- regardless there’s only 4 in that group,
- but sometimes 4 is even enough for convergence,
- problem is deeper
table quintile with death
round(prop.table(table(g$quintile, g$death),1)*100,1)
##
## 0 1
## 0 100.0 0.0
## 1 43.5 56.5
## 2 54.1 45.9
## 3 49.8 50.2
## 4 52.7 47.3
## 5 50.5 49.5
- real problem is that quintile=0 is the reference group
- and none of the 4 patients were dead
- model had no reference group to compare to
solutions
change reference category
- R sets lowest category as reference by default
- in terms of socioeconomic status, usually interest in effect of lower status vs higher status {lower should be reference}
- so let’s set quintile = 1 as reference
combine categories
- quintile = 0 are not NAs
- they don’t have postal code, might be from overseas, homeless, new immigrants
- but quintile = 0 group so small, combining it with NA and creating a new group shouldn’t impact results
exclude patients
- if only 4/1000 patients, might be better to just exclude these
dropping variable
- if a lot of patients had quintile = 0, doesn’t make sense to drop all patients,
- better to drop that variable
EOF