run with these predictors

  • age, gender, copd, quintile, and ethnic group
  • quintile is 5 socioeconomic categories (1 = most affluent, 5 = poorest),
    • nationally weighted by population, meaning that 20% of England’s population live in areas of each quintile
library(survival)
g <- read.csv("GMPH_1K_final_2.csv", header = TRUE, sep = ",")
g %>% dim()
## [1] 1000   31
g %>% head()
##   id death los age gender cancer cabg crt defib dementia diabetes hypertension
## 1  1     0   2  90      2      0    0   0     0        0        0            0
## 2  2     0  10  74      1      0    0   0     0        0        0            1
## 3  3     0   3  83      2      0    0   0     0        0        0            1
## 4  4     0   1  79      1      0    0   0     0        0        1            1
## 5  5     0  17  94      2      0    0   0     0        0        1            1
## 6  6     0  47  89      1      0    0   0     0        0        0            0
##   ihd mental_health arrhythmias copd obesity pvd renal_disease valvular_disease
## 1   0             0           1    0       0   0             0                1
## 2   1             0           0    0       0   0             1                1
## 3   0             0           1    0       0   0             0                0
## 4   1             0           0    1       0   0             0                0
## 5   0             0           0    0       0   0             0                0
## 6   0             0           0    0       0   1             0                1
##   metastatic_cancer pacemaker pneumonia prior_appts_attended prior_dnas pci
## 1                 0         0         0                    4          0   0
## 2                 0         0         1                    9          1   0
## 3                 0         0         0                    1          0   0
## 4                 0         1         0                    9          2   1
## 5                 0         0         0                    3          0   0
## 6                 0         0         1                    3          0   0
##   stroke senile quintile ethnicgroup fu_time
## 1      0      0        2          NA     416
## 2      0      0        4           1     648
## 3      0      0        3           1     466
## 4      1      0        5           1     441
## 5      0      0        2           1     371
## 6      0      0        3          NA      47
g$gender <- as.factor(g$gender)
g$ethnicgroup <- as.factor(g$ethnicgroup)
g$copd <- as.factor(g$copd)

cox_convg <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile + ethnicgroup, data = g)
summary(cox_convg)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age + gender + copd + 
##     quintile + ethnicgroup, data = g)
## 
##   n= 951, number of events= 468 
##    (49 observations deleted due to missingness)
## 
##                   coef exp(coef)  se(coef)      z Pr(>|z|)    
## age           0.062452  1.064443  0.005635 11.082  < 2e-16 ***
## gender2      -0.278350  0.757031  0.096478 -2.885  0.00391 ** 
## copd1         0.044956  1.045982  0.108316  0.415  0.67811    
## quintile      0.068762  1.071181  0.035380  1.944  0.05195 .  
## ethnicgroup2  0.009406  1.009450  0.325074  0.029  0.97692    
## ethnicgroup3 -0.796419  0.450941  0.415449 -1.917  0.05524 .  
## ethnicgroup9  0.463197  1.589146  0.360763  1.284  0.19916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## age             1.0644     0.9395    1.0528    1.0763
## gender2         0.7570     1.3209    0.6266    0.9146
## copd1           1.0460     0.9560    0.8459    1.2934
## quintile        1.0712     0.9335    0.9994    1.1481
## ethnicgroup2    1.0095     0.9906    0.5338    1.9089
## ethnicgroup3    0.4509     2.2176    0.1998    1.0180
## ethnicgroup9    1.5891     0.6293    0.7836    3.2229
## 
## Concordance= 0.665  (se = 0.013 )
## Likelihood ratio test= 159.2  on 7 df,   p=<2e-16
## Wald test            = 134.2  on 7 df,   p=<2e-16
## Score (logrank) test = 132.4  on 7 df,   p=<2e-16

running with quintiles as categorical leads to non-convergence

g$quintile <- factor(g$quintile)

cox_nonconvg <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile + ethnicgroup, data = g)
"Loglik converged before variable  4,5,6,7,8 ; coefficient may be infinite."
## [1] "Loglik converged before variable  4,5,6,7,8 ; coefficient may be infinite."
summary(cox_nonconvg)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age + gender + copd + 
##     quintile + ethnicgroup, data = g)
## 
##   n= 951, number of events= 468 
##    (49 observations deleted due to missingness)
## 
##                    coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age           6.213e-02  1.064e+00  5.606e-03 11.082  < 2e-16 ***
## gender2      -3.104e-01  7.332e-01  9.695e-02 -3.201  0.00137 ** 
## copd1         6.439e-02  1.067e+00  1.092e-01  0.590  0.55548    
## quintile1     1.422e+01  1.501e+06  9.177e+02  0.015  0.98764    
## quintile2     1.391e+01  1.094e+06  9.177e+02  0.015  0.98791    
## quintile3     1.431e+01  1.638e+06  9.177e+02  0.016  0.98756    
## quintile4     1.427e+01  1.579e+06  9.177e+02  0.016  0.98759    
## quintile5     1.430e+01  1.631e+06  9.177e+02  0.016  0.98756    
## ethnicgroup2  1.402e-02  1.014e+00  3.261e-01  0.043  0.96570    
## ethnicgroup3 -7.728e-01  4.617e-01  4.167e-01 -1.854  0.06369 .  
## ethnicgroup9  5.211e-01  1.684e+00  3.623e-01  1.438  0.15038    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## age          1.064e+00  9.398e-01    1.0525    1.0759
## gender2      7.332e-01  1.364e+00    0.6063    0.8866
## copd1        1.067e+00  9.376e-01    0.8610    1.3211
## quintile1    1.501e+06  6.660e-07    0.0000       Inf
## quintile2    1.094e+06  9.141e-07    0.0000       Inf
## quintile3    1.638e+06  6.107e-07    0.0000       Inf
## quintile4    1.579e+06  6.332e-07    0.0000       Inf
## quintile5    1.631e+06  6.131e-07    0.0000       Inf
## ethnicgroup2 1.014e+00  9.861e-01    0.5352    1.9217
## ethnicgroup3 4.617e-01  2.166e+00    0.2040    1.0450
## ethnicgroup9 1.684e+00  5.939e-01    0.8277    3.4254
## 
## Concordance= 0.67  (se = 0.013 )
## Likelihood ratio test= 169.6  on 11 df,   p=<2e-16
## Wald test            = 142.4  on 11 df,   p=<2e-16
## Score (logrank) test = 143.1  on 11 df,   p=<2e-16

quintile CIs infinitely wide

table quintile

table(g$quintile, exclude = NULL)
## 
##    0    1    2    3    4    5 <NA> 
##    4  138  205  211  220  216    6
  • quintile 0 is where postal code was missing, couldn’t find socioeconomic status from national file
  • regardless there’s only 4 in that group,
    • but sometimes 4 is even enough for convergence,
    • problem is deeper

table quintile with death

round(prop.table(table(g$quintile, g$death),1)*100,1)
##    
##         0     1
##   0 100.0   0.0
##   1  43.5  56.5
##   2  54.1  45.9
##   3  49.8  50.2
##   4  52.7  47.3
##   5  50.5  49.5
  • real problem is that quintile=0 is the reference group
  • and none of the 4 patients were dead
  • model had no reference group to compare to

solutions

change reference category

  • R sets lowest category as reference by default
  • in terms of socioeconomic status, usually interest in effect of lower status vs higher status {lower should be reference}
  • so let’s set quintile = 1 as reference

combine categories

  • quintile = 0 are not NAs
  • they don’t have postal code, might be from overseas, homeless, new immigrants
  • but quintile = 0 group so small, combining it with NA and creating a new group shouldn’t impact results

exclude patients

  • if only 4/1000 patients, might be better to just exclude these

dropping variable

  • if a lot of patients had quintile = 0, doesn’t make sense to drop all patients,
  • better to drop that variable

EOF