library(survival)
g <- read.csv("GMPH_1K_final_2.csv", header = TRUE, sep = ",")
g %>% dim()
g$fu_time %>% typeof()
g$death %>% typeof()
## [1] 1000 31
## [1] "integer"
## [1] "integer"
run simple Cox model
cox <- coxph(Surv(fu_time, death) ~ age, data = g)
summary(cox)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age, data = g)
##
## n= 1000, number of events= 492
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.056005 1.057602 0.005193 10.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.058 0.9455 1.047 1.068
##
## Concordance= 0.651 (se = 0.013 )
## Likelihood ratio test= 138 on 1 df, p=<2e-16
## Wald test = 116.3 on 1 df, p=<2e-16
## Score (logrank) test = 115.7 on 1 df, p=<2e-16
- coef = log hazard ratio
- exp(coef) = hazard ratio
- HR of 1.06 means for every 1 year increase in age, 6% increase in risk of death
- older people have higher risk of death vs younger people (as expected)
- fitted age as continuous variable using linear model, so 6% HR is constant
- se is small (only 1 predictor used, lots of patients)
- p-value is tiny (highly statistically significant predictor of risk of death)
- concordance
- similar to c-statistic in logistic regression
- fraction of pairs of patients in a sample where patient with higher survival time was predicted to have higher risk of death (i.e. the model was correct)
- values > 0.5 is better than chance
- R^2 (didn’t output this for me…)
- proportion of variation in the outcome explained by the model
- with only 1 predictor (age), neither concordance nor R^2 are very high
- Likelihood ratio test/ Wald test/ Score test
- null hypothesis testing
- at small p-value, null is rejected
- i.e. model with age in it, is significantly better than null model
practice: ethnic groups
library(survminer)
g$ethnicgroup <- factor(g$ethnicgroup)
cox_ethnicity <- coxph(Surv(fu_time, death) ~ ethnicgroup, data =g)
summary(cox_ethnicity)
## Call:
## coxph(formula = Surv(fu_time, death) ~ ethnicgroup, data = g)
##
## n= 957, number of events= 471
## (43 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ethnicgroup2 -0.06428 0.93774 0.32000 -0.201 0.84078
## ethnicgroup3 -1.19586 0.30244 0.41108 -2.909 0.00362 **
## ethnicgroup9 0.07394 1.07674 0.35706 0.207 0.83596
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ethnicgroup2 0.9377 1.0664 0.5008 1.7558
## ethnicgroup3 0.3024 3.3064 0.1351 0.6769
## ethnicgroup9 1.0767 0.9287 0.5348 2.1679
##
## Concordance= 0.516 (se = 0.006 )
## Likelihood ratio test= 12.99 on 3 df, p=0.005
## Wald test = 8.55 on 3 df, p=0.04
## Score (logrank) test = 9.61 on 3 df, p=0.02
- large standard errors, large CIs…
- hazard ratio, need to know reference ethnicity, which is ethnicgroup1 (white)
- group2 (black) not significantly different hazard
- group3 (indian) significantly different hazard
- group9 not significant
43 had missing ethnicities
- make an unknown ethnicity group and rerun
g$ethnicgroup %>% levels()
## [1] "1" "2" "3" "9"
levels(g$ethnicgroup) <- c(levels(g$ethnicgroup), "8")
g$ethnicgroup[is.na(g$ethnicgroup)] <- "8"
cox_ethnicity_noNAs <- coxph(Surv(fu_time, death) ~ ethnicgroup, data = g)
summary(cox_ethnicity_noNAs)
## Call:
## coxph(formula = Surv(fu_time, death) ~ ethnicgroup, data = g)
##
## n= 1000, number of events= 492
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ethnicgroup2 -0.06573 0.93638 0.31999 -0.205 0.83725
## ethnicgroup3 -1.19368 0.30310 0.41107 -2.904 0.00369 **
## ethnicgroup9 0.08160 1.08502 0.35706 0.229 0.81923
## ethnicgroup8 -0.02353 0.97675 0.22363 -0.105 0.91621
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ethnicgroup2 0.9364 1.0679 0.5001 1.7532
## ethnicgroup3 0.3031 3.2992 0.1354 0.6784
## ethnicgroup9 1.0850 0.9216 0.5389 2.1846
## ethnicgroup8 0.9767 1.0238 0.6301 1.5140
##
## Concordance= 0.518 (se = 0.008 )
## Likelihood ratio test= 12.95 on 4 df, p=0.01
## Wald test = 8.53 on 4 df, p=0.07
## Score (logrank) test = 9.58 on 4 df, p=0.05
- found that including NA ethnicities as a group did not change results by much
- but if including multiple predictors, NAs will add up, will affect results,
- better to try recoding NAs or impute, and compare results before and after
EOF