intro
- multiple predictors have their own issues {collinearity?}
- run descriptive stats on age, gender, prior_dnas, ethnic group, COPD
library(survival)
g <- read.csv("GMPH_1K_final_2.csv", header = TRUE, sep = ",")
g %>% dim()
# g %>% head()
g$gender <- as.factor(g$gender)
g$prior_dnas <- as.factor(g$prior_dnas)
g$ethnicgroup <- as.factor(g$ethnicgroup)
g$copd <- as.factor(g$copd)
## [1] 1000 31
age
summary(g$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 73.00 80.00 78.73 87.00 102.00
- nice continuous variable without any NAs
gender
table(g$gender, exclude = NULL)
##
## 1 2
## 548 452
- nice binary variable with a lot of people in each category
copd
table(g$copd, exclude = NULL)
##
## 0 1
## 758 242
- having copd is a lot less than not having copd
- there’s usually under-recording of comorbidities for various reasons (e.g. not diagnosed)
ethnic group
table(g$ethnicgroup, exclude = NULL)
##
## 1 2 3 9 <NA>
## 889 17 34 17 43
levels(g$ethnicgroup) <- c(levels(g$ethnicgroup),"8")
g$ethnicgroup[is.na(g$ethnicgroup)] <- "8"
table(g$ethnicgroup, exclude = NULL)
##
## 1 2 3 9 8
## 889 17 34 17 43
- turn NAs into another category so can include those patients in analysis
dealing with prior_dnas
table(g$prior_dnas, exclude = NULL)
##
## 0 1 2 3 4 5 6 7 8 10
## 732 156 50 34 17 3 3 2 1 2
- option1 = pretend it’s continuous, assume linear relationship with outcome
- preferred for ordinal variables with lots of categories
- but need to check relation is indeed linear with outcome
- option2 = use existing categories
- categorising loses information
- fitting categorical variable with small categories will cause issues (more later)
- option3 = combine some small categories
- if relation is non-linear, and have small categories, then option3 is best
- assessing linearity is made harder by the small categories too
multivariate cox
# try model with prior_dnas as continuous var
g$prior_dnas <- as.integer(g$prior_dnas)
cox_multi <- coxph(Surv(fu_time, death) ~ age + gender + copd + prior_dnas + ethnicgroup, data = g)
summary(cox_multi)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age + gender + copd +
## prior_dnas + ethnicgroup, data = g)
##
## n= 1000, number of events= 492
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.06214 1.06411 0.00552 11.257 < 2e-16 ***
## gender2 -0.25525 0.77473 0.09435 -2.705 0.00682 **
## copd1 0.13606 1.14576 0.10388 1.310 0.19025
## prior_dnas 0.17071 1.18614 0.04091 4.172 3.02e-05 ***
## ethnicgroup2 -0.27657 0.75838 0.34604 -0.799 0.42415
## ethnicgroup3 -0.82654 0.43756 0.41433 -1.995 0.04605 *
## ethnicgroup9 0.40316 1.49655 0.36082 1.117 0.26384
## ethnicgroup8 -0.04394 0.95701 0.22522 -0.195 0.84532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0641 0.9397 1.0527 1.0757
## gender2 0.7747 1.2908 0.6439 0.9321
## copd1 1.1458 0.8728 0.9347 1.4045
## prior_dnas 1.1861 0.8431 1.0947 1.2852
## ethnicgroup2 0.7584 1.3186 0.3849 1.4943
## ethnicgroup3 0.4376 2.2854 0.1943 0.9856
## ethnicgroup9 1.4966 0.6682 0.7378 3.0354
## ethnicgroup8 0.9570 1.0449 0.6155 1.4881
##
## Concordance= 0.667 (se = 0.013 )
## Likelihood ratio test= 168.9 on 8 df, p=<2e-16
## Wald test = 142 on 8 df, p=<2e-16
## Score (logrank) test = 140.2 on 8 df, p=<2e-16
- gender1 = “male”
- ethnicity1 = “white”
- changing reference category covered later
interpretation
- age, 1.06411 (6%) increase in risk of mortality for every year you age
- females (gender2) has 0.77473 (22%) lower risk of mortality following hospital admission for heart failure vs males
- prior_dnas is a significant predictor of risk of mortality,
- every 1 more missed visits = 1.18614 (18.6%) increase in risk
- ethnicgroup9 has 1.4966 (49.7%) increase in risk compared to ethnicgroup1,
- but results not significant, {probably because small group}
- each of the predictors is ADJUSTED for the other predictors {so results would change if you omitted some predictors}
trying survfit() different prior_dnas groupings
- looking at width of CI for each category will tell you if you have enough patients in each to get significant results
library(car) # need recode from car package, not from dplyr
table(g$prior_dnas, exclude = NULL)
##
## 1 2 3 4 5 6 7 8 9 10
## 732 156 50 34 17 3 3 2 1 2
g$prior_dnas <- factor(g$prior_dnas)
g$prior_dnas %>% levels()
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
g <- g %>% mutate(prior_dnas4groups = recode(prior_dnas,"c('4','5','6','7','8','9','10')='4'"))
table(g$prior_dnas4groups, exclude = NULL)
##
## 1 2 3 4
## 732 156 50 62
cox_prior_dnas4groups <- coxph(Surv(fu_time, death) ~ age + gender + copd + prior_dnas4groups + ethnicgroup, data = g)
summary(cox_prior_dnas4groups)
## Call:
## coxph(formula = Surv(fu_time, death) ~ age + gender + copd +
## prior_dnas4groups + ethnicgroup, data = g)
##
## n= 1000, number of events= 492
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.061951 1.063910 0.005493 11.279 < 2e-16 ***
## gender2 -0.267539 0.765260 0.094528 -2.830 0.00465 **
## copd1 0.130582 1.139491 0.104253 1.253 0.21037
## prior_dnas4groups2 0.052656 1.054067 0.127494 0.413 0.67960
## prior_dnas4groups3 0.353136 1.423525 0.198028 1.783 0.07454 .
## prior_dnas4groups4 0.786780 2.196314 0.193433 4.067 4.75e-05 ***
## ethnicgroup2 -0.125654 0.881920 0.328006 -0.383 0.70166
## ethnicgroup3 -0.841692 0.430981 0.415264 -2.027 0.04267 *
## ethnicgroup9 0.364488 1.439777 0.363224 1.003 0.31563
## ethnicgroup8 -0.066686 0.935489 0.225732 -0.295 0.76767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0639 0.9399 1.0525 1.0754
## gender2 0.7653 1.3067 0.6358 0.9210
## copd1 1.1395 0.8776 0.9289 1.3978
## prior_dnas4groups2 1.0541 0.9487 0.8210 1.3533
## prior_dnas4groups3 1.4235 0.7025 0.9656 2.0986
## prior_dnas4groups4 2.1963 0.4553 1.5033 3.2088
## ethnicgroup2 0.8819 1.1339 0.4637 1.6774
## ethnicgroup3 0.4310 2.3203 0.1910 0.9726
## ethnicgroup9 1.4398 0.6946 0.7065 2.9341
## ethnicgroup8 0.9355 1.0690 0.6010 1.4561
##
## Concordance= 0.669 (se = 0.013 )
## Likelihood ratio test= 169.7 on 10 df, p=<2e-16
## Wald test = 144.7 on 10 df, p=<2e-16
## Score (logrank) test = 142.3 on 10 df, p=<2e-16
- looks like prior_dnas4groups4 has 2.196314 risk compared to reference group
EOF