what is heart failure
- causes include: coronary heart disease, high blood pressure, arrythmia, alcoholism
- symptoms: shortness of breath, tiredness, swollen ankles
- high mortality and morbidity
- high rate of hospital admission
- readmission high, within 1-4 days of discharge
- hospital admin database gives us data to work with
library(survival)
run survfit() and KMplot
# read in data
g <- read.csv("GMPH_1K_final_2.csv", header = TRUE, sep = ",")
g %>% dim()
# set variable types
g[,"fu_time"] %>% typeof()
fu_time <- g[,"fu_time"]; fu_time %>% head(20)
g[,"death"] %>% typeof()
death <- g[,"death"]; death %>% head(20)
## [1] 1000 31
## [1] "integer"
## [1] 416 648 466 441 371 47 656 12 530 551 1059 90 538 289 334
## [16] 5 79 363 195 27
## [1] "integer"
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0
# survfit
km_fit <- survfit(Surv(fu_time, death) ~ 1)
# ~ 1 plots KM estimated probability of survival over time for ALL patients, no comparisons made
plot(km_fit)
interpret the function
- survfit()
- fits simple survival model,
- no predictor given, model has only the intercept (i.e. ~ 1)
- Surv()
- 2 arguments: follow-up time of each patient + if they died
- death = 1, people who died by the end of their follow-up period
- death = 0, people still alive by end of their follow-up period, censored
- survfit() produces KM estimates of “the probabilty of survival over time”,
- used by plot to produce KM curve
table of survival estimates
summary(km_fit, times = c(1:7, 30, 60, 90*(1:10)))
## Call: survfit(formula = Surv(fu_time, death) ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 992 12 0.988 0.00346 0.981 0.995
## 2 973 7 0.981 0.00435 0.972 0.989
## 3 963 5 0.976 0.00489 0.966 0.985
## 4 954 6 0.970 0.00546 0.959 0.980
## 5 945 5 0.964 0.00590 0.953 0.976
## 6 938 1 0.963 0.00598 0.952 0.975
## 7 933 1 0.962 0.00606 0.951 0.974
## 30 865 39 0.921 0.00865 0.905 0.939
## 60 809 28 0.891 0.01010 0.871 0.911
## 90 770 24 0.864 0.01117 0.843 0.887
## 180 698 43 0.815 0.01282 0.790 0.841
## 270 653 24 0.787 0.01363 0.760 0.814
## 360 619 21 0.761 0.01428 0.733 0.789
## 450 525 44 0.705 0.01554 0.675 0.736
## 540 429 47 0.639 0.01681 0.607 0.673
## 630 362 32 0.589 0.01765 0.556 0.625
## 720 266 43 0.514 0.01876 0.479 0.552
## 810 190 31 0.448 0.01979 0.411 0.488
## 900 126 26 0.378 0.02098 0.339 0.421
- t=1, survival probability is high (99%)
- t=900, survival probability is low (38%)
- { longer you stay in hospital the lower your survival prob?
or is this just natural decrease in surv prob over time??? }
- see week1_4 for calculating survfit() output by hand
run survfit() with groups
g[,"gender"] %>% typeof()
## [1] "integer"
gender <- as.factor(g[,"gender"])
km_gender_fit <- survfit(Surv(fu_time, death) ~ gender)
plot(km_gender_fit)
compare survival using logRankTest
survdiff(Surv(fu_time, death) ~ gender, rho = 0)
## Call:
## survdiff(formula = Surv(fu_time, death) ~ gender, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## gender=1 548 268 271 0.0365 0.082
## gender=2 452 224 221 0.0448 0.082
##
## Chisq= 0.1 on 1 degrees of freedom, p= 0.8
- rho = 0 is default
- yields log-rank (aka Mantel-Haenszel) test
- log-rank test does NOT have proportional hazard assumption
- { is log-rank test non-parametric? }
- Observed = weighted observed number of events in each group
- Expected = weighted expected number of events in each group
# explore log-rank test
# ?survdiff
# tests if there's a difference btw 2 or more survival curves,
# using rho family of tests
g %>% filter(gender == 1) %>% summarise(., mean(fu_time)) # 460.8942
## mean(fu_time)
## 1 460.8942
g %>% filter(gender == 2) %>% summarise(., mean(fu_time)) # 472.0841
## mean(fu_time)
## 1 472.0841
log-rank test details
- https://en.wikipedia.org/wiki/Logrank_test
- compares survival distributions of 2 samples
- nonparametric
- good when data is right skewed and censored
- logrank test statistic
- compares estimates of hazard function of 2 groups at EACH time point
- constructed by computing O and E number of events in one group at each time point,
- then sum to obtain overall summary across all-time points where there was an event
- more details needed…
- null hypothesis
- 2 groups have identical hazard functions (survival curves)
- H0: h_1(t) = h_2(t)
- assumptions
- censoring is unrelated to hazard
- survival probability same regardless of when subject was recruited (early vs later in study)
exercise: compare survival of age groups
age_65plus <- ifelse(g[,"age"] >= 65, 1, 0)
table(age_65plus)
## age_65plus
## 0 1
## 115 885
# table(g[, "age"], age_65plus)
km_age65plus_fit <- survfit(Surv(fu_time, death) ~ age_65plus)
plot(km_age65plus_fit)
survdiff(Surv(fu_time, death) ~ age_65plus, rho=0)
## Call:
## survdiff(formula = Surv(fu_time, death) ~ age_65plus, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## age_65plus=0 115 18 67 35.85 41.7
## age_65plus=1 885 474 425 5.65 41.7
##
## Chisq= 41.7 on 1 degrees of freedom, p= 1e-10
EOF