This note describes a few elementary aspects of practical analysis of
survival data in R. For further information we refer to the book Modeling Survival Data by
Terry M. Therneau and Patricia M. Grambsch. Terry is the author of the survival
package for R,
which we use. A classical and monomental theoretical reference is Statistical Models Based on Counting Processes
by Andersen, Borgan, Gill and Keiding.
The analyses that we will do can all be done using the functions in the
R package survival
. That package also contains some example data sets.
library(survival)
## Loading required package: splines
data(pbc)
pbc[1:10,
c("time", "status", "age",
"edema", "albumin", "bili",
"protime")
]
## time status age edema albumin bili protime
## 1 400 2 58.77 1.0 2.60 14.5 12.2
## 2 4500 0 56.45 0.0 4.14 1.1 10.6
## 3 1012 2 70.07 0.5 3.48 1.4 12.0
## 4 1925 2 54.74 0.5 2.54 1.8 10.3
## 5 1504 1 38.11 0.0 3.53 3.4 10.9
## 6 2503 2 66.26 0.0 3.98 0.8 11.0
## 7 1832 0 55.53 0.0 4.09 1.0 9.7
## 8 2466 2 53.06 0.0 4.00 0.3 11.0
## 9 2400 2 42.51 0.0 3.08 3.2 11.0
## 10 51 2 70.56 1.0 2.74 12.6 11.5
The data set pbc
(primary biliary cirrhosis) contains the main variable time
, which is the
observed, potentially censored survival time and the status
, which indicates if
the observation has been censored. In addition there are 17 observed predictors, where we
show above five that in previous analyses have been found to be important.
We can produce a plot of the survival times with :
nr <- 50
plot(c(0, pbc$time[1]), c(1, 1), type = "l", ylim = c(0, nr + 1),
xlim = c(0, max(pbc$time[1:nr]) + 10), ylab = "nr",
xlab = "Survival time")
for(i in 2:nr)
lines(c(0,pbc$time[i]), c(i, i))
for(i in 1:nr) {
if(pbc$status[i] == 0)
points(pbc$time[i], i, col = "red", pch = 20) #censored
if(pbc$status[i] == 1)
points(pbc$time[i], i)
}
In the complete dataset there are 418 patients, and the status variable shows that 161 have died, 232 have been censored and then 25 have got a transplant. We exclude the 25 transplant observations from the further analysis. Furthermore, we change the status variable to a logical for subsequent use.
pbc <- subset(pbc, status != 1)
pbc <- transform(pbc, status = as.logical(status))
Estimation of the survival function using the Kaplan-Meier estimator can be done using
the survfit
function.
pbcSurv <- survfit(Surv(time, status) ~ 1, data = pbc)
pbcSurv
## Call: survfit(formula = Surv(time, status) ~ 1, data = pbc)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 393 393 393 161 3358 2847 3839
plot(pbcSurv, mark.time = FALSE)
The result from survfit is an R object of type
survfit
. Calling summary
with a survfit
object as argument
provides the nonparametric estimate of the survival function
including additional information as a long list. The plot with the added, pointwise confidence bands
is perhaps more informative. Setting mark.time = FALSE
prevents all the censored observations
to be marked on the plot by a “+”.
The function Surv
applied to the time
and status
variables
for the PBC data is a function that creates a survival
object. Functions like
survfit
, survreg
to be considered below and coxph
to be considered
later in the course use formulas for model specification – just like lm
and glm
do. However, the left hand side of the formula (the response)
must be a a survival
object as created by Surv
. This is the way the
censoring is specified.
A major point in a data set like the PBC data above is that we have
predictor information on the individuals. It is no problem to split
the estimation of the survival function according to a factor such as
the edema
variable in the PBC data.
pbcSurv <- survfit(Surv(time, status) ~ edema, data=pbc)
plot(pbcSurv, mark.time = FALSE, conf.int = TRUE,
col = c("red", "blue", "purple"))
From the plot it seems clear that individuals without edema have a larger survival time than those with edema, and those treated successfully or left untreated have a larger survival time than those treated unsuccessfully. This can also be tested formally with the log-rank test.
survdiff(Surv(time, status) ~ edema, data = pbc)
## Call:
## survdiff(formula = Surv(time, status) ~ edema, data = pbc)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## edema=0 332 116 145.38 5.94 61.6
## edema=0.5 41 26 13.00 12.99 14.2
## edema=1 20 19 2.61 102.79 105.4
##
## Chisq= 123 on 2 degrees of freedom, p= 0
But what about including the other predictors also – especially the continuous ones? This can be handled in the framework of proportional hazards model. We consider in this note the parametric models using a Weibull distribution as baseline model. We first show how the model can be fitted using the glm-framework (for a fixed shape parameter).
sigma <- 0.5 ## shape parameter gamma = 2 in the Weibull distribution
pbcSurvGlm <- glm(status ~ offset(log(time) / sigma) + as.factor(edema) + age, family = poisson, data = pbc)
summary(pbcSurvGlm)
##
## Call:
## glm(formula = status ~ offset(log(time)/sigma) + as.factor(edema) +
## age, family = poisson, data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.248 -0.895 -0.451 1.265 4.093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.81513 0.44456 -42.32 < 2e-16 ***
## as.factor(edema)0.5 0.71475 0.22622 3.16 0.0016 **
## as.factor(edema)1 3.02079 0.25380 11.90 < 2e-16 ***
## age 0.04372 0.00813 5.38 7.6e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 858.50 on 392 degrees of freedom
## Residual deviance: 750.37 on 389 degrees of freedom
## AIC: 1080
##
## Number of Fisher Scoring iterations: 7
The standard errors and \(Z\)-scores are quite valid based on profile likelihood arguments
for a known shape parameter. However, it is inconvenient that we cannot estimate the shape
parameter easily, let alone accomodate for the added uncertainty due to estimation of this parameter.
Fortunately, the survreg
function provides an alternative way to fit these models.
The survreg
function fits AFT models for a number of different response distributions including
the log-logistic, the log-normal and the Weibull distribution. For the Weibull distribution the
proportional hazards model is also an AFT model, but with a different parametrization, which leads
notoriously to some confusions.
pbcSurv <- survreg(Surv(time, status) ~ as.factor(edema) + age, data = pbc, scale = sigma)
summary(pbcSurv)
##
## Call:
## survreg(formula = Surv(time, status) ~ as.factor(edema) + age,
## data = pbc, scale = sigma)
## Value Std. Error z p
## (Intercept) 9.4076 0.22228 42.32 0.00e+00
## as.factor(edema)0.5 -0.3574 0.11312 -3.16 1.58e-03
## as.factor(edema)1 -1.5104 0.12690 -11.90 1.15e-32
## age -0.0219 0.00407 -5.38 7.57e-08
##
## Scale fixed at 0.5
##
## Weibull distribution
## Loglik(model)= -1524 Loglik(intercept only)= -1578
## Chisq= 108.1 on 3 degrees of freedom, p= 0
## Number of Newton-Raphson Iterations: 6
## n= 393
The model fitted is the same, but the parameters are different. They are comparable as follows:
- coef(pbcSurv) / sigma
## (Intercept) as.factor(edema)0.5 as.factor(edema)1
## -18.81513 0.71475 3.02079
## age
## 0.04372
coef(pbcSurvGlm)
## (Intercept) as.factor(edema)0.5 as.factor(edema)1
## -18.81513 0.71475 3.02079
## age
## 0.04372
If you inspect the \(Z\)-scores in the two models, they are equal up to the sign. What is particularly confusing is the change of sign in the two parametrizations. In the hazards model, a positive value of \(\beta_j\) means that the hazard increases with increasing values of \(x_j\). The larger \(x_j\) is, the sooner the individual will die – at least from a population point of view. However, in the AFT model, a positive value of \(\beta_j\) means that the survival time is scaled by the factor \(e^{\beta_j x_j} > 1\), and thus the larger \(x_j\) is, the later will the individual die.
One thing we get for free using the survreg
function is the ability to estimate the
shape parameter (or the scale parameter \(\sigma\) in the AFT jargon).
pbcSurv <- survreg(Surv(time, status) ~ as.factor(edema) + age, data = pbc)
summary(pbcSurv)
##
## Call:
## survreg(formula = Surv(time, status) ~ as.factor(edema) + age,
## data = pbc)
## Value Std. Error z p
## (Intercept) 9.9557 0.36880 26.99 1.72e-160
## as.factor(edema)0.5 -0.6047 0.19351 -3.12 1.78e-03
## as.factor(edema)1 -2.0392 0.22214 -9.18 4.33e-20
## age -0.0274 0.00664 -4.12 3.77e-05
## Log(scale) -0.1692 0.06631 -2.55 1.07e-02
##
## Scale= 0.844
##
## Weibull distribution
## Loglik(model)= -1484 Loglik(intercept only)= -1523
## Chisq= 79.1 on 3 degrees of freedom, p= 0
## Number of Newton-Raphson Iterations: 5
## n= 393
We will, of course, prefer to fit models this way to get appropriate estimated standard errors in practice, but the relation to the Poisson regression model is not completely superfluous. It allows us, for instance, to draw on the uniqueness and existence results for the MLE (admitted, the results have to be modified slightly to accomodate for the offset, but this is not a problem).