Some survival analysis in R

This note describes a few elementary aspects of practical analysis of survival data in R. For further information we refer to the book Modeling Survival Data by Terry M. Therneau and Patricia M. Grambsch. Terry is the author of the survival package for R, which we use. A classical and monomental theoretical reference is Statistical Models Based on Counting Processes by Andersen, Borgan, Gill and Keiding.

The analyses that we will do can all be done using the functions in the R package survival. That package also contains some example data sets.

library(survival)
## Loading required package: splines
data(pbc)
pbc[1:10,
    c("time", "status", "age",
      "edema", "albumin", "bili", 
      "protime")
    ]
##    time status   age edema albumin bili protime
## 1   400      2 58.77   1.0    2.60 14.5    12.2
## 2  4500      0 56.45   0.0    4.14  1.1    10.6
## 3  1012      2 70.07   0.5    3.48  1.4    12.0
## 4  1925      2 54.74   0.5    2.54  1.8    10.3
## 5  1504      1 38.11   0.0    3.53  3.4    10.9
## 6  2503      2 66.26   0.0    3.98  0.8    11.0
## 7  1832      0 55.53   0.0    4.09  1.0     9.7
## 8  2466      2 53.06   0.0    4.00  0.3    11.0
## 9  2400      2 42.51   0.0    3.08  3.2    11.0
## 10   51      2 70.56   1.0    2.74 12.6    11.5

The data set pbc (primary biliary cirrhosis) contains the main variable time, which is the observed, potentially censored survival time and the status, which indicates if the observation has been censored. In addition there are 17 observed predictors, where we show above five that in previous analyses have been found to be important.

We can produce a plot of the survival times with :

nr <- 50
plot(c(0, pbc$time[1]), c(1, 1), type = "l", ylim = c(0, nr + 1),
     xlim = c(0, max(pbc$time[1:nr]) + 10), ylab = "nr",
     xlab = "Survival time")
for(i in 2:nr) 
  lines(c(0,pbc$time[i]), c(i, i))
for(i in 1:nr) {
  if(pbc$status[i] == 0) 
    points(pbc$time[i], i, col = "red", pch = 20) #censored
  if(pbc$status[i] == 1) 
    points(pbc$time[i], i)
}

plot of chunk unnamed-chunk-2

In the complete dataset there are 418 patients, and the status variable shows that 161 have died, 232 have been censored and then 25 have got a transplant. We exclude the 25 transplant observations from the further analysis. Furthermore, we change the status variable to a logical for subsequent use.

pbc <- subset(pbc, status != 1)
pbc <- transform(pbc, status = as.logical(status))

Estimation of the survival function using the Kaplan-Meier estimator can be done using the survfit function.

pbcSurv <- survfit(Surv(time, status) ~ 1, data = pbc)
pbcSurv
## Call: survfit(formula = Surv(time, status) ~ 1, data = pbc)
## 
## records   n.max n.start  events  median 0.95LCL 0.95UCL 
##     393     393     393     161    3358    2847    3839
plot(pbcSurv, mark.time = FALSE)

plot of chunk unnamed-chunk-4

The result from survfit is an R object of type survfit. Calling summary with a survfit object as argument provides the nonparametric estimate of the survival function including additional information as a long list. The plot with the added, pointwise confidence bands
is perhaps more informative. Setting mark.time = FALSE prevents all the censored observations to be marked on the plot by a “+”.

The function Surv applied to the time and status variables for the PBC data is a function that creates a survival object. Functions like survfit, survreg to be considered below and coxph to be considered later in the course use formulas for model specification – just like lm and glm do. However, the left hand side of the formula (the response) must be a a survival object as created by Surv. This is the way the censoring is specified.

Including predictors

A major point in a data set like the PBC data above is that we have predictor information on the individuals. It is no problem to split the estimation of the survival function according to a factor such as the edema variable in the PBC data.

pbcSurv <- survfit(Surv(time, status) ~ edema, data=pbc)
plot(pbcSurv, mark.time = FALSE, conf.int = TRUE, 
     col = c("red", "blue", "purple"))

plot of chunk unnamed-chunk-5

From the plot it seems clear that individuals without edema have a larger survival time than those with edema, and those treated successfully or left untreated have a larger survival time than those treated unsuccessfully. This can also be tested formally with the log-rank test.

survdiff(Surv(time, status) ~ edema, data = pbc)
## Call:
## survdiff(formula = Surv(time, status) ~ edema, data = pbc)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## edema=0   332      116   145.38      5.94      61.6
## edema=0.5  41       26    13.00     12.99      14.2
## edema=1    20       19     2.61    102.79     105.4
## 
##  Chisq= 123  on 2 degrees of freedom, p= 0

Parametric survival regression modeling

But what about including the other predictors also – especially the continuous ones? This can be handled in the framework of proportional hazards model. We consider in this note the parametric models using a Weibull distribution as baseline model. We first show how the model can be fitted using the glm-framework (for a fixed shape parameter).

sigma <- 0.5  ## shape parameter gamma = 2 in the Weibull distribution
pbcSurvGlm <- glm(status ~ offset(log(time) / sigma) + as.factor(edema) + age, family = poisson, data = pbc)
summary(pbcSurvGlm)
## 
## Call:
## glm(formula = status ~ offset(log(time)/sigma) + as.factor(edema) + 
##     age, family = poisson, data = pbc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.248  -0.895  -0.451   1.265   4.093  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -18.81513    0.44456  -42.32  < 2e-16 ***
## as.factor(edema)0.5   0.71475    0.22622    3.16   0.0016 ** 
## as.factor(edema)1     3.02079    0.25380   11.90  < 2e-16 ***
## age                   0.04372    0.00813    5.38  7.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 858.50  on 392  degrees of freedom
## Residual deviance: 750.37  on 389  degrees of freedom
## AIC: 1080
## 
## Number of Fisher Scoring iterations: 7

The standard errors and \(Z\)-scores are quite valid based on profile likelihood arguments for a known shape parameter. However, it is inconvenient that we cannot estimate the shape parameter easily, let alone accomodate for the added uncertainty due to estimation of this parameter. Fortunately, the survreg function provides an alternative way to fit these models.

The survreg function fits AFT models for a number of different response distributions including the log-logistic, the log-normal and the Weibull distribution. For the Weibull distribution the proportional hazards model is also an AFT model, but with a different parametrization, which leads notoriously to some confusions.

pbcSurv <- survreg(Surv(time, status) ~ as.factor(edema) + age, data = pbc, scale = sigma)
summary(pbcSurv)
## 
## Call:
## survreg(formula = Surv(time, status) ~ as.factor(edema) + age, 
##     data = pbc, scale = sigma)
##                       Value Std. Error      z        p
## (Intercept)          9.4076    0.22228  42.32 0.00e+00
## as.factor(edema)0.5 -0.3574    0.11312  -3.16 1.58e-03
## as.factor(edema)1   -1.5104    0.12690 -11.90 1.15e-32
## age                 -0.0219    0.00407  -5.38 7.57e-08
## 
## Scale fixed at 0.5 
## 
## Weibull distribution
## Loglik(model)= -1524   Loglik(intercept only)= -1578
##  Chisq= 108.1 on 3 degrees of freedom, p= 0 
## Number of Newton-Raphson Iterations: 6 
## n= 393

The model fitted is the same, but the parameters are different. They are comparable as follows:

- coef(pbcSurv) / sigma
##         (Intercept) as.factor(edema)0.5   as.factor(edema)1 
##           -18.81513             0.71475             3.02079 
##                 age 
##             0.04372
coef(pbcSurvGlm)
##         (Intercept) as.factor(edema)0.5   as.factor(edema)1 
##           -18.81513             0.71475             3.02079 
##                 age 
##             0.04372

If you inspect the \(Z\)-scores in the two models, they are equal up to the sign. What is particularly confusing is the change of sign in the two parametrizations. In the hazards model, a positive value of \(\beta_j\) means that the hazard increases with increasing values of \(x_j\). The larger \(x_j\) is, the sooner the individual will die – at least from a population point of view. However, in the AFT model, a positive value of \(\beta_j\) means that the survival time is scaled by the factor \(e^{\beta_j x_j} > 1\), and thus the larger \(x_j\) is, the later will the individual die.

One thing we get for free using the survreg function is the ability to estimate the shape parameter (or the scale parameter \(\sigma\) in the AFT jargon).

pbcSurv <- survreg(Surv(time, status) ~ as.factor(edema) + age, data = pbc)
summary(pbcSurv)
## 
## Call:
## survreg(formula = Surv(time, status) ~ as.factor(edema) + age, 
##     data = pbc)
##                       Value Std. Error     z         p
## (Intercept)          9.9557    0.36880 26.99 1.72e-160
## as.factor(edema)0.5 -0.6047    0.19351 -3.12  1.78e-03
## as.factor(edema)1   -2.0392    0.22214 -9.18  4.33e-20
## age                 -0.0274    0.00664 -4.12  3.77e-05
## Log(scale)          -0.1692    0.06631 -2.55  1.07e-02
## 
## Scale= 0.844 
## 
## Weibull distribution
## Loglik(model)= -1484   Loglik(intercept only)= -1523
##  Chisq= 79.1 on 3 degrees of freedom, p= 0 
## Number of Newton-Raphson Iterations: 5 
## n= 393

We will, of course, prefer to fit models this way to get appropriate estimated standard errors in practice, but the relation to the Poisson regression model is not completely superfluous. It allows us, for instance, to draw on the uniqueness and existence results for the MLE (admitted, the results have to be modified slightly to accomodate for the offset, but this is not a problem).