Estimation of the expected prediction error

This is an extended version of the R code presented in the lecture Friday, May 23, 2014. It contains an implementation and investigation of the three different simulation based methods for estimation of the expected prediction error.

General functions

We first implement a number of generic functions for computing training error and estimates of expected prediction error by different methods.

trainErr <- function(form = y ~ x, data) { 
  model <- lm(form, data = data)
  mean((data$y - predict(model))^2)  
}

Function for computing an estimate of expected prediction error based on subsampling.

testSub <- function(form, data, B = 200, q = 0.75) {
  n <- nrow(data)
  PEsub <- vector("list", B)
  for(b in 1:B) {
  ## Subsampling a data set of size q * n 
    i <- sample(n, floor(q * n))  
    modelsub <- lm(form, data = data[i, ])
    muhat <- predict(modelsub, newdata = data[-i, ])
    PEsub[[b]] <- (data$y[-i] - muhat)^2
  }
  mean(unlist(PEsub))
}

Function for computing an estimate of expected prediction error based on cross-validation.

testCV <- function(form, data, B = 50, k = 4) {
  n <- nrow(data)
  PEcv <- vector("list", B)
  tmp <- numeric(n)
  for(b in 1:B) {
    ## Generating the random subdivision into groups 
    group <- sample(rep(1:k, length.out = n))
    for(i in 1:k) {
      modelsub <- lm(form, data = data[group != i, ])
      muhat <- predict(modelsub, newdata = data[group == i, ])
      tmp[group == i] <- (data$y[group == i] - muhat)^2
      }
    PEcv[[b]] <- tmp
    }
  mean(unlist(PEcv))
  }

Function for computing an estimate of expected prediction error based on bootstrapping.

testBoot <- function(form, data, B = 200) {
  n <- nrow(data)
  PEboot <- vector("list", B)
  for(b in 1:B) {
    i <- sample(n, n, replace = TRUE)
    modelsub <- lm(form, data = data[i, ])
    muhat <- predict(modelsub, newdata = data[-i, ])
    PEboot[[b]] <- (data$y[-i] - muhat)^2
    }
  mean(unlist(PEboot))
  }

First test example

First we apply the different functions on a single small simulated data set for a simple model where we just do linear regression. We simulate the data set and comute the training error and different estimates of expected prediction error.

n <- 20
x <- rnorm(n)
yx <- data.frame(y = rnorm(n, x), x)
form <- y ~ x
trainErr(form, yx)
## [1] 1.041
testSub(form, yx)
## [1] 1.317
testCV(form, yx)
## [1] 1.368
testBoot(form, yx)
## [1] 1.474

We know in this case that expected training error is \(1 - 2/n\) (which is 0.9 when n = 20), and the expected prediction error is \(1 + 2/n\) (which is 1.1 when n = 20). The subsampling and 4-fold cross-validation computations have mean \(1 + 2 / (qn)\) – with \(q = 1 - 1/k\) for \(k\)-fold CV. Thus in the case of \(q = 0.75\) and \(n = 20\) they have mean \(1.133\). It is a little less clear what the theoretical mean for the bootstrap based estimates is. Note that the teoretical considerations are conditional on the predictors – this is reflected in the simulations below, where the predictors are simulated once and are common for the simulations of the responses.

Simulation study

require(reshape2)
require(ggplot2)

In the simulation above we made a large number of replications (large \(B\)). In real examples this may be computationally prohibitive as this requires a large number of refits of the model. Here we investigate by simulations the distribution of the estimators of expected prediction error, and we study the effect of the size of \(B\).

First we consider just a single cross-validation round corresponding to 4 refits.

K <- 1000
n <- 20
form <- y ~ x
errhat <- vector("list", K)
x <- rnorm(n)
for(k in 1:K) {
  yx <- data.frame(y = rnorm(n, x), x)
  errhat[[k]] <- data.frame(train = trainErr(form, yx), 
                            sub = testSub(form, yx, B = 4),
                            CV = testCV(form, yx, B = 1),
                            boot = testBoot(form, yx, B = 4)
                            )
  }
errhat <- do.call(rbind, errhat)
ggplot(melt(errhat), aes(x = variable, y = value)) + 
  geom_violin(fill = gray(0.8)) + 
  geom_hline(yintercept = 1.1, color = "red", alpha = 0.5) +
    geom_hline(yintercept = 0.9, color = "red", alpha = 0.5) +
  stat_summary(fun.y = mean, color = "blue", geom = "point") +
  coord_cartesian(ylim = c(0, 4))

plot of chunk plot


colMeans(errhat)  ## Means of estimators
##  train    sub     CV   boot 
## 0.8944 1.1515 1.1532 1.2307
apply(errhat, 2, sd) ## Standard errors of estimators
##  train    sub     CV   boot 
## 0.3110 0.5101 0.4307 0.5187

We observe that the training error generally underestimates expected prediction error. The three sampling based methods compensate for this bias. Note that the \(C_p\)-statistics corresponds exactly to the training error plus 0.2 in this case.

From the distributions of the estimates of expected prediction error, it is clear that there is a considerable variance. How much of this variance can be reduced by increasing \(B\)? We investigate this by increasing \(B\) by a factor 10.

errhat <- vector("list", K)
x <- rnorm(n)
for(k in 1:K) {
  yx <- data.frame(y = rnorm(n, x), x)
  errhat[[k]] <- data.frame(train = trainErr(form, yx), 
                            sub = testSub(form, yx, B = 40),
                            CV = testCV(form, yx, B = 10),
                            boot = testBoot(form, yx, B = 40)
                            )
  }
errhat <- do.call(rbind, errhat)
ggplot(melt(errhat), aes(x = variable, y = value)) + 
  geom_violin(fill = gray(0.8)) + 
  geom_hline(yintercept = 1.1, color = "red", alpha = 0.5) +
    geom_hline(yintercept = 0.9, color = "red", alpha = 0.5) +
  stat_summary(fun.y = mean, color = "blue", geom = "point") +
  coord_cartesian(ylim = c(0, 4))

plot of chunk plot2


colMeans(errhat)  ## Means of estimators
##  train    sub     CV   boot 
## 0.8922 1.1374 1.1375 1.2099
apply(errhat, 2, sd) ## Standard errors of estimators
##  train    sub     CV   boot 
## 0.2962 0.3942 0.3849 0.4144

The conclusion is that the increase of \(B\) decreases the spread of the distribution – and the variance – for the subsampling and bootstrapping methods. For cross-validation the decrease is detectable, but smaller.