In [None]:
library(AER)

# Instrumental Variables

by Jonas Peters, Niklas Pfister, 29.12.2017

This notebook aims to give you a basic understanding of the instrumental variable approach and when it can be used to infer causal relations.

The goal of this method is to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. The idea of the instrumental variable approach is to account for this confounding by considering an additional variable $I$ called an instrument. Although there exist numerous extensions, here, we focus on the classical case. We provide two definitions.


First, assume the following SCM
\begin{align}
I &:= N_I\\
H &:= N_H\\ 
X &:= \gamma I + \delta_X H + N_X\\
Y &:= \beta X + \delta_Y H + N_Y.\\
\end{align}
The corresponding DAG looks as follows.
\begin{align}
 &\phantom{0}\\
 &\begin{array}{ccc}
 & & &H & \\
 & &\phantom{abcdefgh}\overset{\delta_X}{\swarrow} & & \overset{\delta_Y}{\searrow}\phantom{abcdefgh}\\
 & & & & \\
 I &\overset{\gamma}{\longrightarrow} &X & \overset{\beta}{\longrightarrow} & Y\\
 \end{array}\\
 &\phantom{0}
\end{align}
Here, $I$ is called an instrumental variable for the causal effect from $X$ to $Y$. It is essential that $I$ effects $Y$ only via $X$ (and not directly).



Second, it is possible to define instrumental variables without SCMs, too. Let us therefore write
\begin{equation}
Y = \beta X + \epsilon_Y
\end{equation}
(this can always be done). Here, $\epsilon_Y$ is allowed to depend on $X$ (if there is a confounder $H$ between $X$ and $Y$, this is likely to be the case). We then call a variable $I$ an instrumental variable if it satisfies the following two conditions:

1. $\operatorname{cov}(X,I)\neq 0$ (relevance)
2. $\operatorname{cov}(\epsilon_Y,I)=0$ (exogenity).

Informally speaking, these conditions mean that $I$ affects $Y$ only through its effect on $X$.

## Estimation

We now want to illustrate how the instrumental variable $I$ can be used to estimate the causal effect $\beta$ in the model above. To this end we use the CollegeDistance data set from [1] available in the R package AER.

In [None]:
# load CollegeDistance data set
data("CollegeDistance")
# read out relevant variables
Y <- CollegeDistance$score
X <- CollegeDistance$education
I <- CollegeDistance$distance

This data set consists of $4739$ observations on $14$ variables from high school student survey conducted by the Department of Education in $1980$, with a follow-up in $1986$. In this notebook, we only consider the following variables:
* $Y$ - base year composite test score. These are achievement tests given to high school seniors in the sample.
* $X$ - number of years of education.
* $I$ - distance from closest 4-year college (units are in 10 miles).

We now assume that $I$ is a valid instrument (we come back to this question in Exercise 2 below). To estimate the causal effect of $X$ on $Y$ we can then use a so-called 2-stage least squares (2SLS) procedure, which goes as follows:
* Step 1: Regress $X$ on $I$ and compute the corresponding predicted values $\hat{X}$ of $X$.
* Step 2: Regress $Y$ on $\hat{X}$, then the resulting regression coefficient is asymptotically equivalent to the causal effect of $X$ on $Y$.

The following four exercises go over some of the details of the 2SLS and apply it to the above data set.

### Exercise 1
Assume the following two structural assignments
\begin{align*}
Y &:= \beta X + \epsilon_Y \\
X &:= \gamma I + \epsilon_X,
\end{align*}
where $\epsilon_X$ and $\epsilon_Y$ are not necessarily independent, but the instrument $I$ is assumed to satisfy the assumptions 1 and 2 above. Prove that the 2-step least square method does indeed give a consistent estimator of causal effect in this case. Hint: For simplicity you may also assume that $\operatorname{cov}(I, \epsilon_X)=0$.

### Solution 1

### End of Solution 1

### Exercise 2

Argue whether the variable $I$ can be used as an instrumental variable to infer the causal effect of $X$ on $Y$. Why might it not be a valid instrument? Hint: You can perform a regression in order to test if there is significant correlation.

### Solution 2

### End of Solution 2

### Exercise 3
Use 2SLS to estimate the causal effect of $X$ on $Y$ based on the instrument $I$. Compare your results with a standard OLS regression of $Y$ on $X$ (that includes an intercept). What happens to the correlation between $X$ and the residuals in both methods? Which attempt yields smaller variance of residuals?

### Solution 3

### End of Solution 3

A slightly different approach to 2SLS is to use the formula

\begin{equation} \tag{1}
\beta=\frac{\operatorname{cov}(Y,I)}{\operatorname{cov}(X,I)}.
\end{equation}

This formula can be shown quite easily using the same setting as in Exercise 1 (try proving it). Replacing the population covariance by the corresponding empirical estimates again results in a consistent estimator.

### Exercise 4
Apply the above estimator (1) to CollegeDistance data and compare your result with the one from Exercise 3.

### Solution 4

### End of Solution 4

## References

[1] Kleiber, C., A. Zeileis (2008). Applied Econometrics with R. Springer-Verlag New York.