{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(AER)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instrumental Variables\n", "\n", "by Jonas Peters, Niklas Pfister, 29.12.2017\n", "\n", "This notebook aims to give you a basic understanding of the instrumental variable approach and when it can be used to infer causal relations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this method is to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. The idea of the instrumental variable approach is to account for this confounding by considering an additional variable $I$ called an instrument. Although there exist numerous extensions, here, we focus on the classical case. We provide two definitions.\n", "\n", "\n", "First, assume the following SCM\n", "\\begin{align}\n", "I &:= N_I\\\\\n", "H &:= N_H\\\\ \n", "X &:= \\gamma I + \\delta_X H + N_X\\\\\n", "Y &:= \\beta X + \\delta_Y H + N_Y.\\\\\n", "\\end{align}\n", "The corresponding DAG looks as follows.\n", "\\begin{align}\n", " &\\phantom{0}\\\\\n", " &\\begin{array}{ccc}\n", " & & &H & \\\\\n", " & &\\phantom{abcdefgh}\\overset{\\delta_X}{\\swarrow} & & \\overset{\\delta_Y}{\\searrow}\\phantom{abcdefgh}\\\\\n", " & & & & \\\\\n", " I &\\overset{\\gamma}{\\longrightarrow} &X & \\overset{\\beta}{\\longrightarrow} & Y\\\\\n", " \\end{array}\\\\\n", " &\\phantom{0}\n", "\\end{align}\n", "Here, $I$ is called an instrumental variable for the causal effect from $X$ to $Y$. It is essential that $I$ effects $Y$ only via $X$ (and not directly).\n", "\n", "\n", "\n", "Second, it is possible to define instrumental variables without SCMs, too. Let us therefore write\n", "\\begin{equation}\n", "Y = \\beta X + \\epsilon_Y\n", "\\end{equation}\n", "(this can always be done). Here, $\\epsilon_Y$ is allowed to depend on $X$ (if there is a confounder $H$ between $X$ and $Y$, this is likely to be the case). We then call a variable $I$ an instrumental variable if it satisfies the following two conditions:\n", "\n", "1. $\\operatorname{cov}(X,I)\\neq 0$ (relevance)\n", "2. $\\operatorname{cov}(\\epsilon_Y,I)=0$ (exogenity).\n", "\n", "Informally speaking, these conditions mean that $I$ affects $Y$ only through its effect on $X$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimation\n", "\n", "We now want to illustrate how the instrumental variable $I$ can be used to estimate the causal effect $\\beta$ in the model above. To this end we use the CollegeDistance data set from [1] available in the R package AER." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# load CollegeDistance data set\n", "data(\"CollegeDistance\")\n", "# read out relevant variables\n", "Y <- CollegeDistance$score\n", "X <- CollegeDistance$education\n", "I <- CollegeDistance$distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data set consists of $4739$ observations on $14$ variables from high school student survey conducted by the Department of Education in $1980$, with a follow-up in $1986$. In this notebook, we only consider the following variables:\n", "* $Y$ - base year composite test score. These are achievement tests given to high school seniors in the sample.\n", "* $X$ - number of years of education.\n", "* $I$ - distance from closest 4-year college (units are in 10 miles).\n", "\n", "We now assume that $I$ is a valid instrument (we come back to this question in Exercise 2 below). To estimate the causal effect of $X$ on $Y$ we can then use a so-called 2-stage least squares (2SLS) procedure, which goes as follows:\n", "* Step 1: Regress $X$ on $I$ and compute the corresponding predicted values $\\hat{X}$ of $X$.\n", "* Step 2: Regress $Y$ on $\\hat{X}$, then the resulting regression coefficient is asymptotically equivalent to the causal effect of $X$ on $Y$.\n", "\n", "The following four exercises go over some of the details of the 2SLS and apply it to the above data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "Assume the following two structural assignments\n", "\\begin{align*}\n", "Y &:= \\beta X + \\epsilon_Y \\\\\n", "X &:= \\gamma I + \\epsilon_X,\n", "\\end{align*}\n", "where $\\epsilon_X$ and $\\epsilon_Y$ are not necessarily independent, but the instrument $I$ is assumed to satisfy the assumptions 1 and 2 above. Prove that the 2-step least square method does indeed give a consistent estimator of causal effect in this case. Hint: For simplicity you may also assume that $\\operatorname{cov}(I, \\epsilon_X)=0$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### End of Solution 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "Argue whether the variable $I$ can be used as an instrumental variable to infer the causal effect of $X$ on $Y$. Why might it not be a valid instrument? Hint: You can perform a regression in order to test if there is significant correlation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### End of Solution 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "Use 2SLS to estimate the causal effect of $X$ on $Y$ based on the instrument $I$. Compare your results with a standard OLS regression of $Y$ on $X$ (that includes an intercept). What happens to the correlation between $X$ and the residuals in both methods? Which attempt yields smaller variance of residuals?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### End of Solution 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A slightly different approach to 2SLS is to use the formula\n", "\n", "\\begin{equation} \\tag{1}\n", "\\beta=\\frac{\\operatorname{cov}(Y,I)}{\\operatorname{cov}(X,I)}.\n", "\\end{equation}\n", "\n", "This formula can be shown quite easily using the same setting as in Exercise 1 (try proving it). Replacing the population covariance by the corresponding empirical estimates again results in a consistent estimator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "Apply the above estimator (1) to CollegeDistance data and compare your result with the one from Exercise 3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 4" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### End of Solution 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "[1] Kleiber, C., A. Zeileis (2008). Applied Econometrics with R. Springer-Verlag New York." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.3.2" } }, "nbformat": 4, "nbformat_minor": 2 }