{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(InvariantCausalPrediction)\n",
    "library(seqICP)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Invariant Causal Prediction\n",
    "\n",
    "by Jonas Peters, Niklas Pfister, 22.8.2017\n",
    "\n",
    "This notebook aims to give you a basic understanding of invariant causal prediction for causal inference. \n",
    "\n",
    "\n",
    "The method's goal is as follows: Suppose we are given data $(\\mathbf{X}_1, Y_1), \\ldots, (\\mathbf{X}_n, Y_n)$ from a target variable $Y$ and $d$ predictors $\\mathbf{X}$. We are then trying to determine the causal parents $\\operatorname{pa}(Y) \\subseteq \\{1, \\ldots, d\\}$ of $Y$. The inference will be based on heterogeneity in the data (e.g. the data come from different interventional settings)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment based approach\n",
    "\n",
    "We first start with a fundamental observation that we will exploit later.\n",
    "\n",
    "Assume the $d+1$ dimensonal vectors $\\mathbf{Z}_i=(Z^0_i,Z^1_i,\\dots,Z^d_i)$ for $i = 1, \\ldots, n$ are independent observations generated by (potentially) different interventional settings of the same linear structural causal model (SCM) such that the induced graphs are directed and acyclic (i.e. DAGs). Assume further that none of the interventions occurs directly on the variable $Z^0$. Then, for $Y:=Z^0$ and $\\mathbf{X}:=(Z^1,\\dots,Z^d)$ we have following invariance: There exists $\\beta\\in(\\mathbb{R}\\setminus\\{0\\})^{|\\operatorname{pa}(Y)|}$ such that for all $i\\in\\{1,\\dots,n\\}$\n",
    "it holds that\n",
    "\\begin{equation} \\tag{1}\n",
    "    Y_i=\\mu+X_i^{\\operatorname{pa}(Y)}\\beta+\\epsilon_i\\text{ and }\\epsilon_i \\perp\\!\\!\\!\\perp X_i^{\\operatorname{pa}(Y)},\n",
    "\\end{equation}\n",
    "where $\\epsilon_1,\\dots,\\epsilon_n$ are i.i.d. noise variables. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "\n",
    "Generate one sample from a distribution from the linear SCM\n",
    "\\begin{equation}\n",
    "\\mathcal{S}:\\left\\{\n",
    "\\begin{split}\n",
    "X_i &= \\epsilon_i^1\\\\\n",
    "Y_i &= 1.5\\cdot X_i + \\epsilon_i^2\\, ,\n",
    "\\end{split}\\right.\n",
    "\\end{equation}\n",
    "and a second sample from the same SCM under a shift intervention on $X$. Plot both samples in the same (X,Y)-scatterplot using different colors. Does the conditional distribution of $Y|X$ remain invariant, i.e., it is the same in both samples? What about the distribution of $Y$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End Solution 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now assume that we are given the data and try to infer $\\operatorname{pa}(Y)$.\n",
    "The method of invariant causal prediction exploits the invariance (1) from above. It goes over all sets of potential parents $\\operatorname{pa}(Y)$ and finds all sets for which this invariance is satisfied.\n",
    "\n",
    "\n",
    "To get a better understanding of how exactly invariant causal prediction performs this search, we consider the following toy data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "load(file = \"./InvariantCausalPredictionData1.RData\")  # load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have now loaded a sample consisting of the variables $Y$, $X^1$, $X^2$ and $X^3$. The variables correspond to the columns of the matrix <tt>\"data\"</tt> and the rows correspond to independent observations from an underlying SCM. The first $140$ rows are sampled from an observational distribution, while the remaining $80$ rows come from an interventional setting for which it is known that none of the interventions occured directly on $Y$. In the following two exercises we will determine the parents of $Y$ using invariant causal prediction. First, we do this maually, and later we will make use of some functions already implemented in R."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Perform a regression of $Y$ on all possible sets of predictors (i.e. \\{X1\\}, \\{X2\\}, \\{X3\\}, \\{X1, X2\\}, \\{X1, X3\\}, \\{X2, X3\\}, \\{X1, X2, X3\\}). For each of the $7$ regressions plot the residuals vs the fitted values (this is called a Tukey-Anscombe plot). In each figure, plot the data points from the first environment in \"blue\" and the points from the second environment in \"red\". Determine whether the corresponding conditional remains invariant across the two environments. Moreover, check whether the distribution of $Y$ itself remains invariant. What is the parent set? Hint: Think about which sets are definetly *not* the correct parent sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "For the same data set apply the invariant causal prediction function <tt>ICP</tt> form the package <tt>InvariantCausalPrediction</tt> to determine the parent set. Hint: You will need to define a vector <tt>ExpInd</tt> which has the same length as the number of observations and indicates from which environment each observations comes (e.g. $0$ for observational data and $1$ for interventional data)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Extension to an environment-free approach\n",
    "\n",
    "In the above exercises we knew which observations corresponded to the observational and which to the interventional setting. In this section we want to show that we can still apply a similar methodology even if this environment information is not known. All we need is a sequential ordering of the data. For example, the data could be grouped together for each environment or the interventions could change continuously across time. We illustrate this using the following toy example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "load(file = \"./InvariantCausalPredictionData2.RData\")  # load data2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The matrix <tt>data2</tt> contains the three variables $Y$, $X^1$ and $X^2$ as columns and each row corresponds to an independent observations from the same SCM under smoothly changing interventions. To be more precise, the interventions correspond to smooth shifts in the variance of the noise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4\n",
    "\n",
    "Use the invariant causal prediction function for sequential data <tt>seqICP</tt> from the package <tt>seqICP</tt> to find an estimate of the parent set for the variable $Y$. Set the parameter <tt>test</tt> to \"smooth.variance\", this leads the <tt>seqICP</tt> to performs a hypothesis test tuned against alternatives that result from smooth variance interventions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "[1] Peters, J., P. Bühlmann, and N. Meinshausen (2016). *Causal inference using invariant prediction:\n",
    "identification  and  confidence  intervals*. Journal of the Royal Statistical Society, Series B (with discussion)\n",
    "78 (5), 947–1012.\n",
    "\n",
    "[2] Pfister, N., P. Bühlmann and J. Peters (2017). *Invariant Causal Prediction for Sequential Data*.\n",
    "ArXiv e-prints (1706.08058)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.3.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}