UCPH Statistics Seminar: Off-policy inference in contextual bandits

Title: Anytime-valid off-policy inference for contextual bandits

Abstract: Contextual bandit algorithms are becoming ubiquitous tools for adaptive sequential experimentation in healthcare and the tech industry. This adaptivity raises interesting but hard statistical inference questions: e.g., how would our algorithm had done if we used a policy different from the logging policy that was used to collect the data --- a problem known as ``off-policy evaluation'' (OPE). 
Using modern martingale techniques, we present a suite of methods for OPE inference that can be used even while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy is data-adaptively changing due to learning, and even if the context distributions are drifting over time.
Concretely, we derive confidence sequences --- sequences of confidence intervals that are uniformly valid over time --- for various functionals, including (time-varying) off-policy mean reward values, as well as the entire CDF of the off-policy rewards. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require known bounds on the maximal importance weights, and (d) adapt to the empirical variance of our estimators.

Speaker: Ian Waudby-Smith