From Contextual Bandits to Conditional Treatment Effects



Log data is one of the most ubiquitous forms of data available, as it can be recorded from a variety of systems (e.g., search engines, recommender systems, ad placement) at little cost. The interaction logs of such systems typically contain a record of the input to the system (e.g., features describing the user), the prediction made by the system (e.g., a recommended list of news articles) and the feedback about the quality of this prediction (e.g., number of articles the user read). This feedback, however, provides only partial-information feedback -- aka ``contextual bandit feedback'' -- limited to the particular prediction shown by the system. This is fundamentally different from conventional supervised learning, where ``correct'' predictions (e.g., the best ranking of news articles for that user) together with a loss function provide full-information feedback. In this talk, I will explore approaches and methods for batch learning from logged bandit feedback (BLBF). Unlike the well-explored problem of online learning with bandit feedback, batch learning with bandit feedback does not require interactive experimental control of the underlying system, but merely exploits log data collected in the past. The talk explores how Empirical Risk Minimization can be used for BLBF, the suitability of various counterfactual risk estimators in this context, and a new learning method for structured output prediction in the BLBF setting. From this, I will draw connections to methods for estimating conditional average treatment effects.


Co-author: Adith Swaminathan