Reinforcement Learning in Recommender Systems: Some Foundational and Practical Questions
I'll provide an overview of some recent work on several foundational questions in reinforcement learning (Q-learning specifically) and Markov decision processes, motivated in part by some practical issues that arise in the application of RL to online, user-facing applications like recommender systems.
I'll first describe the notion of delusion in MDPs and RL, a problem that arises when using value- or Q-function approximation. Since any approximation limits the realizable policy class, methods like Q-learning and value iteration run the risk of fitting value estimates to inconsistent (i.e., unrealizable) labels, leading to a variety of pathological phenomena. We develop a policy-class consistent backup that resolves fully this problem and is guaranteed to find the optimal policy in the realizable policy class (and do so in polynomial time under some assumptions).
I'll next outline an approach to slate decomposition to manage Q-learning in settings where a set (or slate) of items is presented to a user, as is often the case in recommender systems. This decomposition allows the Q-value of a slate to decompose into a specific function of item-level Q-values by making certain assumptions about user choice behavior. As a consequence, we avoid the need to explore and generalize over the combinatorial space of slates. Experiments in both small simulated domains and a large-scale live recommender system demonstrate the efficacy and robustness of the approach.
Time permitting, I'll conclude with a few remarks about the role that reinforcement learning has to play in user-centric recommender systems and some of the interesting research challenges that face RL in this setting.
Craig Boutilier is Principal Scientist at Google. He works on various aspects of decision making under uncertainty, with a current focus on sequential decision models: reinforcement learning, Markov decision processes, temporal models, etc.
He was a Professor in the Department of Computer Science at the University of Toronto (on leave) and Canada Research Chair in Adaptive Decision Making for Intelligent Systems. He received his Ph.D. in Computer Science from the University of Toronto in 1992, and worked as an Assistant and Associate Professor at the University of British Columbia from 1991 until his return to Toronto in 1999. He served as Chair of the Department of Computer Science at Toronto from 2004-2010. He was co-founder (with Tyler Lu) of Granata Decision Systems from 2012-2015, until his move to Google in 2015.