Exploration by Optimization in Partial Monitoring

Tue, Nov 12, 2019, 4:30 pm
Sherrerd 101
Center for Statistics and Machine Learning
Operations Research and Financial Planning


In many real-world problems learners cannot directly observe their own rewards but can still infer whether some particular action is successful. How should a learner take actions to balance its need of information while maximizing their reward in this setting? Partial monitoring is a framework introduced a few decades ago to model learning situations of this kind. The simplest version of partial monitoring generalizes learning with full-, bandit-, or even graph-information based feedback. In this talk I will introduce this elegant framework, give a short overview of previous work, and introduce a new elegant algorithm which, unlike previous algorithms, is near-optimal across a wide range of partial monitoring games, while it is also able to beat algorithms specialized for specific types of partial monitoring games (e.g., Exp3 on bandits). The talk is based on joint work with Tor Lattimore.


Csaba Szepesvari is the team-lead for the "Foundations" team at DeepMind, UK and a Professor of Computing Science at the University of Alberta. He earned his PhD in 1999 from Jozsef Attila University, in Szeged, Hungary. He has authored three books and about 200 peer-reviewed journal and conference papers. He serves as the action editor of the Journal of Machine Learning Research and Machine Learning, as well as on various program committees. Dr. Szepesvari's interest is artificial intelligence (AI) and, in particular, principled approaches to AI that use machine learning. He is the co-inventor of UCT, a widely successful Monte-Carlo tree search algorithm, which ignited much work in AI, such as DeepMind's AlphaGo which defeated the top Go professional Lee Sedol in a landmark game. This work on UCT won the 2016 test-of-time award at ECML/PKDD.