## Abstracts – Online Learning

**Bandit regret scaling with the effective loss range**

Nicolò Cesa-Bianchi (Università degli Studio di Milano, Italy)

(This talk will include a short introduction to the field of online learning.)

Multiarmed bandits are a learning technique often used in online product recommendation, where training signals can only refer to the products that were actually shown to the user. In online learning performance is typically measured in terms of regret: the difference between the learner’s total gain in a sequence of recommendations and what it could have been gained by recommending the single best product to all users in the sequence. A recent result shows that it is impossible to guarantee that, whenever the gains across different products are in a small range for each user, the regret will scale with that range size. In this talk we show ways in which this negative result can be circumvented modulo some mild additional assumptions. For example, when rough estimates are available for the gains of each user, or when we know the gain of some unspecified product for each user.

Joint work with Ohad Shamir (Weizmann, Israel)

**Sparsity, variance and curvature in multi-armed bandits**

Sébastien Bubeck (Microsoft Research, USA)

In (online) learning theory the concepts of sparsity, variance and curvature are well-understood and are routinely used to obtain refined regret and generalization bounds. In this paper we further our understanding of these concepts in the more challenging limited feedback scenario. We consider the adversarial multi-armed bandit and linear bandit problems and solve several open problems pertaining to the existence of algorithms with good regret bounds under the following assumptions: (i) sparsity of the individual losses, (ii) small variation of the loss sequence, and (iii) curvature of the action set. Specifically we show that (i) for s-sparse losses one can obtain \sqrt{sT}-regret (solving an open problem by Kwon and Perchet), (ii) for loss sequences with variation bounded by Q one can obtain \sqrt{Q}-regret (solving an open problem by Kale and Hazan), and (iii) for linear bandit on an \ell_p^n ball one can obtain \sqrt{n T}-regret for p \in [1,2] and one must have a regret larger than n \sqrt{T} for p > 2 (solving an open problem by Bubeck, Cesa-Bianchi and Kakade.

Joint work with Michael B. Cohen and Yuanzhi Li.

**Online version of search problems**

Vianney Perchet (ENS Paris-Saclay & Criteo, France

In traditional search problems, an agent aims at exploring a graph in order to find an object as hidden as possible. We consider here the Bayesian repeated scenario with several iid instance of the same basic search problem. The objective is, within a given amount of time, to find as many objects as possible with the possibility to leave an instance for the next one at any moment.

We also consider the non-Bayesian case with a variant of regret minimization. We provide and analyses several algorithms with fast rates of convergence and/or guaranteed efficiency.

On optimal strategies via random playouts and perturbations

Jacob Abernethy (Georgia Institute of Technology, USA

In this talk we will explore the use of perturbation and randomization techniques for learning and decision-making. In particular, we will discuss a nice connection between regularization methods commonly used in statistical learning and perturbation methods developed for the online learning settings. We’ll explore applications to limited feedback settings, the use of randomized playouts in repeated games, and the relationship to differential privacy.

**Exploration in reinforcement learning: from tight regret analysis of finite MDPs to intrinsic motivation in deep RL**

Rémi Munos and Mohammad Gheshlaghi Azar (DeepMind, UK)

(This talk consists of two parts.)

In the first part we consider the problem of exploration in reinforcement learning for finite horizon and finite state space MDPs. We show that an optimistic modification to value iteration achieves a regret bound of O((HSAT)1/2) closing the gap with the established lower bound of Ω((HSAT)1/2), ignoring the higher order and logarithmics terms. This result improves over the best previous known bound O(HS(AT)1/2) achieved by the UCRL2 algorithm of Jaksch et al. (2010). Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we consider Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in H).

The second part of this talk will consider the problem of exploration in MDPs with large (possibly infinite) state spaces where we possibly never go back to the same state twice. We will present heuristics based on deep learning representations in order to replace the optimism principle for exploration by an intrinsic motivation approach. We show how to use a density model to measure the novelty of a situation from which we can derive a pseudo-count that is used, similarly to the optimistic approach, to build an exploration bonus. We will illustrate those ideas on the infamously difficult Atari game Montezuma’s Revenge.