Multi agent systems are increasingly becoming more prevalent in machine learning domains such as online bidding, power-grid managment, inventory control and language models. How can we design efficent learning algorithms? Can we make use of multi agent training to improve single agent performance? How can we design scalable algorithms when number of agents is massive? These are few questions we aim to explore in this direction.
It has become common wisdom that adding tasks or using multiple different datasets can improve the performance of a learning algorithm. We introduce the Aligned Multi-Objective Optimization (AMOO) framework that enables us to investigate this phenomenon from an optimization perspective. Specifically, we design a simple adaptation of gradient descent that is guaranteed to give better performance in the presence of multi-objective feedback.
The state-space of many RL and control problems contains information on an uncontrollable and time-correlated process, which we refer to as exogenous distractors. For example, visual sensors of an autonomous car may detect irrelevant information on a flock of birds. Nevertheless, we would hope, the complexity of learning to drive would be independent of the complexity of learning the movement of the flock. In this line of work, we aim to study algorithms for this setting that have no or mild dependence on the complexity of the exogenous noise process.
We focus on a natural class of POMDPs in which exists a hidden context, not directly observed as a part of the state space, that we refer to as latent MDPs (LMDPs). We study several aspects of this problem, (i) we establish a statistical separation between the scenario the hidden context is given at the end of each episode and the scenario in which it is not given, (ii) study several assumptions for which this class of problems is learnable with polynomial sample complexity, and (iii) study a sub-class of LMDPs and study a provably efficient algorithm under no assumptions.
Motivated by this question, we study linear bandits supplied with offline data in which the feature vectors are only partially observed. We then suggest a simple approach, based on geometric intuition, that can utilize the offline data to improve the performance of the online algorithm.
A recurring theme in applying sequential decision-making algorithms is the lack of feedback. Indeed, in most applications in which a human supplies the reward information there is some feedback limitation. Motivated by this realization, we formulated the sequential budgeted learning setting in which there is a limitation on the available reward feedback. We study a simple principled approach -- we refer to as confidence-budget matching -- for this setting and applied it in RL and contextual bandit problems. Towards a deeper understanding of the setting, we also studied problem-dependent guarantees for the multi-armed bandit problem in the presence of feedback constraints.
Motivated by the recent success of policy optimization in RL and safe RL we established convergence guarantees to policy optimization techniques. Specifically, we analyzed the convergence of the TRPO algorithm (Schulmann et al. 2015) and showed it has similar performance guarantees as to the Conservative Policy Iteration algorithm (Kakade and Langford 2002). Additionally, we considered the exploration problem and derived performance guarantees of mirror-descent type of algorithm for MDPs and constraint MDPs.
Lookahead policies are a widely used tool in practical applications for several decades -- from Samuel's checkers play program to Alpha-Go. Motivated by these successes, we studied lookahead policies from several perspectives and addressed when, why and how lookahead policies should be used. Furthermore, we established a close relation between lookahead policies and reward shaping, which allowed us to scale multi-step greedy algorithms to deep RL.