Past Seminars - Spring 2020

TUESDAY 11th August 2020

Speaker: Hoi-To Wai (CUHK)

Title: Finite Time Analysis of Linear Two-Timescale Stochastic Approximation with Markovian Noise

Paper: https://arxiv.org/abs/2002.01268v1

Slides: here

More details:

Finite Time Analysis of Linear Two-Timescale Stochastic Approximation with Markovian Noise

Authors: Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, Hoi-To Wai

Abstract: Linear two-timescale stochastic approximation (SA) scheme is an important class of algorithms which has become popular in reinforcement learning (RL), particularly for the policy evaluation problem. Recently, a number of works have been devoted to establishing the finite time analysis of the scheme, especially under the Markovian (non-i.i.d.) noise settings that are ubiquitous in practice. In this paper, we provide a finite-time analysis for linear two timescale SA. Our bounds show that there is no discrepancy in the convergence rate between Markovian and martingale noise, only the constants are affected by the mixing time of the Markov chain. With an appropriate step size schedule, the transient term in the expected error bound is o(1/kc) and the steady-state term is (1/k), where c>1 and k is the iteration number. Furthermore, we present an asymptotic expansion of the expected error with a matching lower bound of Ω(1/k). A simple numerical experiment is presented to support our theory.

Speaker bio: Hoi-To Wai is an Assistant Professor at the Department of Systems Engineering and Engineering Management in The Chinese University of Hong Kong (CUHK). He received his PhD degree from Arizona State University in Fall 2017, working with Prof. Anna Scaglione. Besides ASU and CUHK, he has held research positions at LIDS, MIT working with Prof. Asuman E. Ozdaglar, and at Télécom ParisTech, Ecole Polytechnique, working with Prof. Eric Moulines. He has received the 2017's Dean's Dissertation Award at ASU's Ira A. Fulton Schools of Engineering (Full Circle), and a Best Student Paper Award from IEEE ICASSP 2018 (paper). His current research focus is on network science and graph learning problems and optimization theory applied to machine learning and distributed signal processing.

TUESDAY 4th August 2020

Speaker: Mengdi Wang (Princeton / DeepMind)

Title: Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Paper: https://arxiv.org/abs/2002.09516

Slides: here

More details:

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Authors: Yaqi Duan, Mengdi Wang

Abstract: This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted χ²-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted χ²-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

Speaker bio: Mengdi Wang is an associate professor at the Princeton University, and currently a visiting researcher at DeepMind. She received her PhD in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2013, advised by Dimitri P. Bertsekas. Mengdi received the Young Researcher Prize in Continuous Optimization of the Mathematical Optimization Society in 2016 (awarded once every three years), the Princeton SEAS Innovation Award in 2016, the NSF Career Award in 2017, the Google Faculty Award in 2017, and the MIT Tech Review 35-Under-35 Innovation Award (China region) in 2018. Her research focuses on data-driven stochastic optimization and applications in machine and reinforcement learning.

TUESDAY 28TH JULY 2020

Speaker: Kwang-Sung Jun (U. Arizona)

Title: Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

Paper: http://arxiv.org/abs/2006.08754

Slides: here


More details:

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

Authors: Kwang-Sung Jun, Chicheng Zhang

Abstract: We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality (asymptotic optimality for short) has recently allured researchers. On the other hand, it is known that one can achieve a bounded regret (i.e., does not grow indefinitely with n) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an ω(1) term w.r.t. the time horizon n in their regret, failing to adapt to the ``easiness'' of the instance. In this paper, we focus on the finite hypothesis class and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer by introducing a new algorithm called CRush Optimism with Pessimism (CROP) that eliminates optimistic hypotheses by pulling the informative arms indicated by a pessimistic hypothesis. Our finite-time analysis shows that CROP (i) achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, (ii) adapts to bounded regret, and (iii) its regret bound scales not with the number of arms K but with an effective number of arms that we introduce. We also discuss a problem class where CROP can be exponentially better than existing algorithms in \textit{nonasymptotic} regimes. Finally, we observe that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret, indicating that it may not be the end of optimism.

Speaker bio: Kwang-Sung Jun is an assistant professor at the CS department, University of Arizona. Before joining UA, he was a postdoc at Boston University with Dr. Francesco Orabona. Before then, he spent 9 years at University of Wisconsin-Madison for a PhD degree with Dr. Xiaojin (Jerry) Zhu and a postdoc with Drs. Robert Nowak, Rebecca Willett, and Stephen Wright. His research interest is bandits and online learning, but more broadly, interactive machine learning and adaptive data collection methods with various applications including scientific discovery.


Tuesday 21st July 2020

Speaker: Andrea Zanette (Stanford)

Title: Learning Near Optimal Policies with Low Inherent Bellman Error

Paper: https://arxiv.org/abs/2003.00153

Slides: here

More details:

Learning Near Optimal Policies with Low Inherent Bellman Error

Authors: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill

Abstract: We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) the algorithm has the optimal statistical rate for this setting which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated LinUCB when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.

Speaker bio: Andrea is a PhD candidate in the Institute for Computational and Mathematical Engineering at Stanford University advised by prof Emma Brunskill and Mykel J. Kochenderfer. He also works closely with Alessandro Lazaric from Facebook Artificial Intelligence Research. His research focuses on provably efficient methods for Reinforcement Learning, in particular, he develop agents capable of autonomous exploration. His research is currently supported by Total.

Tuesday 7th July 2020

Speaker: Fei Feng (UCLA)

Title: Provably Efficient Exploration for RL with Unsupervised Learning

Paper: https://arxiv.org/abs/2003.06898

Slides: here

More details:

Provably Efficient Exploration for RL with Unsupervised Learning

Authors: Fei Feng, Ruosong Wang, Wotao Yin, Simon S. Du, Lin F. Yang

Abstract: We study how to use unsupervised learning for efficient exploration in reinforcement learning with rich observations generated from a small number of latent states. We present a novel algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret reinforcement learning algorithm. We show that our algorithm provably finds a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of possible observations. Our result gives theoretical justification to the prevailing paradigm of using unsupervised learning for efficient exploration (Tang et al., 2017, Bellemare et al., 2016).

Speaker bio: Fei Feng is a Ph.D. candidate at UCLA Math Department supervised by Wotao Yin. Her research interests are optimization, reinforcement learning (RL), and parallel computing, lately focusing on topics within RL such as transfer, efficient exploration, and the relationship with mean-field games.

Tuesday 30th June 2020

Speaker: Max Simchowitz (UC Berkeley)

Title: Improper Learning for Non-Stochastic Control

Paper: https://arxiv.org/abs/2001.09254

Slides: here

More details:

Improper Learning for Non-Stochastic Control

Authors: Max Simchowitz, Karan Singh, Elad Hazan

Abstract: We consider the problem of controlling a possibly unknown linear dynamical system with adversarial perturbations, adversarially chosen convex loss functions, and partially observed states, known as non-stochastic control. We introduce a controller parametrization based on the denoised observations, and prove that applying online gradient descent to this parametrization yields a new controller which attains sublinear regret vs. a large class of closed-loop policies. In the fully-adversarial setting, our controller attains an optimal regret bound of sqrt(T) when the system is known, and, when combined with an initial stage of least-squares estimation, T^{2/3} when the system is unknown; both yield the first sublinear regret for the partially observed setting. Our bounds are the first in the non-stochastic control setting that compete with all stabilizing linear dynamical controllers, not just state feedback. Moreover, in the presence of semi-adversarial noise containing both stochastic and adversarial components, our controller attains the optimal regret bounds of polylog(T) when the system is known, and sqrt(T) when unknown. To our knowledge, this gives the first end-to-end sqrt(T) regret for online Linear Quadratic Gaussian controller, and applies in a more general setting with adversarial losses and semi-adversarial noise.

Speaker bio: Max Simchowitz is a year PhD student in the EECS department at UC Berkeley, co-adivsed by Ben Recht and Michael Jordan. He is currently supported by an Open Philanthropy fellowship, and in the past has received support from an NSF GRFP grant and a Berkeley Fellowship. His recent work focuses on the theoretical foundations of online linear control and reinforcement learning, with past research ranging broadly across topics in adaptive sampling, multi-arm bandits, complexity of convex and non-convex optimization, and fairness in machine learning.

Tuesday 23rd June 2020

Speaker: Nan Jiang (UIUC)

Title: Information-Theoretic Considerations in Batch Reinforcement Learning

Paper: https://arxiv.org/abs/1905.00360

Slides: here

More details:

Information-Theoretic Considerations in Batch Reinforcement Learning

Authors: Jinglin Chen, Nan Jiang

Abstract: Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods often crucially rely on two types of assumptions: (1) mild distribution shift, and (2) representation conditions that are stronger than realizability. However, the necessity ("why do we need them?") and the naturalness ("when do they hold?") of such assumptions have largely eluded the literature. In this paper, we revisit these assumptions and provide theoretical results towards answering the above questions, and make steps towards a deeper understanding of value-function approximation.

Speaker bio: Nan Jiang is an assistant professor of Computer Science at University of Illinois at Urbana-Champaign. Prior to joining UIUC, he was a postdoc researcher at Microsoft Research NYC. He received his PhD in Computer Science and Engineering at University of Michigan, advised by Satinder Singh. His research interests lie in the theory of reinforcement learning, mostly focusing on sample efficiency. Specific research topics include sample complexity of exploration under function approximation, state representation learning, off-policy evaluation, spectral learning of dynamical systems, etc. He is the recipient of the Best Paper Award at AAMAS 2015 and Rackham Predoctoral Fellowship in 2016.

Tuesday 16th June 2020

Speaker: Niao He (UIUC)

Title: A Unified Switching System Perspective and O.D.E. Analysis of Q-Learning Algorithms

Paper: https://arxiv.org/abs/1912.02270

Slides: here

More details:

A Unified Switching System Perspective and O.D.E. Analysis of Q-Learning Algorithms

Authors: Donghwan Lee, Niao He

Abstract: In this paper, we introduce a unified framework for analyzing a large family of Q-learning algorithms, based on switching system perspectives and ODE-based stochastic approximation. We show that the nonlinear ODE models associated with these Q-learning algorithms can be formulated as switched linear systems, and analyze their asymptotic stability by leveraging existing switching system theories. Our approach provides the first O.D.E. analysis of the asymptotic convergence of various Q-learning algorithms, including asynchronous Q-learning and averaging Q-learning. We also extend the approach to analyze Q-learning with linear function approximation and derive a new sufficient condition for its convergence.

Speaker bio: Niao He is currently an Assistant Professor in the Department of Industrial and Enterprise Systems Engineering (ISE) at the University of Illinois at Urbana-Champaign (UIUC). She is also affiliated with the Department of Electrical & Computer Engineering and Coordinated Science Laboratory (CSL). She received a Ph.D. in Operations Research under the supervision of Arkadi Nemirovski at Georgia Institute of Technology in 2015 and B.S. in Mathematics at the University of Science and Technology of China in 2010. Niao's research interest lie in the interface of optimization and machine learning, with a primary focus on developing fast algorithms and theoretical analyses for large-scale optimization problems that stem from big data applications in machine learning, signal processing, statistical inference, reinforcement learning, healthcare analytics, financial engineering, and etc.

Tuesday 9th June 2020

Speaker: Shie Mannor (Technion)

Title: Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Slides: here

Paper: https://arxiv.org/abs/1909.02769

More details:

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Authors: Lior Shani, Yonathan Efroni, Shie Mannor

Abstract: Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish O(1/sqrt(N)) convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of O(1/N), much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

Speaker bio: Shie Mannor received the B.Sc. degree in electrical engineering, the B.A. degree in mathematics, and the Ph.D. degree in electrical engineering from the Technion—Israel Institute of Technology, Haifa, Israel, in 1996, 1996, and 2002, respectively. From 2002 to 2004, he was a Fulbright Scholar and a Post-Doctoral Associate with MIT. From 2004 to 2010, he was with the Department of Electrical and Computer Engineering, McGill University, where he was the Canada Research Chair in machine learning. Since 2008, he has been with the Faculty of Electrical Engineering, Technion—Israel Institute of Technology, where he is currently a Professor. His research interests include machine learning and pattern recognition, planning and control, multi-agent systems, and communications.

Tuesday 26th May 2020

Speaker: Adithya M. Devraj (University of Florida and Stanford University)

Title: Q-Learning with Uniformly Bounded Variance: Large Discounting Is Not a Barrier to Fast Learning

Slides: here

Paper: https://arxiv.org/abs/2002.10301

More details:

Q-Learning with Uniformly Bounded Variance: Large Discounting Is Not a Barrier to Fast Learning

Authors: Adithya M. Devraj, Sean P. Meyn

Abstract: Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in 1/(1 − β), where β < 1 is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an ε-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all β < 1. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an ε-optimal policy. Specifically, we show that the asymptotic variance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of 1/(1 − β); an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic variance that is a quadratic in 1/(1 − ρ∗β), where 1 − ρ∗ > 0 is an upper bound on the spectral gap of an optimal transition matrix.

Speaker bio: Adithya is a postdoc at Stanford University, where he is working with Prof. Benjamin Van Roy. Before arriving at Stanford, he obtained his PhD in December 2019 from University of Florida, Gainesville, where he worked with Prof. Sean Meyn. His dissertation was on designing Reinforcement Learning algorithms with optimal variance.

Tuesday 19th May 2020

Speaker: Daniel Russo (Columbia University)

Title: Global Optimality Guarantees For Policy Gradient Methods

Slides: here

Paper: http://djrusso.github.io/docs/policy_grad_optimality.pdf

More details:

Global Optimality Guarantees For Policy Gradient Methods

Authors: Jalaj Bhandari, Daniel Russo

Abstract: Policy gradients methods are perhaps the most widely used class of reinforcement learning algorithms. These methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, even for simple control problems solvable by classical techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to local minima. This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex. When these assumptions are relaxed, our work gives conditions under which any local minimum is near-optimal, where the error bound depends on a notion of the expressive capacity of the policy class.

Speaker bio: Daniel Russo joined the Decision, Risk, and Operations division of the Columbia Business School as an assistant professor in Summer 2017. Prior to joining Columbia, he spent one year as an assistant professor in the MEDS department at Northwestern's Kellogg School of Management and one year at Microsoft Research in New England as Postdoctoral Researcher. Daniel received his PhD from Stanford University in 2015, where he was advised by Benjamin Van Roy. His research lies at the intersection of statistical machine learning and sequential decision-making, and contributes to the fields of online optimization, reinforcement learning, and sequential design of experiments. He is interested in the design and analysis of algorithms that learn over time to make increasingly effective decisions through interacting with a poorly understood environment.

Tuesday 12th May 2020

Speaker:
Sarah Dean,
UC Berkeley

Title:
On the Sample Complexity of the Linear Quadratic Regulator

Slides: here

Paper: https://arxiv.org/abs/1710.01688

More details:

On the Sample Complexity of the Linear Quadratic Regulator

Authors: Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, Stephen Tu

Abstract: This paper addresses the optimal control problem known as the Linear Quadratic Regulator in the case when the dynamics are unknown. We propose a multi-stage procedure, called Coarse-ID control, that estimates a model from a few experimental trials, estimates the error in that model with respect to the truth, and then designs a controller using both the model and uncertainty estimate. Our technique uses contemporary tools from random matrix theory to bound the error in the estimation procedure. We also employ a recently developed approach to control synthesis called System Level Synthesis that enables robust control design by solving a convex optimization problem. We provide end-to-end bounds on the relative error in control cost that are nearly optimal in the number of parameters and that highlight salient properties of the system to be controlled such as closed-loop sensitivity and optimal control magnitude. We show experimentally that the Coarse-ID approach enables efficient computation of a stabilizing controller in regimes where simple control schemes that do not take the model uncertainty into account fail to stabilize the true system.

Speaker bio: Sarah Dean is a PhD candidate in the Department of Electrical Engineering and Computer Science at UC Berkeley, where she is advised by Ben Recht. She received her BSE in electrical engineering and mathematics from the University of Pennsylvania in 2016. Sarah is broadly interested in designing methods for and understanding the implications of data-driven optimization. She has worked specifically on topics in optimal control, machine learning, and computational imaging, with motivating applications in robotics and recommender systems.

Tuesday 5th May 2020

Speaker:
Matthieu Geist, Google

Title:
Leverage the Average: An Analysis of Regularization in RL

Slides: here

Paper: arxiv.org/abs/2003.14089

More details:

Leverage the Average: An Analysis of Regularization in RL

Authors: Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist

Abstract: Building upon the formalism of regularized Markov decision processes, we study the effect of Kullback-Leibler (KL) and entropy regularization in reinforcement learning. Through an equivalent formulation of the related approximate dynamic programming (ADP) scheme, we show that a KL penalty amounts to averaging q-values. This equivalence allows drawing connections between a priori disconnected methods from the literature, and proving that a KL regularization indeed leads to averaging errors made at each iteration of value function update. With the proposed theoretical analysis, we also study the interplay between KL and entropy regularization. When the considered ADP scheme is combined with neural-network-based stochastic approximations, the equivalence is lost, which suggests a number of different ways to do regularization. Because this goes beyond what we can analyse theoretically, we extensively study this aspect empirically.

Speaker bio: Matthieu Geist obtained an Electrical Engineering degree and an MSc degree in Applied Mathematics in Sept. 2006 (Supélec, France), a PhD degree in Applied Mathematics in Nov. 2009 (University Paul Verlaine of Metz, France) and a Habilitation degree in Feb. 2016 (University Lille 1, France). Between Feb. 2010 and Sept. 2017, he was an assistant professor at CentraleSupélec, France. In Sept. 2017, he joined University of Lorraine, France, as a full professor in Applied Mathematics (Interdisciplinary Laboratory for Continental Environments, CNRS-UL). Since Sept. 2018, he is on secondment at Google Brain, as a research scientist (Paris, France). His research interests include machine learning, especially reinforcement learning and imitation learning.

Tuesday 28th April 2020

Speaker:
Chi Jin, Princeton University

Title:
Provably Efficient Reinforcement Learning with Linear Function Approximation

Slides: here

Paper: arxiv.org/abs/1907.05388v2

More details:

Provably Efficient Reinforcement Learning with Linear Function Approximation

Authors: Chi Jin, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan

Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed.

This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde O( \sqrt{d^3 H^3 T)$ regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.

Speaker Bio: Chi Jin is assistant professor of Electrical Engineering at Princeton University. He obtained his Ph.D. in Computer Science at UC Berkeley, advised by Michael I. Jordan. He received his B.S. in Physics from Peking University. His research interest lies in theoretical machine learning, with special emphases on nonconvex optimization and reinforcement learning. His representative work includes proving noisy gradient descent / accelerated gradient descent escape saddle points efficiently, proving sample complexity bounds for Q-learning / LSVI with UCB, and designing near-optimal algorithms for minimax optimization. He is the recipient of the best paper award in the ICML 2018 workshop on "Exploration in Reinforcement Learning".