AI Research Scientist (PhD from UC Berkeley)

About

I'm a Research Scientist at Covariant.ai, building universal AI that allows robots to see, reason, and act on the world around them. My latest research enables material-handling robots in warehouses to self-improve based on the experience collected by all deployed robots.

I completed my Ph.D. in Artificial Intelligence in 2020 at the Robotics Learning Lab at UC Berkeley, under the supervision of Prof. Pieter Abbeel. My thesis focuses on enabling new robotics applications in variable environments, with minimum supervision. To this end, I developed Deep Reinforcement Learning algorithms that leverage policy hierarchy, few-shot learning, and automatic curriculum generation as key elements to solve complex locomotion and manipulation tasks in ways that scale up to real-world scenarios. During my Ph.D., I was under the LaCaixa Fellowship, then under the Berkeley Deep Drive fellowship. In 2018 I was a research intern at Deep Mind (London) working on self-supervised learning with Nicolas Heess. In 2019 I was a research intern at NVIDIA's Seattle Robotic Lab working on uncertainty-aware policy optimization under the supervision of Dieter Fox.

Prior to my graduate work, I obtained in 2015 a double degree in Mathematics and Industrial Engineering with Honors, at the Center for High Interdisciplinary Training (CFIS) in the Polytechnic University of Catalonia. During this time, I performed research in Operations Research for distribution systems. In 2015 I was a visiting scholar with Prof. Ignacio Grossmann at Carnegie Mellon University, writing my undergraduate thesis on multi-level optimization for capacity planning. I also studied efficient ways to control electric grids subject to more stochastic sources (like renewables) - at the Ecole Polytechnique Federal de Lausanne in 2014 with Prof. Rachid Cherkaoui, and at Argonne National Lab in 2013 with Prof. Victor Zavala. I also worked at the Institute of Photonic Sciences (ICFO, Barcelona) using video processing for non-contact pulse oximetry with Prof. Turgut Durduran.

Publications

RFM-1: Giving robots human-like reasoning capabilities

Set up as a multimodal any-to-any sequence model, RFM-1 is an 8 billion parameter transformer trained on text, images, videos, robot actions, and a range of numerical sensor readings.

By tokenizing all modalities into a common space and performing autoregressive next-token prediction, RFM-1 uses its broad range of input and output modalities to enable diverse applications.

For example, it can perform image-to-image learning for scene analysis tasks like segmentation and identification. It can combine text instructions with image observations to generate desired grasp actions or motion sequences. It can pair a scene image with a targeted grasp image to predict outcomes as videos or simulate the numerical sensor readings that would occur along the way.

Research Team Blog post March 11, 2024

Which Mutual-Information Representation Learning Objectives are Sufficient for Control?

Mutual information maximization provides an appealing formalism for learning representations of data. In the context of reinforcement learning (RL), such representations can accelerate learning by discarding irrelevant and redundant information, while retaining the information necessary for control. Much of the prior work on these methods has addressed the practical difficulties of estimating mutual information from samples of high-dimensional observations, while comparatively less is understood about which mutual information objectives yield representations that are sufficient for RL from a theoretical perspective. In this paper, we formalize the sufficiency of a state representation for learning and representing the optimal policy, and study several popular mutual-information based objectives through this lens. Surprisingly, we find that two of these objectives can yield insufficient representations given mild and common assumptions on the structure of the MDP. We corroborate our theoretical results with empirical experiments on a simulated game environment with visual observations.

Cite as: Kate Rakelly, Abhishek Gupta, Carlos Florensa, Sergey Levine. Which Mutual-Information Representation Learning Objectives are Sufficient for Control? Advances in Neural Information Processing Systems, NeurIPS 2021

GUAPO: Guided Uncertainty-Aware Policy Optimization

Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. On the other hand, learning-based approaches, such as Reinforcement Learning (RL), can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this work we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline, while requiring minimal interaction with the environment. This is achieved by leveraging uncertainty estimates to divide the space in regions where the given model-based policy is reliable, and regions where it may have flaws or not be well defined. In these uncertain regions, we show that a local RL policy can be learned directly from raw sensory inputs. We test our algorithm, Guided Uncertainty-Aware Policy Optimization (GUAPO), on a real-world robot performing tight-fitting peg insertion.

Cite as: Michelle A. Lee*, Carlos Florensa*, Jonathan Tremblay, Nathan Ratliff, Animesh Garg, Fabio Ramos, and Dieter Fox. Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning. International Conference on Robotics and Automation, ICRA 2020.

Best paper award talk at the Robot Learning workshop, NeurIPS 2019.

Videos and supplementary material available.

Goal-conditioned Imitation Learning

Designing rewards for Reinforcement Learning (RL) is challenging because it needs to convey the desired task, be efficient to optimize, and be easy to compute. The latter is particularly problematic when applying RL to robotics, where detecting whether the desired configuration is reached might require considerable supervision and instrumentation. Furthermore, we are often interested in being able to reach a wide range of configurations, hence setting up a different reward every time might be unpractical. Methods like Hindsight Experience Replay (HER) have recently shown promise to learn policies able to reach many goals, without the need of a reward. Unfortunately, without tricks like resetting to points along the trajectory, HER might take a very long time to discover how to reach certain areas of the state-space. In this work we investigate different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms. Furthermore, our method can be used when only trajectories without expert actions are available, which can leverage kinestetic or third person demonstration.

Cite as: Yiming Ding*, Carlos Florensa *, Mariano Phielipp, Pieter Abbeel. Goal-Conditioned Imiation Learning. Advances in Neural Information Processing Systems, NeurIPS 2019.

Code and videos available.

Sub-policy Adaptation for Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning is a promising approach to long-horizon decision-making problems with sparse rewards. Unfortunately, most methods still decouple the lower-level skill acquisition process and the training of a higher level that controls the skills in a new task. Treating the skills as fixed can lead to significant sub-optimality in the transfer setting. In this work, we propose a novel algorithm to discover a set of skills, and continuously adapt them along with the higher level even when training on a new task. Our main contributions are two-fold. First, we derive a new hierarchical policy gradient, as well as an unbiased latent-dependent baseline. We introduce Hierarchical Proximal Policy Optimization (HiPPO), an on-policy method to efficiently train all levels of the hierarchy simultaneously. Second, we propose a method of training time-abstractions that improves the robustness of the obtained skills to environment changes

Cite as: Alexander C. Li* , Carlos Florensa*, Ignasi Clavera, Pieter Abbeel. Sub-policy Adaptation for Hierarchical Reinforcement Learning. International Conference on Learning Representations ICLR 2020.

Code and videos available.

Self-supervised Learning of Image Embedding for Continuous Control

Operating directly from raw high dimensional sensory inputs like images is still a challenge for robotic control. Recently, Reinforcement Learning methods have been proposed to solve specific tasks end-to-end, from pixels to torques. However, these approaches still require the access to a specified reward which may require specialized instrumentation of the environment. Furthermore, the obtained policy and representations tend to be task specific and may not transfer well. In this work we investigate completely unsupervised learning of a general image embedding and control primitives, based on finding the shortest time to reach any state. We also introduce a new structure for the state-action value function that builds a connection between model-free and model-based methods, and improves the performance of the learning algorithm. We experimentally demonstrate these findings in three simulated robotic tasks.

Cite as: Carlos Florensa, Jonas Degrave, Nicolas Heess, Jost Tobias Springenberg, Martin Riedmiller. Self-supervised Learning of Image Embedding for Continuous Control. Contributed Talk in I2C workshop at Advances in Neural Information Processing Systems (NIPS) 2018.

Adaptive Variance for Changing Sparse-Reward Environments

Robots that are trained to perform a task in a fixed environment often fail when facing unexpected changes to the environment due to a lack of exploration. We propose a principled way to adapt the policy for better exploration in changing sparse-reward environments. Unlike previous works which explicitly model environmental changes, we analyze the relationship between the value function and the optimal exploration for a Gaussian-parameterized policy and show that our theory leads to an effective strategy for adjusting the variance of the policy, enabling fast adapt to changes in a variety of sparse-reward environments.

Cite as: Xingyu Lin, Pengsheng Guo, Carlos Florensa, David Held. Adaptive Variance for Changing Sparse-Reward Environments. International Conference on Robotics and Automation (ICRA) 2019

Automatic Goal Generation for Reinforcement Learning Agents

Reinforcement learning (RL) is a powerful technique to train an agent to perform a task; however, an agent that is trained using RL is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing in its environment. We use a generator network to propose tasks for the agent to try to accomplish, each task being specified as reaching a certain parametrized subset of the state-space. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent, thus automatically producing a curriculum. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment, even when only sparse rewards are available.

Cite as: Carlos Florensa*, David Held*, Xinyang Geng*, Pieter Abbeel. Automatic Goal Generation for Reinforcement Learning Agents. In International Conference in Machine Learning (ICML) 2018.

Supplementary material and videos available.

Reverse Curriculum Generation for Reinforcement Learning

Many relevant tasks require an agent to reach a certain state, or to manipulate objects into a desired configuration. For example, we might want a robot to align and assemble a gear onto an axle or insert and turn a key in a lock. These tasks present considerable difficulties for reinforcement learning approaches, since the natural reward function for such goal-oriented tasks is sparse and prohibitive amounts of exploration are required to reach the goal and receive a learning signal. Past approaches tackle these problems by manually designing a task-specific reward shaping function to help guide the learning. Instead, we propose a method to learn these tasks without requiring any prior task knowledge other than obtaining a single state in which the task is achieved. The robot is trained in "reverse", gradually learning to reach the goal from a set of starting positions increasingly far from the goal. Our method automatically generates a curriculum of starting positions that adapts to the agent's performance, leading to efficient training on such tasks. We demonstrate our approach on difficult simulated fine-grained manipulation problems, not solvable by state-of-the-art reinforcement learning methods.

Cite as: Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, Pieter Abbeel. Reverse Curriculum Generation for Reinforcement Learning. In Conference on Robot Learning (CoRL) 2017.

Supplementary material and videos available.

Stochastic Neural Networks for Hierarchical Reinforcement Learning

Deep reinforcement learning has achieved many impressive results in recent years. However, tasks with sparse rewards or long horizons continue to pose significant challenges. To tackle these important problems, we propose a general framework that first learns useful skills in a pre-training environment, and then leverages the acquired skills for learning faster in downstream tasks. Our approach brings together some of the strengths of intrinsic motivation and hierarchical methods: the learning of useful skill is guided by a single proxy reward, the design of which requires very minimal domain knowledge about the downstream tasks. Then a high-level policy is trained on top of these skills, providing a significant improvement of the exploration and allowing to tackle sparse rewards in the downstream tasks. To efficiently pre-train a large span of skills, we use Stochastic Neural Networks combined with an information-theoretic regularizer. Our experiments show that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks.

Cite as: Florensa, Carlos; Duan, Yan; Abbeel, Pieter. Stochastic Neural Networks for Hierarchical Reinforcement Learning. In International Conference on Learning Representations (ICLR) 2017.

Code and videos available.

Capacity planning with competitive decision-makers: Trilevel MILP formulation, degeneracy, and solution approaches

Capacity planning addresses the decision problem of an industrial producer investing on infrastructure to satisfy future demand with the highest profit. Traditional models neglect the rational behavior of some external decision-makers by assuming either static competition or captive markets. We propose a mathematical programing formulation with three levels of decision-makers to capture the dynamics of duopolistic markets. The trilevel model is transformed into a bilevel optimization problem with mixed-integer variables in both levels by replacing the third-level linear program with its optimality conditions. We introduce new definitions required for the analysis of degeneracy in multilevel models, and develop two novel algorithms to solve these challenging problems. Each algorithm is shown to converge to a different type of degenerate solution. The computational experiments for capacity expansion in industrial gas markets show that no algorithm is strictly superior in terms of performance.

Cite as: Carlos Florensa, Pablo Garcia-Herreros, Pratik Misra, Erdem Arslan, Sanjay Mehta, Ignacio E. Grossmann. Capacity planning with competitive decision-makers: Trilevel MILP formulation, degeneracy, and solution approaches. European Journal of Operations Research 2017.

“The magic of light!” - An entertaining optics and photonics awareness program

Illusionism provides a surprising and unforgettable way of explaining photonics to a wide audience. Imagine grabbing with your own hand an egg-sized photon with the same incredible properties as in a quantum computer! And what about touching the light beam which detects and removes diseased cells like in cutting edge medical prototypes? The art of magic allows promoting photonics, exploring advanced subjects in an understandable and palpable fashion that strongly inspires all ages.

Cite as: Carlos Florensa, Miriam Martí, S. Chaitanya Kumar, Silvia Carrasco. “The magic of light!” - An entertaining optics and photonics awareness program. Education and Training in Optics and Photonics, 2013.

Patents

Systems and methods for robotic perturbation

Various embodiments of the present technology generally relate to robotic devices and artificial intelligence. More specifically, some embodiments relate to a robotic device for picking items from a bin and perturbing items in a bin. The robotic device may include one or more picking elements and one or more perturbation elements for disturbing a present arrangement of items in the bin. In an exemplary embodiment, a perturbation element comprises a compressed air valve. In some implementations, the robotic device may also include one or more computer-vision systems. Based on image data from the one or more computer-vision systems, a strategy for picking up items from the bin is determined. When no strategies with high probability of success exist, the robotic device may perturb the contents of the bin to create new available pick-up points.

Cite as: Yan Duan, Ian Rust, Andrew Amir Vaziri, Xi Chen, Carlos Florensa. "Systems and methods for robotic picking and perturbation" - US Patent App. 17/014,194, 2021

Systems and methods for robotic picking

Various embodiments of the present technology generally relate to robotic devices and artificial intelligence. More specifically, some embodiments relate to a robotic device for picking items from a bin and perturbing items in a bin. In some implementations, the device may include one or more computer-vision systems. A computer-vision system, in accordance with the present technology, may use at least two two-dimensional images to generate three-dimensional (3D) information about the bin and items in the bin. Based on the 3D information, a strategy for picking up items from the bin is determined. When no strategies with high probability of success exist, the robotic device may perturb the contents of the bin to create new available pick-up points and re-attempt to pick up an item.

Cite as: Yan Duan, Xi Chen, Mostafa Rohaninejad, Nikhil Mishra, Yu Xuan Liu, Andrew Amir Vaziri, TANG Haoran, Yide Shentu, Ian Rust, Carlos Florensa. "Systems and methods for robotic picking" - US Patent App. 17/014,545, 2021

Guided uncertainty-aware policy optimization: combining model-free and model-based strategies for sample-efficient learning

A robot is controlled using a combination of model-based and model-free control methods. In some examples, the model-based method uses a physical model of the environment around the robot to guide the robot. The physical model is oriented using a perception system such as a camera. Characteristics of the perception system may be are used to determine an uncertainty for the model. Based at least in part on this uncertainty, the system transitions from the model-based method to a model-free method where, in some embodiments, information provided directly from the perception system is used to direct the robot without reliance on the physical model.

Cite as: Jonathan Tremblay, Dieter Fox, Michelle Lee, Carlos Florensa, Nathan Donald Ratliff, Animesh Garg, Fabio Tozeto Ramos. "Guided uncertainty-aware policy optimization: combining model-free and model-based strategies for sample-efficient learning" - US Patent App. 16/780,465, 2021

Curriculum Vitae

CV 2024.docx

Google Sites

Report abuse