George Tucker

Google Scholar - LinkedIn - Github -

I am a research scientist on the Google Brain team. My current focus is on sequential models and reinforcement learning. We want to learn rich models and complex policies efficiently in the amount of feedback required.

I'm particularly interested in offline RL, where we have a dataset of previously collected experience, and want to train a policy solely on the dataset without further interaction with the environment. While this is certainly not the end goal, I think this is a fundamental building block of RL and has many real-world applications, so it is an important problem to understand well.

Previously, I was a research scientist on the Amazon Speech team in Boston, where I designed deep neural networks acoustic models for small-footprint keyword spotting. Before joining Amazon, I was a visiting Postdoctoral Research Fellow in the Price lab at the Harvard School of Public Health. I worked on methods for genetic risk prediction and association testing in genome wide association (GWAS) studies with related individuals. I conducted my PhD research in the MIT Mathematics department in Professor Bonnie Berger's research group.

Offline RL

Reinforcement learning (RL) is a general framework for optimizing sequential decisions, central to many real world applications. A key challenge is that most RL algorithms assume the availability of active interactions with a live environment or simulator. Unfortunately, applying these RL algorithms to complex real world problems is prohibitive due to the difficulty of building a high-fidelity simulator as well as the cost and risk associated with interacting with the live environment. Fortunately, for many real world applications, logged history of past decisions and their corresponding quality metrics are available. This corresponds to the offline RL setting.

We have written a tutorial covering recent advances in Offline RL: Offline reinforcement learning: Tutorial, review, and perspectives on open problems. S Levine, A Kumar, G Tucker, J Fu.

Our research efforts in this area comprise:

Methods development.

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes. A Kumar, R Agarwal, X Geng, G Tucker*, S Levine*. ICLR 2023 Top 5%.

Oracle Inequalities for Model Selection in Offline Reinforcement Learning. JN Lee, G Tucker, O Nachum, B Dai, E Brunskill. NeurIPS 2022.

Conservative Q-Learning for Offline Reinforcement Learning. A Kumar, A Zhou, G Tucker, S Levine. NeurIPS 2020.

Behavior Regularized Offline Reinforcement Learning. Y Wu, G Tucker, O Nachum.

Stabilizing off-policy q-learning via bootstrapping error reduction. A Kumar, J Fu, M Soh, G Tucker, S Levine. NeurIPS 2019.


Datasets for Data-Driven Reinforcement Learning. J Fu, A Kumar, O Nachum, G Tucker, S Levine.

Applications engagements.

Internally working closely with product/applications team to understand how offline RL fits into their problem setting and how they can leverage it.

Other Selected Google Brain publications

Coupled gradient estimators for discrete latent variables. Z Dong, A Mnih, G Tucker. NeurIPS 2021.

DisARM: An antithetic gradient estimator for binary latent variables. Z Dong, A Mnih, G Tucker. NeurIPS 2020 Spotlight.

On variational bounds of mutual information. B Poole, S Ozair, A Oord, AA Alemi, G Tucker. ICML 2019.

Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives. G Tucker, D Lawson, S Gu, CJ Maddison.

The Mirage of Action-Dependent Baselines in Reinforcement Learning. G Tucker, S Bhupatiraju, S Gu, RE Turner, Z Ghahramani, S Levine.

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. C Riquelme, G Tucker, J Snoek.

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. G Tucker, A Mnih, CJ Maddison, D Lawson, J Sohl-Dickstein.

Filtering Variational Objectives. CJ Maddison*, D Lawson*, G Tucker*, N Heess, M Norouzi, A Doucet, A Mnih, YW Teh.

Confidence Penalties

We propose regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. We connect our confidence penalty to label smoothing through the direction of the KL divergence between networks output distribution and the uniform distribution. We exhaustively evaluate our proposed confidence penalty and label smoothing (uniform and unigram) on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and our confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyper-parameters.

Amazon Speech - Small-footprint acoustic models

Compacting Neural Network Classifiers via Dropout Training. Kubo Y., Tucker G., Wiesler, S. NIPS 2016 workshop on Efficient Methods for Deep Neural Networks.

Max-pooling Loss Training of Long Short-Term Memory Networks for Small-footprint Keyword Spotting. M Sun, A Raju, G Tucker, S Panchapagesan, G Fu, A Mandal, et al. SLT 2016.

Model compression applied to small-footprint keyword spotting

Recently, a number of devices and services have enabled fully voice-based interfaces, such as the Google Now, iPhone 6s, and the Amazon Echo. For privacy reasons, these devices rely on the user to preface their commands with a keyword, such as "Alexa". Accurate on-device keyword spotting is critical to usability. In this work, we focused on keyword spotting systems (KWS) for small-footprint devices. In particular, we investigated the use of low rank weight matrices and knowledge distillation applied to a deep neural network (DNN) based KWS system. We found that these techniques combine to give significant reductions in false alarms (FAs) and misses (~10% reduction in FAs at a fixed miss rate).

HSPH & MIT - Medical Genetics


Genetic prediction based on either identity by state (IBS) sharing or pedigree information has been investigated extensively using Best Linear Unbiased Prediction (BLUP) methods. Such methods were pioneered in the plant and animal breeding literature and have since been applied to predict human traits with the aim of eventual clinical utility. However, methods to combine IBS sharing and pedigree information for genetic prediction in humans have not been explored. We introduce a two variance component model for genetic prediction: one component for IBS sharing and one for approximate pedigree structure, both estimated using genetic markers. In simulations using real genotypes from CARe and FHS family cohorts, we demonstrate that the two variance component model achieves gains in prediction r^2 over standard BLUP at current sample sizes, and we project based on simulations that these gains will continue to hold at larger sample sizes. Accordingly, in analyses of four quantitative phenotypes from CARe and two quantitative phenotypes from FHS, the two variance component model significantly improves prediction r^2 in each case, with up to a 20% relative improvement. We also find that standard mixed model association tests can produce inflated test statistics in data sets with related individuals, whereas the two variance component model corrects for inflation.


Using a reduced subset of SNPs in a linear mixed model can improve power for genome-wide association studies, yet this can result in insufficient correction for population stratification. We propose a hybrid approach using principal components that does not inflate statistics in the presence of population stratification and improves power over standard linear mixed models.