Home

Ankit Gupta

Hi! I am a Research Scientist at IBM Research interested in Machine Learning and Natural Language Processing. Before this, I was a postdoc with Jonathan Berant's NLP group at Tel Aviv University. My PhD was on algebraic computation.

Email : ankitgupta(dot)iitkanpur(at)gmail(dot)com.

[CV] [Publications] [Talks] [Code]

My recent work has mostly been towards improving the efficiency & generalization of seq2seq models. Some of my works on Question Answering and Language Modeling are:

"Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors" : We show that the conventional practice of directly fitting randomly-initialized models on a given dataset leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using only the downstream task data, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points.

Ido Amos, Jonathan Berant, Ankit Gupta

International Conference on Learning Representations (ICLR), 2024. (outstanding paper award)

"Simplifying and Understanding State Space Models with Diagonal Linear RNNs" : We show vanilla diagonal linear RNNs (DLR) to be as performant as previously-proposed SSMs in the presence of strong supervision, by characterizing the expressivity of SSMs & attention-based models via a suite of 13 synthetic seq2seq tasks involving interactions over tens of thousands of positions. SSMs report near-perfect performance on tasks that can be modeled via few convolutional kernels but struggle on tasks requiring many such kernels or when the desired sequence manipulation is context-dependent.

Ankit Gupta, Harsh Mehta, Jonathan Berant

arXiv preprint, 2022.

"Diagonal State Space Augmented Transformers for Speech Recognition" : We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. Comparing neural transducers with either conformer or our proposed DSS-augmented transformer (DSSformer) encoders, on Switchboard+Fisher 300/2000h, we reach a single model performance of 9.1%/6.9% WER on the combined test set of the Hub5 2000 evaluation, respectively, and on MALACH 176h we improve the WER by 7% relative over the previous best published result.

George Saon, Ankit Gupta, Xiaodong Cui

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023.

"Long Range Language Modeling via Gated State Spaces" : We propose a seq2seq model GSS (a gated version of DSS) that is more compute-efficient compared to well-tuned Transformer variants (XL / Block Recurrent / Memorizing) on autoregressive sequence modeling over English books, Github source code, arXiv Math articles.

Harsh Mehta, Ankit Gupta, Ashok Cutkosky, Behnam Neyshabur

International Conference on Learning Representations (ICLR), 2023.

"On the Parameterization and Initialization of Diagonal State Space Models" : We systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85% on the Long Range Arena.

Albert Gu, Ankit Gupta, Karan Goel, Christopher Ré

Advances in Neural Information Processing Systems (NeurIPS), 2022.

"Diagonal State Spaces are as Effective as Structured State Spaces" : We propose a seq2seq model that uses diagonal state spaces (DSS) for contextualization & matches the performance of state-of-the-art models (such as S4) on benchmarks requiring long-range reasoning over text, images & audio, while being conceptually simpler & straightforward to implement.

Ankit Gupta, Albert Gu, Jonathan Berant

Advances in Neural Information Processing Systems (NeurIPS), 2022. (spotlight talk)

"SCROLLS: Standardized CompaRison Over Long Language Sequences" : We introduce SCROLLS, a suite of tasks that require reasoning over long texts. Tasks are structured in a unified text-to-text format & involve summarization, QA & NLI, covering multiple domains including literature, science, business & entertainment. [Data & Leaderboard]

Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, Omer Levy

Empirical Methods in Natural Language Processing (EMNLP), 2022.

"Analyzing Transformers in Embedding Space" : We present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, i.e. the space of vocabulary items they operate on. Applications include (a) interpreting parameters of pretrained/fine-tuned models in embedding space, (b) aligning the parameters of different models that share a vocabulary, and (c) constructing a classifier without training by ``translating'' the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained.

Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant

Annual Conference of the Association for Computational Linguistics (ACL), 2023.
BlackboxNLP @ EMNLP 2022.

"Memory-efficient Transformers via Top-k Attention" : We propose a highly accurate approximation of vanilla attention. (a) Memory usage is linear w.r.t input size, (b) is plug-n-play w.r.t vanilla attention based LMs like BERT, T5 (no corrective pre-training required), (c) can also be used at feed-forward layers, & (d) accuracy nearly-identical to vanilla Transformers in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, Jonathan Berant

Workshop on Simple and Efficient Natural Language Processing (SustaiNLP) @ EMNLP 2021.

"Value-aware Approximate Attention" : We formulate a value-aware objective for approximating the dot-product attention & show, theoretically & empirically, that an optimal approximation of this value-aware objective substantially outperforms an optimal approximation that ignores values.

Ankit Gupta, Jonathan Berant

Empirical Methods in Natural Language Processing (EMNLP), 2021. (oral presentation)

"Break It Down: A Question Understanding Benchmark" : Language models (BERT, etc) have achieved super-human performance on simple QA datasets like SQuAD. These models don't perform so well on complex questions requiring reasoning from multiple documents over a large knowledge base (wikipedia/web). [Data & Leaderboard]

Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, Jonathan Berant

Transactions of the Association for Computational Linguistics (TACL), 2020.

"Injecting Numerical Reasoning Skills into Language Models" : Language models (BERT, etc) are not trained to perform discrete operations such as signed combination of numbers, argmax, etc and hence do not perform well on tasks requiring numerical reasoning over text (DROP, etc). We propose a methodology for injecting synthetic skills directly into language models and outperform symbolic approaches.

Mor Geva*, Ankit Gupta*, Jonathan Berant

Annual Conference of the Association for Computational Linguistics (ACL), 2020.

"GMAT: Global Memory Augmentation for Transformers": We augment the recently-proposed sparse variants of Transformer with a dense attention-based global memory and demonstrate its utility on a wide range of tasks including synthetic tasks requiring global reasoning, masked LM, RC & compressing contextualized representations.

Ankit Gupta, Jonathan Berant

arXiv preprint, 2020.