ENTP: Encoder-only Next Token Prediction

Ethan Ewer*,1, Daewon Chae*,2, Thomas Zeng*,1, Jinkyu Kim2, Kangwook Lee1

1University of Wisconsin-Madison, 2Korea University

* Equal Contribution

[Paper] [Code] 

Abstract

Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent "cheating" by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP's superior performance across various realistic tasks, such as length generalization and in-context learning.

Summary of Results


Theory


We prove the following theoretical results (informal version):

We also provide a characterization of encoder and decoder via the different time and space complexity they require for generation of each additional token. (Here n is the sequence length, D the embedding dimension and L the number of layers in the model.)

Triplet Counting


We propose the autoregressive task Triplet-Counting, where the next token is generated as follows:

When we trained small scale decoders, they all fail to learn Triplet-CountingHowever a small encoder (ENTP) is quickly able to reach 100% sequence accuracy.

Additionally, we found that decoder-only based LLMs also struggle to learn Triplet-Counting.

More Realistic Settings


We train encoder and decoder to perform addition on 3-digit numbers and to do language modeling via the OpenWebText dataset. Lastly, we compare performance on various ICL tasks. In general, encoder achieves slightly better accuracy/lower loss on each task when compared to a decoder of the same model size.

BibTex


@misc{ewer2024entpencoderonlytokenprediction,

      title={ENTP: Encoder-only Next Token Prediction}, 

      author={Ethan Ewer and Daewon Chae and Thomas Zeng and Jinkyu Kim and Kangwook Lee},

      year={2024},

      eprint={2410.01600},

      archivePrefix={arXiv},

      primaryClass={cs.LG},

      url={https://arxiv.org/abs/2410.01600}, 

}