ENTP: Encoder-only Next Token Prediction
Ethan Ewer*,1, Daewon Chae*,2, Thomas Zeng*,1, Jinkyu Kim2, Kangwook Lee1
1University of Wisconsin-Madison, 2Korea University
* Equal Contribution
Abstract
Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent "cheating" by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP's superior performance across various realistic tasks, such as length generalization and in-context learning.
Summary of Results
We show theoretically that the class of causal models expressible by encoder and decoders are non-overlapping.
We propose the Triplet-Counting autoregressive task, which we conjecture is difficult for decoders to learn due to bounded complexity of the decoder model. We also provide empirical results that show that decoders fail to learn this task while encoders can.
We explore and present results of encoder vs. decoder causal models in a more realistic setting e.g. Addition, ICL, Language Modeling.
Theory
We prove the following theoretical results (informal version):
There exists a causal model for which both decoders and encoders can represent exactly.
There exists a causal model for which a decoder but not an encoder can represent exactly.
There exists a casual model for which an encoder but not a decoder (which uses positional embeddings) can represent exactly.
We also provide a characterization of encoder and decoder via the different time and space complexity they require for generation of each additional token. (Here n is the sequence length, D the embedding dimension and L the number of layers in the model.)
Triplet Counting
We propose the autoregressive task Triplet-Counting, where the next token is generated as follows:
When we trained small scale decoders, they all fail to learn Triplet-Counting. However a small encoder (ENTP) is quickly able to reach 100% sequence accuracy.
Additionally, we found that decoder-only based LLMs also struggle to learn Triplet-Counting.
More Realistic Settings
We train encoder and decoder to perform addition on 3-digit numbers and to do language modeling via the OpenWebText dataset. Lastly, we compare performance on various ICL tasks. In general, encoder achieves slightly better accuracy/lower loss on each task when compared to a decoder of the same model size.
BibTex
@misc{ewer2024entpencoderonlytokenprediction,
title={ENTP: Encoder-only Next Token Prediction},
author={Ethan Ewer and Daewon Chae and Thomas Zeng and Jinkyu Kim and Kangwook Lee},
year={2024},
eprint={2410.01600},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.01600},
}