ENTP: Encoder-only Next Token Prediction
Ethan Ewer*,1, Daewon Chae*,2, Thomas Zeng*,1, Jinkyu Kim2, Kangwook Lee1
1University of Wisconsin-Madison, 2Korea University
* Equal Contribution
Ethan Ewer*,1, Daewon Chae*,2, Thomas Zeng*,1, Jinkyu Kim2, Kangwook Lee1
1University of Wisconsin-Madison, 2Korea University
* Equal Contribution
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the Count3 task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
We show theoretically that the class of causal models expressible by encoder and decoders are non-overlapping.
We propose the Count3 autoregressive task, which we conjecture is difficult for decoders to learn due to bounded complexity of the decoder model. We also provide empirical results that show that decoders fail to learn this task while encoders can.
We explore and present results of encoder vs. decoder causal models in a more realistic setting e.g. Addition, ICL, Language Modeling.
We prove the following theoretical results (informal version):
There exists a causal model for which both decoders and encoders can represent exactly.
There exists a causal model for which a decoder but not an encoder can represent exactly.
There exists a casual model for which an encoder but not a decoder (which uses positional embeddings) can represent exactly.
We also provide a characterization of encoder and decoder via the different time and space complexity they require for generation of each additional token. (Here n is the sequence length, D the embedding dimension and L the number of layers in the model.)
We propose the autoregressive task Triplet-Counting, where the next token is generated as follows:
When we trained small scale decoders, they all fail to learn Triplet-Counting. However a small encoder (ENTP) is quickly able to reach 100% sequence accuracy.
Additionally, we found that decoder-only based LLMs also struggle to learn Triplet-Counting.
We train encoder and decoder to perform addition on 3-digit numbers and to do language modeling via the OpenWebText dataset. Lastly, we compare performance on various ICL tasks. In general, encoder achieves slightly better accuracy/lower loss on each task when compared to a decoder of the same model size.
@misc{ewer2024entpencoderonlytokenprediction,
title={ENTP: Encoder-only Next Token Prediction},
author={Ethan Ewer and Daewon Chae and Thomas Zeng and Jinkyu Kim and Kangwook Lee},
year={2024},
eprint={2410.01600},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.01600},
}