InCoder: A Generative Model for Code In-Filling and Synthesis

Daniel Fried*, Armen Aghajanyan*, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis

ICLR, 2023



Model weights and instructions:


Inserting and completing code in a single model

We train a generative, decoder-only Transformer using a causal-masking training objective (from CM3, Aghajanyan et al. 2022) , which trains a model to generate entire code files in arbitrary orderings via masking. Here's an example where a single region is masked:

Zero-shot generation for code tasks

In inference, we can prompt our model with a document containing MASK tokens where we want it to insert code. This lets us perform a plethora of code tasks without any task-specific fine-tuning, including docstring generation, type hint prediction, variable renaming, cloze tasks, and more. Here are real outputs from our model:

See more examples here: Examples

Trained on open-source code and StackOverflow

Unlike past work, our model's training data consists of only permissively-licensed code (Apache 2.0, MIT, BSD-2 and BSD-3 licensed) from online sources such as GitHub and GitLab, as well as StackOverflow. We focus on Python and JavaScript, but include 28 languages in total -- a total of ~200GB of data (after deduplication, filtering, and decontamination). See our paper for details.

Model available in HuggingFace's Transformers

6.7B parameter version:

1.3B parameter version:

See our readme here for instructions on required versions of transformers and tokenizers, and examples of how to do infilling.


Thanks to Lucile Saulnier, Leandro von Werra, Nicolas Patry, Suraj Patil, Omar Sanseviero, and others at HuggingFace for help with the model release, and to Naman Goyal and Stephen Roller for the code our demo was based on!