InCoder: A Generative Model for Code In-Filling and Synthesis
Daniel Fried*, Armen Aghajanyan*, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis
ICLR, 2023
Paper: https://arxiv.org/abs/2204.05999
Demo: https://huggingface.co/spaces/facebook/incoder-demo
Model weights and instructions: https://github.com/dpfried/incoder/blob/main/README.md
Examples: https://sites.google.com/view/incoder-code-models/home/examples
Inserting and completing code in a single model
We train a generative, decoder-only Transformer using a causal-masking training objective (from CM3, Aghajanyan et al. 2022) , which trains a model to generate entire code files in arbitrary orderings via masking. Here's an example where a single region is masked:
Zero-shot generation for code tasks
In inference, we can prompt our model with a document containing MASK tokens where we want it to insert code. This lets us perform a plethora of code tasks without any task-specific fine-tuning, including docstring generation, type hint prediction, variable renaming, cloze tasks, and more. Here are real outputs from our model:
See more examples here: Examples
Trained on open-source code and StackOverflow
Unlike past work, our model's training data consists of only permissively-licensed code (Apache 2.0, MIT, BSD-2 and BSD-3 licensed) from online sources such as GitHub and GitLab, as well as StackOverflow. We focus on Python and JavaScript, but include 28 languages in total -- a total of ~200GB of data (after deduplication, filtering, and decontamination). See our paper for details.
Demo available
Model available in HuggingFace's Transformers
6.7B parameter version: https://huggingface.co/facebook/incoder-6B
1.3B parameter version: https://huggingface.co/facebook/incoder-1B
See our readme here for instructions on required versions of transformers and tokenizers, and examples of how to do infilling.
Credits
Thanks to Lucile Saulnier, Leandro von Werra, Nicolas Patry, Suraj Patil, Omar Sanseviero, and others at HuggingFace for help with the model release, and to Naman Goyal and Stephen Roller for the code our demo was based on!