Bridging the Gap Between Open-Sourced and Proprietary LLMs @ ICLR 2024

Overview

Proprietary LLMs such as OpenAI’s GPT-4, Google’s Bard, and Anthropic’s Claude have substantially transformed the landscape of machine learning and NLP research. Significant gains have been achieved across a variety of tasks including question answering, dialogue, solving puzzles, summarization, text games, and robot control, by simply calling the LLM to generate text given a prompt. However, the proprietary nature of these LLMs has resulted in a paradigm shift in AI research which until recently has been mostly open-sourced base. For example, earlier breakthroughs such as BERT and Word2Vec were quickly made publicly available.

Open-source models offer two immediate advantages. Firstly, they are available for free unlike proprietary LLMs which have a subscription cost, and this lowers the barrier of entry for researchers and practitioners and enables community-led development. Secondly, open-source models have more transparency which can inform better decisions on when to use them, or better calibrate their capabilities. In contrast, using proprietary LLMs can get prohibitively expensive which excludes all but well-funded groups, and lack of access to their training data or model internals, reduces model transparency.

To reduce this reliance on proprietary models, several open-source alternatives have been developed such as LLaMA [Touvron et al., 2023], GPT-J [Wang and Komatsuzaki, 2021], Phi-1 [Gunasekar et al., 2023], and Alpaca [Taori et al.,2023] However, there remains a significant gap between the performance of these LLMs and proprietary LLMs such as GPT-4, despite significant interest from both the research community and industry. This is true even when some of these language models such as Alpaca and Phi-1 were trained using the proprietary LLM’s output (LM2LM approach).This has lead to questions on whether an LM2LM approach will be feasible. In particular, some of them, such as Alpaca and Phi-1, use proprietary LLMs to generate text to train their language model, in effect, training a language model using another language model (LM2LM). However, there remains a significant gap between the performance of these LLMs and proprietary LLMs such as GPT-4, even when using the latter to train the open-source LLM via LM2LM, despite significant interest from both the research community and industry. This has led to the belief that the LM2LM approach is challenging for developing open-source alternatives [Gudibande et al., 2023]. Further, the terms and conditions of the proprietary LLM may preclude approaches that rely on them to train LLMs. This together may mean that novel approaches may be necessary to develop open-source LLMs that do not have the resources deployed for training proprietary LLMs.

Our workshop’s main goal is to bring members of the research community from both academia and industry, to understand how to design better open-source LLMs and bridge the gap between them and proprietary LLMs. One core question that the workshop seeks to understand is, whether better data design and algorithmic innovations, can allow us to bridge the gap between open-source LLMs and proprietary LLMs. Can an LM2LM approach be used to gradually build a series of better language models using the current set of available open-source LLMs? Or, even when terms of usage of proprietary LLMs allow its use for building new LLMs, can an LM2LM approach use these models to build better open-source LLMs? As mentioned earlier, this question has attracted attention on both sides, with several algorithmic works proposing ways to do so (e.g., Alpaca, Phi-1), along with those that argue this is unlikely to succeed [Gudibande et al., 2023]. In addition to these questions, the workshop will include a wide range of topics, including but not limited to:

Algorithms: What are better algorithms for training open-source LLMs? We welcome papers that discuss novel approaches for fine-tuning and aligning LLMs which include better learning algorithms, optimization, training pipeline, and model innovation.
Evaluation: How to evaluate open-source and proprietary LLMs? In particular, open-source LLMs provide access to model weights which can enable more thorough evaluations as well as mechanistic interpretability. Can we evaluate open-source LLMs using proprietary LLMs? If so, what is a reliable and reproducible approach to do so?
Transparency and Safety: What are good practices to make an open-source LLM more transparent and safe? How can the full knowledge of dataset, code, and model internals be used to make better-informed decisions about when to use the LLM? How can this be used to implement guard rails for safer use of LLM?
Data: How can we build high-quality diverse and open datasets for building open-source LLMs? How do we evaluate which data sources are helpful and which are not?
Scale: Can we develop small open-source language models that can reside on the device and enable fast inference? Can this be done via efficient distillation from an LLM?
Community: How can we develop collective efforts for training open-sourced LLMs? Can there be a Linux/GNU-style way of developing open-source LLMs where multiple efforts build on top of each other rather than competing with each other? This could include modular approaches for merging language models, building datasets, evaluations, training models, and researching new capabilities to further improve open LLMs in the long term.