A multilingual, open-source suite of high-quality LLMs for European languages.
A multilingual, open-source suite of high-quality LLMs for European languages.
New Website:
Goal
To build an open source European Large Language Model that supports 24 Official European Languages, and a few other strategically important languages.
News
12-12-2024 We released our 9B model on Huggingface on 2 December, and have had 50k downloads in the first week! We are also currently the top performing 9B model on the OpenGPTX European leaderboard apart from Gemma 2 which is actually a bigger model as they do not count embeddings in the 9B.
22-10-2024 We are attending the EuroHPC user day in October in Amsterdam - please reach out to meet us in person!
24-09-24 Paper on the 1.7B EuroLLM model is published on Arxiv: https://arxiv.org/abs/2409.16235
Duration
Project Timeline: 1 May 2024 - 30 April 2025
We thank EuroHPC for the HPC resources used to support this work through grant EHPC-EXT-2023E01-042.
Deliverables
A series of models of different sizes for optimal effectiveness and efficiency (1B, 9B and 22B) trained on 4T tokens
A multimodal model which can process and understand speech or text input
Full project codebase available to the public with detailed data and model descriptions
Models pretrained and finetuned on text from all languages for better performance, especially for native users of these LLMs.
Models trained to process speech and text in various languages, enabling them to support spoken languages and recognize prosody and emotion.
Models that offer high performance on various tasks in multiple languages, including QA, summarisation, and translation.
Models that can be used freely by all researchers, organisations and citizens of Europe.
Models
EuroLLM-9B is a 9B parameter model trained on similar data to EuroLLM-1.7B.
Euro LLM-1.7 B
EuroLLM-1.7B is a 1.7B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets. EuroLLM-1.7B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation.
With links to
Do you have questions?
Contact eurollm.info@gmail.com to get more information about the project