Welcome to the Data Filtering Challenge for Training Edge Language Models!
The rapid development of language models (LMs) has catalyzed breakthroughs across various domains, including natural language understanding, robotics, and digital human interaction. Compared with general large LMs, which are difficult to deploy on resource-constrained edge devices, edge LMs fine-tuned for target downstream tasks have the potential to achieve both greater efficiency and higher task accuracy. However, this fine-tuning hinges on the availability of high-quality, diverse datasets. The Data Filtering Challenge for Training Edge Language Models seeks to unite academic researchers, industry experts, and AI enthusiasts to develop data filtering techniques that refine datasets driving the next generation of edge LMs.
This challenge invites participants to create data filtering techniques and submit datasets refined by these methods, aiming to significantly enhance the achievable performance of edge LMs on downstream tasks deployed on edge devices. With a focus on improving model accuracy and applicability across crucial domains, participants will have the opportunity to push the frontier of edge LMs and gain recognition within the AI community. For the finetuning technique, we are focusing on a method known as Weight-Decomposed Low-Rank Adaptation (DoRA), which allows for the creation of efficient task-specific edge LMs from pre-trained ones using fewer resources, making it ideal for devices such as smartphones and portable robots.
For questions/comments about the challenge, please join our Discord Server.
Participants are encouraged to develop and apply data filtering techniques to curate datasets optimized for key use cases in edge LM deployment. These datasets aim to enhance the performance of edge LMs in diverse scenarios, including:
Roleplay in interactive digital environments
Function calling on mobile devices
Robotics for autonomous tasks
Retrieval-augmented generation (RAG) tasks
The goal is to ensure that edge LMs, continuously trained on these curated datasets, demonstrate significant improvements across these use cases. In particular, participants should highlight how these datasets, coupled with DoRA-enhanced models, improve performance.
More details can be found on the Problem page.
04/15/2025
05/14/2025
Our website is online!
The starter kit has been released!
Website Release
Toolkit Release
Registration Deadline
Submission Deadline
Award Notification
Awards Ceremony / Workshop
April 15th, 2025
May 14th, 2025
July 15th, 2025
September 1st, 2025
October 1st, 2025
December 2nd, 2025
Awards
Grand Prize
Category-Specific Awards
Innovation Award
$10,000 + $1,000 Lambda Credits
4 x $3,000 + $500 Lambda Credits
$3,000 + $300 Lambda Credits
Shizhe Diao
NVIDIA
Yonggan Fu
NVIDIA & Georgia Institute of Technology
Xin Dong
NVIDIA
Peter Belcak
NVIDIA
Lexington Whalen
NVIDIA & Georgia Institute of Technology
Mostofa Patwary
NVIDIA
Mohammad Shoeybi
NVIDIA
Wenfei Zhou
NVIDIA
Jan Kautz
NVIDIA
Yingyan (Celine) Lin
NVIDIA & Georgia Institute of Technology
Pavlo Molchanov
NVIDIA