Data Filtering Challenge for Training Edge Language Models

 Welcome to the Data Filtering Challenge for Training Edge Language Models!

Introduction and Motivation

The rapid development of language models (LMs) has catalyzed breakthroughs across various domains, including natural language understanding, robotics, and digital human interaction. Compared with general large LMs, which are difficult to deploy on resource-constrained edge devices, edge LMs fine-tuned for target downstream tasks have the potential to achieve both greater efficiency and higher task accuracy. However, this fine-tuning hinges on the availability of high-quality, diverse datasets. The Data Filtering Challenge for Training Edge Language Models seeks to unite academic researchers, industry experts, and AI enthusiasts to develop data filtering techniques that refine datasets driving the next generation of edge LMs.

This challenge invites participants to create data filtering techniques and submit datasets refined by these methods, aiming to significantly enhance the achievable performance of edge LMs on downstream tasks deployed on edge devices. With a focus on improving model accuracy and applicability across crucial domains, participants will have the opportunity to push the frontier of edge LMs and gain recognition within the AI community. For this edition, we are focusing on a method known as Low-Rank Adaptation (LoRA), which allows for the creation of efficient task-specific edge LMs from pre-trained ones using fewer resources, making it ideal for devices such as smartphones and portable robots.

Scope of this Challenge

Participants are encouraged to develop and apply data filtering techniques to curate datasets optimized for key use cases in edge LM deployment. These datasets aim to enhance the performance of edge LMs in diverse scenarios, including:

The goal is to ensure that edge LMs, continuously trained on these curated datasets, demonstrate significant improvements across these use cases. In particular, participants should highlight how these datasets, coupled with LoRA-enhanced models, improve accuracy and performance.

More details can be found on the Problem page.

News

Our website is online!

Challenge Timeline

Jan. 24, 2025

Jan. 24, 2025

Feb. 15, 2025

May 31, 2025

Jun. 20, 2025

Summer 2025

Awards

$10,000

$3,000

$3,000

Sponsors

Contest Organizers

Shizhe Diao

NVIDIA

Yonggan Fu

Georgia Institute of Technology

Xin Dong

NVIDIA

Peter Belcak

NVIDIA

Lexington Whalen

Georgia Institute of Technology

Jan Kautz

NVIDIA

Yingyan (Celine) Lin

Georgia Institute of Technology

Pavlo Molchanov

NVIDIA