Data Filtering Challenge for Training Edge Language Models

Welcome to the Data Filtering Challenge for Training Edge Language Models!

Introduction and Motivation

The rapid development of language models (LMs) has catalyzed breakthroughs across various domains, including natural language understanding, robotics, and digital human interaction. Compared with general large LMs, which are difficult to deploy on resource-constrained edge devices, edge LMs fine-tuned for target downstream tasks have the potential to achieve both greater efficiency and higher task accuracy. However, this fine-tuning hinges on the availability of high-quality, diverse datasets. The Data Filtering Challenge for Training Edge Language Models seeks to unite academic researchers, industry experts, and AI enthusiasts to develop data filtering techniques that refine datasets driving the next generation of edge LMs.

This challenge invites participants to create data filtering techniques and submit datasets refined by these methods, aiming to significantly enhance the achievable performance of edge LMs on downstream tasks deployed on edge devices. With a focus on improving model accuracy and applicability across crucial domains, participants will have the opportunity to push the frontier of edge LMs and gain recognition within the AI community. For the finetuning technique, we are focusing on a method known as Weight-Decomposed Low-Rank Adaptation (DoRA), which allows for the creation of efficient task-specific edge LMs from pre-trained ones using fewer resources, making it ideal for devices such as smartphones and portable robots.

For questions/comments about the challenge, please join our Discord Server.

Scope of this Challenge

Participants are encouraged to develop and apply data filtering techniques to curate datasets optimized for key use cases in edge LM deployment. These datasets aim to enhance the performance of edge LMs in diverse scenarios, including:

Roleplay in interactive digital environments
Function calling on mobile devices
Robotics for autonomous tasks
Retrieval-augmented generation (RAG) tasks

The goal is to ensure that edge LMs, continuously trained on these curated datasets, demonstrate significant improvements across these use cases. In particular, participants should highlight how these datasets, coupled with DoRA-enhanced models, improve performance.

More details can be found on the Problem page.