Community Researchers

We have an ongoing commitment to support you, Cohere Labs community members, as you pursue and achieve your research dreams. As part of this, we invite you all to publish with the affiliation “Cohere Labs Community”.

Those who achieve publication solely with our affiliation are celebrated as Cohere Labs Community Researchers. This is to recognize that publishing is always an achievement, but it is all the more significant when accomplished without the backing of a formal lab or institution. Our most sincere congratulations to those who unlock this honour!

On this page, get to know researchers worldwide who have published with this affiliation, or else made connections in the Cohere Labs Open Science Community to support their research.

Aya Dataset and Model

M-Rewardbench

Kaleidoscope

Maya: An Instruction Finetuned Multilingual Multimodal Model

On The Fairness Impacts of Hardware Selection in Machine Learning

FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling!

Robust and Fine-Grained Detection of AI Generated Texts

Intriguing Properties of compression on Multilingual Models

The Grand Illusion: The Myth of Software Portability and Implications for ML Progress

Vision Transformers for Mobile Application: A Short Survey

SantaCoder: Don’t Reach For the Stars!

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

Investigating Continual Pretraining in Large Language Models: Insights and Implications

StAyaL | Multilingual Style Transfer

MedRAX: Medical Reasoning Agent for Chest X-ray

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

On the Limitations of Vision-Language Models in Understanding Image Transforms

OPEN-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Aya Dataset and Model

Read the research - Aya Dataset

Read the research - Aya Model

Community Researchers

Shivalika Singh
Discord: @shivalikasingh

Freddie Vargus
Discord: @freddiev4

Daniel D'souza
Discord: @danieldsouza

Abinaya Mahendiran
Discord: @abinayamahendiran

Herumb Shandilya
Discord: @krypticmouse

Deividas Mataciunas Discord: @deivm

Hakimeh Fadaei

Ifeoma Okoh

Aisha Alaagib
Discord: @aisha5803

Oshan Mudannayake

Vu Minh Chien

Surya Guthikonda
Discord: @suryaguthikonda

Niklas Muennighoff Discord: @muennighoff

Zheng-Xin Yong
Discord: @yongzx

Neel Bhandari
Discord: @neel.bhandari

Hui-Lee Ooi
Discord: @huilee_

Research Abstract

Aya Dataset:

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

Aya Model:

Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.

Story of how this paper came together...

In 2022, the staff and leads at the Cohere Labs Open Science Community set out to work on a large open science initiative. With members from all over the world, we aimed to leverage our global reach to make a significant impact in the field of multilingual AI. In January 2023, an introductory call was held, and three technical teams were formed to begin work on the project.

The project, named Aya after the Twi word for "fern," a symbol of endurance and resourcefulness, aimed to create a state-of-the-art multilingual dataset and language model that would serve 101 languages. We faced numerous challenges, including coordinating a diverse group of contributors, ensuring data quality and safety, and overcoming technical hurdles. The biggest challenge was data collection: connecting with speakers of these languages worldwide and engaging them to write and annotate data in their language. Despite these obstacles, the project connected with 3000 collaborators world wide, made significant progress, releasing the largest ever collection of human-annotated instruction-finetuned languages and a state-of-the-art multilingual model.

M-Rewardbench

Read the research

Community Researchers

Shayekh Bin Islam
Discord: @shayekhbinislam

Drishti Sharma
Discord: @drishti.sharma

Gusti Winata
Discord: @gustiwinata

Research Abstract

Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.

Story of how this paper came together...

Kaleidoscope

Read the research

Community Researchers

Shayekh Bin Islam Discord:
@shayekhbinislam

Arshia Soltani Moakhar
Discord: @

Danylo Boiko
Discord: @danylo_boiko

Dipika Khullar
Discord: @dipika6486

Dominik Krzemiński
Discord: @dokato

Jekaterina Novikova Discord: @jekaterina_n

Rishabh Maheshwary

Sharad Duwal

Jebish Purbey
Discord: @jebish7

Azmine Toushik Wasi
Discord: @azminetoushikwasi

Bardia Soltani Moakhar Disocrd: @bardia4530

Maral Jabbari Shiviari Discord: @maraljs

MohammadAmin farahani fard
Discord: @ma.farahani

Silvia Fernandez Discord: @sil22

Dmitry Abulkhanov Discord: @dmitry_abulkhanov

Drishti Sharma
Discord: @drishti.sharma

Johan Obando-Ceron
Discord: @johanobando

Setayesh Heydari
Discord: @seta_hydri

Research Abstract

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

Story of how this paper came together...

Maya: An Instruction Finetuned Multilingual Multimodal Model

Read the research

Community Researchers

Nahid Alam
Discord: @nahidai

Karthik Reddy Kanjula
Discord: @karthik_99_

Surya Guthikonda
Discord: @suryaguthikonda

Timothy Chung

Abhipsha Das

Anthony Susevski
Discord: @asusevski

S M Iftekhar Uddin

Shayekh Bin Islam
Discord: @shayekhbinislam

Drishti Sharma
Discord: @drishti.sharma

Ashvanth.S
Discord: @ashpun

Research Abstract

The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks.

Story of how this paper came together...

The idea for this paper began when I, Nahid, realized that Aya, a multilingual LLM, did not yet have a multimodal counterpart. In April 2024, I proposed the idea of building Maya—Multimodal Aya—within the Cohere Labs community. The concept quickly gained interest, and with Sara Hooker leading Cohere Labs at the time, the excitement around the project grew. A formal project proposal was created, and soon after, Karthik and Surya stepped up to co-lead the effort. From the early discussions to the final stages, they remained committed, helping shape Maya into a collaborative and sustained effort that brought this paper to life.

What sets Maya apart is its focus on linguistic and cultural inclusivity in vision-language modeling. While rapid progress in large VLMs has produced strong results on academic benchmarks, these models often underperform in low-resource languages and culturally diverse contexts. To address this gap, Maya introduces three main contributions: a multilingual image-text pretraining dataset spanning eight languages, a comprehensive analysis of toxicity in the widely used LLaVA dataset, and a multilingual VLM trained on a toxicity-mitigated dataset. Our work involved filtering 7,531 toxic image-text pairs and creating a safer dataset to support inclusive, culturally aware multimodal learning. This commitment to responsible dataset curation is central to Maya’s mission of building more equitable AI systems.

The impact of this work has already extended beyond the model itself—two CVPR 2025 workshop papers were accepted based on research emerging from Maya. The project stands as a testament to the power of open collaboration, rigorous analysis, and a shared vision for building AI that better represents the diversity of human language and culture. The Maya project is open source and available for the community at: https://github.com/nahidalam/maya

On The Fairness Impacts of Hardware Selection in Machine Learning

Read the research

Community Researchers

Sree Harsha Nelaturu
Discord: @majormelancholy

Nishaanth Kanna Ravichandran
Discord: @nishaanthkanna

Research Abstract

In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.

Story of how this paper came together...

FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling!

Read the research

Community Researchers

Wei-Yin Ko
Discord: @weiyinko

Karina Nguyen
Discord: @karinanguyen

Daniel D’souza
Discord: @danieldsouza

Randall Balestriero
Discord: @rb426768

Research Abstract

Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way to improve top-line metrics and to outperform a larger single model. In this work, we go beyond top-line metrics and instead explore the impact of ensembling on subgroup performances. Surprisingly, we observe that even with a simple homogeneous ensemble -- all the individual DNNs share the same training set, architecture, and design choices -- the minority group performance disproportionately improves with the number of models compared to the majority group, i.e. fairness naturally emerges from ensembling. Even more surprising, we find that this gain keeps occurring even when a large number of models is considered, e.g. 20, despite the fact that the average performance of the ensemble plateaus with fewer models. Our work establishes that simple DNN ensembles can be a powerful tool for alleviating disparate impact from DNN classifiers, thus curbing algorithmic harm. We also explore why this is the case. We find that even in homogeneous ensembles, varying the sources of stochasticity through parameter initialization, mini-batch sampling, and data-augmentation realizations, results in different fairness outcomes.

Robust and Fine-Grained Detection of AI Generated Texts

Read the research

Community Researchers

Ram Mohan Rao Kadiyala
Discord: @1048576m

Jebish Purbey
Discord: @jebish7

Kanwal Mehreen
Discord: @kanwal_mehreen

Drishti Sharma

Discord: @drishti.sharma

Siddhant Gupta
Discord: @minemasterxd

Research Abstract

An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models' performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.

Story of how this paper came together...

Intriguing Properties of compression on Multilingual Models

Read the research

Community Researcher

Gbemileke Onilude
Discord: @leke5577

Research Abstract

Multilingual models are often particularly dependent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilingualism, and compression. In this work, we propose an experimental framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning. Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties. In contrast to prior findings, we find that compression may improve model robustness over dense models. We additionally observe that under certain sparsification regimes compression may aid, rather than disproportionately impact the performance of low-resource languages.

Story of how this paper came together...

The Grand Illusion: The Myth of Software Portability and Implications for ML Progress

Read the research

Community Researchers

Fraser Mince
Discord: @fraser8168

Dzung Dinh
Discord: toolazy._

Jonas Kgomo

Research Abstract

Pushing the boundaries of machine learning often requires exploring different hardware and software combinations. However, the freedom to experiment across different tooling stacks can be at odds with the drive for efficiency, which has produced increasingly specialized AI hardware and incentivized consolidation around a narrow set of ML frameworks. Exploratory research can be restricted if software and hardware are co-evolving, making it even harder to stray away from mainstream ideas that work well with popular tooling stacks. While this friction increasingly impacts the rate of innovation in machine learning, to our knowledge the lack of portability in tooling has not been quantified. In this work, we ask: How portable are popular ML software frameworks? We conduct a large-scale study of the portability of mainstream ML frameworks across different hardware types. Our findings paint an uncomfortable picture -- frameworks can lose more than 40% of their key functions when ported to other hardware. Worse, even when functions are portable, the slowdown in their performance can be extreme and render performance untenable. Collectively, our results reveal how costly straying from a narrow set of hardware-software combinations can be - and suggest that specialization of hardware impedes innovation in machine learning research.

Story of how this paper came together...

Vision Transformers for Mobile Application: A Short Survey

Read the research

Community Researchers

Nahid Alam
Discord: @nahidai

Nishant Bansali
Discord: @nishantbhansali

Research Abstract

Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks. Unfortunately, deploying these large-scale ViTs is resource-consuming and impossible for many mobile devices. While most in the community are building for larger and larger ViTs, we ask a completely opposite question: How small can a ViT be within the tradeoffs of accuracy and inference latency that make it suitable for mobile deployment? We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer. Recent work has also attempted to create sparse ViT networks and proposed alternatives to the attention module. In this paper, we study these architectures, identify the challenges and analyze what really makes a vision transformer suitable for mobile applications. We aim to serve as a baseline for future research direction and hopefully lay the foundation to choose the exemplary vision transformer architecture for your application running on mobile devices.

Story of how this paper came together...

SantaCoder: Don’t Reach For the Stars!

Read the research

Community Researcher

Christopher Akiki

Research Abstract

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model.

Story of how this paper came together...

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

Read the research

Community Researchers

Everlyn Asiko Chimoto
Discord: @everlyn_asiko

Jay Gala

Orevaoghene Ahia

Research Abstract

Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.

Story of how this paper came together...

Investigating Continual Pretraining in Large Language Models: Insights and Implications

Read the research

Community Researcher

Nishaanth Kanna Ravichandran
Discord:@nishaanthkanna

Research Abstract

Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models.

Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.

Story of how this paper came together...

StAyaL | Multilingual Style Transfer

Read the research

Community Researchers

Karishma Thakrar

Katrina Lawrence
Discord: @katrina.lawrence

Kyle Howard

Research Abstract

Stylistic text generation plays a vital role in enhancing communication by reflecting the nuances of individual expression. This paper presents a novel approach for generating text in a specific speaker's style across different languages. We show that by leveraging only 100 lines of text, an individuals unique style can be captured as a high-dimensional embedding, which can be used for both text generation and stylistic translation. This methodology breaks down the language barrier by transferring the style of a speaker between languages. The paper is structured into three main phases: augmenting the speaker's data with stylistically consistent external sources, separating style from content using machine learning and deep learning techniques, and generating an abstract style profile by mean pooling the learned embeddings. The proposed approach is shown to be topic-agnostic, with test accuracy and F1 scores of 74.9% and 0.75, respectively. The results demonstrate the potential of the style profile for multilingual communication, paving the way for further applications in personalized content generation and cross-linguistic stylistic transfer.

Story of how this paper came together...

MedRAX: Medical Reasoning Agent for Chest X-ray

Read the research

Community Researcher

Alif Munim
Discord: @biggmon

Research Abstract

Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems.

Story of how this paper came together...

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Read the research

Community Researchers

Mohammed Hamdy
Discord:@_mohamdy

Research Abstract

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

Story of how this paper came together...

1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs

Read the research

Community Researcher

Drishti Sharma
Discord: @drishti.sharma

Research Abstract

This paper presents a detailed system description of our entry for the CHiPSAL 2025 shared task, focusing on language detection, hate speech identification, and target detection in Devanagari script languages. We experimented with a combination of large language models and their ensembles, including MuRIL, IndicBERT, and Gemma-2, and leveraged unique techniques like focal loss to address challenges in the natural understanding of Devanagari languages, such as multilingual processing and class imbalance. Our approach achieved competitive results across all tasks: F1 of 0.9980, 0.7652, and 0.6804 for Sub-tasks A, B, and C respectively. This work provides insights into the effectiveness of transformer models in tasks with domain-specific and linguistic challenges, as well as areas for potential improvement in future iterations.

Story of how this paper came together...

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Read the research

Community Researcher

Femiloye Oyerinde
Discord: @phemiloyeai

Research Abstract

Ultra-fine-grained image recognition (UFGIR) categorizes objects with extremely small differences between classes, such as distinguishing between cultivars within the same species, as opposed to species-level classification in fine-grained image recognition (FGIR). The difficulty of this task is exacerbated due to the scarcity of samples per category. To tackle these challenges we introduce a novel approach employing down-sampling inter-layer adapters in a parameter-efficient setting, where the backbone parameters are frozen and we only fine-tune a small set of additional modules. By integrating dual-branch down-sampling, we significantly reduce the number of parameters and floating-point operations (FLOPs) required, making our method highly efficient. Comprehensive experiments on ten datasets demonstrate that our approach obtains outstanding accuracy-cost performance, highlighting its potential for practical applications in resource-constrained environments. In particular, our method increases the average accuracy by at least 6.8\% compared to other methods in the parameter-efficient setting while requiring at least 123x less trainable parameters compared to current state-of-the-art UFGIR methods and reducing the FLOPs by 30\% in average compared to other methods.

Story of how this paper came together...

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Read the research

Community Researchers

Ram Mohan Rao Kadiyala
Discord: @1048576m

Drishti Sharma
Discord: @drishti.sharma

Siddhant Gupta
Discord: @minemasterxd

Kanwal Mehreen
Discord: @kanwal_mehreen

Jebish Purbey
Discord: @jebish7

Muhammad Arham

Research Abstract

Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

Story of how this paper came together...

In 2024, we met during the aya expedition, none of us had any prior experience in post-training, so we decided to give it a try together. Hindi was the common language among the collaborators, so we decided to build a Hindi-English bi-lingual model. The result was a model that outperformed models of the same size over both lanuages. We open-sourced the models and well as the datasets and released the paper along the way.

Model : https://huggingface.co/large-traversaal/Mantra-14B

Dataset : https://huggingface.co/datasets/1024m/PHI-4-Hindi-Instruct-Data

On the Limitations of Vision-Language Models in Understanding Image Transforms

Read the research

Community Researcher

Ahmad Mustafa Anis
Discord: .ahmadmustafaanis

Research Abstract

Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

Story of how this paper came together...

OPEN-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

Read the research

Community Researchers

Shayekh Bin Islam
Discord: @shayekhbinislam

Research Abstract

Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-source LLMs. Our framework transforms an arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE) model capable of handling complex reasoning tasks, including both single- and multi-hop queries. Open-RAG uniquely trains the model to navigate challenging distractors that appear relevant but are misleading. As a result, Open-RAG leverages latent learning, dynamically selecting relevant experts and integrating external knowledge effectively for more accurate and contextually relevant responses. In addition, we propose a hybrid adaptive retrieval method to determine retrieval necessity and balance the trade-off between performance gain and inference speed. Experimental results show that the Llama2-7B-based Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT, Self-RAG, and Command R+ in various knowledge-intensive tasks.

Story of how this paper came together...

SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains

Read the research

Community Researcher

Ahmad Mustafa Anis
Discord: @drishti.sharma

Research Abstract

This paper presents the system description of our entry for the COLING 2025 FMD challenge, focusing on misinformation detection in financial domains. We experimented with a combination of large language models, including Qwen, Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for not only identifying fraudulent financial content but also generating coherent, and concise explanations that clarify the rationale behind the classifications. Our approach achieved competitive results with an F1-score of 0.8283 for classification, and ROUGE-1 of 0.7253 for explanations. This work highlights the transformative potential of LLMs in financial applications, offering insights into their capabilities for combating misinformation and enhancing transparency while identifying areas for future improvement in robustness and domain adaptation.

Story of how this paper came together...

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Read the research

Community Researchers

Srishti Gureja
Discord:@srishtigureja

Mohammed Hamdy
Discord:@_mohamdy

Research Abstract

Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.

Story of how this paper came together...

In July 2022 Sara posted in #find-collaborators channel in the C4AI discord looking for additional collaborator for a project with Randall involving how model ensemble designs impact per-class performance for classification tasks. Wei-Yin answered the call as he was looking for opportunities work on his first ML research project along with Daniel and Karina. The focus was on finding empirical results that show how much image classification accuracy improves via ensembles. While we only saw small accuracy gains overall by adding more models into the ensemble, we noticed that the majority of the gains came from classes that were underrepresented and/or underperforming when it was just a single model. Thus, we refocused to examine the fairness implications of model ensembling. We found that deep ensembling drastically improved performance on underrepresented blond male category on CelebA, where the male category is not explicitly trained on. Furthermore, we also realized that these model ensembles can be easily created just by varying the batch ordering while training. The improvements on fairness from ensembling emerges without explicit design--hence the claim that fairness "naturally emerges"!

The paper was accepted to Algorithmic Fairness through the Lens of Time workshop (AFT2023) at Neurips 2023.

Page updated

Report abuse