We have an ongoing commitment to support you, Cohere Labs community members, as you pursue and achieve your research dreams. As part of this, we invite you all to publish with the affiliation “Cohere Labs Community”.
Those who achieve publication solely with our affiliation are celebrated as Cohere Labs Community Researchers. This is to recognize that publishing is always an achievement, but it is all the more significant when accomplished without the backing of a formal lab or institution. Our most sincere congratulations to those who unlock this honour!
On this page, get to know researchers worldwide who have published with this affiliation, or else made connections in the Cohere Labs Open Science Community to support their research.
Community Researchers
Shivalika Singh
Discord: @shivalikasingh
Freddie Vargus
Discord: @freddiev4
Daniel D'souza
Discord: @danieldsouza
Abinaya Mahendiran
Discord: @abinayamahendiran
Herumb Shandilya
Discord: @krypticmouse
Deividas Mataciunas Discord: @deivm
Hakimeh Fadaei
Ifeoma Okoh
Aisha Alaagib
Discord: @aisha5803
Oshan Mudannayake
Vu Minh Chien
Surya Guthikonda
Discord: @suryaguthikonda
Niklas Muennighoff Discord: @muennighoff
Zheng-Xin Yong
Discord: @yongzx
Neel Bhandari
Discord: @neel.bhandari
Hui-Lee Ooi
Discord: @huilee_
Research Abstract
Aya Dataset:
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.
Aya Model:
Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
Story of how this paper came together...
In 2022, the staff and leads at the Cohere Labs Open Science Community set out to work on a large open science initiative. With members from all over the world, we aimed to leverage our global reach to make a significant impact in the field of multilingual AI. In January 2023, an introductory call was held, and three technical teams were formed to begin work on the project.
The project, named Aya after the Twi word for "fern," a symbol of endurance and resourcefulness, aimed to create a state-of-the-art multilingual dataset and language model that would serve 101 languages. We faced numerous challenges, including coordinating a diverse group of contributors, ensuring data quality and safety, and overcoming technical hurdles. The biggest challenge was data collection: connecting with speakers of these languages worldwide and engaging them to write and annotate data in their language. Despite these obstacles, the project connected with 3000 collaborators world wide, made significant progress, releasing the largest ever collection of human-annotated instruction-finetuned languages and a state-of-the-art multilingual model.
Community Researchers
Research Abstract
Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.
Story of how this paper came together...
Community Researchers
Dipika Khullar
Discord: @dipika6486
Dominik Krzemiński
Discord: @dokato
Jekaterina Novikova Discord: @jekaterina_n
Rishabh Maheshwary
Sharad Duwal
Jebish Purbey
Discord: @jebish7
Azmine Toushik Wasi
Discord: @azminetoushikwasi
Bardia Soltani Moakhar Disocrd: @bardia4530
Maral Jabbari Shiviari Discord: @maraljs
MohammadAmin farahani fard
Discord: @ma.farahani
Silvia Fernandez Discord: @sil22
Dmitry Abulkhanov Discord: @dmitry_abulkhanov
Drishti Sharma
Discord: @drishti.sharma
Johan Obando-Ceron
Discord: @johanobando
Setayesh Heydari
Discord: @seta_hydri
Research Abstract
The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.
Story of how this paper came together...
Community Researchers
Abhipsha Das
Anthony Susevski
Discord: @asusevski
S M Iftekhar Uddin
Shayekh Bin Islam
Discord: @shayekhbinislam
Drishti Sharma
Discord: @drishti.sharma
Ashvanth.S
Discord: @ashpun
Research Abstract
The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks.
Story of how this paper came together...
The idea for this paper began when I, Nahid, realized that Aya, a multilingual LLM, did not yet have a multimodal counterpart. In April 2024, I proposed the idea of building Maya—Multimodal Aya—within the Cohere Labs community. The concept quickly gained interest, and with Sara Hooker leading Cohere Labs at the time, the excitement around the project grew. A formal project proposal was created, and soon after, Karthik and Surya stepped up to co-lead the effort. From the early discussions to the final stages, they remained committed, helping shape Maya into a collaborative and sustained effort that brought this paper to life.
What sets Maya apart is its focus on linguistic and cultural inclusivity in vision-language modeling. While rapid progress in large VLMs has produced strong results on academic benchmarks, these models often underperform in low-resource languages and culturally diverse contexts. To address this gap, Maya introduces three main contributions: a multilingual image-text pretraining dataset spanning eight languages, a comprehensive analysis of toxicity in the widely used LLaVA dataset, and a multilingual VLM trained on a toxicity-mitigated dataset. Our work involved filtering 7,531 toxic image-text pairs and creating a safer dataset to support inclusive, culturally aware multimodal learning. This commitment to responsible dataset curation is central to Maya’s mission of building more equitable AI systems.
The impact of this work has already extended beyond the model itself—two CVPR 2025 workshop papers were accepted based on research emerging from Maya. The project stands as a testament to the power of open collaboration, rigorous analysis, and a shared vision for building AI that better represents the diversity of human language and culture. The Maya project is open source and available for the community at: https://github.com/nahidalam/maya
Research Abstract
In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.
Story of how this paper came together...
Wei-Yin Ko
Discord: @weiyinko
Karina Nguyen
Discord: @karinanguyen
Daniel D’souza
Discord: @danieldsouza
Randall Balestriero
Discord: @rb426768
Research Abstract
Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way to improve top-line metrics and to outperform a larger single model. In this work, we go beyond top-line metrics and instead explore the impact of ensembling on subgroup performances. Surprisingly, we observe that even with a simple homogeneous ensemble -- all the individual DNNs share the same training set, architecture, and design choices -- the minority group performance disproportionately improves with the number of models compared to the majority group, i.e. fairness naturally emerges from ensembling. Even more surprising, we find that this gain keeps occurring even when a large number of models is considered, e.g. 20, despite the fact that the average performance of the ensemble plateaus with fewer models. Our work establishes that simple DNN ensembles can be a powerful tool for alleviating disparate impact from DNN classifiers, thus curbing algorithmic harm. We also explore why this is the case. We find that even in homogeneous ensembles, varying the sources of stochasticity through parameter initialization, mini-batch sampling, and data-augmentation realizations, results in different fairness outcomes.
Community Researchers
Research Abstract
An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models' performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.
Story of how this paper came together...
Gbemileke Onilude
Discord: @leke5577
Research Abstract
Multilingual models are often particularly dependent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilingualism, and compression. In this work, we propose an experimental framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning. Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties. In contrast to prior findings, we find that compression may improve model robustness over dense models. We additionally observe that under certain sparsification regimes compression may aid, rather than disproportionately impact the performance of low-resource languages.
Story of how this paper came together...
Community Researchers
Fraser Mince
Discord: @fraser8168
Dzung Dinh
Discord: toolazy._
Jonas Kgomo
Research Abstract
Pushing the boundaries of machine learning often requires exploring different hardware and software combinations. However, the freedom to experiment across different tooling stacks can be at odds with the drive for efficiency, which has produced increasingly specialized AI hardware and incentivized consolidation around a narrow set of ML frameworks. Exploratory research can be restricted if software and hardware are co-evolving, making it even harder to stray away from mainstream ideas that work well with popular tooling stacks. While this friction increasingly impacts the rate of innovation in machine learning, to our knowledge the lack of portability in tooling has not been quantified. In this work, we ask: How portable are popular ML software frameworks? We conduct a large-scale study of the portability of mainstream ML frameworks across different hardware types. Our findings paint an uncomfortable picture -- frameworks can lose more than 40% of their key functions when ported to other hardware. Worse, even when functions are portable, the slowdown in their performance can be extreme and render performance untenable. Collectively, our results reveal how costly straying from a narrow set of hardware-software combinations can be - and suggest that specialization of hardware impedes innovation in machine learning research.
Story of how this paper came together...
Research Abstract
Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks. Unfortunately, deploying these large-scale ViTs is resource-consuming and impossible for many mobile devices. While most in the community are building for larger and larger ViTs, we ask a completely opposite question: How small can a ViT be within the tradeoffs of accuracy and inference latency that make it suitable for mobile deployment? We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer. Recent work has also attempted to create sparse ViT networks and proposed alternatives to the attention module. In this paper, we study these architectures, identify the challenges and analyze what really makes a vision transformer suitable for mobile applications. We aim to serve as a baseline for future research direction and hopefully lay the foundation to choose the exemplary vision transformer architecture for your application running on mobile devices.
Story of how this paper came together...
Christopher Akiki
Research Abstract
The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model.
Story of how this paper came together...
Community Researchers
Everlyn Asiko Chimoto
Discord: @everlyn_asiko
Jay Gala
Orevaoghene Ahia
Research Abstract
Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
Story of how this paper came together...
Nishaanth Kanna Ravichandran
Discord:@nishaanthkanna
Research Abstract
Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models.
Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.
Story of how this paper came together...
Community Researchers
Karishma Thakrar
Katrina Lawrence
Discord: @katrina.lawrence
Kyle Howard
Research Abstract
Stylistic text generation plays a vital role in enhancing communication by reflecting the nuances of individual expression. This paper presents a novel approach for generating text in a specific speaker's style across different languages. We show that by leveraging only 100 lines of text, an individuals unique style can be captured as a high-dimensional embedding, which can be used for both text generation and stylistic translation. This methodology breaks down the language barrier by transferring the style of a speaker between languages. The paper is structured into three main phases: augmenting the speaker's data with stylistically consistent external sources, separating style from content using machine learning and deep learning techniques, and generating an abstract style profile by mean pooling the learned embeddings. The proposed approach is shown to be topic-agnostic, with test accuracy and F1 scores of 74.9% and 0.75, respectively. The results demonstrate the potential of the style profile for multilingual communication, paving the way for further applications in personalized content generation and cross-linguistic stylistic transfer.
Story of how this paper came together...
Alif Munim
Discord: @biggmon
Research Abstract
Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems.
Story of how this paper came together...
Community Researchers
Mohammed Hamdy
Discord:@_mohamdy
Research Abstract
Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
Story of how this paper came together...
Drishti Sharma
Discord: @drishti.sharma
Research Abstract
This paper presents a detailed system description of our entry for the CHiPSAL 2025 shared task, focusing on language detection, hate speech identification, and target detection in Devanagari script languages. We experimented with a combination of large language models and their ensembles, including MuRIL, IndicBERT, and Gemma-2, and leveraged unique techniques like focal loss to address challenges in the natural understanding of Devanagari languages, such as multilingual processing and class imbalance. Our approach achieved competitive results across all tasks: F1 of 0.9980, 0.7652, and 0.6804 for Sub-tasks A, B, and C respectively. This work provides insights into the effectiveness of transformer models in tasks with domain-specific and linguistic challenges, as well as areas for potential improvement in future iterations.
Story of how this paper came together...
Community Researcher
Femiloye Oyerinde
Discord: @phemiloyeai
Research Abstract
Ultra-fine-grained image recognition (UFGIR) categorizes objects with extremely small differences between classes, such as distinguishing between cultivars within the same species, as opposed to species-level classification in fine-grained image recognition (FGIR). The difficulty of this task is exacerbated due to the scarcity of samples per category. To tackle these challenges we introduce a novel approach employing down-sampling inter-layer adapters in a parameter-efficient setting, where the backbone parameters are frozen and we only fine-tune a small set of additional modules. By integrating dual-branch down-sampling, we significantly reduce the number of parameters and floating-point operations (FLOPs) required, making our method highly efficient. Comprehensive experiments on ten datasets demonstrate that our approach obtains outstanding accuracy-cost performance, highlighting its potential for practical applications in resource-constrained environments. In particular, our method increases the average accuracy by at least 6.8\% compared to other methods in the parameter-efficient setting while requiring at least 123x less trainable parameters compared to current state-of-the-art UFGIR methods and reducing the FLOPs by 30\% in average compared to other methods.
Story of how this paper came together...
Research Abstract
Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.
Story of how this paper came together...
In 2024, we met during the aya expedition, none of us had any prior experience in post-training, so we decided to give it a try together. Hindi was the common language among the collaborators, so we decided to build a Hindi-English bi-lingual model. The result was a model that outperformed models of the same size over both lanuages. We open-sourced the models and well as the datasets and released the paper along the way.
Model : https://huggingface.co/large-traversaal/Mantra-14B
Dataset : https://huggingface.co/datasets/1024m/PHI-4-Hindi-Instruct-Data
Community Researcher
Ahmad Mustafa Anis
Discord: .ahmadmustafaanis
Research Abstract
Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.
Story of how this paper came together...
Shayekh Bin Islam
Discord: @shayekhbinislam
Research Abstract
Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-source LLMs. Our framework transforms an arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE) model capable of handling complex reasoning tasks, including both single- and multi-hop queries. Open-RAG uniquely trains the model to navigate challenging distractors that appear relevant but are misleading. As a result, Open-RAG leverages latent learning, dynamically selecting relevant experts and integrating external knowledge effectively for more accurate and contextually relevant responses. In addition, we propose a hybrid adaptive retrieval method to determine retrieval necessity and balance the trade-off between performance gain and inference speed. Experimental results show that the Llama2-7B-based Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT, Self-RAG, and Command R+ in various knowledge-intensive tasks.
Story of how this paper came together...
Community Researcher
Ahmad Mustafa Anis
Discord: @drishti.sharma
Research Abstract
This paper presents the system description of our entry for the COLING 2025 FMD challenge, focusing on misinformation detection in financial domains. We experimented with a combination of large language models, including Qwen, Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for not only identifying fraudulent financial content but also generating coherent, and concise explanations that clarify the rationale behind the classifications. Our approach achieved competitive results with an F1-score of 0.8283 for classification, and ROUGE-1 of 0.7253 for explanations. This work highlights the transformative potential of LLMs in financial applications, offering insights into their capabilities for combating misinformation and enhancing transparency while identifying areas for future improvement in robustness and domain adaptation.
Story of how this paper came together...
Srishti Gureja
Discord:@srishtigureja
Mohammed Hamdy
Discord:@_mohamdy
Research Abstract
Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.
Story of how this paper came together...
Story of how this paper came together...
In July 2022 Sara posted in #find-collaborators channel in the C4AI discord looking for additional collaborator for a project with Randall involving how model ensemble designs impact per-class performance for classification tasks. Wei-Yin answered the call as he was looking for opportunities work on his first ML research project along with Daniel and Karina. The focus was on finding empirical results that show how much image classification accuracy improves via ensembles. While we only saw small accuracy gains overall by adding more models into the ensemble, we noticed that the majority of the gains came from classes that were underrepresented and/or underperforming when it was just a single model. Thus, we refocused to examine the fairness implications of model ensembling. We found that deep ensembling drastically improved performance on underrepresented blond male category on CelebA, where the male category is not explicitly trained on. Furthermore, we also realized that these model ensembles can be easily created just by varying the batch ordering while training. The improvements on fairness from ensembling emerges without explicit design--hence the claim that fairness "naturally emerges"!
The paper was accepted to Algorithmic Fairness through the Lens of Time workshop (AFT2023) at Neurips 2023.