Abstract
Retrieval-augmented generation (RAG) is a novel framework widely used in advanced AI systems, like ChatGPT, Microsoft Copilot, Perplexity.ai, etc. In the typical scenario, given an English question, the RAG pipeline retrieves relevant passages from an English corpus, concatenates the passage with the question, and feeds the prompt into the LLM.
The informative passages will promote the model to respond with precise and accurate answers. However, users with diverse backgrounds may ask questions in a non-English language (๐) and the retrieved passages can also be written in diverse languages.ย
Therefore, it is important to โ ensure that LLMs can utilize the passages in whatever language, and โก think of the optimal retrieval strategy. For instance, retrieval can be performed in the English corpus (which is typically the corpus with the richest passages), or in the ๐ corpus (sometimes the useful information may be only present in a non-English language), or in a mixture of both. After the assessment and analysis, you will gain a clear understanding of the RAG framework in non-English scenarios and may think of any improvement if possible (as a bonus ๐).
Description
Retrieval-augmented generation (RAG) has shown its powerful capability to generate responses accurately and precisely to user questions. For the popular English RAG, given an English question (Q_EN), relevant passages are retrieved from the English corpus, filled into the prompt, and fed together with the question to the LLM. However, when the users ask questions in non-English languages, we need to think twice about the retrieval strategy and the generation quality because there is no guarantee that LLMs can consistently use the passages in any language. For instance, the model is likely to prefer passages in a specific language, compared with those in another language with exactly the same information. On the input end, the model may perform different capabilities in understanding the passages when the same content is written in different languages; on the generation end, the model may fail to generate the response in the user's language when affected by the language of the retrieved passages. In this project, you will first investigate a simple scenario (XQUAD dataset) where each question is provided with one informative passage translated into multiple languages. Based on the findings on XQUAD, you will extend the study to an open-domain QA dataset (Global-MMLU) where you need to do retrieval on your own. Based on the retrieved passages, you can analyze the LLMs' behaviors and properties on non-English RAG. You should play with questions in at least ONE non-English language ๐ and the retrieved corpora (databases) should also cover at least two languages.ย
Experiment for start-up
[Single-passage retrieval]: It is good to start from a simple scenario where only one passage is concatenated with the question per time. You can check the difference in answer accuracy (i.e. whether the gold answer is a substring of the LLM response) when the provided passage has the same content but in different languages. It reflects the LLMs' preferences for the passage in a specific language for which you may analyze the potential reasons. XQUAD dataset can be a choice for the studied dataset because here you can find the informative passage and its translation in multiple languages for each question. But also feel free to use other suitable datasets.ย
Elective ideas for Multi-passage retrieval
Moving toward a more practical scenario where the passages can be in more than one passage. Global-MMLU is a suitable dataset that contains 14042 questions over 42 languages (you can pick a subset of languages to study, but at least 2).ย
[Best retrieval strategy] Since no contextual passages are provided in the dataset, you need to do retrieval from the Wikipedia Corpora by yourself and you can compare different retrieval strategies, like (1) cross-lingual retrieval (calculate similarities between the original question with passages in the corpora), or (2) multilingual retrieval (translate the question into different languages and always retrieve in the corpus of the question language) or (3) other options. With the retrieval results, you can check the LLM's accuracy on different passage combinations and see which is the best.ย
[Desired generation language] Besides, you can verify if the responses are always in the desired language by checking whether the gold answer in ๐ is a substring of the LLM response.
[Input understanding] Moreover, Global MMLU enables you to disentangle the LLM's input understanding from the capability to respond in the correct language. Specifically, this is a multi-choice task where four options are provided in the prompt and the LLM is expected to generate an option letter. In this case, the accuracy can reveal whether the LLM can understand the retrieved documents in different languages.
[Challenge ๐] It would be promising to study a scenario where more than 2 languages are covered. Ideally, people would expect to use more corpora well in practice.
[Challenge ๐๐] Given the findings in the above experiments, can you think of any methods that can boost LLM's utilization of multilingual passages to achieve higher accuracy?
Materials
Since the questions may be written in non-English languages, the LLMs we are playing with should support multiple languages, such as BLOOM, QWEN-2.5, LLAMA-3.2, GEMMA-3, AYA-expanse, etc.
Datasets: XQUAD (questions with one passage and its translations), Global-MMLU (questions with their embeddings), Wikipedia Passage Embeddings (The embedding of Wikipedia passages, which can be used jointly with the previous embeddings for doing cosine-similarity retrieval)
References
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models. EMNLP 2023
Retrieval-based Language Models and Applications. ACL 2023 Tutorial
Dense retriever (DPR). EMNLP 2020
Generalizable T5-based dense Retrievers (GTR, popular retriever). EMNLP 2022
RAG Survey NeurIPS 2020