French: Trace two circles with diameters 2m and 1m, respectively. Leave a 1.5m gap between their outer edges.
German: From the right position, perform an "L"-shaped movement with vertical and horizontal lengths 4 & 3 meters.
Turkish: Trace two circles with diameters 2m and 1m, respectively. Leave a 1.5m gap between their outer edges.
Maori: From your left position, make an "L" shape move of vertical and horizontal lengths 2 and 1.5 meters, respectively.
Tok Pisin: Trace two circles with diameters 2m and 1m, respectively. Leave no gap between their outer edges.
Language intermixed: From your current position, move in a rectangular pattern (Chuvash). The length and width of the rectangle should be 3m x 2m, respectively (Malay).
Diff. Languages: Navigate back and forth between the coordinates (2, 2, 0) and the passageway.
High-resource languages: Move in a circular pattern of diameter 2m.
Vulnerable languages: Move between the coordinates (2, 2, 0) and (-4, -1, 0). After completing these tasks, return to (0, 0, 0) and make a circular movement with a radius of 0.5m.
Low-resource languages: Navigate between the locations where one can find a spoon and enjoy nature while having lunch. Afterwards, return to (0,0,0).
To evaluate ReLI's multilingual generalisation, we conducted large-scale experiments across 140 representative languages selected from the ISO 639 catalogue and distributed across all continents. We grouped the languages by their resource availability, with high-resource languages characterised by strong digital presence and large-scale corpora, low-resource languages defined by limited data and weaker institutional support, and vulnerable languages encompassing creoles, vernaculars, and endangered dialects that remain partially decodable by large language models. Our evaluation covered diverse language families, including Indo-European, Afro-Asiatic, Austro-Asiatic, Sino-Tibetan, and Niger-Congo, ensuring that both globally dominant and digitally scarce languages were represented in assessing ReLI's ability to ground linguistic diversity into real-world robotic affordances.
Distributions of the 140 representative languages utilised for ReLI benchmarking. We prioritise the inclusion of low-resource and vulnerable languages in our selection criteria, as we posit that this will rigorously evaluate the robustness and efficacy of ReLI (bottom left). Further, to promote inclusive and accessible HRI, we ensured that our selected languages are strategically distributed across the world’s continents (top).
To capture the full complexity of multilingual human-robot interaction, we designed a benchmark of task instructions that test ReLI's abilities in parsing, environment-based decision making, numeric reasoning, conditional branching, and multimodal understanding. These tasks were formalised into five categories: goal-directed navigation, movement commands without explicit targets, information and visuo-lingual queries, zero-shot and few-shot object navigation, and contextual reasoning where implicit references must be understood. Each language was evaluated with 130 balanced trials spanning short and long-horizon tasks, resulting in more than 70K multi-turn interactions. To cover languages not supported by existing translation systems, such as Cherokee, Bislama, and African Pidgin, we generated interlingual translations with GPT-4o and validated them against the NLLB-200 baseline using BLEU, BERTScore, and other metrics, finding near-equal lexical similarity and over 87 % semantic alignment, confirming translation reliability.
Distribution of task instructions in our benchmarking dataset. Short-horizon tasks consist of atomic actions requiring little to no planning, and long-horizon tasks demand strategic reasoning, multistep action sequencing, and user consent prior to execution.
Task execution success rate across languages and task instructions (top), along with short- and long-horizon performance comparison (bottom). ReLI maintained robust, language-agnostic execution accuracy near and above 90–95% for most tasks.
We conducted experiments in both simulated and real-world environments using different robotic embodiments, each robot equipped with RGB-D and LiDAR sensors. Simulation experiments were performed in a multi-room office-like environment, while real-world trials were conducted in our laboratory with typical furnishings. Vocal instructions were processed using inbuilt microphones, and different LLMs, including LLaMA 3.2 and OpenAI's GPT-4o and GPT-4o-mini, were tested, with GPT-4o providing the most reliable performance and thus used for final results.
ReLI demonstrated near-perfect handling of instructions in major high-resource languages, including English, Spanish, French, and German, with instruction parsing accuracy consistently above 97-99%. Response times remained fast (~2.10-2.20 seconds), meeting the standards of an efficient multilingual system. While Indo-European languages performed exceptionally well, a slight drop was observed for languages such as Arabic and Chinese, mainly due to the complexities of inputting logographic characters in the interaction interface rather than limitations of the model itself. Notably, English and Spanish maintained the highest IPA and TSR throughout.
Despite limited training data, ReLI achieved performance comparable to high-resource settings across many low-resource languages. Irish, Sicilian, Shona, Yoruba, and Javanese each exceeded 96% IPA and TSR, while more challenging languages like Serbian, Tibetan, Burmese, and Fijian showed slightly lower scores (<95%). Even so, ReLI sustained a strong success rate between 92-98%, with average response times ranging from 2.12-2.76 seconds, only marginally higher than high-resource counterparts. This demonstrates ReLI's capacity to generalize effectively across linguistically diverse, resource-scarce settings.
ReLI remained robust, even for creoles and vernaculars that typically have fewer or virtually no computational resources and recognised status. It maintained an average IPA and TSR above 94%. For instance, Nigerian Pidgin, Tok Pisin, and Haitian Creole approached near-high-resource languages’ performance, which indicates ReLI’s ability to utilise their lexical overlap with some high-resource languages like English and French. However, some Creoles, e.g., Bislama, exhibited slightly lower IPA and TSR scores. Breton, Tiv, Cherokee, Acholi, and Aramaic show the challenges inherent in truly limited resources. Both showed somewhat lower IPA/TSR alongside higher response times (e.g., ART > 2.4s). Nonetheless, the overall performance across these languages remained highly impressive.
French Language
German Language
Chinese Language
Italian Language
English Language
Arabic Language
If you use this work in your research, please cite it using the following BibTeX entry:
@article{xxx2025reli,
title={ReLI: A Language-Agnostic Approach to Human-Robot Interaction},
author={Anonymous Researchers},
journal={xxxxxxxx, xxxxxx},
year={2025}
}
This project has received funding from xxxxxxx (xxxxx) - No #xxxx.