Attentiveness, Understanding, Responsiveness and Agency

At Honda Research Institute Europe we have designed, built, and open-sourced AURA - a humanoid robot that collaborates naturally with small human groups. AURA features two 7-DoF arms with three-finger grippers, plus a neck and head that integrate mirror eyes and directional microphones to deliver explainable social cues in real time. Its architecture fuses scene perception, dialogue capture, situation understanding, and behavior generation with the commonsense reasoning power of state-of-the-art Vision-Language Models (VLMs). Beyond obeying explicit commands, AURA autonomously decides how and when to assist - or when to stay silent - so that support is helpful rather than disruptive.

MirrorEyes

literalizes the idiom “eyes are a window to the mind”: each digital pupil becomes a live, mirrored vignette of whatever the robot is truly attending to. By fusing that visual snippet with coordinated eye-and-head motion, users gain an intuitive, three-dimensional pointer - no extra icons or subtitles needed.

A 24 × 7 cm 1280 × 400-px IPS LCD sits behind a custom acrylic lens whose 80 % transmittance hides bezels and adds depth. Two micro-servos embedded in the ear “fins” provide ±45 ° rotation for secondary cues (e.g., perk-up, droop). The display-and-ear assembly is mounted on a direct-drive pan-tilt unit (600 °/s, < 0.1 ° repeatability) so the robot can swing its gaze as quickly as a person glances across a table yet hold still for prolonged foveation.

We model six degrees of freedom: virtual eyeball pitch/yaw, physical neck pan/tilt, and optional ear rotation. A weighted, differential inverse-kinematics solver prioritises high-frequency saccades in the eyes while distributing slower corrections to the neck, thereby emulating the human vestibulo-ocular reflex. During larger gestures—nods, shakes, “look-over-there” sweeps—the solver injects additional constraints so that the pupils cling to the target despite head motion.

Every render frame, a 256 × 256 px tile is cropped around the attended point in the RGB-D camera feed, flipped horizontally to preserve left-right semantics and alpha-blended at variable opacity atop the pupil layer. Dynamic reflection scaling ensures distant and close objects remain recognizable. The entire OpenCV shader runs at 60 fps on a single Jetson Orin core, leaving ample head-room for perception and planning tasks.

Ten primitives sit on a continuum from human-like to “augmented-reality” cues:

1. stylized iris and pupil, 2. reduced pupil size, 3. positive state, 4. negative state, 5. eyes closed, Superhuman augmentations: 6. processing animation, 7. color coding of robot state, 8. focused on a person (blurred), 9. focused on objects at lower reflection opacity + loading animation indicating processing, 10. focus on an object with a brief overexposure (flash) indicating first registration.

With 33 participants, MirrorEyes cut error-interruption time by ~8 % and lifted UEQ-S user-experience scores from “Good” to “Excellent” versus eyes-only baselines—evidence that the design is not just eye-catching but truly helpful.

Please visit our paper website for more details about MirrorEyes.

Head Design

Inspired by Japanese-animation aesthetics, the head combines a friendly silhouette with a versatile 12-inch display that can render animated eyes, icons, and any HDMI-streamed content. Twin micro-servos driven by an Arduino Nano rotate the side-mounted “ears”, enriching the robot’s emotional repertoire.

The whole module rides on a high-speed, whisper-quiet pan-tilt unit, enabling expressive nods, shakes, and directional gazing. All structural parts are 3-D-printed in light-grey ABS to match the robot arms; an 80 % - transmittance acrylic cover masks the bezels and adds depth to the screen. Multiple print - test - refine loops optimized stiffness, weight, and maintainability.

The motor behavior is generated through a hybrid pipeline: LLM prompts propose high-level social cues that are translated into rule-based motion primitives, producing responses that feel both vivid and timely. We will release the complete CAD files, firmware, and bill-of-materials as open-source so other researchers can replicate the head using inexpensive off-the-shelf components.

For more detail about the LLM-based expression generation, please see our paper webpage.

Intelligent Behaviors via VLM

The robot is equipped with different capabilities to interact helpfully and intelligently with one or more humans. These capabilities include perceptive abilities to understand the situation and context, and actionable abilities to interact helpfully and enjoyable with the humans.

The perceptive abilities include:

action detection to recognize what the humans are doing
open-world perception to recognize novel objects
inspecting objects to actively inspect objects physically
person identification to remember previously met humans

These abilities fuel the V/LLM-based framework with the situational information required to decide what to do.The actionable abilities include:

collaborative actions to physically cooperate
explainable interaction to inform the human on current plans – or explain observed mistakes and ask for help
mirror eyes to intuitively and visually transmit the robots’ attention and intention
flexible turn-taking for fluid interactions with less waiting
multi-party interaction to provide support within a group

More detail about the VLM-based robotic system, please visit our paper websites: CuriousRobot, AttentiveSupport

Our Team

Chao Wang

Principal Scientist

Matti Krüger

Senior Scientist

Stephan Hasler

Senior Scientist

Daniel Tanneberg

Senior Scientist

Mark Dunn

Senior Engineer

Oliver Schön

Engineer

Jörg Deigmöller

Senior Scientist

Anna Belardinelli

Principal Scientist

Felix Ocker

Senior Scientist

Fan Zhang

Senior Scientist

Michael Gienger

Chief Scientist

Publications

Wang, Chao, et al. "Lami: Large language models for multi-modal human-robot interaction." Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 2024.
Krüger, Matti, et al. "Mirror Eyes: Explainable Human-Robot Interaction at a Glance." arXiv preprint arXiv:2506.18466 (2025).
Tanneberg, Daniel, et al. "To help or not to help: Llm-based attentive support for human-robot group interactions." 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024.
Leusmann, Jan, et al. "Investigating LLM-Driven Curiosity in Human-Robot Interaction." Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 2025.
Joublin, Frank, et al. "CoPAL: corrective planning of robot actions with large language models." 2024 ieee international conference on robotics and automation (ICRA). IEEE, 2024.
Deigmoeller, Joerg, et al. "CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition." arXiv preprint arXiv:2506.20373 (2025).

Honda Research Institute

In 2003, when the Honda Research Institutes were founded in Japan, in the United States and in Europe, our central focus was research into Computational Intelligence, Optimization and Robotics. Welcome to visit our website for more information.

Want to contribute? Contact us!

We are continuously developing our robotic platform, guided by the vision of creating a human-aware, safe, responsive, and open-source robotic system.

We are currently seeking internship students and exploring opportunities for broader collaborations. If you are interested and have expertise in one of the following areas:

Robotic perception, manipulation, or mechanical design
Industrial design & Human–Computer Interaction

Please feel free to contact us at info [at] honda-ri [dot] de.

Page updated

Google Sites

Report abuse