Workshop on Multimodal Understanding and Learning for Embodied Applications

[Update September 23rd] We have put together a tentative workshop program. Please note that the workshop is a full-day workshop.

[Update August 23rd] Our workshop will be a full-day workshop on October 25th. We will have 4 full-paper presentations and 4 posters. We are currently finalizing the arrangements with a few invited speakers. To register for the workshop (and the conference), please go to :

[Update August 12th] We have sent out the notifications of paper/poster acceptance. Please contact us if you haven't received the notification.

  • We would like to thank the authors for submitting to the MULEA 2019 Workshop. Our thanks also go to the program committee members for reviewing the submissions. Thank you!

[Update August 12th] We finally are ready to announce the accepted papers and posters. We will send out the notification emails today (2019/08/12 Monday). We again apologize for the delay.

[Update August 8th] Due to a delay in reviewing the submissions, we have to push back the notification date by a few days. The current plan is to make the notification by Saturday morning (August 10th). We apologize for the delay.

Welcome to the website of MULEA 2019 Workshop

The MULEA 2019 Workshop will be held in conjunction with the 2019 ACM Multimedia Conference, in Nice, France, from 2019/10/21 through 2019/10/25. The full name of the Workshop is the The 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications.

The focus of the workshop is on the embodied applications covering many of the “fashionable” applications in AI, such as robotics, autonomous driving, multimodal chatbots, or simulated games. It also covers many new and exciting research areas.


  1. Multimodal context understanding.
    • Context include environment, task/goal states, dynamic objects of the scene, activities, etc. Relevant research streams include visually grounded learning, context understanding, and environmental modeling which includes 3D environment modeling and understanding. Language grounding is also an interesting topic. Connecting the vision and language modalities is essential in applications such as question answering and image captioning. Other relevant research areas include multimodal understanding, context modeling, and grounded dialog systems.
  2. Knowledge inference.
    • Knowledge in this multimedia scenario is represented with knowledge graph, scene graph, memory, etc. Representing contextual knowledge is a topic that has attracted much interest, and goal-driven knowledge representation and reasoning are also new research directions. Deep learning methods are good options to deal with unstructured multimodal knowledge signals.
  3. Embodied learning.
    • Building on context understanding and knowledge representation, the policy generates actions for intelligent agents to achieve goals or finish tasks. The input signals are multimodal and can be images or dialogues, etc. The learning policies not only need to provide short-term reactions, but also need to plan its actions to optimally finish the long-term goals. The actions may involve navigation and localization as well, which are mainstream in the robotics and self-driving vehicle fields. This is relevant to reinforcement learning, and the algorithms are driven by multiple industrial applications in robotics, self-driving vehicles, simulated games, multimodal chatbots, etc.