We are very pleased to announce that Cynthia Matuszek (University of Maryland, Baltimore County) and Siddharth Narayanaswamy (University of Oxford) will be giving invited talks at the workshop! 

Workshop schedule:
9.00 - 9.55 Siddharth Narayanaswamy, "Language- and Model-Driven Machine Intelligence"
9.55 - 10.20 Contributed talk (van der Sluis, et al.)
10.20 - 10.45 Contributed talk (Pustejovsky, et al.)
10.45 - 11.05 Coffee break
11.05 - 11.30 Contributed talk (Lücking)
11.30 - 11.55 Contributed talk (Bhaskar, et al.)
11.55 - 12.50 Cynthia Matuszek, "Grounded Language Acquisition: A Physical Agent Approach"

Note that there are multiple workshops taking place in the afternoon as well, so stick around!

Accepted papers:
"Indexicals as Weak Descriptors," Andy Lücking
"Text-Picture Relations in Multimodal Instructions," Ielka Van der Sluis, Anne Nienke Eppinga and Gisela Redeker
"Exploring Multi-Modal Text+Image Models to Distinguish between Abstract and Concrete Nouns," Sai Abishek Bhaskar, Maximilian Köper, Sabine Schulte Im Walde and Diego Frassinelli
"Creating Common Ground through Multimodal Simulations," James Pustejovsky, Nikhil Krishnaswamy and Bruce Draper

Cynthia MatuszekA critical component of understanding human language is the ability to
map words and ideas in that language to aspects of the external world.
This mapping, called the symbol grounding problem, has been studied
since the early days of artificial intelligence; however, advances in
language processing, sensory, and motor systems have only recently
made it possible to directly interact with tangibly grounded concepts.
In this talk, I describe how we use robotics to explicitly acquire and
use physically grounded language--specifically, how robots can learn
to follow instructions, understand descriptions of objects, and build
models of language and the physical world from interactions with
users.  I will describe our work on building a learning system that
can ground English commands and descriptions from examples, making it
possible for robots to learn from untrained end-users in an intuitive,
natural way, and describe applications of our work in following
directions and learning about objects. Finally, I will discuss how
robots with these learning capabilities address a number of near-term

van der Sluis, et al.: This paper presents a multi-method approach to the description and evaluation

of multimodal content in situated contexts. First, a corpus of 15 simple first-aid

instructions that include texts and pictures is shown to exhibit variation in content

and presentation. We then report a user study in which four versions of a

tick-removal instruction were used to test the effects of the relative placement

of text and pictures in a particular instruction. The participants’ processing of

the instruction and their task performance were video-recorded and registered

with an eye tracker. Questionnaires and interviews were used to measure comprehension,

recall and the instruction’s attractiveness. Results show that users

first read at least some of the text before looking at the pictures, and prefer to

have the pictures placed to the right or below the text.

Pustejovsky, et al. The demand for more sophisticated human-computer interactions is rapidly increasing, as users

become more accustomed to conversation-like interactions with their devices. In this paper, we examine

this changing landscape in the context of human-machine interaction in a shared workspace to

achieve a common goal. In our prototype system, people and avatars cooperate to build blocks world

structures through the interaction of language, gesture, vision, and action. This provides a platform

to study computational issues involved in multimodal communication. In order to establish elements

of the common ground in discourse between speakers, we have created an embodied 3D simulation,

enabling both the generation and interpretation of multiple modalities, including: language,

gesture, and the visualization of objects moving and agents acting in their environment. The simulation

is built on the modeling language VoxML, that encodes objects with rich semantic typing and

action affordances, and actions themselves as multimodal programs, enabling contextually salient

inferences and decisions in the environment. We illustrate this with a walk-through of multimodal

communication in a shared task.

Lücking:  Indexicals have a couple of uses that are in conflict with the traditional view that they directly

refer to indices in the utterance situation. But how do they refer instead? It is argued that indexical

have both an indexical and a descriptive aspect – what is called weak descriptors . The indexical

aspect anchors them in the actual situation of utterance, the weak descriptive aspect singles out the

referent. Descriptive uses of “today” are then attributed to calendric coercion which is triggered by

qunatificational elements. This account provides a grammatically motivated formal link to descriptive

uses. With regard to some uses of “I”, a tentative contiguity rule is proposed as the reference rule

for the first person pronoun, which is oriented along recent hearer-oriented accounts in philosophy.

Bhaskar, et al. This paper explores variants of multi-modal computational models that aim to distinguish between

abstract and concrete nouns. We assumed that textual vs. visual modalities might have different

strengths in providing information on abstract vs. concrete words. While the overall predictions

of our models were highly successful (reaching an accuracy of 96.45% in a binary classification and a

Spearman correlation of 0.86 in a regression analysis), the differences between the textual, visual and

combined modalities were however negligible, hence both text and images seem to provide reliable,

non-complementary information to represent both abstract and concrete words.