Daily life is filled with encounters with other people. Normally, we quickly and effortlessly understand the meaning of the actions they perform -- a remarkable human capacity. Doing so is key to normal social life, because those actions provide vital clues about others’ intentions, beliefs, and personalities. For example, on seeing a family member chopping vegetables in the kitchen, we know that he intends to cook a meal; seeing a friend opening an umbrella suggests that she believes it will soon rain; and observing a stranger make a donation at a shop entrance indicates that she may be an empathetic person. How action understanding is so readily achieved remains poorly understood.
Our project offers a novel view of human action understanding as arising from interactions of two mental processes. Perceptual systems gather evidence about the actions we see, extracting the objects, movements, body postures, and scene context that make up an action. Returning to the previous cooking example, these systems would locate and identify the knife, cutting board, vegetables, and other objects; compute the posture of the cook, his grasp of the knife, and its up-and-down movements; and describe the layout of the scene and identify it as a kitchen.
Evidence from these perceptual systems interacts with a mental library of "action frames", each of which captures the typical roles, relationships, and reasons that comprise an action. For example, an action frame for “cooking” captures our knowledge that this generally involves the manipulation of food ingredients, with certain tools and movements, with the goal to transform them into an edible result, all of which typically takes place in a kitchen setting. Action frames also express some of our (normally unconscious) knowledge about probabilities related to actions. For example, we know that chopping motions are more likely to occur with a knife than a spoon; stirring often occurs in cooking but also in painting; and the kinds of actions that occur in a kitchen tend not to overlap with those typically seen in a garage. Action understanding arises when the activity of the perceptual systems and the action frames converges on a consistent interpretation, in which the key roles of the action frame are filled, and competing, less-likely action frames are excluded.
We are testing this framework with a combination of evidence from human judgments in simple action perception tasks performed on the computer, and with simple but powerful “neural network” computer models that capture some of the features of those judgment tasks. These allow us to frame our predictions in a precise, quantitative way, and to make new predictions about how action understanding behaviour will unfold. With this combined approach, we hope to demonstrate how our framework explains at least some of the human ability to understand others’ actions efficiently. We see potential for this framework to inform research in other disciplines that have a stake in how human observers understand the meaning and learning opportunities behind others’ actions.
This project is funded by the ESRC grant "A dynamic interactive account of human visual action understanding" to Paul Downing, Angelika Lingnau, and Marieke Mur.