AVM22: Cross-modality brain signals: auditory, visual and motor

Topic leaders

  • Claire Pelofi & Malcolm Slaney

  • Cornelia Fermüller & Ryad Benosman


The focus of this workgroup is on understanding the interplay between neural representations of auditory, visual, and motor cues and statistical knowledge of speech, music and action. To this aim, the group will pursue two complementary efforts. A first neuroscience project will investigate predictive coding in the multisensory context of watching videos. Before the Workshop, data will be collected from subjects watching the same movie in different languages and videos of violin players using the signals of EEG, MEG, fMRI and pupillometry. Using this data, the aim will be to investigate questions of auditory and visual saliency, audio-visual integration and its integration with motor areas and attentional decoding. A second computational perception project will study aspects of motor learning using vision and audio in the context of learning to play the violin. We will develop software to monitor a student’s gross and fine motor movements, i.e., the posture, which is a challenge for learners for many years, the fast finger movements on the left hand, and the bow movement with the right hand and arm. Using a dataset of students playing repeatedly the same pieces, collected with a motion capture system, vision, event sensors, and audio, we will develop causal models and use techniques from visualizing deep networks to gain insight into how specific movements and changes in movement affect the sound features.


1) Decoding neural signals:

  • Introduce statistical models from deep neural networks for decoding audio-visual interactions and audio-motor interactions. Examples include decoding from multiple subjects watching movies (audio only, video only, both), and analysis of video segments (film score, dialog, concurrent speech).

  • Introduce high-level level knowledge from additional sources for decoding viewers' attention. These include data from multiple languages or responses to human actions and facial movements.

2) Visual and auditory analysis for monitoring and assisting string players

  • Sensor fusion in machine learning models for computing human pose from video and audio. Since the sound is due to human motion, there is a close relationship between the vision and the sound we perceive, and by combining the two modalities, we expect better models. We can improve existing posture estimation methods using models of the instrumentalist’s motion, or train posture models from scratch using vision and audio. We plan to explore which features of sound are most relevant and model the specific characteristics of instrument playing to improve posture estimation.

  • Event vision models to create representations of the fast finger motions. We will explore event representations characterizing the finger motions by learning features from point clouds, such as descriptions of local event surfaces (HOTS (Lagorce et al., 2016) and HATS (Sironi et al, 2018)). We plan to explore transferring these features to classic video and combining audio and vision for better finger motion descriptions.

  • Causal models of movement and sound in playing the instrument Certain movements of the hand or bow cause certain sounds. Erroneous posture and incorrect hand and arm movements distort the sound. We would like to understand how specific movements and changes in movement affect the sound features. We will use techniques from visualizing deep networks interpretations (to gain an insight into these features (Zeiler and Fergus. 2014; Selvaraju, 2017).

Materials, Equipment, and Tutorials:


We will provide EEG, MEG, pupillometry and fMRI data of participants watching the movie Forrest Gump. This database will be used to tackle questions of multisensory decoding and across-modality predictive coding.

We will prepare data from violin students playing specific pieces designed to teach violin techniques, so-called études. Students with a few years of experience will be recorded in one-hour sessions in which they repeatedly play a few études and are given feedback on form in between playing these pieces. Data will be recorded with a Vicon MoCap system, cameras, microphones, an event sensor, and Myoelectric sensors.


BrainVision 64 electrodes EEG set up. 4 caps, two bundles of 32 electrodes, amplifier and battery, one

data acquisition computer, one presentation computer. Tobii hardware for pupillometry.

A small MoCap system and Nexus software for human motion capture , standard video cameras, a DVS sensor, microphones, a Jamstik guitar Trainer.


General survey of EEG/MEG preprocessing methods, denoising and ERP analysis, and presentation of all hardware elements involved in the collection of data (types of electrode, magnetic vs. electric activity etc.)

Lecture on advanced techniques of EEG/MEG decoding, such as Temporal Response Function (TRF),

Denoising Source Separation (DSS), linear classifiers and Deep Neural Networks (DNN).

Introduction to the use of the Vicon Mocap system and Nexus software

Overview lecture on the basics of human motion modeling and on use of existing software.

Relevant Literature:

(Lagorce et al., 2016) Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E. and Benosman, R.B., 2016. Hots: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7), pp.1346-1359.

(Lee et al., 2019) Lee, H.Y., Yang, X., Liu, M.Y., Wang, T.C., Lu, Y.D., Yang, M.H. and Kautz, J., 2019. Dancing to music. arXiv preprint arXiv:1911.02001.

(Li et al., 2021) Li, R., Yang, S., Ross, D.A. and Kanazawa, A., 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13401-13412).

(Selvaraju, 2017) Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 618-626).

(Sironi et al, 2018) Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X. and Benosman, R., 2018. HATS: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1731-1740).

(Shlizerman et. al., 2018) Shlizerman, E., Dery, L., Schoen, H. and Kemelmacher-Shlizerman, I., 2018. Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7574-7583).

(TELMI, 2016). Technology Enhanced Learning of Musical Instrument Performance, 2016 -19

(Zeiler, 2014) Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham.