Projects
Running projects
E-SSL
E-SSL
Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies
ANR PRC E-SSL
Runtime: to be started (3 years)
Role: Co-author proposal, beneficiary, workpackage leader
Partners: LIA (Université d'Avignon et des Pays du Vaucluse), LAMSADE (Paris Sciences & Lettres)
Summary: Following previous major advances, self-supervised learning (SSL) has recently emerged as one of the most promising artificial intelligence methods. With this technique, it becomes feasible to take advantage of the colossal amounts of existing unlabelled data to significantly improve the results of various AI systems. In particular, the field of speech processing (SP) is being rapidly transformed by the rise of SSL due to massive industrial investments, and the explosion of data both made available by few companies. Although incredibly powerful, the complexity of SSL models requires researchers and the industry to acquire extraordinary computing capacities, which drastically reduces both the access to fundamental research in this field and its deployment in real products. The E-SSL project aims at re-empowering the scientific community and the speech industry with the necessary control over self-supervised learning in order to ensure its fair evolution and deployment by facilitating both academic research and its transfer to industry. In practice, E-SSL holistically integrates three key issues of self-supervised learning for speech representations including its effective computational efficiency, its societal impacts and the feasibility of its extension to future products
THERADIA
Thérapies Digitales Augmentées par l'IA - Digital Therapies augmented with IA
BPI PSPC THERADIA
Runtime: 01.03.2020 - 31.03.2025
Role: Co-author proposal, beneficiary, workpackage leader
Partners: GIPSA-Lab (UGA), EMC (Université Lyon 2), ATOS, SBT Human Matter, PERTIMM
Summary: Digital plays a key role in the transformation of medicine. Beyond the simple computerisation of healthcare systems, many non-drug treatments are now possible thanks to digital technology. Thus, interactive stimulation exercises can be offered to people suffering from cognitive disorders, such as developmental disorders, neurodegenerative diseases, stroke or traumas. The efficiency of these new treatments, which are still primarily offered face-to-face by therapists, can be greatly improved if patients can pursue them at home. However, patients are left to their own devices which can be problematic. The aim is to endow a system for autonomous cognitive remediation with a conversational agent capable of providing social presence, coaching and support when necessary. Challenges facing the implementation and long-term acceptability issues of such a technology are numerous.
Finished projects
WELLBOT
Weakly supervised learning of human affective behaviors from multimodal interactions with a chatbot
ANRT No 2019/0729
Runtime: 01.11.2019 - 31.03.2023
Role: First author proposal, main beneficiary
Partners: ATOS
Summary: Recent advances in deep learning have shown promising results in the field of affective computing, which aims to provide machines the ability to detect and interpret human emotions in order to respond to them appropriately. One of the most dominant tasks in the field consists in quantifying automatically attributes of human behaviours using predefined scales of emotion, such as arousal (ranging from passive to active), or valence (ranging from pleasant to unpleasant), from signals recorded by sensors, e.g., audio, video, and/or text data. Those attributes of emotion can be then exploited by a machine, such as a chat-bot, a robot, or a vocal assistant, in order to monitor the affective behaviours displayed by the user over the interaction and adapt the dialog accordingly. Machine learning algorithms based on deep learning are however known to require a large amount of data for generalising well on unseen expressions, which is a common issue when the task consists in predicting expressions of human behaviour “in-the-wild”. Whereas automatic speech recognition systems (ASR) usually rely on several thousands of hours of speech data, datasets available for emotion recognition research rarely exceed ten hours of data. The objectives of this project are two-folds: (i) to develop a system that performs real-time sensing of affective attributes from speech and/or text data collected from a person while interacting with a chat-bot, and (ii) to exploit newly collected data in a weakly-supervised approach to make the inference model more robust to unseen expressions.
Multi-Modal Human-Robot Interaction for Teaching and Expanding Social Imagination in Autistic Children
EU Horizon 2020 Research & Innovation Action (RIA) #688835
Runtime: 01.02.2016 - 31.07.2019
Role: Coauthor proposal, beneficiary, workpackage leader
Partners: University of Twente, Savez udruzenja Srbije za pomoc osobama sa autizmom, Autism-Europe, IDMIND, University College London, University of Passau, Romane Institute of Mathematics Simion Stoilow of the Romanian Academy, Imperial College London
Summary: Autism Spectrum Conditions (ASC, frequently defined as ASD -- Autism Spectrum Disorders) are neurodevelopmental conditions, characterized by social communication difficulties and restricted and repetitive behaviour patterns. There are over 5 million people with autism in Europe – around 1 in every 100 people, affecting lives of over 20 million people each day. Alongside their difficulties, individuals with ASC tend to have intact and sometimes superior abilities to comprehend and manipulate closed, rule-based, predictable systems, such as robotbased technology. Over the last couple of years, this has led to several attempts to teach emotion recognition and expression to individuals with ASC, using humanoid robots. This has been shown to be very effective as an integral part of the psychoeducational therapy for children with ASC. The main reason for this is that humanoid robots are perceived by children with autism as being more predictable, less complicated, less threatening, and more comfortable to communicate with than humans, with all their complex and frightening subtleties and nuances. The proposed project aims to create and evaluate the effectiveness of such a robot-based technology, directed for children with ASC. This technology will enable to realise robust, context-sensitive (such as user- and culture-specific), multimodal (including facial, bodily, vocal and verbal cues) and naturalistic human-robot interaction (HRI) aimed at enhancing the social imagination skills of children with autism. The proposed will include the design of effective and user-adaptable robot behaviours for the target user group, leading to more personalised and effective therapies than previously realised. Carers will be offered their own supportive environment, including professional information, reports of child’s progress and use of the system and forums for parents and therapists.
Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics
FP7 ERC Starting grant #338164
Runtime: 01.01.2014 - 31.12.2018
Role: Participant
Partners: University of Passau, TUM
Summary: Recently, automatic speech and speaker recognition has matured to the degree that it entered the daily lives of thousands of Europe's citizens, e.g., on their smart phones or in call services. During the next years, speech processing technology will move to a new level of social awareness to make interaction more intuitive, speech retrieval more efficient, and lend additional competence to computer-mediated communication and speech-analysis services in the commercial, health, security, and further sectors. To reach this goal, rich speaker traits and states such as age, height, personality and physical and mental state as carried by the tone of the voice and the spoken words must be reliably identified by machines. In the iHEARu project, ground-breaking methodology including novel techniques for multi-task and semi-supervised learning will deliver for the first time intelligent holistic and evolving analysis in real-life condition of universal speaker characteristics which have been considered only in isolation so far. Today's sparseness of annotated realistic speech data will be overcome by large-scale speech and meta-data mining from public sources such as social media, crowd-sourcing for labelling and quality control, and shared semi-automatic annotation. All stages from pre-processing and feature extraction, to the statistical modelling will evolve in "life-long learning" according to new data, by utilising feedback, deep, and evolutionary learning methods. Human-in-the-loop system validation and novel perception studies will analyse the self-organising systems and the relation of automatic signal processing to human interpretation in a previously unseen variety of speaker classification tasks. The project's work plan gives the unique opportunity to transfer current world-leading expertise in this field into a new de-facto standard of speaker characterisation methods and open-source tools ready for tomorrow's challenge of socially aware speech analysis.
AMI
Audiovisual Markers of perceived speech Intelligibility
NeuroCog IDEX UGA, "Investissements d'avenir" framework: ANR-15-IDEX-02
Runtime: 01.02.2018 - 26.07.2018
Role: Coauthor proposal, beneficiary
Partners: Laboratoire d'Informatique de Grenoble (LIG), team GETALP, Laboratoire Grenoble Images Parole Signal Automatique (GIPSA), team PCMD, University of Grenoble Alpes
Summary: We are not passive when we listen to an interlocutor, but produce many non-verbal cues, such as short vocal expressions (e.g., 'hum hum"), particular head mouvements (e.g., nodding, facial mimics (e.g., frowning) or specific body postures. Studies have shown that the multimodal cues produced by an individual while listening an interlocutor are relatively powerful in social interactions because they serve, for example, to mark the agreement or disagreement with what is said, the degree of appreciation of the partner, the engagement in the interaction (e.g., attentive or passive listening), or help to manage the speaker turns. On the other hand, little information has been provided on the markers used to encode information related to the level of understanding and the evaluation of the level of speech intelligibility. There are indeed many contextual factors that can have a major impact on the quality of speech production, and therefore on the intelligibility of the message produced, e.g., benign diseases, intoxication, cognitive load, physical load, affect, stress, noisy environment, wide audience, etc. Although non-verbal cues are very powerful in social interactions - a simple sigh can tell more than a long speech - their study in the context of speech intelligibility is still very little explored. The identification and the fine analysis of its non-verbal markers by automatic processing methods will enable to finely characterize a listener's reactions to the intelligibility of a perceived speech, and to identify, if possible, the multimodal encoding strategies employed according to the encountered disturbances.
Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets
EU Horizon 2020 Innovation Action (IA) #644632 - 12.5% acceptance rate in the call
Runtime: 01.04.2015 - 31.03.2017
Role: Coauthor proposal, beneficiary, workpackage leader
Partners: NUI Galway, Univ. Polit. Madrid, University of Passau, Expert Systems, Paradigma Tecnológico, TU Brno, Sindice Ltd., Deutsche Welle, Phonexia SRO, Adoreboard, Millward Brown
Summary: MixedEmotions will develop innovative multilingual multi-modal Big Data analytics applications that will analyze a more complete emotional profile of user behavior using data from mixed input channels: multilingual text data sources, A/V signal input (multilingual speech, audio, video), social media (social network, comments), and structured data. Commercial applications (implemented as pilot projects) will be in Social TV, Brand Reputation Management and Call Centre Operations. Making sense of accumulated user interaction from different data sources, modalities and languages is challenging and has not yet been explored in fullness in an industrial context. Commercial solutions exist but do not address the multilingual aspect in a robust and large-scale setting and do not scale up to huge data volumes that need to be processed, or the integration of emotion analysis observations across data sources and/or modalities on a meaningful level. MixedEmotions will implement an integrated Big Linked Data platform for emotion analysis across heterogeneous data sources, different languages and modalities, building on existing state of the art tools, services and approaches that will enable the tracking of emotional aspects of user interaction and feedback on an entity level. The MixedEmotions platform will provide an integrated solution for: large-scale emotion analysis and fusion on heterogeneous, multilingual, text, speech, video and social media data streams, leveraging open access and proprietary data sources, and exploiting social context by leveraging social network graphs; semantic-level emotion information aggregation and integration through robust extraction of social semantic knowledge graphs for emotion analysis along multidimensional clusters.
PROPEREMO
Production and Perception of Emotions: An affective sciences approach
FP7 ERC Advanced grant #230331
Runtime: 01.03.2008 - 28.02.2015
Role: Participant
Partners: University of Geneva (PI: Klaus Scherer), TUM, Free University of Berlin
Summary: Emotion is a prime example of the complexity of human mind and behaviour, a psychobiological mechanism shaped by language and culture, which has puzzled scholars in the humanities and social sciences over the centuries. In an effort to reconcile conflicting theoretical traditions, we advocate a componential approach which treats event appraisal, motivational shifts, physiological responses, motor expression, and subjective feeling as dynamically interrelated and integrated components during emotion episodes. Using a prediction-generating theoretical model, we will address both production (elicitation and reaction patterns) and perception (observer inference of emotion from expressive cues). Key issues are the cognitive architecture and mental chronometry of appraisal, neurophysiological structures of relevance and valence detection, the emergence of conscious feelings due to the synchronization of brain/body systems, the generating mechanism for motor expression, the dimensionality of affective space, and the role of embodiment and empathy in perceiving and interpreting emotional expressions. Using multiple paradigms in laboratory, game, simulation, virtual reality, and field settings, we will critically test theory-driven hypotheses by examining brain structures and circuits (via neuroimagery), behaviour (via monitoring decisions and actions), psychophysiological responses (via electrographic recording), facial, vocal, and bodily expressions (via micro-coding and image processing), and conscious feeling (via advanced self-report procedures). In this endeavour, we benefit from extensive research experience, access to outstanding infrastructure, advanced analysis and synthesis methods, validated experimental paradigms as well as, most importantly, from the joint competence of an interdisciplinary affective science group involving philosophers, linguists, psychologists, neuroscientists, behavioural economists, anthropologists, and computer scientists.
IM2 - EMOTIBOARD
Emotionally Enriched Remote Collaborative Interactions
Swiss National Foundation (SNF) - National Center of Competence in Research (NCCR)
Runtime: 01.08.2011 - 31.12.2013
Role: Participant
Partners: University of Fribourg, Ecole Polytechnique Fédérale de Lausanne (EPFL)
Summary: The EmotiBoard is an interdisciplinary project, leaded by Denis Lalanne and Fabien Ringeval at the department of Informatics and in collaboration with Prof. Juergen Sauer and Andreas Sonderegger from the department of psychology. The goal of this project is to develop a methodology for emotionally enriching remote collaborative interactions; members of virtual teams are indeed less productive and less affectively committed compared to collocated teams, in particular with regard to difficult work situation such as intercultural teamwork. The project involves both real-time emotion recognition from multimodal inputs (speech and electrodermal activity - EDA) and adapted emotional visual representation as an emotion feedback; one see the emotion of his/her remote partner. Various studies were performed with EmotiBoard to investigate its benefits for improving quality of remote collaborative interactions with an emotional feedback, in particular for emotion awareness.
USIT
Use of Signal and Image Processing for Speech Disorders
Orange Foundation (France)
Runtime: 2006-2011
Role: Participant
Partners: Université Pierre et Marie Curie (UPMC), Université Paris Diderot - Paris 7, Hôpital la Pitié-Salpêtrière, Hôpital Enfants-malades Necker
Summary: The goal of this research project is to create and analyse a corpus of pathological speech (e.g., autism spectrum conditions, specific language impairment), in close collaboration with clinicians and psychologists. The main goal is to determine what are the prosodic features which are specific to the studied developmental disorders, in order to improve both diagnosis and medical care. One major problematic shown in the literature concerns a lack of consensus for the subjective description of the prosody of autistic children, whether in their grammatical, pragmatic or affective functionalities. However, an automatic approach allows us to overcome the bias introduced by human judgments, but also the difficulty to analyse in details a large ensemble of acoustic correlates of prosody.