This work addresses the challenge of explainability in Deep Learning within the medical domain, focusing specifically on chest X-rays. To enhance explainability, we propose a closed pipeline comprising separate models that integrate natural language processing techniques—leveraging insights from radiologist reports—with localization methods to produce visually interpretable outputs highlighting pathological regions. Our preliminary results demonstrate success in generating concise diagnostic reports and semantically grounding them in chest X-ray images. Additionally, we developed a specialized medical metric to assess the similarity of medical phrases, as existing natural language metrics do not adequately capture the required semantics. The current phase of this PhD focuses on refining these individual models and developing a final hybrid model capable of providing confidence scores for localized pathology regions.
To address the environmental impact of human activity, both individuals and industries must adopt sustainable practices. In manufacturing, this includes enhancing product reliability, reducing waste, and ensuring safety, with quality control playing a key role. While off-the-shelf solutions efficiently detect simple defects in standardized production lines, they fail for complex or subjective issues, leaving manual, error-prone processes. This thesis aims to develop a robust computer vision system for quality control, overcoming challenges like subtle defect discrimination, rare and costly annotations, and stringent hardware constraints. The project will explore semi-supervised learning, lightweight model design, and potentially extend to root cause analysis through multi-modal reasoning.
Knee injuries, particularly those involving the anterior cruciate ligament (ACL), are among the most common sports-related injuries and present significant challenges for return-to-sport (RTP) processes, with only two-thirds of athletes regaining their previous performance levels post-rehabilitation. Despite promising results from machine learning (ML) methods for injury prediction, issues like limited generalization and inconsistent predictive factors have hindered their widespread application. Recent advances in deep learning (DL), particularly its ability to analyze large datasets and medical imaging with human-level accuracy, offer new avenues for improving knee injury prevention and management.
This project aims to develop a novel DL-based approach using biomechanical data from ballistic motor tasks to identify subtle force production differences as functional markers of knee impairment. By transforming time-series data into image representations, the project seeks to train neural networks capable of predicting knee injury risks or recurrence with greater accuracy, providing an integrative and interdisciplinary solution for injury prevention and post-surgical monitoring in athletes.
I contributed to the Connect Talent MILCOM project "Multimodal Imaging and Learning for COmputational Medicine" which combines data science and health, and focuses on the application of machine learning to the analysis of multimodal medical image analysis. In the framework of this project, my research addressed the following problems:
Context: Current trends in medical image analysis have shown the effectiveness of Machine Learning (ML) and Deep Learning (DL) in devising computer-aided solutions for a plethora of medical applications and imaging modalities. However, the challenge remains that ML/DL models are data-hungry, requiring large and high quality medical expert annotated datasets. Yet, rarely does one have a perfectly-sized and carefully-labeled dataset with which to train a ML/DL model, particularly in medical imaging where data and annotations are expensive to acquire. In addition, there is never a full consensus among medical expert annotators. Consequently, there is a need for innovative methodologies that enable annotation-efficient deep learning for medical imaging. To address this challenge, our research focuses on: i) How to learn with a limited and incomplete quantity of annotations and ii) How to leverage unannotated data.
Context: The generalization of deep learning approaches has been an issue in real clinical scenario, where large differences in the collected data characteristics between medical centers and scanners exist. The generalization issue is due to the domain shift problem, that is for example: appearances variations across modalities, inter-patient anatomical structure variations, and different clinical sites with different acquisition parameters. Consequently, there is a need for innovative methodologies that enable generalizable deep learning approaches. Typically, this problem is addressed via unsupervised domain adaptation (UDA) strategies where one assume no labels are available for the target domain. The core idea of UDA is to go through an adaption phase using a non-linear mapping to find a common domain-invariant representation or a latent space Z. The domain shift in Z can be reduced by enforcing the two domains distributions to be closer via a certain loss (e.g. Maximum Mean Discrepancy). Since Z is common to all domains who share the same label space, projected labeled source domain samples can be used to train a segmenter for all domains. To address this challenge, our research focuses on: i) Developing an unsupervised adaptation method that take advantages of a model learned over a source domain dataset (e.g. using MRI images) to address a related problem (e.g. Segmentation task) on target images (e.g. using CT images collected from different clinical sites), for which no annotations are available. ii) Investigating the possibility of building a general segmenter for any organ with minimal task-specific annotations, while still leveraging other tasks information.
I was part of the AGPIG team at GIPSA-Lab, (newly ACTIV team). During my Ph.D., my research addresses the problem of automatic analysis of macro and micro facial expressions. Our tasks addresses the problem of micro-expression detection and macro-expression recognition using machine learning and deep learning frameworks. We analyzed facial expressions using images and video sequences. In the framework of this project, my research addressed the following problems:
Macro-Facial Expressions (MaEs) occur when the subject agrees to express a given emotion. As a consequence, MaEs are characterized by lasting over a substantial period of time on several regions of the face. They are frequent and involve more conscious control. MaEs can occur spontaneously due to an involuntary manifestation of an emotional state or it can be posed as a result of a deliberate effort of communicating an emotional signal. Posed MaEs induce exaggerated movements and changes in the location and appearance of facial features. On the contrary, spontaneous MaEs are more subtle but have visible facial movements and typically evolve differently over time than posed MaEs. In my dissertation, first, we dedicate our work on MaEs analysis where expressions are categorized into basic (e.g., joy or anger) and non basic (e.g., worried or ashamed) emotions and we put the focus on spontaneous facial expression associated with less constrained environmental conditions. Facial expressions could appear different when presented with extrinsic and intrinsic variations even if performed by the same identity. Such changes within the same identity overwhelm the variations due to identity differences and make Facial Expression Recognition (FER) challenging, mainly in unconstrained conditions. Therefore, the complexity of discerning whether two Facial Expressions (FE) reveal similar or different emotions is shifted to the detection and description of a rich set of discriminative features. Tackling these issues requires extracting visual features that can reflect the essential visual content of the FE images while being robustly invariant to intrinsic and extrinsic factors. We put the focus on finding the best feature representation able to discriminate between FEs no matter the acquisition conditions or the expressions are. Therefore, we design different levels of facial feature appearance-based representations, mainly: low-level, mid-level and hierarchical features.
In real-world applications, FE databases are inconsistent between each other. For instance, face images of the same expression may have different appearances in the face images within the same database. On the contrary, different expressions may have a similar appearance in the face images of different subjects from different databases. Such inconsistency is due to varying domains because of different extrinsic and intrinsic factors variability, such as different cameras, illuminations, populations, acquisition setup and participants’ culture background or personality, etc. As a consequence of the foregoing inconsistency between different databases, the performance degrades when we train a FER system on one source domain and we test it on another target domain. The mismatch distribution problem is often referred as domain-shift. In this context, a robust FER model must take special care during the learning process to infer models that adapt well to the test data they are deployed on. Yet, many critical issues associated to the target domain induce the domain-shift problem. Mainly three: 1) the inter-subject-expression variations such as the way to produce an expression are inconsistent across different people, 2) the large variance in face pose, illumination, occlusions, changes in the camera and image resolution, and finally 3) the issues with spontaneous expressions with various intensities. Hence, we study how to adapts facial expression models trained for a particular visual domain, that is posed expression datasets, to a new domain, that is spontaneous expression datasets, by learning a non-linear transformation that minimizes the effect of domain-shift changes in the feature distribution. We also study the problem of Zero Shot Learning (ZSL) for FER recognition, where facial expression classes in the test set are unseen during the training step. We direct our research toward transfer learning, where we aim at adapting facial expression models to new domains and tasks. We studied domain adaptation and zero shot learning for developing a method that solves the two tasks jointly. Our method is suitable for unlabelled target datasets coming from different data distributions than the source domain while sharing the same label space and task, and for unlabelled target datasets with different feature and label distributions than the source domain. To permit knowledge transfer between domains and tasks, we use Euclidean learning and Convolutional Neural Networks to design a non-linear mapping function that maps the visual information coming from facial expressions into a semantic space coming from a Natural Language model that encodes the visual attribute description or uses the label information description. The consistency between the two subspaces is maximized by aligning them using the visual feature distribution. Chapter 4 in my dissertation describe completely the domain adaptation and the zero-shot learning approaches.
Micro-expressions (MiE) are the cause of either conscious suppression or unconscious repression of expressions when a person experiences an emotion but attempts to mask over the facial deformations. Understanding MiEs helps to identify the deception and the true mental condition of a person. Unlike macro-facial expressions, which typically last for 0.5-4 seconds and thus can be immediately recognized by humans, MiEs generally remain less than 0.2 seconds, as well they are very subtle which makes them difficult to spot and recognize them. In order to improve the capacity of people to identify and recognize MiEs, researchers in psychology made improvements to train specialists using the Micro Expression Training Tools. However, even with these training tools, visual reading of MiEs by experts is only around 45%. Obviously, spotting and recognizing MiEs with a human eye is an extremely difficult task, as there is a need for more descriptive facial feature displacements and motion information. In my dissertation, we propose a process for MiEs spotting, that is identifying their temporal and spatial locations in a video sequence while effectively dealing with parasitic movements. As the duration of an MiE is very short, to capture its speed and subtlety, a high-speed camera is a must during data acquisition. However, the usage of high-speed camera tends to produce parasitic motions and deformations, such as those related to head movements, eye blinks, gaze direction, and mouth opening or closing movements. Those parasitic movements along other facial muscle activations are usually reinforced, in return resulting a confusion with MiEs. As a result, it is essential to eliminate the interferences from unrelated facial MiE information and to emphasize in the meantime on important characteristics of MiEs. On the top of that, to detect a MiE, a method that captures subtle facial motions and subtle local spatial deformations effectively is required. In this sense, our objective was: i) to spot MiE segments (onset-offset frames); ii) to pinpoint their subtle local spatial deformations over facial regions; iii) to effectively deal with parasitic movements by distinguishing motions related to MiEs from other facial events. To achieve our objectives, first we needed to consider the nature of how MiEs are produced. For instance, MiEs tend to be naturally infrequent since people try not to produce them and very specific conditions are required to evoke MiEs. Therefore a small number of data can be collected about MiEs accompanied with parasitic movements and deformations even though the acquisition process is done in a strictly controlled environment. Having at disposal few data regarding single MiE segments, typically about 2 to 20 MiE frames from onset to offset if recorded with a high-speed camera at 60 fps, it is difficult to utilize supervised learning methods toward automating the process of MiE segment detection. As a consequence, we proposed a weakly supervised method where we reformulate the problem of MiEs spotting into a problem of Anomaly Detection. All facial motion and deformations but those caused by MiEs are considered as Natural Facial Behaviour (NFB) events. NFB motions and spatial deformations are learned so that we can detect MiEs motions and deformations in the frame as they are different from the one the model has learned. To this end, by reformulating the problem into anomaly detection, we alleviate the main challenge of dealing with small amount of MiE segments as we deal mainly with NFB events, which are frequent. More importantly, it is more efficient to deal with NFB segments as it is possible to extract for them discriminant spatio-temporal information because their motions and deformations are bigger. Our method is composed of a deep Recurrent Convolutional Auto-Encoder to capture spatial and motion feature changes of natural facial behaviours. Then, a statistical based model for estimating the probability density function of normal facial behaviours while associating a discriminating score to spot micro-expressions is learned based on a Gaussian Mixture Model. Finally, an adaptive thresholding technique for identifying micro expressions from natural facial behaviours is proposed. Chapter 5 in my dissertation describe completely the work on micro facial expression analysis.