Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network


Jinjin Cai*, Ruiqi Wang*, Dezhong Zhao,  Ziqin Yuan, Victoria McKenna, Aaron Friedman, Ryan Boente, 

Rachel Foot, Susan Storey, Sudip Vhaduri, and Byung-Cheol Min

*:equal contribution

 Purdue University, Beijing University of Chemical Technology, New York University, University of Cincinnati, Indiana University

[code]

Abstract

Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based hierarchical fusion network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.


Framework Overview

Illustration of the proposed AuD-Former framework: This illustration showcases the framework using cough, respiration, and speech modalities as example inputs; however, the framework is versatile and can accommodate a variety of bio-audio modalities. Initially, multimodal low-level acoustic features extracted from multiple bio-audio sources undergo temporal and positional embedding processes, resulting in sequences of temporal unimodal features. These sequences are input into an intra-modal representation learning module composed of multiple intra-modal transformer networks. This module produces unimodal representations, which effectively capture intra-modal dependencies within each modality-specific context. Subsequently, these unimodal representations are concatenated and, along with a low-level fusion representation are fed into an inter-modal representation learning module. This module constructs a high-level fusion representation that encodes latent cross-modal complementarities within a shared modality space. Finally, the high-level fusion representation passes through a prediction layer, consisting of a multi-head attention sub-layer followed by two linear sub-layers, to produce the disease prediction.


Methodology

Intra-modal Representation Learning:

Illustration of the intra-modal transformer network.


Experimental Settings

Feature Descriptions:

The zero-crossing rate measures the frequency at which a signal changes its sign, that is, the number of times the speech signal shifts from positive to negative or from negative to positive within a given frame. Typically, a higher zero-crossing rate corresponds to a higher frequency approximation. Thus, the zero-crossing rate can be used to represent the frequency feature of the signal.

Short-Term Energy represents the mean value of energy for each short frame.

The spectral Centroid is a crucial parameter for describing timbre properties, as it represents the center of gravity of the frequency components. It is calculated as the energy-weighted average of frequencies within a specific range, measured in Hz. The spectral centroid provides valuable information about the distribution of both frequency and energy within an audio signal.

Humans have a relatively reduced sensitivity to the high-frequency components of speech signals.  While humans can perceive sounds up to 20,000 Hz, the human ear processes sound waves at different frequencies non-linearly. Therefore, to align with the auditory properties of the human ear, researchers proposed the Mel scale, which is a set of triangular bandpass filters, arranged from low to high frequencies in a dense to sparse configuration. 

The log-Mel spectrogram is acquired by applying the Mel scale to the linear spectrogram and taking the logarithm. The MFCC is obtained by applying the Discrete Cosine Transform (DCT) to the log-Mel spectrogram, and retaining the second to the 13th coefficients, resulting in 12 coefficients that form the MFCCs. These features effectively capture the key spectral characteristics of the speech signal while being consistent with human auditory perception.

GFCC is similar to MFCC but it uses a GammaTone filter instead of Mel scale and logarithms. GammaTone filters are a set of filter models designed to simulate the frequency decomposition characteristics of the cochlea. When external speech signals enter the cochlea's basilar membrane, they are decomposed according to frequency, generating traveling vibrations that stimulate auditory receptor cells.  In comparison to the Mel filter, the GammaTone filter exhibits greater sensitivity to high-frequency information and reduces energy leakage. Moreover, it offers robustness and noise immunity advantages.

In the process of CQCC feature extraction, constant Q transform (CQT) is first applied to input audio signals. The nonlinearity of the CQT is effective in simulating the properties of music theory.