Rapid urbanization increases community, construction, and transportation noise in residential areas, and relying solely on sound pressure level (SPL) for noise control is inadequate. This research work develops an end-to-end IoT system that uses edge devices to extract real-time urban sound metadata, including sound type, location, duration, occurrence rate, loudness, and azimuth in nine residential areas, with data aggregated on a cloud platform for detailed analytics and visualization. The system integrates hardware, software, cloud technologies, and signal processing algorithms, providing insights into residential noise and discussing practical challenges in managing a widespread sensor network.
SINGA:PURA is a polyphonic urban sound dataset with spatiotemporal context, collected via wireless acoustic sensor network across Singapore. The project aims to identify and mitigate noise sources in Singapore, with applications in sound event detection, classification, and localization. It includes a hierarchical label taxonomy compatible with existing datasets, and our paper details the data collection, annotation, processing methodologies, exploratory data analysis, and performance of a baseline model on the dataset.
Sound event localization and detection (SELD) comprises sound event detection and direction-of-arrival estimation, which are challenging to optimize jointly. We introduce the Spatial cue-Augmented Log-SpectrogrAm (SALSA) feature, which accurately maps signal power and source directional cues to resolve overlapping sound sources. Experiments on the TAU-NIGENS Spatial Sound Events 2021 dataset demonstrated that SALSA features significantly outperformed state-of-the-art features, improving F1 scores and localization recall by up to 16% and 7%, respectively, depending on the microphone array format.
Signal processing methods for sound source direction-of-arrival estimation create spatial pseudo-spectra where local maxima indicate source directions, but these spectra remain noisy after smoothing due to varying levels of noise, reverberation, and overlapping sources. Additionally, the unknown number of sources complicates peak selection, leading to potential errors. Convolutional neural networks, proven effective in image processing and direction-of-arrival estimation, generalize well across environments. The proposed 2D CNN with multi-task learning robustly estimates source numbers and directions from spatial pseudo-spectra, enhancing generalization and outperforming traditional methods in diverse noise, reverberation, and source conditions based on simulation and experimental results.
A GENERAL NETWORK ARCHITECTURE FOR SOUND EVENT LOCALIZATION AND DETECTION USING TRANSFER LEARNING AND RECURRENT NEURAL NETWORK
The polyphonic sound event detection and localization (SELD) task is complex due to challenges in optimizing both sound event detection (SED) and direction-of-arrival (DOA) estimation simultaneously within a single network. Our proposed SELD architecture addresses this by integrating independently pretrained sub-networks for SED and DOA estimation, augmented by a recurrent layer that aligns their outputs without knowledge of their upstream algorithms. This modular approach accommodates various existing SED and DOA methods, allowing independent refinement of sub-networks. Experimental results on the DCASE 2020 SELD dataset demonstrate competitive performance across different algorithms and audio formats, and the source code is publicly available on GitHub.
FRCRN: BOOSTING FEATURE REPRESENTATION USING FREQUENCY RECURRENCE FOR MONAURAL SPEECH ENHANCEMENT
In this paper, convolutional recurrent networks (CRN), specifically the convolutional recurrent encoder-decoder (CRED) structure, are introduced to enhance monaural speech using a feedforward sequential memory network (FSMN) for efficient frequency recurrence. This approach aims to improve feature representation along the frequency axis by applying frequency recurrence on 3D convolutional feature maps, capturing long-range frequency correlations in speech inputs. Additionally, the framework, termed Frequency Recurrent CRN (FR-CRN), incorporates stacked FSMN layers between the encoder and decoder to model temporal dynamics. The proposed FR-CRN achieves state-of-the-art results on wideband benchmark datasets and ranks second in the ICASSP 2022 Deep Noise Suppression challenge's real-time fullband track based on Mean Opinion Score (MOS) and Word Accuracy (WAcc).
A Sequence Matching Network for Polyphonic Sound Event Localization and Detection
Sound event detection and direction-of-arrival estimation typically rely on distinct input features: time-frequency patterns for event detection and magnitude or phase differences between microphones for direction-of-arrival estimation. Previous methods often use shared input features and train both tasks jointly or in a two-stage transfer-learning approach. In contrast, the proposed approach separates the learning of these tasks into two steps: first detecting sound events and estimating directions-of-arrival independently, then aligning their outputs using a deep neural network. This modular strategy enhances system flexibility and performance, as demonstrated by improved results on the DCASE 2019 dataset compared to existing state-of-the-art methods.
Urbanization has led to major roads being built closer to residential areas, resulting in traffic noise from inconsiderate driving behaviors and aging vehicles disturbing residents, especially at night. To address this issue, an automatic Noisy Vehicle Surveillance Camera (NoivelCam) system is being designed to capture and document noise violations by photographing license plates of offending vehicles. An initial deployment of NoivelCam in Singapore has demonstrated its effectiveness in monitoring vehicle noise levels on highways.