RESEARCH

Acoustic & Speech

Image & Vision

Natural Language Processing

Acoustic & Speech

Investigates various Speech signal processing schemes for acoustic modeling so that more robust speech recognition can be achieved. Our aim is to perform the state-of-art research providing effective means for achieving:

Acoustic Animal Classification System for Exotic Species and Source Speration

Admin 2021-04-08 👁️ 498

1. Introduction

- Exotic species are continuously flowing into our country. Monitoring them is vital in protecting biodiversity. Therefore, we propose an Acoustic Classification System for Exotic Species based on deep learning to monitor and identify exotic species.

2. Main algorithm and principle

- An overview of the proposed system is shown in Figure 1 and 3. First, the acoustic data transmitted from the recording/data transmission device is preprocessed. Then, the data is classified as one of the species predefined species via the classification system. The system is able to identify and classify each species in real-time even if multiple species are simultaneously present in the data.

Figure 1. System Structure

- Dynamic Filter Temporal efficient network [1, 4]

o The filter generator produces weights of the neural network, and the dynamic filter layer uses the generated weights for the filtering process.

o Combination of Depthwise Convolution, Temporal Convolution, and Pointwise Convolution

o Compared to the Temporal Convolution Network (TCNet), the number of parameters and amount of computation are reduced.

- Metric Learning [2]

o Minimize same species vector distance (L2-norm)

o Maximize different species vector distance (L2-norm)

o Minimize same-species vector similarity (Cosine similarity)

o Maximize different species vector similarity (Cosine similarity)

Figure 2. Machine learning model through dynamic filter and metric learning

Figure 3. System Structure

rent species vector

Source Separation [5]

Animal sound datasets include diverse species and contain similar sounds. When considering the real world situation, it is also necessary to consider various backgound noises. Disregarding these can cause some species to be misclassified. Taking these problems into consideration, the following model was proposed.

Figure 4. DPRNN + TEnet Structure

- Transformer based classification model [6,7]

o The transformer based structure is a matrix product based non-local structure that uses fewer learning variables to identify important patterns.

o Transformer Attention is expanded to multi-head attention that ha multiple heads.

o Transformer structure works when the length of the signal changes, and induces the network structure to extract global features.

o More accurate feature representation learning is performed by using the channel attention to extract important features.

3. Demo

1. Enabled access to system via mobile and PC by using Django
2. Set up Deep Learning server and mobile PC Client
  1. Client: sends recorded audio files to the server
  2. Server: performs classification algorithm on the received audio files and sends the results to the client
3. Visualization of the waveform and analysis of audio files by intervals

Figure 6. Image demo example

4. Reference

[1] Donghyeon Kim, Kyungdeuk Ko, Jeong-gi Kwak, David K. Han, Hanseok Ko, “LIGHTWEIGHT DYNAMIC FILTER FOR KEYWORD SPOTTING”, arXiv preprint arXiv:2019.11165, 2021.

[2] Elad Hoffer, Nir Ailon, “DEEP METRIC LEARNING USING TRIPLET NETWORK”, Similarity-Based Pattern Recognition: Third International Workshop, Proceedings 3, Springer International, pp. 84-92, October 2015.

[3] Gwantae Kim, David K. Han, Hanseok Ko, “Feedback Module Based Convolution Neural Networks for Sound Event Classification”, IEEE Access 9 (2021): 150993-151003.

[4] Ximin Li, Xiaodong Wei, and Xiaowei Qin, “Small-Footprint Keyword Spotting With Multi-Scale Temporal Convolution”, arXiv preprint arXiv:1904.03814, 2019.

[5] Yongmin Kim, Chulwon Choi, Yuanming Li, Hanseok Ko, “Animal Sound Spearation using Dual-Path RNN and Classifier Loss”, ICA2022.

[6] Avi Gazenli, Gadi Zimerman, Tal Ridnik, Gilad Sharir, Asaf Noy DAMo Aacademy, Alibaba Group, “END-TO-END AUDIO STRIKES BACK: BOOSTING AUGMENTATIONS TOWARDS AN EFFICIENT AUDIO CLASSFICIATION NETWORK”, arXiv preprint arXiv:2204.11479, 2022.

[7] Jie Hu, Li Shen, Gang Sun, “Squeeze-and-Excitation Networks”, Proceedings of the IEEE conference on computer vision and pattern recogntiion, pp. 7132-7141, 2018.

Page updated

Report abuse