: A Game Sound Benchmark for Sound Detection in a Battle Game

Authors

Sungho Shin Seongju Lee Changhyun Jun Kyoobin Lee

School of Integrated Technology, Gwangju Institute of Science and Technology (GIST)

Direct Link

Paper

Data

Project Introduction Video

Abstract

A haptic sensor coupled to a gamepad or headset is frequently used to enhance the sense of immersion for game players. However, providing haptic feedback for appropriate sound effects involves specialized audio engineering techniques to identify target sounds that vary according to the game. We propose a deep learning-based method for sound event detection (SED) to determine the optimal timing of haptic feedback in extremely noisy environments. To accomplish this, we introduce the BattleSound dataset, which contains a large volume of game sound recordings of game effects and other distracting sounds, including voice chats from a PlayerUnknown’s Battlegrounds (PUBG) game. Given the highly noisy and distracting nature of war-game environments, we set the annotation interval to 0.5 s, which is significantly shorter than the existing benchmarks for SED, to increase the likelihood that the annotated label contains sound from a single source. As a baseline, we adopt mobile-sized deep learning models to perform two tasks: weapon sound event detection (WSED) and voice chat activity detection (VCAD). The accuracy of the models trained on BattleSound was greater than 90% for both tasks; thus, BattleSound enables real-time game sound recognition in noisy environments via deep learning. In addition, we demonstrated that performance degraded significantly when the annotation interval was greater than 0.5 s, indicating that the BattleSound with short annotation intervals is advantageous for SED applications that demand real-time inferences.

Dataset Description

1. How to Download

You can download the dataset at the following link.
Zip file includes three folders (each folder denotes the class name). Audio samples for each class exist inside a folder.

2. Overview of BattleSound Dataset

3. Dataset Sample

1. VOICE Samples (Label 0)

0_0_sample.wav

0_1_sample.wav

0_2_sample.wav

2. EFFECT Samples (Label 1)

1_0_sample.wav

1_1_sample.wav

1_2_sample.wav

3. MIXED Samples (Label 2)

2_0_sample.wav

2_1_sample.wav

2_2_sample.wav

Proposed Baseline Task on the BattleSound Dataset

1. Task Definitions

Real-time voice chat activity detection (VCAD) is a task that identifies voices in the streamed audio input. Typically, two to four players forming a team can communicate via voice chat while playing PUBG. Because multiple players speak concurrently and loudly, several parts of the recorded voice contain noise and overlapping sounds. In addition, weapon sounds are frequently mingled with voices, making them difficult to distinguish. To recognize the voice in the streamed audio input, we developed a VCAD model using deep learning. The VOICE-and MIXTURE-labeled samples in the BattleSound were considered target voice samples, whereas the samples labeled with WEAPON were used as non-target samples.

Weapon sound event detection (WSED) is a task that entails real-time detection of weapon sounds, such as gun and bomb from streamed audio input. For a realistic feeling, numerous game devices provide visual or haptic feedback in response to game effects, such as shooting or striking. If the game's effects can be detected via sound, the timing of the feedback delivery can be determined automatically. Therefore, we developed a deep learning model capable of detecting weapon sounds from streamed audio. We used the WEAPON-and MIXTURE-labeled samples in BattleSound as target samples and the VOICE-labeled samples as non-target samples.

2. Audio Data Representation

2.1. Audio Signal

All the samples in the BattleSound has 16kHz sampling rate. Each sample for the training and validation has 0.5 seconds long. In other words, input signal has the dimension of 1x8000.

2.2. Spectrogram

2.3. Mel-spectrogram

Page updated

Google Sites

Report abuse