: A Game Sound Benchmark for Sound Detection in a Battle Game
: A Game Sound Benchmark for Sound Detection in a Battle Game
Sungho Shin Seongju Lee Changhyun Jun Kyoobin Lee
School of Integrated Technology, Gwangju Institute of Science and Technology (GIST)
A haptic sensor coupled to a gamepad or headset is frequently used to enhance the sense of immersion for game players. However, providing haptic feedback for appropriate sound effects involves specialized audio engineering techniques to identify target sounds that vary according to the game. We propose a deep learning-based method for sound event detection (SED) to determine the optimal timing of haptic feedback in extremely noisy environments. To accomplish this, we introduce the BattleSound dataset, which contains a large volume of game sound recordings of game effects and other distracting sounds, including voice chats from a PlayerUnknown’s Battlegrounds (PUBG) game. Given the highly noisy and distracting nature of war-game environments, we set the annotation interval to 0.5 s, which is significantly shorter than the existing benchmarks for SED, to increase the likelihood that the annotated label contains sound from a single source. As a baseline, we adopt mobile-sized deep learning models to perform two tasks: weapon sound event detection (WSED) and voice chat activity detection (VCAD). The accuracy of the models trained on BattleSound was greater than 90% for both tasks; thus, BattleSound enables real-time game sound recognition in noisy environments via deep learning. In addition, we demonstrated that performance degraded significantly when the annotation interval was greater than 0.5 s, indicating that the BattleSound with short annotation intervals is advantageous for SED applications that demand real-time inferences.
You can download the dataset at the following link.
Zip file includes three folders (each folder denotes the class name). Audio samples for each class exist inside a folder.
Real-time voice chat activity detection (VCAD) is a task that identifies voices in the streamed audio input. Typically, two to four players forming a team can communicate via voice chat while playing PUBG. Because multiple players speak concurrently and loudly, several parts of the recorded voice contain noise and overlapping sounds. In addition, weapon sounds are frequently mingled with voices, making them difficult to distinguish. To recognize the voice in the streamed audio input, we developed a VCAD model using deep learning. The VOICE-and MIXTURE-labeled samples in the BattleSound were considered target voice samples, whereas the samples labeled with WEAPON were used as non-target samples.
Weapon sound event detection (WSED) is a task that entails real-time detection of weapon sounds, such as gun and bomb from streamed audio input. For a realistic feeling, numerous game devices provide visual or haptic feedback in response to game effects, such as shooting or striking. If the game's effects can be detected via sound, the timing of the feedback delivery can be determined automatically. Therefore, we developed a deep learning model capable of detecting weapon sounds from streamed audio. We used the WEAPON-and MIXTURE-labeled samples in BattleSound as target samples and the VOICE-labeled samples as non-target samples.
2.1. Audio Signal
All the samples in the BattleSound has 16kHz sampling rate. Each sample for the training and validation has 0.5 seconds long. In other words, input signal has the dimension of 1x8000.
2.2. Spectrogram
2.3. Mel-spectrogram