Red Hen Audio Tagger

Red Hen Audio Tagger (RHAT) by Sabyasachi Ghosal, Austin Bennett, & Mark Turner

The Red Hen Audio Tagger (RHAT) is a novel, publicly available open-source platform developed by Red Hen Lab. RHAT employs deep learning models to tag audio elements frame by frame, generating metadata tags that can be utilized in various data formats for analysis. RHAT seamlessly integrates with widely used linguistic research tools like ELAN: the researcher can use RHAT to tag audio content automatically and display those tags alongside other ELAN annotation tiers. RHAT additionally complements existing Red Hen pipelines devoted to natural language processing, speech-to-text processing, body pose analysis, optical character recognition, named entity recognition, computer vision, semantic frame recognition, and so on. These cooperating Red Hen pipelines are research tools to advance the science of multimodal communication.

RHAT tags streams of audio data via a deep learning model. Since a single stream of data can contain multiple sound effects, audio tagging is a multi-label classification problem. RHAT is a pipeline; it automatically pre-processes each audiovisual file in the input list, models it, and generates a file of tags suitable for ingestion to an annotation application, with timestamps and confidence ratings for each tag. The pipeline can be modified by swapping the model used. Which model is deemed best will depend upon the nature of the research project. RHAT accordingly treats the model as a modular plug-in component.RHAT's tags are generated from frames, using existing pre-trained deep learning models (like YamNet). The tags are stored in two kinds of files that differ not in data but in metadata format:

SFX Files: Red Hen created the SFX format to comply with existing Red Hen data formats. An SFX file contains a TOP block, consisting of some rows of UTF-8 text, presenting information such as filename, source, etc. In the Red Hen structure, a single data asset (such as an audiovisual file and all its metadata) may consist of many separate files of data and metadata; all have the same filename, but different filename extensions. The filename is unique and informative to a human reader at a glance. The data and metadata in all the various files related to a data asset can be collocated according to the timestamps. A data file might have extension .mp4 or .wav or any other common extension that indicates a data format. A metadata file for that data might be in form .txt (for closed-captions), .tpt (for transcript), .ocr (for optical character recognition of on-screen text), .seg (for various kinds of metadata, including frame tagging, natural language processing tags, sentiment ratings, etc.), and so on. If a data asset is already in the Red Hen dataset, RHAT copies the TOP block of that data asset’s .seg file into the beginning of the corresponding SFX file. After this TOP block, the SFX file lists audio tags in JSON format, with time intervals keyed to frame-by-frame analysis. The SFX file also includes confidence scores for each tag. The volume and specificity of the metadata in an SFX file can easily overwhelm the user. Such a file can consist of hundreds or thousands of lines. Accordingly, the user might prefer to use a parser to render the metadata in a user-friendly format. Many such parsers are available. Red Hen recommends JQ queries, along with data filters to arrange and filter the metadata.
CSV Files: CSV files are standard across research and can be configured to suit user purposes in a variety of ways. A CSV file generated by the current version of RHAT provides metadata for all the audio tags. Confidence scores are included for each recognition. The audio tags and confidence scores produced by RHAT can be imported to ELAN, an annotation tool developed by The Language Archive, a branch of the Max Planck Institute for Psycholinguistics in Nijmegen. ELAN allows users to annotate and transcribe audio or video recordings, manually or semi-automatically. ELAN is very widely used by linguists, gesture scholars, and communication scientists. Red Hen offers a system for importing RHAT CSV files directly into ELAN using ELAN’s native import function. Such an import quickly and easily creates Red Hen Tagger ELAN tiers alongside any ELAN tiers previously or subsequently created by the researcher. The ELAN user follows usual ELAN practices to import an audio or audiovisual file, import an ELAN template in EAF format, and annotate in standard ELAN fashion. At any point during work, the ELAN user can deploy RHAT to create a CSV file of audio annotations and incorporate them directly into the ELAN tagging project.

Source Code

RHAT was originally created as a project in Red Hen Lab Google Summer of Code 2022. The current version can be found here.

Code Book

The current pipeline only leverages the tags provided by YaMNet. The codebook contains these list of sound effects (521 classes) which are currently tagged by the model. The details of the Codebook can be found here.

Published Paper

RHAT was published by Linguistic Vanguard, where you can read more information (https://doi.org/10.1515/lingvan-2022-0130 )

Open Issues/Future Work

Validation for inputs and output for scripts. Remove hardcodings from scripts.
Evaluate other pretrained models and evaluate performance with YamNet.
Generate tags for RedHen's videos.

Report abuse