ECHO (Ethical Communication for Human Outreach) Dataset
ECHO (Ethical Communication for Human Outreach) Dataset
Despite the growing urgency to combat online hate speech, there remains no comprehensive multilingual speech benchmark that spans both low-resource and high-resource languages. This absence has left the speech technology community without a shared, standardized foundation for developing safe, fair, and inclusive speech moderation systems. The Multilingual Hate Speech Detection (MHSD) dataset fills this gap by introducing the first cross-lingual benchmark for hate speech detection directly from spoken audio. MHSD integrates raw speech recordings, aligned automatic speech recognition (ASR) transcripts, and carefully validated annotations across Tamil, Bengali, Hindi, Malayalam, Telugu, French, Spanish, English, Chinese, and Japanese — collectively representing over 3 billion speakers worldwide. The dataset contains approximately 15,000 labeled audio clips, each between 5 and 15 seconds in duration. All clips are segmented, transcription-aligned, and annotated into hate and non-hate categories, with language-specific guidelines to ensure cultural relevance and annotation consistency. The dataset is organized in a language-wise folder structure, with each folder containing audio files and corresponding transcripts. This design enables flexible use for both multimodal (speech + text) and single-modality research applications, supporting advancements in fair and robust hate speech detection across diverse linguistic contexts.