ECHO Challenge - Evaluation and Results

Evaluation and Ranking Criteria

The model's evaluation will be performance-based, using two metrics:

1.Macro-F1 Score: Selected as the principal evaluation metric to address class imbalance between hate and non-hate categories.

2.Inter-Language Fairness Variance (IFV): IFV quantifies how consistently a speech classification model performs across different languages. It addresses the ethical concern that models trained predominantly on high-resource languages may underperform on underrepresented ones, potentially amplifying linguistic bias. To compute IFV, the Macro-F1 score is first calculated separately for each language subset in the evaluation dataset. Let Mi denote the Macro-F1 score for language Li, and n be the total number of languages evaluated.

Once all per-language scores are obtained, IFV is computed as the variance of these values, reflecting the spread in performance across languages. A lower IFV indicates fairer, more consistent performance, while a higher IFV suggests uneven or biased behavior. The formula for IFV is given as

3.Acoustic–Lexical Overlap Error : In multilingual hate-speech detection, systems may perform well in one modality (e.g., text) while failing in another (e.g., speech). For example, a model may detect explicit slurs in transcripts but miss hostile tone or sarcasm in audio. Conversely, a speech model may capture prosodic aggression but fail on languages with transcription errors.

To evaluate whether a model’s predictions are stable, consistent, and robust across modalities, we introduce Acoustic–Lexical Overlap Error (ALOE). This metric measures how often predictions from the audio, text, and multimodal models disagree with the ground truth and with each other. ALOE is designed specifically for ECHO because the challenge includes three parallel tasks:This structure enables a tri-view consistency evaluation. ALOE measures cross-modality prediction stability in multilingual hate speech detection by comparing system outputs from the three challenge tasks: Task 1 (Audio-only prediction, Aᵢ), Task 2 (Text-only prediction, Tᵢ), and Task 3 (Multimodal prediction, Mᵢ), against the ground-truth label (yᵢ), where all values are binary (0 or 1). For each test sample, the overlap error is calculated as

representing the fraction of incorrect predictions across modalities, and the final ALOE score is

where N is the number of test samples. Scores between 0.90 and 1.00 indicate a highly stable model that performs consistently across modalities, 0.75 to 0.89 indicates moderate stability and potential modality dependence, and below 0.75 signals instability and weak generalization. ALOE is designed to reveal practical failure points such as over-reliance on ASR quality, inability to detect prosodic or sarcastic hate, poor handling of multilingual acoustic cues, and ineffective multimodal fusion, making it a meaningful indicator of real-world reliability.

** Ranking is determined by 25% weighting for the Macro-F1 Score , a 30% weighting for Inter-Language Fairness Variance (IFV) and a 45% weighting for ALOE.

Page updated

Google Sites

Report abuse