Evaluation and Ranking Criteria
Evaluation and Ranking Criteria
The model's evaluation will be performance-based, using two metrics:
1.Macro-F1 Score: Selected as the principal evaluation metric to address class imbalance between hate and non-hate categories.
2.Inter-Language Fairness Variance (IFV): IFV quantifies how consistently a speech classification model performs across different languages. It addresses the ethical concern that models trained predominantly on high-resource languages may underperform on underrepresented ones, potentially amplifying linguistic bias. To compute IFV, the Macro-F1 score is first calculated separately for each language subset in the evaluation dataset. Let Mi denote the Macro-F1 score for language Li, and n be the total number of languages evaluated.
Once all per-language scores are obtained, IFV is computed as the variance of these values, reflecting the spread in performance across languages. A lower IFV indicates fairer, more consistent performance, while a higher IFV suggests uneven or biased behavior. The formula for IFV is given as
3.Acoustic–Lexical Overlap Error : In multilingual hate-speech detection, systems may perform well in one modality (e.g., text) while failing in another (e.g., speech). For example, a model may detect explicit slurs in transcripts but miss hostile tone or sarcasm in audio. Conversely, a speech model may capture prosodic aggression but fail on languages with transcription errors.
To evaluate whether a model’s predictions are stable, consistent, and robust across modalities, we introduce Acoustic–Lexical Overlap Error (ALOE). This metric measures how often predictions from the audio, text, and multimodal models disagree with the ground truth and with each other. ALOE is designed specifically for ECHO because the challenge includes three parallel tasks:This structure enables a tri-view consistency evaluation. ALOE measures cross-modality prediction stability in multilingual hate speech detection by comparing system outputs from the three challenge tasks: Task 1 (Audio-only prediction, Aᵢ), Task 2 (Text-only prediction, Tᵢ), and Task 3 (Multimodal prediction, Mᵢ), against the ground-truth label (yᵢ), where all values are binary (0 or 1). For each test sample, the overlap error is calculated as