B-RIGHT: Benchmark Re-evaluation for Integrity in Generalized Human-Object Interaction Testing [ArXiv'25] [paper]
Who Should Read This Paper
Researchers and practitioners focusing on human-object interaction (HOI) detection or vision-language tasks
Anyone concerned with imbalanced datasets or skewed evaluation metrics in computer vision
Those developing benchmarking protocols for fair model comparisons across different classes or tasks
What the Paper Covers
Balanced HOI Dataset: Introduces a new dataset, B-RIGHT, ensuring uniform representation (50 train, 10 test examples) for each of 351 HOI classes
Revisiting HICO-DET Imbalance: Shows how long-tail distributions in HICO-DET can inflate or deflate model performance metrics, leading to misleading comparisons
Automated Generation & Filtering: Leverages text-to-image diffusion, vision-language models, and LLMs to augment data and filter low-quality samples
Balanced Zero-Shot Evaluation: Proposes a zero-shot HOI test set that systematically measures generalization to unseen interactions, avoiding skewed results
Real-World Applications
Fair Model Validation: Facilitates rigorous testing of HOI detectors in various real-world scenarios (e.g., robotics, video analytics), ensuring that no single class dominates results
Benchmarking & Leaderboards: Offers a less biased dataset for evaluating new HOI detection methods, leading to more reliable performance rankings
Data Generation Pipelines: Highlights how automatic prompt-based generation plus multi-step filtering can address dataset scarcity or fill in coverage gaps without excessive manual labeling
Key Strengths
Uniform Class Representation: Ensures equal instances per HOI class, mitigating long-tail issues and reducing score variance
Robust Ranking & Analysis: Exposes significant ranking shifts among state-of-the-art detectors once class imbalance is removed
Scalable, Automated Process: Employs a generation-and-filtering pipeline (vision-language models + LLMs) that streamlines dataset balancing
Comprehensive Zero-Shot Benchmark: Evaluates unseen verb–object pairs under balanced conditions, revealing more about true model generalization