Training-Free Region-Level Semantic Understanding via Interpreting Internal Representations of Vision-Language Models