Improving Concept Alignment in Vision-Language Concept Bottleneck Models

Nithish Muthuchamy Selvaraj*, Xiaobao Guo*, Bingquan Shen", 

Adams Wai-Kin Kong*, Alex Kot*

*Nanyang Technological University, "DSO National Laboratories

Abstract

Concept Bottleneck Models (CBM) map the input image to a high-level human-understandable concept space and then make class predictions based on these concepts. Recent approaches automate the construction of CBM by prompting Large Language Models (LLM) to generate text concepts and then use Vision Language Models (VLM) to obtain concept scores to train a CBM. However, it is desired to build CBMs with concepts defined by human experts instead of LLM generated concepts to make them more trustworthy. In this work, we take a closer inspection on the faithfulness of VLM concept scores for such expert-defined concepts in domains like fine-grain bird species classification and animal classification. Our investigations reveal that frozen VLMs, like CLIP, struggle to correctly associate a concept to the corresponding visual input despite achieving a high classification performance. To address this, we propose a novel Contrastive Semi-Supervised (CSS) learning method which uses a few labeled concept examples to improve concept alignment (activate truthful visual concepts) in CLIP model. Extensive experiments on three benchmark datasets show that our approach substantially increases the concept accuracy and classification accuracy, yet requires only a fraction of the human-annotated concept labels. To further improve the classification performance, we also introduce a new class-level intervention procedure for fine-grain classification problems that identifies the confounding classes and intervenes their concept space to reduce errors.

Problem Statement

Recent approaches in Explainable AI automate the construction of Concept Bottleneck Models (CBM) with LLM and VLM. They first prompt a LLM to generate a set of interpretable natural language concepts for the given classification task and then use a VLM to generate concept scores (by leveraging their image-text alignment scores) for all samples. Finally, an interpretable classifier is trained on these concept scores.

In this work, we investigate if the concept scores of VLM's like CLIP faithfully represent the visual truth. Experiments on three benchmark datasets - CUB, RIVAL, and AwA2, expose two problems.

1. Low Concept Accuracy - CLIP model has a low concept accuracy despite achieving a high classification performance i.e., the concept scores donot faithfully represent the visual input, which makes the resulting model less trustworthy. 

2. Incorrect Concept Association - For challenging classification problems like CUB, the CLIP model is sensitive and biased towards the primary color of the bird and associate this dominant color to all other body parts. It struggles to correctly attribute the fine-grain concepts to the visual input.

Evaluation of CLIP concept scores.

Improving Concept Alignment with Contrastive Semi-Supervised (CSS) Learning

- In order to improve concept alignment, obtaining the supervisory concept labels for all training samples is cumbersome. Hence, we propose a novel Contrastive Semi-Supervised (CSS) learning approach that can improve the concept alignment in Vision-Language Concept Bottleneck Models (VL-CBM) with fewer labeled concept examples.

- Our approach encourages consistent concept activations within the same class whislt discriminating (contrasting) them from the other classes. It then uses a few labelled concept examples (semi-supervision) per class to align them with the ground truth.

- CSS method substantially increases the concept accuracy (+39.1% for CUB, +18.63% for RIVAL, +39.11% for AwA2) and enhances the overall classification accuracy (+5.61% for CUB, +2.84% for RIVAL, +3.06% for AwA2) with only a handful of human-annotated concept labels per class..

Visualization of top−8 concept scores for the CUB (birds) and AwA2 (animals) datasets. Incorrectly activated concepts are highlighted in red.

Grad-CAM visualization of learned concepts.

Class-level Intervention

- Intervention in CBM is typically performed at an instance level, where a handful of error images are arbitrarily chosen for debugging. But, choosing the error images are non-trivial for fine-grain classification problems which have fewer samples per class and even visually similar classes.

- We propose a class-level intervention approach for fine-grain classification problems, where we first compute an error matrix to identify the “confounding classes” and then intervene the error images of these classes to further increase classification performance.

Confounding classes - birds that are visually similar yet belong to different sub-species.

Class-level intervention of CSS VL-CBM for the confounding classes - California Gull, Western Gull, Common Tern, and Artic Tern.