Weakly Supervised Object Localization

Evaluating Weakly Supervised Object Localization Right

People (*: equal contribution)

Junsuk Choe1,3*, Seong Joon Oh2*, Seungho Lee1, Sanghyuk Chun3, Zeynep Akata4, Hyunjung Shim1

1 School of Integrated Technology, Yonsei University 2 Clova AI Research, LINE Plus Corp. 3 Clova AI Research, NAVER Corp. 4 University of Tübingen

Abstract

Weakly-supervised object localization (WSOL) has gained popularity over the last years for its promise to train localization models with only image-level labels. Since the seminal WSOL work of class activation mapping (CAM), the field has focused on how to expand the attention regions to cover objects more broadly and localize them better. However, these strategies rely on full localization supervision to validate hyperparameters and for model selection, which is in principle prohibited under the WSOL setup. In this paper, we argue that WSOL task is ill-posed with only image-level labels, and propose a new evaluation protocol where full supervision is limited to only a small held-out set not overlapping with the test set. We observe that, under our protocol, the five most recent WSOL methods have not made a major improvement over the CAM baseline. Moreover, we report that existing WSOL methods have not reached the few-shot learning baseline, where the full-supervision at validation time is used for model training instead. Based on our findings, we discuss some future directions for WSOL.

Introduction

Overview of WSOL performances 2016-2019. Left image shows that recent improvements in WSOL are illusory due to (1) different amount of implicit full supervision through validation and (2) a fixed score-map threshold to generate object boxes. Under our evaluation protocol with the same validation set sizes and oracle threshold for each method, CAM is still the best. In fact, our few-shot learning baseline, i.e., using the validation supervision (10 samples/class) at training time, outperforms existing WSOL methods.

Problem definition

WSOL as MIL. WSOL can be interpreted as a patch classification task trained with multiple-instance learning (MIL). After generating the score map s(X), it is thresholded at τ to estimate the true mask T.


Ill-posedness of Weakly Supervised Object Localization

Ill-posed WSOL: An example. Scoring according to the true posterior s(M) = p(Y |M) may not lead to the correct prediction of the true patch label T if background cues are more associated with the image-level label than the foreground cues (e.g. p(duck|water) > p(duck|feet)).

Evaluation Protocol for WSOL


Masks: PxAP (Eq. 1&2). We define the pixel precision and recall at threshold τ , PxPrec (τ ) and PxRec (τ ), as follows. For threshold independence, we define and use the pixel average precision, PxAP := P l PxPrec(τl)(PxRec(τl) − PxRec(τl−1)), the area under curve of the pixel precision recall curve.

Bounding boxes: MaxBoxAccV2 (Eq. 3). Given the ground truth box B, we define the box accuracy at score map threshold τ and IoU threshold δ , BoxAccV2 (τ, δ). Note that box(s(X(n) ), τ ) is the set of tightest boxes around each connected component of the mask {(i, j) | s(X (n) ij ) ≥ τ}. IoU(boxes_A, boxes_B) is defined as the best (maximal) value among the IoUs across the sets boxesA and boxesB. For score map threshold independence, we report the box accuracy at the optimal threshold τ, the maximal box accuracy MaxBoxAccV2(δ) :=maxτBoxAccV2(τ,δ), as the final performance metric. We average the performance across δ∈ {0.3,0.5,0.7} to address diverse demands for localization granularity.

Data Contribution

WSOL is an ill-posed problem when only image-level labels are available (see paper for an argument). To be able to solve the WSOL task, certain amount of full supervision is inevitable, and prior WSOL approaches have utilized different amount of implicit and explicit full supervision (usually through validation). We propose to use a fixed amount of full supervision per method by carefully designing validation splits (called train-fullsup in the paper), such that different methods use the same amount of localization-labelled validation split.

We propose three disjoint splits for every dataset: train-weaksup, train-fullsup, and test. The train-weaksup contains images with weak supervision (the image-level labels). The train-fullsup contains images with full supervision (either bounding box or binary mask). It is left as freedom for the user to utilize it for hyperparameter search, model selection, ablative studies, or even model fitting. The test split contains images with full supervision; it must be used only for the final performance report. For example, checking the test results multiple times with different model configurations violates the protocol as the learner implicitly uses more full supervision than allowed. The splits and their roles are more extensively explained in the paper.

Experimental Results

Re-evaluating WSOL. How much have WSOL methods improved upon the vanilla CAM model? test split results are shown, relative to the vanilla CAM performance (increase or decrease). Hyperparameters have been optimized over the identical train-fullsup split for all WSOL methods and the FSL baseline: (10,5,5) full supervision/class for (ImageNet,CUB,OpenImages).

Publication

Evaluating Weakly Supervised Object Localization Methods Right

Junsuk Choe*, Seong Joon Oh*, Seungho Lee, Sanghyuk Chun, Zeynep Akata, and Hyunjung Shim, IEEE CVPR, 2020

Links

[pdf][code]