Given a specific task, developers can collect massive deep neural networks (DNNs) from public sources for efficient reusing and avoid redundant work from scratch. However, estimating the performance (e.g., accuracy and robustness) of multiple DNNs and giving a reasonable recommendation that which model should be used is challenging regarding the scarcity of labeled data and demand of domain expertise. Existing model selection approaches are mainly sampling-based where a few data are sampled and manually labeled to discriminate DNNs. Besides, due to the randomness of sampling, the performance ranking is not deterministic. In this paper, we propose LaF, a labeling-free model selection approach, to overcome the limitations of labeling effort and sampling randomness. Our exhaustive experiments on 9 benchmark datasets spanning in the domains of image, text, and source code, and 165 DNNs demonstrate that LaF outperforms baseline methods by up to 0.74 on Spearman's correlation and is flexible to both synthetic and natural data distribution shift.
Table 1 Datasets and collected DNNs. Each numeral value represents the accuracy or robustness on the ID or OOD test dataset.
Table 1 (continue) Average robustness of the DNNs on MNIST-C and CIFAR-10-C.