Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe “cold-start” problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. For example, our methods can improve the Kendall’s Tau correlation coefficient between actual performance and predicted scores from 0.2549 to 0.7064 with only 25 actual architecture-performance data on NDSResNet. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.
Cold-start Problem: Predictor-based NAS trains an approximate performance predictor and utilizes it to rank unseen architectures without actually training them. However, predictor-based NAS suffers from the severe “cold-start” problem: It usually takes quite a considerable cost to acquire the architecture-performance data needed for training a working predictor from scratch.
Motivation: This work focuses on exploiting more information in other cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the data requirements of predictor training. Actually, it is intuitive that utilizing other low-fidelity information (e.g., grasp and plain) for predictor training can help mitigate the cold-start problem. One can anticipate that training with this information might bring potential improvements in two aspects.
The ranking information included in some indicators (e.g., one-shot, zero-shot estimations) might help the predictor acquire a better ranking quality.
Learning to fit other lowfidelity information could encourage the predictor to extract better architecture representations.
Utilization Problem: A straightforward way of utilizing low-fidelity information is to pretrain the model on a single type of low-fidelity information and finetune it on a small amount of actual architecture-performance data. We conduct a preliminary experiment in the Table below and make the following observations.
Low-fidelity information does have the potential to improve prediction ability with limited actual architecture-performance data significantly.
Inappropriate low-fidelity information types even damage the prediction ability.
Different search spaces have different preferences for lowfidelity information types.
A high-ranking quality of the low-fidelity information does not indicate its utilization effectiveness.
The "Low-fidelity Corr." and relative Kendall's Tau improvement achieved by utilizing different typical types of low-fidelity information. Specifically, we construct the predictor with an LSTM encoder and train it with ranking loss. All architectures in the training split are used for pretraining, while the first 1% percentages by index with corresponding actual performance are used for finetuning. "Low-fidelity Corr." represents Kendall's Tau correlation between the low-fidelity information and the actual performance.
That is to say, despite the intuitiveness of this idea, which types of low-fidelity information are useful for performance prediction is unclear to practitioners beforehand. In addition, different types of low-fidelity information could provide beneficial information from different aspects, but the naive method described above can only utilize one type of low-fidelity information. Therefore, it would be better if we could fuse the knowledge from multiple types of low-fidelity information organically in an automated way.
We propose a novel dynamic ensemble predictor framework, whose core is a learnable gating network that maps the neural architecture to a set of weighting coefficients to be used in ensembling predictions of different low-fidelity experts. The framework comprises two steps.
In the first step, we pretrain different low-fidelity experts on different types of available low-fidelity information to extract beneficial knowledge.
In the second step, the overall predictor is finetuned on the actual architecture-performance data to fuse knowledge from different types of low-fidelity information to make the final prediction.
In this way, we can not only leverage multiple low-fidelity information in the architecture performance prediction but also balance their contributions in an automatic and dynamic fashion, overcoming the challenge for the practitioners to decide on which lowfidelity information to use.
First, we evaluate our proposed method's prediction ability with different proportions of training samples on four NAS benchmarks: NAS-Bench-201, NAS-Bench-301, NDS-ResNet / ResNeXt-A and MobileNet-V3. As the results shown in the Table below, our method consistently outperforms the vanilla predictor training method. For example, our method achieves 0.8244 Kendall's Tau correlation between the predicted score and actual performance with GATES as the encoder and 1% training samples on NAS-Bench-201, much better than the vanilla method (0.7332).
The Kendall's Tau (average over five runs) of using different encoders on NAS-Bench-201, NAS-Bench-301, NDS-ResNet, NDS-ResNeXt-A and MobileNet-V3. And the standard deviation is in the subscript. "Vanilla" represents directly training predictor with ground-truth accuracies without low-fidelitiy information utilization.
Then we conduct architecture search to verify whether our proposed method can mitigate "cold-start" problem. For example, on NAS-Bench-201 and NAS-Bench-301 as shown below, compared with various baseline search methods, our method can discover high-performance architecture much faster.
Comparison with other search strategies on NAS-Bench-201 (left) and NAS-Bench-301 (right). We report the test accuracy of the architecture with the highest reward among all sampled architectures.