Neural architecture search (NAS) can automatically discover architectures that outperform the handcrafted ones for various applications. However, early NAS methods suffer from an extremely heavy computational burden, and can take tens of thousands of GPU hours to run. One of the major reasons for the computational challenge of NAS is that evaluating each candidate architecture is slow, which includes a full training and testing process. In the past years, many studies have been focusing on developing more efficient performance estimators of neural architectures.
One-shot Estimator (OSE) is one of the most popular and commonly-used efficient estimators, which amortizes the architecture training costs by sharing the parameters of one “supernet” between all architectures. Recently, Zero-shot Estimator (ZSE) attracts much attention because it involves no training process and achieves higher efficiency.
Despite their efficiency, both OSE and ZSE suffer the low estimation quality. One obvious drawback is the poor ranking quality between the OSE/ZSE estimation performances and the ground-truth performances. However, the existing studies do not thoroughly evaluate and analysis their estimation quality, which cannot promote the future research in this field. In this post, we demonstrate the some observations and analyses from the evaluation results and share several useful suggestions for better utilizing the OSE and ZSE. Also, we present several promising directions for improving the efficient estimators based on our analysis. We kindly hope our research can bring some new knowledges to the NAS community and promote the future studies.
One-shot Estimator (OSE) Traditional NAS methods conduct a costly separate training process to acquire the suitable parameters to evaluate each candidate architecture. To make NAS computationally tractable, ENAS [1] proposes the parameter-sharing technique to accelerate the architecture evaluation. Specifically, ENAS constructs an over-parametrized super network (i.e. “supernet”) such that all architectures in the search space are its sub-architectures. During the search process, candidate architectures are evaluated on the validation data by using the corresponding subset of weights, without undergoing a separate training process. Following this work, the parameter sharing technique is widely used for architecture search in different search spaces or incorporated with different search strategies. We refer to the parameter-sharing estimations as the “one-shot” estimations since it requires the training cost of one supernets.
Fig 1. Framework of one-shot estimation.
Zero-shot Estimator (ZSE) More recently, in order to further reduce the architecture evaluation cost, several studies [2, 3] introduce “zero-shot” estimators that involve no training. The ZSEs evaluated in our study include grad_norm, plain, snip, grasp, fisher, synflow, jacob_cov, relu_logdet, and the assembled indicator vote. Besides jacob_cov and relu_logdet [3], other ZSEs are designed for network pruning by measuring the approximate loss change when certain parameters or activations are pruned [2].
Fig 2. Framework of zero-shot estimation.
Fig 3. Some Commonly-used ZSEs that drawn from the pruning literatures.
NAS benchmarks are proposed to enable researchers to verify the effectiveness of NAS methods efficiently. As shown in Fig 4, we demonstrate the five popular NAS benchmarks that used in our evaluation study.
NAS-Bench-101 (NB101) [4] provides the performances of the 423k valid architectures in a cell-based search space. However, OSE cannot be easily applied for the whole NB101 search space due to its specific channel number rule. To reuse NB101 for benchmarking OSE, NAS-Bench-1shot1 (NB1shot) [5] picks out three sub-spaces of NB101, and a supernet can be easily constructed for these sub-spaces. In our study, we use the largest sub-space in NB1shot: NB1shot-3, and use the name “NB101” to refer to it.
NAS-Bench-201 (NB201) [6] construct a cell-based NAS search space, with 15625 DAGs that contain 4 nodes, 6 edges and 5 operation choices. Complete training information of these 15k architectures is provided.
NAS-Bench-301 (NB301) [7] is proposed as a benchmark in the commonly-used DARTS search space, which contains over 10^18 architectures. It adopts a surrogate-based methodology that predicts architecture performances with the performances of about 60k anchor architectures.
NDS ResNet & ResNeXt-A [8] provides the architectures' performances on the non-topological search spaces, which are quite different from the above three topological spaces. The ResNet search space contains depth and width decisions, and the ResNeXt-A search space contains decisions w.r.t depth, width, bottleneck width ratio, and number of groups. We sample 5000 architectures from these two spaces respectively to evaluate the OSE and ZSE.
Fig 4. Five popular NAS benchmarks used in our evaluation study.
The evaluation criteria used to evaluate the OSEs and ZSEs include:
Pearson coefficient of linear correlation (LC).
Kendall’s Tau ranking correlation (KD): The relative difference of concordant pairs and discordant pairs.
Spearman’s ranking correlation (SpearmanR): The pearson correlation coefficient between the ranking variables.
Precision@K (P@topK): Proportion of true top-K proportion archs in the top-K archs w.r.t. the scores.
Precision@K (P@bottomK): Proportion of true bottom-K proportion archs in the bottom-K archs w.r.t. the scores.
BestRanking@K (BR@K): The best normalized ranking among the top K proportion of archs w.r.t. the scores.
WorstRanking@K (WR@K): The worse normalized ranking among the top K proportion of archs w.r.t. the scores.
Tab 1 shows the hyper-parameters of training the supernet on different NAS benchmarks. Besides, all the training and evaluation are conducted on CIFAR-10, which is the dataset used by the five benchmarks. 80% of the training set are used as the training data, while the other 20% are used as the validation data. We run every supernet training process with three random seeds (20, 2020, 202020).
Tab 1. Supernet training hyper-parameters.
For a more convincing evaluation result, we use relatively sufficient architectures to evaluate the OSEs/ZSEs. Specifically, on NB101/NB201/NB301/NDS-ResNet/NDS-ResNeXt-A, we use 5000/15625/5896/5000/5000 architectures to calculate the evaluation criteria, which is more comprehensive than the existing studies. We also believe that our evaluation results can be served as a strong baseline for the future studies.
In the following, we demonstrate the important observations of OSEs/ZSEs from the experiments, and propose some strategies to improve their estimation quality. Specifically, we inspect the OSEs/ZSEs from three aspests:
how well the OSEs/ZSEs estimations are correlated with the standalone architecture performances?
how and why the OSEs/ZSEs estimations have bias and variance?
how to improve the OSEs/ZSEs?
From the first aspect, we can have a comprehensive understanding of the OSEs/ZSEs estimations. The observations (denoted as O) drawn from the evaluation results will be shown. Then, we further analyze why the OSEs/ZSEs not perform well from the perspectives of the bias and variance. We will also demonstrate the analyses (denoted as A) in the following part. At last, based on the above observations and analyses, some useful suggestions (denoted as S) are shared with the community.
O1: On NB201/NB301/NDS-ResNet/NDS-ResNeXt-A, more sufficient supernet training brings ranking quality improvements. And OSEs are better at distinguishing bad architectures than good architectures (P@top<P@bottom).
O2: On NB101, longer training does not help after 20 epochs. And bad architectures are harder to distinguish (P@top>P@bottom), since GTs of top architectures on NB101 are not distributed as concentrated as those on NB201/NB301 (as shown in Fig6 (left)).
O3: On NB301, using loss as the OS score is much better than using acc, since loss carries more information about prediction confidence (as shown in Fig 6 (right)).
Fig 5: Trends of different evaluation criteria on the five NAS benchmark.
Fig 6. The GTs distrubutions of NB101/NB201/NB301 (left), and the OS acc and OS loss on NB201/NB301 (right).
O4: The trend and performance of the OSE on NB101 are quite different from that on NB201/NB301 (e.g., different convergence speed, different ralation between P@top K and P@bottom K). We believe this comes from the structural difference between the archs on NB101 and NB201/301. The arch on NB101 is in "OON" (operation on node) type, which leads to a large sharing extent of the supernet, and thus limits the potential estimation quality on NB101.
O5: On NB201, OSE is mainly learning to reduce its chance of regarding bad architectures as good ones in the middle and late training stages. As shown in Fig 7 (top), the WR/Acc@Ks become better and better as the training progresses. This is important since in a NAS flow where one takes out several top-ranked architectures and conducts final training, the stability of BR/Acc@K is of concern. And because the BR/Acc@Ks converge very fast, a higher WR/Acc@K means the relatively stable perfs of the top-K archs.
O6: Although the ranking quality criteria on NB301 are worse than those on NB201, OSEs can still help find architectures with satisfying accuracy on the harder and better NB301 search space. As shown in Fig 7, the WorstAcc@5% on NB301 is higher than that on NB201, since the GT distribution of NB301 locates in the high-performance region. This tells us that when analyzing the ranking quality, we still need to consider the absolute accuracy distribution.
Fig 7: WR@K on NB201 (top) & NB301 (bottom).
Fig 8. Criteria vary on NB301 as the batch number (X-axis) changes. Right: The histogram of intra-level KDs using 10- batch OS acc, the “levels” are partitioned according to 1-batch OS acc.
O7: On both NB201/NB301, using more validation data improves the estimation quality (as shown in Fig 8 (left)). Interestingly, when the supernet is not well trained (at 200 epoch), criteria decrease with more batches. We further analyze this phenomenon, and find that when the supernet is under-trained, its ability in distinguishing the intra-level architectures is weak (as shown in Fig 8 (right)). And using more data would result in more levels and make the OSEs harder to distinguish.
O7: On NB201, no matter which GT accuracy we use, the KD and P@top 5% of using OS accuracy on CIFAR-10 (denoted as C10) is the highest. We conduct this experiment on NB201, which also provides the architectures' performances on CIFAR-100 (denoted as C100) and ImageNet-120 (denoted as IN120). The results in the red rectangle in Fig 9 show the above counter-intuitive observation.
We speculate that since the number of classes on CIFAR-100 and ImageNet-16-120 is larger than that on CIFAR-10 (100 V.S. 10), the classification head might become a parameter-sharing bottleneck. For example, when the init channel number is 16, the classification head on these two datasets is constructed by a global average pooling layer and a single linear layer that convert 16 × 2^2 = 64 units to 100 units. This compact FC layer might lack the representational capability to be shared by lots of architectures, which can cause all architectures to be too under-trained to reflect their standalone rankings correctly.
Fig 9: KD and P@top 5% across different datasets on NB201.
Fig 10: Influence of layer proxy (left) and channel proxy (right).
O8: Channel proxy has little influence while layer proxy reduces the reliability of search results (as shown in Fig 10). Due to memory and time constraints, it is common to use a shallower or thinner proxy model in the search process. This obervation tells that for cell-based search spaces, proxy-less search w.r.t the layer number is worth studying
A1: In the arch-level, sub-architectures in the supernet have different amounts of calculation and might converge with a different speed, which leads to the underestimations of the small archs. As shown in Fig 11, we divide the archs into five groups according to the amount of calculation (FLOPs), and show the KD & average RD in each group (the group with the smallest FLOPs is at the leftmost). In the early training stages, the average RD shows a decreasing trend because larger models converge slower. As the training goes on, the absolute average RD decreases, indicating that the issue of underestimating larger models gets alleviated.
Fig 11: Arch-level bias. Y axis left/right: KD τ / Average RD within the complexity group.
A2: In the op-level, OSEs prefer some types of operations to the others, which leads to improper estimations. We inspect the changes of GT and OS accuracy when one operation is mutated to another. An example is shown in Fig 12. We can see that, on NB301, mutating skip-connection to dil_conv_3x3 leads to the OS acc increases of many archs. But the GTs of most of these archs do not increase, whch means that the estimations are improper. Thus, this result tells us that OSE underestimates skip-connect and overestimate DilConv on NB301.
Fig 12: Arch-level bias. Mutate skip_connect to dil_conv_3x3.
A3: The multi-model forgetting phenomenon causes the poor estimation quality. Due to the parameter sharing and the random sample training scheme, the training of subsequent architectures overwrites the weights of previous ones, thus degrades their OS accs. We verify this "multi-model forgetting" phenonmenon in Fig 13. And we can also see that, as training progresses, the variance of the FVs decreases, which is natural due to the learning rate decay.
Fig 13: Multi-model forgetting phenomenon on NB101/NB201/NB301.
A4: The ranking stability also causes the poor estimation quality. As shown in Fig 14, the criteria (i.e. relative KD, relative P@top/bottomK) are calculated with two sets of adjacent OS estimations, while the estimations of the latter checkpoint are taken as the GT one. We can see that, with insufficient training (in the early epoch), the ranking stability is quite low. And on NB301, even with sufficient training, the ranking stability of top architectures is still not high (relP@top 0.5%∼0.46).
Fig 14: Ranking stability of OSEs on NB301.
S1: Use temporal ensemble to help reduce the ranking instability and improve the estimation in some cases.
Previous study [9] has proposed to stabilize OS estimations by temporally averaging weights of supernet checkpoints. Fig 14 shows that the ensemble technique can reduce the ranking variances. And as shown in Fig 15, we can see that temporally ensembling 3/5 checkpoints can even bring improvements on NB201, but brings no bias improvements on NB301.
Fig 15: Effect of ensemble techniques on OSEs on NB201 (top) and NB301 (bottom).
S2: Do not use multiple MC samples on NB101/NB201/NB301.
We compare the results of using different MC sample numbers S in supernet training. We also adapt Fair-NAS [10] sampling strategy to the NB201/301 spaces (a special case of MC sample 5/7 for 201/301). As shown in Tab 2, on NB101/NB301, using S>1 brings small KD improvement, while on NB201, the KD eveb slightly degrades when S>1. Note that since in most of the cases, using multiple MC samples cannot improve the OSE’s ability in distinguishing good architectures (lower P@top 5%).Thus, there is no need to use multiple MC samples on NB101/NB201/NB301 .
Tab 2: Results of MC sample num (S) and FairNAS when the training converges.
S3: Use a fair sampling strategy.
The random sample training scheme of the supernet causes the estimation variances. Besides, improper sampling distribution leads to estimation biases. One of the main causes is that architecture are sampled from an unfair distribution, i.e., some architectures have undesirable higher probabilities.
Take NB201 as an example. It contains many isomorphic architectures with different representations. Even with rather sufficient training, the supernet still overestimates some simple architectures significantly, since their equivalent sampling probability is higher and the parameters are trained towards their desired directions. To this end, we propose De-Isomorphic Sampling Strategy to improve the fairness. Specifically, we sample architectures in the training process without isomorphic architectures. As shown in Fig 16, we can see that, if the deiso sampling strategy is not used, the quality of the estimations on top architectures decreases as the training progresses.
Fig 16: Comparison of Iso / Deiso sampling strategy.
S4: Reduce the sharing extent of the supernet.
Due to the parameter sharing technique, a large sharing extent is harmful for the OSE estimations. Dynamic search space (SS) pruning is proved to be an effective way for improving the OSE quality, which prunes the SS based on a certain indicator. To this end, we propose two directions for developing practical dynamic SS pruning methods: 1) Per-architecture (soft) pruning with a jointly updated controller, where the controller would give higher sampling probability to the architectures with higher OS scores. 2) Per-decision pruning with a jointly updated controller, where the controller learns to assign different probabilities to architectural decisions instead of architectures.
We follow the first direction to conduct a case study on NB301. Results shown in Fig 17 reveal the potential of dynamic SS pruning for improving the OSE quality, especially for good architectures.
Fig 17: Per-architecture soft pruning with evolutionary-based controllers on NB301: Quality comparison of the OSE estimations after one-shot training and controller-guided training.
S5: Do not use BN affine operations in the search process.
Fig 18 compares the ranking quality of the OSEs that are trained with or without affine operations in BNs. We can see that, training supernet without BN affine operations can improve the estimation quality.
Fig 18: Effect of BN affine operations in OSE on NB201 (top) and NB301 (bottom).
From Tab 3, we can observe that:
O1: ZSEs are still performing worse than OSEs in their current stage. Especially, ZSEs perform very poorly on NB101.
O2: Simply counting the number of ReLU layers (denoted as relu) can achieve very competitive ranking quality.
O3: The relative effectiveness of ZSEs varies between search spaces.
O4: Ensembling the best three ZSEs (denoted as vote) does not bring improvements over the best one ZSE.
Tab 3: Comparison of ZSEs and OSEs on NB101/NB201/NB301.
O4: ZSEs cannot benefit from one-shot training, and ZSEs that utilize high-order information (i.e., gradients) provide the best estimations with randomly initialized weights. As shown in Tab 4, using randomly initialized weights to compute the ZSEs performs better than using the one-shot supernet weights.
Tab 4: Quality of zero-shot estimations as training processes.
A1: In the arch-level, some ZSEs have the excessive preference for the largest architectures, which leads to an improper estimations.
For example, we demonstrate the top-3 architectures on NB 201 according to the synflow values in Fig 19. As can be seen, these three architectures are the largest ones on NB201. However, they are not the best-performed architectures.
Besides, snip, grad_norm, fisher and grasp all show improper preference on architectures with gradient explosion, since they are designed for measuring parameter-wise sensitivity
Fig 19: Top-3 ZS ranked archs on NB201 by synflow.
A2: In the op-level, some ZSEs have the excessive preference for the certain operations.
For example, both relu_logdet and jacob_cov show an improper preference on 1×1 convolution over 3×3 convolution, which account for their weak performance in identifying good architectures. As shown in Fig 20, the architectures with conv 1x1 obtain a higher OS scores than those with conv 3x3 in most of the cases.
Fig 20: The scatter plot of GT (Y axis) - ZS (X-axis) scores, different subplots stand for different edges (6 edges in total), and different colors & markers stand for different operation type on that edge.
Based on S4 for the OSEs, a promising direction is to adopt a more appropriate sharing extent for the supernet. Specifically, we conduct an experiment (shown in Fig 21) to find that using a larger sharing extent can accelerate the training speed, but cannot achieve a high saturating performance. Therefore, a natural idea is to use large sharing extent in the early stage, and then gradually reduce it.
Fig 21: Comparison of supernets with small and large sharing extent. The left part shows a cell-architecture that uses Supernet-1 (top) and Supernet-2 (bottom). The right part shows the KD of Supernet-1 and Supernet-2 throughout the training process on NB201 and NB301.
Based on the above idea, we propose to employ Curriculum Learning On the Sharing Extent (CLOSE) of the supernet (as shown in Fig 22 (left)). And to enable the adaption of the sharing extent during the training process, we design a novel supernet, CLOSENet (as shown in Fig 22 (right)), whose sharing extent can be easily adjusted.
Fig 22: CLOSE strategy (left) and CLOSENet (right).
We evaluate CLOSE on NB201/NB301/NDS-ResNet/NDS-ResNeXt-A. We can see that CLOSENet achieves a higher KD and P@top5% on all the NAS benchmarks. Moreover, we can see that throughout the training process, CLOSENet consistently achieves higher ranking quality, which implies CLOSENet’s superiority to the vanilla supernet under any budget for supernet training.
[1]. Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. ICML 2018.
[2]. Mohamed S. Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D. Lane. ZeroCost Proxies for Lightweight NAS. ICML 2021.
[3]. Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural architecture search without training. ICML 2021.
[4]. Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards reproducible neural architecture search. ICML 2019.
[5]. Arber Zela, Julien Siems, and Frank Hutter. Nas-bench-1shot1: Benchmarking and dissecting one-shot neural architecture search. ICLR 2020.
[6]. Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. ICLR 2020.
[7]. Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. Nas-bench-301 and the case for surrogate benchmarks for neural architecture search. 2020.
[8]. Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Dollár. On network design spaces for visual recognition. CVPR 2019.
[9]. Ronghao Guo, Chen Lin, Chuming Li, Keyu Tian, Ming Sun, Lu Sheng, and Junjie Yan. Powering one-shot topological nas with stabilized share-parameter proxy. In European Conference on Computer Vision, pages 625–641. Springer, 2020.
[10]. Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845, 2019.