Search Strategy Improvement
Our work on search strategy improvement centers around a core idea: Through learnable modeling of the search space (i.e., a performance predictor of architectures), we can know which search space regions are worth exploring, and thus accelerate the exploration process. This is the underlying idea of predictor-based NAS methods, including parametric or non-parametric ones (e.g., bayesian optimization). Here, we present our studies along this direction.
Following the intuition that "an architecture describes how the data flow and get processed", we propose to mimic the data flow and processing process to encode the architecture. The proposed GATES is a more suitable encoding method for data-processing directed acyclic graphs (DAGs) which matches the nature of this type of data, and surely, it can encode equivalent / isomorphic architectures to the same embedding intrinsically. We also propose to use the ranking loss to train the predictor as providing correct ranking is usually more important than predicting absolutely close performances in NAS.
TA-GATES is an improvement over GATES that can get contextualized embeddings for different operations (even those of the same type). Actually, we people surely know that even if two operations are of the same type, they have different functionalities according to their architectural context. However, plain GATES does not give contextualized embeddings for different operations of the same type. To get a more discriminative encoding, the intuition behind the principled design of TA-GATES is "an architecture not only describes how the data flow and get processed in the forward propagation, but it also decides the learning dynamics of the model". Accordingly, we propose to mimic the training process of an architecture to encode it, by conducting several forward and backward passes on the architecture DAG, and updating the operation embedding in each backward pass. TA-GATES can also enable many interesting applications: (1) Any-time performance prediction, which could be handy in early-stop / multi-fidelity NAS. (2) A natural joint encoding of other factors in the deep learning pipeline (e.g., loss design, hyper-parameters) for joint AutoDL. This is because the encoding process of TA-GATES explicitly mimics the model training process, it is straightforward to find the corresponding counterparts of those training-time factors in the encoding process.
Based on the intuition that "low-fidelity information can be beneficial for learning the modeling", we propose a dynamic mixture-of-expert predictor framework, DELE, to fuse beneficial knowledge from different low-fidelity experts. Specifically, each low-fidelity expert is trained with the aid of one type of low-fidelity information (e.g., zero-shot evaluation scores, complexity scores, and so on), and then a dynamic ensemble of these experts is trained using only a small set of ground-truth performance data. DELE is orthogonal to GATES and TA-GATES (specialized encoder designs for neural architectures according to their characteristics) in that every expert in DELE can be a GATES and TA-GATES.
Research List
Predictor-based NAS relies on an architecture-performance predictor to evaluate candidate architectures efficiently. However, the prediction ability strongly influences the effectiveness of the search process . Generally, the predictor first encodes the input architecture into a latent embedding and then gives the predicted score with an MLP. Traditionally, the encoder is an LSTM, MLP or GCN. In this paper, we propose a graph-based encoder by modeling the information flow in the architecture for a better embedding ability and thus better search effeciency.
Neural architecture search tries to shift the manual design of neural network (NN) architectures to algorithmic design. In these cases, the NN architecture itself can be viewed as data and needs to be modeled. A better modeling could help explore novel architectures automatically and open the black box of automated architecture design. To this end, this work proposes a new encoding scheme for neural architectures, the Training-Analogous Graph-based ArchiTecture Encoding Scheme (TA-GATES). TA-GATES encodes an NN architecture in a way that is analogous to its training. Extensive experiments demonstrate that the flexibility and discriminative power of TA-GATES lead to better modeling of NN architectures. We expect our methodology of explicitly modeling the NN training process to benefit broader automated deep learning systems.
To mitigate the "cold-start" problem of predictor-based NAS, we utilize the low-fidelity estimation into predictor training by "Warmup" or "Multi-fidelity Training". Extensive experiments on NAS-Bench-201, NAS-Bench-301, NDS and MobileNetV3 demonstrate the effectiveness of our method.