With the fast evolvement of embedded deep-learning computing systems, applications powered by deep learning are moving from the cloud to the edge. When deploying neural networks (NNs) onto the devices under complex environments, there are various types of possible faults: soft errors caused by cosmic radiatio and radioactive impurities, voltage instability, aging, temperature variations, malicious attackers, etc. Thus the safety risk of deploying NNs is now drawing much attention. In this paper, after the analysis of the possible faults in various types of NN accelerators, we formalize and implement various fault models from the algorithmic perspective. We propose Fault-Tolerant Neural Architecture Search (FT-NAS) to automatically discover convolutional neural network (CNN) architectures that are reliable to various faults in nowadays devices. Then we incorporate fault-tolerant training (FTT) in the search process to achieve better results, which is referred to as FTT-NAS. Experiments on CIFAR-10 show that the discovered architectures outperform other manually designed baseline architectures significantly, with comparable or fewer floating-point operations (FLOPs) and parameters. Specifically, with the same fault settings, F-FTT-Net discovered under the feature fault model achieves an accuracy of 86.2% (VS. 68.1% achieved by MobileNet-V2), and W-FTT-Net discovered under the weight fault model achieves an accuracy of 69.6% (VS. 60.8% achieved by ResNet-18). By inspecting the discovered architectures, we find that the operation primitives, the weight quantization range, the capacity of the model, and the connection pattern have influences on the fault resilience capability of NN models.
There exist hardware-related reliability issues when deploying NNs onto nowadays embedded devices. In this paper, instead of redesigning the hardware for reliability, we attempt to elimate this problem from the algorithmic perspective. Intuitively, the neural architecture might also be important for the fault tolerance characteristics, since it determines the “path” of fault propagation. We conduct experiments on the CIFAR-10 dataset to show that the fault tolerance characteristics vary among neural architectures, which motivates the employment of neural architecture search (NAS) techniques in designing fault-tolerant neural architectures. We emphasize that our work is orthogonal to most of the previous methods based on hardware or mapping strategy design. To our best knowledge, our work is the first to increase the algorithmic fault resilience capability by optimizing the NN architecture.
As the basis of our research, we first model the MiBB Feature Fault Model and the adSAF Weight Fault Model. Their examples are shown as below.
An example of injecting feature faults under the iBB fault model (soft errors in FPGA LUTs).
An example of injecting weight faults under the adSAF fault model (SAF errors in RRAM cells).
The figure below illustrates the overall search framework of FTT-NAS. In the first step, we fault-tolerant training (FTT) a supernet by minimizing loss both in normal condiction and fault injected condition. In the second step, we conduct architecture search in the supernet to discover architectures minimizing both the normal loss and fault-injected loss. And an hyper-parameter is applied for trade-off. The architecture with the maximum reward is the search result. And in the third step, we fault-tolerant train the discovered architecture.
Comparison of different architectures under the MiBB feature fault model is shown as below. Our discovered architectures outperform the baselines for a large margin.
Comparison of different architectures under the MiBB feature fault model
Comparison of different architectures under the adSAF feature fault model is shown as below. Our discovered architectures outperform the baselines for a large margin.
Comparison of different architectures under the adSAF weight fault model
@article{ning2021ftt,
title={FTT-NAS: Discovering fault-tolerant convolutional neural architecture},
author={Ning, Xuefei and Ge, Guangjun and Li, Wenshuo and Zhu, Zhenhua and Zheng, Yin and Chen, Xiaoming and Gao, Zhen and Wang, Yu and Yang, Huazhong},
journal={ACM Transactions on Design Automation of Electronic Systems (TODAES)},
volume={26},
number={6},
pages={1--24},
year={2021},
publisher={ACM New York, NY}
}