Deep learning (DL) has been applied widely, and the quality of DL system becomes crucial, especially for safety-critical applications. Existing work mainly focuses on the quality analysis of DL models, but lacks attention to the underlying libraries and frameworks on which all DL models depend. In this work, we propose Audee, a novel approach for testing DL libraries and localizing bugs. Audee adopts a search-based approach and implements three different mutation strategies to generate diverse tests cases by exploring combinations of model structures, parameters, weights and inputs. Audee is able to detect three types of bugs: logic bugs, crashes and Not-a-Number (NaN) bugs. In particular, for logic bugs, Audee adopts a cross-reference check to detect behavioral inconsistencies across multiple frameworks (e.g., TensorFlow and PyTorch), which indicates potential bugs in their implementations. For NaN bugs, Audee adopts a heuristic-based approach to generate DNNs that tend to output outliers (i.e., too large or small values), and these values are likely to cause NaN value. Furthermore, Audee leverages causal testing based technique to localize layers as well as parameters that cause inconsistencies or bugs. To evaluate the effectiveness of our approach, we applied Audee on evaluating four DL frameworks, i.e., TensorFlow, CNTK, Theano, and PyTorch. We totally generate 260 models which cover 25 widely-used APIs in the four frameworks. The results demonstrate Audee are effective in detecting inconsistencies, crashes and NaN bugs. In total, 26 unique unknown bugs were discovered, and seven of them have already been confirmed by the developers.
To evaluate the effectiveness of AUDEE and understand the root causes of inconsistencies and errors, we design large scale experiments aiming at answering the following research questions:
RQ1: How effective is Audee in detecting inconsistencies?
RQ2: How useful is Audee in localizing layers as well as parameters for the inconsistencies?
RQ3: How effective is Audee in detecting NaNs?
RQ4: What are the root cause of inconsistencies and bugs? How many unique bugs were found by AUDEE?
Due to the page limit, more details on bug details and reports, API summary and seed DNN, and Top1 change rate are shown here: