Data Fault Localization for Deep Neural Networks

Abstract

The rich datasets have empowered various deep learning(DL)ap-plications,leading to remarkable success in many fields.However,accompanying with these benefits,data faults hidden in the datasetscould result in DL applications behaving unpredictably and evencause massive monetary and life losses.To alleviate this problem,inthis paper,we propose a dynamic data fault localization approach,namely DFauLo,to locate the mislabeled and noisy data in the deeplearning datasets.DFauLo is inspired by the conventional mutation-based code fault localization,but utilizing the differences betweenDNN mutants to amplify and identify the potential data faults.

Specifically,it first generates multiple DNN model mutants of theoriginal trained DNN model,extracts features from these mutants,and maps the extracted features into a suspiciousness score indicat-ing the probability of the given data being a data fault.Moreover,DFauLo is the first dynamic data fault localization technique,priori-tizing the suspected data based on user feedback,and providing thegeneralizability to unseen data faults during training.To validateDFauLo,we extensively evaluate it on 26 cases with various faulttypes,data types,and model structures.We also evaluate DFauLoon three widely-used benchmark datasets.The results show thatDFauLo outperforms the state-of-the-art techniques on almost allcases and locates hundreds of different types of real data faults inbenchmark datasets.Additionally,DFauLo can effectively purifythe DL dataset and further improve the performance of various DLapplications.