DFauLo has several hyper-parameters, and different parameter settings may impact the performance stability in application scenario.
To analyze the parameter impact on the effectiveness of \tool, we select 7 subjects (#ID 1, 6, 11, 16, 17, 24, 25) which have diverse fault types, data types and model structures for the experiment. For each setting, we repeat DFauLo for 30 times and report the average RAUC results. Besides, denote the default result as A and other parameter setting as B, the Mann-Whitney U test \cite{nachar2008mann} is conducted to check if A is significant different from B.
Three groups of parameter settings are evaluated:
1) Retraining Epoch. (Default: 10) We investigate whether increasing or decreasing the retrain epoch impacts the effectiveness of \tool;
2) Remove Ratio. (Default: $\alpha=5\%$) $\alpha$ controls the size of discarded data with sparse features, which is a hyper-parameter for input\&output layer mutant;
3) Iteration Batchsize. (Default: 200) The \model model of \tool is updated dynamically during the iteration of manual data review, and the iteration batchsize determines the task release and \model update frequency.
Results.
Table below presents the experiment results. For almost all subjects and settings, the change of average RAUC is less than 0.001. The statistical analysis results show that \tool is not sensitive to the Retrain Epoch parameter, and the impacts of Remove Ratio and Iteration Batchsize are also small.
Discussion.
We can draw following conclusions based on the Table \ref{tab:RQ2-sensitivy}:
1) \tool doesn't requires many epochs to generate mutated models, and even 5 epochs is sufficient to amplify the behavior differences between clean and faulty data;
2) The remove ratio could impact the effectiveness of \tool in a small scale, and the best remove ratio of classification and regression dataset is different.
3) Increasing the iteration frequency (i.e., choosing a smaller batchsize) would slightly improve the performance of \tool, however, it also introduce more manual work in task release and aggregation. Testers should balance the iteration frequency and localization accuracy requirement to choose an appropriate batchasize.