Finding 1.1. The testing effectiveness (i.e., the number of detected bugs) is significantly affected by seeds from different task domains.
For example, CRADLE fails to detect any bugs of DL libraries in the NLP domain and MUFFIN cannot work in the ASR domain and MXNet. Even within the same domain, existing approaches perform differently with different seeds, e.g., AUDEE failed to detect bugs with mutation seed {N1, N2} while succeeded with seed C5.
Therefore, it is necessary to involve seeds from different task domains when generating test inputs.
Finding 1.2. The inputs which trigger bugs lack diversity.
Since most of the test inputs of existing approaches are generated from given seeds by mutations or templates, the inputs are highly similar to each other (some of them are even duplicated). According to Table I, the higher the diversity of test inputs, the more bugs are detected.