This website shows more detailed results of the empirical study. We apply DFauLo to sort the well-known official dataset and present the images to five independent crowdsourced workers for inspection. We simultaneously make the images and the corresponding tags available to the workers. Besides, we also sampled a reference set from the entire collection and provided it to the workers for analysis.
Workers were asked to score each image in {-2, -1,0,1,2}, with lower scores indicating that the worker is more confident that the current image does not match the label. In addition, we further suggested workers mark defective information for images with negative scores.