In the first case, there are 3 designs retrieved from the control group and 10, 30, 50 designs retrieved from the experimental groups. Among 10 participants, for the control group, 1.7/3 (56.6%) designs are marked as related candidates on average, in contrast, the designs from the experimental groups are marked as more related to the query, 7.6/10 (76%), 19.8/30 (66%), and 48.8/50 (97.6%) respectively. All ten participants consider that there are more than 45 designs out of 50 are useful, with 6 of them choose all 50 designs as useful candidates.
From the above bar chart, we can see that the our method is better than the control group in satisfaction (2.7/5, 3.9/5, 4.1/5 vs 5/5). For the control group, participants' satisfactions are quite diverse, note that there is one participant even rate the satisfaction score form the control group as 1 out of 5. In contrast, for the experimental group, most of the participants rate above 4 out of 5. Especially, for the third experimental group, all of the participants choose the highest score.
It can be seen that for this case, the diversity rating of the experiment group is clearly higher than the control group (2.4/5, 3.9/5, 4.8/5 vs 1.9/5) since the results only contain 3 images. It is consistent with the human perception as the control group contains only 3 images, in contrast to 10, 30, 50 images in the experimental group. Note that in the control group, only 3 out of 10 participants rate higher diversity (equal or above 3), while in the experimental groups, most of the participants rate higher diversity, particularly 8 out of 10 participants rate full marks in the experimental group (50).