Image captioning (IC) systems, including Microsoft Azure Cognitive Service, are commonly utilized to convert image content into descriptive natural language. However, inaccuracies in caption generation can lead to serious misinterpretations. Advanced testing techniques such as MetaIC and ROME have been developed to mitigate these issues, yet they encounter notable challenges. Firstly, these strategies demand intensive labor, relying on detailed manual annotations like bounding box data of objects to create test cases. Secondly, the realism of the generated images is compromised, with MetaIC adding unrelated objects and ROME failing to remove objects effectively. Finally, the capability to generate diversified test suites is restricted. MetaIC is limited to only inserting specific objects to prevent overlap, whereas ROME can generate only 3n - 2n variations of test cases from an original seed image containing n objects.
In this study, we present SPOLRE, a novel automated tool designed for Semantic Preserving Object Layout Reconstruction in image captioning system testing. SPOLRE is based on the insight that modifying the arrangement of objects within an image does not alter its inherent semantics. We utilize four semantic-preserving transformation techniques—translation, rotation, mirroring, and scaling—to modify object layouts autonomously, eliminating the need for manual annotation. This approach enables the creation of realistic and varied test suites for IC system testing. Our extensive testing demonstrates that more than 75% of survey respondents find the images produced by SPOLRE more realistic compared to those generated by SOTA methods. Additionally, SPOLRE exhibits outstanding performance in identifying caption errors, detecting 31,544 incorrect captions across seven IC systems with an average precision of 91.62%. This significantly outperforms other methods, which only achieve 85.65% accuracy on average and identify 17,160 incorrect captions. Notably, SPOLRE exposes 6,236 unique issues within Microsoft Azure Cognitive Service, highlighting its effectiveness against one of the most advanced IC systems available.
Figure 1: SPLORE overview with two major parts: Image Processing and Text Processing.
For each image input, we implement semantic segmentation to identify the types of objects present and generate their corresponding masks.
We utilize a novel inpainting-based extraction algorithm for recursively extracting accurate masks of objects, ensuring precise object delineation.
Utilizing the obtained object masks, we perform transformations based on metamorphic relations to generate a complete mask that reflects a varied layout.
With the complete mask in hand, we employ a diffusion model to render a new image that aligns with the modified layout. This image is then processed by the IC system under test.
OLAR parses the generated captions using Part-Of-Speech (POS) tagging to identify objects and their quantities.
We then compare these elements with the ground truth. Any discrepancies in object types or count are flagged as violations.