Seeing is Fixing
Seeing is Fixing
Overview of GUIRepair.
The left shows the system input, and the right illustrates the step-by-step repair workflow.
2-1 Repository Checkout
Please install all SWE-bench M task instances by running the following command:
>>> cd Data && python get_all_reproduce_scenario.py
2-2 Issue Resolution
You can run the repair workflow to generate patches for the SWE-bench M instance.
If you want to run the GUIRepairbase implementation using o4-mini as the base model, please run:
>>> cd Code && bash 1_run_GUIRepair_base_o4mini.sh
If you want to run the GUIRepairfull implementation using o4-mini as the base model, please run:
>>> cd Code && bash 4_run_GUIRepair_FULL_o4mini.sh
Similarly, you can run other GUIRepair implementations of other base models:
>>> bash 4_run_GUIRepair_FULL_GPT4o.sh
>>> bash 4_run_GUIRepair_FULL_GPT4-1.sh
2-3 Result Evaluation
If you want to quickly replicate the experimental results of the paper, you can run the following script to implement the evaluation.
For SWE-bench Multimodal, SWE-bench evaluation can only be carried out via sb-cli. Read the guideline to install the sb-cli.
Then, you can run the evaluation script to get the repair results of o4-mini, GPT-4.1 and GPT-4o:
>>> cd Result && bash val_o4mini.sh
>>> cd Result && bash val_gpt4.1.sh
>>> cd Result && bash val_gpt4o.sh
Finally, you can receive the evaluation reports of o4-mini, GPT-4.1 and GPT-4o.
We released the patches generated by GUIRepair using different models and the corresponding evaluation reports.
In addition, we also released the reasoning traces of GUIRepair when solving SWE-bench M task instances using different models, which reveals how GUIRepair performs the repair workflow to solve the problem step by step.
Considering the poor readability of the images in the paper, we show two cases more clearly in the web page to clarify the key role of the two cross-modal transformation components of GUIRepair.
To further clarify the contribution of the Image2Code component, here we provide a case study to analyze how the Image2Code component works.
As shown in Figure 1-a, we demonstrate a task instance eslint-15243, which was successfully resolved by GUIRepair$_{I2C}$, but the GUIRepair$_{base}$ fix failed. In Figure 1-a, GUIRepair$_{I2C}$ accurately located the bug file \textit{cli.js} and generated a fix behavior consistent with the developer patch. Yet GUIRepair$_{base}$ did not locate the correct bug file, which caused the fix to fail.
Further, we try to analyze the reasoning trace behind these two variants to reveal why GUIRepair$_{I2C}$ was able to locate the bug file accurately. As shown in Figure 1-b, GUIRepair$_{base}$ reads the issue report directly to locate suspicious files from the codebase, however, since the model does not have project-specific knowledge, it struggles to accurately locate the relevant code files behind the multimodal issue.
In contrast, in Figure 1-c, GUIRepair$_{I2C}$ first learns the project details in the knowledge mining phase, where it reads the key documents and learns the role of the cli.js file. Then, benefiting from the fact that the model has learned that project knowledge, when implementing fault localization, the model considers cli.js as a suspicious file by viewing the multimodal issue report. In this way, GUIRepair$_{base}$ misses the opportunity to localize to the bug file due to its lack of project knowledge, while GUIRepair$_{I2C}$ compensates for this by using the Image2Code, which improves the overall effectiveness.
Figure 1-a: The developer/GUIRepair_base/GUIRepair_I2C patch of eslint-15243.
Figure 1-b: The reasoning trace of GUIRepair_base to solve eslint-15243.
Figure 1-c: The reasoning trace of GUIRepairI2C to solve eslint-15243.
Here we show a task instance next-4182 that can only be resolved by GUIRepair$_{full}$. As shown in Figure 2-a, the issue report for this instance does not provide the complete reproduced code, which results in GUIRepair$_{C2I}$ unable to implement patch validation for selecting patches. Furthermore, even though GUIRepair$_{I2C}$ is able to generate the repro code during the fault comprehension phase, it is not equipped with the Code2Image module resulting in the inability to fully utilize the repro generation. Therefore, these variants cannot solve this case.
At the same time, we show the reasoning trace of GUIRepair$_{full}$ in Figure 2-b. Benefiting from the Image2Code component, the model successfully generates repro code (2️⃣Repro Generation) for the bug scenario after learning project-specific knowledge (1️⃣Knowledge Mining) and helps the model locate the correct bug file cascader-select.jsx (3️⃣File Localization).
Then, the Code2Image component uses the previously generated repro code to capture the visual effects presented by the fixing behavior. When the iteration validates to Patch_8 (7️⃣Patch Selection), the model observes that the bug component renders properly on the web page and considers the patch resolved the issue scenario.
In this case, the reproduced code generation is a key factor in helping to solve the problem by tightly linking the Image2Code and Code2Image components to better understand the bug and validate the patch. Considering that most multimodal instances lack reproduction links, the full repair potential can only be unlocked by using both key components together. Overall, GUIRepair requires not only the Image2Code to understand the multimodal problem and generate the reproduced code, but also the Code2Image to use the reproduced code to capture the actual effect of the fixing behavior, which is an inextricably intertwined process.
Figure 2-a: The issue report of next-4182. Note: left is the textual description, right is the image in the issue report.
Figure 2-b: The reasoning trace of GUIRepair_full to solve next-4182.
Implementation Details
In terms of parameter settings, we follow the Agentless and Agentless Lite experience.
By default, we query the chat model with greedy decoding (i.e., temperature = 0).
In the 1️⃣ knowledge mining phase, we set the temperature for the chat model to 0, the sampling times to 1, and let the model return the Top-6 relevant documents. Then, we set the chunk size for the embedding model to 512, chunk overlap to 0, and select the Top-6 relevant documents retrieved by the model. Finally, we merge the results of the above two parts to get the final document list.
In the 2️⃣ repro generation phase, we use the default chat model settings and one-shot prompting to generate the reproduced code. In particular, if complete reproduction codes are provided in the issue report, GUIRepair will use them directly. If the issue report does not provide reproduction codes, GUIRepair will perform repro generation.
In the 3️⃣ file localization phase, we set the chat model's temperature of 1 and sampling time of 2 to obtain a diverse distribution, and ask the model to return all suspicious files. Besides, we keep the same settings for embedding model just like in the knowledge mining phase, and return Top-4 bug files. Similarity, we merge the results of the above two parts to get the final file list. In particular, we set the maximum number of candidate bug files to 4. If the final file list exceeds the maximum number of files, we ask the chat model (default settings) to read these candidates and filter out the Top-4 key files to alleviate the context limit.
In the 4️⃣ hunk localization phase, we set the temperature of 0.7 and sampling time of 2 to cover potential bug code snippets/hunks to provide adequate contexts (or repair ingredients) for patch generation. In particular, for bug elements that are not inside the class/function, we set the context window to 500 to intercept the code hunk where the bug element is located.
In the 5️⃣ patch generation phase, we use the greedy decoding strategy (default settings) to generate a patch and use the multi sampling strategy (temperature is 1 and sampling times is 39) to return 39 patches. Finally, we will get at most 40 patch candidates.
In the 6️⃣ image capturing phase, we use fnm to control different Node.JS versions and use the npm/pnpm/yarn as the JavaScript package manager, then we use Playwright to replay/capture the issue/patch scenario in browsers.
In the 7️⃣ patch selection phase, we keep the default chat model settings and only submit the Top-1 valid patch to the SWE-bench evaluation platform.