Xuefei (Julie) Wang1 Kai A. Horstmann2 Ethan Lin2 Jonathan Chen2
Alexander R. Farhang1 Sophia Stiles1 Atharva Sehgal3 Jonathan Light4
David Van Valen1 Yisong Yue1 Jennifer J. Sun2
1 Caltech 2 Cornell 3 UT Austin 4 Rensselaer Polytechnic Institute
1
Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.
Research Question: What is the most practical and simplest agent framework that can reliably adapt a fixed, pretrained production tool to a new, bespoke dataset?
We explored the following components of agent design space:
Task Prompt: The initial problem specification, which is required for the coding agent to understand the goal and write the desired functions.
Coding Agent: An LLM-based agent whose role is to generate candidate functions.
Execution Agent: A module that receives the generated function, embeds it into the scientific workflow, executes the pipeline, and returns execution feedback and scores back to the coding agent.
Data Prompt: Context on the nature of the data.
API List: A list of relevant APIs with docstrings.
LLM Type: The model used for the Coding Agent, encompassing a wide range of options, varying in sizes, training focuses, development (open-source vs. closed-source), and providers.
Expert Functions: Inserts human-expert optimized functions into the prompt to study if they provide guidance and serve as effective in-context examples.
Function Bank: a persistent memory of its previously generated functions. When enabled, previous functions will be selected and fed back into the agent's prompt to guide further exploration.
AutoML Agent: Adds an explicit hyperparameter search step. When enabled, the agent will be invoked periodically to analyze generated functions from the function bank, identify optimizable parameters, and run a hyperparameter search to fine-tune the parameters.
MedSAM
Cellpose
Polaris
API Space Analysis: Polaris and Cellpose have concentrated spaces, dominated by a few strongly connected key APIs, while other APIs are occasionally used. In contrast, MedSAM has a highly dispersed space—with more evenly spread-out strong edges across the graph.
Parameter Space Analysis: The parameter space analysis shows that the Polaris task is defined by a hard-to-optimize parameter space. This is specifically due to the threshold_abs parameter in peak_local_max, where we found a drastic, systematic gap between the agent’s suggested values and the optimal range. . In contrast, all other tasks and functions had easy-to-optimize parameter spaces, with agent-proposed distributions aligning well with the optimal ones.
This allows us to categorize our tasks as follows:
Analysis of Solution Diversity and Length
Analysis of Polaris `threshold_abs` parameter
Analysis of Novel APIs and API usage w/ and w/o API list
Expert Functions. This component is highly beneficial for hard-to-optimize parameter spaces but detrimental to dispersed API spaces. Polaris saw a massive benefit from the added parameter information. Meanwhile, MedSAM (dispersed API) was harmed likely because the component restricted its necessary exploration. Cellpose’s easyto-optimize parameter space meant it received no massive boost, while its concentrated API space meant it was not harmed by the restriction, resulting in a moderate positive effect.
Reasoning LLM. The Reasoning LLM’s impact is best understood by its uneven exploratory behavior: it excelled at function diversity but failed at parameter search. This enhanced diversity is likely beneficial for MedSAM’s dispersed API space. Conversely, it is more constrained in its parameter choices, preventing it from finding the optimal parameters on Polaris. As before, the more neutral Cellpose task saw a moderate performance boost.
Function Bank. The Function Bank’s diversity-boosting effect acted as a double-edged sword. While it led to better performance for Cellpose and Polaris, it surprisingly hurt MedSAM scores. We observed that, in a dispersed space, the component encourages the agent to build progressively longer solutions, which may eventually become detrimental.
Data Prompt. Ablating the data prompt consistently worsened performance, which suggests that the data context is essential for agents to write suitable functions.
API List. Ablating the API list consistently im proved scores. Further analysis showed that agents can discover novel APIs regardless of the presence of an API list, but do so more consistently without it, suggesting that the LLM’s latent knowledge might be sufficient for solving these tasks. Additionally, providing the list appears to introduce a harmful bias, evidenced by the unexpected high usage of specific APIs like remove_small_objects and remove_small_holes. While these APIs are relevant and are used even when the list is omitted, their usage became strongly and disproportionately biased when the API list was provided. This suggests that the default choice should be to omit the list unless the APIs are beyond the LLM’s intrinsic knowledge scope.