We conducted an evaluation of the Abstract Syntax Tree (AST) results. Given that LLM does not adhere as strictly to rules as a rule-based AST parser does, we engaged experts to assess the reasonableness of the generated ASTs. Minor discrepancies were tolerable within this evaluation process.
We employed ChatGPT to identify matching codes within codebases, wherein it ranked the expressions based on their degree of similarity. Interestingly, we observed that ChatGPT occasionally generated expressions not present in the codebase, seemingly deriving them from previously encountered data, even though we stipulated that answers must originate from the codebase. However, within its ranked list, the correct answer was typically found among the top three expression.