To investigate RQ3, we expanded on the previous experiments by generating 10 additional tokens for each test suite, starting from the last token of the query. We explore how different tokens impact coverage analysis, assuming they may lead to varying coverage behaviors and reveal more comprehensive insights into model behavior. Our key question is: "How does generating additional tokens affect diversity assessment among test suites in LLM testing?"
Our further goal is to identify the optimal token positions for LLM testing to reduce computational costs and achieve efficient testing results. These points are crucial for determining the most effective moments to measure coverage, thereby optimizing the balance between testing and resource expenditure. After generating each new token, we calculated the corresponding coverage rates for the model. This process allowed us to examine how the coverage evolves as the model generates more tokens beyond the initial query. Similarly, we analyzed the impact of these tokens on coverage analysis through the RCG quantified by NC and TKNC.
The RCG results calculated based on NC and TKNC for different tokens in the target LLMs. Each model generates 10 tokens for each query.