This page demonstrates how EvoGen can efficiently identify test inputs that triggers recurrent generation.
Initialization
At the beginning, EvoGen generates N random test inputs, each of length L, by sampling tokens from the vocabulary V. This forms the initial population of test inputs, which will be evolved in subsequent generations.
Selection
In each iteration, each input in the population will be evaluated multiple times to approximate the expectation of the fitness score under the distribution of model sampling. The inputs with the best average fitness score will be kept in the population. this is the selection step.
Offspring
Following selection, the chosen inputs generate offspring until the pool is replenished. The offspring are produced by mutating concatenations of randomly selected parents.
The fitness function (fitnessScore) evaluates the token count of each generated response, prioritizing prompts that produce the longest outputs to maintain them in the population pool. This strategy ensures survival of high performing candidates while remaining applicable in black-box optimization scenarios.
We manually analyzed 200 randomly selected responses from the 2,631 prompts and found them to be highly repetitive. To quantify this repetition, we calculated the 4-gram self-similarity metrics (mean and standard deviation) for both the 2,631 responses and a comparative sample of 5,000 responses from ShareGPT:
The table above presents the average number of attempts needed by EvoGen and a random generator to find recurrent samples. The random generator failed to find samples within 10,000 attempts for GPT-4o, GPT-4o-mini and OpenAI’s o1 and o3-mini. EvoGen achieves an average speed-up of at least 335% compared with the random generator.