Therefore, to construct our test suites as listed in the table below, we select the following two widely-used datasets and create a complementary dataset on our own:
Alpaca-gpt4: Primarily used for fine-tuning LLMs, this dataset includes instructional tasks designed to emulate routine question-and-answer scenarios in everyday environments, serving as the source of normal queries for the test suites. Examples of these tasks include prompts such as “Give three tips for staying healthy” and “What are the three primary colors?”.
JailBreakV: This dataset is tailored to assess the robustness of LLMs against jailbreak attacks. Each dataset entry consists of a pair of rejected queries and corresponding attack queries derived from the rejected queries using attack templates~(to induce the model to output malicious content).
Synonymous Query Dataset: To construct synonymous queries, we used GPT-4 to generate corresponding synonymous paraphrases for the first 500 queries from the Alpaca-gpt4 dataset, e.g., "What is the capital of France?" and "Name the capital city of France.". This complementary dataset serves as the source of synonymous queries for the test suites.
Distribution of test suites across datasets
In this study, we comprehensively evaluate four well-known open-source LLMs, which vary significantly in size, architecture, and origin. These models include OPT-125M, Llama-2-7B-Chat, Pythia-12B, and Gemma-2-27B-it. This selection covers a broad spectrum of model characteristics, ensuring an adequate observation across different dimensions.
OPT-125M: Developed by Meta AI, this foundational model boasts 125 million parameters and is specialized for generative text tasks, utilizing a dataset primarily from CommonCrawl.
Llama-2-7B-Chat: From Meta, this 7 billion parameter model is fine-tuned for chat applications, employing supervised fine-tuning and reinforcement learning to ensure safe and contextually relevant conversations.
Pythia-12B: Pythia-12B by EleutherAI, with 12 billion parameters, is a research-centric model trained on the Pile dataset for in-depth analysis of large model behaviors and constraints.
Gemma-2-27B-it: Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.
We refer to the settings from prior studies, making appropriate adjustments for practical applications in LLMs. We briefly describe the basic settings of the coverage criteria used in our experiments and explain the rationale behind these choices. Note that our experiments focus on the trend of coverage changes rather than precise numerical values.
NC requires an activation threshold parameter T to determine whether a neuron is activated. Due to the different size and activation functions among OPT-125M, Llama-2-7B-Chat, Pythia-12B, and Gemma-2-27B-it, which significantly affect the distribution of neuron activations, we empirically set T to 0.1, 0.25, 0.75, and 50 for the best performance, respectively.
TKNC requires a parameter K to determine the number of top neurons selected. For all models, we set K to 10.
TKNP is similar to TKNC. However, our experiments show that due to the complexity of LLMs, setting K too high results in each new input forming a new pattern. Therefore, we set K to 1.
TFC requires a parameter T to determine the distance between different clusters. Here, we again refer to the model sizes and set T to 5, 50, 500, and 1000, respectively.
NLC does not require a pre-set parameter. However, since we do not have access to the complete training data to use as prior knowledge for NLC, we calculate it directly on different test suites.