Quality Assurance of LLM
Quality assurance in deep learning-driven NLP software has recently garnered significant interest from industry and academia. On one side, related research seeks to empirically evaluate the trustworthiness of these models more thoroughly and comprehensively. Meanwhile, there is a concerted demand and push toward devising advanced techniques to predict failures, identify ethical concerns, and enhance various abilities of current models.
Regarding empirical evaluation, some benchmarks have been proposed, addressing factual consistency [137], [149], [173], [174], robustness [110], [175], toxicity [176], and hallucination [149], [156] in tasks like QA and text summarization. These benchmarks comprise datasets that are either human-labeled [149], [174], extracted from external resources [156], [173], transformed from other datasets [110], [175], or labeled/generated by AI models [137], [176]. While many studies target specific AI model facets for select tasks, the multifaceted nature of LLMs warrants broader evaluations. Recent research delves into multiple capabilities of LLMs, encompassing faithfulness of QA [177], security of generated code [178] and its correctness [179], mathematical capabilities [180], and logical reasoning skills [181]. Notably, HELM [182] stands out as an important study in this domain. It conducts extensive tests across seven metrics in 42 scenarios for 30 language models, offering a comprehensive insight into the current landscape of LLMs. DecodingTrust [9] is another important benchmark assessment of LLMs that concentrates on diverse perspectives of trustworthiness. In our work, we select two important salient tasks from this study: adversarial detection and OOD detection.
These empirical studies reveal that while LLMs excel in various tasks, they often lack trustworthiness and transparency. To tackle these shortcomings, some recent studies suggest some promising directions such as data-centric methods [183]–[186], uncertainty estimation [144], [187]– [192], controlled decoding [193]–[196], self-refinement [197]– [201], and leveraging external knowledge during inference [162], [202]–[206].
Data-centric approaches are model-agnostic and formu- late related problems as unintended behavior detection. Typically, these methods gather data and train classifiers to identify undesired content. A notable instance is OpenAI’s moderation system, offered as an API service [185]. This system’s training data encompasses content related to sexuality, hate, violence, self-harm, and harassment. Uncertainty estimation, often lightweight and black-box in nature, uses uncertainty scores as indicators for the models’ trustworthi- ness. Manakul et al. [144], for instance, introduce a black- box hallucination detection technique based on token-level prediction likelihood and entropy, while Huang et al. [189] explore the efficacy of both single and multi-inference un- certainty estimation methods.
While the above two approaches focus more on detection, the remaining three aim to directly improve the generated content. Controlled decoding techniques freeze the base LLM while guiding the text generation to achieve the desired attributes. Mireshghallah et al. [195], for example, propose energy-based models to steer the distribution of generated text toward desired attributes, such as unbiased content. Cao et al. [196] suggest employing dead- end analysis to reduce LLM toxicity. Drawing inspiration from human introspection, self-refinement methods have been introduced. Huang et al. [199] instruct LLMs to generate confident answers for unlabeled questions, which are then used in further training. Madaan et al. [200] suggest that LLMs critique and refine their own outputs. Lastly, LLMs augmented with external databases can address the ”brain-in-the-vat” dilemma [207], leading to more accurate inferences. Examples include WIKI-based chatbots [204] and Retrieval-Augmented LLMs [205].
Among the relevant studies, the work by Azaria et al. [81] and Li et al. [158] bear the closest resemblance to ours. While the majority of these approaches adopt black- box methodologies, they try to analyze the relationship between LLMs’ internal and their trustworthiness. Azaria et al. utilize the hidden layer activations of LLMs as features to train a classifier for assessing the truthfulness of generated content. Li et al. first probe LLMs to find the correlation between truthfulness and attention heads and subsequently leverage this insight for inference-time intervention, aim- ing to produce more accurate responses. In contrast, our framework emphasizes holistic model extraction and stateful analysis, offering a more systematic exploration of the stateful characteristics inherent to LLMs.
Model-based Analysis for Stateful DNNs
Interpreting the behavior of stateful deep neural networks is challenging, considering the potentially countless concrete
states the model can reach and its near black-box nature. Fortunately, there are already some successful attempts for the RNN-series, a representative stateful architecture before the transformer era. Some theoretical research indicates that, while RNNs are Turing-complete [208], practical constraints such as finite precision and limited computation time render them equivalent to finite-state automata (FSA) [209], [210]. These insights potentially bridge the gap between the intricate black-box nature of RNNs and the well-understood FSAs, which have been rigorously examined in classical formal theory.
Interestingly, attempts to leverage FSAs for RNN analysis predate these theoretical explorations, originating as early as the 1990s. These studies try to first abstract the concrete (hidden) state space and then build FSAs that try to mirror RNNs behavior. Omlin et al. introduce a method to segment the hidden state space into q equal intervals, with q being the quantization level [118]. Zeng et al. [119] and Cechin et al. [120] propose to use K-means to cluster concrete states into abstract states. These pioneering efforts from the pre-deep learning era paved the way for subsequent model-based analysis of more sophisticated RNNs.
The advent of deep learning has ushered in two transformative shifts in the field: an influx of data and increasingly complex architectures. Concurrently, the model-based analysis has also evolved accordingly. These efforts broadly fall into two categories: those that focus on extracting a trans- parent surrogate model replicating RNN decisions [56], [60], [122], [211]–[216] and those emphasizing transition traces with associated semantic meanings related to downstream tasks [29], [54], [121], [126], [217].
For the former, one line of research leverages a more formal way for the FSA extraction, such as using Angluin’s L∗ algorithm [211] and its variant [212] or finding the Hankel matrix of a black-box system and constructing weighted automata from it [213]. These strategies treat RNNs as teachers and craft automata through querying. Alternatively, a more empirical path focuses on analyzing direct transition traces derived from training data. For example, Dong et al. first obtain symbolic states by clustering on concrete hidden states and build probabilistic automata based on a learning algorithm [60]. Zhang et al. use similar methods to build symbolic states but enhance the context-awareness of the extracted model by compositing adjacent states [122]. Merrill et al. introduce an automata extraction technique based on state merging, which performs better than k- means [215]. Hong et al. utilize a transition path matching method, integrate identified patterns with state merging, and offer a more systematic approach to constructing automata [216]. All these methods aim to extract automata for better consistency with source RNNs.
Rather than creating an exact FSA mirroring a target DNN’s behavior, stakeholders may prioritize specific properties of stateful software systems, such as security, safety, privacy, and correctness. Consequently, some studies focus more on studying these specific properties and obtaining insights observed from the extracted FSA instead of seeking a perfect decision alignment. For instance, DeepStellar [54] and its successor, Marble [121], delve into the adversarial robustness of RNNs using discrete probabilistic models. Conversely, AbASG [217] employs automata for adversarial sequence generation. DeepMemory [126] performs analysis of RNN memorization and its associated security and privacy implications using semantic Markov models. RNNRepair [29] performs repair of an RNN through model- based analysis and guidance. DeepSeer [218] employs finite automata as the central methodology for human interactive design to enable RNN debugging. The diverse successes of these methods underscore the efficacy of model-based analysis in stateful DNN systems.
Our work differentiates from the above studies in two key aspects. Firstly, we endeavor to develop a universal analysis framework designed for versatile property analysis across a broad spectrum of tasks in stateful DNN software systems in a plug-and-play manner. Secondly, our emphasis lies on the Transformer architecture and the corresponding LLMs. These models operate on a very distinct mechanism (e.g. attention mechanism) and adhere to unique training workflows. Recent studies find that the Transformer has much better empirical representation power than LSTM in simulating pushdown automation, calling the need for adapted analysis methods [219]. On the other hand, various papers have pointed out that some important capabilities of the Transformer, including factual associations [220] and object identification [221], stem from propagating complex information through tokens, inherently exhibiting stateful characteristics. While some related studies have investigated the potential of enhancing language models with finite automata for improved performance [222] or con- straining their outputs using DFA [223], a comprehensive model-based analysis and framework remain to be absent.
LLM and Software Engineering
Recently, a growing number of research works show that LLMs have already made great potential in various phases throughout the software production lifecycle. Many re- searchers and industrial practitioners have investigated and examined the capabilities of LLMs for a large spectrum of applications in software engineering domain, such as code generation [1], [224]–[227], code summarization [228]–[230], program synthesis [231], [232], test case generation [233]– [236] and bug fixing [237]–[241].
In particular, Dong et al. [227] leverage ChapGPT to present a self-collaboration framework for code generation. Namely, multiple LLMs are assigned with different roles (i.e., coder, tester, etc.) following a general software development schema. Such an LLM-powered self-collaboration framework achieves state-of-the-art performance in solving complex real-world code generation tasks. Ahmend et al. [228] investigate the effectiveness of few-shot training on LLM (Codex [242]) for code summarization tasks. Their experiment results confirm that leveraging data from the same project with few-shot training is a promising approach to improve the performance of the LLM in code summarization. Nijkamp et al. [224] release a family of LLMs (CODE- GEN) trained on both natural language and programming language data to demonstrate the ability of LLMs on pro- gram synthesis. In addition, Lemieux et al. [234] incorporate LLM into the loop to improve search-based software testing (SBST) for programs being tested through a combination of test case generation and other techniques. Last but not
least, Sobania et al. [237] study the capability of ChatGPT in terms of software bug localization and fixing. These works qualify the potential of LLMs as an enabler and a booster to accelerate the software production lifecycle.
Despite the promising SE task-handling capabilities by LLMs, existing works [9], [63], [66], [155], [243], [244] have also pointed out that the current LLMs could potentially suffer critical quality issues across different SE tasks. Specifically, developers sometimes find it hard to understand the code generation process and the code produced by LLMs, and LLMs have also exhibited incorrect behaviors in generating suboptimal or erroneous solutions [66]. Such concerns about the trustworthiness of LLMs and the quality of the corresponding outcomes greatly hinder further adaptation and deployment of LLMs on safety, reliability, security and privacy-related SE applications. Moreover, although a large body of current works in the SE community focuses on leveraging LLM to further promote and accelerate SE applications from different aspects, less attention has been paid to applying and adapting existing SE methodologies to safeguard the trustworthiness of LLMs. As the recently fast-increasing trend of LLM-based techniques for various key stages of the SE lifecycle, it is recognized that LLM would potentially play a more and more important role in the next few years. Therefore, it could be of great importance to establish an early foundation towards a more system- atic analysis of LLMs to better interpret their behavior, to understand the potential risks when using it, and to equip the researchers and developers with more tangible guidance (e.g., concrete analysis results and feedback) to facilitate the continuous enhancement of LLMs for practical usage. Therefore, to bridge this gap and inspire further research along this direction, we hope LUNA, as a basic analysis framework for LLMs, could be helpful for researchers and practitioners to conduct more deep exploration and ex- ploitation and to design novel quality assurance solutions towards approaching trustworthy LLMs in practice.