LUNA is a model-based analysis framework crafted to investigate the trustworthiness of LLMs. At a high level, LUNA includes four key stages: abstract model construction, semantics binding, model quality metrics, and practical application.
Abstract model construction. The first step is to build the abstract model, which plays a predominant role in our anal- ysis framework. To enable the universal analysis for LLM, LUNA is designed to support an assortment of abstraction factors, i.e., dimension reduction (PCA), abstraction state partition (grid-based and cluster-based partition), and abstract model types (DTMC and HMM).
Semantics binding. With the abstract state space, an important step is to know what information contained in the state can help the analysis process. Thus, after the abstract model is built, we bind semantics, which is the level of satisfaction of the LLM with respect to the specific trustwor- thiness perspective. The semantics of the model represent the behavior logic of the LLM and empower an in-depth analysis.
Model quality assessment. A crucial step before practical application is the evaluation of the quality of the model. To evaluate the quality of the constructed model, we lever- age two sets of metrics: abstract model-wise metrics and semantics-wise metrics. We collect abstract model-wise metrics to measure the quality of the abstract model from exist- ing works. To evaluate the quality of the semantics binding, we also propose semantics-wise metrics.
Practical application. LLMs can occasionally make up answers or generate erroneous outputs in their answers. To en- hance the trustworthiness of LLMs, it is important to detect such abnormal behaviors. After constructing the abstract model, we utilize it for a common analysis for LLMs, specif- ically, the detection of abnormal behaviors.
Taking both trustworthiness perspective-specific data and the subject LLM as inputs, we first profile the given model to extract the concrete states and traces, i.e., outputs from the decoder blocks. Then, we leverage the extracted data to construct our abstract model. In this work, we mainly study two state-based models, DTMC and HMM, depicted as the Figure below. The construction of these two models is described as follows.
DTMC Construction
We outline the steps to construct the abstract DTMC, which contains state abstraction and transition abstraction.
The state abstraction aims to build the abstract state space S ̄, which includes two steps: dimension reduction and state space partition.
The dimension of concrete states is equal to the number of neurons of decoder block outputs, which is typically too high to analyze directly. For instance, with 32 decoder blocks and 4, 096 dimensions, the dimension of the hidden states for a single token in LLaMA-7b is 131, 072. Thus, we first apply dimension reduction to reduce the dimension of concrete states to ease the analysis complexity. In particular, we leverage Principle Component Analysis (PCA) [130] to transform the original data to k dimensional vector, which retains the most important patterns and variations in the original data.
Then, we perform state space partition to construct the abstract state space. We use two ways that are commonly used in the recent works [54], [121], [126] to conduct the partition: grid-based partition and cluster-based partition. For regular grid-based partition, we first apply multi-step ab- straction to include more information contained in the near temporal steps. The abstraction is essentially created by sliding a N-step window on the trace. In other words, for N = 2, {si , si+1 }, and {si+1 , si+2 } are different multi-step abstraction. Then, we apply grid partition; namely, each dimension of the k-dimensional space is first uniformly divided into m grids, and we use cij to denote the j-th grid of the i-th dimension. Then, the compressed concrete states that fall into the same grid are assigned to the same abstract state, i.e., s ̄ = {si|s1i ∈ c1 ∧ ··· ∧ ski ∈ ck}. For the cluster-based partition, we utilize existing clustering algorithms, e.g., Gaussian Mixture Model (GMM) [131] and KMeans [132], to assign the compressed concrete states into n different groups, where each of such group is considered an abstract state.
HMM Construction
HMM [133]–[135], is designed to catch the sequential depen- dencies within the data and is able to provide a probability distribution over possible sequences. Hence, we also choose HMM to model the hidden state traces.
The construction of HMM is as follows. We first define the state space S with the number of hidden states and the abstract states, built in DTMC construction (Section 3.2.1), and the observations O ̄, which is all the seen abstract states in the abstract states space. Then, we use the standard HMM fitting procedure – Baum-Welch algorithm [136] (as an Expectation-Maximization algorithm) to compute transition probability P ̄ , Emission function E ̄ , and initial state probability function I. Baum-Welch algorithm is composed of expectation, which calculates the conditional expectation given observed traces, and maximization, which updates the parameters of P, E, and I, to maximize the likelihood of observation. The Baum-Welch algorithm determines the most probable sequence of hidden states that would lead to the sequence of observed abstract states. The constructed HMM is capable of analyzing and predicting the future text and outputs, based on the probabilistic modelling of the historical data, i.e., the fitted P,E, and I.
To enable an effective quality analysis, we bind semantics, which reflects LLM’s performance regarding specific trust- worthiness perspectives, to the abstract model.
Definition 3 (Semantics). The concrete semantics θ ∈ Rn of a concrete state sequence τk = ⟨si,...,si+k−1⟩ represents the level of satisfaction of the LLM w.r.t. the trustworthiness perspectives.
Intuitively, semantics reflects the condition of the LLM regarding the desired trustworthiness perspective. Assume k = 1, as shown in Figure 3, when the LLM falls in states s ̄0 , s ̄1 , s ̄2 , and s ̄3 , it is considered to be in the normal status, while state s ̄4 is considered to be an abnormal state for the model. Moreover, we perform semantics abstraction to obtain the abstract semantics θ ̄. We take the average values of all concrete semantics in the abstract state as the abstract semantics. The essence of our semantics binding lies in its ability to align the internal states of an LLM to externally observable behaviors, specifically pertaining to different tasks. Therefore, such semantics interpretation acts as a bridge, connecting the abstract behavior captured by the model to the real-world implications of that behavior. Note that when k = 1, the sequence contains only one state. To ease the notation, we omit k.
Once the abstract model is built, one of the important steps before the concrete application is to assess the quality of the abstract model. The assessment is typically through some types of metrics. In this work, we collect and summarize a set of metrics characterizing the quality of the model. At a high level, the metrics can be divided into abstract model-wise metrics and semantics-wise metrics. Below, we briefly introduce these metrics.
To evaluate the quality of the constructed model, we collect the metrics that are widely used in the literature [54], [121], [126], [138], [139] to assess the model from diverse aspects, as displayed in Table 1. We call such metrics as abstract model-wise metrics. Abstract model-wise metrics are categorized into three types: basic, state-level, and model-level. Basic metrics contain succinctness (the abstract level of the state space), coverage (how many new states are unseen in the state space), and sensitivity (whether abstract states differ under small perturbation). The state-level metrics contain state type classification, e.g., sink state [124], which helps identify the property of the Markov model, e.g., absorbable (the Markov chain cannot escape some undesirable states.) [140]. We compute the following metrics for model-level metrics: stationary distribution entropy [139] and perplexity [141], which reflect the stability of the model and the degree of well-fitting to the training distribution, respectively.
Note that the abstract model-wise metrics do not involve semantics, which contains the level of satisfaction w.r.t. trustworthiness. However, to our knowledge, not much work provides general metrics to measure the quality of the abstract model in terms of semantics. To fill this gap, we propose semantics-wise metrics, as shown in Table 2. The semantics-wise metrics are extended into basic, trace-level, and surprise-level. Basic semantics-wise metrics contain semantics preciseness (the average preciseness of abstract semantics) and semantics entropy (the randomness and unpredictability of the semantics space). Trace-level metrics compute the level of how the semantics change temporally, which includes value diversity (instant value and n-gram value) and derivative diversity (n-gram derivative) [142].
Surprise-level metrics try to evaluate the surprising degree of the change of the semantics by means of Bayesian reasoning [143].
Recent works demonstrate that the abstract model has extensive analysis capability for stateful DNN systems [29], [54], [126]. Here, to validate the practicality of our constructed abstract models, we mainly apply them into abnormal behavior detection, which is a common analysis demand for LLMs [144], [145]. As introduced in Section 2, abnormal behavior refers to the unintended expression of LLM, e.g., making up answers or generating biased output [146]–[150]. To detect such behavior, we leverage the abstract model with the semantics. The procedure of detection is as follows. Given an output text and the abstract state trace { ̄s1, . . . , ̄sn}, we first acquire the corresponding semantics trace { ̄θ( ̄s1), . . . , ̄θ( ̄sn)}. Then, we compute an semantics score by taking the mean of the semantics sequence value, namely, AVG( ̄θ( ̄si)). Finally, we compare the computed score with the ground truth to determine the performance of classifying the output text as normal/abnormal behavior.
Here, we provide a running example of hallucination detection, as shown in Figure 4, to show how we use the abstract model to detect abnormal behaviors. The prompt to LLM is ”Which is denser, water vapor or air?” and the LLM answers ”Water vapor is denser than air.”. The corresponding abstract state sequence is ̄s199 → ̄s5 → · · · → ̄s159 → ̄s4, and the semantics sequence is 0.12 → 0.23 → · · · → 0.01 → 0.15. The computed semantics score is 0.04, and we identify the answer as an abnormal behavior. Moreover, we can see that state ̄s159 is an abnormal state, which represents the LLM become abnormal at word ”denser”. Such semantics-based LLM behavior interpretation enables a human-understandable approach to explain and analyze the quality of the LLM w.r.t. different trustworthiness perspectives. It is worth noting that our framework is designed with adaptability for various practical applications (e.g., OOD detection, adversarial attack detection, etc.).
More concrete examples can be found in Concrete Application.