On this page, we've chosen to highlight two specific examples, OOD Detection and Adversarial Attack, to demonstrate the functionality of LUNA on these tasks. For those interested in the example of hallucination, please take a look at the detailed discussion in our paper.
In our OOD detection task, as exemplified in the provided figure, after an out-of-distribution sample undergoes a Shakespearean-style mutation, the sentiment shifts. The resultant abstract state sequence progresses from s6 to s10, accompanied by a declining semantics sequence, culminating in a semantics score of 0.34. This score is demonstrably lower than the in-distribution sample score of 0.78. Such a stark contrast in scores effectively showcases our framework's capability to discern and flag out-of-distribution instances. This semantics-based approach to discerning model behavior offers a clear, human-comprehensible method to evaluate the model's reliability across diverse perspectives, whether OOD detection or other applications.
It's an example of adversarial behavior detection. Given the query to the LLM, "What spacecraft did the Soviets use to send animals to space and around the moon's orbit?", the original response was well-founded and accurate, citing the Soviet Union's utilization of Zond 5 on September 15, 1968, to send animals around the Moon. The LLM's confidence score stood at a firm 1, with a resultant semantics score of 0.5 derived from the abstract state sequence and the semantics sequence of 0.36, 0.31, and 0.92.
Upon subjecting the LLM to adversarial input, "the soviet union throw station animals around the moon on september 15 , 1968 , aboard zond 5 , and it was think they might presently ingeminate the exploit with human cosmonaut.", the LLM failed to recognize the distortions in the phrasing and returned a confidence score of 0. The accompanying abstract state sequence pinpointed states s33 and s19 as significant deviations from the expected behavior, with state s33 being an anomalous state. The semantics sequence of 0.01 and 0.82 led to a combined semantics score of 0.31.