Discussion
Abstract Model Construction for LLM
Some research works have demonstrated that a well-constructed abstract model can behave as an indicator to reveal the internal behavior of the target neural network model [29], [54], [57], [121], [126]. Adequate model construction techniques are vital to retroactively reflect the corresponding characteristics of the studied system. Nevertheless, considering the very large model size and the distinct self-attention mechanism of LLMs, it is still unclear to what extent existing methods are effective on LLMs. Hence, our framework, LUNA, collaborates three state abstraction methods and two model construction techniques with a total of 180 different parameter configurations to extensively explore the effectiveness of popular model-based analysis approaches.
From the evaluation results, we find that cluster-based state partition methods (KMeans, GMM) and the grid-based method have distinct advantages on different model quality measurement metrics. Meanwhile, in terms of the methods of model construction, DTMC exhibited close or beyond performance on most of the metrics than HMM, which implies it is a potential candidate to model the state transition features of LLMs. It is worth noting that the efficacy of the abstraction and modelling techniques varies on tasks and trustworthiness perspectives. For instance, KMeans gets superior performance sores on Succinctness and Coverage on both TruthfulQA and SST-2 datasets but relatively inadequate performance on AdvGLUE++ dataset. Such a finding signifies that explicit selection of methods and appropriate parameter tuning are necessary to maximize the effective- ness of existing techniques regarding abstract model construction. Therefore, advanced and LLM-specific abstract model construction techniques are called to capture and represent the behavior characteristics of LLMs regardless of types of tasks and trustworthiness perspectives.
Abstract Model Quality Measurement
In this work, we tried our best and chose as many as 12 metrics to initiate a relatively comprehensive understanding of the quality of the constructed model from both abstract model-wise and semantics-wise. Particularly, abstract model-wise metrics assess the intrinsic properties of the model regardless of subject trustworthiness perspectives, such as the stability of the model and the degree of well-fitting to the distribution of training data. We notice that Coverage and Succinctness, which measure the level of compression of the abstract model, provide more insights for dimension reduction and abstract state partition. Moreover, Stationary Distribution Entropy, Perplexity and Sink State make more efforts to guide the selection of model construction methods and subsequent parameter tuning. Such metrics help to enhance the quality of the model towards better training distribution fitting and the ability against small perturbations.
In contrast, semantics-wise metrics measure the quality of the model from the angle of the degree of satisfaction w.r.t. trustworthiness perspectives. In particular, from Section 4.5, we notice that Preciseness, Entropy and n-gram Value Trend are more correlated with the performance of the model regarding different trustworthiness perspectives. Some metrics may have distinct adaptabilities on certain applications. For example, Surprise Level and n-gram Derivative Trend have finer effectiveness in describing the quality of the model on adversarial and hallucination detection.
In general, different metrics are needed to collaboratively guide the construction of the abstract model and secure the quality from diverse aspects. Also, some metrics are potentially fit to tackle specific downstream tasks or trust-worthiness perspectives; thus, more research is called to prospect the explicit metrics for particular applications or quality requirements.
Model-based Quality Assurance for LLM.
The fast- growing popularity of LLMs highlights the escalating in- fluence of LLMs across academia and industry [5], [7], [162]. With the witness to the adoption of LLMs in a large spec- trum of practical applications, LLMs are expected to carry as foundation models to boost the software development lifecycle in which trustworthiness is critical. Namely, quality assurance techniques explicitly in the context of LLMs are of urgent need to enable the deployment of LLMs on more safety, reliability, security and privacy-related applications.
Our framework LUNA aims to provide a general and versatile platform that assembles various modelling meth- ods, downstream tasks and trustworthiness perspectives to safeguard the quality of LLMs. Moreover, considering the extensibility of the framework, LUNA is expected to behave as a foundation that enables the following research to im- plement new advanced techniques for more diverse tasks across different domains. The results from Section 4 confirm that the abstract model can act as a beacon to disclose abnormalities in the LLM when generating responses to different inputs. Specifically, the abstract model extracts and inspects the inner behavior of the LLM to detect whether it is under unintended conditions that can possibly produce nonfactual or erroneous outputs. The model embeds seman- tics w.r.t. different trustworthiness perspectives to extend its capability to tackle diverse quality concerns. In addition, we consider our framework can play roles in extensive quality assurance directions, such as online monitoring [163]–[166], fault localization [167], [168], testing case generation [19], [21], [169], [170] and output repair [28], [171], [172]. For instance, by leveraging the trajectories of the states and cor- responding semantics w.r.t. a specific output, it is possible to trace back and precisely localize the faulty segments within the output tokens.
In this paper, we take an early step to present a model-based LLM analysis framework, LUNA, to initiate exploratory research towards the quality assurance of LLMs. Our experiment results show that the abstract model can capture the abnormal behaviors of the LLM from its hidden state information. We conduct a series of modelling tech- niques with a diverse set of quality measurement metrics to deliver a comprehensive understanding of the capability and effectiveness of our framework. Hence, we find that LUNA shows performant abilities to detect the suspicious generations of LLMs w.r.t. different trustworthiness perspectives.
REFERENCES
[1] P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in Chi conference on human factors in computing systems extended abstracts, 2022, pp. 1–7.
[2] C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246, 2023.
[3] W. Zhang, Y. Deng, B. Liu, S. J. Pan, and L. Bing, “Sentiment analysis in the era of large language models: A reality check,” arXiv preprint arXiv:2305.15005, 2023.
[4] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” arXiv preprint arXiv:2211.01910, 2022.
[5] “Chatgpt,” http://chat.openai.com, 2023.
[6] “Gpt4,” https://openai.com/gpt- 4, 2023.
[7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozie`re, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[8] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
[9] B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer et al., “Decodingtrust: A com- prehensive assessment of trustworthiness in gpt models,” arXiv preprint arXiv:2306.11698, 2023.
[10] H. Raj, D. Rosati, and S. Majumdar, “Measuring reliability of large language models through semantic consistency,” arXiv preprint arXiv:2211.05853, 2022.
[11] B. Wang, S. Wang, Y. Cheng, Z. Gan, R. Jia, B. Li, and J. Liu, “Infobert: Improving robustness of language models from an in- formation theoretic perspective,” arXiv preprint arXiv:2010.02329, 2020.
[12] B. AlKhamissi, M. Li, A. Celikyilmaz, M. Diab, and M. Ghazvininejad, “A review on language models as knowledge bases,” arXiv preprint arXiv:2204.06031, 2022.
[13] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Comput. Surv., vol. 55, no. 12, mar 2023. [Online]. Available: https://doi.org/10.1145/3571730
[14] A. Abid, M. Farooqi, and J. Zou, “Persistent anti-muslim bias in large language models,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 298–306.
[15] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithful- ness and factuality in abstractive summarization,” arXiv preprint arXiv:2005.00661, 2020.
[16] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
[17] S. Wang, Z. Zhao, X. Ouyang, Q. Wang, and D. Shen, “Chatcad: Interactive computer-aided diagnosis on medical image using large language models,” arXiv preprint arXiv:2302.07257, 2023.
[18] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 1–18.
[19] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu et al., “Deepgauge: Multi-granularity testing criteria for deep learning systems,” in Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, 2018, pp. 120–131.
[20] J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing using surprise adequacy,” in 2019 IEEE/ACM 41st Inter- national Conference on Software Engineering (ICSE). IEEE, 2019, pp. 1039–1049.
[21] X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y. Liu, J. Zhao, B. Li, J. Yin, and S. See, “Deephunter: a coverage-guided fuzz testing framework for deep neural networks,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 146–157.
[22] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao et al., “Deepmutation: Mutation testing of deep learning systems,” in 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2018, pp. 100–111.
[23] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” in Proceedings of the 40th international conference on software engineering, 2018, pp. 303–314.
[24] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “Deeproad: Gan-based metamorphic testing and input valida- tion framework for autonomous driving systems,” in 2018 33rd IEEE/ACM International Conference on Automated Software Engi- neering (ASE). IEEE, 2018, pp. 132–142.
[25] H. Wang, B. Ustun, and F. Calmon, “Repairing without retrain- ing: Avoiding disparate impact with counterfactual distribu- tions,” in International Conference on Machine Learning. PMLR, 2019, pp. 6618–6627.
[26] M. Sotoudeh and A. V. Thakur, “Correcting deep neural networks with small, generalizing patches,” in Workshop on Safety and Robustness in Decision Making, 2019.
[27] H. Zhang and W. Chan, “Apricot: A weight-adaptation approach to fixing deep learning models,” in 2019 34th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE). IEEE, 2019, pp. 376–387.
[28] B. Yu, H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, and J. Zhao, “Deeprepair: Style-guided repairing for deep neural networks in the real-world operational environment,” IEEE Transactions on Reliability, vol. 71, no. 4, pp. 1401–1416, 2021.
[29] X. Xie, W. Guo, L. Ma, W. Le, J. Wang, L. Zhou, Y. Liu, and X. Xing, “Rnnrepair: Automatic rnn repair via model-based anal- ysis,” in International Conference on Machine Learning. PMLR, 2021, pp. 11 383–11 392.
[30] Q. Hu, Y. Guo, M. Cordy, X. Xie, L. Ma, M. Papadakis, and Y. Le Traon, “An empirical study on data distribution-aware test selection for deep learning enhancement,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–30, 2022.
[31] X. Gao, Y. Feng, Y. Yin, Z. Liu, Z. Chen, and B. Xu, “Adaptive test selection for deep neural networks,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 73–85.
[32] Z. Yang, J. Shi, M. H. Asyrofi, and D. Lo, “Revisiting neuron coverage metrics and quality of deep neural networks,” in 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2022, pp. 408–419.
[33] V. Riccio and P. Tonella, “When and why test generators for deep learning produce invalid inputs: an empirical study,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1161–1173.
[34] J. Wang, H. Qiu, Y. Rong, H. Ye, Q. Li, Z. Li, and C. Zhang, “Bet: black-box efficient testing for convolutional neural networks,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 164–175.
[35] J.-t. Huang, J. Zhang, W. Wang, P. He, Y. Su, and M. R. Lyu, “Aeon: a method for automatic evaluation of nlp test cases,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 202–214.
[36] Z. Wei, H. Wang, I. Ashraf, and W.-K. Chan, “Deeppatch: Main- taining deep learning model programs to retain standard accu- racy with substantial robustness improvement,” ACM Transac- tions on Software Engineering and Methodology, 2023.
[37] R. Schumi and J. Sun, “Semantic-based neural network repair,” arXiv preprint arXiv:2306.07995, 2023.
[38] Y. Zhang, Z. Wang, J. Jiang, H. You, and J. Chen, “Toward improving the robustness of deep learning models via model transformation,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13.
[39] Y. Li, M. Chen, and Q. Xu, “Hybridrepair: towards annotation- efficient repair for deep learning models,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 227–238.
[40] T. Zohdinasab, V. Riccio, and P. Tonella, “Deepatash: Focused test generation for deep learning systems,” 2023.
[41] T. Zohdinasab, V. Riccio, A. Gambi, and P. Tonella, “Efficient and effective feature space exploration for testing deep learning systems,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 2, mar 2023. [Online]. Available: https://doi.org/10.1145/3544792
[42] N. Humbatova, G. Jahangirova, and P. Tonella, “Deepcrime: from real faults to mutation testing tool for deep learning,” in 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2023, pp. 68–72.
[43] A. Stocco, P. J. Nunes, M. D’Amorim, and P. Tonella, “Third- eye: Attention maps for safe autonomous driving systems,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–12.
[44] J. Kim, G. An, R. Feldt, and S. Yoo, “Learning test-mutant rela- tionship for accurate fault localisation,” Information and Software Technology, p. 107272, 2023.
[45] J. Sohn, S. Kang, and S. Yoo, “Arachne: Search-based repair of deep neural networks,” ACM Transactions on Software Engineering and Methodology, vol. 32, no. 4, pp. 1–26, 2023.
[46] J. Kim, N. Humbatova, G. Jahangirova, P. Tonella, and S. Yoo, “Repairing dnn architecture: Are we there yet?” in 2023 IEEE Conference on Software Testing, Verification and Validation (ICST), 2023, pp. 234–245.
[47] A. Stocco, M. Weiss, M. Calzana, and P. Tonella, “Misbehaviour prediction for autonomous driving systems,” in Proceedings of the ACM/IEEE 42nd international conference on software engineering, 2020, pp. 359–371.
[48] N. Humbatova, G. Jahangirova, and P. Tonella, “Deepcrime: mutation testing of deep learning systems based on real faults,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 67–78.
[49] J. Zhou, F. Li, J. Dong, H. Zhang, and D. Hao, “Cost-effective testing of a deep learning model through input reduction,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 289–300.
[50] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hub- bard, and L. Jackel, “Handwritten digit recognition with a back- propagation network,” Advances in neural information processing systems, vol. 2, 1989.
[51] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning internal representations by error propagation,” 1985.
[52] T. Zohdinasab, V. Riccio, A. Gambi, and P. Tonella, “Deephyper- ion: exploring the feature space of deep learning-based systems through illumination search,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 79–90.
[53] B. C. Hu, L. Marsso, K. Czarnecki, and M. Chechik, “What to check: Systematic selection of transformations for analyzing reliability of machine vision components,” in 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2022, pp. 49–60.
[54] X. Du, X. Xie, Y. Li, L. Ma, Y. Liu, and J. Zhao, “Deepstel- lar: Model-based quantitative analysis of stateful deep learning systems,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 477–487.
[55] X. Ren, Y. Lin, Y. Xue, R. Liu, J. Sun, Z. Feng, and J. S. Dong, “Deeparc: Modularizing neural networks for the model maintenance,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1008–1019.
[56] I. Khmelnitsky, D. Neider, R. Roy, X. Xie, B. Barbot, B. Bollig, A. Finkel, S. Haddad, M. Leucker, and L. Ye, “Property-directed verification and robustness certification of recurrent neural net- works,” in Automated Technology for Verification and Analysis: 19th International Symposium, ATVA 2021, Gold Coast, QLD, Australia, October 18–22, 2021, Proceedings 19. Springer, 2021, pp. 364–380.
[57] J. Song, X. Xie, and L. Ma, “Siege: A semantics-guided safety enhancement framework for ai-enabled cyber-physical systems,” IEEE Transactions on Software Engineering, 2023.
[58] X. Xie, J. Song, Z. Zhou, F. Zhang, and L. Ma, “Mosaic: Model- based safety analysis framework for ai-enabled cyber-physical systems,” arXiv preprint arXiv:2305.03882, 2023.
[59] R. Pan and H. Rajan, “Decomposing convolutional neural net- works into reusable and replaceable modules,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 524–535.
[60] G. Dong, J. Wang, J. Sun, Y. Zhang, X. Wang, T. Dai, J. S. Dong, and X. Wang, “Towards interpreting recurrent neural
networks through probabilistic abstraction,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 499–510.
[61] H. Qi, Z. Wang, Q. Guo, J. Chen, F. Juefei-Xu, F. Zhang, L. Ma, and J. Zhao, “Archrepair: Block-level architecture-oriented repairing for deep neural networks,” ACM Transactions on Software Engi- neering and Methodology, vol. 32, no. 5, pp. 1–31, 2023.
[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[63] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” 2023.
[64] Y. Charalambous, N. Tihanyi, R. Jain, Y. Sun, M. A. Ferrag, and L. C. Cordeiro, “A new era in software security: Towards self- healing software via large language models and formal verifica- tion,” 2023.
[65] D. Lo, “Trustworthy and synergistic artificial intelligence for software engineering: Vision and roadmaps,” 2023.
[66] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for soft- ware engineering: Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023.
[67] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu, “Biogpt: generative pre-trained transformer for biomedical text generation and mining,” Briefings in Bioinformatics, vol. 23, no. 6, p. bbac409, 2022.
[68] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galac- tica: A large language model for science,” arXiv preprint arXiv:2211.09085, 2022.
[69] Y. Shen, L. Heacock, J. Elias, K. D. Hentel, B. Reig, G. Shih, and L. Moy, “Chatgpt and other large language models are double- edged swords,” p. e230163, 2023.
[70] T. Kocmi and C. Federmann, “Large language models are state-of-the-art evaluators of translation quality,” arXiv preprint arXiv:2302.14520, 2023.
[71] E.Kasneci,K.Seßler,S.Ku ̈chemann,M.Bannert,D.Dementieva, F. Fischer, U. Gasser, G. Groh, S. Gu ̈nnemann, E. Hu ̈llermeier et al., “Chatgpt for good? on opportunities and challenges of large language models for education,” Learning and individual differences, vol. 103, p. 102274, 2023.
[72] D. B. Lenat, “Cyc: A large-scale investment in knowledge infrastructure,” Commun. ACM, vol. 38, no. 11, p. 33–38, nov 1995. [Online]. Available: https://doi.org/10.1145/219717.219745
[73] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu et al., “Summary of chatgpt/gpt-4 research and perspective towards the future of large language models,” arXiv preprint arXiv:2304.01852, 2023.
[74] R. Mao, Q. Liu, K. He, W. Li, and E. Cambria, “The biases of pre- trained language models: An empirical study on prompt-based sentiment analysis and emotion detection,” IEEE Transactions on Affective Computing, 2022.
[75] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto, “Benchmarking large language models for news summarization,” arXiv preprint arXiv:2301.13848, 2023.
[76] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10.
[77] T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-summarization,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engi- neering, 2022, pp. 1–5.
[78] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term de- pendencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
[79] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning. Pmlr, 2013, pp. 1310–1318.
[80] I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, J. Weill, and N. Koenigstein, “Interpreting bert-based text similarity via ac- tivation and saliency maps,” in Proceedings of the ACM Web Conference 2022, 2022, pp. 3259–3268.
[81] A. Azaria and T. Mitchell, “The internal state of an llm knows when its lying,” arXiv preprint arXiv:2304.13734, 2023.
[82] H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explain- ability for interpreting bi-modal and encoder-decoder transform- ers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 397–406.
[83] X. Li, H. Xiong, X. Li, X. Wu, X. Zhang, J. Liu, J. Bian, and D. Dou, “Interpretable deep learning: Interpretation, inter- pretability, trustworthiness, and beyond,” Knowledge and Informa- tion Systems, vol. 64, no. 12, pp. 3197–3234, 2022.
[84] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19- 1423
[85] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020.
[86] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available: http: //arxiv.org/abs/1907.11692
[87] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 7871–7880. [Online]. Available: https://aclanthology.org/2020.acl- main.703
[88] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schuster, S. Zheng, D. Zhou, N. Houlsby, and D. Metzler, “UL2: Unifying language learning paradigms,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=6ruVLB727MC
[89] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
[90] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[91] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
[92] H. Qiu, S. Zhang, A. Li, H. He, and Z. Lan, “Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models,” arXiv preprint arXiv:2307.08487, 2023.
[93] Y. Li, F. Wei, J. Zhao, C. Zhang, and H. Zhang, “Rain: Your language models can align themselves without finetuning,” arXiv preprint arXiv:2309.07124, 2023.
[94] A. Helbling, M. Phute, M. Hull, and D. H. Chau, “Llm self defense: By self examination, llms know they are being tricked,” arXiv preprint arXiv:2308.07308, 2023.
[95] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mane ́, “Concrete problems in ai safety,” arXiv preprint arXiv:1606.06565, 2016.
[96] D. Hendrycks and K. Gimpel, “A baseline for detecting misclas- sified and out-of-distribution examples in neural networks,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Hkg4TI9xl
[97] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified frame- work for detecting out-of-distribution samples and adversarial attacks,” Advances in neural information processing systems, vol. 31, 2018.
[98] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=H1VGkIxRZ
[99] P. Morteza and Y. Li, “Provable guarantees for understanding out-of-distribution detection,” in Proceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7831–7840.
[100] H. Lang, Y. Zheng, Y. Li, J. Sun, F. Huang, and Y. Li, “A survey on out-of-distribution detection in nlp,” arXiv preprint arXiv:2305.03236, 2023.
[101] U. Arora, W. Huang, and H. He, “Types of out-of-distribution texts and how to detect them,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 10 687–10 701. [Online]. Available: https://aclanthology.org/2021.emnlp- main.835
[102] J. Ren, J. Luo, Y. Zhao, K. Krishna, M. Saleh, B. Lakshminarayanan, and P. J. Liu, “Out-of-distribution detection and selective generation for conditional lan- guage models,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=kJUS5nD0vPB
[103] A. Kamath, R. Jia, and P. Liang, “Selective question answering under domain shift,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 5684–5696. [Online]. Available: https://aclanthology.org/2020.acl- main.503
[104] J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
[105] K. Krishna, J. Wieting, and M. Iyyer, “Reformulating unsupervised style transfer as paraphrase generation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 737–762. [Online]. Available: https://aclanthology.org/2020.emnlp- main.55
[106] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Sˇrndic ́, P. Laskov, G. Giacinto, and F. Roli, “Evasion attacks against machine learn- ing at test time,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13. Springer, 2013, pp. 387–402.
[107] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural net- works,” in 2nd International Conference on Learning Representations (ICLR), 2014.
[108] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015. [Online]. Available: http: //arxiv.org/abs/1412.6572
[109] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, “Adversarial attacks and defences: A sur- vey,” arXiv preprint arXiv:1810.00069, 2018.
[110] B. Wang, C. Xu, S. Wang, Z. Gan, Y. Cheng, J. Gao, A. H. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” in Advances in Neural Information Processing Systems, 2021.
[111] S. Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran, “A survey of adversarial defenses and robustness in nlp,” ACM Computing Surveys, vol. 55, no. 14s, pp. 1–39, 2023.
[112] V. Raunak, A. Menezes, and M. Junczys-Dowmunt, “The curious case of hallucinations in neural machine translation,” 2021.
[113] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,” 2019.
[114] P. Koehn and R. Knowles, “Six challenges for neural machine translation,” arXiv preprint arXiv:1706.03872, 2017.
[115] W. Krys ́cin ́ski, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” arXiv preprint arXiv:1910.12840, 2019.
[116] B. Bi, C. Wu, M. Yan, W. Wang, J. Xia, and C. Li, “Incorporating external knowledge into machine reading for generative question answering,” ArXiv, vol. abs/1909.02745, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:202234053 A. Balakrishnan, J. Rao, K. Upasani, M. White, and R. Subba, “Constrained decoding for neural NLG from compositional representations in task-oriented dialogue,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 831–844. [Online]. Available: https: //www.aclweb.org/anthology/P19- 1080
[118] C. W. Omlin and C. L. Giles, “Extraction of rules from discrete- time recurrent neural networks,” Neural networks, vol. 9, no. 1, pp. 41–52, 1996.