Grounding and Evaluation for Large Language Models (Tutorial)

Overview

With the ongoing rapid adoption of Artificial Intelligence (AI) based systems in high-stakes domains such as financial services, healthcare and life sciences, hiring and human resources, education, societal infrastructure, and national security, it is crucial to develop and deploy the underlying AI models and systems in a responsible manner and ensure their trustworthiness, safety, and observability. Our focus is on large language models (LLMs) and other generative AI models and applications. Such models and applications need to be evaluated and monitored not only for accuracy and quality-related metrics but also for robustness against adversarial attacks, robustness under distribution shifts, bias and discrimination against underrepresented groups, security and privacy protection, interpretability, hallucinations (and other ungrounded or low-quality outputs), harmful content (such as sexual, racist, and hateful responses), jailbreaks of safety and alignment mechanisms, prompt injection attacks, misinformation and disinformation, fake, misleading, and manipulative content, copyright infringement, and other responsible AI dimensions.

In this tutorial, we first highlight key harms associated with generative AI systems with a focus on ungrounded answers (hallucinations), jailbreaks and prompt injection attacks, harmful content, and copyright infringement. We then discuss how to effectively address potential risks and challenges, following the framework of identification, measurement, mitigation (with four mitigation layers at model, safety system, application, and positioning), and operationalization. We present real-world LLM use cases, practical challenges, best practices, lessons learned from deploying solution approaches in industry, and key open problems. Our goal is to stimulate further research on grounding and evaluation of LLMs, and enable researchers and practitioners to build more robust and trustworthy LLM applications.

Contributors

Krishnaram Kenthapadi (Oracle Health AI, USA)

Mehrnoosh Sameki (Microsoft Azure AI, USA)

Ankur Taly (Google Cloud AI, USA)

Tutorial Logistics

KDD'24 Tutorial Slides (TBA)

Tutorial Video Recording: KDD'24 Video (TBA)

Grounding and Evaluation for LLMs Survey Paper

Contributor Bios

Krishnaram Kenthapadi is the Chief Scientist, Clinical AI at Oracle Health, where he leads the AI initiatives for Clinical Digital Assistant and other Oracle Health products. Previously, as the Chief AI Officer & Chief Scientist of Fiddler AI, he led initiatives on generative AI (e.g., Fiddler Auditor, an open-source library for evaluating & red-teaming LLMs before deployment; AI safety, observability & feedback mechanisms for LLMs in production), and on AI safety, alignment, observability, and trustworthiness, as well as the technical strategy, customer-driven innovation, and thought leadership for Fiddler. Prior to that, he was a Principal Scientist at Amazon AWS AI, where he led the fairness, explainability, privacy, and model understanding initiatives in Amazon AWS AI platform. Prior to joining Amazon, he led similar efforts at the LinkedIn AI team, and served as LinkedIn’s representative in Microsoft’s AI and Ethics in Engineering and Research (AETHER) Advisory Board. Previously, he was a Researcher at Microsoft Research Silicon Valley Lab. Krishnaram received his Ph.D. in Computer Science from Stanford University in 2006. His work has been recognized through awards at NAACL, WWW, SODA, CIKM, ICML AutoML workshop, and Microsoft’s AI/ML conference (MLADS). He has published 50+ papers, with 7000+ citations and filed 150+ patents (72 granted). He has presented tutorials on privacy, fairness, explainable AI, ML model monitoring, responsible AI, and trustworthy generative AI in industry at forums such as KDD '18 '19 '22 '23,  WSDM '19, WWW '19 '20 '21 '23, FAccT '20 '21 '22 '23, AAAI '20 '21, and ICML '21 '23, and instructed a course on responsible AI at Stanford.

Mehrnoosh Sameki is a principal product manager and responsible AI tools area lead at Microsoft, leading a group of AI product managers to develop and deliver cutting-edge tools for model evaluation, and responsible AI, both in open source and through Azure AI platforms for generative AI solutions and ML models. She is also an adjunct assistant professor at Boston University, School of Computer Science, where she earned her PhD degree in 2017. Mehrnoosh has presented at several industry forums (including Microsoft Build) and tutorials on fairness, ML model monitoring, and responsible AI in industry at forums such as KDD '19 '22, WWW '21 '23, FAccT '21 '22, AAAI '21, and ICML '21.

Ankur Taly is a Staff Research Scientist at Google Cloud AI, leading initiatives on LLM grounding, evaluation of factual consistency of LLM responses, and data-centric AI. Previously, he was the head of data science at Fiddler AI, and prior to that, he was a Staff Research Scientist at Google Brain where he carried out research in explainable AI, and was most well-known for his contribution to developing and applying Integrated Gradients (6000+ citations) — a new interpretability algorithm for Deep Networks. His research in this area has resulted in publications at top-tier machine learning conferences (ICML 2017, ACL 2018), and prestigious journals like the American Academy of Ophthalmology (AAO) and Proceedings of the National Academy of Sciences (PNAS). He also given invited talks (Slides, Video) at several academic and industrial venues, including, UC Berkeley (DREAMS seminar), SRI International, Dagstuhl seminar, and Samsung AI Research. Besides explainable AI, Ankur has a broad research background, and has published 25+ papers in several other areas including Computer Security, Programming Languages,  Formal Verification, and Machine Learning. He has served on several conference program committees (PLDI 2014 and 2019, POST 2014, PLAS 2013), taught guest lectures at graduate courses, and instructed a short course on distributed authorization at the FOSAD summer school in 2016. Ankur obtained his Ph.D. in computer science from Stanford University in 2012 and a B. Tech in CS from IIT Bombay in 2007.

Tutorial Outline and Description

The tutorial will consist of the following parts:

Introduction and overview of the generative AI landscape

We give an overview of the generative AI landscape in industry and motivate the topic of the tutorial with the following questions. What constitutes generative AI? Why is generative AI an important topic? What are key applications of generative AI that are being deployed across different industry verticals? Why is it crucial to develop and deploy generative AI models and applications in a responsible manner?

Holistic Evaluation of LLMs

We highlight key challenges that arise when developing and deploying LLMs and other generative AI models in enterprise settings, and present an overview of solution approaches and open problems. We discuss evaluation dimensions such as truthfulness, safety and alignment, bias and fairness, robustness and security, privacy, model disgorgement and unlearning, copyright infringement, calibration and confidence, and transparency and causal interventions.

Grounding for LLMs

We then provide a deeper discussion of grounding for LLMs, that is, ensuring that every claim in the response generated by an LLM can be attributed to a document in the user-specified knowledge base. We highlight how grounding differs from factuality in the context of LLMs, and present technical solution approaches such as retrieval augmented generation, constrained decoding, evaluation, guardrails, and revision, and corpus tuning.

LLM Operations and Observability

We present processes and best practices for addressing grounding and evaluation related challenges in real-world LLM application settings. We discuss mechanisms for managing safety risks and vulnerabilities associated with deployed LLM and generative AI applications as well as practical approaches for monitoring the underlying models and systems with respect to quality and other responsible AI related metrics.

We will present real-world LLM case studies across different application domains such as healthcare, financial services, hiring, conversational assistants, and search and recommendation systems and discuss solution approaches for addressing above challenges, highlighting practical challenges, best practices, lessons learned from deploying solution approaches in the industry, and key open problems. We hope that our tutorial will inform both researchers and practitioners, stimulate further research on grounding and evaluation approaches for LLMs and other generative AI models, and pave the way for building more reliable generative AI models and applications in the future.

This tutorial is aimed at attendees with a wide range of interests and backgrounds both in academia and industry, including researchers interested in knowing about grounding, evaluation, and more broadly responsible AI techniques and tools in the context of LLMs / generative AI models as well as practitioners interested in implementing such tools for various LLMs / generative AI applications. We will not assume any prerequisite knowledge, and present the advances, challenges, and opportunities related to evaluation and grounding of LLMs by building intuition to ensure that the material is accessible to all attendees.

Related Tutorials and Resources