Session Abstracts

You, Your Data, and AI Sessions I Room: 208

Opening Remarks

Matei Zaharia I Room 208 I 6:10 PM

K Prize, Terminal-Bench, and the Noble Quest for Hard, Relevant Benchmarks

Andy Konwinski I Room 208 I 6:20 PM

Andy Konwinski (co-founder of Databricks, Perplexity, and Laude Ventures) will explain why hard, relevant benchmarks are crucial for AI progress. He will preview the next phase of the K Prize and introduce a new terminal usage benchmark, T-Bench. Mike Merrill and Alex Shaw of the T-Bench team will describe and demo the benchmark.

Housekeeping

Denny Lee I Room 208 I 6:40 PM

Evaluation and Measurement Sessions I Room: 206

Improving GenAI Quality Starts with Measuring It

Pallavi Koppol I Room 206 I 7:00 PM

Many organizations struggle to monitor and improve their GenAI applications because (1) they do not know how to define or measure “quality,” and (2) while human feedback remains the gold standard for subjective evaluations, it does not scale to enterprise needs. This presentation details best practices for GenAI quality measurement, procuring data from human annotators, and calibrating LLM judges to enable cost-effective evaluation at scale.

Reward Models: The secret ingredient of the era of experience

Alex Trott & Jonathan Chang I Room 206 I 7:20 PM

Reward models are versatile tools that can be used to guide, filter and specialize language models for specific applications. This talk covers the ways we currently use RMs at Databricks to improve the quality of our customers' models and touches on ideas for future applications.

TAO: Building AI with the data you have

Brandon Cui I Room 206 I 7:40 PM

Test time adaptive optimization (TAO) is a model specialization method that improves AI quality using only unlabeled data. TAO enables enterprises to bring the quality of inexpensive open source models, like llama, within the quality of proprietary models like GPT-4o and o3-mini.

Optimizing an LLM system with Patronus's Percival: a SOTA Evaluation Agent

Anand Kannappan I Room 206 I 8:00 PM

What is Percival:

Percival is a SOTA AI companion for every AI team that needs to debug and optimize their AI outputs. Percival ingests an unlimited number of traces from any LLM-based workflow, detects 60+ failure modes for a system, and optimizes prompts. It has saved engineering teams hundreds of hours in analyzing individual traces, clustering errors, and prompt engineering.

History of Percival:

When we created evaluator models to help AI teams monitor outputs, teams often asked us “What do I do after I see my errors in AI outputs? How do I know how to change my prompts?”

When teams were trying to evaluate agents in particular, the vector space of answers was unconstrained and traditional evaluation practices like golden dataset creation fell short. An agent can select any trajectory of tools to call, and covering all of these permutations in a golden dataset is almost impossible. However, one agent can evaluate another agent.

Percival’s vision field of potential errors is larger than any human’s. It can traverse through another agent’s vector space of errors quickly to cluster failure and improve prompts. Percival can also learn from errors, making its prediction of solutions more accurate than a single human’s.

Llama developer toolkit

Gayathri Murali I Room 206 I 8:20 PM

See for yourself how to unleash the power of Llama models and achieve next-level performance with our curated set of practical tools, techniques, and recipes. Tap into Llama Model Customization and bring your existing data to easily fine-tune a smaller Llama model for your specific workflow.

LLMOps and Productionizing AI Sessions I Room: 213

From ML to GenAI: Expanding MLflow for the Full Spectrum of AI

Cathy Yin I Room 213 I 7:00 PM

MLflow started as a tool for experiment tracking and model management in traditional ML, but has since evolved to support deep learning and Generative AI workflows. In this talk, we’ll give a fast-paced overview of new capabilities like tracing, GenAI evaluation, monitoring, and prompt registry — all designed to support modern LLM workflows. We’ll also introduce the newly launched managed MLflow on Databricks, which offers a seamless onboarding experience with free credits, alongside the open-source offering developers already know.

MLflow Tracing: The Evolution of Observability in the GenAI Application Development

Yuki Watanabe I Room 213 I 7:40 PM

MLflow Tracing, first unveiled at DAIS 2024, has evolved from a basic debugging tool to the cornerstone of enterprise GenAI development, empowering thousands of organizations to build complex applications with full visibility. In this session, we will explore how Tracing transforms the development workflows and previews key MLflow 3.0 enhancements for evaluation, human-in-the-loop feedback, and production monitoring.

DSPy + MLflow: Building Smarter LLM Systems with Optimization and Observability

Tomu Hirata I Room 213 I 7:20 PM

The Science behind Knowledge Assistant (KA) - Retrieval and Feedback

Andrew Drozdov I Room 213 I 8:00 PM

In this talk, we'll explore the science behind Knowledge Assistant, our multi-tiered compound AI system designed to generate comprehensive, grounded, and accurate answers to your complex questions. We'll walk through key components of the system that deliver Deep Data Intelligence, including data ingestion, generative search, and domain-adaptive customization through feedback.

How LanceDB tackles the challenges of Multimodal Data

Jonathan Hsieh I Room 213 I 8:20 PM

Powering AI-based applications and model training pipelines with images, text, audio, and video often entails complex data plumbing, slow iterations, and siloed storage. LanceDB confronts these challenges head-on by unifying raw content, metadata, and embeddings in a single, columnar AI-native database with built-in, sub-millisecond vector search. Discover how teams at Harvey, Runway, Midjourney, Character AI, and World Labs have leveraged LanceDB to power their services, slash data plumbing by up to 80%, accelerate experimentation, and ship features faster.

GenAI Infrastructure Sessions I Room: 209

Custom Training and Finetuning on Databricks

Harsh Panchal & Tejas Sundaresan I Room 209 I 7:00 PM

Builder Document Parsing

Ziyi Yang & Jasmine Collins I Room 209 I 7:20 PM

Unstructured data in the form of pdfs, docx, raw images, spreadsheets and presentations are a hugely important data source for enterprises. Legal contracts, mortgage and loan documents, financial documents, internal strategy, etc. are examples of important documents enterprises need to understand. We will talk about the work we're doing to make it easy for you to process this data and get it into databricks in a form that can be used by the rest of our AI and existing tools.

Information Extraction with LLMs

Ivan Zhou & Krista Opsahl-Ong I Room 209 I 7:40 PM

A key capability of the modern AI stack is turning long unstructured documents like contracts and transcripts into structured information that can be processed by traditional data pipelines as well as AI agents. Learn how you can do this at scale with high quality and low costs through our optimization.

Building Fast, Scalable Foundation Model APIs

Daya Khudia & Asfandyar Qureshi I Room 209 I 8:00 PM

Efficiently serving foundation models at scale requires more than just model weights and GPUs—it demands system-level optimizations across the entire inference stack. In this talk, we’ll share engineering insights from building fast, scalable APIs for foundation model inference.

We’ll discuss real-world challenges such as framework-level inefficiencies, kernel bottlenecks, and accuracy-issues in low-precision optimizations. A key focus will be on optimized serving of parameter-efficient fine-tuning (PEFT) methods like LoRA, including runtime techniques for dynamically loading and serving multiple adapters with minimal overhead.

We'll also cover strategies for reducing framework overhead, optimizing batch formation for LoRA requests, and leveraging CUDA graphs—along with the practical issues that come with them.

Whether you're building high-throughput inference systems or looking to productionize fine-tuned LLMs, this talk offers practical takeaways grounded in hands-on experience.

Demystifying Batch Inference On Databricks

Andrew Shieh I Room 209 I 8:20 PM

Using AI Functions to instantly run batch inference on millions of data records without manually provisioning compute might seem like magic, but there’s sophisticated engineering behind this large-scale distributed AI infrastructure. This session takes you under the hood of Databricks’ batch inference system, revealing how AI Functions, LLM inference runtimes, and a central coordinator work together to deliver high-throughput performance at scale. We’ll explore how the system achieves zero startup time and eliminates idle compute costs through shared compute pools, while addressing the challenges of scaling, fairness, and load balancing.

Page updated

Google Sites

Report abuse