by Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan
Analyzes shortcomings of the current state of agent evaluations, and outlines steps to mitigate these shortcomings.
by Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini
Studies scaling inference compute through repeated sampling to improve the performance of large language models.
The Ethics of Advanced AI assistants
by Iason Gabriel, Arianna Manzini et al.
Analyzes the ethical and societal risks posed by advanced AI assistants, considerations regarding deploying advanced assistants, and recommendations for various stakeholders
SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?
by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Presents an evaluation framework for language models consisting of real-world software engineering problems from GitHub issues and pull requests across 12 Python repositories.
Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
by Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan
A benchmark that simulates conversations between users and language agents. Authors also introduce a new approach, pass^k, to evaluate agents over multiple trials.
Lessons from the Trenches on Reproducible Evaluation of Language Models
by Stella Biderman, Hailey Schoelkopf, Lintang Sutawika et al.
Provides guidance for researchers on evaluating language models.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
by Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts
Introduces DSPy, a programming model that abstracts language model pipelines as graphs; DSPy includes modules that can call language models and compilers to optimize its pipeline.
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
by Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab
Studies optimizing prompts for language models through three distinct strategies.
SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines
by Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J.D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, Eugene Wu
Presents a method for developers to synthesize data quality assertions to identify when LLMs are making mistakes by analyzing prompt version history.
Cognitive Architectures for Language Agents
by Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, Thomas L. Griffiths
A framework for organizing recently developed agents and identifying future developments.
Learning by Demonstration for a Collaborative Planning Environment
by Karen Myers, Jake Kolojejchic, Carl Angiolillo, Tim Cummings, Tom Garvey, Matt Gaston, Melinda Gervasio, Will Haines, Chris Jones, Kellie Keifer, Janette Knittel, David Morley, William Ommert, Scott Potter
Describes deploying a learning by demonstration technology to help automate repetitive tasks.
Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change)
by Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, Subbarao Kambhampati
Introduces a framework to test large language models on planning and reasoning about change.
Can Large Language Models Reason and Plan?
by Subbarao Kambhampati
Explains how the author tested language models’ reasoning and planning capabilities, and methods to improve planning and reasoning performance. Concludes that models do retrieval, which can be mistaken for reasoning.
Chilling autonomy: Policy enforcement for human oversight of AI agents
by Peter Cihon
Details how policymakers, researchers, and developers can enforce AI policy.
Richard Feynman on education in Brazil
Excerpt by Richard Feynman on Rob Shearer's blog
An excerpt about learning from Richard Feynman’s 1985 book, “Surely You’re Joking, Mr. Feynman.”
Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
A book that explores the overall field of artificial intelligence.
A Python framework to help build AI applications
A Python library that simplifies working with structured outputs from large language models.
A software engineering model evaluated on SWE-bench and designed to emulate the thought processes of human engineers.
An framework with six open-source libraries that assissts with building large language model-based applications.
A framework that optimizes the weights and prompts of language models algorithmically by abstracting model pipelines as graphs.