Useful and Reliable AI Agents

Papers, Books, and Blogs

AI Agents that Matter

by Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan

Analyzes shortcomings of the current state of agent evaluations, and outlines steps to mitigate these shortcomings.

Large Language Monkeys

by Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini

Studies scaling inference compute through repeated sampling to improve the performance of large language models.

The Ethics of Advanced AI assistants

by Iason Gabriel, Arianna Manzini et al.

Analyzes the ethical and societal risks posed by advanced AI assistants, considerations regarding deploying advanced assistants, and recommendations for various stakeholders

SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?

by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan

Presents an evaluation framework for language models consisting of real-world software engineering problems from GitHub issues and pull requests across 12 Python repositories.

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

by Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

A benchmark that simulates conversations between users and language agents. Authors also introduce a new approach, pass^k, to evaluate agents over multiple trials.

Lessons from the Trenches on Reproducible Evaluation of Language Models

by Stella Biderman, Hailey Schoelkopf, Lintang Sutawika et al.

Provides guidance for researchers on evaluating language models.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

by Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts

Introduces DSPy, a programming model that abstracts language model pipelines as graphs; DSPy includes modules that can call language models and compilers to optimize its pipeline.

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

by Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab

Studies optimizing prompts for language models through three distinct strategies.

SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines

by Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J.D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, Eugene Wu

Presents a method for developers to synthesize data quality assertions to identify when LLMs are making mistakes by analyzing prompt version history.

Cognitive Architectures for Language Agents

by Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, Thomas L. Griffiths

A framework for organizing recently developed agents and identifying future developments.

Learning by Demonstration for a Collaborative Planning Environment

by Karen Myers, Jake Kolojejchic, Carl Angiolillo, Tim Cummings, Tom Garvey, Matt Gaston, Melinda Gervasio, Will Haines, Chris Jones, Kellie Keifer, Janette Knittel, David Morley, William Ommert, Scott Potter

Describes deploying a learning by demonstration technology to help automate repetitive tasks.

Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change)

by Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, Subbarao Kambhampati

Introduces a framework to test large language models on planning and reasoning about change.

Can Large Language Models Reason and Plan?

by Subbarao Kambhampati

Explains how the author tested language models’ reasoning and planning capabilities, and methods to improve planning and reasoning performance. Concludes that models do retrieval, which can be mistaken for reasoning.

Chilling autonomy: Policy enforcement for human oversight of AI agents

by Peter Cihon

Details how policymakers, researchers, and developers can enforce AI policy.

Richard Feynman on education in Brazil

Excerpt by Richard Feynman on Rob Shearer's blog

An excerpt about learning from Richard Feynman’s 1985 book, “Surely You’re Joking, Mr. Feynman.”

Artificial Intelligence: A Modern Approach

by Stuart Russell and Peter Norvig

A book that explores the overall field of artificial intelligence.

Tools

Guardrails AI

A Python framework to help build AI applications

Instructor

A Python library that simplifies working with structured outputs from large language models.

Genie

A software engineering model evaluated on SWE-bench and designed to emulate the thought processes of human engineers.

LangChain

An framework with six open-source libraries that assissts with building large language model-based applications.

DSPy

A framework that optimizes the weights and prompts of language models algorithmically by abstracting model pipelines as graphs.

Report abuse