Knowledge
SummEdits: https://github.com/salesforce/factualNLG
Reasoning Capability
ARC Prize: https://arcprize.org/
Agent Bench: https://github.com/THUDM/AgentBench
SWE Bench: https://github.com/princeton-nlp/SWE-bench
LeanDojo: https://leandojo.org/
HumanEval: https://github.com/openai/human-eval
Chain-of-Thought Hub: https://github.com/FranxYao/chain-of-thought-hub
WebArena: https://arxiv.org/abs/2307.13854
BrowsComp: https://openai.com/index/browsecomp/
TravelPlanner: https://osu-nlp-group.github.io/TravelPlanner/
SciInstruct: https://arxiv.org/abs/2401.07950
Alignment
SafetyBench: https://github.com/thu-coai/SafetyBench
Benchmark Surveys
MetaTool: https://arxiv.org/pdf/2310.03128.pdf
LLM-eval-survey: https://github.com/MLGroupJLU/LLM-eval-survey
Multi-Modal
LAB-Bench: https://github.com/Future-House/LAB-Bench
MicroVQA: https://jmhb0.github.io/microvqa/
Scaling Laws
(To be updated.)
(Note: some papers belong in more than one category)
LLM Capabilities & Eval
Reasoning (Prompt)
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Take a step back: evoking reasoning via abstraction in large language models
Measuring and Improving the Faithfulness of Model-Generated Reasoning
Understanding Reasoning
Why think step by step? Reasoning emerges from the locality of experience
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
Meta-Prompt Tuning
Process Reward
Search
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Mathematical discoveries from program search with large language models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Agent Frameworks
Scalable Oversight
Measuring Progress on Scalable Oversight for Large Language Models
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Embodied Agent
Voyager: An Open-Ended Embodied Agent with Large Language Models
Eureka: Human-Level Reward Design via Coding Large Language Models
GenSim: Generating Robotic Simulation Tasks via Large Language Models
Code Generation
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions
ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers
Scattered Forest Search: Smarter Code Space Exploration with LLM Inference
Tool-Use
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Toolformer: Language Models Can Teach Themselves to Use Tools
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
Theorem Proving & Mathematical Reasoning
Mathematical discoveries from program search with large language models
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving
Test-time Scaling
Auto Agent Design
Learning to Think
Agents for Science Workflows
Visual Programming & Spatial Reasoning