Knowledge
SummEdits: https://github.com/salesforce/factualNLG
Reasoning Capability
Agent Bench: https://github.com/THUDM/AgentBench
SWE Bench: https://github.com/princeton-nlp/SWE-bench
LeanDojo: https://leandojo.org/
HumanEval: https://github.com/openai/human-eval
Chain-of-Thought Hub: https://github.com/FranxYao/chain-of-thought-hub
WebArena: https://arxiv.org/abs/2307.13854
Alignment
SafetyBench: https://github.com/thu-coai/SafetyBench
Benchmark Surveys
MetaTool: https://arxiv.org/pdf/2310.03128.pdf
LLM-eval-survey: https://github.com/MLGroupJLU/LLM-eval-survey
(To be updated.)
(Note: some papers belong in more than one category)
LLM Capabilities & Eval
Reasoning (Prompt)
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Take a step back: evoking reasoning via abstraction in large language models
More Reasoning
Understanding Reasoning
Why think step by step? Reasoning emerges from the locality of experience
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
Meta-Prompt Tuning
Process Reward
Search
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Mathematical discoveries from program search with large language models
Scalable Oversight
Measuring Progress on Scalable Oversight for Large Language Models
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Embodied Agent
Voyager: An Open-Ended Embodied Agent with Large Language Models
Eureka: Human-Level Reward Design via Coding Large Language Models
GenSim: Generating Robotic Simulation Tasks via Large Language Models
Code Generation
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions
ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers
Tool-Use
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Toolformer: Language Models Can Teach Themselves to Use Tools
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Theorem Proving & Mathematical Reasoning