Readings & Resources

Related Courses

Berkeley CS 294

Stanford CS 224N

Stanford CS 324

Waterloo CS 886

Benchmarks

Knowledge

MMLU: https://github.com/hendrycks/test?tab=readme-ov-file
SummEdits: https://github.com/salesforce/factualNLG

Reasoning Capability

ARC Prize: https://arcprize.org/
Agent Bench: https://github.com/THUDM/AgentBench
SWE Bench: https://github.com/princeton-nlp/SWE-bench
LeanDojo: https://leandojo.org/
MATH: https://github.com/hendrycks/math/
HumanEval: https://github.com/openai/human-eval
Chain-of-Thought Hub: https://github.com/FranxYao/chain-of-thought-hub
WebArena: https://arxiv.org/abs/2307.13854
BrowsComp: https://openai.com/index/browsecomp/
GAIA: https://arxiv.org/abs/2311.12983
TravelPlanner: https://osu-nlp-group.github.io/TravelPlanner/
SciInstruct: https://arxiv.org/abs/2401.07950

Alignment

SafetyBench: https://github.com/thu-coai/SafetyBench

Benchmark Surveys

MetaTool: https://arxiv.org/pdf/2310.03128.pdf
LLM-eval-survey: https://github.com/MLGroupJLU/LLM-eval-survey

Multi-Modal

CURIE: https://github.com/google/curie
LAB-Bench: https://github.com/Future-House/LAB-Bench
MicroVQA: https://jmhb0.github.io/microvqa/

Scaling Laws

CAMEL: https://github.com/camel-ai/camel

References

(To be updated.)
(Note: some papers belong in more than one category)

LLM Capabilities & Eval

Reasoning (Prompt)

Understanding Reasoning

Meta-Prompt Tuning

Process Reward

Search

Agent Frameworks

Scalable Oversight

Embodied Agent

Code Generation

Tool-Use

Theorem Proving & Mathematical Reasoning

Test-time Scaling

Auto Agent Design

Learning to Think

Agents for Science Workflows

Visual Programming & Spatial Reasoning

Page updated

Google Sites

Report abuse