Readings & Resources

Related Courses

Stanford CS 324

Waterloo CS 886

Benchmarks

Knowledge

MMLU: https://github.com/hendrycks/test?tab=readme-ov-file
SummEdits: https://github.com/salesforce/factualNLG

Reasoning Capability

Agent Bench: https://github.com/THUDM/AgentBench
SWE Bench: https://github.com/princeton-nlp/SWE-bench
LeanDojo: https://leandojo.org/
MATH: https://github.com/hendrycks/math/
HumanEval: https://github.com/openai/human-eval
Chain-of-Thought Hub: https://github.com/FranxYao/chain-of-thought-hub
WebArena: https://arxiv.org/abs/2307.13854

Alignment

SafetyBench: https://github.com/thu-coai/SafetyBench

Benchmark Surveys

MetaTool: https://arxiv.org/pdf/2310.03128.pdf
LLM-eval-survey: https://github.com/MLGroupJLU/LLM-eval-survey

References

(To be updated.)
(Note: some papers belong in more than one category)

LLM Capabilities & Eval

Reasoning (Prompt)

More Reasoning

Understanding Reasoning

Meta-Prompt Tuning

Process Reward

Search

Scalable Oversight

Embodied Agent

Code Generation

Tool-Use

Theorem Proving & Mathematical Reasoning

Page updated

Google Sites

Report abuse