Winter Semester 2025/26 - Seminar

Solving Complex Tasks Using Large Language Models (LLMs)

The seminar explores prompt engineering techniques for enabling LLMs to handle complex tasks as well as using LLMs to evaluate complex outputs. The seminar features literature as well as experimental topics. The goal of the literature topics is to summarize the state of the art concerning the application and evaluation of LLMs. The goal of the experimental topics is to verify the utility of advanced prompt engineering techniques by applying them to tasks beyond the tasks used in the respective papers for illustration and evaluation.

Organization

This seminar is organized by Dr. Jennifer D'Souza.
The seminar is available for master students of the "Electrical Engineering and Information Technology" or "Computer Science" programs at the Leibniz University of Hannover and at present is offered in Winter Semester 2025/2026 (WiSE 25/26).
The seminar grade depends on the talk (~50%), the article (~40%), and participation (~10%). All three aspects need to be passed individually.
Slides of the Kickoff-Meeting 14-10-2025 (uploaded to eLearning)
Student talks will begin from TBD.

Goals

In this seminar, you will

read, understand, and explore scientific literature
critically summarize the state-of-the-art concerning your topic
give a presentation about your topic (before the submission of the report)

Topics

Evaluation

Benchmarks Measuring LLM Reasoning and Truthfulness

Wang, Yubo, et al. "Mmlu-pro: A more robust and challenging multi-task language understanding benchmark." Advances in Neural Information Processing Systems 37 (2024): 95266-95290.
Li, Tianle, et al. "From live data to high-quality benchmarks: The arena-hard pipeline." Blog post.[Accessed 07-02-2025] (2024).
White, Colin, et al. "LiveBench: A challenging, contamination-limited LLM benchmark." arXiv preprint arXiv:2406.19314 (2024).

Scaling, Quantifying, and Extrapolating the Capabilities of LLMs

Gadre, Samir Yitzhak, et al. "Language models scale reliably with over-training and on downstream tasks." arXiv preprint arXiv:2403.08540 (2024).
Wu, Yangzhen, et al. "Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models." arXiv preprint arXiv:2408.00724 (2024).
Chen, Zhengyu, et al. "Revisiting scaling laws for language models: The role of data quality and training strategies." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

Measuring the Mathematical Abilities of LLMs

He, Chaoqun, et al. "Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems." arXiv preprint arXiv:2402.14008 (2024).
Petrov, Ivo, et al. "Proof or bluff? evaluating llms on 2025 usa math olympiad." arXiv preprint arXiv:2503.21934 (2025).
Gao, Bofei, et al. "Omni-math: A universal olympiad level mathematic benchmark for large language models." arXiv preprint arXiv:2410.07985 (2024).

LLMs as a Judge - I

Kim, Seungone, et al. "Prometheus 2: An open source language model specialized in evaluating other language models." arXiv preprint arXiv:2405.01535 (2024).
Wang, Yicheng, et al. "Dhp benchmark: Are llms good nlg evaluators?." arXiv preprint arXiv:2408.13704 (2024).
Chehbouni, Khaoula, et al. "Neither valid nor reliable? investigating the use of llms as judges." arXiv preprint arXiv:2508.18076 (2025).

LLMs as a Judge - II

Gu, Jiawei, et al. "A survey on llm-as-a-judge." arXiv preprint arXiv:2411.15594 (2024).

Alignment of LLMs as Evaluators

Kim, Seungone, et al. "Prometheus 2: An open source language model specialized in evaluating other language models." arXiv preprint arXiv:2405.01535 (2024).
Zhu, Lianghui, Xinggang Wang, and Xinlong Wang. "Judgelm: Fine-tuned large language models are scalable judges." arXiv preprint arXiv:2310.17631 (2023).
Hashemi, Helia, et al. "LLM-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts." arXiv preprint arXiv:2501.00274 (2024).

Can LLMs Evaluate Themselves?

Panickssery, Arjun, Samuel Bowman, and Shi Feng. "Llm evaluators recognize and favor their own generations." Advances in Neural Information Processing Systems 37 (2024): 68772-68802.
Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework https://openreview.net/forum?id=BuNhkdvEDz
Yuan, Weizhe, et al. "Self-rewarding language models." Forty-first International Conference on Machine Learning. 2024.

Synthetic Evaluation Suites for LLMs

Li, Tianle, et al. "From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline." arXiv preprint arXiv:2406.11939 (2024).
Li, Tianle, et al. "Bench-O-Matic: Automating Benchmark Curation from Crowdsourced Data."
Li, Tianle, et al. "From live data to high-quality benchmarks: The arena-hard pipeline." Blog post.[Accessed 07-02-2025] (2024).

Self-refinement of LLMs

Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." Advances in Neural Information Processing Systems 36 (2024).
Ranaldi, Leonardo, and Andrè Freitas. "Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models." arXiv preprint arXiv:2405.00402 (2024).
Kamoi, Ryo, et al. "When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs." arXiv preprint arXiv:2406.01297 (2024).

LLMs Hallucination

Alansari, Aisha, and Hamzah Luqman. "Large Language Models Hallucination: A Comprehensive Survey." arXiv e-prints (2025): arXiv-2510.
Yan, Shi-Qi, et al. "Corrective retrieval augmented generation." (2024).
Bao, Forrest Sheng, et al. "Faithbench: A diverse hallucination benchmark for summarization by modern llms." arXiv preprint arXiv:2410.13210 (2024).

Jailbreak Attacks on LLMs

Yi, Sibo, et al. "Jailbreak attacks and defenses against large language models: A survey." arXiv preprint arXiv:2407.04295 (2024).
Chao, Patrick, et al. "Jailbreakbench: An open robustness benchmark for jailbreaking large language models." Advances in Neural Information Processing Systems 37 (2024): 55005-55029.

Evaluating Morality in LLMs

Ji, Jianchao, et al. "Moralbench: Moral evaluation of llms." ACM SIGKDD Explorations Newsletter 27.1 (2025): 62-71.
Jiao, Junfeng, et al. "LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models." Scientific Reports 15.1 (2025): 34642.
Seror, Avner. "The moral mind (s) of large language models." arXiv preprint arXiv:2412.04476 (2024).

Biases in LLMs

Gallegos, Isabel O., et al. "Bias and fairness in large language models: A survey." Computational Linguistics 50.3 (2024): 1097-1179.
Jung, Dahyun, et al. "FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models." arXiv preprint arXiv:2503.19540 (2025).
Jin, Jiho, et al. "Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations." arXiv preprint arXiv:2503.06987 (2025).

How can LLMs act as Lifelong Learners?

Wang, Renzhi, and Piji Li. "MEMoE: Enhancing Model Editing with Mixture of Experts Adaptors." arXiv preprint arXiv:2405.19086 (2024).
Wang, Renzhi, and Piji Li. "LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models." arXiv preprint arXiv:2406.20030 (2024).
Yang, Shu, et al. "MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning." arXiv preprint arXiv:2402.11260 (2024).

Evaluation of Code Writing Ability of LLMs

Jain, Naman, et al. "Livecodebench: Holistic and contamination free evaluation of large language models for code." arXiv preprint arXiv:2403.07974 (2024).
Zhang, Linghao, et al. "SWE-bench Goes Live!." arXiv preprint arXiv:2505.23419 (2025).
Gu, Alex, et al. "Cruxeval: A benchmark for code reasoning, understanding and execution." arXiv preprint arXiv:2401.03065 (2024).

Evaluating Scientific Image and Text Generation

Chang, Yifan, et al. "SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model." arXiv preprint arXiv:2505.22126 (2025).
Zadeh, Fatemeh Pesaran, et al. "Text2chart31: Instruction tuning for chart generation with automatic feedback." arXiv preprint arXiv:2410.04064 (2024).
Su, Weihang, et al. "Benchmarking Computer Science Survey Generation." arXiv e-prints (2025): arXiv-2508.
D'Souza, Jennifer, Hamed Babaei Giglou, and Quentin Münch. "YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering." arXiv preprint arXiv:2505.14279 (2025).

Application

LLMs for Translation

Gain, Baban, Dibyanayan Bandyopadhyay, and Asif Ekbal. "Bridging the linguistic divide: A survey on leveraging large language models for machine translation." arXiv preprint arXiv:2504.01919 (2025).
"Joint speech and text machine translation for up to 100 languages." Nature 637, no. 8046 (2025): 587-593.
Zhang, Ran, Wei Zhao, and Steffen Eger. "How good are llms for literary translation, really? literary translation evaluation with humans and llms." arXiv preprint arXiv:2410.18697 (2024).

LLMs as Interfaces to Datastores

Hong, Zijin, et al. "Next-generation database interfaces: A survey of llm-based text-to-sql." IEEE Transactions on Knowledge and Data Engineering (2025).
Li, Jinyang, et al. "Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls." Advances in Neural Information Processing Systems 36 (2023): 42330-42357.
Ozsoy, Makbule Gulcin, et al. "Text2cypher: Bridging natural language and graph databases." arXiv preprint arXiv:2412.10064 (2024).

LLMs for Programming

Zhang, Linghao, et al. "SWE-bench Goes Live!." arXiv preprint arXiv:2505.23419 (2025).
Jain, Naman, et al. "Livecodebench: Holistic and contamination free evaluation of large language models for code." arXiv preprint arXiv:2403.07974 (2024).
Yang, John, et al. "Swe-agent: Agent-computer interfaces enable automated software engineering." Advances in Neural Information Processing Systems 37 (2024): 50528-50652.

LLMs in Education

Kestin, Greg, et al. "AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting." Scientific Reports 15.1 (2025): 17458.
Macina, Jakub, et al. "Mathtutorbench: A benchmark for measuring open-ended pedagogical capabilities of llm tutors." arXiv preprint arXiv:2502.18940 (2025).
Xu, Hanyi, et al. "Large language models for education: A survey." arXiv preprint arXiv:2405.13001 (2024).

Scientific Application of LLMs

Scientific Ideation Assistance

Si, Chenglei, Diyi Yang, and Tatsunori Hashimoto. "Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers." arXiv preprint arXiv:2409.04109 (2024).
Gu, Tianyang, et al. "LLMs can realize combinatorial creativity: generating creative ideas via LLMs for scientific research." arXiv preprint arXiv:2412.14141 (2024).
Li, Sitong, et al. "A review of llm-assisted ideation." arXiv preprint arXiv:2503.00946 (2025).

Generating Scientific Discoveries

Shojaee, Parshin, et al. "Llm-srbench: A new benchmark for scientific equation discovery with large language models." arXiv preprint arXiv:2504.10415 (2025).
Kim, Heegyu, et al. "Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge." arXiv preprint arXiv:2502.16457 (2025).
Mandal, Indrajeet, et al. "Evaluating large language model agents for automation of atomic force microscopy." Nature Communications 16.1 (2025): 9104.

Systematic Literature Review

Delgado-Chaves, Fernando M., et al. "Transforming literature screening: The emerging role of large language models in systematic reviews." Proceedings of the National Academy of Sciences 122.2 (2025): e2411962122.
Tang, Xuemei, Xufeng Duan, and Zhenguang G. Cai. "Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition." arXiv preprint arXiv:2412.13612 (2024).
Mikriukov, Andrei, et al. "AI Tools for Automating Systematic Literature Reviews." Proceedings of the 2025 International Conference on Software Engineering and Computer Applications. 2025.

Generative AI Assistance in Chemistry/Material Sciences

M. Bran, Andres, et al. "Augmenting large language models with chemistry tools." Nature Machine Intelligence 6.5 (2024): 525-535.
Liu, Siyu, et al. "MatTools: Benchmarking Large Language Models for Materials Science Tools." arXiv preprint arXiv:2505.10852 (2025).
Yanguas-Gil, Angel, et al. "Benchmarking large language models for materials synthesis: The case of atomic layer deposition." Journal of Vacuum Science & Technology A 43.3 (2025).

Experimental Design & Self-Driving Labs

Hartung, Thomas. "AI, Agentic Models and Lab Automation for Scientific Discovery–the beginning of scAInce." Frontiers in Artificial Intelligence 8 (2025): 1649155.
Mandal, Indrajeet, et al. "Evaluating large language model agents for automation of atomic force microscopy." Nature Communications 16.1 (2025): 9104.
Brunnsåker, D., et al. "Agentic AI Integrated with Scientific Knowledge: Laboratory Validation in Systems Biology." (2025).

Scientific Knowledge Graphs & Curation

Buehler, Markus J. "Generative retrieval-augmented ontologic graph and multiagent strategies for interpretive large language model-based materials design." ACS Engineering Au 4.2 (2024): 241-277.
Belova, Margarita, et al. "GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data." arXiv preprint arXiv:2510.09580 (2025).
Zheng, Tianshi, et al. "From automation to autonomy: A survey on large language models in scientific discovery." arXiv preprint arXiv:2505.13259 (2025).

Agentic AI for Scientific Discovery

Gridach, Mourad, et al. "Agentic ai for scientific discovery: A survey of progress, challenges, and future directions." arXiv preprint arXiv:2503.08979 (2025).

Agentic reasoning & planning

Web & interactive agents (acting in realistic environments)

Chezelles, De, et al. "The browsergym ecosystem for web agent research." arXiv preprint arXiv:2412.05467 (2024).
Zhou, Shuyan, et al. "Webarena: A realistic web environment for building autonomous agents." arXiv preprint arXiv:2307.13854 (2023).
Drouin, Alexandre, et al. "Workarena: How capable are web agents at solving common knowledge work tasks?." arXiv preprint arXiv:2403.07718 (2024).

Autonomous coding agents (from sandbox to real repos)

Yang, John, et al. "Swe-agent: Agent-computer interfaces enable automated software engineering." Advances in Neural Information Processing Systems 37 (2024): 50528-50652.
Jimenez, Carlos E., et al. "Swe-bench: Can language models resolve real-world github issues?." arXiv preprint arXiv:2310.06770 (2023).
Wang, Xingyao, et al. "Opendevin: An open platform for ai software developers as generalist agents." arXiv preprint arXiv:2407.16741 3 (2024).

Multi-agent collaboration & debate (coordination for tougher reasoning)

Liang, Tian, et al. "Encouraging divergent thinking in large language models through multi-agent debate." arXiv preprint arXiv:2305.19118 (2023).
Estornell, Andrew, and Yang Liu. "Multi-LLM debate: Framework, principals, and interventions." Advances in Neural Information Processing Systems 37 (2024): 28938-28964.
Wu, Qingyun, et al. "Autogen: Enabling next-gen LLM applications via multi-agent conversations." First Conference on Language Modeling. 2024.

Programmatic & structured reasoning (planning/decomposition)

Besta, Maciej, et al. "Graph of thoughts: Solving elaborate problems with large language models." Proceedings of the AAAI conference on artificial intelligence. Vol. 38. No. 16. 2024.
Sel, Bilgehan, et al. "Algorithm of thoughts: Enhancing exploration of ideas in large language models." arXiv preprint arXiv:2308.10379 (2023).
Khattab, Omar, et al. "Dspy: Compiling declarative language model calls into self-improving pipelines." arXiv preprint arXiv:2310.03714 (2023).

Knowledge & memory for complex tasks

Agentic RAG & retrieval planning (Part 1)

Edge, Darren, et al. "From local to global: A graph rag approach to query-focused summarization." arXiv preprint arXiv:2404.16130 (2024).
Yan, Shi-Qi, et al. "Corrective retrieval augmented generation." (2024).
Verma, Prakhar, et al. "Plan* rag: Efficient test-time planning for retrieval augmented generation." arXiv preprint arXiv:2410.20753 (2024).

Agentic RAG & retrieval planning (Part 2)

Zhao, Siyun, et al. "Retrieval augmented generation (rag) and beyond: A comprehensive survey on how to make your llms use external data more wisely." arXiv preprint arXiv:2409.14924 (2024).

Long context & persistent memory

Ding, Yiran, et al. "Longrope: Extending llm context window beyond 2 million tokens." arXiv preprint arXiv:2402.13753 (2024).
Munkhdalai, Tsendsuren, Manaal Faruqui, and Siddharth Gopal. "Leave no context behind: Efficient infinite context transformers with infini-attention." arXiv preprint arXiv:2404.07143 101 (2024).
Gao, Tianyu, et al. "How to train long-context language models (effectively)." arXiv preprint arXiv:2410.02660 (2024).

Test-time adaptation & self-improvement

Zuo, Yuxin, et al. "Ttrl: Test-time reinforcement learning." arXiv preprint arXiv:2504.16084 (2025).
Hübotter, Jonas, et al. "Efficiently learning at test-time: Active fine-tuning of llms." arXiv preprint arXiv:2410.08020 (2024).
Yuan, Weizhe, et al. "Self-rewarding language models." Forty-first International Conference on Machine Learning. 2024.

Multimodality, domain specialization & safety

Document/charts multimodal reasoning (grounding math & layout)

Nacson, Mor Shpigel, et al. "Docvlm: Make your vlm an efficient reader." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Xu, Zhengzhuo, et al. "Chartbench: A benchmark for complex visual reasoning in charts." arXiv preprint arXiv:2312.15915 (2023).
Lu, Pan, et al. "Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts." arXiv preprint arXiv:2310.02255 (2023).

Time-series foundation models (planning/forecasting as complex tasks)

Ansari, Abdul Fatir, et al. "Chronos: Learning the language of time series." arXiv preprint arXiv:2403.07815 (2024).
Toner, William, et al. "Performance of zero-shot time series foundation models on cloud data." arXiv preprint arXiv:2502.12944 (2025).
Jin, Ming, et al. "Time-llm: Time series forecasting by reprogramming large language models." arXiv preprint arXiv:2310.01728 (2023).

Agent safety & evaluation under realistic risk

Zhang, Zhexin, et al. "Agent-safetybench: Evaluating the safety of llm agents." arXiv preprint arXiv:2412.14470 (2024).
Yuan, Tongxin, et al. "R-judge: Benchmarking safety risk awareness for llm agents." arXiv preprint arXiv:2401.10019 (2024).
OpenAI Model Spec (Draft) — principles for model/agent behavior (policy lens). https://cdn.openai.com/spec/model-spec-2024-05-08.html

Prompt Engineering

Prompt Search / Breeding

Fernando, Chrisantha, et al. "Promptbreeder: Self-referential self-improvement via prompt evolution." arXiv preprint arXiv:2309.16797 (2023).
Guo, Qingyan, et al. "Connecting large language models with evolutionary algorithms yields powerful prompt optimizers." arXiv preprint arXiv:2309.08532 (2023).
Yang, Chengrun, et al. "Large language models as optimizers." The Twelfth International Conference on Learning Representations. 2023.

Active Prompt

Diao, Shizhe, et al. "Active prompting with chain-of-thought for large language models." arXiv preprint arXiv:2302.12246 (2023).
Kung, Po-Nien, et al. "Active instruction tuning: Improving cross-task generalization by training on prompt sensitive tasks." arXiv preprint arXiv:2311.00288 (2023).
Xia, Yu, et al. "From selection to generation: A survey of llm-based active learning." arXiv preprint arXiv:2502.11767 (2025).

Contrastive Prompting

Yao, Liang. "Large language models are contrastive reasoners." arXiv preprint arXiv:2403.08211 (2024).
Chia, Yew Ken, et al. "Contrastive chain-of-thought prompting." arXiv preprint arXiv:2311.09277 (2023).
Zhong, Qihuang, et al. "ROSE doesn't do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding." arXiv preprint arXiv:2402.11889 (2024).

Prompt Optimization

Li, Wenwu, et al. "A survey of automatic prompt engineering: An optimization perspective." arXiv preprint arXiv:2502.11560 (2025).
Yuksekgonul, Mert, et al. "Textgrad: Automatic" differentiation" via text." arXiv preprint arXiv:2406.07496 (2024).
Hsieh, Cho-Jui, et al. "Automatic engineering of long prompts." arXiv preprint arXiv:2311.10117 (2023).

Limitations of LLMs

Abbe, Emmanuel, et al. "How far can transformers reason? the globality barrier and inductive scratchpad." Advances in Neural Information Processing Systems 37 (2024): 27850-27895.
Kostikova, Aida, et al. "LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models." arXiv preprint arXiv:2505.19240 (2025).
Roh, Jaechul, et al. "Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation." arXiv preprint arXiv:2506.06971 (2025).

Retrieval-augmented Generation

Edge, Darren, et al. "From local to global: A graph rag approach to query-focused summarization." arXiv preprint arXiv:2404.16130 (2024).
Yan, Shi-Qi, et al. "Corrective retrieval augmented generation." (2024).
Yu, Hao, et al. "Evaluation of retrieval-augmented generation: A survey." CCF Conference on Big Data. Singapore: Springer Nature Singapore, 2024.

Eliciting Reasoning in LLMs - I

Sel, Bilgehan, et al. "Algorithm of thoughts: Enhancing exploration of ideas in large language models." arXiv preprint arXiv:2308.10379 (2023).
Yao, Shunyu, et al. "Tree of thoughts: Deliberate problem solving with large language models." Advances in neural information processing systems 36 (2023): 11809-11822.
Besta, Maciej, et al. "Graph of thoughts: Solving elaborate problems with large language models." Proceedings of the AAAI conference on artificial intelligence. Vol. 38. No. 16. 2024.

Eliciting Reasoning in LLMs - II

Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171 (2022).
Lightman, Hunter, et al. "Let's verify step by step." The Twelfth International Conference on Learning Representations. 2023.
Liang, Tian, et al. "Encouraging divergent thinking in large language models through multi-agent debate." arXiv preprint arXiv:2305.19118 (2023).

Finetuning

Finetuning Pretrained Language Models (PLMs) with Objectives

Meng, Yu, Mengzhou Xia, and Danqi Chen. "Simpo: Simple preference optimization with a reference-free reward." Advances in Neural Information Processing Systems 37 (2024): 124198-124235.
Hong, Jiwoo, Noah Lee, and James Thorne. "Orpo: Monolithic preference optimization without reference model." arXiv preprint arXiv:2403.07691 (2024).
Ethayarajh, Kawin, et al. "Kto: Model alignment as prospect theoretic optimization." arXiv preprint arXiv:2402.01306 (2024).

Parameter-efficient Finetuning of Pretrained Language Models (PLMs)

Liu, Shih-Yang, et al. "Dora: Weight-decomposed low-rank adaptation." Forty-first International Conference on Machine Learning. 2024.
Meng, Fanxu, Zhaohui Wang, and Muhan Zhang. "Pissa: Principal singular values and singular vectors adaptation of large language models." Advances in Neural Information Processing Systems 37 (2024): 121038-121072.
Li, Yang, Shaobo Han, and Shihao Ji. "Vb-lora: Extreme parameter efficient fine-tuning with vector banks." Advances in Neural Information Processing Systems 37 (2024): 16724-16751.

Foundation LLMs

Small LLMs are the Future?

Belcak, Peter, et al. "Small Language Models are the Future of Agentic AI." arXiv preprint arXiv:2506.02153 (2025).
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone https://arxiv.org/abs/2404.14219
Li, Ethan, et al. "Apple intelligence foundation language models: Tech report 2025." arXiv preprint arXiv:2507.13575 (2025).

New LLM Architectures: Beyond Transformers

Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." First Conference on Language Modeling. 2024.
Lieber, Opher, et al. "Jamba: A hybrid transformer-mamba language model." arXiv preprint arXiv:2403.19887 (2024).

LLMs as Mixture of Experts (MoE)

Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).
Liu, Aixin, et al. "Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model." arXiv preprint arXiv:2405.04434 (2024).
Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).

Getting started

The following survey articles and tutorials are good starting points for getting an overview of the topics of the seminar:

Zhao, et al.: A survey of Large Language Models. arXiv:2303.18223 [cs.CL]
Mialon, et al.: Augmented Language Models: a Survey. arXiv:2302.07842 [cs.CL]
Wang, et al: A Survey on Large Language Model based Autonomous Agents. arXiv:2308.11432 [cs.CL]
Prompt Engineering Guide

Page updated

Google Sites

Report abuse