Teaching

Solving Complex Tasks Using Large Language Models (LLMs)

The seminar explores prompt engineering techniques for enabling LLMs to handle complex tasks as well as using LLMs to evaluate complex outputs. The seminar features literature as well as experimental topics. The goal of the literature topics is to summarize the state of the art concerning the application and evaluation of LLMs. The goal of the experimental topics is to verify the utility of advanced prompt engineering techniques by applying them to tasks beyond the tasks used in the respective papers for illustration and evaluation.

Organization

This seminar is organized by Dr. Jennifer D'Souza.
The seminar is available for master students of the "Electrical Engineering and Information Technology" or "Computer Science" programs at the University.
Presentation Schedule (this link will be updated after the literature/topic selection)
Slides of the Kickoff-Meeting 15-10-2024 (uploaded to eLearning)

Goals

In this seminar, you will

read, understand, and explore scientific literature
critically summarize the state-of-the-art concerning your topic
give a presentation about your topic (before the submission of the report)

Topics

Prompt Engineering

From Self-Consistency to MedPrompt: Improving Results by Ensembling LLMs

Wang, et al.: Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 (2022)
Nori, Harsha, et al. “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine.” arXiv preprint arXiv:2311.16452 (2023).
Zhao, et al.: A survey of Large Language Models. arXiv:2303.18223 (2023)

Prompt Search / Breeding

Fernando, Chrisantha, et al. “Promptbreeder: Self-referential self-improvement via prompt evolution.” arXiv preprint arXiv:2309.16797 (2023).
Liu, Pengfei, et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.” ACM Computing Surveys 55.9 (2023): 1–35.

Active Prompt

Diao, Shizhe, et al. “Active Prompting with Chain-of-Thought for Large Language Models.” arXiv, May 23, 2023.
Mavromatis, Costas, et al. “Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection.” arXiv, October 30, 2023.

Contrastive Prompting

Chia, Yew Ken, et al. “Contrastive Chain-of-Thought Prompting.” arXiv preprint arXiv:2311.09277 (2023).
Paranjape, Bhargavi, et al. “Prompting contrastive explanations for commonsense reasoning tasks.” arXiv preprint arXiv:2106.06823 (2021).

Prompt Optimization

Chen, Xiang, et al. "Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction." Proceedings of the ACM Web conference 2022. 2022.
Chen, Yongchao, et al. "Prompt optimization in multi-step tasks (promst): Integrating human feedback and preference alignment." arXiv preprint arXiv:2402.08702 (2024).
Wang, Xinyuan, et al. "Promptagent: Strategic planning with language models enables expert-level prompt optimization." arXiv preprint arXiv:2310.16427 (2023).
Wen, Yuxin, et al. "Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery." Advances in Neural Information Processing Systems 36 (2024).

Limitations of LLMs

Berglund, Lukas, et al. “The Reversal Curse: LLMs Trained on ‘A Is B’ Fail to Learn ‘B Is A.’” arXiv, September 22, 2023.
Kaddour, Jean, et al. “Challenges and Applications of Large Language Models.” arXiv, July 19, 2023. https://doi.org/10.48550/arXiv.2307.10169.

Retrieval-augmented Generation

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
Sun, Zhiqing, et al. "Recitation-augmented language models." arXiv preprint arXiv:2210.01296 (2022).
Su, Weihang, et al. "Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models." arXiv preprint arXiv:2403.10081 (2024).

Eliciting Reasoning in LLMs - I

Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.
Yao, Shunyu, et al. "React: Synergizing reasoning and acting in language models." arXiv preprint arXiv:2210.03629 (2022).
Caufield, J. Harry, et al. "Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning." Bioinformatics 40.3 (2024): btae104.

Eliciting Reasoning in LLMs - II

Zhang, Zhuosheng, et al. "Automatic chain of thought prompting in large language models." arXiv preprint arXiv:2210.03493 (2022).
Zheng, Chuanyang, et al. "Progressive-hint prompting improves reasoning in large language models." arXiv preprint arXiv:2304.09797 (2023)
Yao, Shunyu, et al. "Tree of thoughts: Deliberate problem solving with large language models." Advances in Neural Information Processing Systems 36 (2024).
Besta, Maciej, et al. "Graph of thoughts: Solving elaborate problems with large language models." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 16. 2024.

Finetuning

Finetuning Pretrained Language Models (PLMs) with Objectives

Ziegler, Daniel M., et al. "Fine-tuning language models from human preferences." arXiv preprint arXiv:1909.08593 (2019).
Ding, Ning, et al. "Parameter-efficient fine-tuning of large-scale pre-trained language models." Nature Machine Intelligence 5.3 (2023): 220-235.
Hong, Yihuai, et al. "Dissecting Fine-Tuning Unlearning in Large Language Models." arXiv preprint arXiv:2410.06606 (2024).
Shamsabadi, Mahsa, et al. "Large Language Models for Scientific Information Extraction: An Empirical Study for Virology." Findings of the Association for Computational Linguistics: EACL 2024. 2024.

Parameter-efficient Finetuning of Pretrained Language Models (PLMs)

Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).
Hayou, Soufiane, Nikhil Ghosh, and Bin Yu. "Lora+: Efficient low rank adaptation of large models." arXiv preprint arXiv:2402.12354 (2024).
Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." Advances in Neural Information Processing Systems 36 (2024).
Xu, Lingling, et al. "Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment." arXiv preprint arXiv:2312.12148 (2023).

Evaluation

Benchmarks Measuring LLM Reasoning and Truthfulness

Clark, Peter, et al. "Think you have solved question answering? try arc, the ai2 reasoning challenge." arXiv preprint arXiv:1803.05457 (2018).
Zellers, Rowan, et al. "Hellaswag: Can a machine really finish your sentence?." arXiv preprint arXiv:1905.07830 (2019).
Hendrycks, Dan, et al. "Measuring massive multitask language understanding." arXiv preprint arXiv:2009.03300 (2020).
Lin, Stephanie, Jacob Hilton, and Owain Evans. "Truthfulqa: Measuring how models mimic human falsehoods." arXiv preprint arXiv:2109.07958 (2021).

Scaling, Quantifying, and Extrapolating the Capabilities of LLMs

Rae, Jack W., et al. "Scaling language models: Methods, analysis & insights from training gopher." arXiv preprint arXiv:2112.11446 (2021).
Srivastava, Aarohi, et al. "Beyond the imitation game: Quantifying and extrapolating the capabilities of language models." arXiv preprint arXiv:2206.04615 (2022).
https://github.com/google/BIG-bench

Measuring the Mathematical Abilities of LLMs

Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021).
Hendrycks, Dan, et al. "Measuring mathematical problem solving with the math dataset." arXiv preprint arXiv:2103.03874 (2021).
Mishra, Swaroop, et al. "Lila: A unified benchmark for mathematical reasoning." arXiv preprint arXiv:2210.17517 (2022).
Lewkowycz, Aitor, et al. "Solving quantitative reasoning problems with language models." Advances in Neural Information Processing Systems 35 (2022): 3843-3857.

LLMs as Evaluation Metrics

Fu, Jinlan, et al. “Gptscore: Evaluate as you desire.” arXiv, February 13, 2023.
Kocmi, Tom, et al. “Large Language Models Are State-of-the-Art Evaluators of Translation Quality.” arXiv, May 31, 2023.
Leiter, Christoph, et al. “The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics.” arXiv, October 30, 2023.

LLMs as Evaluators

Kim, Seungone, et al. "Prometheus: Inducing fine-grained evaluation capability in language models." The Twelfth International Conference on Learning Representations. 2023.
Kim, Seungone, et al. "Prometheus 2: An open source language model specialized in evaluating other language models." arXiv preprint arXiv:2405.01535 (2024).
Vu, Tu, et al. "Foundational autoraters: Taming large language models for better automatic evaluation." arXiv preprint arXiv:2407.10817 (2024).

Alignment of LLMs as Evaluators

Liu, Yinhong, et al. "Aligning with human judgement: The role of pairwise preference in large language model evaluators." arXiv preprint arXiv:2403.16950 (2024).
Thakur, Aman Singh, et al. "Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges." arXiv preprint arXiv:2406.12624 (2024).
Li, Zongjie, et al. "Split and merge: Aligning position biases in large language model based evaluators." arXiv preprint arXiv:2310.01432 (2023).

Can LLMs Evaluate Themselves?

Deutsch, Daniel, et al. “On the Limitations of Reference-Free Evaluations of Generated Text.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 10960–77.
Ouyang, Long, et al. “Training Language Models to Follow Instructions with Human Feedback.” arXiv, March 4, 2022.
Rafailov, Rafael, et al. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv, December 13, 2023.

Synthetic Evaluation Suites for LLMs

Lei, Fangyu, et al. "S3eval: A synthetic, scalable, systematic evaluation suite for large language models." arXiv preprint arXiv:2310.15147 (2023).
Iskander, Shadi, et al. "Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs." arXiv preprint arXiv:2409.16341 (2024).
Zhao, Chenyang, et al. "Self-guide: Better task-specific instruction following via self-synthetic finetuning." arXiv preprint arXiv:2407.12874 (2024).
He, Chaoqun, et al. "UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs." arXiv preprint arXiv:2404.07584 (2024).

LLMs with Tools as Evaluation Metrics

Fernandes, Patrick, et al. “The Devil Is in the Errors: Leveraging Large Language Models for Fine-Grained Machine Translation Evaluation.” arXiv, August 14, 2023.
Kocmi, Tom, et al. “GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4.” arXiv, October 21, 2023.
Shu, Lei, et al. “Fusion-Eval: Integrating Evaluators with LLMs.” arXiv, November 15, 2023.

Self-refinement of LLMs

Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." Advances in Neural Information Processing Systems 36 (2024).
Ranaldi, Leonardo, and Andrè Freitas. "Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models." arXiv preprint arXiv:2405.00402 (2024).
Kamoi, Ryo, et al. "When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs." arXiv preprint arXiv:2406.01297 (2024).

Task Contamination

Li, Changmao, et al. “Task Contamination: Language Models May Not Be Few-Shot Anymore.” arXiv preprint arXiv:2312.16337 (2023).
Roberts, Manley, et al. “Data Contamination Through the Lens of Time.” arXiv preprint arXiv:2310.10628 (2023).
Jiang, et al.: Investigating Data Contamination for Pre-training Language Models. arXiv preprint arXiv:2401.06059 (2024).

LLMs Hallucination

Mickus, Timothee, et al. "SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes." arXiv preprint arXiv:2403.07726 (2024).
Yuan, Hongbang, et al. "Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models." arXiv preprint arXiv:2402.19103 (2024).
Sansford, Hannah, et al. "Grapheval: A knowledge-graph based llm hallucination evaluation framework." arXiv preprint arXiv:2407.10793 (2024).
Rahman, A. B. M., et al. "DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation." arXiv preprint arXiv:2406.09155 (2024).

Jailbreak Attacks on LLMs

Doumbouya, Moussa Koulako Bala, et al. "h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment." arXiv preprint arXiv:2408.04811 (2024).
Wang, Hao, et al. "From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings." arXiv preprint arXiv:2402.16006 (2024).
Yi, Sibo, et al. "Jailbreak attacks and defenses against large language models: A survey." arXiv preprint arXiv:2407.04295 (2024).

Evaluating Morality in LLMs

Nie, Allen, et al. "Moca: Measuring human-language model alignment on causal and moral judgment tasks." Advances in Neural Information Processing Systems 36 (2023): 78360-78393.
Jin, Zhijing, et al. "When to make exceptions: Exploring language models as accounts of human moral judgment." Advances in neural information processing systems 35 (2022): 28458-28473.
Scherrer, Nino, et al. "Evaluating the moral beliefs encoded in llms." Advances in Neural Information Processing Systems 36 (2024).
Guha, Neel, et al. "Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models." Advances in Neural Information Processing Systems 36 (2024).

Biases in LLMs - I

Jones, Erik, and Jacob Steinhardt. "Capturing failures of large language models via human cognitive biases." Advances in Neural Information Processing Systems 35 (2022): 11785-11799.
Schramowski, Patrick, et al. "Large pre-trained language models contain human-like biases of what is right and wrong to do." Nature Machine Intelligence 4.3 (2022): 258-268.
Ferrara, Emilio. "Should chatgpt be biased? challenges and risks of bias in large language models." arXiv preprint arXiv:2304.03738 (2023).

Biases in LLMs - II

Gallegos, Isabel O., et al. "Bias and fairness in large language models: A survey." Computational Linguistics (2024): 1-79.
Liang, Paul Pu, et al. "Towards understanding and mitigating social biases in language models." International Conference on Machine Learning. PMLR, 2021.
Kotek, Hadas, Rikker Dockum, and David Sun. "Gender bias and stereotypes in large language models." Proceedings of the ACM collective intelligence conference. 2023.

How can LLMs act as Lifelong Learners?

Wang, Renzhi, and Piji Li. "MEMoE: Enhancing Model Editing with Mixture of Experts Adaptors." arXiv preprint arXiv:2405.19086 (2024).
Wang, Renzhi, and Piji Li. "LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models." arXiv preprint arXiv:2406.20030 (2024).
Yang, Shu, et al. "MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning." arXiv preprint arXiv:2402.11260 (2024).

Evaluation of Code Writing Ability of LLMs

Chen, Mark, et al. “Evaluating large language models trained on code.” arXiv preprint arXiv:2107.03374 (2021).
Le, Triet HM, et al. “Deep learning for source code modeling and generation: Models, applications, and challenges.” ACM Computing Surveys (CSUR) 53.3 (2020): 1–38.

Evaluating Scientific Image and Text Generation

Belouadi, Jonas, et al. “AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ.” arXiv, January 23, 2024.
Belouadi, Jonas, et al. "DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ." arXiv preprint arXiv:2405.15306 (2024).
Nechakhin, Vlad, et al. “Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph” Information 2024, 15(6), 328; https://doi.org/10.3390/info15060328
Babaei Giglou, Hamed, et al. “LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis.” arXiv, September 27, 2024.

LLMs as Mixture of Experts (MoE)

Cai, Weilin, et al. "A survey on mixture of experts." arXiv preprint arXiv:2407.06204 (2024).
https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts
Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).

Application

WebAPI Query Planning Using LLMs

Chen, Zui, et al. “Symphony: Towards natural language query answering over multi-modal data lakes.” Conference on Innovative Data Systems Research, CIDR. 2023.
Urban, Matthias, et al. “CAESURA: Language Models as Multi-Modal Query Planners.” arXiv preprint arXiv:2308.03424 (2023).
Wang, et al.: A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432 (2023)
https://gorilla.cs.berkeley.edu/

Attribute Value Normalization Using LLMs

Jaimovitch-López, Gonzalo, et al. “Can language models automate data wrangling?.” Machine Learning 112.6 (2023): 2053–2082.
Bogatu, Alex, et al. “Towards automatic data format transformations: Data wrangling at scale.” Data Analytics: 31st British International Conference on Databases (BICOD2017), 2017.

LLM for Literary Translation and Evaluation

Fonteyne, Margot, et al. “Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT-Translated Detective Novel on Document Level.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 3790–98. Marseille, France, 2020.
Karpinska, Marzena, et al. “Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist.” arXiv, May 22, 2023.
Wang, Longyue, et al. “Document-Level Machine Translation with Large Language Models.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16646–61. Singapore, 2023.

LLMs for Synthetic Training Data Generation

Frédéric Piedboeuf et al. Is ChatGPT the ultimate Data Augmentation Algorithm? In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023.
Pal, Koyena, et al. “Generative Benchmark Creation for Table Union Search.” arXiv, August 7, 2023.

LLMs as Interfaces to Datastores

Karou Diallo, et al. "A Comprehensive Evaluation of Neural SPARQL Query Generation from Natural Language Questions." arXiv e-prints (2023): arXiv-2304.
Li, Jinyang, et al. "Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls." Advances in Neural Information Processing Systems 36 (2024).
Lehmann, Jens, et al. "Beyond boundaries: A human-like approach for question answering over structured and unstructured information sources." Transactions of the Association for Computational Linguistics 12 (2024): 786-802.

LLMs for Arithmetic and Symbolic Reasoning Tasks

Yin, Shuo, et al. "MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning." arXiv preprint arXiv:2405.07551 (2024).
Gao, Luyu, et al. "Pal: Program-aided language models." International Conference on Machine Learning. PMLR, 2023.
Imani, Shima, et al. "Mathprompter: Mathematical reasoning using large language models." arXiv preprint arXiv:2303.05398 (2023).

LLMs for Programming

Wang, Yue, et al. "Codet5+: Open code large language models for code understanding and generation." arXiv preprint arXiv:2305.07922 (2023).
Luo, Ziyang, et al. "Wizardcoder: Empowering code large language models with evol-instruct." arXiv preprint arXiv:2306.08568 (2023).
Wei, Yuxiang, et al. "Magicoder: Source code is all you need." arXiv preprint arXiv:2312.02120 (2023).
Liu, Jiawei, et al. "Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation." Advances in Neural Information Processing Systems 36 (2024).

LLMs for Computer Tasks

Shi, Tianlin, et al. "World of bits: An open-domain platform for web-based agents." International Conference on Machine Learning. PMLR, 2017.
Humphreys, Peter C., et al. "A data-driven approach for learning to control computers." International Conference on Machine Learning. PMLR, 2022.
Kim, Geunwoo, et al. "Language Models can Solve Computer Tasks." arXiv e-prints (2023): arXiv-2303.
Payan, Justin, et al. "InstructExcel: A Benchmark for Natural Language Instruction in Excel." arXiv preprint arXiv:2310.14495 (2023).

LLMs in Education

Fuchs, Kevin. "Exploring the opportunities and challenges of NLP models in higher education: is Chat GPT a blessing or a curse?." Frontiers in Education. Vol. 8. Frontiers Media SA, 2023.
Javaid, Mohd, et al. "Unlocking the opportunities through ChatGPT Tool towards ameliorating the education system." BenchCouncil Transactions on Benchmarks, Standards and Evaluations 3.2 (2023): 100115.
Laato, Samuli, et al. "AI-assisted learning with ChatGPT and large language models: Implications for higher education." 2023 IEEE International Conference on Advanced Learning Technologies (ICALT). IEEE, 2023.
Xiao, Changrong, et al. "Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications." Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 2023.

LLM-based Agents / OpenAI Assistants

https://platform.openai.com/docs/assistants/how-it-works
https://www.promptingguide.ai/research/llm-agents
Wang, et al.: A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432 (2023)

Agent Cooperation

Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023.
Zhuge, Mingchen, et al. “Mindstorms in Natural Language-Based Societies of Mind.” arXiv preprint arXiv:2305.17066 (2023).
Suzgun and Kalai: Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding. arXiv preprint arXiv:2401.12954 (2024).
Wang, et al.: A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432 (2023)
https://www.promptingguide.ai/research/llm-agents

Commonsense Question Answering

Molfese, Francesco Maria, et al. "ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering." arXiv preprint arXiv:2410.05077 (2024).
Wang, Chenhao, et al. "Leros: Learning Explicit Reasoning on Synthesized Data for Commonsense Question Answering." Proceedings of LREC-COLING 2024.
Siriwardhana, Shamane, et al. "Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering." Transactions of the Association for Computational Linguistics 11 (2023): 1-17.
Shwartz, Vered, et al. "Unsupervised commonsense question answering with self-talk." arXiv preprint arXiv:2004.05483 (2020).

Scientific Application of LLMs

Scientific Ideation Assistance

Qi, Biqing, et al. "Large language models are zero shot hypothesis proposers." arXiv preprint arXiv:2311.05965 (2023).
Huang, Qian, et al. "Benchmarking large language models as ai research agents." arXiv preprint arXiv:2310.03302 (2023).
Baek, Jinheon, et al. "Researchagent: Iterative research idea generation over scientific literature with large language models." arXiv preprint arXiv:2404.07738 (2024).
Si, Chenglei, et al. "Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers." arXiv preprint arXiv:2409.04109 (2024).

Generating Scientific Discoveries

Lu, Chris, et al. "The AI scientist: Towards fully automated open-ended scientific discovery." arXiv preprint arXiv:2408.06292 (2024).
Lála, Jakub, et al. "Paperqa: Retrieval-augmented generative agent for scientific research." arXiv preprint arXiv:2312.07559 (2023).
Cai, Hengxing, et al. "Sciassess: Benchmarking llm proficiency in scientific literature analysis." arXiv preprint arXiv:2403.01976 (2024).
Kang, Hao, and Chenyan Xiong. "ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents." arXiv preprint arXiv:2406.10291 (2024).

Systematic Literature Review

Sami, Abdul Malik, et al. "System for systematic literature review using multiple ai agents: Concept and an empirical evaluation." arXiv preprint arXiv:2403.08399 (2024).
Agarwal, Shubham, et al. "LitLLM: A Toolkit for Scientific Literature Review." arXiv preprint arXiv:2402.01788 (2024).
Susnjak, Teo, et al. "Automating research synthesis with domain-specific large language model fine-tuning." arXiv preprint arXiv:2404.08680 (2024).
DeYoung, Jay, et al. "Do Multi-Document Summarization Models Synthesize?." Transactions of the Association for Computational Linguistics 12 (2024): 1043-1062.

Generative AI Assistance in Chemistry/Material Sciences

Bran, Andres M., et al. "ChemCrow: Augmenting large-language models with chemistry tools." arXiv preprint arXiv:2304.05376 (2023).
Guo, Taicheng, et al. "What can large language models do in chemistry? a comprehensive benchmark on eight tasks." Advances in Neural Information Processing Systems 36 (2023): 59662-59688.
Schilling-Wilhelmi, Mara, et al. "From Text to Insight: Large Language Models for Materials Science Data Extraction." arXiv preprint arXiv:2407.16867 (2024).
Mirza, Adrian, et al. "Are large language models superhuman chemists?." arXiv preprint arXiv:2404.01475 (2024).

Foundation Transformer LLMs

Compute-Optimal LLMs?

Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).
Hoffmann, Jordan, et al. "An empirical analysis of compute-optimal large language model training." Advances in Neural Information Processing Systems 35 (2022): 30016-30030.
Li, Ming, et al. "From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning." arXiv preprint arXiv:2308.12032 (2023).
https://huggingface.co/mistralai/Ministral-8B-Instruct-2410

Advances in the Generative Pre-trained (GPT) Language Models

Radford, Alec. "Improving language understanding by generative pre-training." (2018).
Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in neural information processing systems 35 (2022): 27730-27744.

Open source Foundation Models and Large-scale Pre-training Datasets Compilation Workflows

Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
Wenzek, Guillaume, et al. "CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data." Proceedings of the Twelfth LREC. 2020.
Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).
Penedo, Guilherme, et al. "The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only." arXiv preprint arXiv:2306.01116 (2023).

Getting started

The following survey articles and tutorial are good starting points for getting an overview of the topics of the seminar:

Zhao, et al.: A survey of Large Language Models. arXiv:2303.18223 [cs.CL]
Mialon, et al.: Augmented Language Models: a Survey. arXiv:2302.07842 [cs.CL]
Wang, et al: A Survey on Large Language Model based Autonomous Agents. arXiv:2308.11432 [cs.CL]
Prompt Engineering Guide