Lit Review (in progress) 

Psychology 4 LLMs (Psychological Evals of LLMs Behaviors)

Personality tests

Safdari, M., Serapio-García, G., Crepy, C., Fitz, S., Romero, P., Sun, L., ... & Matarić, M. (2023). Personality traits in large language models. arXiv preprint arXiv:2307.00184.

Pan, K., & Zeng, Y. (2023). Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180.

Dorner, F. E., Sühr, T., Samadi, S., & Kelava, A. (2023). Do personality tests generalize to Large Language Models?. arXiv preprint arXiv:2311.05297.

Huang, J. T., Wang, W., Li, E. J., Lam, M. H., Ren, S., Yuan, Y., ... & Lyu, M. R. (2023). Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench. arXiv preprint arXiv:2310.01386.

Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48.

Gupta, A., Song, X., & Anumanchipalli, G. (2023). Investigating the Applicability of Self-Assessment Tests for Personality Measurement of Large Language Models. arXiv preprint arXiv:2309.08163.

Other psych tests (perception, reasoning etc)

Levesque, H. J. (2011). The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Levesque.pdf (

Nadeem, M., Bethke, A., & Reddy, S. (2021). Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.

Bhagavatula, C., Bras, R. L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, W.T. & Choi, Y. (2020). Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739.

Hudson, D. A., & Manning, C. D. (2019). GQA: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR).

General discussion about the internal state of LLMs

Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185-5198.

Marcus, G. (2020). The next decade in AI: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177.

LLMs For Psychology Research and Therapy

Ke, L., Tong, S., Chen, P., & Peng, K. (2024). Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. arXiv preprint arXiv:2401.01519.

Rao, H., Leung, C., & Miao, C. (2023). Can chatgpt assess human personalities? a general evaluation framework. arXiv preprint arXiv:2303.01248.

General LLM Benchmarks 

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

AgentBench: Evaluating LLMs as Agents

Test Images for Robin

downloaded the image, use to convert to jpg and hosted in imgbb (temporarily/for 1 hr)