Lit Review (in progress) 

Psychology 4 LLMs (Psychological Evals of LLMs Behaviors)

Personality tests


Safdari, M., Serapio-García, G., Crepy, C., Fitz, S., Romero, P., Sun, L., ... & Matarić, M. (2023). Personality traits in large language models. arXiv preprint arXiv:2307.00184.https://arxiv.org/pdf/2307.00184.pdf



Pan, K., & Zeng, Y. (2023). Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180. https://arxiv.org/pdf/2307.16180



Dorner, F. E., Sühr, T., Samadi, S., & Kelava, A. (2023). Do personality tests generalize to Large Language Models?. arXiv preprint arXiv:2311.05297. https://arxiv.org/pdf/2311.05297


Huang, J. T., Wang, W., Li, E. J., Lam, M. H., Ren, S., Yuan, Y., ... & Lyu, M. R. (2023). Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench. arXiv preprint arXiv:2310.01386. https://arxiv.org/pdf/2310.01386.pdf


Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48. https://arxiv.org/abs/1907.13528


Gupta, A., Song, X., & Anumanchipalli, G. (2023). Investigating the Applicability of Self-Assessment Tests for Personality Measurement of Large Language Models. arXiv preprint arXiv:2309.08163. https://arxiv.org/pdf/2309.08163

Other psych tests (perception, reasoning etc)

Levesque, H. J. (2011). The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Levesque.pdf (commonsensereasoning.org)


Nadeem, M., Bethke, A., & Reddy, S. (2021). Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456. https://arxiv.org/abs/2004.09456


Bhagavatula, C., Bras, R. L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, W.T. & Choi, Y. (2020). Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739. https://arxiv.org/abs/1908.05739


Hudson, D. A., & Manning, C. D. (2019). GQA: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html

General discussion about the internal state of LLMs

Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185-5198. https://aclanthology.org/2020.acl-main.463/


Marcus, G. (2020). The next decade in AI: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177. https://arxiv.org/abs/2002.06177

LLMs For Psychology Research and Therapy

Ke, L., Tong, S., Chen, P., & Peng, K. (2024). Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. arXiv preprint arXiv:2401.01519. https://arxiv.org/pdf/2401.01519


Rao, H., Leung, C., & Miao, C. (2023). Can chatgpt assess human personalities? a general evaluation framework. arXiv preprint arXiv:2303.01248. https://arxiv.org/pdf/2303.01248

General LLM Benchmarks 

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

AgentBench: Evaluating LLMs as Agents

Test Images for Robin

downloaded the image, use iloveimg.com to convert to jpg and hosted in imgbb (temporarily/for 1 hr)