Psychology 4 LLMs (Psychological Evals of LLMs Behaviors)
Personality tests
Safdari, M., Serapio-García, G., Crepy, C., Fitz, S., Romero, P., Sun, L., ... & Matarić, M. (2023). Personality traits in large language models. arXiv preprint arXiv:2307.00184.https://arxiv.org/pdf/2307.00184.pdf
Pan, K., & Zeng, Y. (2023). Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180. https://arxiv.org/pdf/2307.16180
Proposes using the Myers-Briggs Type Indicator (MBTI) test to evaluate the personalities of LLMs like ChatGPT.
Conducts experiments assessing the MBTI types of different LLMs and exploring whether types can be changed via prompt engineering.
Finds LLMs exhibit distinct MBTI types, but they are difficult to change without proper instruction tuning. Training data also impacts types.
Concludes MBTI can serve as a rough indicator of LLM personality, though not a rigorous assessment.
Dorner, F. E., Sühr, T., Samadi, S., & Kelava, A. (2023). Do personality tests generalize to Large Language Models?. arXiv preprint arXiv:2311.05297. https://arxiv.org/pdf/2311.05297
Argues personality tests designed for humans may not directly generalize to LLMs.
Shows LLMs respond inconsistently to reverse-coded personality test items.
Finds LLMs fail to replicate clean five-factor structure of human responses when prompted with different personas.
Concludes validity of human personality tests cannot be assumed for LLMs without critical analysis.
Huang, J. T., Wang, W., Li, E. J., Lam, M. H., Ren, S., Yuan, Y., ... & Lyu, M. R. (2023). Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench. arXiv preprint arXiv:2310.01386. https://arxiv.org/pdf/2310.01386.pdf
Develops PsychoBench, a framework of 13 clinical psychology scales, to evaluate LLMs' psychological portrayal.
Tests 5 LLMs, analyzing impact of model size, updates, and safety alignment on psychological results.
Verifies validity of scales via role assignments and tasks like TruthfulQA and SafetyQA.
Provides insights into customizing LLMs based on psychological metrics.
Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48. https://arxiv.org/abs/1907.13528
Introduces diagnostic tests from human language experiments to probe information used by LLMs for predictions.
Finds BERT distinguishes good vs. bad completions but struggles with challenging inferences like negation.
Concludes probing LLMs with psycholinguistic assessments reveals strengths/limitations in emulating human language users.
Gupta, A., Song, X., & Anumanchipalli, G. (2023). Investigating the Applicability of Self-Assessment Tests for Personality Measurement of Large Language Models. arXiv preprint arXiv:2309.08163. https://arxiv.org/pdf/2309.08163
Questions using self-assessment tests for LLM personality measurement.
Shows LLM test scores vary significantly across equivalent prompts and option orders.
Concludes self-assessments are unreliable for LLMs due to lack of ground truth and limitations like prompt/order sensitivity.
Other psych tests (perception, reasoning etc)
Levesque, H. J. (2011). The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Levesque.pdf (commonsensereasoning.org)
Proposes the Winograd Schema Challenge, a set of pronoun resolution problems requiring commonsense reasoning.
Winograd schemas rely on implicit background knowledge humans use to resolve ambiguous pronouns.
Example: "The city council refused the demonstrators a permit because they [feared/advocated] violence."
Correctly answering which referent "they" matches requires real-world knowledge.
Nadeem, M., Bethke, A., & Reddy, S. (2021). Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456. https://arxiv.org/abs/2004.09456
Presents StereoSet, a large dataset for measuring stereotypical bias in language models.
Contains human-generated sentence pairs labeled for biases about gender, race, religion, and professions.
Tests popular language models like BERT, GPT-2, RoBERTa on StereoSet.
Finds these models exhibit strong stereotypical biases, highlighting issues to address.
Bhagavatula, C., Bras, R. L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, W.T. & Choi, Y. (2020). Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739. https://arxiv.org/abs/1908.05739
Introduces abductive commonsense reasoning tasks Abductive NLI and Abductive NLG.
Abductive NLI: Choose most plausible explanation for observation from choices.
Abductive NLG: Generate an explanation for a given observation.
Shows current models struggle on these tasks compared to humans.
Hudson, D. A., & Manning, C. D. (2019). GQA: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html
Presents GQA, a visual reasoning dataset with compositional questions.
Contains 22M diverse reasoning questions about images with functional programs.
Programs allow tight control over answers to mitigate biases.
New metrics assess consistency, grounding, and plausibility.
Significant room for improvement compared to human performance.
General discussion about the internal state of LLMs
Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185-5198. https://aclanthology.org/2020.acl-main.463/
Argues that the hype around large neural language models "understanding" language is misguided.
States models trained only on linguistic form have no inherent way to learn meaning.
Calls for clearly distinguishing between form and meaning to guide research towards better science around natural language understanding.
Marcus, G. (2020). The next decade in AI: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177. https://arxiv.org/abs/2002.06177
Proposes a hybrid knowledge-driven approach to AI instead of just big data and compute.
Advocates incorporating structured knowledge, causal models, and reasoning.
Outlines four steps: reverse-engineering the mind, discovering the principles of common sense, teaching computers to read, and combining bottom-up (data-driven) and top-down (knowledge-driven) approaches.
LLMs For Psychology Research and Therapy
Ke, L., Tong, S., Chen, P., & Peng, K. (2024). Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. arXiv preprint arXiv:2401.01519. https://arxiv.org/pdf/2401.01519
Provides a comprehensive review of LLMs' applications across cognitive, clinical, educational, and social psychology.
Discusses using LLMs to simulate human cognition and behavior and as aids for literature reviews, hypothesis generation, experimental design, data analysis, and academic writing.
Notes technical and ethical challenges of using LLMs in psychological research, including privacy, bias, and need for interpretability.
Rao, H., Leung, C., & Miao, C. (2023). Can chatgpt assess human personalities? a general evaluation framework. arXiv preprint arXiv:2303.01248. https://arxiv.org/pdf/2303.01248
Presents a framework for evaluating ChatGPT's ability to assess human personalities via MBTI tests.
Uses unbiased prompts and subject-replaced queries to elicit personality assessments.
Proposes metrics to evaluate consistency, robustness, and fairness of assessments.
Finds ChatGPT can independently assess personalities, with higher consistency/fairness but lower robustness than InstructGPT.
General LLM Benchmarks
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Test Images for Robin
downloaded the image, use iloveimg.com to convert to jpg and hosted in imgbb (temporarily/for 1 hr)