Using Large Language Models to score journal articles for quality
To investigate the extent to which large language models and other artificial intelligence methods can support peer and expert review of academic documents. This includes developing and evaluating algorithms and strategies.
Develop and evaluate the ability of Large Language Models to predict the peer review quality scores that human experts would give to published academic research
Promote a discussion about responsible and ethical uses of AI in peer/expert evaluation of academic research.
Individual research quality scores from ChatGPT and Gemini have little value for research evaluation but averaging 5+ scores per article makes them useful in most fields.
ChatGPT 4o and ChatGPT 40-mini scores correlate more highly with expert scores than do citation-based indicators for most fields.
Langfeldt, L., Aksnes, D. W., Karlstrøm, H., & Thelwall, M. (2026). Large Language Models for Departmental Expert Review Quality Scores. arXiv preprint arXiv:2601.18945.
Thelwall, M., Schroeder, R., & Dhanda, M. (2026). Can ChatGPT be a good follower of academic paradigms? Research quality evaluations in conflicting areas of sociology. Journal of Data and Information Science.
Thelwall, M. & Mohammadi, E. (2026). Can small and reasoning Large Language Models score journal articles for research quality and do averaging and few-shot help? Scientometrics.
Thelwall, M. (2026). ChatGPT estimates of the quality of published conference papers from their titles and abstracts. Data Technologies and Applications.
Thelwall, M. (2025). Designing large language model prompts to extract scores from messy text: A shared dataset and challenge. Trends in Information Management, 13(2), paper 1. https://lis.uok.edu.in/Files/9ebfb2f2-5003-47a4-9dfe-d3cdcc6a2020/Journal/1435d56f-dbb2-4364-b3ef-2fe4237f73d7.pdf https://arxiv.org/abs/2601.18271
Thelwall, M. (2026). Large Language Models and responsible research evaluation: An extension of the Leiden Manifesto. Scientometrics. https://doi.org/10.1007/s11192-026-05552-x
Thelwall, M. (2026). Do Large Language Models know basic facts about journal articles? Journal of Documentation. https://doi.org/10.1108/JD-11-2025-0330
Thelwall, M. & Nunkoo, R. (2025). A Global South strategy for evaluating research value with ChatGPT. Quantitative Science Studies.
Thelwall, M. (2025). Can smaller large language models evaluate research quality? Malaysian Journal of Library and Information Science, 30(2), 66-81. https://doi.org/10.22452/mjlis.vol30no2.4
Thelwall, M. & Yang, Y. (2025). Implicit and explicit research quality score probabilities from ChatGPT, https://arxiv.org/abs/2506.13525
Thelwall, M. (2025). Quantitative Methods in Research Evaluation: Citation Indicators, Altmetrics, and Artificial Intelligence. White Rose University Press. https://doi.org/10.48550/arXiv.2407.00135
Thelwall, M. (2025). Responsible Uses of Large Language Models in research evaluation. In 20th International Society of Scientometrics and Informetrics Conference Volume 1, p 71-80.
Thelwall, M. (2025). Research quality evaluation by AI in the era of Large Language Models: Advantages, disadvantages, and systemic effects. Scientometrics
Thelwall, M. (2025). In which fields do ChatGPT 4o scores align better than citations with research quality? https://arxiv.org/abs/2504.04464
Kousha, K., & Thelwall, M. (2025). Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations. Journal of the Association for Information Science and Technology.
Thelwall, M. & Cox, A. (2025). Estimating the quality of academic books from their descriptions with ChatGPT. Journal of Academic Librarianship.
Thelwall, M., Jiang, X., & Bath, P. (2025). Estimating the quality of published medical research with ChatGPT. Information Processing & Management, 62(4), 104123. https://doi.org/10.1016/j.ipm.2025.104123
Thelwall, M., & Kousha, K. (2025). Journal Quality Factors from ChatGPT: More meaningful than Impact Factors? Journal of Data and Information Science. https://doi.org/10.2478/jdis-2025-0016
Thelwall, M., & Kurt, Z. (2024). Research evaluation with ChatGPT: Is it age, country, length, or field biased? arXiv preprint arXiv:2411.09768.
Thelwall, M., & Yaghi, A. (2024). In which fields can ChatGPT detect journal article quality? An evaluation of REF2021 results. arXiv preprint arXiv:2409.16695.
Thelwall, M. & Yaghi, A. (2025). Evaluating the predictive capacity of ChatGPT for academic peer review outcomes across multiple platforms. Scientometrics, to appear.
Thelwall, M. (2025). Is Google Gemini better than ChatGPT at evaluating research quality? Journal of Data and Information Science, 10(2), 1–5. https://doi.org/10.2478/jdis-2025-0014 with extended version here: https://doi.org/10.6084/m9.figshare.28089206.v1
Thelwall, M. (2025). Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputs. Journal of Data and Information Science, 10(1), 7-25. https://doi.org/10.2478/jdis-2025-0011
Thelwall, M. (2024). Can ChatGPT evaluate research quality? Journal of Data and Information Science, 9(2), 1–21. https://doi.org/10.2478/jdis-2024-0013
Thelwall, M. (2024). ChatGPT for complex text evaluation tasks. Journal of the Association for Information Science and Technology. http://doi.org/10.1002/asi.24966
Python code for ChatGPT.
Windows program Webometric Analyst to extract scores from free text reports from ChatGPT and Gemini (AI or LLM menu).
Contact any team member to get more information about the project or to ask for a speaker at your event
This project is funded by ESRC Metascience grant UKRI1079