Using Large Language Models to score journal articles for quality
To investigate the extent to which large language models and other artificial intelligence methods can support peer and expert review of academic documents. This includes developing and evaluating algorithms and strategies.
Develop and evaluate the ability of Large Language Models to predict the peer review quality scores that human experts would give to published academic research
Promote a discussion about responsible and ethical uses of AI in peer/expert evaluation of academic research.
Individual research quality scores from ChatGPT and Gemini have little value for research evaluation but averaging 5+ scores per article makes them useful in most fields.
ChatGPT 4o and ChatGPT 40-mini scores correlate more highly with expert scores than do citation-based indicators for most fields.
Thelwall, M. & Yang, Y. (2025). Implicit and explicit research quality score probabilities from ChatGPT, https://arxiv.org/abs/2506.13525
Thelwall, M. (2025). Quantitative Methods in Research Evaluation: Citation Indicators, Altmetrics, and Artificial Intelligence. White Rose University Press. https://doi.org/10.48550/arXiv.2407.00135
Thelwall, M. (2025). Responsible Uses of Large Language Models in research evaluation. In 20th International Society of Scientometrics and Informetrics Conference Volume 1, p 71-80.
Thelwall, M. (2025). Research quality evaluation by AI in the era of Large Language Models: Advantages, disadvantages, and systemic effects. Scientometrics
Thelwall, M. (2025). In which fields do ChatGPT 4o scores align better than citations with research quality? https://arxiv.org/abs/2504.04464
Kousha, K., & Thelwall, M. (2025). Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations. Journal of the Association for Information Science and Technology.
Thelwall, M. & Cox, A. (2025). Estimating the quality of academic books from their descriptions with ChatGPT. Journal of Academic Librarianship.
Thelwall, M., Jiang, X., & Bath, P. (2025). Estimating the quality of published medical research with ChatGPT. Information Processing & Management, 62(4), 104123. https://doi.org/10.1016/j.ipm.2025.104123
Thelwall, M., & Kousha, K. (2025). Journal Quality Factors from ChatGPT: More meaningful than Impact Factors? Journal of Data and Information Science. https://doi.org/10.2478/jdis-2025-0016
Thelwall, M., & Kurt, Z. (2024). Research evaluation with ChatGPT: Is it age, country, length, or field biased? arXiv preprint arXiv:2411.09768.
Thelwall, M., & Yaghi, A. (2024). In which fields can ChatGPT detect journal article quality? An evaluation of REF2021 results. arXiv preprint arXiv:2409.16695.
Thelwall, M. & Yaghi, A. (2025). Evaluating the predictive capacity of ChatGPT for academic peer review outcomes across multiple platforms. Scientometrics, to appear.
Thelwall, M. (2025). Is Google Gemini better than ChatGPT at evaluating research quality? Journal of Data and Information Science, 10(2), 1–5. https://doi.org/10.2478/jdis-2025-0014 with extended version here: https://doi.org/10.6084/m9.figshare.28089206.v1
Thelwall, M. (2025). Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputs. Journal of Data and Information Science, 10(1), 7-25. https://doi.org/10.2478/jdis-2025-0011
Thelwall, M. (2024). Can ChatGPT evaluate research quality? Journal of Data and Information Science, 9(2), 1–21. https://doi.org/10.2478/jdis-2024-0013
Thelwall, M. (2024). ChatGPT for complex text evaluation tasks. Journal of the Association for Information Science and Technology. http://doi.org/10.1002/asi.24966
Python code for ChatGPT.
Windows program Webometric Analyst to extract scores from free text reports from ChatGPT and Gemini (AI or LLM menu).
Contact any team member to get more information about the project or to ask for a speaker at your event
This project is funded by ESRC Metascience grant UKRI1079