Using ChatGPT for Systematic Review Searching - Problem or Panacea?

Dr Anne Littlewood

Information Specialist, YHEC

The field of artificial intelligence (AI) is growing in the life sciences, with an increasing number of studies investigating whether AI tools can be - and should be - used in healthcare research [1]. Here, we interview YHEC's Dr Anne Littlewood, an Information Specialist who has been exploring the use of ChatGPT, a generative AI chatbot, for conducting searches for systematic reviews.

Tell us a bit about yourself and your role.
As an Information Specialist, I undertake literature searching for a range of YHEC projects. This can range from systematic reviews to more pragmatic reviews. A key part of my role is to create (often complex) search strategies for electronic databases to identify evidence for ongoing projects.

Can you give us some background about using ChatGPT for systematic review searching?
There's been a lot of interest recently in AI and how it might be applied to create efficiency savings in systematic reviews, particularly around using large language models (LLMs) like ChatGPT [2]. Searching the literature to identify evidence is one of the key stages in the systematic review process, and the role of LLMs for undertaking such searches is one area that might be a potential timesaver. It may also have other benefits, allowing us to investigate different options within a search more quickly and easily to improve the search strategy and, ultimately, the systematic review. While LLMs could improve efficiency in creating a search strategy, it’s important to note that published literature does not currently support the use of ChatGPT to produce comprehensive searches suitable for a systematic review context [3].

Why are you interested in this?
I have worked on systematic review projects as an information specialist for 15 years. In that time, the literature base has grown exponentially, and it is getting more difficult to break through the "noise" to find the evidence that is needed to answer healthcare research questions. This is why YHEC is keen to explore whether LLMs can perform some of the functions of an expert searcher and whether LLMs could be used to make our work more efficient.

Although I can see how this could inform my own work, I do have some concerns that novice searchers may think that LLMs can undertake comprehensive literature searches suitable for a systematic review and that they may use these tools uncritically. I believe it is important to understand the strengths and limitations of LLMs for this type of work to help identify where AI could provide useful input when conducting searches for systematic reviews.

Who else is this research relevant for?
Our research is relevant for other researchers who are conducting systematic reviews, particularly Information Specialists and Medical Librarians who will be conducting searches.

Can you briefly summarise the methods that were used?
To see how a general user might fare, the LLM used in this research was the free version of ChatGPT. We fed five prompts into ChatGPT to carry out tasks that might typically be undertaken when preparing literature searches for a systematic review. We initially asked the LLM to write a search strategy that could find evidence in one of the core medical databases: Ovid MEDLINE. We used postpartum depression as an example topic. Search strategies in the systematic review context aim to be as comprehensive as possible and usually contain indexing terms (in this case, Medical Subject Headings or MeSH) along with keyword searching for maximum retrieval. They make use of Boolean logic (for example, AND, OR and NOT) and truncation (searching on the stem of a word to identify plurals and term variants using a truncation symbol). We expected to find all these features in ChatGPT's search. We also asked ChatGPT to limit the search to randomised controlled trials, a restriction that is often applied in systematic reviews of healthcare interventions.

Another key task is translating the search strategy for use in other databases, which may have very different indexing terms and search syntax. Next, we asked ChatGPT to convert its MEDLINE search strategy for use in CINAHL via Ebsco, a separate database supplied by a different vendor. We also wanted to see how well ChatGPT might perform in providing the building blocks of a search strategy, so we asked it to suggest indexing terms for a particular concept (IL-1 inhibitors) and a list of synonyms for a different concept (detection) to test its capabilities.

Figure 1: Prompts used for ChatGPT for systematic review searching
Abbreviations: IL-1, interleukin-1; MeSH, Medical Subject Headings.

What were your key findings?
Writing a Boolean search and limiting it to randomised controlled trials
In answer to Prompt 1 (Figure 1), ChatGPT produced a search strategy that looked good but had some flaws (Figure 2). It did include indexing terms, keyword searching using synonyms for "postpartum depression", Boolean logic, and truncation. However, the number of synonyms or variant terms for postpartum depression were very limited; for example, it did not explore concepts such as depression in new mothers or new parents. Terms like these should be added to capture all the available evidence. Truncation was inconsistently applied. It included "depress*" as a term, which would also pick up depressed, depression, depressions etc. but "anxiety" was not truncated to pick up anxieties, and anxious did not feature at all.

Figure 2: The search strategy produced by ChatGPT

For Prompt 2 (Figure 1), ChatGPT did successfully limit the search to randomised controlled trials (Figure 3), using a recognised set of validated search terms (a search filter) from the Cochrane Handbook for Systematic Reviews of Interventions [4]. This was impressive, but there are two versions of this filter, and the version used by ChatGPT was more focused and precise than the sensitivity-maximising version. This may be acceptable in the context of this example, but some knowledge of searching is needed to recognise this and to decide which version of the filter is most appropriate. At this stage, ChatGPT also introduced a limit to English language records only. It was not prompted to do this, and the language limit doesn't form part of the Cochrane search filter. Limiting to English language studies only would not generally be acceptable for a systematic review. It's also worth noting that the syntax used by ChatGPT to limit the search by language is syntax that won't run in the database.

Figure 3: The search strategy produced by ChatGPT limited the search to randomised controlled trials

Our research showed that taking a search strategy written by ChatGPT and trying to run it uncritically in Ovid MEDLINE would result in a suboptimal search in the systematic review context. However, the rapid generation of a full strategy by ChatGPT – even if flawed – clearly has potential value to the experienced, critical searcher; the strategy can help inform early thinking on search challenges, strategy structure, and potential indexing and textword search terms.

Converting the search to a different database
In answer to Prompt 3 (Figure 1), the search strategy produced for CINAHL via Ebsco looked satisfactory on the surface but, on further analysis, there were serious problems with the indexing terms and search syntax. In this case, ChatGPT misinterpreted a search of title, abstract and keywords and instead created a strategy designed to search the full text of the documents. This would exponentially increase the number of records to screen and lead to less efficiency, not more! The key indexing term "postpartum depression" was also incorrectly entered in the strategy and retrieved zero results. The Boolean "AND" and "OR" searches were incorrectly rendered and would have searched for the wrong concept. The search that was produced by ChatGPT was unusable, although it looked as if it would work. It is possible that a novice searcher would not recognise or understand the flaws in the ChatGPT search and would attempt to run it, leading to the retrieval of many irrelevant studies, which would take additional time to screen. The search could also miss potential eligible studies, possibly introducing bias into the systematic review as a result.

Using ChatGPT to create the building blocks of a search strategy
For Prompt 4 (Figure 1), ChatGPT did retrieve a list of indexing terms for IL-1 inhibitors from MEDLINE, terms that we might potentially add to a search strategy (Figure 4). However, although most of the terms were correct, some of the terms presented were not indexing terms. Cytokine inhibitors was a term in the list, but this is not a MeSH. Anakinra is a type of IL-1 inhibitor but doesn't exist as a MeSH either. Hallucinations are a well-known issue with ChatGPT, and this seems to be another example of an area where, if ChatGPT doesn't "know", it makes a best guess [5].

Figure 4: The list of indexing terms from MEDLINE retrived by ChatGPT
Abbreviations: IL-1, interleukin-1; MeSH, Medical Subject Headings.

If ChatGPT failed to accurately identify MeSH, could it be potentially used to identify variant terms for a text word search? An Information Specialist will search on text words that might appear in the title, abstract or keywords of a record, and will need to identify as many synonyms as possible to make the search sensitive. For Prompt 5, the list of variant terms for "detection" produced by ChatGPT was pretty good, one that we could use as a starting point to draw on when considering variant search terms on the concept. We think there may be a good case for using ChatGPT in this way – as one of the sources we might use at the start of strategy development for harvesting potential terms. At this point we wouldn’t want to solely rely on it, but it could have value.

What can we conclude from these results?
Although we have found that ChatGPT is not ready to consistently produce final, finished search strategies suitable for use in systematic reviews, we feel that it has potential value as a tool to assist the experienced, critical searcher with strategy development. Our findings chime with those in published studies on the topic [3].

Do you think anything might change as a result of this research? Why?
LLMs have promise for search strategy development, and there is a case for using them for some tasks, particularly developing initial search terms. We will continue to look into where and how AI tools can add value to our methods, and we'll incorporate them into our processes going forward.

If someone was to do the research again, would you recommend that they do anything differently?
We know from previous research that there are issues with ChatGPT and replication [6]. It would be useful to know whether the same prompts run at a different time would result in the same search strategy, or something else. This is something we are planning to explore further.

Where should future researchers in this area focus their attention?
We have only looked at ChatGPT for this research, but there are now numerous LLMs available. It would be interesting to compare their performance and see how they measure up! It's also the case that LLMs are evolving very quickly, so it would definitely be worth repeating the experiment to see if ChatGPT improves its search strategy development capabilities over time.

If you wanted people to take away one thing from this research, what would it be?
Some experience and knowledge in systematic review search methods is needed before using ChatGPT to create search strategies for systematic reviews. Although the output of ChatGPT looks impressive, a critical eye is required before using any of its suggestions. At YHEC, we believe that our searches should be researcher-led and AI empowered, and not AI-led. This research supports that approach.

Contact us
To find out more about our capabilities in searching for evidence and support for reviews, please contact us at yhec@york.ac.uk.

References
1. Senthil R, Anand T, Somala CS, Saravanan KM. Bibliometric analysis of artificial intelligence in healthcare research: Trends and future directions. Future Healthcare Journal. 2024.11(3):100182.
2. Alshami A, Elsayed M, Ali E, Eltoukhy AE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: Methodology, case study, limitations, and future directions. Systems. 2023.11(7):351.
3. Parisi V, Sutton A. The role of ChatGPT in developing systematic literature searches: an evidence summary. Journal of EAHIL. 2024.20(2):30-34.
4. Lefebvre C, Glanville J, Briscoe S, Featherstone R, Littlewood A, Metzendorf M-I, et al. Chapter 4: Searching for and selecting studies. . In: Higgins JPT TJ, Chandler J, Cumpston M, Li T, Page MJ, et al., editor. Cochrane Handbook for Systematic Reviews of Interventions version 6.3: Cochrane; 2022.
5. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023.15(2).
6. Al-Dujaili Z, Omari S, Pillai J, Al Faraj A. Assessing the accuracy and consistency of ChatGPT in clinical pharmacy management: a preliminary analysis with clinical pharmacy experts worldwide. Research in Social and Administrative Pharmacy. 2023.19(12):1590-94.

Posted: 04 February 2025

Page updated

Report abuse

Using ChatGPT for Systematic Review Searching - Problem or Panacea?

Figure 1: Prompts used for ChatGPT for systematic review searchingAbbreviations: IL-1, interleukin-1; MeSH, Medical Subject Headings.

Figure 2: The search strategy produced by ChatGPT

Figure 3: The search strategy produced by ChatGPT limited the search to randomised controlled trials

Figure 4: The list of indexing terms from MEDLINE retrived by ChatGPTAbbreviations: IL-1, interleukin-1; MeSH, Medical Subject Headings.

Figure 1: Prompts used for ChatGPT for systematic review searching
Abbreviations: IL-1, interleukin-1; MeSH, Medical Subject Headings.

Figure 4: The list of indexing terms from MEDLINE retrived by ChatGPT
Abbreviations: IL-1, interleukin-1; MeSH, Medical Subject Headings.