CGIAR

News > Article

Can AI enhance CGIAR syntheses and quality assurance?

Two preliminary assessments

By Mac Millan • Published on 11/12/2023

The seductions of artificial intelligence (AI) are arguably especially great for professionals working with large bodies of scientific information. AI evangelists tell us that new language models offer an elegant solution to all knowledge workers needing to review copious amounts of information to marshal evidence for one purpose or another. These models can do the heavy lifting, they promise, and do so at warp speed.

But questions remain. Can AI produce something as useful as it is fast? Is it a helpful assistant or a blunt instrument? Are AI products carefully observed or coarsened summaries? Clear-eyed or clunky approximations? And can AI products connect dots in new ways? Do they master or reduce complexity? What about deeper narratives that may be at play? What of nuance? Of ambiguity?

What is lost in translation—in reading, summarizing, synthesizing and assessing large amounts of text—may be significant. While nuanced details may be lost in AI-generated products, for example, cognitive, confirmation and other biases may corrupt human work. With the recent (and stratospheric) ascendancy of artificial intelligence (AI)—specifically the large language models like ChatGPT that are bringing AI to the masses—the Portfolio Performance Unit of CGIAR recently conducted two retrospective assessments of how useful AI might be in CGIAR work to manage its portfolio in crawling through a mass of information to produce useful syntheses and assessments.

AI to enhance syntheses?

In the first assessment, PPU prompted Chat GPT-4 to summarize a mass of recommendations that had previously been summarized by a senior communications expert, Laura Reumann. Reumann compared her manual synthesis and ChatGPT’s automated synthesis of the same source document, which consisted of adaptive management recommendations made in 2022 by 31 CGIAR Initiatives and the GENDER Impact Platform to better align themselves with shifting resources, budgets, and situations. A total of 197 recommendations, comprising 16,520 words and 63 pages, were made. Reumann’s analysis and synthesis of these recommendations took her about 8 hours to complete, that by ChatGPT took mere seconds, although Reumann estimates that a further 2–3 hours would be needed to manually review and edit the AI text.

Reumann concludes that ChatGPT proved highly effective in providing a broad overview of the themes and recommendations noted by the CGIAR Initiatives. Use of AI reduced the complexity of summarizing the extensive source text and information. It consistently identified key points and themes and enhanced the readability of the text. By assembling draft texts quickly and managing the bulk of the initial summary work, AI made this complex task more approachable.

On the other hand, Neumann noted that ChatGPT sometimes over-simplified content, with some anodyne results. To produce a higher quality, more useful, product, one with greater explanatory power, would thus necessitate manually integrating essential context, more salient and granular details (including some specific statistics), and insights gained from them, into the AI product. ChatGPT was also unable to fully capture strategic communication nuances and key priorities that require an understanding of the broader context, something that Reumann, with her years of experience in CGIAR communications, was able to provide. Finally, Reumann noted that more detailed and precise prompts would be needed for ChatGPT to align its output with CGIAR’s specific style guidelines.

Reumann concludes that use of AI in such communications work can be beneficial when employed as complementary to human work rather than as a standalone solution. She cautions against using AI as a primary method of generating content but notes that better prompt design could enhance its utility. While ChatGPT can assist in summarizing large amounts of information, she argues, it still requires careful integration with human skills to produce syntheses that are both comprehensive and nuanced, that are aligned to CGIAR’s style and messages, and that are sufficiently detailed to offer the CGIAR community “illuminating insights” and “actionable intelligence”.

This comparative analysis by CGIAR’s Portfolio Performance Unit was made to assess the viability and value of using this and similar AI tools for comparable future tasks. The results in this case suggest that it may be best to view the AI draft as a reasonably fair summary, one that highlights key general themes and provides some specific examples—a ‘starting summary’ that could, with further refining of prompts, likely be improved. The human draft, on the other hand, is closer to a synthesis, which, along with its summarized information, offers more detailed context and deeper analysis.

AI to enhance quality assurance?

CGIAR’s existing quality assurance (QA) process now takes quality assurance assessors large amounts of time to assess each result reported by CGIAR researchers. The evidence reported by researchers to bolster their claims of the results of their innovations can be a lot to read: some results consist of 50-page documents as well as papers in academic journals. To evaluate the “readiness level” of a given innovation requires a detailed review of all this evidence. The question PPU is asking is can AI help assessors to work faster and more accurately? If so, which task or field should human assessors spend time on and which could they able to leave to AI?

In a recent “deep dive” PPU meeting, the unit’s expert in AI matters, Ebenge Usip, and its expert in quality assurance, Mariajulia Mariani, provided a case study of how AI models might greatly speed up the time it takes to assess results submitted by CGIAR researchers to the QA Platform.

With recent advances in AI, Usip said, large language models such as Quad from Entropic, GPT 3.5, and GPT-4 can be leveraged to read and summarize large bodies of text in a matter of minutes. The time it takes to review and summarize evidence of up to 110 pages in length was shortened to less than 5 minutes for some of the results that Usip processed.

A short-term solution to reducing the time it takes to conduct QA, Usip explained, is to use an offline tool that can provide AI assessments for specific aspects of the QA process (currently “readiness level”, “tags” and “geographic location”). Ideally, in the longer term, he said, integration of this kind of work into the QA platform would probably be more beneficial as some pre-steps could also be taken by the models, such as production of a summary of the evidence and auto-completion of certain fields whose correct population is highly reliable.

For his review, Usip passed to AI large language models the full evidence and results of the innovations researchers reported on, and QA assessors assessed, in 2022. He then passed all this information to Quad 2, which, with its very large context window, was able to summarize 50-plus pages in 300-word summaries that include identifying the who, what, where, when and why of each particular innovation, its key facts and figures, and its objectives.

He then combined all of this information and prompted open AI to read the summaries and evaluate the readiness level of each reported innovation. In this process, AI had no knowledge of the readiness level claimed by the researchers; this was a completely independent evaluation.

Results: Two-thirds of the sample size were found to have close alignment between human and AI scores of innovation readiness.

Usip was able to gather 61 results that had readily accessible evidence of the results reported. Of the 61 results, AI gave scores of 24 readiness levels that were less than those claimed by researchers. Generally, this is a sign of good AI performance—it indicates that AI read through everything and paid close attention to all the details. In addition, Usip’s QA colleague Mariajulia Mariani conducted her own manual assessment of 7 examples and found that the AI assessments were probably correct, and the human assessments probably incorrect due to human error of some kind, in 6 of the 7 instances. (This will be further evaluated.)

The AI gave the same exact score for 8 results claimed by the researchers, which clearly shows alignment between the AI and researcher assessments. Finally, the AI gave scores of 29 readiness levels that were higher than those claimed by the researchers. Usip explained that these were probably the result of the AI misinterpreting the detailed research plans reported by researchers for proven research results. He will work to revise the prompt he used to make distinctions between research plans and results clearer.

In summary, Usip reported that 39 of the 61 results were accurate or acceptable (at most 1 point off from the readiness level claimed by researchers), which represents a significant reduction in time to conduct QA of about two-thirds. So as far as innovation readiness levels go, PPU’s AI assessment produced a 60–66% positive performance, with some improvements yet to be made.

This AI approach could now be widened to include more straightforward analysis of such items in the QA process as long and short titling, tags, and geographic locations. This should save time for the human QA assessors, who could then focus on more critical aspects of their QA work.

Yet to be determined is whether CGIAR will use AI in future to focus on specific areas of reported results, to focus on all results, or to have the human assessors ask AI to focus only on areas where scores differ.

Page updated

Google Sites

Report abuse