Quick Buzz Feed

How did Cochrane select AI tools to evaluate in our platform study?

Gary Lloyd | Mar 13,26 | 01:42 EST

Technology

Cochrane has undertaken an innovative study to assess the efficacy of artificial intelligence (AI) tools in enhancing evidence synthesis. This article details the rigorous, multi-stage selection process employed to choose AI tools for evaluation in their platform study. It outlines the crucial criteria and values Cochrane prioritizes, such as alignment with responsible AI use, methodological rigor, and human oversight, providing a transparent framework for AI tool developers and users interested in responsibly integrating AI into evidence synthesis practices.

Selection process

This section initiates the discussion by explaining that Cochrane launched a study to evaluate AI tools for evidence synthesis, receiving 48 proposals. Due to resource limitations, only two tools could be selected, necessitating a comprehensive selection process. It clarifies that the assessment at this stage was based on information provided by developers, not actual tool usage, with real-world usability and suitability to be determined in the subsequent platform study. This methodical approach ensures a fair and transparent selection, setting the stage for the rigorous evaluation to follow.

Initial screening

The first phase of the selection involved an initial screening performed by the Cochrane Central Executive Team. Proposals were evaluated against three primary criteria. Firstly, "Cochrane workflow coverage" prioritized tools applicable to screening (abstracts and full-texts) and data extraction, as these stages are identified as most promising for AI application, particularly with large language models (LLMs). Secondly, "Maturity" focused on immediately beneficial tools, filtering out those not yet available or lacking prior evaluation. Lastly, "Affordability" was a key consideration, ensuring that selected tools could be potentially scaled up after the study without prohibitive costs. This systematic initial filter streamlined the pool of applicants.

Scoring

Submissions that successfully passed the initial screening were then subjected to a detailed scoring process. This involved evaluating them against several key criteria, drawing from the original call for AI tool proposals. These criteria included: alignment with the Responsible AI use in evidence Synthesis (RAISE) recommendations, particularly using the responsible handover framework in RAISE 3; the robustness of evaluation standards and validation approaches; the tool's usability and its compatibility with Cochrane's methodological standards and review workflows; and the AI tool developer's commitment to capacity-building, support, and an interoperable infrastructure. Notably, the first two criteria, concerning RAISE alignment and evaluation standards, were given double the weight, signifying their critical importance to Cochrane’s mission of rigorous evidence synthesis.

Final ranking

Following the scoring phase, the 18 submissions that met the initial screening criteria were narrowed down. The nine highest-scoring proposals were presented to a scientific advisory group comprising seven volunteers from the joint AI Methods Group. During discussions, two tools were removed due to the absence of publicly available evaluation studies, emphasizing Cochrane’s commitment to transparency and verifiable performance. The remaining seven tools were then independently ranked by the advisory group members. The top two tools from this ranking were ultimately selected as the initial candidates for the platform study. The remaining ranked tools form a reserve list, available for inclusion if the primary selections do not meet the predefined standards outlined in the study protocol.

Factors that informed the selection of tools

Beyond the formal scoring, several overarching themes emerged during the selection process that significantly influenced the final choices. These factors provide valuable guidance for AI tool developers aiming to integrate their solutions into the evidence synthesis community. Cochrane encourages developers to consider these elements as they design and refine their AI offerings, ensuring they meet the high standards required for reliable and impactful evidence generation. These themes are crucial for fostering responsible AI development and adoption in the field.

Coverage of systematic review workflows

A significant preference was given to AI tools capable of supporting multiple stages of systematic review workflows, specifically both study screening and data extraction. The rationale was to minimize the need for Cochrane authors to learn and switch between numerous new tools, thereby streamlining the review process. However, this preference came with a critical caveat: each distinct AI application embedded within a tool had to demonstrate its own merit, assessment, and validation. Developers were encouraged to adopt a holistic perspective, ensuring that all AI systems within their tool, along with user interactions, consistently met expected standards of rigor and integrity, aligning with RAISE recommendations.

Open-source vs proprietary models

Cochrane initially expressed a strong preference for open-source tools, particularly those developed within the academic community, over commercial alternatives. This preference stemmed from several benefits: open-source tools typically incur lower financial costs for the Cochrane community, facilitate greater transparency and reproducibility (especially when not reliant on third-party LLMs), and allow Cochrane to freely adapt them to its specific workflows and methodologies. However, the selection process revealed fewer mature open-source proposals than anticipated. Consequently, this factor had to be balanced against other crucial criteria. Only one open-source tool progressed to the final ranking stage but was not selected as an initial candidate because it lacked data extraction support. Cochrane remains keen to engage with open-source developers and emphasizes transparency in funding and declarations of interest for all tools.

Transparency of existing evidence base and substantiated claims

A non-negotiable requirement for consideration was the availability of publicly accessible validation studies for AI tools. Any tools lacking such studies, or those requiring non-disclosure agreements for access, were immediately excluded from the shortlist. Cochrane rigorously focused on tools with methodologically sound and transparently reported validation studies, addressing concerns about replicability and the reliability of performance claims. Assumptions, often unstated, within these studies were also critically evaluated to determine if performance claims would generalize across Cochrane’s broad scope of health and social care topics. The article cites multiple RAISE recommendations emphasizing complete, transparent, and public reporting of evaluations, avoiding misleading claims, and respecting accuracy standards expected by the evidence synthesis community, urging developers not to hide information behind "commercial confidentiality."

Human oversight

While the precise mechanisms for human oversight would be thoroughly evaluated during the platform study, the initial selection prioritized proposals that clearly demonstrated a commitment to human-centered design principles. This meant ensuring that users were fully aware of all automated decisions made by the AI and possessed the capability to manually override these decisions. Crucially, tools that featured agentic workflows, where multiple tasks were executed autonomously without any human intervention or oversight, were explicitly excluded from consideration. This criterion underscores Cochrane's commitment to maintaining human control and accountability in the evidence synthesis process, aligning with RAISE recommendations for human-centered AI systems.

Dependency on user-defined prompts

For AI tools powered by large language models (LLMs), performance can be highly sensitive to the quality and specificity of user-defined prompts. A significant concern arose with tools where the burden of "prompt engineering" fell directly on the user, often without adequate guiding interfaces. Cochrane recognized that its authors may not possess specialized prompt engineering skills, raising questions about whether the impressive performance claims observed in controlled environments would translate to real-world usage where authors must craft prompts specific to their reviews. Consequently, tools that placed a heavy reliance on advanced user prompting, and where concerns existed about the ability of Cochrane authors to use them reliably even with provided training and support, were deprioritized. This highlights the importance of user-friendliness and accessibility in AI tool design.

Compliance with legal and ethical requirements

A fundamental requirement for all AI tool developers was demonstrating clear and public adherence to legal and ethical standards. Tools that failed to provide this assurance were excluded. Specific examples of non-compliance that led to exclusion included: policies allowing the use of uploaded user content to train the AI system, tools that imposed limitations on the subsequent use of generated outputs, and features designed to streamline the import of full-text articles without adequate safeguards to help users avoid potential copyright infringements. The article cites RAISE recommendations emphasizing awareness and adherence to national and international guidelines (like the EU AI Act), transparency regarding data usage, and ensuring that ethical, legal, and regulatory standards are followed throughout AI model development to prevent copyright infringement or unlawful use.

Next steps

This section outlines the ongoing nature of Cochrane’s initiative, emphasizing that the selection process detailed is merely the initial phase of their 2026 platform study of AI tools. It reaffirms that all decisions regarding tool inclusion were based on the information provided by developers, and that more concrete conclusions about the usability and suitability of these tools will emerge from the later, practical stages of the study, with findings to be shared upon availability. The article stresses that this is a pioneering effort to develop more timely evidence synthesis methods without compromising integrity and rigor. It expresses gratitude to all AI tool developers who submitted proposals and looks forward to further collaboration with those committed to responsible AI development, aiming to provide accessible, trusted evidence for better, more equitable health outcomes globally.

Find out more

This subheading serves as a brief signpost, offering readers a direct link to additional information about Cochrane's broader initiative. It directs interested parties to an article titled "Cochrane launches innovative study to assess AI tools for evidence synthesis," allowing them to delve deeper into the overarching goals and scope of the platform study. This demonstrates Cochrane's commitment to transparency and providing comprehensive resources to its community and the wider public.

Are you an AI tool developer interested in using Cochrane data?

This section directly addresses AI tool developers, clarifying the stringent conditions for using Cochrane data. It states unequivocally that AI developers are subject to the same terms and conditions for re-use as human users and must obtain explicit authorization before utilizing data from Cochrane reviews for AI development, training, or implementation. The article emphasizes that all terms of the license associated with each review apply, and no implied permission exists without a formal license. Furthermore, any use of Cochrane data must strictly adhere to Cochrane’s data re-use terms and conditions, including appropriate citation and accreditation. To reinforce legal and ethical responsibilities, it also refers developers to Wiley’s Statement on Illegal Scraping of AI Copyright, highlighting the importance of respecting intellectual property rights in AI development.