Wang Zifeng, Cao Lang, Danek Benjamin, Jin Qiao, Lu Zhiyong, Sun Jimeng
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL, USA.
Keiji.AI Inc, Seattle, USA.
NPJ Digit Med. 2025 Aug 8;8(1):509. doi: 10.1038/s41746-025-01840-7.
Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose published SRs to build TrialReviewBench, which contains 100 SRs and 2,220 clinical studies. For study search, it achieves high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind beats previous document ranking methods in a 1.5-2.6 fold change. For data extraction, it outperforms a GPT-4's accuracy by 16-32%. In a pilot study, human-AI collaboration with TrialMind improved recall by 71.4% and reduced screening time by 44.2%, while in data extraction, accuracy increased by 23.5% with a 63.4% time reduction. Medical experts preferred TrialMind's synthesized evidence over GPT-4's in 62.5%-100% of cases. These findings show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration.
临床证据综合主要依赖于对医学文献中临床研究的系统评价(SR)。在此,我们提出了一种名为TrialMind的生成式人工智能(AI)流程,以简化系统评价中的研究检索、研究筛选和数据提取任务。我们选择已发表的系统评价来构建TrialReviewBench,其中包含100篇系统评价和2220项临床研究。在研究检索方面,它实现了较高的召回率(我们的召回率为0.711 - 0.834,而人类基线召回率为0.138 - 0.232)。在研究筛选方面,TrialMind比以前的文献排名方法有1.5至2.6倍的提升。在数据提取方面,它的准确率比GPT - 4高出16%至32%。在一项试点研究中,人类与TrialMind的协作使召回率提高了71.4%,筛选时间减少了44.2%,而在数据提取方面,准确率提高了23.5%,时间减少了63.4%。在62.5%至100%的案例中,医学专家更喜欢TrialMind合成的证据,而不是GPT - 4合成的证据。这些发现表明了人类与人工智能协作推动临床证据综合加速发展的前景。