• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用大语言模型加速临床证据综合分析

Accelerating clinical evidence synthesis with large language models.

作者信息

Wang Zifeng, Cao Lang, Danek Benjamin, Jin Qiao, Lu Zhiyong, Sun Jimeng

机构信息

Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL, USA.

Keiji.AI Inc, Seattle, USA.

出版信息

NPJ Digit Med. 2025 Aug 8;8(1):509. doi: 10.1038/s41746-025-01840-7.

DOI:10.1038/s41746-025-01840-7
PMID:40775042
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12331930/
Abstract

Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose published SRs to build TrialReviewBench, which contains 100 SRs and 2,220 clinical studies. For study search, it achieves high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind beats previous document ranking methods in a 1.5-2.6 fold change. For data extraction, it outperforms a GPT-4's accuracy by 16-32%. In a pilot study, human-AI collaboration with TrialMind improved recall by 71.4% and reduced screening time by 44.2%, while in data extraction, accuracy increased by 23.5% with a 63.4% time reduction. Medical experts preferred TrialMind's synthesized evidence over GPT-4's in 62.5%-100% of cases. These findings show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration.

摘要

临床证据综合主要依赖于对医学文献中临床研究的系统评价(SR)。在此,我们提出了一种名为TrialMind的生成式人工智能(AI)流程,以简化系统评价中的研究检索、研究筛选和数据提取任务。我们选择已发表的系统评价来构建TrialReviewBench,其中包含100篇系统评价和2220项临床研究。在研究检索方面,它实现了较高的召回率(我们的召回率为0.711 - 0.834,而人类基线召回率为0.138 - 0.232)。在研究筛选方面,TrialMind比以前的文献排名方法有1.5至2.6倍的提升。在数据提取方面,它的准确率比GPT - 4高出16%至32%。在一项试点研究中,人类与TrialMind的协作使召回率提高了71.4%,筛选时间减少了44.2%,而在数据提取方面,准确率提高了23.5%,时间减少了63.4%。在62.5%至100%的案例中,医学专家更喜欢TrialMind合成的证据,而不是GPT - 4合成的证据。这些发现表明了人类与人工智能协作推动临床证据综合加速发展的前景。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/e3ccd07bc9d1/41746_2025_1840_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/4c65923b9afe/41746_2025_1840_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/9da640aefa9d/41746_2025_1840_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/74f5b786f39e/41746_2025_1840_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/aacafd080e51/41746_2025_1840_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/e3ccd07bc9d1/41746_2025_1840_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/4c65923b9afe/41746_2025_1840_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/9da640aefa9d/41746_2025_1840_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/74f5b786f39e/41746_2025_1840_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/aacafd080e51/41746_2025_1840_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48c4/12331930/e3ccd07bc9d1/41746_2025_1840_Fig5_HTML.jpg

相似文献

1
Accelerating clinical evidence synthesis with large language models.利用大语言模型加速临床证据综合分析
NPJ Digit Med. 2025 Aug 8;8(1):509. doi: 10.1038/s41746-025-01840-7.
2
Enhancing systematic literature reviews with generative artificial intelligence: development, applications, and performance evaluation.利用生成式人工智能加强系统文献综述:开发、应用及性能评估
J Am Med Inform Assoc. 2025 Apr 1;32(4):616-625. doi: 10.1093/jamia/ocaf030.
3
Examining the Role of Large Language Models in Orthopedics: Systematic Review.检查大型语言模型在骨科中的作用:系统评价。
J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.
4
Health Care Social Robots in the Age of Generative AI: Protocol for a Scoping Review.生成式人工智能时代的医疗保健社交机器人:一项范围综述的方案
JMIR Res Protoc. 2025 Apr 14;14:e63017. doi: 10.2196/63017.
5
Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。
Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.
6
Validation of automated paper screening for esophagectomy systematic review using large language models.使用大语言模型对食管癌切除术系统评价的自动化文献筛选进行验证。
PeerJ Comput Sci. 2025 Apr 30;11:e2822. doi: 10.7717/peerj-cs.2822. eCollection 2025.
7
Development of a GPT-4-Powered Virtual Simulated Patient and Communication Training Platform for Medical Students to Practice Discussing Abnormal Mammogram Results With Patients: Multiphase Study.开发一个由GPT-4驱动的虚拟模拟患者和沟通训练平台,供医学生练习与患者讨论异常乳房X光检查结果:多阶段研究。
JMIR Form Res. 2025 Apr 17;9:e65670. doi: 10.2196/65670.
8
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
9
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
10
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

引用本文的文献

1
CLEAR: A vision to support clinical evidence lifecycle with continuous learning.清晰:通过持续学习支持临床证据生命周期的愿景。
J Biomed Inform. 2025 Jul 29;169:104884. doi: 10.1016/j.jbi.2025.104884.
2
A foundation model for human-AI collaboration in medical literature mining.医学文献挖掘中人类与人工智能协作的基础模型。
ArXiv. 2025 Jan 27:arXiv:2501.16255v1.

本文引用的文献

1
Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation.使用提示工程和检索增强生成技术,通过大语言模型简化系统评价。
BMC Med Res Methodol. 2025 May 10;25(1):130. doi: 10.1186/s12874-025-02583-5.
2
High-performance automated abstract screening with large language model ensembles.使用大语言模型集成进行高性能自动摘要筛选。
J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.
3
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.
用于进行系统评价的大型语言模型:正在兴起,但尚未准备好投入使用——一项范围综述
J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26.
4
Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success).使用GPT-3总结、简化和综合医学证据(效果各异)。
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:1387-1407. doi: 10.18653/v1/2023.acl-short.119.
5
: a large language model that generates search queries for systematic reviews.一个为系统评价生成搜索查询的大语言模型。
JAMIA Open. 2024 Sep 25;7(3):ooae098. doi: 10.1093/jamiaopen/ooae098. eCollection 2024 Oct.
6
Closing the gap between open source and commercial large language models for medical evidence summarization.弥合用于医学证据总结的开源大型语言模型与商业大型语言模型之间的差距。
NPJ Digit Med. 2024 Sep 9;7(1):239. doi: 10.1038/s41746-024-01239-w.
7
Performance of two large language models for data extraction in evidence synthesis.两种大型语言模型在证据综合数据提取中的性能比较。
Res Synth Methods. 2024 Sep;15(5):818-824. doi: 10.1002/jrsm.1732. Epub 2024 Jun 19.
8
Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses.大语言模型在系统评价和荟萃分析制作中的潜在作用。
J Med Internet Res. 2024 Jun 25;26:e56780. doi: 10.2196/56780.
9
Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.幻觉发生率和 ChatGPT 与 Bard 用于系统评价的参考准确性:比较分析。
J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164.
10
Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.利用生成式人工智能进行临床证据综合需要确保其可信度。
J Biomed Inform. 2024 May;153:104640. doi: 10.1016/j.jbi.2024.104640. Epub 2024 Apr 10.