文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

对临床医学中大型语言模型(LLM)评估的系统综述。

A systematic review of large language model (LLM) evaluations in clinical medicine.

作者信息

Shool Sina, Adimi Sara, Saboori Amleshi Reza, Bitaraf Ehsan, Golpira Reza, Tara Mahmood

机构信息

Center for Technology and Innovation in Cardiovascular Informatics, Rajaie Cardiovascular Medical and Research Center, Iran University of Medical Sciences, Tehran, Iran.

Rajaie Cardiovascular Medical and Research Center, Iran University of Medical Sciences, Tehran, 1995614331, Iran.

出版信息

BMC Med Inform Decis Mak. 2025 Mar 7;25(1):117. doi: 10.1186/s12911-025-02954-4.


DOI:10.1186/s12911-025-02954-4
PMID:40055694
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11889796/
Abstract

BACKGROUND: Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to ensure reliability, safety, and ethical alignment. OBJECTIVE: This systematic review examines the evaluation parameters and methodologies applied to LLMs in clinical medicine, highlighting their capabilities, limitations, and application trends. METHODS: A comprehensive review of the literature was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, and arXiv databases, encompassing both peer-reviewed and preprint studies. Studies were screened against predefined inclusion and exclusion criteria to identify original research evaluating LLM performance in medical contexts. RESULTS: The results reveal a growing interest in leveraging LLM tools in clinical settings, with 761 studies meeting the inclusion criteria. While general-domain LLMs, particularly ChatGPT and GPT-4, dominated evaluations (93.55%), medical-domain LLMs accounted for only 6.45%. Accuracy emerged as the most commonly assessed parameter (21.78%). Despite these advancements, the evidence base highlights certain limitations and biases across the included studies, emphasizing the need for careful interpretation and robust evaluation frameworks. CONCLUSIONS: The exponential growth in LLM research underscores their transformative potential in healthcare. However, addressing challenges such as ethical risks, evaluation variability, and underrepresentation of critical specialties will be essential. Future efforts should prioritize standardized frameworks to ensure safe, effective, and equitable LLM integration in clinical practice.

摘要

背景:大语言模型(LLMs)是基于Transformer架构的先进人工智能工具,通过增强决策支持、诊断和医学教育,在临床医学中展现出巨大潜力。然而,将它们整合到临床工作流程中需要进行严格评估,以确保可靠性、安全性和符合伦理规范。 目的:本系统评价考察了应用于临床医学中LLMs的评估参数和方法,突出了它们的能力、局限性及应用趋势。 方法:对PubMed、Scopus、科学网、IEEE Xplore和arXiv数据库中的文献进行全面检索,涵盖同行评审研究和预印本研究。根据预先定义的纳入和排除标准对研究进行筛选,以确定评估LLMs在医学环境中性能的原始研究。 结果:结果显示,临床环境中利用LLM工具的兴趣日益浓厚,有761项研究符合纳入标准。虽然通用领域的LLMs,特别是ChatGPT和GPT-4在评估中占主导地位(93.55%),但医学领域的LLMs仅占6.45%。准确性是最常评估的参数(21.78%)。尽管有这些进展,但证据基础凸显了纳入研究中存在的某些局限性和偏差,强调需要谨慎解读和建立稳健的评估框架。 结论:LLM研究的指数级增长凸显了它们在医疗保健领域的变革潜力。然而,应对伦理风险、评估变异性和关键专业代表性不足等挑战至关重要。未来的工作应优先考虑标准化框架,以确保LLMs在临床实践中安全、有效和公平地整合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea81/11889796/db49ca5ffca2/12911_2025_2954_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea81/11889796/f3319879b60e/12911_2025_2954_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea81/11889796/db49ca5ffca2/12911_2025_2954_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea81/11889796/f3319879b60e/12911_2025_2954_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea81/11889796/db49ca5ffca2/12911_2025_2954_Fig2_HTML.jpg

相似文献

[1]
A systematic review of large language model (LLM) evaluations in clinical medicine.

BMC Med Inform Decis Mak. 2025-3-7

[2]
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.

J Med Internet Res. 2024-12-27

[3]
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

JMIR Med Inform. 2024-5-10

[4]
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024-11-5

[5]
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

J Med Internet Res. 2023-10-30

[6]
Examining the Role of Large Language Models in Orthopedics: Systematic Review.

J Med Internet Res. 2024-11-15

[7]
Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.

Cancers (Basel). 2024-8-12

[8]
Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology - a recent scoping review.

Diagn Pathol. 2024-2-27

[9]
Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis.

J Biomed Sci. 2025-5-7

[10]
Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review.

J Med Internet Res. 2025-1-23

引用本文的文献

[1]
Evaluation of Three Large Language Models' Response Performances to Inquiries Regarding Post-Abortion Care in the Context of Chinese Language: A Comparative Analysis.

Risk Manag Healthc Policy. 2025-8-18

[2]
Large language models in clinical nutrition: an overview of its applications, capabilities, limitations, and potential future prospects.

Front Nutr. 2025-8-7

[3]
BASIL DB: bioactive semantic integration and linking database.

J Biomed Semantics. 2025-8-13

[4]
Harmonizing organ-at-risk structure names using open-source large language models.

Phys Imaging Radiat Oncol. 2025-7-24

[5]
Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.

Int J Emerg Med. 2025-8-7

[6]
DrugBERT: a BERT-based approach integrating LDA topic embedding and efficacy-aware mechanism for predicting anti-tumor drug efficacy.

J Transl Med. 2025-8-5

[7]
Harm Reduction Strategies for Thoughtful Use of Large Language Models in the Medical Domain: Perspectives for Patients and Clinicians.

J Med Internet Res. 2025-7-25

[8]
A scoping review of natural language processing in addressing medically inaccurate information: Errors, misinformation, and hallucination.

J Biomed Inform. 2025-7-22

[9]
Between hype and hard evidence: Are large language models ready for implementation in surveillance colonoscopy?

Endosc Int Open. 2025-6-17

[10]
A large language model improves clinicians' diagnostic performance in complex critical illness cases.

Crit Care. 2025-6-6

本文引用的文献

[1]
Current applications and challenges in large language models for patient care: a systematic review.

Commun Med (Lond). 2025-1-21

[2]
ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis.

J Med Internet Res. 2024-7-8

[3]
Evaluation of large language models as a diagnostic aid for complex medical cases.

Front Med (Lausanne). 2024-6-20

[4]
Assessing the research landscape and clinical utility of large language models: a scoping review.

BMC Med Inform Decis Mak. 2024-3-12

[5]
Large Language Models and Healthcare Alliance: Potential and Challenges of Two Representative Use Cases.

Ann Biomed Eng. 2024-8

[6]
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

J Med Internet Res. 2023-10-30

[7]
Assessing the Soft Tissue Infection Expertise of ChatGPT and Bard Compared to IDSA Recommendations.

Ann Biomed Eng. 2024-6

[8]
Evaluating the Application of Large Language Models in Clinical Research Contexts.

JAMA Netw Open. 2023-10-2

[9]
Embracing Large Language Models for Medical Applications: Opportunities and Challenges.

Cureus. 2023-5-21

[10]
FUTURE OF THE LANGUAGE MODELS IN HEALTHCARE: THE ROLE OF CHATGPT.

Arq Bras Cir Dig. 2023

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索