Shool Sina, Adimi Sara, Saboori Amleshi Reza, Bitaraf Ehsan, Golpira Reza, Tara Mahmood
Center for Technology and Innovation in Cardiovascular Informatics, Rajaie Cardiovascular Medical and Research Center, Iran University of Medical Sciences, Tehran, Iran.
Rajaie Cardiovascular Medical and Research Center, Iran University of Medical Sciences, Tehran, 1995614331, Iran.
BMC Med Inform Decis Mak. 2025 Mar 7;25(1):117. doi: 10.1186/s12911-025-02954-4.
BACKGROUND: Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to ensure reliability, safety, and ethical alignment. OBJECTIVE: This systematic review examines the evaluation parameters and methodologies applied to LLMs in clinical medicine, highlighting their capabilities, limitations, and application trends. METHODS: A comprehensive review of the literature was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, and arXiv databases, encompassing both peer-reviewed and preprint studies. Studies were screened against predefined inclusion and exclusion criteria to identify original research evaluating LLM performance in medical contexts. RESULTS: The results reveal a growing interest in leveraging LLM tools in clinical settings, with 761 studies meeting the inclusion criteria. While general-domain LLMs, particularly ChatGPT and GPT-4, dominated evaluations (93.55%), medical-domain LLMs accounted for only 6.45%. Accuracy emerged as the most commonly assessed parameter (21.78%). Despite these advancements, the evidence base highlights certain limitations and biases across the included studies, emphasizing the need for careful interpretation and robust evaluation frameworks. CONCLUSIONS: The exponential growth in LLM research underscores their transformative potential in healthcare. However, addressing challenges such as ethical risks, evaluation variability, and underrepresentation of critical specialties will be essential. Future efforts should prioritize standardized frameworks to ensure safe, effective, and equitable LLM integration in clinical practice.
背景:大语言模型(LLMs)是基于Transformer架构的先进人工智能工具,通过增强决策支持、诊断和医学教育,在临床医学中展现出巨大潜力。然而,将它们整合到临床工作流程中需要进行严格评估,以确保可靠性、安全性和符合伦理规范。 目的:本系统评价考察了应用于临床医学中LLMs的评估参数和方法,突出了它们的能力、局限性及应用趋势。 方法:对PubMed、Scopus、科学网、IEEE Xplore和arXiv数据库中的文献进行全面检索,涵盖同行评审研究和预印本研究。根据预先定义的纳入和排除标准对研究进行筛选,以确定评估LLMs在医学环境中性能的原始研究。 结果:结果显示,临床环境中利用LLM工具的兴趣日益浓厚,有761项研究符合纳入标准。虽然通用领域的LLMs,特别是ChatGPT和GPT-4在评估中占主导地位(93.55%),但医学领域的LLMs仅占6.45%。准确性是最常评估的参数(21.78%)。尽管有这些进展,但证据基础凸显了纳入研究中存在的某些局限性和偏差,强调需要谨慎解读和建立稳健的评估框架。 结论:LLM研究的指数级增长凸显了它们在医疗保健领域的变革潜力。然而,应对伦理风险、评估变异性和关键专业代表性不足等挑战至关重要。未来的工作应优先考虑标准化框架,以确保LLMs在临床实践中安全、有效和公平地整合。
BMC Med Inform Decis Mak. 2025-3-7
J Med Internet Res. 2024-12-27
JMIR Med Inform. 2024-5-10
J Med Internet Res. 2023-10-30
J Med Internet Res. 2024-11-15
J Med Internet Res. 2025-1-23
J Biomed Semantics. 2025-8-13
Phys Imaging Radiat Oncol. 2025-7-24
Commun Med (Lond). 2025-1-21
J Med Internet Res. 2024-7-8
Front Med (Lausanne). 2024-6-20
BMC Med Inform Decis Mak. 2024-3-12
J Med Internet Res. 2023-10-30
JAMA Netw Open. 2023-10-2
Arq Bras Cir Dig. 2023