Su Hankun, Sun Yuanyuan, Li Ruiting, Zhang Aozhe, Yang Yuemeng, Xiao Fen, Duan Zhiying, Chen Jingjing, Hu Qin, Yang Tianli, Xu Bin, Zhang Qiong, Zhao Jing, Li Yanping, Li Hui
Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China.
Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China.
J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.
BACKGROUND: The integration of large language models (LLMs) into medical diagnostics has garnered substantial attention due to their potential to enhance diagnostic accuracy, streamline clinical workflows, and address health care disparities. However, the rapid evolution of LLM research necessitates a comprehensive synthesis of their applications, challenges, and future directions. OBJECTIVE: This scoping review aimed to provide an overview of the current state of research regarding the use of LLMs in medical diagnostics. The study sought to answer four primary subquestions, as follows: (1) Which LLMs are commonly used? (2) How are LLMs assessed in diagnosis? (3) What is the current performance of LLMs in diagnosing diseases? (4) Which medical domains are investigating the application of LLMs? METHODS: This scoping review was conducted according to the Joanna Briggs Institute Manual for Evidence Synthesis and adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). Relevant literature was searched from the Web of Science, PubMed, Embase, IEEE Xplore, and ACM Digital Library databases from 2022 to 2025. Articles were screened and selected based on predefined inclusion and exclusion criteria. Bibliometric analysis was performed using VOSviewer to identify major research clusters and trends. Data extraction included details on LLM types, application domains, and performance metrics. RESULTS: The field is rapidly expanding, with a surge in publications after 2023. GPT-4 and its variants dominated research (70/95, 74% of studies), followed by GPT-3.5 (34/95, 36%). Key applications included disease classification (text or image-based), medical question answering, and diagnostic content generation. LLMs demonstrated high accuracy in specialties like radiology, psychiatry, and neurology but exhibited biases in race, gender, and cost predictions. Ethical concerns, including privacy risks and model hallucination, alongside regulatory fragmentation, were critical barriers to clinical adoption. CONCLUSIONS: LLMs hold transformative potential for medical diagnostics but require rigorous validation, bias mitigation, and multimodal integration to address real-world complexities. Future research should prioritize explainable artificial intelligence frameworks, specialty-specific optimization, and international regulatory harmonization to ensure equitable and safe clinical deployment.
背景:大语言模型(LLMs)在医学诊断中的整合因其提高诊断准确性、简化临床工作流程以及解决医疗保健差距的潜力而备受关注。然而,大语言模型研究的快速发展需要对其应用、挑战和未来方向进行全面综合。 目的:本范围综述旨在概述大语言模型在医学诊断中应用的当前研究状况。该研究试图回答四个主要子问题,如下:(1)常用哪些大语言模型?(2)如何在诊断中评估大语言模型?(3)大语言模型目前在疾病诊断中的表现如何?(4)哪些医学领域正在研究大语言模型的应用? 方法:本范围综述根据乔安娜·布里格斯研究所证据综合手册进行,并遵循PRISMA-ScR(系统评价和元分析扩展的首选报告项目用于范围综述)。从2022年至2025年在科学网、PubMed、Embase、IEEE Xplore和ACM数字图书馆数据库中搜索相关文献。根据预定义的纳入和排除标准对文章进行筛选和选择。使用VOSviewer进行文献计量分析以识别主要研究集群和趋势。数据提取包括大语言模型类型、应用领域和性能指标的详细信息。 结果:该领域正在迅速扩展,2023年后出版物激增。GPT-4及其变体主导了研究(70/95,占研究的74%),其次是GPT-3.5(34/95,占36%)。关键应用包括疾病分类(基于文本或图像)、医学问答和诊断内容生成。大语言模型在放射学、精神病学和神经病学等专业中表现出较高的准确性,但在种族、性别和成本预测方面存在偏差。包括隐私风险和模型幻觉在内的伦理问题,以及监管碎片化,是临床应用的关键障碍。 结论:大语言模型在医学诊断方面具有变革潜力,但需要严格验证、减轻偏差和多模态整合以应对现实世界的复杂性。未来研究应优先考虑可解释的人工智能框架、特定专业的优化以及国际监管协调,以确保公平和安全的临床部署。
J Med Internet Res. 2025-6-9
J Med Internet Res. 2025-1-23
JBI Database System Rev Implement Rep. 2016-4
J Med Internet Res. 2025-6-19
JMIR Mhealth Uhealth. 2025-6-13
Mayo Clin Proc Digit Health. 2024-11-29
Int J Equity Health. 2025-2-26
J Multidiscip Healthc. 2025-1-17