• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型的研究现状和临床实用性:范围综述。

Assessing the research landscape and clinical utility of large language models: a scoping review.

机构信息

Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada.

Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada.

出版信息

BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.

DOI:10.1186/s12911-024-02459-6
PMID:38475802
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10936025/
Abstract

IMPORTANCE

Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base.

OBJECTIVE

This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications.

EVIDENCE REVIEW

We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations.

FINDINGS

Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility.

CONCLUSIONS AND RELEVANCE

This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.

摘要

重要性

像 OpenAI 的 ChatGPT 这样的大型语言模型(LLMs)是强大的生成系统,可以快速合成自然语言响应。对 LLM 的研究揭示了它们的潜力和陷阱,尤其是在临床环境中。然而,医学领域 LLM 研究的不断发展,在评估、应用和证据基础方面留下了几个空白。

目的

本范围综述旨在(1)总结 LLM 在医学应用中的准确性和有效性的现有研究证据,(2)讨论 LLM 在临床环境中使用的伦理、法律、后勤和社会经济影响,(3)探讨在医疗保健中实施 LLM 的障碍和促进因素,(4)提出评估 LLM 临床实用性的标准化评估框架,以及(5)确定 LLM 在临床应用中的证据空白并提出未来的研究方向。

证据审查

我们从 MEDLINE、EMBASE、CINAHL、medRxiv、bioRxiv 和 arXiv 中筛选了 2023 年 1 月(搜索开始)至 2023 年 6 月 26 日的 4036 条记录,筛选出英文论文,并分析了来自全球 55 项研究的发现。根据牛津循证医学中心的建议报告证据质量。

发现

我们的研究结果表明,LLM 在编写患者记录、帮助患者在医疗保健系统中导航以及在一定程度上支持临床决策方面具有潜力,前提是结合人工监督。然而,它们的利用受到训练数据中的偏差的限制,这些偏差可能会伤害患者,产生不准确但令人信服的信息,以及存在伦理、法律、社会经济和隐私方面的担忧。我们还发现,缺乏评估 LLM 有效性和可行性的标准化方法。

结论和相关性

因此,本综述强调了潜在的未来方向和问题,以解决这些限制,并进一步探索 LLM 在增强医疗保健提供方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/1aa037e2436e/12911_2024_2459_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/78d197a94d4c/12911_2024_2459_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/ff64d13e8e16/12911_2024_2459_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/1aa037e2436e/12911_2024_2459_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/78d197a94d4c/12911_2024_2459_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/ff64d13e8e16/12911_2024_2459_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4eb6/10936025/1aa037e2436e/12911_2024_2459_Fig3_HTML.jpg

相似文献

1
Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性:范围综述。
BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.
2
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用:范围综述
JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.
3
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
4
Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.大语言模型与用户信任:自我参照学习循环的后果及医疗保健专业人员的技能退化
J Med Internet Res. 2024 Apr 25;26:e56764. doi: 10.2196/56764.
5
Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology - a recent scoping review.使用大型语言模型(如 ChatGPT)进行诊断医学的挑战和障碍,重点是数字病理学——近期的范围综述。
Diagn Pathol. 2024 Feb 27;19(1):43. doi: 10.1186/s13000-024-01464-7.
6
Examining the Role of Large Language Models in Orthopedics: Systematic Review.检查大型语言模型在骨科中的作用:系统评价。
J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.
7
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
8
A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.ChatGPT及其他对话式大语言模型在医疗保健领域的系统评价
medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.
9
Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注:系统评价。
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
10
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

引用本文的文献

1
Enhancing Clinical Decision Support with Adaptive Iterative Self-Query Retrieval for Retrieval-Augmented Large Language Models.通过用于检索增强大语言模型的自适应迭代自查询检索来增强临床决策支持
Bioengineering (Basel). 2025 Aug 21;12(8):895. doi: 10.3390/bioengineering12080895.
2
A practical framework for appropriate implementation and review of artificial intelligence (FAIR-AI) in healthcare.医疗保健领域人工智能合理实施与审查实用框架(FAIR-AI)
NPJ Digit Med. 2025 Aug 11;8(1):514. doi: 10.1038/s41746-025-01900-y.
3
Advancing methodological development of artificial intelligence in patient-centered comparative clinical effectiveness research: Patient-Centered Outcomes Research Institute's unique contribution to research done differently.

本文引用的文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
2
Foresight-a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study.利用电子健康记录对患者时间线进行建模的生成式预训练转换器 Foresight:一项回顾性建模研究。
Lancet Digit Health. 2024 Apr;6(4):e281-e290. doi: 10.1016/S2589-7500(24)00025-6.
3
Bridging the literacy gap for surgical consents: an AI-human expert collaborative approach.
推进以患者为中心的比较临床疗效研究中人工智能的方法学发展:以患者为中心的结果研究所对不同研究方式的独特贡献。
JAMIA Open. 2025 Jul 26;8(4):ooaf081. doi: 10.1093/jamiaopen/ooaf081. eCollection 2025 Aug.
4
In Reply to Sengul I and Sengul D.回复森古尔一世和森古尔二世。
Adv Radiat Oncol. 2025 Jul 10;10(7):101800. doi: 10.1016/j.adro.2025.101800. eCollection 2025 Jul.
5
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.ChatGPT-3.5、ChatGPT-4o、Copilot、Gemini、Claude和Perplexity在依据临床实践指南对腰骶神经根性疼痛提供建议方面的准确性:横断面研究
Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025.
6
Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations.大型语言模型在轴性脊柱关节炎管理中的性能评估:对欧洲抗风湿病联盟2022年建议的分析
Diagnostics (Basel). 2025 Jun 7;15(12):1455. doi: 10.3390/diagnostics15121455.
7
AI in Medical Questionnaires: Innovations, Diagnosis, and Implications.医学问卷中的人工智能:创新、诊断及影响
J Med Internet Res. 2025 Jun 23;27:e72398. doi: 10.2196/72398.
8
Stakeholder-centric participation in large language models enhanced health systems.以利益相关者为中心参与大型语言模型强化卫生系统。
Npj Health Syst. 2025;2(1):22. doi: 10.1038/s44401-025-00024-5. Epub 2025 Jun 18.
9
Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.用于医疗保健的可重复生成式人工智能评估:一种临床医生参与的方法。
JAMIA Open. 2025 Jun 16;8(3):ooaf054. doi: 10.1093/jamiaopen/ooaf054. eCollection 2025 Jun.
10
Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.大语言模型在临床诊断中的比较分析:常见和复杂医疗病例的性能评估
JAMIA Open. 2025 Jun 12;8(3):ooaf055. doi: 10.1093/jamiaopen/ooaf055. eCollection 2025 Jun.
弥合手术同意书的读写能力差距:一种人工智能与人类专家的协作方法。
NPJ Digit Med. 2024 Mar 8;7(1):63. doi: 10.1038/s41746-024-01039-2.
4
Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).ChatGPT 在医学中作为 AI 辅助决策支持工具的性能:解释常见心脏疾病症状和管理的概念验证研究 (AMSTELHEART-2)。
Acta Cardiol. 2024 May;79(3):358-366. doi: 10.1080/00015385.2024.2303528. Epub 2024 Feb 13.
5
Applications of natural language processing at emergency department triage: A narrative review.自然语言处理在急诊科分诊中的应用:叙事性综述。
PLoS One. 2023 Dec 14;18(12):e0279953. doi: 10.1371/journal.pone.0279953. eCollection 2023.
6
Unlocking the Power of EHRs: Harnessing Unstructured Data for Machine Learning-based Outcome Predictions.释放电子健康记录的力量:利用非结构化数据进行基于机器学习的结果预测。
Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1-4. doi: 10.1109/EMBC40787.2023.10340232.
7
Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - An Analysis of 10 Fictional Case Vignettes.ChatGPT在眼科急诊院前管理中的评估——对10个虚构病例 vignettes的分析
Klin Monbl Augenheilkd. 2024 May;241(5):675-681. doi: 10.1055/a-2149-0447. Epub 2023 Oct 27.
8
A vignette-based evaluation of ChatGPT's ability to provide appropriate and equitable medical advice across care contexts.基于案例的评估:ChatGPT 在跨护理环境下提供适当和公平的医疗建议的能力。
Sci Rep. 2023 Oct 19;13(1):17885. doi: 10.1038/s41598-023-45223-y.
9
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.评估 ChatGPT 在整个临床工作流程中的效用:开发和可用性研究。
J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.
10
A Context-based Chatbot Surpasses Trained Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines.基于语境的聊天机器人在遵循 ACR 适宜性准则方面超越了经过培训的放射科医生和通用的 ChatGPT。
Radiology. 2023 Jul;308(1):e230970. doi: 10.1148/radiol.230970.