大语言模型在医疗保健应用中的测试与评估：一项系统综述。

Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.

作者信息

Bedi Suhana, Liu Yutong, Orr-Ewing Lucy, Dash Dev, Koyejo Sanmi, Callahan Alison, Fries Jason A, Wornow Michael, Swaminathan Akshay, Lehmann Lisa Soleymani, Hong Hyo Jung, Kashyap Mehr, Chaurasia Akash R, Shah Nirav R, Singh Karandeep, Tazbaz Troy, Milstein Arnold, Pfeffer Michael A, Shah Nigam H

机构信息

Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California.

Clinical Excellence Research Center, Stanford University, Stanford, California.

出版信息

JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.

DOI:10.1001/jama.2024.21700

PMID:39405325

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11480901/

Abstract

IMPORTANCE

Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.

OBJECTIVE

To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.

DATA SOURCES

A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.

STUDY SELECTION

Studies evaluating 1 or more LLMs in health care.

DATA EXTRACTION AND SYNTHESIS

Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.

RESULTS

Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.

CONCLUSIONS AND RELEVANCE

Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.

摘要

重要性

大语言模型（LLMs）可协助开展各种医疗保健活动，但当前的评估方法可能无法充分识别最有用的应用领域。

目的

从五个方面总结大语言模型在医疗保健领域的现有评估：（1）评估数据类型，（2）医疗保健任务，（3）自然语言处理（NLP）和自然语言理解（NLU）任务，（4）评估维度，（5）医学专业。

数据来源

对2022年1月1日至2024年2月19日期间发表的研究进行了PubMed和Web of Science的系统检索。

研究选择

评估大语言模型在医疗保健领域中一项或多项的研究。

数据提取与综合

三位独立评审员通过关键词搜索，根据所使用的数据、医疗保健任务、NLP和NLU任务、评估维度以及医学专业对研究进行分类。

结果

在2022年1月1日至2024年2月19日期间审查的519项研究中，只有5%使用真实患者护理数据进行大语言模型评估。最常见的医疗保健任务是评估医学知识，如回答医学执照考试问题（44.5%）和进行诊断（19.5%）。对诸如分配计费代码（0.2%）和开处方（0.2%）等行政任务的研究较少。对于NLP和NLU任务，大多数研究集中在问答（84.2%），而诸如总结（8.9%）和对话（3.3%）等任务则很少见。几乎所有研究（95.4%）都将准确性作为主要评估维度；公平性、偏差和毒性（15.8%）、部署考虑因素（4.6%）以及校准和不确定性（1.2%）很少被衡量。最后，在医学专业领域方面，大多数研究涉及通用医疗保健应用（25.6%）、内科（16.4%）、外科（11.4%）和眼科（6.9%），核医学（0.6%）、物理医学（0.4%）和医学遗传学（0.2%）的研究最少。

结论与意义

大语言模型的现有评估大多集中在医学考试问答的准确性上，而未考虑真实患者护理数据。公平性、偏差和毒性以及部署考虑因素等维度受到的关注有限。未来的评估应采用标准化的应用和指标，使用临床数据，并扩大关注范围，以涵盖更广泛的任务和专业。

相似文献

Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.大语言模型在医疗保健应用中的测试与评估：一项系统综述。

JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注：系统评价。

J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.

Use of SNOMED CT in Large Language Models: Scoping Review.SNOMED CT 在大语言模型中的应用：范围综述。

JMIR Med Inform. 2024 Oct 7;12:e62924. doi: 10.2196/62924.

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.ChatGPT及其他对话式大语言模型在医疗保健领域的系统评价

medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.

Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review.使用大语言模型进行临床文本摘要的科学证据：范围综述

J Med Internet Res. 2025 May 15;27:e68998. doi: 10.2196/68998.

Examining the Role of Large Language Models in Orthopedics: Systematic Review.检查大型语言模型在骨科中的作用：系统评价。

J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

Natural Language Processing for Digital Health in the Era of Large Language Models.大语言模型时代数字健康领域的自然语言处理

Yearb Med Inform. 2024 Aug;33(1):229-240. doi: 10.1055/s-0044-1800750. Epub 2025 Apr 8.

Using Large Language Models to Automate Data Extraction From Surgical Pathology Reports: Retrospective Cohort Study.使用大语言模型自动从外科病理报告中提取数据：回顾性队列研究。

JMIR Form Res. 2025 Apr 7;9:e64544. doi: 10.2196/64544.

Potential of Large Language Models in Health Care: Delphi Study.大语言模型在医疗保健中的潜力：德尔菲研究。

J Med Internet Res. 2024 May 13;26:e52399. doi: 10.2196/52399.

A Narrative Review on the Application of Large Language Models to Support Cancer Care and Research.关于应用大语言模型支持癌症护理与研究的叙述性综述。

Yearb Med Inform. 2024 Aug;33(1):90-98. doi: 10.1055/s-0044-1800726. Epub 2025 Apr 8.

引用本文的文献

Large language models in nephrology: applications and challenges in chronic kidney disease management.肾脏病学中的大语言模型：慢性肾脏病管理中的应用与挑战

Ren Fail. 2025 Dec;47(1):2555686. doi: 10.1080/0886022X.2025.2555686. Epub 2025 Sep 7.

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.使用两个大语言模型提取临床指南信息：评估研究

J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.

The Impact of Access to Clinical Guidelines on LLM-Based Treatment Recommendations for Chronic Hepatitis B.获取临床指南对基于大语言模型的慢性乙型肝炎治疗建议的影响

Liver Int. 2025 Oct;45(10):e70324. doi: 10.1111/liv.70324.

Performance and improvement strategies for adapting generative large language models for electronic health record applications: A systematic review.将生成式大语言模型应用于电子健康记录的性能及改进策略：一项系统综述

Int J Med Inform. 2025 Aug 28;205:106091. doi: 10.1016/j.ijmedinf.2025.106091.

Application and ethical implication of generative artificial intelligence in medical education: a cross-sectional study among critical care academic physicians in China.生成式人工智能在医学教育中的应用及伦理意义：一项针对中国重症医学学术医师的横断面研究

BMC Med Educ. 2025 Aug 29;25(1):1225. doi: 10.1186/s12909-025-07825-0.

Clinical decision support for pharmacologic management of treatment-resistant depression with augmented large language models.利用增强型大语言模型为难治性抑郁症的药物治疗提供临床决策支持。

J Mood Anxiety Disord. 2025 Jul 14;12:100142. doi: 10.1016/j.xjmad.2025.100142. eCollection 2025 Dec.

How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors.人工智能的准确性如何？对常用大语言模型回应患者对偶然发现的肾肿瘤担忧的批判性评估。

J Clin Med. 2025 Aug 12;14(16):5697. doi: 10.3390/jcm14165697.

Assessing the transferability of BERT to patient safety: classifying multiple types of incident reports.评估BERT在患者安全方面的可转移性：对多种类型的事件报告进行分类。

BMJ Health Care Inform. 2025 Aug 18;32(1):e101146. doi: 10.1136/bmjhci-2024-101146.

Quo Vadis, AI-Empowered Doctor?人工智能赋能的医生，路在何方？

JMIR Med Educ. 2025 Aug 15;11:e70079. doi: 10.2196/70079.

Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning.评估精神病学中的聊天机器人：基于拉施模型对临床知识和推理的见解。

PLoS One. 2025 Aug 14;20(8):e0330303. doi: 10.1371/journal.pone.0330303. eCollection 2025.

本文引用的文献

Evaluating the clinical benefits of LLMs.评估大语言模型的临床益处。

Nat Med. 2024 Sep;30(9):2409-2410. doi: 10.1038/s41591-024-03181-6.

Evaluation and mitigation of the limitations of large language models in clinical decision-making.评估和缓解大型语言模型在临床决策中的局限性。

Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.

Artificial Intelligence-Generated Draft Replies to Patient Inbox Messages.人工智能生成的回复患者收件箱消息草稿。

JAMA Netw Open. 2024 Mar 4;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201.

Ensuring useful adoption of generative artificial intelligence in healthcare.确保在医疗保健中有用地采用生成式人工智能。

J Am Med Inform Assoc. 2024 May 20;31(6):1441-1444. doi: 10.1093/jamia/ocae043.

To do no harm - and the most good - with AI in health care.在医疗保健领域利用人工智能做到无害且带来最大益处。

Nat Med. 2024 Mar;30(3):623-627. doi: 10.1038/s41591-024-02853-7.

Accuracy of Online Artificial Intelligence Models in Primary Care Settings.在线人工智能模型在初级保健环境中的准确性。

Am J Prev Med. 2024 Jun;66(6):1054-1059. doi: 10.1016/j.amepre.2024.02.006. Epub 2024 Feb 12.

An evaluation of GPT models for phenotype concept recognition.GPT 模型在表型概念识别中的评估。

BMC Med Inform Decis Mak. 2024 Jan 31;24(1):30. doi: 10.1186/s12911-024-02439-w.

A Comparison of ChatGPT and Fine-Tuned Open Pre-Trained Transformers (OPT) Against Widely Used Sentiment Analysis Tools: Sentiment Analysis of COVID-19 Survey Data.ChatGPT与微调后的开放预训练变换器（OPT）与广泛使用的情感分析工具的比较：COVID-19调查数据的情感分析

JMIR Ment Health. 2024 Jan 25;11:e50150. doi: 10.2196/50150.

Evaluating the ChatGPT family of models for biomedical reasoning and classification.评估ChatGPT系列模型在生物医学推理和分类方面的表现。

J Am Med Inform Assoc. 2024 Apr 3;31(4):940-948. doi: 10.1093/jamia/ocad256.

DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients.DRG-LLaMA：调整LLaMA模型以预测住院患者的诊断相关分组

NPJ Digit Med. 2024 Jan 22;7(1):16. doi: 10.1038/s41746-023-00989-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验