Suppr超能文献

大语言模型在医疗保健应用中的测试与评估:一项系统综述。

Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.

作者信息

Bedi Suhana, Liu Yutong, Orr-Ewing Lucy, Dash Dev, Koyejo Sanmi, Callahan Alison, Fries Jason A, Wornow Michael, Swaminathan Akshay, Lehmann Lisa Soleymani, Hong Hyo Jung, Kashyap Mehr, Chaurasia Akash R, Shah Nirav R, Singh Karandeep, Tazbaz Troy, Milstein Arnold, Pfeffer Michael A, Shah Nigam H

机构信息

Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California.

Clinical Excellence Research Center, Stanford University, Stanford, California.

出版信息

JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.

Abstract

IMPORTANCE

Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.

OBJECTIVE

To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.

DATA SOURCES

A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.

STUDY SELECTION

Studies evaluating 1 or more LLMs in health care.

DATA EXTRACTION AND SYNTHESIS

Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.

RESULTS

Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.

CONCLUSIONS AND RELEVANCE

Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.

摘要

重要性

大语言模型(LLMs)可协助开展各种医疗保健活动,但当前的评估方法可能无法充分识别最有用的应用领域。

目的

从五个方面总结大语言模型在医疗保健领域的现有评估:(1)评估数据类型,(2)医疗保健任务,(3)自然语言处理(NLP)和自然语言理解(NLU)任务,(4)评估维度,(5)医学专业。

数据来源

对2022年1月1日至2024年2月19日期间发表的研究进行了PubMed和Web of Science的系统检索。

研究选择

评估大语言模型在医疗保健领域中一项或多项的研究。

数据提取与综合

三位独立评审员通过关键词搜索,根据所使用的数据、医疗保健任务、NLP和NLU任务、评估维度以及医学专业对研究进行分类。

结果

在2022年1月1日至2024年2月19日期间审查的519项研究中,只有5%使用真实患者护理数据进行大语言模型评估。最常见的医疗保健任务是评估医学知识,如回答医学执照考试问题(44.5%)和进行诊断(19.5%)。对诸如分配计费代码(0.2%)和开处方(0.2%)等行政任务的研究较少。对于NLP和NLU任务,大多数研究集中在问答(84.2%),而诸如总结(8.9%)和对话(3.3%)等任务则很少见。几乎所有研究(95.4%)都将准确性作为主要评估维度;公平性、偏差和毒性(15.8%)、部署考虑因素(4.6%)以及校准和不确定性(1.2%)很少被衡量。最后,在医学专业领域方面,大多数研究涉及通用医疗保健应用(25.6%)、内科(16.4%)、外科(11.4%)和眼科(6.9%),核医学(0.6%)、物理医学(0.4%)和医学遗传学(0.2%)的研究最少。

结论与意义

大语言模型的现有评估大多集中在医学考试问答的准确性上,而未考虑真实患者护理数据。公平性、偏差和毒性以及部署考虑因素等维度受到的关注有限。未来的评估应采用标准化的应用和指标,使用临床数据,并扩大关注范围,以涵盖更广泛的任务和专业。

相似文献

引用本文的文献

9
Quo Vadis, AI-Empowered Doctor?人工智能赋能的医生,路在何方?
JMIR Med Educ. 2025 Aug 15;11:e70079. doi: 10.2196/70079.

本文引用的文献

1
Evaluating the clinical benefits of LLMs.评估大语言模型的临床益处。
Nat Med. 2024 Sep;30(9):2409-2410. doi: 10.1038/s41591-024-03181-6.
6
Accuracy of Online Artificial Intelligence Models in Primary Care Settings.在线人工智能模型在初级保健环境中的准确性。
Am J Prev Med. 2024 Jun;66(6):1054-1059. doi: 10.1016/j.amepre.2024.02.006. Epub 2024 Feb 12.
7
An evaluation of GPT models for phenotype concept recognition.GPT 模型在表型概念识别中的评估。
BMC Med Inform Decis Mak. 2024 Jan 31;24(1):30. doi: 10.1186/s12911-024-02439-w.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验