Bedi Suhana, Liu Yutong, Orr-Ewing Lucy, Dash Dev, Koyejo Sanmi, Callahan Alison, Fries Jason A, Wornow Michael, Swaminathan Akshay, Lehmann Lisa Soleymani, Hong Hyo Jung, Kashyap Mehr, Chaurasia Akash R, Shah Nirav R, Singh Karandeep, Tazbaz Troy, Milstein Arnold, Pfeffer Michael A, Shah Nigam H
Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California.
Clinical Excellence Research Center, Stanford University, Stanford, California.
JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.
Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.
To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.
A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.
Studies evaluating 1 or more LLMs in health care.
Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.
Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.
Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
大语言模型(LLMs)可协助开展各种医疗保健活动,但当前的评估方法可能无法充分识别最有用的应用领域。
从五个方面总结大语言模型在医疗保健领域的现有评估:(1)评估数据类型,(2)医疗保健任务,(3)自然语言处理(NLP)和自然语言理解(NLU)任务,(4)评估维度,(5)医学专业。
对2022年1月1日至2024年2月19日期间发表的研究进行了PubMed和Web of Science的系统检索。
评估大语言模型在医疗保健领域中一项或多项的研究。
三位独立评审员通过关键词搜索,根据所使用的数据、医疗保健任务、NLP和NLU任务、评估维度以及医学专业对研究进行分类。
在2022年1月1日至2024年2月19日期间审查的519项研究中,只有5%使用真实患者护理数据进行大语言模型评估。最常见的医疗保健任务是评估医学知识,如回答医学执照考试问题(44.5%)和进行诊断(19.5%)。对诸如分配计费代码(0.2%)和开处方(0.2%)等行政任务的研究较少。对于NLP和NLU任务,大多数研究集中在问答(84.2%),而诸如总结(8.9%)和对话(3.3%)等任务则很少见。几乎所有研究(95.4%)都将准确性作为主要评估维度;公平性、偏差和毒性(15.8%)、部署考虑因素(4.6%)以及校准和不确定性(1.2%)很少被衡量。最后,在医学专业领域方面,大多数研究涉及通用医疗保健应用(25.6%)、内科(16.4%)、外科(11.4%)和眼科(6.9%),核医学(0.6%)、物理医学(0.4%)和医学遗传学(0.2%)的研究最少。
大语言模型的现有评估大多集中在医学考试问答的准确性上,而未考虑真实患者护理数据。公平性、偏差和毒性以及部署考虑因素等维度受到的关注有限。未来的评估应采用标准化的应用和指标,使用临床数据,并扩大关注范围,以涵盖更广泛的任务和专业。