Diabetes Technology Society, Burlingame, CA, USA.
School of Medicine, Johns Hopkins University, Baltimore, MD, USA.
BMC Med Inform Decis Mak. 2024 Nov 26;24(1):357. doi: 10.1186/s12911-024-02757-z.
The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.
We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.
We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency".
The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.
自 2022 年 11 月 30 日以来,大型语言模型(LLM),尤其是 ChatGPT 的发布,引起了人们对其在医学领域应用的关注,特别是在支持临床决策方面。然而,医学界对于如何评估 LLM 在临床环境中的表现还没有达成共识。
我们对 PubMed 进行了文献回顾,以确定 2022 年 12 月 1 日至 2024 年 4 月 1 日期间讨论评估 LLM 生成的诊断或治疗计划的出版物。
我们从 PubMed 中选择了 108 篇相关文章进行分析。使用最多的 LLM 是 GPT-3.5、GPT-4、Bard、基于 LLaMa/Alpaca 的模型和 Bing Chat。评估 LLM 输出的五个最常用标准是“准确性”、“完整性”、“适当性”、“洞察力”和“一致性”。
在过去的 1.5 年中,研究人员一直一致选择定义高质量 LLM 的最常用标准。我们发现,研究报告其发现和评估 LLM 性能的方式存在很大差异。可以开发标准化的报告定性评估指标,以促进医疗保健中 LLM 的研究。