Suppr超能文献

从生物医学文献中评估大语言模型在临床决策中的定性指标:叙述性综述。

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.

机构信息

Diabetes Technology Society, Burlingame, CA, USA.

School of Medicine, Johns Hopkins University, Baltimore, MD, USA.

出版信息

BMC Med Inform Decis Mak. 2024 Nov 26;24(1):357. doi: 10.1186/s12911-024-02757-z.

Abstract

BACKGROUND

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.

METHODS

We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.

RESULTS

We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency".

CONCLUSIONS

The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.

摘要

背景

自 2022 年 11 月 30 日以来,大型语言模型(LLM),尤其是 ChatGPT 的发布,引起了人们对其在医学领域应用的关注,特别是在支持临床决策方面。然而,医学界对于如何评估 LLM 在临床环境中的表现还没有达成共识。

方法

我们对 PubMed 进行了文献回顾,以确定 2022 年 12 月 1 日至 2024 年 4 月 1 日期间讨论评估 LLM 生成的诊断或治疗计划的出版物。

结果

我们从 PubMed 中选择了 108 篇相关文章进行分析。使用最多的 LLM 是 GPT-3.5、GPT-4、Bard、基于 LLaMa/Alpaca 的模型和 Bing Chat。评估 LLM 输出的五个最常用标准是“准确性”、“完整性”、“适当性”、“洞察力”和“一致性”。

结论

在过去的 1.5 年中,研究人员一直一致选择定义高质量 LLM 的最常用标准。我们发现,研究报告其发现和评估 LLM 性能的方式存在很大差异。可以开发标准化的报告定性评估指标,以促进医疗保健中 LLM 的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c30/11590327/05222f0f89d5/12911_2024_2757_Figa_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验