从生物医学文献中评估大语言模型在临床决策中的定性指标：叙述性综述。

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.

机构信息

Diabetes Technology Society, Burlingame, CA, USA.

School of Medicine, Johns Hopkins University, Baltimore, MD, USA.

出版信息

BMC Med Inform Decis Mak. 2024 Nov 26;24(1):357. doi: 10.1186/s12911-024-02757-z.

DOI:10.1186/s12911-024-02757-z

PMID:39593074

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11590327/

Abstract

BACKGROUND

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.

METHODS

We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.

RESULTS

We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency".

CONCLUSIONS

The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.

摘要