Suppr超能文献

分析医学领域大语言模型的评价方法:范围综述。

Analyzing evaluation methods for large language models in the medical field: a scoping review.

机构信息

Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea.

Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.

出版信息

BMC Med Inform Decis Mak. 2024 Nov 29;24(1):366. doi: 10.1186/s12911-024-02709-7.

Abstract

BACKGROUND

Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.

OBJECTIVE

This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.

METHODS & MATERIALS: We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.

RESULTS

A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.

CONCLUSIONS

More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.

摘要

背景

由于大型语言模型(LLM)的迅速普及,已经进行了各种性能评估研究,以确认其在医学领域的适用性。然而,目前仍然没有明确的 LLM 评估框架。

目的

本研究回顾了医学领域的 LLM 评估研究,并分析了这些研究中使用的研究方法。旨在为未来设计 LLM 研究的研究人员提供参考。

方法与材料

我们对三个数据库(PubMed、Embase 和 MEDLINE)进行了范围性综述,以确定 2023 年 1 月 1 日至 2023 年 9 月 30 日期间发表的与 LLM 相关的文章。我们分析了方法类型、问题数量(查询)、评估者、重复测量、额外分析方法、提示工程的使用以及除准确性之外的指标。

结果

共有 142 篇文章符合纳入标准。LLM 评估主要分为提供测试考试(n=53,37.3%)或由医学专业人员评估(n=80,56.3%),一些混合案例(n=5,3.5%)或两者的组合(n=4,2.8%)。大多数研究有 100 个或更少的问题(n=18,29.0%),15 项(24.2%)进行了重复测量,18 项(29.0%)进行了额外分析,8 项(12.9%)使用了提示工程。对于医学评估,大多数研究使用 50 个或更少的查询(n=54,64.3%),有两个评估者(n=43,48.3%),14 项(14.7%)使用了提示工程。

结论

需要进一步研究 LLM 在医疗保健中的应用。尽管之前的研究已经评估了性能,但未来的研究可能会侧重于提高性能。这些研究需要一个结构良好的方法学,以便能够系统地进行。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e1b/11606129/bd4ba13c86d0/12911_2024_2709_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验