Suppr超能文献

大型语言模型在医疗文档中的评估框架:开发和可用性研究。

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study.

机构信息

Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea.

Department of Nursing, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.

出版信息

J Med Internet Res. 2024 Nov 20;26:e58329. doi: 10.2196/58329.

Abstract

BACKGROUND

The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application.

OBJECTIVE

This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation.

METHODS

We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria-appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators.

RESULTS

The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P<.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P<.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=-0.654; P<.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=-0.633; P<.001), indicating that higher error rates corresponded to lower clinical acceptability.

CONCLUSIONS

Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field.

摘要

背景

大型语言模型(LLM)的进步为医疗保健带来了重大机遇,特别是在生成医学文献方面。然而,确保 LLM 输出的准确性和可靠性的相关挑战,以及缺乏既定的质量标准,引发了对其临床应用的担忧。

目的

本研究旨在开发和验证一种评估框架,以评估 LLM 生成的急诊科(ED)记录的准确性和临床适用性,旨在增强人工智能在医疗保健文档中的整合。

方法

我们组织了 Healthcare Prompt-a-thon,这是一项旨在探索 LLM 在生成准确医疗记录方面能力的竞赛活动。该活动涉及 52 名参与者,他们使用韩国专用 LLM HyperCLOVA X 生成了 33 份初始 ED 记录。我们采用了双重评估方法。首先,临床评估:4 名医疗专业人员使用 5 点李克特量表对 5 项标准(适宜性、准确性、结构/格式、简洁性和临床有效性)对记录进行评估。其次,定量评估:我们开发了一个框架来对 LLM 输出中的错误进行分类和计数,确定了 7 种关键错误类型。使用统计方法,包括皮尔逊相关系数和组内相关系数(ICC),评估评估者之间的一致性和一致性。

结果

临床评估显示出很强的组内一致性,ICC 值范围为 0.653 至 0.887(P<.001),测试-再测试皮尔逊相关系数为 0.776(P<.001)。定量分析表明,无效生成错误是最常见的,占总错误的 35.38%,而结构畸形错误对临床评估评分的影响最大(Pearson r=-0.654;P<.001)。数量错误与临床评估评分之间存在很强的负相关(Pearson r=-0.633;P<.001),表明错误率越高,临床可接受性越低。

结论

我们的研究为所提出的评估框架的可靠性和临床可接受性提供了有力支持。它强调了该框架减轻临床负担和促进人工智能技术在医疗保健中负责任整合的潜力,为该领域的未来研究和实际应用提供了有希望的方向。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验