文献检索，用中文搜 PubMed

Abstract

BACKGROUND

Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.

OBJECTIVE

The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.

METHODS

We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.

RESULTS

Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).

CONCLUSION

Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.

摘要

背景

大语言模型（LLMs）在包括医学和法律在内的各个专业领域都展现出了潜力。然而，它们在高度专业化任务中的表现，如从患者病历中提取ICD - 10 - CM编码，仍未得到充分探索。

目的

主要目的是评估和比较不同大语言模型与人工编码员在提取ICD - 10 - CM编码方面的表现。

方法

我们评估了六个大语言模型（GPT - 3.5、GPT - 4、Claude 2.1、Claude 3、Gemini Advanced和Llama 2 - 70b）相对于人工编码员提取ICD - 10 - CM编码的表现。本研究使用了美国健康信息管理协会虚拟实验室中真实患者病例的去识别化住院病历。我们计算了一致性百分比和科恩kappa值，以评估大语言模型与人工编码员之间的一致性。然后，我们在10%的随机子集中确定了大语言模型在编码提取中存在差异的原因。

结果

在50份住院病历中，人工编码员提取了165个唯一的ICD - 10 - CM编码。大语言模型提取的唯一ICD - 10 - CM编码数量显著高于人工编码员，其中Llama 2 - 70b提取的最多（658个），Gemini Advanced提取的最少（221个）。GPT - 4与人工编码员的一致性百分比最高，为15.2%，其次是Claude 3（12.7%）和GPT - 3.5（12.4%）。科恩kappa值表明一致性极小至不存在，范围从 - 0.02到0.01。当关注主要诊断时，Claude 3的一致性百分比最高（26%），kappa值为（0.25）。不同大语言模型在编码提取中存在差异的原因各不相同，包括提取医生未确认诊断的编码（GPT - 4为60%）、提取非特定编码（GPT - 3.5为25%）、尽管存在更具体诊断仍提取体征和症状的编码（Claude 2.1为22%）以及幻觉（Claude 2.1为35%）。

结论

与人工编码员相比，当前大语言模型在从住院病历中提取ICD - 10 - CM编码方面表现不佳。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用大语言模型从临床文档中提取国际疾病分类代码

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献