Suppr超能文献

使用大语言模型从临床文档中提取国际疾病分类代码

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.

作者信息

Simmons Ashley, Takkavatakarn Kullaya, McDougal Megan, Dilcher Brian, Pincavitch Jami, Meadows Lukas, Kauffman Justin, Klang Eyal, Wig Rebecca, Smith Gordon, Soroush Ali, Freeman Robert, Apakama Donald J, Charney Alexander W, Kohli-Seth Roopa, Nadkarni Girish N, Sakhuja Ankit

机构信息

Department of Human Performance - Health Informatics and Information Management, West Virginia University, Morgantown, West Virginia, United States.

Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States.

出版信息

Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.

Abstract

BACKGROUND

Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.

OBJECTIVE

The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.

METHODS

We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.

RESULTS

Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).

CONCLUSION

Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.

摘要

背景

大语言模型(LLMs)在包括医学和法律在内的各个专业领域都展现出了潜力。然而,它们在高度专业化任务中的表现,如从患者病历中提取ICD - 10 - CM编码,仍未得到充分探索。

目的

主要目的是评估和比较不同大语言模型与人工编码员在提取ICD - 10 - CM编码方面的表现。

方法

我们评估了六个大语言模型(GPT - 3.5、GPT - 4、Claude 2.1、Claude 3、Gemini Advanced和Llama 2 - 70b)相对于人工编码员提取ICD - 10 - CM编码的表现。本研究使用了美国健康信息管理协会虚拟实验室中真实患者病例的去识别化住院病历。我们计算了一致性百分比和科恩kappa值,以评估大语言模型与人工编码员之间的一致性。然后,我们在10%的随机子集中确定了大语言模型在编码提取中存在差异的原因。

结果

在50份住院病历中,人工编码员提取了165个唯一的ICD - 10 - CM编码。大语言模型提取的唯一ICD - 10 - CM编码数量显著高于人工编码员,其中Llama 2 - 70b提取的最多(658个),Gemini Advanced提取的最少(221个)。GPT - 4与人工编码员的一致性百分比最高,为15.2%,其次是Claude 3(12.7%)和GPT - 3.5(12.4%)。科恩kappa值表明一致性极小至不存在,范围从 - 0.02到0.01。当关注主要诊断时,Claude 3的一致性百分比最高(26%),kappa值为(0.25)。不同大语言模型在编码提取中存在差异的原因各不相同,包括提取医生未确认诊断的编码(GPT - 4为60%)、提取非特定编码(GPT - 3.5为25%)、尽管存在更具体诊断仍提取体征和症状的编码(Claude 2.1为22%)以及幻觉(Claude 2.1为35%)。

结论

与人工编码员相比,当前大语言模型在从住院病历中提取ICD - 10 - CM编码方面表现不佳。

相似文献

本文引用的文献

2
GPT-4 passes the bar exam.GPT-4通过了律师资格考试。
Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.
3
Assessing the Accuracy of ChatGPT on Core Questions in Glomerular Disease.评估ChatGPT在肾小球疾病核心问题上的准确性。
Kidney Int Rep. 2023 May 26;8(8):1657-1659. doi: 10.1016/j.ekir.2023.05.014. eCollection 2023 Aug.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验