• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.使用大语言模型从临床文档中提取国际疾病分类代码
Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.
2
Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation.用于从临床文档中提取国际疾病分类代码的大型语言模型基准测试
medRxiv. 2024 Nov 23:2024.04.29.24306573. doi: 10.1101/2024.04.29.24306573.
3
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
4
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估
BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.
5
Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用:Claude、GPT和Gemini的比较研究
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.
6
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
7
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
8
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
9
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
10
Relation extraction using large language models: a case study on acupuncture point locations.基于大语言模型的关系抽取研究:以穴位定位为例。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2622-2631. doi: 10.1093/jamia/ocae233.

引用本文的文献

1
Can LLMs effectively assist medical coding? Evaluating GPT performance on DRG and targeted clinical tasks.大语言模型能否有效辅助医学编码?评估GPT在疾病诊断相关分组及特定临床任务上的表现。
BMC Med Inform Decis Mak. 2025 Aug 19;25(1):312. doi: 10.1186/s12911-025-03151-z.
2
Using Open-Source Large Language Models to Identify Access to Germline Genetic Testing in Veterans With Breast Cancer From Unstructured Text.利用开源大语言模型从非结构化文本中识别乳腺癌退伍军人获得种系基因检测的情况。
JCO Clin Cancer Inform. 2025 Jul;9:e2400263. doi: 10.1200/CCI-24-00263. Epub 2025 Jul 22.
3
Primer on large language models: an educational overview for intensivists.大语言模型入门:重症医学专家的教育概述
Crit Care. 2025 Jun 12;29(1):238. doi: 10.1186/s13054-025-05479-4.
4
Prediction of Lupus Classification Criteria via Generative AI Medical Record Profiling.通过生成式人工智能病历剖析预测狼疮分类标准
BioTech (Basel). 2025 Mar 6;14(1):15. doi: 10.3390/biotech14010015.
5
AI-Powered Neurogenetics: Supporting Patient's Evaluation with Chatbot.人工智能驱动的神经遗传学:利用聊天机器人辅助患者评估。
Genes (Basel). 2024 Dec 27;16(1):29. doi: 10.3390/genes16010029.

本文引用的文献

1
Adapted large language models can outperform medical experts in clinical text summarization.经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。
Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.
2
GPT-4 passes the bar exam.GPT-4通过了律师资格考试。
Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.
3
Assessing the Accuracy of ChatGPT on Core Questions in Glomerular Disease.评估ChatGPT在肾小球疾病核心问题上的准确性。
Kidney Int Rep. 2023 May 26;8(8):1657-1659. doi: 10.1016/j.ekir.2023.05.014. eCollection 2023 Aug.
4
Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test.ChatGPT 答错多项选择题美国胃肠病学院自测题
Am J Gastroenterol. 2023 Dec 1;118(12):2280-2282. doi: 10.14309/ajg.0000000000002320. Epub 2023 May 22.
5
ChatGPT Performance on the American Urological Association Self-assessment Study Program and the Potential Influence of Artificial Intelligence in Urologic Training.ChatGPT 在泌尿外科协会自我评估研究计划中的表现以及人工智能在泌尿外科培训中的潜在影响。
Urology. 2023 Jul;177:29-33. doi: 10.1016/j.urology.2023.05.010. Epub 2023 May 18.
6
Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.人工智能聊天机器人在眼科知识评估中的表现。
JAMA Ophthalmol. 2023 Jun 1;141(6):589-597. doi: 10.1001/jamaophthalmol.2023.1144.
7
Health Information Management: Implications of Artificial Intelligence on Healthcare Data and Information Management.健康信息管理:人工智能对医疗保健数据与信息管理的影响
Yearb Med Inform. 2019 Aug;28(1):56-64. doi: 10.1055/s-0039-1677913. Epub 2019 Aug 16.
8
Computer-assisted clinical coding: A narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals.计算机辅助临床编码:对其益处、局限性、实施情况以及对临床编码专业人员影响的文献进行的叙述性综述。
Health Inf Manag. 2020 Jan;49(1):5-18. doi: 10.1177/1833358319851305. Epub 2019 Jun 3.
9
Computer-Assisted Diagnostic Coding: Effectiveness of an NLP-based approach using SNOMED CT to ICD-10 mappings.计算机辅助诊断编码:一种基于自然语言处理的方法利用SNOMED CT到ICD-10映射的有效性。
AMIA Annu Symp Proc. 2018 Dec 5;2018:807-816. eCollection 2018.
10
Validity and reliability of a medical record review method identifying transitional patient safety incidents in merged primary and secondary care patients' records.一种用于识别合并的初级和二级护理患者记录中过渡性患者安全事件的病历审查方法的有效性和可靠性。
BMJ Open. 2018 Aug 13;8(8):e018576. doi: 10.1136/bmjopen-2017-018576.

使用大语言模型从临床文档中提取国际疾病分类代码

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.

作者信息

Simmons Ashley, Takkavatakarn Kullaya, McDougal Megan, Dilcher Brian, Pincavitch Jami, Meadows Lukas, Kauffman Justin, Klang Eyal, Wig Rebecca, Smith Gordon, Soroush Ali, Freeman Robert, Apakama Donald J, Charney Alexander W, Kohli-Seth Roopa, Nadkarni Girish N, Sakhuja Ankit

机构信息

Department of Human Performance - Health Informatics and Information Management, West Virginia University, Morgantown, West Virginia, United States.

Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States.

出版信息

Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.

DOI:
10.1055/a-2491-3872
PMID:39608761
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12020521/
Abstract

BACKGROUND

Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.

OBJECTIVE

The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.

METHODS

We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.

RESULTS

Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).

CONCLUSION

Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.

摘要

背景

大语言模型(LLMs)在包括医学和法律在内的各个专业领域都展现出了潜力。然而,它们在高度专业化任务中的表现,如从患者病历中提取ICD - 10 - CM编码,仍未得到充分探索。

目的

主要目的是评估和比较不同大语言模型与人工编码员在提取ICD - 10 - CM编码方面的表现。

方法

我们评估了六个大语言模型(GPT - 3.5、GPT - 4、Claude 2.1、Claude 3、Gemini Advanced和Llama 2 - 70b)相对于人工编码员提取ICD - 10 - CM编码的表现。本研究使用了美国健康信息管理协会虚拟实验室中真实患者病例的去识别化住院病历。我们计算了一致性百分比和科恩kappa值,以评估大语言模型与人工编码员之间的一致性。然后,我们在10%的随机子集中确定了大语言模型在编码提取中存在差异的原因。

结果

在50份住院病历中,人工编码员提取了165个唯一的ICD - 10 - CM编码。大语言模型提取的唯一ICD - 10 - CM编码数量显著高于人工编码员,其中Llama 2 - 70b提取的最多(658个),Gemini Advanced提取的最少(221个)。GPT - 4与人工编码员的一致性百分比最高,为15.2%,其次是Claude 3(12.7%)和GPT - 3.5(12.4%)。科恩kappa值表明一致性极小至不存在,范围从 - 0.02到0.01。当关注主要诊断时,Claude 3的一致性百分比最高(26%),kappa值为(0.25)。不同大语言模型在编码提取中存在差异的原因各不相同,包括提取医生未确认诊断的编码(GPT - 4为60%)、提取非特定编码(GPT - 3.5为25%)、尽管存在更具体诊断仍提取体征和症状的编码(Claude 2.1为22%)以及幻觉(Claude 2.1为35%)。

结论

与人工编码员相比,当前大语言模型在从住院病历中提取ICD - 10 - CM编码方面表现不佳。