• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用预训练语言模型和先进提示学习技术的自主国际疾病分类编码:对一个使用医学文本的自动分析系统的评估

Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.

作者信息

Zhuang Yan, Zhang Junyan, Li Xiuxing, Liu Chao, Yu Yue, Dong Wei, He Kunlun

机构信息

Medical Big Data Research Center, Chinese PLA General Hospital, Beijing, China.

School of Computer Science & Technology, Beijing Institute of Technology, Beijing, China.

出版信息

JMIR Med Inform. 2025 Jan 6;13:e63020. doi: 10.2196/63020.

DOI:10.2196/63020
PMID:39761555
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11747532/
Abstract

BACKGROUND

Machine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key-bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning.

OBJECTIVE

This study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing.

METHODS

We integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework's performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets.

RESULTS

Compared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro-F1-score of 0.838 and a macro-area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots.

CONCLUSIONS

These findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.

摘要

背景

机器学习模型可通过将病历实时转换为国际疾病分类(ICD)编码来减轻医生负担,从而提高诊断和治疗效率。然而,它面临着诸如数据集小、写作风格多样、记录非结构化以及需要半自动预处理等挑战。现有的方法,如朴素贝叶斯、Word2Vec和卷积神经网络,在处理缺失值和理解医学文本上下文方面存在局限性,导致错误率较高。我们基于来自变换器(BERT)的键双向编码器表示法和大规模病历开发了一个全自动管道,用于持续预训练,该管道可有效地将长自由文本转换为标准ICD编码。通过调整参数设置,如混合模板和软语言器,该模型可以灵活适应不同需求,实现特定任务的提示学习。

目的

本研究旨在提出一种基于预训练语言模型的提示学习实时框架,该框架无需半自动预处理即可用ICD - 10编码自动标记心血管疾病的长自由文本数据。

方法

我们在框架中集成了4个组件:一个医学预训练的BERT、一个按功能顺序的关键词过滤BERT、一个微调阶段以及利用混合模板和软语言器的特定任务提示学习。该框架在一个多中心医学数据集上进行了验证,用于13种常见心血管疾病(584,969条记录)的自动ICD编码。将其性能与经过稳健优化的BERT预训练方法、极端语言网络以及各种基于BERT的微调管道进行了比较。此外,我们评估了该框架在不同提示学习和微调设置下的性能。此外,还进行了少样本学习实验,以评估我们的框架在涉及中小型数据集的场景中的可行性和有效性。

结果

与传统的预训练和微调管道相比,我们的方法实现了更高的微观F1分数0.838和受试者工作特征曲线下的宏观面积(宏观AUC)0.958,比其他方法高10%。在不同的提示学习设置中,混合模板和软语言器的组合产生了最佳性能。少样本实验表明,性能在500次样本时稳定下来且AUC达到峰值。

结论

这些发现强调了在医学实践中对预训练语言模型内的子任务进行提示学习和微调的有效性和卓越性能。我们的实时ICD编码管道有效地将详细的医学自由文本转换为标准化标签,在临床决策中具有广阔的应用前景。它可以帮助不熟悉ICD编码系统的医生整理病历信息,从而加快医疗过程并提高诊断和治疗效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/1a37c15ea4fb/medinform_v13i1e63020_fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/68f1dc61c0de/medinform_v13i1e63020_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/9625467ebd8f/medinform_v13i1e63020_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/78a451eb5c10/medinform_v13i1e63020_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/7b4405a74555/medinform_v13i1e63020_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/8d82262f764c/medinform_v13i1e63020_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/f225d89ad292/medinform_v13i1e63020_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/5079adf22a8a/medinform_v13i1e63020_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/1a37c15ea4fb/medinform_v13i1e63020_fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/68f1dc61c0de/medinform_v13i1e63020_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/9625467ebd8f/medinform_v13i1e63020_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/78a451eb5c10/medinform_v13i1e63020_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/7b4405a74555/medinform_v13i1e63020_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/8d82262f764c/medinform_v13i1e63020_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/f225d89ad292/medinform_v13i1e63020_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/5079adf22a8a/medinform_v13i1e63020_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0478/11747532/1a37c15ea4fb/medinform_v13i1e63020_fig8.jpg

相似文献

1
Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.使用预训练语言模型和先进提示学习技术的自主国际疾病分类编码:对一个使用医学文本的自动分析系统的评估
JMIR Med Inform. 2025 Jan 6;13:e63020. doi: 10.2196/63020.
2
Comparison of different feature extraction methods for applicable automated ICD coding.不同特征提取方法在适用的自动化 ICD 编码中的比较。
BMC Med Inform Decis Mak. 2022 Jan 12;22(1):11. doi: 10.1186/s12911-022-01753-5.
3
Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches.自动国际疾病分类编码系统:基于规则方法的深度情境化语言模型
JMIR Med Inform. 2022 Jun 29;10(6):e37557. doi: 10.2196/37557.
4
Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.利用基于深度学习的自然语言处理技术从非结构化电子健康记录中分类社会健康决定因素。
J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.
5
Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.开发国际疾病分类第十版(ICD - 10)编码助手:使用RoBERTa和GPT - 4进行术语提取和基于描述的代码选择的试点研究
JMIR Form Res. 2025 Feb 11;9:e60095. doi: 10.2196/60095.
6
Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。
J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.
7
Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.评估医疗保健中的实体识别:实体模型定量研究。
JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.
8
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
9
When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博:预训练语言模型在疾病分类上的学习曲线分析。
BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.
10
A Comparative Analysis of Machine-Learning Algorithms for Automated International Classification of Diseases (ICD)-10 Coding in Malaysian Death Records.马来西亚死亡记录中用于自动国际疾病分类(ICD)-10编码的机器学习算法的比较分析
Cureus. 2025 Jan 12;17(1):e77342. doi: 10.7759/cureus.77342. eCollection 2025 Jan.

引用本文的文献

1
Artificial Intelligence to Improve Clinical Coding Practice in Scandinavia: Crossover Randomized Controlled Trial.人工智能改善斯堪的纳维亚地区临床编码实践:交叉随机对照试验。
J Med Internet Res. 2025 Jul 3;27:e71904. doi: 10.2196/71904.
2
Experience of Cardiovascular and Cerebrovascular Disease Surgery Patients: Sentiment Analysis Using the Korean Bidirectional Encoder Representations from Transformers (KoBERT) Model.心血管和脑血管疾病手术患者的体验:使用韩国双向编码器表征从变压器(KoBERT)模型进行情感分析
JMIR Med Inform. 2025 May 30;13:e65127. doi: 10.2196/65127.

本文引用的文献

1
Clinical Prompt Learning With Frozen Language Models.临床提示学习与冻结语言模型。
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16453-16463. doi: 10.1109/TNNLS.2023.3294633. Epub 2024 Oct 29.
2
Secondary Use of Clinical Problem List Entries for Neural Network-Based Disease Code Assignment.临床问题清单条目在基于神经网络的疾病代码分配中的二次使用。
Stud Health Technol Inform. 2023 May 18;302:788-792. doi: 10.3233/SHTI230267.
3
Fine-tuning large neural language models for biomedical natural language processing.针对生物医学自然语言处理对大型神经语言模型进行微调。
Patterns (N Y). 2023 Apr 14;4(4):100729. doi: 10.1016/j.patter.2023.100729.
4
Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data.使用自然语言处理方法从自由文本和非结构化患者生成的健康数据中提取医学信息:基于真实世界数据的可行性研究
JMIR Form Res. 2023 Mar 7;7:e43014. doi: 10.2196/43014.
5
Transformer-based models for ICD-10 coding of death certificates with Portuguese text.基于Transformer 的模型在葡萄牙语死亡证明 ICD-10 编码中的应用。
J Biomed Inform. 2022 Dec;136:104232. doi: 10.1016/j.jbi.2022.104232. Epub 2022 Oct 25.
6
RadBERT: Adapting Transformer-based Language Models to Radiology.RadBERT:使基于Transformer的语言模型适用于放射学领域。
Radiol Artif Intell. 2022 Jun 15;4(4):e210258. doi: 10.1148/ryai.210258. eCollection 2022 Jul.
7
Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches.自动国际疾病分类编码系统:基于规则方法的深度情境化语言模型
JMIR Med Inform. 2022 Jun 29;10(6):e37557. doi: 10.2196/37557.
8
Automated ICD coding for primary diagnosis via clinically interpretable machine learning.通过具有临床解释能力的机器学习实现主要诊断的自动化 ICD 编码。
Int J Med Inform. 2021 Sep;153:104543. doi: 10.1016/j.ijmedinf.2021.104543. Epub 2021 Jul 27.
9
Medical code prediction via capsule networks and ICD knowledge.基于胶囊网络和 ICD 知识的医疗编码预测。
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):55. doi: 10.1186/s12911-021-01426-9.
10
A narrative review of the impact of the transition to ICD-10 and ICD-10-CM/PCS.关于向国际疾病分类第十版(ICD - 10)及ICD - 10临床修正版/ Procedure Coding System(ICD - 10 - CM/PCS)过渡影响的叙述性综述
JAMIA Open. 2019 Dec 26;3(1):126-131. doi: 10.1093/jamiaopen/ooz066. eCollection 2020 Apr.