使用预训练语言模型和先进提示学习技术的自主国际疾病分类编码：对一个使用医学文本的自动分析系统的评估

Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.

作者信息

Zhuang Yan, Zhang Junyan, Li Xiuxing, Liu Chao, Yu Yue, Dong Wei, He Kunlun

机构信息

Medical Big Data Research Center, Chinese PLA General Hospital, Beijing, China.

School of Computer Science & Technology, Beijing Institute of Technology, Beijing, China.

出版信息

JMIR Med Inform. 2025 Jan 6;13:e63020. doi: 10.2196/63020.

DOI:10.2196/63020

PMID:39761555

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11747532/

Abstract

BACKGROUND

Machine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key-bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning.

OBJECTIVE

This study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing.

METHODS

We integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework's performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets.

RESULTS

Compared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro-F1-score of 0.838 and a macro-area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots.

CONCLUSIONS

These findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.

摘要

背景

机器学习模型可通过将病历实时转换为国际疾病分类（ICD）编码来减轻医生负担，从而提高诊断和治疗效率。然而，它面临着诸如数据集小、写作风格多样、记录非结构化以及需要半自动预处理等挑战。现有的方法，如朴素贝叶斯、Word2Vec和卷积神经网络，在处理缺失值和理解医学文本上下文方面存在局限性，导致错误率较高。我们基于来自变换器（BERT）的键双向编码器表示法和大规模病历开发了一个全自动管道，用于持续预训练，该管道可有效地将长自由文本转换为标准ICD编码。通过调整参数设置，如混合模板和软语言器，该模型可以灵活适应不同需求，实现特定任务的提示学习。

目的

本研究旨在提出一种基于预训练语言模型的提示学习实时框架，该框架无需半自动预处理即可用ICD - 10编码自动标记心血管疾病的长自由文本数据。

方法

我们在框架中集成了4个组件：一个医学预训练的BERT、一个按功能顺序的关键词过滤BERT、一个微调阶段以及利用混合模板和软语言器的特定任务提示学习。该框架在一个多中心医学数据集上进行了验证，用于13种常见心血管疾病（584,969条记录）的自动ICD编码。将其性能与经过稳健优化的BERT预训练方法、极端语言网络以及各种基于BERT的微调管道进行了比较。此外，我们评估了该框架在不同提示学习和微调设置下的性能。此外，还进行了少样本学习实验，以评估我们的框架在涉及中小型数据集的场景中的可行性和有效性。