Chen Pei-Fu, Chen Kuan-Chih, Liao Wei-Chih, Lai Feipei, He Tai-Liang, Lin Sheng-Che, Chen Wei-Jen, Yang Chi-Yu, Lin Yu-Cheng, Tsai I-Chang, Chiu Chi-Hao, Chang Shu-Chih, Hung Fang-Ming
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.
Department of Anesthesiology, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
JMIR Med Inform. 2022 Jun 29;10(6):e37557. doi: 10.2196/37557.
The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination.
This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification.
We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F score and the micro area under the receiver operating characteristic curve were used to compare the model's performance with that of different preprocessing methods.
BioBERT had an F score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively.
The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules.
《国际疾病分类》第十版(ICD - 10)广泛应用于流行病学研究和健康管理。ICD - 10的临床修订版(CM)和手术编码系统(PCS)旨在通过增加诊断和手术编码来描述更多临床细节,并应用于疾病相关分组以进行费用报销。编码数量的增加使得编码工作既耗时又不准确。使用深度上下文词嵌入的先进模型被用于ICD - 10的自动多标签文本分类。除了输入出院诊断(DD)外,对来自其他文档类型的文本(如病史、合并症和并发症、手术方法以及特殊检查)采用适当的预处理方法可提高分类性能。
本研究旨在建立一个基于规则预处理方法的上下文语言模型,以开发用于ICD - 10多标签分类的模型。
我们从一家医疗中心检索电子健康记录。首先,我们比较了不同的词嵌入方法。其次,我们使用性能最佳的嵌入方法比较了预处理方法。我们比较了生物医学双向编码器表征来自变换器(BioBERT)、用于语言理解的临床广义自回归预训练(Clinical XLNet)、用于高性能极端多标签文本分类的基于标签树的注意力感知深度模型(AttentionXLM)以及词向量(Word2Vec)来预测ICD - 10 - CM。为了比较ICD - 10 - CM的不同预处理方法时,我们将DD、病史以及合并症和并发症作为输入。我们比较了使用不同预处理(包括定义训练、外部原因编码去除、数字转换和组合编码过滤)的ICD - 10 - CM预测性能。对于ICD - 10 PCS,使用DD、手术方法和特殊检查关键词的不同组合对模型进行训练。使用微F分数和接收器操作特征曲线下的微面积来比较模型与不同预处理方法的性能。
BioBERT的F分数为0.701,优于Clinical XLNet、AttentionXLM和Word2Vec等其他模型。对于ICD - 10 - CM,通过ICD - 10定义训练、外部原因编码去除、数字转换和组合编码过滤,模型的F分数从0.749(95%CI 0.744 - 0.753)显著提高到了0.769(95%CI 0.764 - 0.773)。对于ICD - 10 - PCS,结合出院诊断、手术方法和特殊检查关键词,模型的F分数从0.670(95%CI 0.663 - 0.678)显著提高到了0.726(95%CI 0.719 - 0.732)。使用我们的预处理方法,对于ICD - 10 - CM和ICD - 10 - PCS,模型在接收器操作特征曲线下的面积分别达到了最高的0.853(95%CI 0.849 - 0.855)和0.831(95%CI 0.827 - 0.834)。
我们的模型结合预训练的上下文语言模型和基于规则的预处理方法,在ICD - 10 - CM或ICD - 10 - PCS方面的性能优于先进模型。本研究强调了基于编码规则的规则预处理方法的重要性。