Bern University of Applied Sciences, Switzerland.
Stud Health Technol Inform. 2024 Aug 22;316:1008-1012. doi: 10.3233/SHTI240580.
Coding according to the International Classification of Diseases (ICD)-10 and its clinical modifications (CM) is inherently complex and expensive. Natural Language Processing (NLP) assists by simplifying the analysis of unstructured data from electronic health records, thereby facilitating diagnosis coding. This study investigates the suitability of transformer models for ICD-10 classification, considering both encoder and encoder-decoder architectures. The analysis is performed on clinical discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset, which contains an extensive collection of electronic health records. Pre-trained models such as BioBERT, ClinicalBERT, ClinicalLongformer, and ClinicalBigBird are adapted for the coding task, incorporating specific preprocessing techniques to enhance performance. The findings indicate that increasing context length improves accuracy, and that the difference in accuracy between encoder and encoder-decoder models is negligible.
根据国际疾病分类(ICD-10)及其临床修订版(CM)进行编码本身就很复杂且费用高昂。自然语言处理(NLP)通过简化电子健康记录中非结构化数据的分析,从而辅助诊断编码。本研究考察了转换器模型在 ICD-10 分类中的适用性,同时考虑了编码器和编码器-解码器架构。分析基于包含大量电子健康记录的医疗信息监护(MIMIC-IV)数据集的临床出院总结进行。为了进行编码任务,我们对 BioBERT、ClinicalBERT、ClinicalLongformer 和 ClinicalBigBird 等预训练模型进行了适配,并采用了特定的预处理技术来提高性能。研究结果表明,增加上下文长度可以提高准确性,而且编码器和编码器-解码器模型之间的准确性差异可以忽略不计。