用于对ICD-10-CM编码进行分类的混合采样训练投影词嵌入模型：纵向观察研究

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study.

作者信息

Lin Chin, Lou Yu-Sheng, Tsai Dung-Jang, Lee Chia-Cheng, Hsu Chia-Jung, Wu Ding-Chung, Wang Mei-Chuen, Fang Wen-Hui

机构信息

Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan.

School of Public Health, National Defense Medical Center, Taipei, Taiwan.

出版信息

JMIR Med Inform. 2019 Jul 23;7(3):e14499. doi: 10.2196/14499.

DOI:10.2196/14499

PMID:31339103

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6683650/

Abstract

BACKGROUND

Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions.

OBJECTIVE

We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods.

METHODS

We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three-character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted.

RESULTS

In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698).

CONCLUSIONS

The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.

摘要

背景

当前大多数用于检索《国际疾病分类第十次修订本临床修订版》（ICD - 10 - CM）编码的先进模型使用词嵌入技术来捕捉有用的语义属性。然而，它们受到初始词嵌入质量的限制。通过电子健康记录（EHR）训练的词嵌入被认为是最好的，但词汇多样性受到既往病历的限制。因此，我们需要一个既能保持开放互联网数据库词汇多样性又能理解EHR医学术语的词嵌入模型。此外，我们需要考虑疾病分类的特殊性，其中出院小结仅呈现阳性疾病描述。

目的

我们旨在提出一种投影词向量模型和一种混合采样方法。此外，我们旨在进行一系列实验以验证这些方法的有效性。

方法

我们使用两个语料库来源（英语维基百科和PubMed期刊摘要）比较投影词向量模型和传统词向量模型。我们使用七个已发表的数据集来衡量词向量模型的医学语义理解，并使用这些嵌入来识别一组出院小结中的三位字符级ICD - 10 - CM诊断编码。在嵌入技术改进的基础上，我们还尝试应用混合采样方法来提高准确性。使用了来自台湾台北三军总医院2015年6月1日至2017年6月30日的94483条有标签出院小结。为评估模型性能，使用了同一医院2017年7月1日至2017年12月31日的24762条出院小结。此外，还测试了从其他七家医院收集的74324条额外出院小结。采用F值作为有效性的主要全局度量。

结果

在医学语义理解方面，原始EHR嵌入和PubMed嵌入表现优于原始维基百科嵌入。应用投影训练技术后，投影维基百科嵌入有明显改进，但未达到原始EHR嵌入或PubMed嵌入的水平。在随后的ICD - 10 - CM编码实验中，同时使用投影PubMed和维基百科嵌入的模型具有最高的测试平均F值（在三军总医院和其他七家医院分别为0.7362和0.6693）。此外，发现混合采样方法提高了模型性能（F值 = 0.7371/0.6698）。

结论

使用EHR和PubMed训练的词嵌入能更好地理解医学语义，所提出的投影词向量模型提高了维基百科嵌入中医学语义提取的能力。尽管投影词向量模型在实际ICD - 10 - CM编码任务中的改进并不显著，但这些模型能有效处理新出现的疾病。所提出的混合采样方法使模型表现得像人类专家。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a4d8/6683650/6370a5011be6/medinform_v7i3e14499_fig1.jpg

相似文献

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study.用于对ICD-10-CM编码进行分类的混合采样训练投影词嵌入模型：纵向观察研究

JMIR Med Inform. 2019 Jul 23;7(3):e14499. doi: 10.2196/14499.

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches.自动国际疾病分类编码系统：基于规则方法的深度情境化语言模型

JMIR Med Inform. 2022 Jun 29;10(6):e37557. doi: 10.2196/37557.

Visualization of medical concepts represented using word embeddings: a scoping review.基于词向量表示的医学概念可视化：范围综述。

BMC Med Inform Decis Mak. 2022 Mar 29;22(1):83. doi: 10.1186/s12911-022-01822-9.

The Application of Projection Word Embeddings on Medical Records Scoring System.投影词嵌入在病历评分系统中的应用

Healthcare (Basel). 2021 Sep 29;9(10):1298. doi: 10.3390/healthcare9101298.

Using word embeddings to expand terminology of dietary supplements on clinical notes.利用词嵌入技术扩展临床记录中膳食补充剂的术语。

JAMIA Open. 2019 Jul;2(2):246-253. doi: 10.1093/jamiaopen/ooz007. Epub 2019 Mar 28.

Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD).优化低资源领域中训练词嵌入的语料库创建：以自闭症谱系障碍（ASD）为例

AMIA Annu Symp Proc. 2018 Dec 5;2018:508-517. eCollection 2018.

Automatic ICD-10 Coding and Training System: Deep Neural Network Based on Supervised Learning.自动ICD - 10编码与训练系统：基于监督学习的深度神经网络

JMIR Med Inform. 2021 Aug 31;9(8):e23230. doi: 10.2196/23230.

引用本文的文献

MLR-predictor: a versatile and efficient computational framework for multi-label requirements classification.MLR预测器：一个用于多标签需求分类的通用且高效的计算框架。

Front Artif Intell. 2024 Nov 27;7:1481581. doi: 10.3389/frai.2024.1481581. eCollection 2024.

The Application of Projection Word Embeddings on Medical Records Scoring System.投影词嵌入在病历评分系统中的应用

Healthcare (Basel). 2021 Sep 29;9(10):1298. doi: 10.3390/healthcare9101298.

Deep Learning Algorithm for Management of Diabetes Mellitus via Electrocardiogram-Based Glycated Hemoglobin (ECG-HbA1c): A Retrospective Cohort Study.基于心电图的糖化血红蛋白（ECG-HbA1c）管理糖尿病的深度学习算法：一项回顾性队列研究。

J Pers Med. 2021 Jul 27;11(8):725. doi: 10.3390/jpm11080725.

Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks.使用神经网络自动多标签检测荷兰心脏病学出院小结中的ICD10编码

NPJ Digit Med. 2021 Feb 26;4(1):37. doi: 10.1038/s41746-021-00404-9.

Optimized Identification of Advanced Chronic Kidney Disease and Absence of Kidney Disease by Combining Different Electronic Health Data Resources and by Applying Machine Learning Strategies.通过整合不同电子健康数据资源并应用机器学习策略优化晚期慢性肾脏病及无肾脏疾病的识别

J Clin Med. 2020 Sep 12;9(9):2955. doi: 10.3390/jcm9092955.

Ontologies, Knowledge Representation, and Machine Learning for Translational Research: Recent Contributions.本体论、知识表示和机器学习在转化研究中的应用：最新贡献。

Yearb Med Inform. 2020 Aug;29(1):159-162. doi: 10.1055/s-0040-1701991. Epub 2020 Aug 21.

本文引用的文献

Scalable and accurate deep learning with electronic health records.借助电子健康记录实现可扩展且准确的深度学习。

NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Exploration of association rule mining for coding consistency and completeness assessment in inpatient administrative health data.探索关联规则挖掘在住院行政健康数据中的编码一致性和完整性评估中的应用。

J Biomed Inform. 2018 Mar;79:41-47. doi: 10.1016/j.jbi.2018.02.001. Epub 2018 Feb 6.

ICD-10 impact on ascertainment and accuracy of oral cleft cases as recorded by the Brazilian national live birth information system.国际疾病分类第十版（ICD - 10）对巴西国家活产信息系统记录的唇腭裂病例的确诊及准确性的影响。

Am J Med Genet A. 2018 Apr;176(4):907-914. doi: 10.1002/ajmg.a.38634. Epub 2018 Feb 9.

Validation of Algorithm to Identify Persons with Non-traumatic Spinal Cord Dysfunction in Canada Using Administrative Health Data.利用行政卫生数据验证加拿大非创伤性脊髓功能障碍患者识别算法

Top Spinal Cord Inj Rehabil. 2017 Fall;23(4):333-342. doi: 10.1310/sci2304-333.

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

A survey on deep learning in medical image analysis.深度学习在医学图像分析中的应用研究综述。

Med Image Anal. 2017 Dec;42:60-88. doi: 10.1016/j.media.2017.07.005. Epub 2017 Jul 26.

Casemix Funding Optimisation: Working Together to Make the Most of Every Episode.病例组合资金优化：携手合作，充分利用每一个诊疗过程。

Health Inf Manag. 2010 Oct;39(3):47-49. doi: 10.1177/183335831003900309.

Rationale-Augmented Convolutional Neural Networks for Text Classification.用于文本分类的基于原理增强的卷积神经网络。

Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:795-804. doi: 10.18653/v1/d16-1076.

Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection.通过专家驱动的特征选择，从明文尸检报告中进行自动的国际疾病分类第十版（ICD-10）死因多类别分类。

PLoS One. 2017 Feb 6;12(2):e0170242. doi: 10.1371/journal.pone.0170242. eCollection 2017.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于对ICD-10-CM编码进行分类的混合采样训练投影词嵌入模型：纵向观察研究

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献