Lin Chin, Lou Yu-Sheng, Tsai Dung-Jang, Lee Chia-Cheng, Hsu Chia-Jung, Wu Ding-Chung, Wang Mei-Chuen, Fang Wen-Hui
Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan.
School of Public Health, National Defense Medical Center, Taipei, Taiwan.
JMIR Med Inform. 2019 Jul 23;7(3):e14499. doi: 10.2196/14499.
Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions.
We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods.
We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three-character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted.
In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698).
The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.
当前大多数用于检索《国际疾病分类第十次修订本临床修订版》(ICD - 10 - CM)编码的先进模型使用词嵌入技术来捕捉有用的语义属性。然而,它们受到初始词嵌入质量的限制。通过电子健康记录(EHR)训练的词嵌入被认为是最好的,但词汇多样性受到既往病历的限制。因此,我们需要一个既能保持开放互联网数据库词汇多样性又能理解EHR医学术语的词嵌入模型。此外,我们需要考虑疾病分类的特殊性,其中出院小结仅呈现阳性疾病描述。
我们旨在提出一种投影词向量模型和一种混合采样方法。此外,我们旨在进行一系列实验以验证这些方法的有效性。
我们使用两个语料库来源(英语维基百科和PubMed期刊摘要)比较投影词向量模型和传统词向量模型。我们使用七个已发表的数据集来衡量词向量模型的医学语义理解,并使用这些嵌入来识别一组出院小结中的三位字符级ICD - 10 - CM诊断编码。在嵌入技术改进的基础上,我们还尝试应用混合采样方法来提高准确性。使用了来自台湾台北三军总医院2015年6月1日至2017年6月30日的94483条有标签出院小结。为评估模型性能,使用了同一医院2017年7月1日至2017年12月31日的24762条出院小结。此外,还测试了从其他七家医院收集的74324条额外出院小结。采用F值作为有效性的主要全局度量。
在医学语义理解方面,原始EHR嵌入和PubMed嵌入表现优于原始维基百科嵌入。应用投影训练技术后,投影维基百科嵌入有明显改进,但未达到原始EHR嵌入或PubMed嵌入的水平。在随后的ICD - 10 - CM编码实验中,同时使用投影PubMed和维基百科嵌入的模型具有最高的测试平均F值(在三军总医院和其他七家医院分别为0.7362和0.6693)。此外,发现混合采样方法提高了模型性能(F值 = 0.7371/0.6698)。
使用EHR和PubMed训练的词嵌入能更好地理解医学语义,所提出的投影词向量模型提高了维基百科嵌入中医学语义提取的能力。尽管投影词向量模型在实际ICD - 10 - CM编码任务中的改进并不显著,但这些模型能有效处理新出现的疾病。所提出的混合采样方法使模型表现得像人类专家。