Finch Anthony, Crowell Alexander, Bhatia Mamta, Parameshwarappa Pooja, Chang Yung-Chieh, Martinez Jose, Horberg Michael
Kaiser Permanente Mid-Atlantic Permanente Medical Group, Rockville, Maryland, USA.
Kaiser Permanente Mid-Atlantic Permanente Research Institute, Rockville, Maryland, USA.
JAMIA Open. 2021 Mar 16;4(1):ooab022. doi: 10.1093/jamiaopen/ooab022. eCollection 2021 Jan.
To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.
We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly released pretrained embeddings and alternative training methodologies.
We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% normalized mutual information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes, we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.
We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.
构建并公开发布一组遵循ICD - 10编码标准的医学概念嵌入,将医学编码中的层次信息明确纳入嵌入公式。
我们使用来自美国大西洋中部一家大型综合医疗保健机构的约600,000名患者的数据集,通过对Word2Vec算法的几个新扩展来训练概念嵌入。我们的概念嵌入包括额外的实体,以考虑临床分类软件修订版(CCSR)数据集分配给编码的医学类别。我们将这些结果与公开发布的预训练嵌入集和替代训练方法进行比较。
我们发现,在将朴素聚类与CCSR提供的标准聚类进行比较的任务中,包含层次数据的Word2Vec模型优于普通的Word2Vec替代模型。我们的带有编码和类别的Skip - Gram模型与标准标签的归一化互信息达到61.4%,而传统Skip - Gram模型为57.5%。在处理两种不同结果的模型中,我们发现包含层次嵌入数据在96.2%的情况下提高了分类性能。在控制所有其他变量时,我们发现共同训练嵌入在66.7%的情况下提高了分类性能。我们发现所有模型都优于我们的竞争基准。
我们发现有重要证据表明,我们提出的算法比普通Word2Vec模型能更充分地表达医学编码的层次结构,并且这种改进在分类任务中得以延续。作为本出版物的一部分,我们使用ICD - 10标准发布了几组预训练的医学概念嵌入,在我们测试的结果上显著优于其他知名的预训练向量。