Kang Byungkon, Yoon Jisang, Kim Ha Young, Jo Sung Jin, Lee Yourim, Kam Hye Jin
Department of Computer Science, State University of New York, Incheon, South Korea.
Graduate School of Information, Yonsei University, Seoul, South Korea.
J Am Med Inform Assoc. 2021 Jul 14;28(7):1489-1496. doi: 10.1093/jamia/ocab030.
Accessing medical data from multiple institutions is difficult owing to the interinstitutional diversity of vocabularies. Standardization schemes, such as the common data model, have been proposed as solutions to this problem, but such schemes require expensive human supervision. This study aims to construct a trainable system that can automate the process of semantic interinstitutional code mapping.
To automate mapping between source and target codes, we compute the embedding-based semantic similarity between corresponding descriptive sentences. We also implement a systematic approach for preparing training data for similarity computation. Experimental results are compared to traditional word-based mappings.
The proposed model is compared against the state-of-the-art automated matching system, which is called Usagi, of the Observational Medical Outcomes Partnership common data model. By incorporating multiple negative training samples per positive sample, our semantic matching method significantly outperforms Usagi. Its matching accuracy is at least 10% greater than that of Usagi, and this trend is consistent across various top-k measurements.
The proposed deep learning-based mapping approach outperforms previous simple word-level matching algorithms because it can account for contextual and semantic information. Additionally, we demonstrate that the manner in which negative training samples are selected significantly affects the overall performance of the system.
Incorporating the semantics of code descriptions more significantly increases matching accuracy compared to traditional text co-occurrence-based approaches. The negative training sample collection methodology is also an important component of the proposed trainable system that can be adopted in both present and future related systems.
由于机构间词汇的多样性,从多个机构获取医学数据存在困难。诸如通用数据模型之类的标准化方案已被提出作为解决此问题的方法,但此类方案需要昂贵的人工监督。本研究旨在构建一个可训练的系统,该系统能够自动执行语义机构间代码映射的过程。
为了自动进行源代码和目标代码之间的映射,我们计算相应描述性句子之间基于嵌入的语义相似度。我们还实施了一种系统的方法来准备用于相似度计算的训练数据。将实验结果与传统的基于单词的映射进行比较。
将所提出的模型与观察性医疗结果合作组织通用数据模型中最先进的自动匹配系统(称为“玉兔”)进行比较。通过为每个正样本合并多个负训练样本,我们的语义匹配方法显著优于“玉兔”。其匹配准确率比“玉兔”至少高10%,并且在各种前k测量中这一趋势都是一致的。
所提出的基于深度学习的映射方法优于先前简单的单词级匹配算法,因为它可以考虑上下文和语义信息。此外,我们证明了选择负训练样本的方式会显著影响系统的整体性能。
与传统的基于文本共现的方法相比,纳入代码描述的语义能更显著地提高匹配准确率。负训练样本收集方法也是所提出的可训练系统的一个重要组成部分,可在当前和未来的相关系统中采用。