National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.
J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.
The study sought to explore the use of deep learning techniques to measure the semantic relatedness between Unified Medical Language System (UMLS) concepts.
Concept sentence embeddings were generated for UMLS concepts by applying the word embedding models BioWordVec and various flavors of BERT to concept sentences formed by concatenating UMLS terms. Graph embeddings were generated by the graph convolutional networks and 4 knowledge graph embedding models, using graphs built from UMLS hierarchical relations. Semantic relatedness was measured by the cosine between the concepts' embedding vectors. Performance was compared with 2 traditional path-based (shortest path and Leacock-Chodorow) measurements and the publicly available concept embeddings, cui2vec, generated from large biomedical corpora. The concept sentence embeddings were also evaluated on a word sense disambiguation (WSD) task. Reference standards used included the semantic relatedness and semantic similarity datasets from the University of Minnesota, concept pairs generated from the Standardized MedDRA Queries and the MeSH (Medical Subject Headings) WSD corpus.
Sentence embeddings generated by BioWordVec outperformed all other methods used individually in semantic relatedness measurements. Graph convolutional network graph embedding uniformly outperformed path-based measurements and was better than some word embeddings for the Standardized MedDRA Queries dataset. When used together, combined word and graph embedding achieved the best performance in all datasets. For WSD, the enhanced versions of BERT outperformed BioWordVec.
Word and graph embedding techniques can be used to harness terms and relations in the UMLS to measure semantic relatedness between concepts. Concept sentence embedding outperforms path-based measurements and cui2vec, and can be further enhanced by combining with graph embedding.
本研究旨在探讨利用深度学习技术来衡量统一医学语言系统(UMLS)概念之间的语义相关性。
通过将词嵌入模型 BioWordVec 和各种 BERT 变体应用于由 UMLS 术语串联而成的概念句子,为 UMLS 概念生成概念句子嵌入。通过图卷积网络和 4 种知识图嵌入模型生成图嵌入,使用基于 UMLS 层次关系构建的图。通过概念向量之间的余弦来衡量语义相关性。将性能与 2 种传统的基于路径(最短路径和 Leacock-Chodorow)的测量方法以及从大型生物医学语料库生成的公开可用的概念嵌入 cui2vec 进行比较。概念句子嵌入还在词义消歧(WSD)任务上进行了评估。使用的参考标准包括明尼苏达大学的语义相关性和语义相似性数据集、从标准 MedDRA 查询和 MeSH(医学主题词)WSD 语料库生成的概念对。
BioWordVec 生成的句子嵌入在语义相关性测量方面优于单独使用的所有其他方法。图卷积网络图嵌入在所有路径测量方法中表现一致,并且优于某些单词嵌入方法,适用于 Standardized MedDRA Queries 数据集。当联合使用时,组合的单词和图形嵌入在所有数据集上都实现了最佳性能。对于 WSD,增强版的 BERT 优于 BioWordVec。
词和图嵌入技术可用于利用 UMLS 中的术语和关系来衡量概念之间的语义相关性。概念句子嵌入优于基于路径的测量方法和 cui2vec,并且通过与图嵌入相结合可以进一步增强。