Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America.
PLoS One. 2021 May 4;16(5):e0251094. doi: 10.1371/journal.pone.0251094. eCollection 2021.
The embedding of Medical Subject Headings (MeSH) terms has become a foundation for many downstream bioinformatics tasks. Recent studies employ different data sources, such as the corpus (in which each document is indexed by a set of MeSH terms), the MeSH term ontology, and the semantic predications between MeSH terms (extracted by SemMedDB), to learn their embeddings. While these data sources contribute to learning the MeSH term embeddings, current approaches fail to incorporate all of them in the learning process. The challenge is that the structured relationships between MeSH terms are different across the data sources, and there is no approach to fusing such complex data into the MeSH term embedding learning. In this paper, we study the problem of incorporating corpus, ontology, and semantic predications to learn the embeddings of MeSH terms. We propose a novel framework, Corpus, Ontology, and Semantic predications-based MeSH term embedding (COS), to generate high-quality MeSH term embeddings. COS converts the corpus, ontology, and semantic predications into MeSH term sequences, merges these sequences, and learns MeSH term embeddings using the sequences. Extensive experiments on different datasets show that COS outperforms various baseline embeddings and traditional non-embedding-based baselines.
医学主题词 (MeSH) 项的嵌入已经成为许多下游生物信息学任务的基础。最近的研究使用不同的数据源,如语料库(其中每个文档都由一组 MeSH 术语索引)、MeSH 术语本体和 MeSH 术语之间的语义谓词(由 SemMedDB 提取),来学习它们的嵌入。虽然这些数据源有助于学习 MeSH 术语嵌入,但目前的方法未能在学习过程中全部利用它们。挑战在于 MeSH 术语之间的结构化关系在不同的数据源中是不同的,并且没有方法将如此复杂的数据融合到 MeSH 术语嵌入学习中。在本文中,我们研究了将语料库、本体和语义谓词结合起来学习 MeSH 术语嵌入的问题。我们提出了一种新颖的框架,即基于语料库、本体和语义谓词的 MeSH 术语嵌入(COS),以生成高质量的 MeSH 术语嵌入。COS 将语料库、本体和语义谓词转换为 MeSH 术语序列,合并这些序列,并使用这些序列学习 MeSH 术语嵌入。在不同数据集上的广泛实验表明,COS 优于各种基线嵌入和传统的非嵌入基线。