Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China.
Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac318.
Measuring the semantic similarity between Gene Ontology (GO) terms is a fundamental step in numerous functional bioinformatics applications. To fully exploit the metadata of GO terms, word embedding-based methods have been proposed recently to map GO terms to low-dimensional feature vectors. However, these representation methods commonly overlook the key information hidden in the whole GO structure and the relationship between GO terms. In this paper, we propose a novel representation model for GO terms, named GT2Vec, which jointly considers the GO graph structure obtained by graph contrastive learning and the semantic description of GO terms based on BERT encoders. Our method is evaluated on a protein similarity task on a collection of benchmark datasets. The experimental results demonstrate the effectiveness of using a joint encoding graph structure and textual node descriptors to learn vector representations for GO terms.
衡量基因本体论(GO)术语之间的语义相似性是许多功能生物信息学应用的基础步骤。为了充分利用 GO 术语的元数据,最近已经提出了基于单词嵌入的方法来将 GO 术语映射到低维特征向量。然而,这些表示方法通常忽略了隐藏在整个 GO 结构中的关键信息以及 GO 术语之间的关系。在本文中,我们提出了一种新的 GO 术语表示模型,称为 GT2Vec,它联合考虑了基于图对比学习获得的 GO 图结构以及基于 BERT 编码器的 GO 术语的语义描述。我们的方法在一组基准数据集上的蛋白质相似性任务上进行了评估。实验结果表明,使用联合编码图结构和文本节点描述符来学习 GO 术语的向量表示是有效的。