Department of Computer Engineering, Ajou University, Suwon 16499, South Korea.
Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
Bioinformatics. 2021 Sep 29;37(18):2971-2980. doi: 10.1093/bioinformatics/btab193.
Knowledge manipulation of Gene Ontology (GO) and Gene Ontology Annotation (GOA) can be done primarily by using vector representation of GO terms and genes. Previous studies have represented GO terms and genes or gene products in Euclidean space to measure their semantic similarity using an embedding method such as the Word2Vec-based method to represent entities as numeric vectors. However, this method has the limitation that embedding large graph-structured data in the Euclidean space cannot prevent a loss of information of latent hierarchies, thus precluding the semantics of GO and GOA from being captured optimally. On the other hand, hyperbolic spaces such as the Poincaré balls are more suitable for modeling hierarchies, as they have a geometric property in which the distance increases exponentially as it nears the boundary because of negative curvature.
In this article, we propose hierarchical representations of GO and genes (HiG2Vec) by applying Poincaré embedding specialized in the representation of hierarchy through a two-step procedure: GO embedding and gene embedding. Through experiments, we show that our model represents the hierarchical structure better than other approaches and predicts the interaction of genes or gene products similar to or better than previous studies. The results indicate that HiG2Vec is superior to other methods in capturing the GO and gene semantics and in data utilization as well. It can be robustly applied to manipulate various biological knowledge.
https://github.com/JaesikKim/HiG2Vec.
Supplementary data are available at Bioinformatics online.
GO(Gene Ontology)和 GOA(Gene Ontology Annotation)的知识操作主要可以通过使用 GO 术语和基因的向量表示来完成。先前的研究已经在欧几里得空间中表示 GO 术语和基因或基因产物,使用基于 Word2Vec 的方法等嵌入方法来测量它们的语义相似性,以将实体表示为数字向量。然而,这种方法存在的限制是,在欧几里得空间中嵌入大型图结构数据不能防止潜在层次结构信息的丢失,从而不能最佳地捕获 GO 和 GOA 的语义。另一方面,双曲空间(如 Poincaré 球)更适合建模层次结构,因为它们具有几何性质,即由于负曲率,距离在接近边界时呈指数级增加。
在本文中,我们通过应用专门用于通过两步过程(GO 嵌入和基因嵌入)表示层次结构的 Poincaré 嵌入,提出了 GO 和基因的层次表示(HiG2Vec)。通过实验,我们表明我们的模型比其他方法更好地表示层次结构,并预测基因或基因产物的相互作用与先前的研究相似或更好。结果表明,HiG2Vec 在捕获 GO 和基因语义以及数据利用方面优于其他方法。它可以稳健地应用于操纵各种生物知识。
https://github.com/JaesikKim/HiG2Vec。
补充数据可在 Bioinformatics 在线获得。