Yu Guoxian, Zhao Yingwen, Lu Chang, Wang Jun
College of Computer and Information Science, Southwest University, Chongqing 400715, China.
College of Computer and Information Science, Southwest University, Chongqing 400715, China.
Comput Biol Chem. 2017 Dec;71:264-273. doi: 10.1016/j.compbiolchem.2017.09.010. Epub 2017 Oct 4.
Gene ontology (GO) is a standardized and controlled vocabulary of terms that describe the molecular functions, biological roles and cellular locations of proteins. GO terms and GO hierarchy are regularly updated as the accumulated biological knowledge. More than 50,000 terms are included in GO and each protein is annotated with several or dozens of these terms. Therefore, accurately predicting the association between proteins and massive GO terms is rather challenging. To accurately predict the association between massive GO terms and proteins, we proposed a method called Hashing GO for protein function prediction (HashGO in short). HashGO firstly adopts a protein-term association matrix to store available GO annotations of proteins. Then, it tailors a graph hashing method to explore the underlying structure between GO terms and to obtain a series of hash functions to compress the high-dimensional protein-term association matrix into a low-dimensional one. Next, HashGO computes the semantic similarity between proteins based on Hamming distance on that low-dimensional matrix. After that, it predicts missing annotations of a protein based on the annotations of its semantic neighbors. Experimental results on archived GO annotations of two model species (Yeast and Human) show that HashGO not only more accurately predicts functions than other related approaches, but also runs faster than them.
基因本体论(GO)是一个标准化且经过控制的术语词汇表,用于描述蛋白质的分子功能、生物学作用和细胞定位。随着生物学知识的不断积累,GO术语和GO层次结构会定期更新。GO中包含超过50,000个术语,每个蛋白质都用其中的几个或几十个术语进行注释。因此,准确预测蛋白质与大量GO术语之间的关联颇具挑战性。为了准确预测大量GO术语与蛋白质之间的关联,我们提出了一种名为“用于蛋白质功能预测的哈希GO”(简称为HashGO)的方法。HashGO首先采用蛋白质-术语关联矩阵来存储蛋白质可用的GO注释。然后,它定制了一种图哈希方法来探索GO术语之间的潜在结构,并获得一系列哈希函数,将高维蛋白质-术语关联矩阵压缩为低维矩阵。接下来,HashGO基于该低维矩阵上的汉明距离计算蛋白质之间的语义相似性。之后,它根据蛋白质语义邻居的注释预测该蛋白质缺失的注释。对两种模式生物(酵母和人类)的存档GO注释进行的实验结果表明,HashGO不仅比其他相关方法更准确地预测功能,而且运行速度也比它们更快。