College of Computer and Information Science, Southwest University, Chongqing 400715, China.
School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China; Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, China.
Genomics. 2019 May;111(3):334-342. doi: 10.1016/j.ygeno.2018.02.008. Epub 2018 Feb 23.
Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash.
GO 术语(GO)使用结构化词汇(或术语)来描述基因产物的分子功能、生物功能和细胞位置,其采用分层本体。GO 注释将基因与 GO 术语相关联,并指出具有相关术语所描述的生物功能的特定基因产物。然而,从 GO 定义的大量 GO 术语中预测基因的正确 GO 注释是一项具有挑战性的任务。为了解决这个挑战,我们引入了一种基于基因本体论层次保持哈希(HPHash)的语义方法用于基因功能预测。HPHash 首先测量 GO 术语之间的分类相似性。然后,它使用层次保持哈希技术来保持 GO 术语之间的层次顺序,并优化一系列哈希函数,通过紧凑的二进制代码对大量 GO 术语进行编码。之后,HPHash 利用这些哈希函数将基因-术语关联矩阵投影到低维空间中,并在低维空间中进行基于语义相似性的基因功能预测。在三种模式物种(人类、小鼠和大鼠)之间进行物种间基因功能预测的实验结果表明,HPHash 的性能优于其他相关方法,并且对哈希函数的数量具有鲁棒性。此外,我们还将 HPHash 作为基于 BLAST 的基因功能预测的插件。从实验结果来看,HPHash 再次显著提高了预测性能。HPHash 的代码可在以下网址获取:http://mlda.swu.edu.cn/codes.php?name=HPHash。