Liu Bin, Jiang Shuangyan, Zou Quan
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
Brief Bioinform. 2020 Jan 17;21(1):298-308. doi: 10.1093/bib/bby104.
As one of the most important fundamental problems in protein sequence analysis, protein remote homology detection is critical for both theoretical research (protein structure and function studies) and real world applications (drug design). Although several computational predictors have been proposed, their detection performance is still limited. In this study, we treat protein remote homology detection as a document retrieval task, where the proteins are considered as documents and its aim is to find the highly related documents with the query documents in a database. A protein similarity network was constructed based on the true labels of proteins in the database, and the query proteins were then connected into the network based on the similarity scores calculated by three ranking methods, including PSI-BLAST, Hmmer and HHblits. The PageRank algorithm and Hyperlink-Induced Topic Search (HITS) algorithm were respectively performed on this network to move the homologous proteins of query proteins to the neighbors of the query proteins in the network. Finally, PageRank and HITS algorithms were combined, and a predictor called HITS-PR-HHblits was proposed to further improve the predictive performance. Tested on the SCOP and SCOPe benchmark datasets, the experimental results showed that the proposed protocols outperformed other state-of-the-art methods. For the convenience of the most experimental scientists, a web server for HITS-PR-HHblits was established at http://bioinformatics.hitsz.edu.cn/HITS-PR-HHblits, by which the users can easily get the results without the need to go through the mathematical details. The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.
作为蛋白质序列分析中最重要的基础问题之一,蛋白质远程同源性检测对于理论研究(蛋白质结构与功能研究)和实际应用(药物设计)都至关重要。尽管已经提出了几种计算预测方法,但其检测性能仍然有限。在本研究中,我们将蛋白质远程同源性检测视为一项文档检索任务,其中蛋白质被视为文档,其目的是在数据库中找到与查询文档高度相关的文档。基于数据库中蛋白质的真实标签构建了蛋白质相似性网络,然后根据PSI-BLAST、Hmmer和HHblits这三种排序方法计算的相似性得分将查询蛋白质连接到该网络中。分别对该网络执行PageRank算法和超链接诱导主题搜索(HITS)算法,以使查询蛋白质的同源蛋白质移动到网络中查询蛋白质的邻居位置。最后,将PageRank算法和HITS算法相结合,提出了一种名为HITS-PR-HHblits的预测器以进一步提高预测性能。在SCOP和SCOPe基准数据集上进行测试,实验结果表明所提出的方案优于其他现有方法。为了方便大多数实验科学家,在http://bioinformatics.hitsz.edu.cn/HITS-PR-HHblits建立了一个HITS-PR-HHblits的网络服务器,用户可以通过该服务器轻松获得结果,而无需了解数学细节。HITS-PR-HHblits预测器是一种使用不同程序集进行蛋白质远程同源性检测的方案,它将成为蛋白质组分析中非常有用的计算工具。