学习排序在蛋白质远程同源检测中的应用。

Application of learning to rank to protein remote homology detection.

机构信息

School of Computer Science and Technology, Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China and Gordon Life Science Institute, Belmont, MA 02478, USA.

School of Computer Science and Technology.

出版信息

Bioinformatics. 2015 Nov 1;31(21):3492-8. doi: 10.1093/bioinformatics/btv413. Epub 2015 Jul 10.

DOI:10.1093/bioinformatics/btv413

PMID:26163693

Abstract

MOTIVATION

Protein remote homology detection is one of the fundamental problems in computational biology, aiming to find protein sequences in a database of known structures that are evolutionarily related to a given query protein. Some computational methods treat this problem as a ranking problem and achieve the state-of-the-art performance, such as PSI-BLAST, HHblits and ProtEmbed. This raises the possibility to combine these methods to improve the predictive performance. In this regard, we are to propose a new computational method called ProtDec-LTR for protein remote homology detection, which is able to combine various ranking methods in a supervised manner via using the Learning to Rank (LTR) algorithm derived from natural language processing.

RESULTS

Experimental results on a widely used benchmark dataset showed that ProtDec-LTR can achieve an ROC1 score of 0.8442 and an ROC50 score of 0.9023 outperforming all the individual predictors and some state-of-the-art methods. These results indicate that it is correct to treat protein remote homology detection as a ranking problem, and predictive performance improvement can be achieved by combining different ranking approaches in a supervised manner via using LTR.

AVAILABILITY AND IMPLEMENTATION

For users' convenience, the software tools of three basic ranking predictors and Learning to Rank algorithm were provided at http://bioinformatics.hitsz.edu.cn/ProtDec-LTR/home/

CONTACT

bliu@insun.hit.edu.cn

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

蛋白质远程同源检测是计算生物学的基本问题之一，旨在从已知结构的数据库中找到与给定查询蛋白质在进化上相关的蛋白质序列。一些计算方法将这个问题视为排序问题，并取得了最先进的性能，如 PSI-BLAST、HHblits 和 ProtEmbed。这就提出了一种可能，即通过使用源自自然语言处理的学习排序（LTR）算法，以监督的方式结合这些方法来提高预测性能。在这方面，我们提出了一种名为 ProtDec-LTR 的新计算方法，用于蛋白质远程同源检测，它能够通过使用源自自然语言处理的学习排序（LTR）算法，以监督的方式将各种排序方法结合起来。

结果

在广泛使用的基准数据集上的实验结果表明，ProtDec-LTR 可以实现 0.8442 的 ROC1 得分和 0.9023 的 ROC50 得分，优于所有单个预测器和一些最先进的方法。这些结果表明，将蛋白质远程同源检测视为排序问题是正确的，并且可以通过使用 LTR 以监督的方式将不同的排序方法结合起来，从而提高预测性能。