IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1203-1210. doi: 10.1109/TCBB.2018.2789880. Epub 2018 Jan 5.
As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.
作为序列分析中最具挑战性的任务之一,蛋白质远程同源检测得到了广泛的研究。基于判别模型和排序方法的方法已经达到了最新的性能水平,这两种方法是互补的。在这项研究中,我们应用了三个 LSTM 模型来构建蛋白质远程同源检测的预测器,包括 ULSTM、BLSTM 和 CNN-BLSTM。它们能够自动提取局部和全局序列顺序信息。与 PSSMs 结合,CNN-BLSTM 在基于 LSTM 的三种模型中表现最好。我们将这种方法命名为 CNN-BLSTM-PSSM。最后,我们提出了一种新的方法 ProtDet-CCH,它结合了 CNN-BLSTM-PSSM 和排序方法 HHblits。在广泛使用的 SCOP 基准数据集上进行测试,ProtDet-CCH 的 ROC 得分达到 0.998,ROC50 得分达到 0.982,明显优于其他现有的最新方法。在两个更新的 SCOPe 独立数据集上的实验结果表明,ProtDet-CCH 可以实现稳定的性能。此外,我们的方法可以为研究蛋白质家族和超家族的特征和基序提供有用的见解。预计 ProtDet-CCH 将成为蛋白质远程同源检测非常有用的工具。