多序列比对中的空位对二级结构预测方法的影响。

The influence of gapped positions in multiple sequence alignments on secondary structure prediction methods.

作者信息

Simossis V A, Heringa J

机构信息

Bioinformatics Section, Faculty of Sciences, Vrije Universiteit, De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands.

出版信息

Comput Biol Chem. 2004 Dec;28(5-6):351-66. doi: 10.1016/j.compbiolchem.2004.09.005.

DOI:10.1016/j.compbiolchem.2004.09.005

PMID:15556476

Abstract

All currently leading protein secondary structure prediction methods use a multiple protein sequence alignment to predict the secondary structure of the top sequence. In most of these methods, prior to prediction, alignment positions showing a gap in the top sequence are deleted, consequently leading to shrinking of the alignment and loss of position-specific information. In this paper we investigate the effect of this removal of information on secondary structure prediction accuracy. To this end, we have designed SymSSP, an algorithm that post-processes the predicted secondary structure of all sequences in a multiple sequence alignment by (i) making use of the alignment's evolutionary information and (ii) re-introducing most of the information that would otherwise be lost. The post-processed information is then given to a new dynamic programming routine that produces an optimally segmented consensus secondary structure for each of the multiple alignment sequences. We have tested our method on the state-of-the-art secondary structure prediction methods PHD, PROFsec, SSPro2 and JNET using the HOMSTRAD database of reference alignments. Our consensus-deriving dynamic programming strategy is consistently better at improving the segmentation quality of the predictions compared to the commonly used majority voting technique. In addition, we have applied several weighting schemes from the literature to our novel consensus-deriving dynamic programming routine. Finally, we have investigated the level of noise introduced by prediction errors into the consensus and show that predictions of edges of helices and strands are half the time wrong for all the four tested prediction methods.

摘要

目前所有领先的蛋白质二级结构预测方法都使用多序列比对来预测顶级序列的二级结构。在这些方法中的大多数中，在预测之前，会删除顶级序列中显示有缺口的比对位置，从而导致比对缩小并丢失位置特异性信息。在本文中，我们研究了这种信息去除对二级结构预测准确性的影响。为此，我们设计了SymSSP算法，该算法通过（i）利用比对的进化信息和（ii）重新引入否则会丢失的大部分信息，对多序列比对中所有序列的预测二级结构进行后处理。然后将后处理后的信息提供给一个新的动态规划程序，该程序为每个多比对序列生成一个最优分割的一致二级结构。我们使用参考比对的HOMSTRAD数据库，在最先进的二级结构预测方法PHD、PROFsec、SSPro2和JNET上测试了我们的方法。与常用的多数投票技术相比，我们的一致推导动态规划策略在提高预测的分割质量方面始终表现更好。此外，我们将文献中的几种加权方案应用于我们新颖的一致推导动态规划程序。最后，我们研究了预测误差引入到一致结构中的噪声水平，并表明对于所有四种测试的预测方法，螺旋和链边缘的预测有一半时间是错误的。