Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA.
BMC Bioinformatics. 2010 Mar 22;11:146. doi: 10.1186/1471-2105-11-146.
While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate.
We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance.
The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.
虽然序列相似性搜索产生的两两比对是识别同源蛋白质的有力工具——蛋白质具有共同的祖先和相似的结构;但两两序列比对往往不能准确地表示从三维坐标推断出的结构比对。由于序列比对算法生成最佳比对,因此最佳结构比对必须反映次优序列比对得分。因此,我们检查了一系列次优序列比对和一系列评分参数,以更好地了解哪些序列比对更有可能具有更高的结构准确性。
我们比较了 Zuker 算法生成的近最优蛋白质序列比对和 probA 程序生成的一组概率比对与四种不同结构比对算法生成的结构比对。结构比对的解空间与具有显著序列相似性(E 值<10-5)的序列的常用评分参数生成的近最优序列比对以及 probA 比对的集合之间存在显著重叠。我们构建了一个逻辑回归模型,该模型结合了三个来自近最优比对集的输入变量:稳健性、边缘频率和最大每位置位数。ROC 分析表明,与仅使用稳健性得分相比,该模型更准确地根据出现在结构比对中的可能性对氨基酸对(比对路径图中的边缘)进行分类。我们研究了各种修剪协议,以从最佳序列比对中去除不正确的边缘;最有效的协议是从半全局最优比对中删除超出局部比对边界的匹配,尽管根据模型生成的概率进行修剪可达到类似的改进水平。该模型还可以通过使用概率代替评分矩阵来生成新的比对。这些比对通常优于最佳序列比对,并包含新的正确结构边缘。我们发现,与 Zuker 集相比,probA 比对更能采样到各种比对,这更频繁地导致与结构比对更接近的比对,但将 probA 比对用作回归模型的输入并不会提高性能。
对于具有统计意义上显著相似性的对,次优的两两蛋白质序列比对的集合与基于结构的比对有很大的重叠,并且基于该比对集合中包含的信息的回归模型可以提高与基于结构的比对的比对准确性。