Gotoh O
Department of Biochemistry, Saitama Cancer Center Research Institute, Japan.
J Mol Biol. 1996 Dec 13;264(4):823-38. doi: 10.1006/jmbi.1996.0679.
The relative performances of four strategies for aligning a large number of protein sequences were assessed by referring to corresponding structural alignments of 54 independent families. Multiple sequence alignment of a family was constructed by a given method from the sequences of known structures and their homologues, and the subset consisting of the sequences of known structures was extracted from the whole alignment and compared with the structural counterpart in a residue-to-residue fashion. Gap-opening and -extension penalties were optimized for each family and method. Each of the four multiple alignment methods gave significantly more accurate alignments than the conventional pairwise method. In addition, a clear difference in performance was detected among three of the four multiple alignment methods examined. The currently most popular progressive method ranked worst among the four, and the randomized iterative strategy that optimizes the sum-of-pairs score ranked next worst. The two best-performing strategies, one of which was newly developed, both pursue an optimal weighted sum-of-pairs score, where the pair weights were introduced to correct for uneven representations of subgroups in a family. The new method uses doubly nested iterations to make alignment, phylogenetic tree and pair weights mutually consistent. Most importantly, the improvement in accuracy of alignments obtained by these iterative methods over pairwise or progressive method tends to increase with decreasing average sequence identity, implying that iterative refinement is more effective for the generally difficult alignment of remotely related sequences. Four well-known amino acid substitution matrices were also tested in combination with the various methods. However, the effects of substitution matrices were found to be minor in the framework of multiple alignment, and the same order of relative performance of the alignment methods was observed with any of the matrices.
通过参考54个独立家族的相应结构比对,评估了四种用于比对大量蛋白质序列的策略的相对性能。一个家族的多序列比对由给定方法从已知结构的序列及其同源物构建而成,然后从整个比对中提取已知结构序列的子集,并以残基对残基的方式与结构对应物进行比较。针对每个家族和方法优化了空位开放和延伸罚分。四种多序列比对方法中的每一种都比传统的两两比对方法给出了明显更准确的比对结果。此外,在所研究的四种多序列比对方法中的三种之间检测到了性能上的明显差异。当前最流行的渐进方法在这四种方法中排名最差,而优化双对分数总和的随机迭代策略排名第二差。两种性能最佳的策略,其中一种是新开发的,都追求最优加权双对分数总和,其中引入对权重以校正家族中各个亚组代表性的不均衡。新方法使用双重嵌套迭代以使比对、系统发育树和对权重相互一致。最重要的是,与两两比对或渐进方法相比,这些迭代方法在比对准确性上的提高往往随着平均序列同一性的降低而增加,这意味着迭代优化对于远缘相关序列通常困难的比对更有效。还结合各种方法测试了四种著名的氨基酸替换矩阵。然而发现在多序列比对框架中替换矩阵的影响较小,并且使用任何一种矩阵时比对方法的相对性能顺序相同。