Golubchik Tanya, Wise Michael J, Easteal Simon, Jermiin Lars S
School of Biological Sciences, University of Sydney, Sydney, Australia.
Mol Biol Evol. 2007 Nov;24(11):2433-42. doi: 10.1093/molbev/msm176. Epub 2007 Aug 20.
Multiple sequence alignment (MSA) is a crucial first step in the analysis of genomic and proteomic data. Commonly occurring sequence features, such as deletions and insertions, are known to affect the accuracy of MSA programs, but the extent to which alignment accuracy is affected by the positions of insertions and deletions has not been examined independently of other sources of sequence variation. We assessed the performance of 6 popular MSA programs (ClustalW, DIALIGN-T, MAFFT, MUSCLE, PROBCONS, and T-COFFEE) and one experimental program, PRANK, on amino acid sequences that differed only by short regions of deleted residues. The analysis showed that the absence of residues often led to an incorrect placement of gaps in the alignments, even though the sequences were otherwise identical. In data sets containing sequences with partially overlapping deletions, most MSA programs preferentially aligned the gaps vertically at the expense of incorrectly aligning residues in the flanking regions. Of the programs assessed, only DIALIGN-T was able to place overlapping gaps correctly relative to one another, but this was usually context dependent and was observed only in some of the data sets. In data sets containing sequences with non-overlapping deletions, both DIALIGN-T and MAFFT (G-INS-I) were able to align gaps with near-perfect accuracy, but only MAFFT produced the correct alignment consistently. The same was true for data sets that comprised isoforms of alternatively spliced gene products: both DIALIGN-T and MAFFT produced highly accurate alignments, with MAFFT being the more consistent of the 2 programs. Other programs, notably T-COFFEE and ClustalW, were less accurate. For all data sets, alignments produced by different MSA programs differed markedly, indicating that reliance on a single MSA program may give misleading results. It is therefore advisable to use more than one MSA program when dealing with sequences that may contain deletions or insertions, particularly for high-throughput and pipeline applications where manual refinement of each alignment is not practicable.
多序列比对(MSA)是基因组和蛋白质组数据分析中至关重要的第一步。已知常见的序列特征,如缺失和插入,会影响MSA程序的准确性,但插入和缺失位置对比对准确性的影响程度尚未独立于其他序列变异来源进行研究。我们评估了6种流行的MSA程序(ClustalW、DIALIGN-T、MAFFT、MUSCLE、PROBCONS和T-COFFEE)以及一个实验程序PRANK在仅因短缺失残基区域不同的氨基酸序列上的性能。分析表明,即使序列在其他方面相同,残基的缺失通常也会导致比对中缺口的错误放置。在包含部分重叠缺失序列的数据集中,大多数MSA程序优先将缺口垂直对齐,而牺牲了侧翼区域残基的错误比对。在所评估的程序中,只有DIALIGN-T能够相对于彼此正确放置重叠缺口,但这通常取决于上下文,并且仅在一些数据集中观察到。在包含非重叠缺失序列的数据集中,DIALIGN-T和MAFFT(G-INS-I)都能够以近乎完美的准确性对齐缺口,但只有MAFFT始终产生正确的比对。对于由可变剪接基因产物的异构体组成的数据集也是如此:DIALIGN-T和MAFFT都产生了高度准确的比对,MAFFT是这两个程序中更一致的。其他程序,特别是T-COFFEE和ClustalW,准确性较低。对于所有数据集,不同MSA程序产生的比对差异显著,这表明依赖单个MSA程序可能会给出误导性结果。因此,在处理可能包含缺失或插入的序列时,尤其是在无法对每个比对进行人工优化的高通量和流水线应用中,建议使用多个MSA程序。