European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom.
Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.
Mol Biol Evol. 2018 Jul 1;35(7):1783-1797. doi: 10.1093/molbev/msy055.
Accurate reconstruction of ancestral states is a critical evolutionary analysis when studying ancient proteins and comparing biochemical properties between parental or extinct species and their extant relatives. It relies on multiple sequence alignment (MSA) which may introduce biases, and it remains unknown how MSA methodological approaches impact ancestral sequence reconstruction (ASR). Here, we investigate how MSA methodology modulates ASR using a simulation study of various evolutionary scenarios. We evaluate the accuracy of ancestral protein sequence reconstruction for simulated data and compare reconstruction outcomes using different alignment methods. Our results reveal biases introduced not only by aligner algorithms and assumptions, but also tree topology and the rate of insertions and deletions. Under many conditions we find no substantial differences between the MSAs. However, increasing the difficulty for the aligners can significantly impact ASR. The MAFFT consistency aligners and PRANK variants exhibit the best performance, whereas FSA displays limited performance. We also discover a bias towards reconstructed sequences longer than the true ancestors, deriving from a preference for inferring insertions, in almost all MSA methodological approaches. In addition, we find measures of MSA quality generally correlate highly with reconstruction accuracy. Thus, we show MSA methodological differences can affect the quality of reconstructions and propose MSA methods should be selected with care to accurately determine ancestral states with confidence.
准确重建祖先状态是研究古代蛋白质和比较父代或灭绝物种与现存亲缘物种之间生化特性的关键进化分析。它依赖于多序列比对 (MSA),但 MSA 方法可能会引入偏差,目前尚不清楚 MSA 方法如何影响祖先序列重建 (ASR)。在这里,我们通过对各种进化场景的模拟研究来研究 MSA 方法如何调节 ASR。我们评估了模拟数据中祖先蛋白序列重建的准确性,并比较了使用不同比对方法的重建结果。我们的研究结果表明,比对器算法和假设、树拓扑以及插入和缺失的速率不仅会引入偏差,而且还会引入偏差。在许多情况下,我们发现不同的 MSA 之间没有实质性差异。然而,增加比对器的难度会显著影响 ASR。MAFFT 一致性比对器和 PRANK 变体表现出最佳性能,而 FSA 的性能有限。我们还发现,在几乎所有的 MSA 方法中,重建序列的长度比真实祖先长,这源于对插入推断的偏好。此外,我们发现 MSA 质量的衡量标准通常与重建准确性高度相关。因此,我们表明 MSA 方法的差异会影响重建的质量,并建议谨慎选择 MSA 方法,以有信心准确确定祖先状态。