Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA.
California Academy of Sciences, San Francisco, CA 94118, USA.
Syst Biol. 2021 Apr 15;70(3):440-462. doi: 10.1093/sysbio/syaa064.
Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several "best practices" for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming].
比对是分子系统发育学中的一个关键问题,因为不同的比对方法可能会对单个基因产生非常不同的拓扑结构。但是,在包含数百或数千个基因数据的基因组分析中,选择比对方法是否仍然重要尚不清楚。例如,比对中的问题可能会在多个基因座中倍增,而单个基因中的比对错误可能变得无关紧要。比对修剪(即从单个基因中去除未对齐区域或缺失数据)的问题也未得到充分探讨。在这里,我们测试了 12 种不同的比对和修剪方法组合对基因组分析的影响。我们使用来自蜥蜴和蛇的超保守元件 (UCE)、鸟类和四足动物的已发表基因组数据来比较这些方法。我们比较了不同比对和修剪方法生成的比对的特性(例如,长度、信息位点、缺失数据)。我们还测试了当使用全数据集(~5000 个基因座)和亚数据集(10%和 1%的基因座)时,这些数据集是否可以通过串联(RAxML)和种系树方法(ASTRAL-III)很好地恢复已建立的进化枝。我们表明,不同的比对和修剪方法会显著影响基因组数据集的各个方面(例如,长度、信息位点)。然而,即使在非常不同的基因座数量下,这些不同的方法通常对已建立的进化枝的恢复和支持值影响很小。尽管如此,我们的结果提出了一些比对和修剪的“最佳实践”。有趣的是,系统发育方法的选择对系统发育结果的影响最大,串联分析比种系树分析恢复了更多的已建立的进化枝(具有更强的支持)。