Department of Biological Sciences, Oakland University, Rochester, MI, USA.
Center for Data Science and Big Data Analytics, Oakland University, Rochester, MI, USA.
Methods Mol Biol. 2022;2569:167-188. doi: 10.1007/978-1-0716-2691-7_8.
Over the past three decades, computational capabilities have grown at such a rapid rate that they have given rise to many computationally heavy science fields such as phylogenomics. As increasingly more genomes are sequenced in the three domains of life, larger and more species-complete phylogenetic tree reconstructions are leading to a better understanding of the tree of life and the evolutionary histories in deep times. However, these large datasets pose unique challenges from a modeling and computational perspective: accurately describing the evolutionary process of thousands of species is still beyond the capability of current models, while the computational burden limits our ability to test multiple hypotheses. Thus, it is common practice to reduce the size of a dataset by selecting species to represent a clade (taxon sampling). Unfortunately, this process is subjective, and comparisons of large tree of life studies show that choice and number of species used in a dataset can alter the topology obtained. Thus, taxon sampling is, in itself, a process that needs to be fully investigated to determine its effect on phylogenetic stability. Here, we present the theory and practical application of an automated pipeline that can be easily implemented to explore the effect of taxon sampling on phylogenetic reconstructions. The application of this approach was recently discussed in a study of Terrabacteria and shows its power in investigating the accuracy of deep nodes of a phylogeny.
在过去的三十年中,计算能力的发展速度如此之快,以至于出现了许多计算密集型的科学领域,如系统发生基因组学。随着越来越多的基因组在生命的三个领域中被测序,更大、更具代表性的系统发生树重建导致了对生命之树和深层时间进化历史的更好理解。然而,这些大型数据集从建模和计算的角度提出了独特的挑战:准确描述数千个物种的进化过程仍然超出了当前模型的能力,而计算负担限制了我们测试多个假设的能力。因此,通过选择代表一个进化枝的物种来缩小数据集的大小是一种常见的做法(分类群取样)。不幸的是,这个过程是主观的,而且对大型生命之树研究的比较表明,数据集使用的物种的选择和数量会改变获得的拓扑结构。因此,分类群取样本身就是一个需要进行全面研究的过程,以确定其对系统发生稳定性的影响。在这里,我们提出了一个自动化流程的理论和实际应用,该流程可以很容易地实施,以探索分类群取样对系统发生重建的影响。这种方法的应用最近在 Terrabacteria 的研究中进行了讨论,展示了它在调查系统发生树中深层节点准确性方面的力量。