Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA.
Philos Trans R Soc Lond B Biol Sci. 2022 Oct 10;377(1861):20210244. doi: 10.1098/rstb.2021.0244. Epub 2022 Aug 22.
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
随着序列数据甚至完整测序和组装基因组的可用性的增加,现在一些生物学家的目标是对非常大的树(甚至数十万条序列)进行系统发育估计。然而,这些系统发育的构建是一个复杂的流程,提出了分析和计算方面的挑战,特别是当序列数量非常大时。在过去的几年中,已经开发了新的方法,旨在能够在这些大型数据集上进行高度准确的系统发育估计,包括用于多序列比对和/或树估计的分而治之技术、能够从多基因座数据集估计种系树的方法,同时解决由于生物过程(例如不完全谱系分选和基因复制和丢失)引起的异质性的方法,以及将序列添加到大基因树或种系树中的方法。在这里,我们介绍了其中的一些最新进展,并讨论了未来改进的机会。本文是“微生物病原体的基因组种群结构”讨论会议议题的一部分。