Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Nat Biotechnol. 2024 Jan;42(1):139-147. doi: 10.1038/s41587-023-01753-4. Epub 2023 Apr 20.
Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10-100 times faster than assembly-based approaches and in most cases more accurate-the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.
目前推断系统发育树的方法需要在计算和劳动力成本方面运行复杂的管道,并且在测序覆盖率、组装和注释质量方面存在额外的限制,特别是对于大型数据集。为了克服这些挑战,我们提出了 Read2Tree,它直接将原始测序reads 处理成相应基因的组,绕过了系统发育推断中的传统步骤,例如基因组组装、注释和全对全序列比较,同时保持准确性。在一个包含广泛数据集的基准测试中,Read2Tree 比基于组装的方法快 10-100 倍,并且在大多数情况下更准确——例外情况是测序覆盖率高且参考物种非常远。在这里,为了说明该工具的广泛适用性,我们重建了一个跨越 5.9 亿年进化的 435 种酵母生命树。我们还将 Read2Tree 应用于超过 10000 个冠状病毒科样本,在单个树中准确地对高度多样化的动物样本和非常相似的严重急性呼吸综合征冠状病毒 2 序列进行分类。Read2Tree 的速度、准确性和多功能性使大规模的比较基因组学成为可能。