Guerrini Veronica, Conte Alessio, Grossi Roberto, Liti Gianni, Rosone Giovanna, Tattini Lorenzo
Dipartimento di Informatica, University of Pisa, Pisa, Italy.
CNRS UMR 7284, INSERM U1081 Université Côte d'Azu, Nice, France.
Algorithms Mol Biol. 2023 Aug 3;18(1):11. doi: 10.1186/s13015-023-00232-4.
Molecular phylogenetics studies the evolutionary relationships among the individuals of a population through their biological sequences. It may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. A key task is inferring phylogenetic trees from any type of sequencing data, including raw short reads. Yet, several tools require pre-processed input data e.g. from complex computational pipelines based on de novo assembly or from mappings against a reference genome. As sequencing technologies keep becoming cheaper, this puts increasing pressure on designing methods that perform analysis directly on their outputs. From this viewpoint, there is a growing interest in alignment-, assembly-, and reference-free methods that could work on several data including raw reads data.
We present phyBWT2, a newly improved version of phyBWT (Guerrini et al. in 22nd International Workshop on Algorithms in Bioinformatics (WABI) 242:23-12319, 2022). Both of them directly reconstruct phylogenetic trees bypassing both the alignment against a reference genome and de novo assembly. They exploit the combinatorial properties of the extended Burrows-Wheeler Transform (eBWT) and the corresponding eBWT positional clustering framework to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori). As a result, they provide novel alignment-, assembly-, and reference-free methods that build partition trees without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. In addition, phyBWT2 outperforms phyBWT in terms of running time, as the former reconstructs phylogenetic trees step-by-step by considering multiple partitions, instead of just one partition at a time, as previously done by the latter.
Based on the results of the experiments on sequencing data, we conclude that our method can produce trees of quality comparable to the benchmark phylogeny by handling datasets of different types (short reads, contigs, or entire genomes). Overall, the experiments confirm the effectiveness of phyBWT2 that improves the performance of its previous version phyBWT, while preserving the accuracy of the results.
分子系统发育学通过生物序列研究种群个体之间的进化关系。它可以提供有关病毒性疾病起源和进化的见解,或突出复杂的进化轨迹。一项关键任务是从任何类型的测序数据(包括原始短读段)推断系统发育树。然而,一些工具需要预处理的输入数据,例如来自基于从头组装的复杂计算流程或与参考基因组的比对。随着测序技术不断变得更便宜,这给直接对其输出进行分析的方法设计带来了越来越大的压力。从这个角度来看,人们对能够处理包括原始读段数据在内的多种数据的无需比对、组装和参考的方法越来越感兴趣。
我们展示了phyBWT2,它是phyBWT(Guerrini等人,第22届国际生物信息学算法研讨会(WABI)242:23 - 12319,2022)的新改进版本。它们都直接重建系统发育树,绕过了与参考基因组的比对和从头组装。它们利用扩展的Burrows - Wheeler变换(eBWT)的组合特性以及相应的eBWT位置聚类框架来检测不同长度的最长共享子串的相关块(与基于k - 元组的方法不同,后者需要先验固定长度k)。结果,它们提供了新颖的无需比对、组装和参考的方法来构建划分树,而不依赖于序列的成对比较,从而避免使用距离矩阵来推断系统发育。此外,phyBWT2在运行时间方面优于phyBWT,因为前者通过考虑多个划分逐步重建系统发育树,而不是像后者以前那样一次只考虑一个划分。
基于对测序数据的实验结果,我们得出结论,我们的方法通过处理不同类型的数据集(短读段、重叠群或整个基因组)可以生成质量与基准系统发育相当的树。总体而言,实验证实了phyBWT