Balaban Metin, Bristy Nishat Anjum, Faisal Ahnaf, Bayzid Md Shamsuzzoha, Mirarab Siavash
Bioinformatics and System Biology Program, University of California San Diego, San Diego, CA 92093, USA.
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.
Bioinform Adv. 2022 Aug 12;2(1):vbac055. doi: 10.1093/bioadv/vbac055. eCollection 2022.
While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data.
Our software is available open source at https://github.com/nishatbristy007/NSB.
Supplementary data are available at online.
在系统发育推断之前,比对一直是确定同源性的主要方法,而无比对方法可以简化分析,尤其是在分析全基因组数据时。此外,无比对方法是处理新出现的数据形式(如无法进行组装的基因组草图)的唯一选择。尽管有吸引力,但无比对方法在准确性方面一直无法与基于比对的方法竞争。无比对方法的一个局限性在于它们依赖于如朱克斯 - 坎托(Jukes-Cantor)这样的简化序列进化模型。如果我们能够在无比对的情况下估计碱基替换的频率,就可以在更复杂的模型下计算成对距离。然而,由于许多形式的全基因组数据中DNA序列的链是未知的,而这可以说是无比对方法的最佳应用场景,所以人们能够使用的最复杂模型是所谓的无链偏性模型。我们展示了如何在一个名为TK4的四参数无链偏性模型下计算距离,而无需依赖比对或组装。主要思路是替换输入序列中的字母,并重新计算k-mer集合之间的杰卡德指数(Jaccard indices)。然而,在更大的基因组上,我们还需要计算由于随机因素而非同源性导致的替换后k-mer错配的数量。我们在模拟中表明,当基因组在假定模型下进化时,无比对距离可以非常准确,并研究了在已组装和未组装的生物数据上的准确性。
我们的软件以开源形式提供,可在https://github.com/nishatbristy007/NSB获取。
补充数据可在网上获取。