在无链偏性模型下的全基因组无比对系统发育距离估计

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model.

作者信息

Balaban Metin, Bristy Nishat Anjum, Faisal Ahnaf, Bayzid Md Shamsuzzoha, Mirarab Siavash

机构信息

Bioinformatics and System Biology Program, University of California San Diego, San Diego, CA 92093, USA.

Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.

出版信息

Bioinform Adv. 2022 Aug 12;2(1):vbac055. doi: 10.1093/bioadv/vbac055. eCollection 2022.

DOI:10.1093/bioadv/vbac055

PMID:35992043

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9383262/

Abstract

UNLABELLED

While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data.

AVAILABILITY AND IMPLEMENTATION

Our software is available open source at https://github.com/nishatbristy007/NSB.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

未标注

在系统发育推断之前，比对一直是确定同源性的主要方法，而无比对方法可以简化分析，尤其是在分析全基因组数据时。此外，无比对方法是处理新出现的数据形式（如无法进行组装的基因组草图）的唯一选择。尽管有吸引力，但无比对方法在准确性方面一直无法与基于比对的方法竞争。无比对方法的一个局限性在于它们依赖于如朱克斯 - 坎托（Jukes-Cantor）这样的简化序列进化模型。如果我们能够在无比对的情况下估计碱基替换的频率，就可以在更复杂的模型下计算成对距离。然而，由于许多形式的全基因组数据中DNA序列的链是未知的，而这可以说是无比对方法的最佳应用场景，所以人们能够使用的最复杂模型是所谓的无链偏性模型。我们展示了如何在一个名为TK4的四参数无链偏性模型下计算距离，而无需依赖比对或组装。主要思路是替换输入序列中的字母，并重新计算k-mer集合之间的杰卡德指数（Jaccard indices）。然而，在更大的基因组上，我们还需要计算由于随机因素而非同源性导致的替换后k-mer错配的数量。我们在模拟中表明，当基因组在假定模型下进化时，无比对距离可以非常准确，并研究了在已组装和未组装的生物数据上的准确性。

可用性和实现方式

我们的软件以开源形式提供，可在https://github.com/nishatbristy007/NSB获取。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cfe/9710617/d8511c0c2273/vbac055f1.jpg

相似文献

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model.

Bioinform Adv. 2022 Aug 12;2(1):vbac055. doi: 10.1093/bioadv/vbac055. eCollection 2022.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.

PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling.

Cell Syst. 2022 Oct 19;13(10):817-829.e3. doi: 10.1016/j.cels.2022.06.007.

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis.

Front Bioeng Biotechnol. 2020 Sep 23;8:556413. doi: 10.3389/fbioe.2020.556413. eCollection 2020.

A greedy alignment-free distance estimator for phylogenetic inference.

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):238. doi: 10.1186/s12859-017-1658-0.

Efficient estimation of pairwise distances between genomes.

Bioinformatics. 2009 Dec 15;25(24):3221-7. doi: 10.1093/bioinformatics/btp590. Epub 2009 Oct 13.

andi: fast and accurate estimation of evolutionary distances between closely related genomes.

Bioinformatics. 2015 Apr 15;31(8):1169-75. doi: 10.1093/bioinformatics/btu815. Epub 2014 Dec 10.

CAFE: aCcelerated Alignment-FrEe sequence analysis.

Nucleic Acids Res. 2017 Jul 3;45(W1):W554-W559. doi: 10.1093/nar/gkx351.

引用本文的文献

Challenges in Assembling the Dated Tree of Life.

Genome Biol Evol. 2024 Oct 9;16(10). doi: 10.1093/gbe/evae229.

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling.

Cell Syst. 2022 Oct 19;13(10):817-829.e3. doi: 10.1016/j.cels.2022.06.007.

本文引用的文献

Disk compression of k-mer sets.

Algorithms Mol Biol. 2021 Jun 21;16(1):10. doi: 10.1186/s13015-021-00192-7.

Ocean-wide genomic variation in Gray's beaked whales, .

R Soc Open Sci. 2021 Mar 24;8(3):201788. doi: 10.1098/rsos.201788.

Neutralism versus selectionism: Chargaff's second parity rule, revisited.

Genetica. 2021 Apr;149(2):81-88. doi: 10.1007/s10709-021-00119-5. Epub 2021 Apr 20.

Revisiting the Relationships Between Genomic G + C Content, RNA Secondary Structures, and Optimal Growth Temperature.

J Mol Evol. 2021 Apr;89(3):165-171. doi: 10.1007/s00239-020-09974-w. Epub 2020 Nov 20.

Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices.

BMC Genomics. 2020 Jul 20;21(1):497. doi: 10.1186/s12864-020-06892-5.

Phylogenetic double placement of mixed samples.

Bioinformatics. 2020 Jul 1;36(Suppl_1):i335-i343. doi: 10.1093/bioinformatics/btaa489.

Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification.

Mol Ecol. 2020 Jul;29(14):2521-2534. doi: 10.1111/mec.15507. Epub 2020 Jun 29.

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.

PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.

Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage.

BMC Bioinformatics. 2019 Dec 17;20(Suppl 20):638. doi: 10.1186/s12859-019-3205-7.

Alignment-Free Sequence Analysis and Applications.

Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在无链偏性模型下的全基因组无比对系统发育距离估计

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model.

作者信息

Balaban Metin, Bristy Nishat Anjum, Faisal Ahnaf, Bayzid Md Shamsuzzoha, Mirarab Siavash

机构信息

Bioinformatics and System Biology Program, University of California San Diego, San Diego, CA 92093, USA.

Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.