在系统发育树重建中，k元组距离与四种基于模型的距离之间的性能比较。

Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction.

作者信息

Yang Kuan, Zhang Liqing

机构信息

Virginia Bioinformatics Institute, Virginia, USA.

出版信息

Nucleic Acids Res. 2008 Mar;36(5):e33. doi: 10.1093/nar/gkn075. Epub 2008 Feb 22.

DOI:10.1093/nar/gkn075

PMID:18296485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2275138/

Abstract

Phylogenetic tree reconstruction requires construction of a multiple sequence alignment (MSA) from sequences. Computationally, it is difficult to achieve an optimal MSA for many sequences. Moreover, even if an optimal MSA is obtained, it may not be the true MSA that reflects the evolutionary history of the underlying sequences. Therefore, errors can be introduced during MSA construction which in turn affects the subsequent phylogenetic tree construction. In order to circumvent this issue, we extend the application of the k-tuple distance to phylogenetic tree reconstruction. The k-tuple distance between two sequences is the sum of the differences in frequency, over all possible tuples of length k, between the sequences and can be estimated without MSAs. It has been traditionally used to build a fast 'guide tree' to assist the construction of MSAs. Using the 1470 simulated sets of sequences generated under different evolutionary scenarios, the neighbor-joining trees and BioNJ trees, we compared the performance of the k-tuple distance with four commonly used distance estimators including Jukes-Cantor, Kimura, F84 and Tamura-Nei. These four distance estimators fall into the category of model-based distance estimators, as each of them takes account of a specific substitution model in order to compute the distance between a pair of already aligned sequences. Results show that trees constructed from the k-tuple distance are more accurate than those from other distances most time; when the divergence between underlying sequences is high, the tree accuracy could be twice or higher using the k-tuple distance than other estimators. Furthermore, as the k-tuple distance voids the need for constructing an MSA, it can save tremendous amount of time for phylogenetic tree reconstructions when the data include a large number of sequences.

摘要

系统发育树重建需要从序列构建多序列比对（MSA）。从计算角度来看，为许多序列获得最优的MSA是困难的。此外，即使获得了最优的MSA，它可能也不是反映基础序列进化历史的真实MSA。因此，在MSA构建过程中可能会引入误差，这进而会影响后续的系统发育树构建。为了规避这个问题，我们将k元组距离的应用扩展到系统发育树重建。两个序列之间的k元组距离是在所有长度为k的可能元组上，序列之间频率差异的总和，并且无需MSA即可估计。传统上它被用于构建快速的“引导树”以辅助MSA的构建。使用在不同进化场景下生成的1470组模拟序列、邻接法树和BioNJ树，我们将k元组距离的性能与四种常用的距离估计器进行了比较，包括Jukes-Cantor、Kimura、F84和Tamura-Nei。这四种距离估计器属于基于模型的距离估计器类别，因为它们每一个都考虑了特定的替换模型，以便计算一对已比对序列之间的距离。结果表明，在大多数情况下，由k元组距离构建的树比由其他距离构建的树更准确；当基础序列之间的分歧度较高时，使用k元组距离构建的树的准确性可能是其他估计器的两倍或更高。此外，由于k元组距离无需构建MSA，当数据包含大量序列时，它可以为系统发育树重建节省大量时间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fae/2275138/894891d777a3/gkn075f1.jpg

相似文献

Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction.在系统发育树重建中，k元组距离与四种基于模型的距离之间的性能比较。

Nucleic Acids Res. 2008 Mar;36(5):e33. doi: 10.1093/nar/gkn075. Epub 2008 Feb 22.

LifePrint: a novel k-tuple distance method for construction of phylogenetic trees.生命印记：一种用于构建系统发育树的新型k元组距离方法。

Adv Appl Bioinform Chem. 2011;4:13-27. doi: 10.2147/AABC.S15021. Epub 2011 Jan 20.

Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction.用于系统发育树重建的蛋白质距离度量和建树方法评估。

Mol Biol Evol. 2005 Nov;22(11):2257-64. doi: 10.1093/molbev/msi224. Epub 2005 Jul 27.

PhyPA: Phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences.PhyPA：一种结合成对序列比对的系统发育方法，在涉及高度分化序列的系统发育分析中，其性能优于似然法。

Mol Phylogenet Evol. 2016 Sep;102:331-43. doi: 10.1016/j.ympev.2016.07.001. Epub 2016 Jul 1.

Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty.在存在多序列比对不确定性的情况下系统发育方法统计不一致性的证据。

Genome Biol Evol. 2015 Jul 1;7(8):2102-16. doi: 10.1093/gbe/evv127.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.SATe-II：一种非常快速且准确的同时估计多个序列比对和系统发育树的方法。

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

Evolutionary distances between nucleotide sequences based on the distribution of substitution rates among sites as estimated by parsimony.基于简约法估计的位点间替换率分布的核苷酸序列间的进化距离。

Mol Biol Evol. 1997 Mar;14(3):287-98. doi: 10.1093/oxfordjournals.molbev.a025764.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise.两两比对排列的序列之间最大似然进化距离的协方差。

BMC Evol Biol. 2008 Jun 23;8:179. doi: 10.1186/1471-2148-8-179.

Statistical Inconsistency of Maximum Parsimony for k-Tuple-Site Data.k-位点数据最大简约法的统计不一致性。

Bull Math Biol. 2019 Apr;81(4):1173-1200. doi: 10.1007/s11538-018-00552-2. Epub 2019 Jan 3.

引用本文的文献

Self-mediated positive selection of T cells sets an obstacle to the recognition of nonself.自身介导的 T 细胞阳性选择对非自身的识别构成障碍。

Proc Natl Acad Sci U S A. 2021 Sep 14;118(37). doi: 10.1073/pnas.2100542118.

A tail of two pandas- whole genome k-mer signature analysis of the red panda (Ailurus fulgens) and the Giant panda (Ailuropoda melanoleuca).两种熊猫的尾巴 - 红熊猫（Ailurus fulgens）和大熊猫（Ailuropoda melanoleuca）全基因组 k-mer 特征分析。

BMC Genomics. 2021 Apr 1;22(1):228. doi: 10.1186/s12864-021-07531-3.

An updated evolutionary study of the Notch family reveals a new ancient origin and novel invariable motifs as potential pharmacological targets.一项关于Notch家族的最新进化研究揭示了一个新的古老起源以及作为潜在药理学靶点的新型不变基序。

PeerJ. 2020 Nov 5;8:e10334. doi: 10.7717/peerj.10334. eCollection 2020.

Phylogenetic double placement of mixed samples.混合样本的系统发育双重定位。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i335-i343. doi: 10.1093/bioinformatics/btaa489.

K-mer-Based Motif Analysis in Insect Species across , , and Genera and Its Application to Species Classification.基于 K- -mer 的昆虫种、属和科的基序分析及其在物种分类中的应用。

Comput Math Methods Med. 2019 Nov 15;2019:4259479. doi: 10.1155/2019/4259479. eCollection 2019.

Alignment-free method for DNA sequence clustering using Fuzzy integral similarity.基于模糊积分相似度的无比对 DNA 序列聚类方法。

Sci Rep. 2019 Mar 6;9(1):3753. doi: 10.1038/s41598-019-40452-6.

Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer：使用基因组草图进行无组装和无比对的样本识别。

Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.

Pathogen diversity drives the evolution of generalist MHC-II alleles in human populations.病原体多样性驱动人类群体中 MHC-II 类基因多态性的进化。

PLoS Biol. 2019 Jan 31;17(1):e3000131. doi: 10.1371/journal.pbio.3000131. eCollection 2019 Jan.

Peptide presentation by HLA-DQ molecules is associated with the development of immune tolerance.HLA-DQ分子呈递肽与免疫耐受的发展相关。

PeerJ. 2018 Jul 3;6:e5118. doi: 10.7717/peerj.5118. eCollection 2018.

Mash: fast genome and metagenome distance estimation using MinHash.Mash：使用MinHash进行快速的基因组和宏基因组距离估计。

Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

本文引用的文献

Efficient parsimony-based methods for phylogenetic network reconstruction.基于简约法的高效系统发育网络重建方法。

Bioinformatics. 2007 Jan 15;23(2):e123-8. doi: 10.1093/bioinformatics/btl313.

Genomics. Genomics and the tree of life.基因组学。基因组学与生命之树。

Science. 2006 Sep 29;313(5795):1897-9. doi: 10.1126/science.1134490.

Analysis and comparison of benchmarks for multiple sequence alignment.多序列比对基准的分析与比较

In Silico Biol. 2006;6(4):321-39.

ModelTest Server: a web-based tool for the statistical selection of models of nucleotide substitution online.ModelTest服务器：一个用于在线进行核苷酸替换模型统计选择的基于网络的工具。

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W700-3. doi: 10.1093/nar/gkl042.

M-Coffee: combining multiple sequence alignment methods with T-Coffee.M-Coffee：将多种多序列比对方法与T-Coffee相结合。

Nucleic Acids Res. 2006 Mar 23;34(6):1692-9. doi: 10.1093/nar/gkl091. Print 2006.

TreeFam: a curated database of phylogenetic trees of animal gene families.TreeFam：一个经过精心策划的动物基因家族系统发育树数据库。

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D572-80. doi: 10.1093/nar/gkj118.

Kalign--an accurate and fast multiple sequence alignment algorithm.Kalign——一种准确且快速的多序列比对算法。

BMC Bioinformatics. 2005 Dec 12;6:298. doi: 10.1186/1471-2105-6-298.

DNA assembly with gaps (Dawg): simulating sequence evolution.带缺口的DNA组装（Dawg）：模拟序列进化

Bioinformatics. 2005 Nov 1;21 Suppl 3:iii31-8. doi: 10.1093/bioinformatics/bti1200.

Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction.用于系统发育树重建的蛋白质距离度量和建树方法评估。

Mol Biol Evol. 2005 Nov;22(11):2257-64. doi: 10.1093/molbev/msi224. Epub 2005 Jul 27.

Using models of nucleotide evolution to build phylogenetic trees.使用核苷酸进化模型构建系统发育树。

Dev Comp Immunol. 2005;29(3):211-27. doi: 10.1016/j.dci.2004.07.007.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在系统发育树重建中，k元组距离与四种基于模型的距离之间的性能比较。

Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献