Kmacs：一种无比对的序列比对方法，通过 k-错配平均公共子串实现。

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

机构信息

Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, FranceDepartment of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, France.

出版信息

Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.

DOI:10.1093/bioinformatics/btu331

PMID:24828656

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4080746/

Abstract

MOTIVATION

Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays.

RESULTS

To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood.

AVAILABILITY AND IMPLEMENTATION

kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/.

摘要

动机

如果要分析大型数据集，基于比对的序列分析方法有各种局限性。因此，近年来无比对方法变得流行起来。其中最著名的无比对方法之一是平均公共子串方法，它基于它们之间最长公共单词的平均长度定义序列之间的距离度量。在此，我们通过考虑具有 k 个错配的最长公共子串来推广该方法。我们提出了一种贪婪启发式算法来近似这种 k-错配子串的长度，并描述了基于广义增强后缀数组的这种思想的有效实现 kmacs。

结果

为了评估我们方法的性能，我们将其应用于使用大量 DNA 和蛋白质序列集进行系统发育重建。在大多数情况下，使用 kmacs 计算的系统发育树比基于精确单词匹配的基于比对的现有无比对方法生成的树更准确。特别是在蛋白质序列上，我们的方法似乎更优越。在模拟的蛋白质家族中，kmacs 甚至优于使用多重比对和最大似然进行系统发育重建的经典方法。

可用性和实现

kmacs 是用 C++实现的，源代码可在 http://kmacs.gobics.de/ 免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22f3/4080746/390182cfb73e/btu331f1p.jpg

相似文献

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs：一种无比对的序列比对方法，通过 k-错配平均公共子串实现。

Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.空格词和 kmacs：基于不精确词匹配的快速无对齐序列比较。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.

A greedy alignment-free distance estimator for phylogenetic inference.一种用于系统发育推断的贪婪无比对距离估计器。

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):238. doi: 10.1186/s12859-017-1658-0.

Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。

Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.

Phylogeny reconstruction based on the length distribution of -mismatch common substrings.基于错配公共子串长度分布的系统发育重建。

Algorithms Mol Biol. 2017 Dec 11;12:27. doi: 10.1186/s13015-017-0118-8. eCollection 2017.

ALFRED: A Practical Method for Alignment-Free Distance Computation.阿尔弗雷德：一种无比对距离计算的实用方法。

J Comput Biol. 2016 Jun;23(6):452-60. doi: 10.1089/cmb.2015.0217. Epub 2016 May 3.

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.一种用于解决k错配平均公共子串问题的可证明高效算法。

J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.一种基于比对的启发式算法，用于快速的序列比对，可应用于系统发育重建。

BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):404. doi: 10.1186/s12859-020-03738-5.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

Sequence Comparison Without Alignment: The SpaM Approaches.无需比对的序列比较：SpaM方法

Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.

引用本文的文献

Alignment-free viral sequence classification at scale.大规模无比对病毒序列分类

BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5.

Alignment-Free Viral Sequence Classification at Scale.大规模无比对病毒序列分类

bioRxiv. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186.

Pangenome comparison via ED strings.通过编辑距离字符串进行泛基因组比较。

Front Bioinform. 2024 Sep 26;4:1397036. doi: 10.3389/fbinf.2024.1397036. eCollection 2024.

CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model.CGRWDL：基于动态语言模型加权混沌博弈表示的病毒无比对系统发育重建方法

Front Microbiol. 2024 Mar 20;15:1339156. doi: 10.3389/fmicb.2024.1339156. eCollection 2024.

Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.斑驳：通过利用短读映射器和梯度下降实现高分歧下精确的双序列替换距离。

PLoS One. 2024 Mar 21;19(3):e0298834. doi: 10.1371/journal.pone.0298834. eCollection 2024.

Seedability: optimizing alignment parameters for sensitive sequence comparison.可播种性：优化用于敏感序列比较的比对参数。

Bioinform Adv. 2023 Aug 12;3(1):vbad108. doi: 10.1093/bioadv/vbad108. eCollection 2023.

In-depth investigation of the point mutation pattern of HIV-1.深入研究 HIV-1 的点突变模式。

Front Cell Infect Microbiol. 2022 Nov 15;12:1033481. doi: 10.3389/fcimb.2022.1033481. eCollection 2022.

Alignment-free comparison of metagenomics sequences via approximate string matching.通过近似字符串匹配对宏基因组序列进行无比对比较。

Bioinform Adv. 2022 Oct 21;2(1):vbac077. doi: 10.1093/bioadv/vbac077. eCollection 2022.

Strain level microbial detection and quantification with applications to single cell metagenomics.利用单细胞宏基因组学进行菌株水平微生物检测和定量。

Nat Commun. 2022 Oct 28;13(1):6430. doi: 10.1038/s41467-022-33869-7.

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling.使用子采样量化无组装全基因组距离估计和系统发育关系的不确定性。

Cell Syst. 2022 Oct 19;13(10):817-829.e3. doi: 10.1016/j.cels.2022.06.007.

本文引用的文献

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.空格词和 kmacs：基于不精确词匹配的快速无对齐序列比较。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.

Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。

Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.

kClust: fast and sensitive clustering of large protein sequence databases.kClust：快速且灵敏的大规模蛋白质序列数据库聚类程序。

BMC Bioinformatics. 2013 Aug 15;14:248. doi: 10.1186/1471-2105-14-248.

Alignment-free phylogeny of whole genomes using underlying subwords.利用潜在子词进行全基因组的无比对系统发育分析。

Algorithms Mol Biol. 2012 Dec 6;7(1):34. doi: 10.1186/1748-7188-7-34.

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.使用 Clustal Omega 快速、可扩展地生成高质量蛋白质多重序列比对。

Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.

Genome characteristics of a generalist marine bacterial lineage.海洋细菌谱系的基因组特征。

ISME J. 2010 Jun;4(6):784-98. doi: 10.1038/ismej.2009.150. Epub 2010 Jan 14.

Efficient estimation of pairwise distances between genomes.高效估计基因组之间的成对距离。

Bioinformatics. 2009 Dec 15;25(24):3221-7. doi: 10.1093/bioinformatics/btp590. Epub 2009 Oct 13.

Estimating mutation distances from unaligned genomes.从未比对的基因组估计突变距离。

J Comput Biol. 2009 Oct;16(10):1487-500. doi: 10.1089/cmb.2009.0106.

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.基于特征频率谱（FFP）和最优分辨率的无比对基因组比较

Proc Natl Acad Sci U S A. 2009 Feb 24;106(8):2677-82. doi: 10.1073/pnas.0813249106. Epub 2009 Feb 2.

Remote homology detection based on oligomer distances.基于寡聚体距离的远程同源性检测。

Bioinformatics. 2006 Sep 15;22(18):2224-31. doi: 10.1093/bioinformatics/btl376. Epub 2006 Jul 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Kmacs：一种无比对的序列比对方法，通过 k-错配平均公共子串实现。

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献