基于字分析的无比对基因序列比较：最新方法综述

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

出版信息

Brief Bioinform. 2014 Nov;15(6):890-905. doi: 10.1093/bib/bbt052. Epub 2013 Jul 31.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4296134/

Abstract

Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.

摘要

现代测序和基因组组装技术提供了丰富的数据，这些数据很快将需要通过比较分析来发现。序列比对是生物信息学研究中的一项基本任务，但也存在一些注意事项。由于在处理大量序列数据时计算成本较高，动态规划的开创性技术和方法在这项工作中证明是无效的。由于遗传重组、遗传改组和其他内在的生物学事件，这些方法容易给出误导性信息。信息论、频率分析和数据压缩的新方法已经可用，并为动态规划提供了强大的替代方案。这些新方法通常更受欢迎，因为它们的算法更简单，不受同线性相关问题的影响。在这篇综述中，我们详细讨论了基于统计分析的基于无比对方法的计算工具。我们提供了几个清晰的例子，演示了无比对分析的几个不同领域的应用和解释，如碱基-碱基相关性、特征频率分布、组成向量、改进的字符串组成和 D2 统计量。此外，我们还详细讨论了数据压缩中 Lempel-Ziv 技术的分析示例。

相似文献

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.基于字分析的无比对基因序列比较：最新方法综述

Brief Bioinform. 2014 Nov;15(6):890-905. doi: 10.1093/bib/bbt052. Epub 2013 Jul 31.

An improved string composition method for sequence comparison.一种用于序列比较的改进型字符串组成方法。

BMC Bioinformatics. 2008 May 28;9 Suppl 6(Suppl 6):S15. doi: 10.1186/1471-2105-9-S6-S15.

GATA: a graphic alignment tool for comparative sequence analysis.GATA：一种用于比较序列分析的图形比对工具。

BMC Bioinformatics. 2005 Jan 17;6:9. doi: 10.1186/1471-2105-6-9.

An improved alignment-free model for DNA sequence similarity metric.一种用于DNA序列相似性度量的改进的无比对模型。

BMC Bioinformatics. 2014 Sep 28;15(1):321. doi: 10.1186/1471-2105-15-321.

Cautionary Tales of Inapproximability.不可近似性的警示故事

J Comput Biol. 2017 Mar;24(3):213-216. doi: 10.1089/cmb.2016.0097. Epub 2016 Sep 8.

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.无比对序列比较的新进展：度量、统计学与新一代测序

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

A new alignment free genome comparison algorithm based on statistically estimated feature frequency profile.一种基于统计估计特征频率分布的新型无比对基因组比较算法。

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:4265-4268. doi: 10.1109/EMBC.2017.8037798.

Alignment-free phylogenetics and population genetics.无比对系统发育学与群体遗传学。

Brief Bioinform. 2014 May;15(3):407-18. doi: 10.1093/bib/bbt083. Epub 2013 Nov 29.

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.随机序列之间精确和近似单词匹配的渐近行为及最优单词大小

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。

Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.

引用本文的文献

CAKL: Commutative algebra k-mer learning of genomics.CAKL：基因组学的交换代数k-mer学习

ArXiv. 2025 Aug 13:arXiv:2508.09406v1.

The grand biological universe: A comprehensive geometric construction of genome space.宏大的生物宇宙：基因组空间的全面几何构建

Innovation (Camb). 2025 Apr 30;6(8):100937. doi: 10.1016/j.xinn.2025.100937. eCollection 2025 Aug 4.

A PCR primer design method for identifying spider mite species using k-mer counting.一种基于k-mer计数法鉴定叶螨种类的PCR引物设计方法。

PLoS One. 2025 Jun 9;20(6):e0321199. doi: 10.1371/journal.pone.0321199. eCollection 2025.

New Virus Variant Detection Based on the Optimal Natural Metric.基于最优自然测度的新型病毒变体检测

Genes (Basel). 2024 Jul 7;15(7):891. doi: 10.3390/genes15070891.

The optimal metric for viral genome space.病毒基因组空间的最佳指标。

Comput Struct Biotechnol J. 2024 May 10;23:2083-2096. doi: 10.1016/j.csbj.2024.05.005. eCollection 2024 Dec.

Learning to Learn: How to Continuously Teach Humans and Machines.学会学习：如何持续教导人类和机器。

IEEE Int Conf Comput Vis Workshops. 2023 Oct;2023:11674-11685. doi: 10.1109/iccv51070.2023.01075. Epub 2024 Jan 15.

Application of Feature Definition and Quantification in Biological Sequence Analysis.特征定义与量化在生物序列分析中的应用。

Curr Genomics. 2023 Oct 27;24(2):64-65. doi: 10.2174/1389202924666230816150732.

Alignment-free comparison of metagenomics sequences via approximate string matching.通过近似字符串匹配对宏基因组序列进行无比对比较。

Bioinform Adv. 2022 Oct 21;2(1):vbac077. doi: 10.1093/bioadv/vbac077. eCollection 2022.

Alignment-Free Analysis of Whole-Genome Sequences From Symbiodiniaceae Reveals Different Phylogenetic Signals in Distinct Regions.对共生藻全基因组序列的无比对分析揭示了不同区域中不同的系统发育信号。

Front Plant Sci. 2022 Apr 26;13:815714. doi: 10.3389/fpls.2022.815714. eCollection 2022.

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.大规模基于 k-mer 的基因组信息特性分析、比较基因组学和分类学。

PLoS One. 2021 Oct 14;16(10):e0258693. doi: 10.1371/journal.pone.0258693. eCollection 2021.

本文引用的文献

A base composition analysis of natural patterns for the preprocessing of metagenome sequences.对宏基因组序列进行预处理的自然模式的碱基组成分析。

BMC Bioinformatics. 2013;14 Suppl 11(Suppl 11):S5. doi: 10.1186/1471-2105-14-S11-S5. Epub 2013 Nov 4.

Alignment-free sequence comparison based on next-generation sequencing reads.基于新一代测序读数的无比对序列比较。

J Comput Biol. 2013 Feb;20(2):64-79. doi: 10.1089/cmb.2012.0228.

Sequence comparison alignment-free approach based on suffix tree and L-words frequency.基于后缀树和L-词频的序列比较免比对方法。

ScientificWorldJournal. 2012;2012:450124. doi: 10.1100/2012/450124. Epub 2012 Sep 10.

Evolutionary implications of horizontal gene transfer.水平基因转移的进化意义。

Annu Rev Genet. 2012;46:341-58. doi: 10.1146/annurev-genet-110711-155529. Epub 2012 Aug 29.

DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.DELIMINATE——一种快速高效的基因组序列无损压缩方法：序列分析。

Bioinformatics. 2012 Oct 1;28(19):2527-9. doi: 10.1093/bioinformatics/bts467. Epub 2012 Jul 25.

Pattern clustering with statistical methods using a DNA-based algorithm.基于 DNA 的算法的统计方法模式聚类。

IEEE Trans Nanobioscience. 2012 Jun;11(2):100-10. doi: 10.1109/TNB.2012.2190618.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.利用布劳尔-惠勒变换对基因组序列数据库进行大规模压缩。

Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.

Alignment-free detection of horizontal gene transfer between closely related bacterial genomes.密切相关细菌基因组间水平基因转移的无比对检测

Mob Genet Elements. 2011 Sep;1(3):230-235. doi: 10.4161/mge.1.3.18065. Epub 2011 Sep 1.

Integrating overlapping structures and background information of words significantly improves biological sequence comparison.整合单词的重叠结构和背景信息能显著提高生物序列比较的效果。

PLoS One. 2011;6(11):e26779. doi: 10.1371/journal.pone.0026779. Epub 2011 Nov 10.

Abundance of ultramicro inversions within local alignments between human and chimpanzee genomes.人类和黑猩猩基因组之间局部比对中存在大量超微倒位。

BMC Evol Biol. 2011 Oct 19;11:308. doi: 10.1186/1471-2148-11-308.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验