无需比对的序列比较：SpaM方法

Sequence Comparison Without Alignment: The SpaM Approaches.

作者信息

Morgenstern Burkhard

机构信息

University of Göttingen, Department of Bioinformatics (IMG), Göttingen, Germany.

出版信息

Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.

DOI:10.1007/978-1-0716-1036-7_8

PMID:33289890

Abstract

Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.

摘要

序列比对是DNA和蛋白质序列分析的核心。然而，对于如今由大规模平行测序技术产生的数据量而言，两两比对和多重比对方法往往过于缓慢。因此，近年来快速的无比对序列比较方法变得流行起来。这些方法大多基于固定长度单词的词频，或基于词匹配统计。其他方法则使用最大词匹配的长度。虽然这些方法非常快速，但它们大多依赖于难以解释的序列相似性或不相似性的特设度量。在本章中，我将描述一些我们近年来开发的无比对方法。我们的方法基于间隔词匹配（“SpaM”），即基于不精确的词匹配，允许在某些预定义位置包含错配。与大多数以前的无比对方法不同，我们的方法能够使用分子进化的随机模型准确估计DNA或蛋白质序列之间的系统发育距离。

相似文献

Sequence Comparison Without Alignment: The SpaM Approaches.无需比对的序列比较：SpaM方法

Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.空格词和 kmacs：基于不精确词匹配的快速无对齐序列比较。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.

Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。

Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.Prot-SpaM：基于全蛋白质组序列的快速无比对系统发育重建。

Gigascience. 2019 Mar 1;8(3). doi: 10.1093/gigascience/giy148.

Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage.Read-SpaM：用于低测序覆盖度细菌基因组的无组装和无比对比较。

BMC Bioinformatics. 2019 Dec 17;20(Suppl 20):638. doi: 10.1186/s12859-019-3205-7.

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.两个 DNA 序列之间 k-mer 匹配的数量作为 k 的函数，以及在估计系统发育距离中的应用。

PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs：一种无比对的序列比对方法，通过 k-错配平均公共子串实现。

Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.使用过滤的间隔字匹配作为锚点，对远缘基因组序列进行精确的多重比对。

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.拉斯巴里：优化间隔种子用于数据库搜索、读段映射和无比对序列比较

PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.

引用本文的文献

CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model.CGRWDL：基于动态语言模型加权混沌博弈表示的病毒无比对系统发育重建方法

Front Microbiol. 2024 Mar 20;15:1339156. doi: 10.3389/fmicb.2024.1339156. eCollection 2024.

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.AlcoR：生物数据中低复杂度区域的无比对模拟、映射和可视化。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13.

How to optimally sample a sequence for rapid analysis.如何最优地采样序列以进行快速分析。

Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad057.

App-SpaM: phylogenetic placement of short reads without sequence alignment.App-SpaM：无需序列比对的短读段系统发育定位

Bioinform Adv. 2021 Oct 13;1(1):vbab027. doi: 10.1093/bioadv/vbab027. eCollection 2021.

Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification.卷积神经网络在 SARS-CoV-2 序列分类中的应用。

Sensors (Basel). 2022 Jul 31;22(15):5730. doi: 10.3390/s22155730.

The complexity landscape of viral genomes.病毒基因组的复杂性景观。

Gigascience. 2022 Aug 11;11. doi: 10.1093/gigascience/giac079.

Insertions and deletions as phylogenetic signal in an alignment-free context.插入和缺失作为无比对背景下的系统发育信号。

PLoS Comput Biol. 2022 Aug 8;18(8):e1010303. doi: 10.1371/journal.pcbi.1010303. eCollection 2022 Aug.

PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.

本文引用的文献

PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.

Whole-proteome tree of life suggests a deep burst of organism diversity.全蛋白质组生命之树表明生物多样性的深度爆发。

Proc Natl Acad Sci U S A. 2020 Feb 18;117(7):3678-3686. doi: 10.1073/pnas.1915766117. Epub 2020 Feb 4.

Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用

Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.

Dashing: fast and accurate genomic distances with HyperLogLog.使用 HyperLogLog 实现快速准确的基因组距离计算。

Genome Biol. 2019 Dec 4;20(1):265. doi: 10.1186/s13059-019-1875-0.

Mash Screen: high-throughput sequence containment estimation for genome discovery.Mash 屏幕：用于基因组发现的高通量序列包含度估计。

Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.

Benchmarking of alignment-free sequence comparison methods.无比对信息的序列比较方法的基准测试。

Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.

Evolution of biosequence search algorithms: a brief survey.生物序列搜索算法的发展历程：简要综述。

Bioinformatics. 2019 Oct 1;35(19):3547-3552. doi: 10.1093/bioinformatics/btz272.

Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer：使用基因组草图进行无组装和无比对的样本识别。

Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对：优势、应用和工具。

Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

无需比对的序列比较：SpaM方法

Sequence Comparison Without Alignment: The SpaM Approaches.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献