使用过滤的间隔字匹配作为锚点，对远缘基因组序列进行精确的多重比对。

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.

机构信息

Department of Bioinformatics, Institute of Microbiology and Genetics.

Center for Computational Sciences, University of Goettingen, Goettingen, Germany.

出版信息

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

DOI:10.1093/bioinformatics/bty592

PMID:29992260

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6330006/

Abstract

MOTIVATION

Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods.

RESULTS

In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points.

AVAILABILITY AND IMPLEMENTATION

http://spacedanchor.gobics.de.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

大多数用于两两和多个基因组比对的方法使用快速局部同源搜索工具来识别锚点，即输入序列的高分局部比对。然后，在这些锚点之间的序列段使用较慢、更敏感的方法进行比对。因此，找到合适的锚点对于基因组序列比较至关重要；基因组比对的速度和灵敏度取决于基础的锚定方法。

结果

在本文中，我们使用过滤的间隔字匹配来生成基因组比对的锚点。对于表示匹配和不关心位置的二进制模式，我们首先搜索间隔字匹配，即具有匹配核苷酸的无间隙局部成对比对模式的匹配位置和可能的不关心位置的错配。那些相似度得分高于某个阈值的间隔字匹配然后使用标准的 X -drop 算法进行扩展；由此产生的局部比对用作锚点。为了评估这种方法，我们使用了流行的多基因组比对管道 Mugsy，并将 Mugsy 用作锚点的精确字匹配替换为我们基于间隔字的锚点。对于密切相关的基因组序列，这两种锚定过程导致相似质量的多重比对。然而，对于远距离相关的基因组，使用过滤间隔字匹配计算的比对优于使用原始 Mugsy 程序生成的比对，其中使用精确字匹配来找到锚点。

可用性和实现

http://spacedanchor.gobics.de。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5be5/6330006/d69e3e510be4/bty592f1.jpg

相似文献

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.使用过滤的间隔字匹配作为锚点，对远缘基因组序列进行精确的多重比对。

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。

Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.空格词和 kmacs：基于不精确词匹配的快速无对齐序列比较。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.

andi: fast and accurate estimation of evolutionary distances between closely related genomes.安迪：快速准确地估计密切相关基因组之间的进化距离。

Bioinformatics. 2015 Apr 15;31(8):1169-75. doi: 10.1093/bioinformatics/btu815. Epub 2014 Dec 10.

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.拉斯巴里：优化间隔种子用于数据库搜索、读段映射和无比对序列比较

PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.

Estimating evolutionary distances between genomic sequences from spaced-word matches.通过间隔词匹配估计基因组序列之间的进化距离。

Algorithms Mol Biol. 2015 Feb 11;10:5. doi: 10.1186/s13015-015-0032-x. eCollection 2015.

Fast and sensitive multiple alignment of large genomic sequences.大型基因组序列的快速灵敏多重比对。

BMC Bioinformatics. 2003 Dec 23;4:66. doi: 10.1186/1471-2105-4-66.

Multiple sequence alignment with DIALIGN.使用DIALIGN进行多序列比对。

Methods Mol Biol. 2014;1079:191-202. doi: 10.1007/978-1-62703-646-7_12.

G-Anchor: a novel approach for whole-genome comparative mapping utilizing evolutionary conserved DNA sequences.G-Anchor：一种利用进化保守 DNA 序列进行全基因组比较作图的新方法。

Gigascience. 2018 May 1;7(5). doi: 10.1093/gigascience/giy017.

引用本文的文献

CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model.CGRWDL：基于动态语言模型加权混沌博弈表示的病毒无比对系统发育重建方法

Front Microbiol. 2024 Mar 20;15:1339156. doi: 10.3389/fmicb.2024.1339156. eCollection 2024.

Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.斑驳：通过利用短读映射器和梯度下降实现高分歧下精确的双序列替换距离。

PLoS One. 2024 Mar 21;19(3):e0298834. doi: 10.1371/journal.pone.0298834. eCollection 2024.

Multiple genome alignment in the telomere-to-telomere assembly era.端粒到端粒组装时代的多基因组比对。

Genome Biol. 2022 Aug 29;23(1):182. doi: 10.1186/s13059-022-02735-6.

Global, highly specific and fast filtering of alignment seeds.全局、高度特异且快速的比对种子过滤。

BMC Bioinformatics. 2022 Jun 10;23(1):225. doi: 10.1186/s12859-022-04745-4.

Using in silico predicted ancestral genomes to improve the efficiency of paleogenome reconstruction.利用计算机预测的祖先基因组提高古基因组重建效率。

Ecol Evol. 2020 Oct 28;10(23):12700-12709. doi: 10.1002/ece3.6925. eCollection 2020 Dec.

Sequence Comparison Without Alignment: The SpaM Approaches.无需比对的序列比较：SpaM方法

Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.

Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification.超越 DNA 条形码：基因组 skimming 数据在样本鉴定中的未实现潜力。

Mol Ecol. 2020 Jul;29(14):2521-2534. doi: 10.1111/mec.15507. Epub 2020 Jun 29.

Phylogeny reconstruction based on the length distribution of -mismatch common substrings.基于错配公共子串长度分布的系统发育重建。

Algorithms Mol Biol. 2017 Dec 11;12:27. doi: 10.1186/s13015-017-0118-8. eCollection 2017.

本文引用的文献

Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.11110110111的最佳命中结果：间隔种子的无模型选择和无参数敏感性计算

Algorithms Mol Biol. 2017 Feb 14;12:1. doi: 10.1186/s13015-017-0092-1. eCollection 2017.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data.PaPrBaG：一种从 NGS 数据中检测新型病原体的机器学习方法。

Sci Rep. 2017 Jan 4;7:39194. doi: 10.1038/srep39194.

PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.

Spaced seeds improve k-mer-based metagenomic classification.间隔种子可改善基于k-mer的宏基因组分类。

Bioinformatics. 2015 Nov 15;31(22):3584-92. doi: 10.1093/bioinformatics/btv419. Epub 2015 Jul 25.

Estimating evolutionary distances between genomic sequences from spaced-word matches.通过间隔词匹配估计基因组序列之间的进化距离。

Algorithms Mol Biol. 2015 Feb 11;10:5. doi: 10.1186/s13015-015-0032-x. eCollection 2015.

andi: fast and accurate estimation of evolutionary distances between closely related genomes.安迪：快速准确地估计密切相关基因组之间的进化距离。

Bioinformatics. 2015 Apr 15;31(8):1169-75. doi: 10.1093/bioinformatics/btu815. Epub 2014 Dec 10.

Fast and sensitive protein alignment using DIAMOND.使用 DIAMOND 进行快速灵敏的蛋白质比对。

Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17.

Alignathon: a competitive assessment of whole-genome alignment methods.比对马拉松：全基因组比对方法的竞争性评估

Genome Res. 2014 Dec;24(12):2077-89. doi: 10.1101/gr.174920.114. Epub 2014 Oct 1.

Lambda: the local aligner for massive biological data.Lambda：用于海量生物数据的局部比对工具。

Bioinformatics. 2014 Sep 1;30(17):i349-55. doi: 10.1093/bioinformatics/btu439.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用过滤的间隔字匹配作为锚点，对远缘基因组序列进行精确的多重比对。

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献