一种使用稀疏后缀数组在大型序列数据集中查找最大精确匹配的实用算法。

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

作者信息

Khan Zia, Bloom Joshua S, Kruglyak Leonid, Singh Mona

机构信息

Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.

出版信息

Bioinformatics. 2009 Jul 1;25(13):1609-16. doi: 10.1093/bioinformatics/btp275. Epub 2009 Apr 23.

DOI:10.1093/bioinformatics/btp275

PMID:19389736

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2732316/

Abstract

MOTIVATION

High-throughput sequencing technologies place ever increasing demands on existing algorithms for sequence analysis. Algorithms for computing maximal exact matches (MEMs) between sequences appear in two contexts where high-throughput sequencing will vastly increase the volume of sequence data: (i) seeding alignments of high-throughput reads for genome assembly and (ii) designating anchor points for genome-genome comparisons.

RESULTS

We introduce a new algorithm for finding MEMs. The algorithm leverages a sparse suffix array (SA), a text index that stores every K-th position of the text. In contrast to a full text index that stores every position of the text, a sparse SA occupies much less memory. Even though we use a sparse index, the output of our algorithm is the same as a full text index algorithm as long as the space between the indexed suffixes is not greater than a minimum length of a MEM. By relying on partial matches and additional text scanning between indexed positions, the algorithm trades memory for extra computation. The reduced memory usage makes it possible to determine MEMs between significantly longer sequences.

AVAILABILITY

Source code for the algorithm is available under a BSD open source license at http://compbio.cs.princeton.edu/mems. The implementation can serve as a drop-in replacement for the MEMs algorithm in MUMmer 3.

摘要

动机

高通量测序技术对现有的序列分析算法提出了越来越高的要求。用于计算序列之间最大精确匹配（MEM）的算法出现在两个场景中，高通量测序将极大地增加序列数据量：（i）为基因组组装对高通量读段进行比对的种子设定，以及（ii）为基因组-基因组比较指定锚定位置。

结果

我们引入了一种用于查找MEM的新算法。该算法利用了稀疏后缀数组（SA），这是一种存储文本中每隔K个位置的文本索引。与存储文本每个位置的全文本索引相比，稀疏SA占用的内存要少得多。即使我们使用的是稀疏索引，只要索引后缀之间的间距不大于MEM的最小长度，我们算法的输出就与全文本索引算法相同。通过依赖部分匹配以及在索引位置之间进行额外的文本扫描，该算法用内存换取了额外的计算量。减少的内存使用使得确定显著更长序列之间的MEM成为可能。

可用性

该算法的源代码可在http://compbio.cs.princeton.edu/mems以BSD开源许可获取。该实现可以作为MUMmer 3中MEM算法的直接替代品。

相似文献

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

Bioinformatics. 2009 Jul 1;25(13):1609-16. doi: 10.1093/bioinformatics/btp275. Epub 2009 Apr 23.

E-MEM: efficient computation of maximal exact matches for very large genomes.

Bioinformatics. 2015 Feb 15;31(4):509-14. doi: 10.1093/bioinformatics/btu687. Epub 2014 Oct 17.

Linear-time computation of minimal absent words using suffix array.

BMC Bioinformatics. 2014 Dec 20;15(1):388. doi: 10.1186/s12859-014-0388-9.

slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array.

Bioinformatics. 2014 Feb 15;30(4):464-71. doi: 10.1093/bioinformatics/btt706. Epub 2013 Dec 11.

essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.

Bioinformatics. 2013 Mar 15;29(6):802-4. doi: 10.1093/bioinformatics/btt042. Epub 2013 Jan 24.

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Space efficient computation of rare maximal exact matches between multiple sequences.

J Comput Biol. 2008 May;15(4):357-77. doi: 10.1089/cmb.2007.0105.

andi: fast and accurate estimation of evolutionary distances between closely related genomes.

Bioinformatics. 2015 Apr 15;31(8):1169-75. doi: 10.1093/bioinformatics/btu815. Epub 2014 Dec 10.

Efficient estimation of pairwise distances between genomes.

Bioinformatics. 2009 Dec 15;25(24):3221-7. doi: 10.1093/bioinformatics/btp590. Epub 2009 Oct 13.

Finding Maximal Exact Matches Using the r-Index.

J Comput Biol. 2022 Feb;29(2):188-194. doi: 10.1089/cmb.2021.0445. Epub 2022 Jan 17.

引用本文的文献

Graphite: painting genomes using a colored de Bruijn graph.

NAR Genom Bioinform. 2024 Oct 23;6(4):lqae142. doi: 10.1093/nargab/lqae142. eCollection 2024 Sep.

Mastering DNA chromatogram analysis in Sanger sequencing for reliable clinical analysis.

J Genet Eng Biotechnol. 2023 Nov 13;21(1):115. doi: 10.1186/s43141-023-00587-6.

Reference-based genome compression using the longest matched substrings with parallelization consideration.

BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.

The genome of Nautilus pompilius illuminates eye evolution and biomineralization.

Nat Ecol Evol. 2021 Jul;5(7):927-938. doi: 10.1038/s41559-021-01448-6. Epub 2021 May 10.

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame.

Front Genet. 2020 Jun 25;11:572. doi: 10.3389/fgene.2020.00572. eCollection 2020.

Gclust: A Parallel Clustering Tool for Microbial Genomic Data.

Genomics Proteomics Bioinformatics. 2019 Oct;17(5):496-502. doi: 10.1016/j.gpb.2018.10.008. Epub 2020 Jan 7.

Pairwise alignment of nucleotide sequences using maximal exact matches.

BMC Bioinformatics. 2019 May 21;20(1):261. doi: 10.1186/s12859-019-2827-0.

Α Quantum Pattern Recognition Method for Improving Pairwise Sequence Alignment.

Sci Rep. 2019 May 10;9(1):7226. doi: 10.1038/s41598-019-43697-3.

De novo assembly of wheat root transcriptomes and transcriptional signature of longitudinal differentiation.

PLoS One. 2018 Nov 5;13(11):e0205582. doi: 10.1371/journal.pone.0205582. eCollection 2018.

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

本文引用的文献

Real-time DNA sequencing from single polymerase molecules.

Science. 2009 Jan 2;323(5910):133-8. doi: 10.1126/science.1162986. Epub 2008 Nov 20.

A physical map of the 1-gigabase bread wheat chromosome 3B.

Science. 2008 Oct 3;322(5898):101-4. doi: 10.1126/science.1161847.

Space efficient computation of rare maximal exact matches between multiple sequences.

J Comput Biol. 2008 May;15(4):357-77. doi: 10.1089/cmb.2007.0105.

Bioinformatics challenges of new sequencing technology.

Trends Genet. 2008 Mar;24(3):142-9. doi: 10.1016/j.tig.2007.12.006. Epub 2008 Feb 11.

High-throughput sequence alignment using Graphics Processing Units.

BMC Bioinformatics. 2007 Dec 10;8:474. doi: 10.1186/1471-2105-8-474.

Evolution of genes and genomes on the Drosophila phylogeny.

Nature. 2007 Nov 8;450(7167):203-18. doi: 10.1038/nature06341.

GAME: a simple and efficient whole genome alignment method using maximal exact match filtering.

Comput Biol Chem. 2005 Jun;29(3):244-53. doi: 10.1016/j.compbiolchem.2005.04.004.

MAVID: constrained ancestral alignment of multiple sequences.

Genome Res. 2004 Apr;14(4):693-9. doi: 10.1101/gr.1960404.

Whole-genome shotgun assembly and comparison of human genome assemblies.

Proc Natl Acad Sci U S A. 2004 Feb 17;101(7):1916-21. doi: 10.1073/pnas.0307971100. Epub 2004 Feb 9.

Versatile and open software for comparing large genomes.

Genome Biol. 2004;5(2):R12. doi: 10.1186/gb-2004-5-2-r12. Epub 2004 Jan 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种使用稀疏后缀数组在大型序列数据集中查找最大精确匹配的实用算法。

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

作者信息

Khan Zia, Bloom Joshua S, Kruglyak Leonid, Singh Mona

机构信息

Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.

出版信息

Bioinformatics. 2009 Jul 1;25(13):1609-16. doi: 10.1093/bioinformatics/btp275. Epub 2009 Apr 23.

DOI:10.1093/bioinformatics/btp275

PMID:19389736

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2732316/

Abstract

MOTIVATION

RESULTS

AVAILABILITY

摘要

动机

结果

可用性

该算法的源代码可在http://compbio.cs.princeton.edu/mems以BSD开源许可获取。该实现可以作为MUMmer 3中MEM算法的直接替代品。

一种使用稀疏后缀数组在大型序列数据集中查找最大精确匹配的实用算法。

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

一种使用稀疏后缀数组在大型序列数据集中查找最大精确匹配的实用算法。

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性