• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

两种用于查找序列间近似重叠的高效技术。

Two Efficient Techniques to Find Approximate Overlaps between Sequences.

作者信息

Haj Rachid Maan

机构信息

Qatar University, P.O. Box 2713, Doha, Qatar.

出版信息

Biomed Res Int. 2017;2017:2731385. doi: 10.1155/2017/2731385. Epub 2017 Feb 15.

DOI:10.1155/2017/2731385
PMID:28293632
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5331309/
Abstract

The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving for the purpose of solving an approximate version of the overlap problem. The main benefit of this direction is the ability to skip time-consuming error-detecting techniques which are applied in the prefiltering stage. In this work, we present and compare two techniques to solve the approximate overlap problem. The first adapts a compact prefix tree to efficiently solve the approximate all-pairs suffix-prefix problem, while the other utilizes a well-known principle, namely, the pigeonhole principle, to identify a potential overlap match in order to ultimately solve the same problem. Our results show that our solution using the pigeonhole principle has better space and time consumption over an FM-based solution, while our solution based on prefix tree has the best space consumption between all three solutions. The number of mismatches (hamming distance) is used to define the approximate matching between strings in our work.

摘要

下一代测序(NGS)技术会输出大量需要进一步处理的序列(读段)。在应用预过滤技术以消除冗余并纠正错误读段之后,基于重叠的组装器通常会在输入读段的每一对有序读段之间找到最长的精确后缀-前缀匹配。然而,为了解决重叠问题的近似版本,另一种趋势正在发展。这个方向的主要好处是能够跳过在预过滤阶段应用的耗时的错误检测技术。在这项工作中,我们提出并比较了两种解决近似重叠问题的技术。第一种技术采用紧凑前缀树来高效解决近似全对后缀-前缀问题,而另一种技术利用一个著名的原理,即鸽巢原理,来识别潜在的重叠匹配,以便最终解决相同的问题。我们的结果表明,我们使用鸽巢原理的解决方案在空间和时间消耗方面优于基于FM的解决方案,而我们基于前缀树的解决方案在所有三种解决方案中具有最佳的空间消耗。在我们的工作中,错配数(汉明距离)用于定义字符串之间的近似匹配。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/18e3b0fe1c62/BMRI2017-2731385.alg.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/274a6d64a9b9/BMRI2017-2731385.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/efd67d030b3e/BMRI2017-2731385.alg.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/240cbe9517d8/BMRI2017-2731385.alg.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/18e3b0fe1c62/BMRI2017-2731385.alg.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/274a6d64a9b9/BMRI2017-2731385.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/efd67d030b3e/BMRI2017-2731385.alg.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/240cbe9517d8/BMRI2017-2731385.alg.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/073c/5331309/18e3b0fe1c62/BMRI2017-2731385.alg.003.jpg

相似文献

1
Two Efficient Techniques to Find Approximate Overlaps between Sequences.两种用于查找序列间近似重叠的高效技术。
Biomed Res Int. 2017;2017:2731385. doi: 10.1155/2017/2731385. Epub 2017 Feb 15.
2
A Practical and Scalable Tool to Find Overlaps between Sequences.一种用于查找序列间重叠部分的实用且可扩展的工具。
Biomed Res Int. 2015;2015:905261. doi: 10.1155/2015/905261. Epub 2015 Apr 19.
3
A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly.一种内存效率高的数据结构,用于表示精确匹配的重叠图,适用于下一代 DNA 组装。
Bioinformatics. 2011 Jul 15;27(14):1901-7. doi: 10.1093/bioinformatics/btr321. Epub 2011 Jun 2.
4
A short note on dynamic programming in a band.关于带动态规划的简要说明。
BMC Bioinformatics. 2018 Jun 15;19(1):226. doi: 10.1186/s12859-018-2228-9.
5
DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies.DBG2OLC:利用第三代测序技术的长错误读长进行大规模基因组的高效组装。
Sci Rep. 2016 Aug 30;6:31900. doi: 10.1038/srep31900.
6
Omega: an overlap-graph de novo assembler for metagenomics.Omega:一种用于宏基因组学的重叠图从头组装器。
Bioinformatics. 2014 Oct;30(19):2717-22. doi: 10.1093/bioinformatics/btu395. Epub 2014 Jun 19.
7
Error Tree: A Tree Structure for Hamming and Edit Distances and Wildcards Matching.错误树:用于汉明距离、编辑距离和通配符匹配的树结构。
J Comput Biol. 2015 Dec;22(12):1118-28. doi: 10.1089/cmb.2015.0132. Epub 2015 Sep 24.
8
libFLASM: a software library for fixed-length approximate string matching.libFLASM:一个用于固定长度近似字符串匹配的软件库。
BMC Bioinformatics. 2016 Nov 10;17(1):454. doi: 10.1186/s12859-016-1320-2.
9
QuorUM: An Error Corrector for Illumina Reads.QuorUM:Illumina测序读数的纠错工具
PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.
10
Starcode: sequence clustering based on all-pairs search.星码:基于全对搜索的序列聚类。
Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.

引用本文的文献

1
Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges.字符串“ACGT”阵列的基因组组装组成:数据结构准确性和性能挑战综述
PeerJ Comput Sci. 2023 Jul 13;9:e1180. doi: 10.7717/peerj-cs.1180. eCollection 2023.

本文引用的文献

1
A Practical and Scalable Tool to Find Overlaps between Sequences.一种用于查找序列间重叠部分的实用且可扩展的工具。
Biomed Res Int. 2015;2015:905261. doi: 10.1155/2015/905261. Epub 2015 Apr 19.
2
Using the Sadakane compressed suffix tree to solve the all-pairs suffix-prefix problem.使用笹钟根压缩后缀树解决所有后缀对前缀问题。
Biomed Res Int. 2014;2014:745298. doi: 10.1155/2014/745298. Epub 2014 Apr 16.
3
A bioinformatician's guide to the forefront of suffix array construction algorithms.生物信息学家的后缀数组构建算法前沿指南。
Brief Bioinform. 2014 Mar;15(2):138-54. doi: 10.1093/bib/bbt081. Epub 2014 Jan 10.
4
Next-generation sequence assembly: four stages of data processing and computational challenges.下一代序列组装:数据处理的四个阶段和计算挑战。
PLoS Comput Biol. 2013;9(12):e1003345. doi: 10.1371/journal.pcbi.1003345. Epub 2013 Dec 12.
5
The MaSuRCA genome assembler.马苏尔卡基因组组装器。
Bioinformatics. 2013 Nov 1;29(21):2669-77. doi: 10.1093/bioinformatics/btt476. Epub 2013 Aug 29.
6
Readjoiner: a fast and memory efficient string graph-based sequence assembler.Readjoiner:一种快速且内存高效的基于字符串图的序列拼接器。
BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82.
7
Efficient de novo assembly of large genomes using compressed data structures.利用压缩数据结构进行高效的从头基因组组装。
Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.
8
Fast and SNP-tolerant detection of complex variants and splicing in short reads.快速且耐受 SNP 的短读长中复杂变体和剪接检测
Bioinformatics. 2010 Apr 1;26(7):873-81. doi: 10.1093/bioinformatics/btq057. Epub 2010 Feb 10.