• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种内存效率高的数据结构,用于表示精确匹配的重叠图,适用于下一代 DNA 组装。

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly.

机构信息

Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.

出版信息

Bioinformatics. 2011 Jul 15;27(14):1901-7. doi: 10.1093/bioinformatics/btr321. Epub 2011 Jun 2.

DOI:10.1093/bioinformatics/btr321
PMID:21636593
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3129531/
Abstract

MOTIVATION

Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n(2)) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.

RESULTS

For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ov(max)(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ-|ov(max)(x, y)| if and only if ω ≤ λ, where |ov(max)(x, y)| is the length of ov(max)(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ-1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.

AVAILABILITY

Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip

摘要

动机

精确匹配重叠图在 DNA 组装和最短超字符串问题的背景下得到了广泛应用,其中字符串的数量 n 从数千到数十亿不等。字符串的长度 ℓ 从 25 到 1000 不等,具体取决于 DNA 测序技术。然而,许多使用重叠图的 DNA 组装器在构建图时需要花费太多的时间和空间。对于下一代测序技术产生的大量数据,这些 DNA 组装器几乎不可能处理,其中字符串的数量 n 可能是数十亿。如果显式存储重叠图,则需要 Ω(n(2))的内存,当 n 大于 1 亿时,这在实践中可能是不可行的。在本文中,我们提出了一种新的数据结构,使用该数据结构可以紧凑地存储重叠图。这种数据结构仅需要线性时间来构建,并且仅需要线性内存来存储。

结果

对于给定的输入字符串集(也称为读取),我们可以非正式地将精确匹配重叠图定义如下。每个读取都表示为图中的一个节点,如果对应的读取重叠足够,则两个节点之间存在边。正式描述如下。两个字符串 x 和 y 的最大精确匹配重叠,记为 ov(max)(x, y),是最长的字符串,它是 x 的后缀和 y 的前缀。给定长度为 ℓ 的 n 个字符串的精确匹配重叠图是一个带权图,其中每个顶点都与一个字符串相关联,如果且仅当 ω ≤ λ 时,存在边 (x, y),权重 ω=ℓ-|ov(max)(x, y)|,其中 |ov(max)(x, y)|是 ov(max)(x, y)的长度,λ 是给定的阈值。在本文中,我们表明可以使用最多 (2λ-1)(2⌈logn⌉+⌈logλ⌉)n 位的紧凑数据结构表示精确匹配重叠图,并保证访问边的基本操作时间为 O(log λ)。我们还提出了两种用于构建精确匹配重叠图数据结构的算法。第一种算法在最坏情况下运行时间为 O(λℓnlogn),需要 O(λ)额外的内存。第二种算法在时间和内存方面都运行 O(λℓn)。我们在来自序列组装的大量模拟数据上的实验结果表明,该数据结构可以高效地构建。

可用性

我们的包含该数据结构的 DNA 序列组装器可在 http://www.engr.uconn.edu/~htd06001/assembler/leap.zip 上免费获得。

相似文献

1
A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly.一种内存效率高的数据结构,用于表示精确匹配的重叠图,适用于下一代 DNA 组装。
Bioinformatics. 2011 Jul 15;27(14):1901-7. doi: 10.1093/bioinformatics/btr321. Epub 2011 Jun 2.
2
Readjoiner: a fast and memory efficient string graph-based sequence assembler.Readjoiner:一种快速且内存高效的基于字符串图的序列拼接器。
BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82.
3
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.用于构建大型双向 de Bruijn 图的高效并行和外核算法。
BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.
4
Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.
5
Efficient construction of an assembly string graph using the FM-index.利用 FM 索引高效构建组装字符串图。
Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.
6
FSG: Fast String Graph Construction for De Novo Assembly.FSG:用于从头组装的快速字符串图构建
J Comput Biol. 2017 Oct;24(10):953-968. doi: 10.1089/cmb.2017.0089. Epub 2017 Jul 17.
7
Information-optimal genome assembly via sparse read-overlap graphs.通过稀疏读段重叠图实现信息最优的基因组组装
Bioinformatics. 2016 Sep 1;32(17):i494-i502. doi: 10.1093/bioinformatics/btw450.
8
Coverage-preserving sparsification of overlap graphs for long-read assembly.重叠图的覆盖保持稀疏化用于长读长组装。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad124.
9
Graph mining for next generation sequencing: leveraging the assembly graph for biological insights.用于下一代测序的图挖掘:利用组装图获取生物学见解。
BMC Genomics. 2016 May 6;17:340. doi: 10.1186/s12864-016-2678-2.
10
HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly.HyDA-Vista:迈向序列组装中k-mer大小的最优引导选择
BMC Genomics. 2014;15 Suppl 10(Suppl 10):S9. doi: 10.1186/1471-2164-15-S10-S9. Epub 2014 Dec 12.

引用本文的文献

1
A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective.从头基因组组装软件的综合研究:当前挑战与未来展望
Evol Bioinform Online. 2018 Feb 20;14:1176934318758650. doi: 10.1177/1176934318758650. eCollection 2018.
2
A Practical and Scalable Tool to Find Overlaps between Sequences.一种用于查找序列间重叠部分的实用且可扩展的工具。
Biomed Res Int. 2015;2015:905261. doi: 10.1155/2015/905261. Epub 2015 Apr 19.
3
Evolutionary potential, cross-stress behavior and the genetic basis of acquired stress resistance in Escherichia coli.大肠杆菌获得性应激抗性的进化潜力、交叉应激行为和遗传基础。
Mol Syst Biol. 2013;9:643. doi: 10.1038/msb.2012.76.
4
Readjoiner: a fast and memory efficient string graph-based sequence assembler.Readjoiner:一种快速且内存高效的基于字符串图的序列拼接器。
BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82.

本文引用的文献

1
Efficient construction of an assembly string graph using the FM-index.利用 FM 索引高效构建组装字符串图。
Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.
2
Genome assembly reborn: recent computational challenges.基因组组装重生:近期的计算挑战
Brief Bioinform. 2009 Jul;10(4):354-66. doi: 10.1093/bib/bbp026. Epub 2009 May 29.
3
The fragment assembly string graph.片段组装字符串图。
Bioinformatics. 2005 Sep 1;21 Suppl 2:ii79-85. doi: 10.1093/bioinformatics/bti1114.