• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过稀疏读段重叠图实现信息最优的基因组组装

Information-optimal genome assembly via sparse read-overlap graphs.

作者信息

Shomorony Ilan, Kim Samuel H, Courtade Thomas A, Tse David N C

机构信息

Department of Electrical Engineering & Computer Sciences, University of California, Berkeley, CA, USA.

Department of Electrical Engineering & Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering, Stanford University, Stanford, CA, USA.

出版信息

Bioinformatics. 2016 Sep 1;32(17):i494-i502. doi: 10.1093/bioinformatics/btw450.

DOI:10.1093/bioinformatics/btw450
PMID:27587667
Abstract

MOTIVATION

In the context of third-generation long-read sequencing technologies, read-overlap-based approaches are expected to play a central role in the assembly step. A fundamental challenge in assembling from a read-overlap graph is that the true sequence corresponds to a Hamiltonian path on the graph, and, under most formulations, the assembly problem becomes NP-hard, restricting practical approaches to heuristics. In this work, we avoid this seemingly fundamental barrier by first setting the computational complexity issue aside, and seeking an algorithm that targets information limits In particular, we consider a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the true sequence?

RESULTS

Based on insights from this information feasibility question, we present an algorithm-the Not-So-Greedy algorithm-to construct a sparse read-overlap graph. Unlike most other assembly algorithms, Not-So-Greedy comes with a performance guarantee: whenever information feasibility conditions are satisfied, the algorithm reduces the assembly problem to an Eulerian path problem on the resulting graph, and can thus be solved in linear time. In practice, this theoretical guarantee translates into assemblies of higher quality. Evaluations on both simulated reads from real genomes and a PacBio Escherichia coli K12 dataset demonstrate that Not-So-Greedy compares favorably with standard string graph approaches in terms of accuracy of the resulting read-overlap graph and contig N50.

AVAILABILITY

Available at github.com/samhykim/nsg

CONTACT

courtade@eecs.berkeley.edu or dntse@stanford.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在第三代长读长测序技术的背景下,基于读段重叠的方法有望在组装步骤中发挥核心作用。从读段重叠图进行组装的一个基本挑战是,真实序列对应于图上的一条哈密顿路径,并且在大多数情况下,组装问题会变成NP难问题,这使得实际方法只能采用启发式算法。在这项工作中,我们先将计算复杂性问题搁置一旁,转而寻求一种针对信息限制的算法,从而避开了这个看似根本性的障碍。具体而言,我们考虑一个基本的可行性问题:读段集合何时包含足够的信息以实现对真实序列的明确重建?

结果

基于对这个信息可行性问题的见解,我们提出了一种算法——非贪婪算法——来构建一个稀疏读段重叠图。与大多数其他组装算法不同,非贪婪算法具有性能保证:只要满足信息可行性条件,该算法就能将组装问题简化为所得图上的欧拉路径问题,因此可以在线性时间内求解。在实际应用中,这种理论保证转化为更高质量的组装结果。对来自真实基因组的模拟读段以及一个PacBio大肠杆菌K12数据集的评估表明,在所得读段重叠图的准确性和重叠群N50方面,非贪婪算法优于标准的弦图方法。

可用性

可在github.com/samhykim/nsg获取

联系方式

courtade@eecs.berkeley.edu或dntse@stanford.edu

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

1
Information-optimal genome assembly via sparse read-overlap graphs.通过稀疏读段重叠图实现信息最优的基因组组装
Bioinformatics. 2016 Sep 1;32(17):i494-i502. doi: 10.1093/bioinformatics/btw450.
2
An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads.一种高效且可扩展的图模型构建方法,用于捕获下一代测序读段中不同层次的信息。
BMC Bioinformatics. 2013;14 Suppl 11(Suppl 11):S7. doi: 10.1186/1471-2105-14-S11-S7. Epub 2013 Nov 4.
3
Coverage-preserving sparsification of overlap graphs for long-read assembly.重叠图的覆盖保持稀疏化用于长读长组装。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad124.
4
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
5
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.用于构建大型双向 de Bruijn 图的高效并行和外核算法。
BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.
6
Integration of string and de Bruijn graphs for genome assembly.用于基因组组装的弦图与德布鲁因图整合
Bioinformatics. 2016 May 1;32(9):1301-7. doi: 10.1093/bioinformatics/btw011. Epub 2016 Jan 10.
7
Graph mining for next generation sequencing: leveraging the assembly graph for biological insights.用于下一代测序的图挖掘:利用组装图获取生物学见解。
BMC Genomics. 2016 May 6;17:340. doi: 10.1186/s12864-016-2678-2.
8
A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly.一种内存效率高的数据结构,用于表示精确匹配的重叠图,适用于下一代 DNA 组装。
Bioinformatics. 2011 Jul 15;27(14):1901-7. doi: 10.1093/bioinformatics/btr321. Epub 2011 Jun 2.
9
Read mapping on de Bruijn graphs.在德布鲁因图上进行读段映射。
BMC Bioinformatics. 2016 Jun 16;17(1):237. doi: 10.1186/s12859-016-1103-9.
10
Bit-parallel sequence-to-graph alignment.位并行序列到图的对齐。
Bioinformatics. 2019 Oct 1;35(19):3599-3607. doi: 10.1093/bioinformatics/btz162.

引用本文的文献

1
Sama: a contig assembler with correctness guarantee.Sama:一种具有正确性保证的重叠群组装程序。
Algorithms Mol Biol. 2025 Jun 3;20(1):9. doi: 10.1186/s13015-025-00280-y.
2
Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy.利用信息论隐私隐藏敏感基因型的机制
IEEE Trans Inf Theory. 2022 Jun;68(6):4090-4105. doi: 10.1109/tit.2022.3156276. Epub 2022 Mar 3.
3
Coverage-preserving sparsification of overlap graphs for long-read assembly.重叠图的覆盖保持稀疏化用于长读长组装。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad124.
4
Improving RNA Assembly via Safety and Completeness in Flow Decompositions.通过流分解中的安全性和完整性提高 RNA 组装
J Comput Biol. 2022 Dec;29(12):1270-1287. doi: 10.1089/cmb.2022.0261. Epub 2022 Oct 25.
5
Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer:使用基因组草图进行无组装和无比对的样本识别。
Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.
6
Optimal compressed representation of high throughput sequence data via light assembly.通过轻量级组装实现高通量序列数据的最优压缩表示
Nat Commun. 2018 Feb 8;9(1):566. doi: 10.1038/s41467-017-02480-6.
7
HINGE: long-read assembly achieves optimal repeat resolution.HINGE:长读长组装可实现最佳的重复序列解析。
Genome Res. 2017 May;27(5):747-756. doi: 10.1101/gr.216465.116. Epub 2017 Mar 20.