• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用 FM 索引高效构建组装字符串图。

Efficient construction of an assembly string graph using the FM-index.

机构信息

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK.

出版信息

Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.

DOI:10.1093/bioinformatics/btq217
PMID:20529929
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2881401/
Abstract

MOTIVATION

Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.

RESULTS

Standard overlap assembly methods have time complexity O(N(2)), where N is the sum of the lengths of the reads. We use the Ferragina-Manzini index (FM-index) derived from the Burrows-Wheeler transform to find overlaps of length at least tau among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

摘要

动机

序列组装是一个难题,随着测序成本的大幅降低,其重要性再次凸显。大多数新的序列组装软件都是从构建 de Bruijn 图开始的,因为以前使用的基于重叠的方法由于计算成本和非常大量的短读的复杂性而被回避。在这里,我们展示了如何使用后缀数组方法,这些方法已成为最近非常快速的序列映射算法的基础,以比以前描述的算法更快的渐近速度找到重叠并生成组装字符串图。

结果

标准的重叠组装方法的时间复杂度为 O(N(2)),其中 N 是读长的总和。我们使用源自 Burrows-Wheeler 变换的 Ferragina-Manzini 索引 (FM-index) 在一组读长中找到至少 tau 长的重叠。除了一种找到所有重叠然后实现传递约简以生成字符串图的方法外,我们还展示了如何直接仅输出不可约重叠,大大缩小内存需求并将计算时间减少到 O(N),与深度无关。基于重叠的组装方法自然可以处理混合长度的读长集,包括毛细管读长或第三代测序技术承诺的长读长。我们在这里提出的算法为基于重叠的组装方法的开发铺平了道路,这些方法可以扩展到整个脊椎动物基因组从头组装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e32/2881401/0253daf9a9d0/btq217f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e32/2881401/0bc8b54c321e/btq217f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e32/2881401/0253daf9a9d0/btq217f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e32/2881401/0bc8b54c321e/btq217f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e32/2881401/0253daf9a9d0/btq217f2.jpg

相似文献

1
Efficient construction of an assembly string graph using the FM-index.利用 FM 索引高效构建组装字符串图。
Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.
2
FSG: Fast String Graph Construction for De Novo Assembly.FSG:用于从头组装的快速字符串图构建
J Comput Biol. 2017 Oct;24(10):953-968. doi: 10.1089/cmb.2017.0089. Epub 2017 Jul 17.
3
String graph construction using incremental hashing.使用增量哈希的字符串图构建。
Bioinformatics. 2014 Dec 15;30(24):3515-23. doi: 10.1093/bioinformatics/btu578. Epub 2014 Sep 2.
4
Readjoiner: a fast and memory efficient string graph-based sequence assembler.Readjoiner:一种快速且内存高效的基于字符串图的序列拼接器。
BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82.
5
Efficient de novo assembly of large genomes using compressed data structures.利用压缩数据结构进行高效的从头基因组组装。
Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.
6
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.用于构建大型双向 de Bruijn 图的高效并行和外核算法。
BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.
7
LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly.LSG:一种用于为下一代测序数据组装计算字符串图的外部存储工具。
J Comput Biol. 2016 Mar;23(3):137-49. doi: 10.1089/cmb.2015.0172.
8
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致(OLC)方法的最佳性能。
BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.
9
Compact representation of k-mer de Bruijn graphs for genome read assembly.用于基因组读取组装的 k-mer de Bruijn 图的紧凑表示。
BMC Bioinformatics. 2013 Oct 23;14:313. doi: 10.1186/1471-2105-14-313.
10
SAGE: String-overlap Assembly of GEnomes.SAGE:基因组的字符串重叠组装。
BMC Bioinformatics. 2014 Sep 15;15(1):302. doi: 10.1186/1471-2105-15-302.

引用本文的文献

1
Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond.测序生物信息学算法及其他方面的理论分析
Commun ACM. 2023 Jul;66(7):118-125. doi: 10.1145/3571723. Epub 2023 Jun 22.
2
Telomere-to-telomere assembly of cassava genome reveals the evolution of cassava and divergence of allelic expression.木薯基因组的端粒到端粒组装揭示了木薯的进化和等位基因表达的差异。
Hortic Res. 2023 Oct 5;10(11):uhad200. doi: 10.1093/hr/uhad200. eCollection 2023 Nov.
3
Parasite manipulation of host phenotypes inferred from transcriptional analyses in a trematode-amphipod system.

本文引用的文献

1
Fast and accurate long-read alignment with Burrows-Wheeler transform.基于 Burrows-Wheeler 变换的快速准确长读比对。
Bioinformatics. 2010 Mar 1;26(5):589-95. doi: 10.1093/bioinformatics/btp698. Epub 2010 Jan 15.
2
SOAP2: an improved ultrafast tool for short read alignment.SOAP2:一种用于短读序列比对的改进型超快速工具。
Bioinformatics. 2009 Aug 1;25(15):1966-7. doi: 10.1093/bioinformatics/btp336. Epub 2009 Jun 3.
3
Genome assembly reborn: recent computational challenges.基因组组装重生:近期的计算挑战
从扁形动物-端足类系统中转录分析推断寄生虫对宿主表型的操纵。
Mol Ecol. 2023 Sep;32(18):5028-5041. doi: 10.1111/mec.17093. Epub 2023 Aug 4.
4
Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species.不同SNP分析软件及树种最佳挖掘流程的评估
Life (Basel). 2023 Apr 22;13(5):1069. doi: 10.3390/life13051069.
5
Weaving place-based knowledge for culturally significant species in the age of genomics: Looking to the past to navigate the future.在基因组学时代编织关于具有文化意义物种的地方知识:回首过去,展望未来。
Evol Appl. 2022 Apr 8;15(5):751-772. doi: 10.1111/eva.13367. eCollection 2022 May.
6
An optimized FM-index library for nucleotide and amino acid search.一个用于核苷酸和氨基酸搜索的优化FM索引库。
Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.
7
High contiguity de novo genome assembly and DNA modification analyses for the fungus fly, Sciara coprophila, using single-molecule sequencing.利用单分子测序技术对粪蝇 Sciara coprophila 进行高连续性从头基因组组装和 DNA 修饰分析。
BMC Genomics. 2021 Sep 6;22(1):643. doi: 10.1186/s12864-021-07926-2.
8
Genome-scale sequencing and analysis of human, wolf, and bison DNA from 25,000-year-old sediment.从 2.5 万年前的沉积物中提取的人类、狼和野牛的基因组测序和分析。
Curr Biol. 2021 Aug 23;31(16):3564-3574.e9. doi: 10.1016/j.cub.2021.06.023. Epub 2021 Jul 12.
9
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort:为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
10
OGRE: Overlap Graph-based metagenomic Read clustEring.OGRE:基于重叠图的宏基因组读聚类。
Bioinformatics. 2021 May 17;37(7):905-912. doi: 10.1093/bioinformatics/btaa760.
Brief Bioinform. 2009 Jul;10(4):354-66. doi: 10.1093/bib/bbp026. Epub 2009 May 29.
4
Fast and accurate short read alignment with Burrows-Wheeler transform.使用Burrows-Wheeler变换进行快速准确的短读比对。
Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.
5
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.短DNA序列与人类基因组的超快速且内存高效比对。
Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4.
6
ABySS: a parallel assembler for short read sequence data.ABySS:一种用于短读长序列数据的并行汇编器。
Genome Res. 2009 Jun;19(6):1117-23. doi: 10.1101/gr.089532.108. Epub 2009 Feb 27.
7
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.《天鹅绒:使用德布鲁因图进行从头短读长拼接的算法》
Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.
8
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.从头开始的细菌基因组测序:在台式计算机上组装数百万条非常短的读段。
Genome Res. 2008 May;18(5):802-9. doi: 10.1101/gr.072033.107. Epub 2008 Mar 10.
9
Short read fragment assembly of bacterial genomes.细菌基因组的短读片段组装
Genome Res. 2008 Feb;18(2):324-30. doi: 10.1101/gr.7088808. Epub 2007 Dec 14.
10
The fragment assembly string graph.片段组装字符串图。
Bioinformatics. 2005 Sep 1;21 Suppl 2:ii79-85. doi: 10.1093/bioinformatics/bti1114.