• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BASE:一种使用长读长二代测序数据进行大型基因组从头组装的实用工具。

BASE: a practical de novo assembler for large genomes using long NGS reads.

作者信息

Liu Binghang, Liu Chi-Man, Li Dinghua, Li Yingrui, Ting Hing-Fung, Yiu Siu-Ming, Luo Ruibang, Lam Tak-Wah

机构信息

Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.

出版信息

BMC Genomics. 2016 Aug 31;17 Suppl 5(Suppl 5):499. doi: 10.1186/s12864-016-2829-5.

DOI:10.1186/s12864-016-2829-5
PMID:27586129
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5009518/
Abstract

BACKGROUND

De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads.

METHODS

This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

RESULTS

Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate.

CONCLUSIONS

BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

摘要

背景

利用二代测序(NGS)数据进行从头基因组组装仍然是一项计算密集型任务,对于大型基因组尤其如此。在实际应用中,效率往往是首要考虑因素,因此更倾向于使用像SOAPdenovo2这样更高效的组装器。然而,基于德布鲁因图(de Bruijn graph)的SOAPdenovo2未能充分利用更长的NGS读段(例如,来自Illumina HiSeq和MiSeq的150 bp至250 bp读段)。基于字符串图(string graph)的组装器(如SGA)虽然不太流行且速度也非常慢,但更适合处理更长的读段。

方法

本文展示了一种名为BASE的新型从头组装器。它通过对读段进行高效索引来增强经典的种子延伸方法,以生成在基因组中出现唯一的高概率自适应种子。这些种子构成了BASE构建延伸树的基础,然后使用反向验证根据读段覆盖度和双末端信息去除分支,从而得到共享这些种子的读段的高质量一致序列。然后将这些一致序列延伸为重叠群(contig)。

结果

在两个细菌数据集和四个人类数据集上的实验表明,BASE在处理更长读段时,在重叠群质量和速度方面都具有优势。在细菌实验中,使用了读长为100 bp和250 bp的两个数据集。特别是对于250 bp的数据集,BASE给出的质量比SOAPdenovo2和SGA好得多,并且与SPAdes相似。在速度方面,BASE始终比SPAdes和SGA快几倍,但仍比SOAPdenovo2慢。使用读长为100 bp、150 bp和250 bp的人类数据集对BASE和Soapdenov2进行了进一步比较。对于所有数据集,BASE都显示出更高的N50,而当读长达到250 bp时,这种改进变得更加显著。此外,在处理有错误率的测序数据时,BASE比SOAPdenovo2更节省内存。

结论

BASE是一种用于构建重叠群的实用高效工具,在处理长NGS读段时质量有显著提高。将BASE扩展到包括搭建支架(scaffolding)相对容易。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/2c8e2cded8ec/12864_2016_2829_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/9e0bb67b09c2/12864_2016_2829_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/b41a1dff384d/12864_2016_2829_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/2c8e2cded8ec/12864_2016_2829_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/9e0bb67b09c2/12864_2016_2829_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/b41a1dff384d/12864_2016_2829_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1512/5009518/2c8e2cded8ec/12864_2016_2829_Fig3_HTML.jpg

相似文献

1
BASE: a practical de novo assembler for large genomes using long NGS reads.BASE:一种使用长读长二代测序数据进行大型基因组从头组装的实用工具。
BMC Genomics. 2016 Aug 31;17 Suppl 5(Suppl 5):499. doi: 10.1186/s12864-016-2829-5.
2
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致(OLC)方法的最佳性能。
BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.
3
TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix.TraRECo:一种基于贪心策略的从头转录组组装方法,使用一致矩阵进行读错误校正。
BMC Genomics. 2018 Sep 4;19(1):653. doi: 10.1186/s12864-018-5034-x.
4
Clover: a clustering-oriented de novo assembler for Illumina sequences.Clover:一款面向聚类的 Illumina 序列从头组装程序。
BMC Bioinformatics. 2020 Nov 17;21(1):528. doi: 10.1186/s12859-020-03788-9.
5
Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.使用MapReduce框架进行从头基因组组装时对高深度下一代测序读数的子集选择。
BMC Genomics. 2015;16 Suppl 12(Suppl 12):S9. doi: 10.1186/1471-2164-16-S12-S9. Epub 2015 Dec 9.
6
GapFiller: a de novo assembly approach to fill the gap within paired reads.GapFiller:一种从头开始的组装方法,用于填补配对读取中的缺口。
BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S8. doi: 10.1186/1471-2105-13-S14-S8. Epub 2012 Sep 7.
7
HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads.HGA:一种利用高覆盖度短测序读段进行细菌基因组从头组装的方法。
BMC Genomics. 2016 Mar 5;17:193. doi: 10.1186/s12864-016-2515-7.
8
ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution.ISEA:利用双末端信息和插入片段大小分布进行从头组装的迭代种子扩展算法
IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):916-925. doi: 10.1109/TCBB.2016.2550433. Epub 2016 Apr 5.
9
PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.PERGA:一种用于使用支持向量机和前瞻方法扩展重叠群的双端读段引导的从头组装器。
PLoS One. 2014 Dec 2;9(12):e114253. doi: 10.1371/journal.pone.0114253. eCollection 2014.
10
Pseudo-Sanger sequencing: massively parallel production of long and near error-free reads using NGS technology.伪桑格测序:使用下一代测序(NGS)技术大规模并行产生长且近乎无错误的 reads。
BMC Genomics. 2013 Oct 17;14(1):711. doi: 10.1186/1471-2164-14-711.

引用本文的文献

1
Draft genome sequence data of Chen 4, the causal pathogen of deodar cedar root rot.喜马拉雅雪松根腐病病原菌陈4号的基因组序列草图数据
Data Brief. 2024 Sep 10;57:110930. doi: 10.1016/j.dib.2024.110930. eCollection 2024 Dec.
2
Genome sequence of sp. C5, a potential chemolithoautotrophic, sulfur-oxidizing bacterium isolated from a mangrove sediment.从红树林沉积物中分离出的潜在化能自养硫氧化细菌C5菌株的基因组序列。
Microbiol Resour Announc. 2024 Nov 12;13(11):e0047424. doi: 10.1128/mra.00474-24. Epub 2024 Sep 24.
3
De novo genome assembly of Bacillus altitudinis 19RS3 and Bacillus altitudinis T5S-T4, two plant growth-promoting bacteria isolated from Ilex paraguariensis St. Hil. (yerba mate).

本文引用的文献

1
MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.MEGAHIT:通过简洁的 de Bruijn 图实现的超快速单节点解决方案,适用于大型和复杂的宏基因组组装。
Bioinformatics. 2015 May 15;31(10):1674-6. doi: 10.1093/bioinformatics/btv033. Epub 2015 Jan 20.
2
Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes.使用具有两条染色体的细菌基因组对第二代和第三代测序仪进行性能比较。
BMC Genomics. 2014 Aug 21;15(1):699. doi: 10.1186/1471-2164-15-699.
3
De novo finished 2.8 Mbp Staphylococcus aureus genome assembly from 100 bp short and long range paired-end reads.
巴氏芽胞杆菌 19RS3 和巴氏芽胞杆菌 T5S-T4 的从头基因组组装,这两种植物促生菌均分离自巴拉圭冬青(冬青属)。
PLoS One. 2021 Mar 11;16(3):e0248274. doi: 10.1371/journal.pone.0248274. eCollection 2021.
从头组装完成了 2.8 Mbp 的金黄色葡萄球菌基因组,使用的是 100 bp 短读长和长读长配对末端 reads。
Bioinformatics. 2014 Jan 1;30(1):40-9. doi: 10.1093/bioinformatics/btt590. Epub 2013 Oct 15.
4
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.Assemblathon2:在三个脊椎动物物种中评估从头组装基因组方法。
Gigascience. 2013 Jul 22;2(1):10. doi: 10.1186/2047-217X-2-10.
5
GAGE-B: an evaluation of genome assemblers for bacterial organisms.GAGE-B:用于细菌基因组组装的评估。
Bioinformatics. 2013 Jul 15;29(14):1718-25. doi: 10.1093/bioinformatics/btt273. Epub 2013 May 10.
6
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.SOAPdenovo2:一种经验丰富的、内存效率高的短读长从头组装器。
Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.
7
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.利用全基因组从头组装进行单样本 SNP 和 INDEL 调用的探索。
Bioinformatics. 2012 Jul 15;28(14):1838-44. doi: 10.1093/bioinformatics/bts280. Epub 2012 May 7.
8
Readjoiner: a fast and memory efficient string graph-based sequence assembler.Readjoiner:一种快速且内存高效的基于字符串图的序列拼接器。
BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82.
9
SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.SPAdes:一种新的基因组组装算法及其在单细胞测序中的应用
J Comput Biol. 2012 May;19(5):455-77. doi: 10.1089/cmb.2012.0021. Epub 2012 Apr 16.
10
IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.IDBA-UD:一个用于具有高度不均匀深度的单细胞和宏基因组测序数据的从头组装程序。
Bioinformatics. 2012 Jun 1;28(11):1420-8. doi: 10.1093/bioinformatics/bts174. Epub 2012 Apr 11.