SHARCGS，一种用于从头基因组测序的快速且高度准确的短读长拼接算法。

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

作者信息

Dohm Juliane C, Lottaz Claudio, Borodina Tatiana, Himmelbauer Heinz

机构信息

Max-Planck-Institute for Molecular Genetics, 14195 Berlin-Dahlem, Germany.

出版信息

Genome Res. 2007 Nov;17(11):1697-706. doi: 10.1101/gr.6435207. Epub 2007 Oct 1.

DOI:10.1101/gr.6435207

PMID:17908823

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2045152/

Abstract

The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25-40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1-4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy.

摘要

DNA测序领域的最新变革是由自动化测序仪的发展带来的，这些测序仪能够快速且低成本地生成吉碱基对数据集。由于所生成读段较短，此类技术的应用似乎仅限于重测序和转录本发现。为了将应用领域扩展到从头测序，我们开发了SHARCGS算法，以高精度和高速度组装短读段（25 - 40碱基）数据。在来自三种真核生物的BAC插入片段、两条酵母染色体以及两个细菌基因组（流感嗜血杆菌、大肠杆菌）上测试了SHARCGS的效率。在考虑缺失读段和错误碱基调用的模拟中，我们发现基于30碱基的BAC组装对于果蝇和拟南芥的N50大小大于20 kbp，对于人类则大于4 kbp。我们组装了949,974个长度大于50 bp的重叠群，并且只有一个重叠群无法与参考序列无错误比对。我们在Illumina 1G测序仪上为犬幽门螺杆菌基因组生成了36碱基读段，并组装了937个覆盖基因组98%的重叠群，N50大小为3.7 kbp。除了五个与参考序列在1 - 4个位置不同的重叠群外，所有重叠群都与基因组无错误匹配。因此，SHARCGS是一种合适的工具，能够通过高可信度地从头组装序列重叠群，并在速度和准确性方面优于现有组装算法，从而充分利用新型测序技术。

相似文献

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

Genome Res. 2007 Nov;17(11):1697-706. doi: 10.1101/gr.6435207. Epub 2007 Oct 1.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes.

BMC Genomics. 2014 Aug 21;15(1):699. doi: 10.1186/1471-2164-15-699.

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab022.

Identifying wrong assemblies in de novo short read primary sequence assembly contigs.

J Biosci. 2016 Sep;41(3):455-74. doi: 10.1007/s12038-016-9630-0.

SOPRA: Scaffolding algorithm for paired reads via statistical optimization.

BMC Bioinformatics. 2010 Jun 24;11:345. doi: 10.1186/1471-2105-11-345.

Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler.

PLoS One. 2009 Dec 22;4(12):e8407. doi: 10.1371/journal.pone.0008407.

COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly.

Bioinformatics. 2012 Nov 15;28(22):2870-4. doi: 10.1093/bioinformatics/bts563. Epub 2012 Oct 8.

Hybrid assembly of the large and highly repetitive genome of , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

Genome Res. 2017 May;27(5):787-792. doi: 10.1101/gr.213405.116. Epub 2017 Jan 27.

Nanopore sequencing and full genome de novo assembly of human cytomegalovirus TB40/E reveals clonal diversity and structural variations.

BMC Genomics. 2018 Aug 2;19(1):577. doi: 10.1186/s12864-018-4949-6.

引用本文的文献

Conventional and Omics Approaches for Understanding the Abiotic Stress Response in Cereal Crops-An Updated Overview.

Plants (Basel). 2022 Oct 26;11(21):2852. doi: 10.3390/plants11212852.

Empirical evaluation of methods for genome assembly.

PeerJ Comput Sci. 2021 Jul 9;7:e636. doi: 10.7717/peerj-cs.636. eCollection 2021.

Features of sRNA biogenesis in rice revealed by genetic dissection of sRNA expression level.

Comput Struct Biotechnol J. 2020 Oct 23;18:3207-3216. doi: 10.1016/j.csbj.2020.10.012. eCollection 2020.

GAAP: A Genome Assembly + Annotation Pipeline.

Biomed Res Int. 2019 Jun 26;2019:4767354. doi: 10.1155/2019/4767354. eCollection 2019.

BMC Genomics. 2019 Jun 6;20(Suppl 5):425. doi: 10.1186/s12864-019-5702-5.

GMASS: a novel measure for genome assembly structural similarity.

BMC Bioinformatics. 2019 Mar 18;20(1):147. doi: 10.1186/s12859-019-2710-z.

TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix.

BMC Genomics. 2018 Sep 4;19(1):653. doi: 10.1186/s12864-018-5034-x.

Survey of gene splicing algorithms based on reads.

Bioengineered. 2017 Nov 2;8(6):750-758. doi: 10.1080/21655979.2017.1373538. Epub 2017 Sep 21.

A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms.

BMC Genomics. 2017 May 24;18(Suppl 4):387. doi: 10.1186/s12864-017-3735-1.

The A, C, G, and T of Genome Assembly.

Biomed Res Int. 2016;2016:6329217. doi: 10.1155/2016/6329217. Epub 2016 May 10.

本文引用的文献

Assembling millions of short DNA sequences using SSAKE.

Bioinformatics. 2007 Feb 15;23(4):500-1. doi: 10.1093/bioinformatics/btl629. Epub 2006 Dec 8.

Whole-genome re-sequencing.

Curr Opin Genet Dev. 2006 Dec;16(6):545-52. doi: 10.1016/j.gde.2006.10.009. Epub 2006 Oct 18.

Who ate whom? Adaptive Helicobacter genomic changes that accompanied a host jump from early humans to large felines.

PLoS Genet. 2006 Jul;2(7):e120. doi: 10.1371/journal.pgen.0020120. Epub 2006 Jun 15.

Genome sequencing in microfabricated high-density picolitre reactors.

Nature. 2005 Sep 15;437(7057):376-80. doi: 10.1038/nature03959. Epub 2005 Jul 31.

Whole-genome sequence assembly for mammalian genomes: Arachne 2.

Genome Res. 2003 Jan;13(1):91-6. doi: 10.1101/gr.828403.

The phusion assembler.

Genome Res. 2003 Jan;13(1):81-90. doi: 10.1101/gr.731003.

Initial sequencing and comparative analysis of the mouse genome.

Nature. 2002 Dec 5;420(6915):520-62. doi: 10.1038/nature01262.

Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes.

Science. 2002 Aug 23;297(5585):1301-10. doi: 10.1126/science.1072104. Epub 2002 Jul 25.

ARACHNE: a whole-genome shotgun assembler.

Genome Res. 2002 Jan;12(1):177-89. doi: 10.1101/gr.208902.

An Eulerian path approach to DNA fragment assembly.

Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53. doi: 10.1073/pnas.171285098.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

SHARCGS，一种用于从头基因组测序的快速且高度准确的短读长拼接算法。

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

作者信息

Dohm Juliane C, Lottaz Claudio, Borodina Tatiana, Himmelbauer Heinz

机构信息

Max-Planck-Institute for Molecular Genetics, 14195 Berlin-Dahlem, Germany.

出版信息

Genome Res. 2007 Nov;17(11):1697-706. doi: 10.1101/gr.6435207. Epub 2007 Oct 1.

DOI:10.1101/gr.6435207

PMID:17908823

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2045152/

Abstract

摘要

SHARCGS，一种用于从头基因组测序的快速且高度准确的短读长拼接算法。

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

SHARCGS，一种用于从头基因组测序的快速且高度准确的短读长拼接算法。

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

作者信息

机构信息

出版信息