NeatFreq：用于从头序列组装的无参考数据缩减和覆盖度归一化

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

作者信息

McCorrison Jamison M, Venepally Pratap, Singh Indresh, Fouts Derrick E, Lasken Roger S, Methé Barbara A

机构信息

Informatics Core Services, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville, MD, 20850, USA.

Department of Genomic Medicine, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville, MD, 20850, USA.

出版信息

BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.

DOI:10.1186/s12859-014-0357-3

PMID:25407910

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4245761/

Abstract

BACKGROUND

Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.

RESULTS

Here we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.

CONCLUSIONS

The normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq .

摘要

背景

新一代测序（NGS）平台上的深度鸟枪法测序产生了大量数据，有助于我们加深对基因组、转录组、扩增的单细胞基因组和宏基因组的理解。然而，短读长数据集的深度覆盖变异以及现代测序仪的高测序错误率，在数据解读方面带来了新的计算挑战，包括映射和从头组装。诸如单细胞多重置换扩增（MDA）和序列无关单引物扩增（SISPA）等新实验室技术，使得对无法培养的生物体进行测序成为可能，但由于扩增偏差会产生高度可变的覆盖度。

结果

在此，我们介绍NeatFreq，这是一种软件工具，通过对按中位数kmer频率（RMKF）和唯一性进行分箱的读段进行聚类和选择，将数据集的覆盖度降低到更均匀的水平。先前的算法基于RMKF对读段覆盖度进行归一化，但不包括用于优先选择（1）随机引物产物的极端可变测序产生的极低覆盖区域和（2）双侧双端序列的方法。该算法使用经过错误校正的数据集，增加了基因组中最独特、覆盖度最低片段的纳入。NeatFreq应用于细菌、病毒噬菌斑和单细胞测序数据。该算法显示，基因组中最独特读段被纳入组装一致序列的比例有所增加，同时还减少了可交付一致序列中重复和错误重叠群（高置信度重叠串）的数量。将传统重叠-布局-一致序列（OLC）获得的结果与使用覆盖度归一化前后的序列针对可变覆盖度输入训练的模拟多德布鲁因图组装替代方法进行了比较。当使用传统细菌组装算法时，覆盖度降低显示出可提高处理速度并减少内存需求。

结论

否则会抑制一致序列解析的深度覆盖峰值的归一化，使高通量测序（HTS）组装项目能够使用现有组装软件持续运行至完成。NeatFreq软件包是免费的、开源的，可在https://github.com/bioh4x/NeatFreq获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8574/4245761/29951d4b9fd8/12859_2014_357_Fig1_HTML.jpg

相似文献

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

Assembly of long error-prone reads using de Bruijn graphs.

Proc Natl Acad Sci U S A. 2016 Dec 27;113(52):E8396-E8405. doi: 10.1073/pnas.1604560113. Epub 2016 Dec 12.

Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.

BMC Genomics. 2015;16 Suppl 12(Suppl 12):S9. doi: 10.1186/1471-2164-16-S12-S9. Epub 2015 Dec 9.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

QuorUM: An Error Corrector for Illumina Reads.

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

A spectral algorithm for fast de novo layout of uncorrected long nanopore reads.

Bioinformatics. 2017 Oct 15;33(20):3188-3194. doi: 10.1093/bioinformatics/btx370.

The present and future of de novo whole-genome assembly.

Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096.

Assessing the impact of exact reads on reducing the error rate of read mapping.

BMC Bioinformatics. 2018 Nov 6;19(1):406. doi: 10.1186/s12859-018-2432-7.

RResolver: efficient short-read repeat resolution within ABySS.

BMC Bioinformatics. 2022 Jun 21;23(1):246. doi: 10.1186/s12859-022-04790-z.

引用本文的文献

Software Choice and Sequencing Coverage Can Impact Plastid Genome Assembly-A Case Study in the Narrow Endemic .

Front Plant Sci. 2022 Jul 6;13:779830. doi: 10.3389/fpls.2022.779830. eCollection 2022.

A simple guide to de novo transcriptome assembly and annotation.

Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab563.

Impact of intrapartum and postnatal antibiotics on the gut microbiome and emergence of antimicrobial resistance in infants.

Sci Rep. 2019 Jul 23;9(1):10635. doi: 10.1038/s41598-019-46964-5.

Improving in-silico normalization using read weights.

Sci Rep. 2019 Mar 26;9(1):5133. doi: 10.1038/s41598-019-41502-9.

In silico read normalization using set multi-cover optimization.

Bioinformatics. 2018 Oct 1;34(19):3273-3280. doi: 10.1093/bioinformatics/bty307.

Neocortical Association Cell Types in the Forebrain of Birds and Alligators.

Curr Biol. 2018 Mar 5;28(5):686-696.e6. doi: 10.1016/j.cub.2018.01.036. Epub 2018 Feb 15.

Microbial Community Composition and Functional Capacity in a Terrestrial Ferruginous, Sulfate-Depleted Mud Volcano.

Front Microbiol. 2017 Nov 2;8:2137. doi: 10.3389/fmicb.2017.02137. eCollection 2017.

HSV-1 clinical isolates with unique in vivo and in vitro phenotypes and insight into genomic differences.

J Neurovirol. 2017 Apr;23(2):171-185. doi: 10.1007/s13365-016-0485-9. Epub 2016 Oct 13.

De novo meta-assembly of ultra-deep sequencing data.

Bioinformatics. 2015 Jun 15;31(12):i9-16. doi: 10.1093/bioinformatics/btv226.

本文引用的文献

Tackling soil diversity with the assembly of large, complex metagenomes.

Proc Natl Acad Sci U S A. 2014 Apr 1;111(13):4904-9. doi: 10.1073/pnas.1402564111. Epub 2014 Mar 14.

Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum.

Proc Natl Acad Sci U S A. 2013 Jun 25;110(26):E2390-9. doi: 10.1073/pnas.1219809110. Epub 2013 Jun 10.

Sequencing viral genomes from a single isolated plaque.

Virol J. 2013 Jun 6;10:181. doi: 10.1186/1743-422X-10-181.

Genome of the pathogen Porphyromonas gingivalis recovered from a biofilm in a hospital sink using a high-throughput single-cell genomics platform.

Genome Res. 2013 May;23(5):867-77. doi: 10.1101/gr.150433.112. Epub 2013 Apr 5.

Genomic sequencing of uncultured microorganisms from single cells.

Nat Rev Microbiol. 2012 Sep;10(9):631-40. doi: 10.1038/nrmicro2857.

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

J Comput Biol. 2012 May;19(5):455-77. doi: 10.1089/cmb.2012.0021. Epub 2012 Apr 16.

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Bioinformatics. 2012 Jun 1;28(11):1420-8. doi: 10.1093/bioinformatics/bts174. Epub 2012 Apr 11.

Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage.

ISME J. 2012 Jun;6(6):1186-99. doi: 10.1038/ismej.2011.189. Epub 2011 Dec 15.

Efficient de novo assembly of single-cell bacterial genomes from short-read data sets.

Nat Biotechnol. 2011 Sep 18;29(10):915-21. doi: 10.1038/nbt.1966.

Single virus genomics: a new tool for virus discovery.

PLoS One. 2011 Mar 23;6(3):e17722. doi: 10.1371/journal.pone.0017722.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

NeatFreq：用于从头序列组装的无参考数据缩减和覆盖度归一化

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献