McCorrison Jamison M, Venepally Pratap, Singh Indresh, Fouts Derrick E, Lasken Roger S, Methé Barbara A
Informatics Core Services, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville, MD, 20850, USA.
Department of Genomic Medicine, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville, MD, 20850, USA.
BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.
Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.
Here we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.
The normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq .
新一代测序(NGS)平台上的深度鸟枪法测序产生了大量数据,有助于我们加深对基因组、转录组、扩增的单细胞基因组和宏基因组的理解。然而,短读长数据集的深度覆盖变异以及现代测序仪的高测序错误率,在数据解读方面带来了新的计算挑战,包括映射和从头组装。诸如单细胞多重置换扩增(MDA)和序列无关单引物扩增(SISPA)等新实验室技术,使得对无法培养的生物体进行测序成为可能,但由于扩增偏差会产生高度可变的覆盖度。
在此,我们介绍NeatFreq,这是一种软件工具,通过对按中位数kmer频率(RMKF)和唯一性进行分箱的读段进行聚类和选择,将数据集的覆盖度降低到更均匀的水平。先前的算法基于RMKF对读段覆盖度进行归一化,但不包括用于优先选择(1)随机引物产物的极端可变测序产生的极低覆盖区域和(2)双侧双端序列 的方法。该算法使用经过错误校正的数据集,增加了基因组中最独特、覆盖度最低片段的纳入。NeatFreq应用于细菌、病毒噬菌斑和单细胞测序数据。该算法显示,基因组中最独特读段被纳入组装一致序列的比例有所增加,同时还减少了可交付一致序列中重复和错误重叠群(高置信度重叠串)的数量。将传统重叠-布局-一致序列(OLC)获得的结果与使用覆盖度归一化前后的序列针对可变覆盖度输入训练的模拟多德布鲁因图组装替代方法进行了比较。当使用传统细菌组装算法时,覆盖度降低显示出可提高处理速度并减少内存需求。
否则会抑制一致序列解析的深度覆盖峰值的归一化,使高通量测序(HTS)组装项目能够使用现有组装软件持续运行至完成。NeatFreq软件包是免费的、开源的,可在https://github.com/bioh4x/NeatFreq获取。