Suppr超能文献

QuorUM:Illumina测序读数的纠错工具

QuorUM: An Error Corrector for Illumina Reads.

作者信息

Marçais Guillaume, Yorke James A, Zimin Aleksey

机构信息

IPST, University of Maryland, College Park, MD, USA.

出版信息

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

Abstract

MOTIVATION

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term "error correction" to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

RESULTS

We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.

AVAILABILITY

QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu.

CONTACT

gmarcais@umd.edu.

摘要

动机

Illumina测序数据能够以较低成本通过相对较短(大多为100碱基对至150碱基对)的读段实现对基因组的高覆盖度。即便错误率较低(宣称1%),100倍覆盖度的Illumina数据平均在基因组的每个碱基处的某些读段中都存在一个错误。这些错误使得数据处理更为复杂,因为它们在读取中导致大量低计数的错误k-mer。然而,读段中存在足够的信息来校正大多数测序错误,从而使数据的后续使用(例如用于映射或组装)更为容易。在此我们使用术语“错误校正”来表示由于单个碱基变化以及不可用序列的修剪而导致的错误减少。我们开发了一个名为QuorUM的错误校正软件。QuorUM主要旨在对Illumina读段进行错误校正以便后续组装。它围绕着最小化输出读段中不同错误k-mer数量并保留最真实k-mer的新颖理念设计,并且我们引入了一个复合统计量π来衡量我们在实现这一双重目标方面的成功程度。我们通过校正来自有参考组装的基因组的实际Illumina读段来评估QuorUM的性能。

结果

我们生成了经过修剪和错误校正的读段,这些读段使得组装得到的重叠群更长且错误更少。我们将QuorUM与几个已发表的错误校正工具进行了比较,发现它在我们使用的大多数指标中表现最佳。QuorUM利用当前的多核计算架构实现了高效运行,并且适用于大数据集(每个核心每天可检查和校正10亿个碱基)。我们还证明第三方组装工具(SOAPdenovo)在使用QuorUM错误校正后的读段时显著受益。与使用原始读段的SOAPdenovo相比,对于所研究的数据集,QuorUM错误校正后的读段使N50重叠群大小提高了1.1至4倍。

可用性

QuorUM作为一个独立软件包以及MaSuRCA组装软件的一个模块进行分发。两者均可在http://www.genome.umd.edu上根据GPL开源许可获取。

联系方式

gmarcais@umd.edu

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0594/4471408/eb7aba971331/pone.0130821.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验