Department of Computer Science, San Diego State University, San Diego, California, United States of America.
PLoS One. 2011 Mar 9;6(3):e17288. doi: 10.1371/journal.pone.0017288.
High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.
高通量测序技术对微生物学产生了深远的影响,为生成草图基因组和探索微生物多样性提供了快速、经济有效的方法。然而,从不纯的核酸制剂中获得的序列可能包含来自样品以外的来源的 DNA。这些序列污染是下游分析所使用数据质量的严重问题,导致序列片段的错误组装和错误的结论。因此,去除序列污染物是所有测序项目的必要和必需步骤。我们开发了 DeconSeq,这是一个用于快速、自动识别和去除长读长数据集(平均读长 150bp)中序列污染物的强大框架。DeconSeq 有独立版本和基于网络的版本可供使用。结果可以导出进行后续分析,并且基于网络的版本使用的数据库会定期自动更新。DeconSeq 对可能的污染序列进行分类,消除与非污染物基因组相似度更高的冗余命中,并提供对齐结果和分类的图形可视化。使用 DeconSeq,我们对 202 个先前发表的微生物和病毒宏基因组中可能存在的人类 DNA 污染进行了分析,在 145 个(72%)宏基因组中发现了可能的污染,其中污染序列高达 64%。这个新框架允许科学家自动检测和有效地从他们的数据集中去除不需要的序列污染物,同时消除当前方法的关键限制。DeconSeq 的网络界面简单易用。独立版本允许离线分析并集成到现有的数据处理管道中。DeconSeq 的结果揭示了测序实验是否成功,是否正确地对样本进行了测序,以及样本中是否存在任何来自 DNA 制备或宿主的序列污染。此外,对 202 个宏基因组的分析表明,非人类相关的宏基因组存在显著的污染,这表明该方法适用于筛选所有的宏基因组。DeconSeq 可在 http://deconseq.sourceforge.net/ 获取。