快速识别和去除基因组和宏基因组数据集中的序列污染。

Fast identification and removal of sequence contamination from genomic and metagenomic datasets.

机构信息

Department of Computer Science, San Diego State University, San Diego, California, United States of America.

出版信息

PLoS One. 2011 Mar 9;6(3):e17288. doi: 10.1371/journal.pone.0017288.

DOI:10.1371/journal.pone.0017288

PMID:21408061

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3052304/

Abstract

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.

摘要

高通量测序技术对微生物学产生了深远的影响，为生成草图基因组和探索微生物多样性提供了快速、经济有效的方法。然而，从不纯的核酸制剂中获得的序列可能包含来自样品以外的来源的 DNA。这些序列污染是下游分析所使用数据质量的严重问题，导致序列片段的错误组装和错误的结论。因此，去除序列污染物是所有测序项目的必要和必需步骤。我们开发了 DeconSeq，这是一个用于快速、自动识别和去除长读长数据集（平均读长 150bp）中序列污染物的强大框架。DeconSeq 有独立版本和基于网络的版本可供使用。结果可以导出进行后续分析，并且基于网络的版本使用的数据库会定期自动更新。DeconSeq 对可能的污染序列进行分类，消除与非污染物基因组相似度更高的冗余命中，并提供对齐结果和分类的图形可视化。使用 DeconSeq，我们对 202 个先前发表的微生物和病毒宏基因组中可能存在的人类 DNA 污染进行了分析，在 145 个（72%）宏基因组中发现了可能的污染，其中污染序列高达 64%。这个新框架允许科学家自动检测和有效地从他们的数据集中去除不需要的序列污染物，同时消除当前方法的关键限制。DeconSeq 的网络界面简单易用。独立版本允许离线分析并集成到现有的数据处理管道中。DeconSeq 的结果揭示了测序实验是否成功，是否正确地对样本进行了测序，以及样本中是否存在任何来自 DNA 制备或宿主的序列污染。此外，对 202 个宏基因组的分析表明，非人类相关的宏基因组存在显著的污染，这表明该方法适用于筛选所有的宏基因组。DeconSeq 可在 http://deconseq.sourceforge.net/ 获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bd0/3052304/3aff1bec5b89/pone.0017288.g001.jpg

相似文献

Fast identification and removal of sequence contamination from genomic and metagenomic datasets.快速识别和去除基因组和宏基因组数据集中的序列污染。

PLoS One. 2011 Mar 9;6(3):e17288. doi: 10.1371/journal.pone.0017288.

TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets.TagCleaner：从基因组和宏基因组数据集中识别和去除标签序列。

BMC Bioinformatics. 2010 Jun 23;11:341. doi: 10.1186/1471-2105-11-341.

Identification and removal of ribosomal RNA sequences from metatranscriptomes.从宏转录组中鉴定和去除核糖体 RNA 序列。

Bioinformatics. 2012 Feb 1;28(3):433-5. doi: 10.1093/bioinformatics/btr669. Epub 2011 Dec 6.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

drVM: a new tool for efficient genome assembly of known eukaryotic viruses from metagenomes.drVM：一种用于从宏基因组中高效组装已知真核病毒基因组的新工具。

Gigascience. 2017 Feb 1;6(2):1-10. doi: 10.1093/gigascience/gix003.

acdc - Automated Contamination Detection and Confidence estimation for single-cell genome data.ACDC - 单细胞基因组数据的自动污染检测与置信度估计

BMC Bioinformatics. 2016 Dec 20;17(1):543. doi: 10.1186/s12859-016-1397-7.

Recentrifuge: Robust comparative analysis and contamination removal for metagenomics.Recentrifuge：用于宏基因组学的稳健比较分析和污染去除。

PLoS Comput Biol. 2019 Apr 8;15(4):e1006967. doi: 10.1371/journal.pcbi.1006967. eCollection 2019 Apr.

Removing contaminants from databases of draft genomes.从基因组草案数据库中去除污染物。

PLoS Comput Biol. 2018 Jun 25;14(6):e1006277. doi: 10.1371/journal.pcbi.1006277. eCollection 2018 Jun.

HoCoRT: host contamination removal tool.HoCoRT：宿主污染去除工具。

BMC Bioinformatics. 2023 Oct 2;24(1):371. doi: 10.1186/s12859-023-05492-w.

MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.环境宏基因组的MinION™纳米孔测序：一种合成方法。

Gigascience. 2017 Mar 1;6(3):1-10. doi: 10.1093/gigascience/gix007.

引用本文的文献

The MIR157-SPL15 module regulates flowering and inflorescence development in Arabidopsis thaliana under short days and in Arabis alpina.MIR157-SPL15模块在短日照条件下调控拟南芥以及高山南芥的开花和花序发育。

PLoS Genet. 2025 Sep 2;21(9):e1011799. doi: 10.1371/journal.pgen.1011799. eCollection 2025 Sep.

Why Are Long-Read Sequencing Methods Revolutionizing Microbiome Analysis?为什么长读长测序方法正在彻底改变微生物组分析？

Microorganisms. 2025 Aug 9;13(8):1861. doi: 10.3390/microorganisms13081861.

Analysis of metagenomic data.宏基因组数据的分析

Nat Rev Methods Primers. 2025;5. doi: 10.1038/s43586-024-00376-6. Epub 2025 Jan 23.

"Microbial and immune modulation by 2'-fucosyllactose supplementation during gestation: a strategy to prevent food allergies".孕期补充2'-岩藻糖基乳糖对微生物和免疫的调节作用：一种预防食物过敏的策略

Gut Microbes. 2025 Dec;17(1):2523813. doi: 10.1080/19490976.2025.2523813. Epub 2025 Jun 26.

Satellite DNA Mapping in Suliformes (Aves): Insights into the Evolution of the Multiple Sex Chromosome System in spp.鹈形目（鸟类）的卫星DNA图谱：对物种多性染色体系统进化的见解

Genes (Basel). 2025 May 24;16(6):633. doi: 10.3390/genes16060633.

Evolution of ZW Sex Chromosomes in Snakes (Reptilia, Colubridae): New Insights from a Molecular Cytogenetic Perspective.蛇类（爬行纲，游蛇科）ZW性染色体的演化：分子细胞遗传学视角的新见解

Int J Mol Sci. 2025 May 9;26(10):4540. doi: 10.3390/ijms26104540.

Bioinformatic approaches to blood and tissue microbiome analyses: challenges and perspectives.血液和组织微生物组分析的生物信息学方法：挑战与展望。

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf176.

Brief antibiotic use drives human gut bacteria towards low-cost resistance.短期使用抗生素会促使人类肠道细菌产生低成本耐药性。

Nature. 2025 May;641(8061):182-191. doi: 10.1038/s41586-025-08781-x. Epub 2025 Apr 23.

Detecting microbial engraftment after FMT using placebo sequencing and culture enriched metagenomics to sort signals from noise.使用安慰剂测序和培养富集宏基因组学从噪声中筛选信号，以检测粪菌移植后的微生物植入情况。

Nat Commun. 2025 Apr 11;16(1):3469. doi: 10.1038/s41467-025-58673-x.

Gut microbiome evolution from infancy to 8 years of age.从婴儿期到8岁的肠道微生物群演变

Nat Med. 2025 Apr 2. doi: 10.1038/s41591-025-03610-0.

本文引用的文献

Quality control and preprocessing of metagenomic datasets.宏基因组数据集的质量控制和预处理。

Bioinformatics. 2011 Mar 15;27(6):863-4. doi: 10.1093/bioinformatics/btr026. Epub 2011 Jan 28.

Third generation DNA sequencing: pacific biosciences' single molecule real time technology.第三代DNA测序：太平洋生物科学公司的单分子实时技术。

Chem Biol. 2010 Jul 30;17(7):675-6. doi: 10.1016/j.chembiol.2010.07.004.

Annotating non-coding regions of the genome.注释基因组的非编码区域。

Nat Rev Genet. 2010 Aug;11(8):559-71. doi: 10.1038/nrg2814. Epub 2010 Jul 13.

TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets.TagCleaner：从基因组和宏基因组数据集中识别和去除标签序列。

BMC Bioinformatics. 2010 Jun 23;11:341. doi: 10.1186/1471-2105-11-341.

A survey of sequence alignment algorithms for next-generation sequencing.下一代测序序列比对算法综述。

Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11.

Characterization of missing human genome sequences and copy-number polymorphic insertions.人类基因组序列缺失特征及拷贝数多态性插入分析。

Nat Methods. 2010 May;7(5):365-71. doi: 10.1038/nmeth.1451.

Signal processing for metagenomics: extracting information from the soup.宏基因组学的信号处理：从汤羹中提取信息。

Curr Genomics. 2009 Nov;10(7):493-510. doi: 10.2174/138920209789208255.

A human gut microbial gene catalogue established by metagenomic sequencing.宏基因组测序建立的人类肠道微生物基因目录。

Nature. 2010 Mar 4;464(7285):59-65. doi: 10.1038/nature08821.

A primer on metagenomics.元基因组学简介。

PLoS Comput Biol. 2010 Feb 26;6(2):e1000667. doi: 10.1371/journal.pcbi.1000667.

Fast and accurate long-read alignment with Burrows-Wheeler transform.基于 Burrows-Wheeler 变换的快速准确长读比对。

Bioinformatics. 2010 Mar 1;26(5):589-95. doi: 10.1093/bioinformatics/btp698. Epub 2010 Jan 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

快速识别和去除基因组和宏基因组数据集中的序列污染。

Fast identification and removal of sequence contamination from genomic and metagenomic datasets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献