• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

终止污染:大规模搜索在 GenBank 中发现超过 200 万条污染条目。

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.

机构信息

School of Biological Sciences, Seoul National University, Seoul, 08826, South Korea.

Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA.

出版信息

Genome Biol. 2020 May 12;21(1):115. doi: 10.1186/s13059-020-02023-1.

DOI:10.1186/s13059-020-02023-1
PMID:32398145
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7218494/
Abstract

Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to "complete" model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator.

摘要

基因组分析对公共数据库中因参考序列标签错误而导致的污染非常敏感。在这里,我们描述了 Conterminator,这是一种通过全面的序列两两比较来检测和去除错误标记序列的有效方法。我们的分析报告称,RefSeq、GenBank 和 NR 数据库中分别有 2161746、114035 和 14148 个序列受到污染,涵盖了从草案到“完整”模式生物基因组的整个范围。我们的方法与输入大小呈线性比例关系,在 32 核计算机上每天可处理 3.3 TB。Conterminator 可以帮助确保参考数据库的质量。源代码(GPLv3):https://github.com/martin-steinegger/conterminator。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/e69b38f80379/13059_2020_2023_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/5080b8dea0bb/13059_2020_2023_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/31dc268fbbdd/13059_2020_2023_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/17f626acb562/13059_2020_2023_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/e69b38f80379/13059_2020_2023_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/5080b8dea0bb/13059_2020_2023_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/31dc268fbbdd/13059_2020_2023_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/17f626acb562/13059_2020_2023_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e9b/7218494/e69b38f80379/13059_2020_2023_Fig4_HTML.jpg

相似文献

1
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.终止污染:大规模搜索在 GenBank 中发现超过 200 万条污染条目。
Genome Biol. 2020 May 12;21(1):115. doi: 10.1186/s13059-020-02023-1.
2
Rapid and sensitive detection of genome contamination at scale with FCS-GX.使用 FCS-GX 实现大规模的基因组污染快速灵敏检测。
Genome Biol. 2024 Feb 26;25(1):60. doi: 10.1186/s13059-024-03198-7.
3
Human contamination in bacterial genomes has created thousands of spurious proteins.人类污染的细菌基因组中创造了数千个虚假蛋白质。
Genome Res. 2019 Jun;29(6):954-960. doi: 10.1101/gr.245373.118. Epub 2019 May 7.
4
VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.VecScreen_plus_taxonomy:对载体污染筛查施加分类学税(onomy)增加。
Bioinformatics. 2018 Mar 1;34(5):755-759. doi: 10.1093/bioinformatics/btx669.
5
VADR: validation and annotation of virus sequence submissions to GenBank.VADR:病毒序列提交到 GenBank 的验证和注释。
BMC Bioinformatics. 2020 May 24;21(1):211. doi: 10.1186/s12859-020-3537-3.
6
Rapid and sensitive detection of genome contamination at scale with FCS-GX.使用FCS-GX大规模快速灵敏地检测基因组污染。
bioRxiv. 2023 Jun 6:2023.06.02.543519. doi: 10.1101/2023.06.02.543519.
7
RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches.RabbitTClust:使用 MinHash 草图实现对数百万个细菌基因组的快速聚类分析。
Genome Biol. 2023 May 17;24(1):121. doi: 10.1186/s13059-023-02961-6.
8
Matching curated genome databases: a non trivial task.匹配经过整理的基因组数据库:一项并非易事的任务。
BMC Genomics. 2008 Oct 24;9:501. doi: 10.1186/1471-2164-9-501.
9
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
10
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.MMseqs软件套件,用于对大型蛋白质序列集进行快速且深入的聚类和搜索。
Bioinformatics. 2016 May 1;32(9):1323-30. doi: 10.1093/bioinformatics/btw006. Epub 2016 Jan 6.

引用本文的文献

1
The new microbiome on the block: challenges and opportunities of using human tumor sequencing data to study microbes.新兴的微生物群落:利用人类肿瘤测序数据研究微生物的挑战与机遇
Nat Methods. 2025 Sep 15. doi: 10.1038/s41592-025-02807-y.
2
Protein Structural Phylogenetics.蛋白质结构系统发育学
Genome Biol Evol. 2025 Jul 30;17(8). doi: 10.1093/gbe/evaf139.
3
Targeted decontamination of sequencing data with CLEAN.使用CLEAN对测序数据进行靶向净化。

本文引用的文献

1
Contaminations in (meta)genome data: An open issue for the scientific community.(宏)基因组数据中的污染:科学界面临的一个开放性问题。
IUBMB Life. 2020 Apr;72(4):698-705. doi: 10.1002/iub.2216. Epub 2019 Dec 23.
2
Pavian: interactive analysis of metagenomics data for microbiome studies and pathogen identification.Pavian:微生物组研究和病原体鉴定的宏基因组数据分析的交互式分析。
Bioinformatics. 2020 Feb 15;36(4):1303-1304. doi: 10.1093/bioinformatics/btz715.
3
Large-scale sequence comparisons with .与……进行大规模序列比较
NAR Genom Bioinform. 2025 Jul 24;7(3):lqaf105. doi: 10.1093/nargab/lqaf105. eCollection 2025 Sep.
4
The spatiotemporal distribution of human pathogens in ancient Eurasia.古代欧亚大陆人类病原体的时空分布。
Nature. 2025 Jul;643(8073):1011-1019. doi: 10.1038/s41586-025-09192-8. Epub 2025 Jul 9.
5
Identification of bacteriophage DNA in human umbilical cord blood.人脐带血中噬菌体DNA的鉴定
JCI Insight. 2025 Jul 8;10(13). doi: 10.1172/jci.insight.183123.
6
Apicortin, a Putative Apicomplexan-Specific Protein, Is Present in Deep-Branching Opisthokonts.顶体蛋白,一种假定的顶复门特异性蛋白,存在于进化分支较深的后鞭毛生物中。
Biology (Basel). 2025 May 28;14(6):620. doi: 10.3390/biology14060620.
7
Refinement of the Reference Viral Database (RVDB) for improving bioinformatics analysis of virus detection by high-throughput sequencing (HTS).优化参考病毒数据库(RVDB)以改进通过高通量测序(HTS)进行病毒检测的生物信息学分析。
mSphere. 2025 Jul 29;10(7):e0028625. doi: 10.1128/msphere.00286-25. Epub 2025 Jun 23.
8
Feature Architecture-Aware Ortholog Search With fDOG Reveals the Distribution of Plant Cell Wall-Degrading Enzymes Across Life.基于fDOG的特征架构感知直系同源物搜索揭示了植物细胞壁降解酶在生命中的分布。
Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf120.
9
In Vouchers We (Hope to) Trust: Unveiling Hidden Errors in GenBank's Tetrapod Taxonomic Foundations.我们(希望)信赖凭证:揭示GenBank四足动物分类学基础中的隐藏错误。
Mol Ecol. 2025 Jul;34(13):e17812. doi: 10.1111/mec.17812. Epub 2025 Jun 3.
10
Small amounts of misassembly can have disproportionate effects on pangenome-based metagenomic analyses.少量的错误组装可能会对基于泛基因组的宏基因组分析产生不成比例的影响。
mSphere. 2025 May 27;10(5):e0085724. doi: 10.1128/msphere.00857-24. Epub 2025 Apr 29.
F1000Res. 2019 Jul 4;8:1006. doi: 10.12688/f1000research.19675.1. eCollection 2019.
4
FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science.FDA-ARGOS 是一个具有公共质量控制参考基因组的数据库,可用于诊断和监管科学。
Nat Commun. 2019 Jul 25;10(1):3313. doi: 10.1038/s41467-019-11306-6.
5
Recompleting the genome.重测序基因组
Genome Res. 2019 Jun;29(6):1009-1022. doi: 10.1101/gr.244830.118. Epub 2019 May 23.
6
Human contamination in bacterial genomes has created thousands of spurious proteins.人类污染的细菌基因组中创造了数千个虚假蛋白质。
Genome Res. 2019 Jun;29(6):954-960. doi: 10.1101/gr.245373.118. Epub 2019 May 7.
7
KrakenUniq: confident and fast metagenomics classification using unique k-mer counts.KrakenUniq:基于独特的 k-mer 计数实现自信且快速的宏基因组分类。
Genome Biol. 2018 Nov 16;19(1):198. doi: 10.1186/s13059-018-1568-0.
8
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
9
GenBank.GenBank。
Nucleic Acids Res. 2019 Jan 8;47(D1):D94-D99. doi: 10.1093/nar/gky989.
10
Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。
Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.