基于系统发育的分类错误标记序列的识别与校正

Phylogeny-aware identification and correction of taxonomically mislabeled sequences.

作者信息

Kozlov Alexey M, Zhang Jiajie, Yilmaz Pelin, Glöckner Frank Oliver, Stamatakis Alexandros

机构信息

The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany

The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany.

出版信息

Nucleic Acids Res. 2016 Jun 20;44(11):5022-33. doi: 10.1093/nar/gkw396. Epub 2016 May 10.

DOI:10.1093/nar/gkw396

PMID:27166378

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4914121/

Abstract

Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences ('mislabels') using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa.

摘要

公共数据库中的分子序列大多由提交作者进行注释，未作进一步验证。这一过程可能会产生错误的分类学序列标签。错误标记的序列很难识别，而且由于新序列通常是使用现有序列进行注释的，所以它们会导致下游错误。此外，参考序列数据库中的分类学错误标记会使依赖分类学的宏基因组研究产生偏差。尽管人们为提高分类学注释的质量付出了巨大努力，但由于人工整理过程劳动强度大，整理率仍然很低。在这里，我们介绍了SATIVA，一种基于系统发育的方法，它使用进化统计模型自动识别分类学上错误标记的序列（“错误标签”）。我们使用进化定位算法（EPA）来检测和评分那些分类注释不被潜在系统发育信号支持的序列，并自动为这些序列提出一个校正后的分类学分类。通过模拟数据，我们表明我们的方法在错误标签的识别（灵敏度96.9%/精确率91.7%）和校正（灵敏度94.9%/精确率89.9%）方面都达到了很高的准确率。此外，对四个广泛使用的微生物16S参考数据库（Greengenes、LTP、RDP和SILVA）的分析表明，它们目前包含0.2%至2.5%的错误标签。最后，我们使用SATIVA对蓝细菌的替代分类法进行了深入评估。SATIVA可在https://github.com/amkozlov/sativa上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dab/4914121/bfc60a5eb168/gkw396fig1.jpg

相似文献

Phylogeny-aware identification and correction of taxonomically mislabeled sequences.基于系统发育的分类错误标记序列的识别与校正

Nucleic Acids Res. 2016 Jun 20;44(11):5022-33. doi: 10.1093/nar/gkw396. Epub 2016 May 10.

RESCRIPt: Reproducible sequence taxonomy reference database management.RESCIPT：可重复序列分类法参考数据库管理。

PLoS Comput Biol. 2021 Nov 8;17(11):e1009581. doi: 10.1371/journal.pcbi.1009581. eCollection 2021 Nov.

DAIRYdb: a manually curated reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products.乳制品数据库（DAIRYdb）：一个经过人工整理的参考数据库，用于改进乳制品 16S rRNA 基因序列的分类注释。

BMC Genomics. 2019 Jul 8;20(1):560. doi: 10.1186/s12864-019-5914-8.

The Influences of Bioinformatics Tools and Reference Databases in Analyzing the Human Oral Microbial Community.生物信息学工具和参考数据库在分析人类口腔微生物群落中的影响。

Genes (Basel). 2020 Aug 3;11(8):878. doi: 10.3390/genes11080878.

The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications.UNITE 数据库用于真菌的分子鉴定：处理暗类群和并行的分类学分类。

Nucleic Acids Res. 2019 Jan 8;47(D1):D259-D264. doi: 10.1093/nar/gky1022.

SILVA, RDP, Greengenes, NCBI and OTT - how do these taxonomies compare?SILVA、RDP、Greengenes、NCBI和OTT——这些分类法如何比较？

BMC Genomics. 2017 Mar 14;18(Suppl 2):114. doi: 10.1186/s12864-017-3501-4.

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences.IDTAXA：一种用于微生物组序列准确分类的新方法。

Microbiome. 2018 Aug 9;6(1):140. doi: 10.1186/s40168-018-0521-5.

A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy.一种用于16S rRNA基因序列的贝叶斯分类方法，具有更高的物种水平准确性。

BMC Bioinformatics. 2017 May 10;18(1):247. doi: 10.1186/s12859-017-1670-4.

CAMITAX: Taxon labels for microbial genomes.CAMITAX：微生物基因组的分类标签。

Gigascience. 2020 Jan 1;9(1). doi: 10.1093/gigascience/giz154.

TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution.TaxAss：利用自定义淡水数据库实现精细分类学分辨率。

mSphere. 2018 Sep 5;3(5):e00327-18. doi: 10.1128/mSphere.00327-18.

引用本文的文献

CABO-16S-a Combined Archaea, Bacteria, Organelle 16S rRNA database framework for amplicon analysis of prokaryotes and eukaryotes in environmental samples.CABO-16S-a：用于环境样本中原核生物和真核生物扩增子分析的古菌、细菌、细胞器16S rRNA数据库组合框架。

NAR Genom Bioinform. 2025 May 19;7(2):lqaf061. doi: 10.1093/nargab/lqaf061. eCollection 2025 Jun.

Deep learning approaches to the phylogenetic placement of extinct pollen morphotypes.用于已灭绝花粉形态类型系统发育定位的深度学习方法。

PNAS Nexus. 2023 Dec 13;3(1):pgad419. doi: 10.1093/pnasnexus/pgad419. eCollection 2024 Jan.

Be positive: customized reference databases and new, local barcodes balance false taxonomic assignments in metabarcoding studies.保持积极态度：定制参考数据库和新的本地条码可平衡分类学错误分配在 metabarcoding 研究中。

PeerJ. 2023 Jan 9;11:e14616. doi: 10.7717/peerj.14616. eCollection 2023.

Cutaneous Surgical Wounds Have Distinct Microbiomes from Intact Skin.皮肤外科伤口的微生物组与完整皮肤有明显不同。

Microbiol Spectr. 2023 Feb 14;11(1):e0330022. doi: 10.1128/spectrum.03300-22. Epub 2022 Dec 21.

Metagenomic Analysis Using Phylogenetic Placement-A Review of the First Decade.基于系统发育定位的宏基因组分析——首个十年综述

Front Bioinform. 2022 May 26;2:871393. doi: 10.3389/fbinf.2022.871393. eCollection 2022.

Propagation, detection and correction of errors using the sequence database network.利用序列数据库网络进行错误的传播、检测和纠正。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac416.

Anatomy promotes neutral coexistence of strains in the human skin microbiome.解剖学促进了人类皮肤微生物组中菌株的中性共存。

Cell Host Microbe. 2022 Feb 9;30(2):171-182.e7. doi: 10.1016/j.chom.2021.12.007. Epub 2022 Jan 6.

Simple Matching Using QIIME 2 and RDP Reveals Misidentified Sequences and an Underrepresentation of Fungi in Reference Datasets.使用QIIME 2和RDP进行简单匹配揭示了参考数据集中错误识别的序列以及真菌代表性不足的问题。

Front Genet. 2021 Nov 26;12:768473. doi: 10.3389/fgene.2021.768473. eCollection 2021.

RESCRIPt: Reproducible sequence taxonomy reference database management.RESCIPT：可重复序列分类法参考数据库管理。

PLoS Comput Biol. 2021 Nov 8;17(11):e1009581. doi: 10.1371/journal.pcbi.1009581. eCollection 2021 Nov.

Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.宏基因组学音乐——应用、分析流程及其相关工具的综述。

Funct Integr Genomics. 2022 Feb;22(1):3-26. doi: 10.1007/s10142-021-00810-y. Epub 2021 Oct 18.

本文引用的文献

The Earth Microbiome project: successes and aspirations.地球微生物组计划：成就与愿景。

BMC Biol. 2014 Aug 22;12:69. doi: 10.1186/s12915-014-0069-1.

Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences.利用 16S rRNA 基因序列统一培养和未培养的细菌和古菌分类。

Nat Rev Microbiol. 2014 Sep;12(9):635-45. doi: 10.1038/nrmicro3330.

Policy, phylogeny, and the parasite.政策、系统发育和寄生虫。

Trends Parasitol. 2014 Jun;30(6):274-81. doi: 10.1016/j.pt.2014.04.004. Epub 2014 Apr 26.

The microbiome in inflammatory bowel disease: current status and the future ahead.炎症性肠病中的微生物组：现状与未来展望。

Gastroenterology. 2014 May;146(6):1489-99. doi: 10.1053/j.gastro.2014.02.009. Epub 2014 Feb 19.

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.RAxML 版本 8：用于系统发育分析和大型系统发育后分析的工具。

Bioinformatics. 2014 May 1;30(9):1312-3. doi: 10.1093/bioinformatics/btu033. Epub 2014 Jan 21.

Ribosomal Database Project: data and tools for high throughput rRNA analysis.核糖体数据库项目：高通量 rRNA 分析的数据和工具。

Nucleic Acids Res. 2014 Jan;42(Database issue):D633-42. doi: 10.1093/nar/gkt1244. Epub 2013 Nov 27.

LPSN--list of prokaryotic names with standing in nomenclature.LPSN--具有命名地位的原核生物名称列表。

Nucleic Acids Res. 2014 Jan;42(Database issue):D613-6. doi: 10.1093/nar/gkt1111. Epub 2013 Nov 15.

Towards a unified paradigm for sequence-based identification of fungi.为基于序列的真菌鉴定建立统一范式。

Mol Ecol. 2013 Nov;22(21):5271-7. doi: 10.1111/mec.12481. Epub 2013 Sep 24.

Genomic variation landscape of the human gut microbiome.人类肠道微生物组的基因组变异景观。

Nature. 2013 Jan 3;493(7430):45-50. doi: 10.1038/nature11711. Epub 2012 Dec 5.

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. SILVA 核糖体 RNA 基因数据库项目：改进的数据处理和基于网络的工具。

Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6. doi: 10.1093/nar/gks1219. Epub 2012 Nov 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于系统发育的分类错误标记序列的识别与校正

Phylogeny-aware identification and correction of taxonomically mislabeled sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献